GPT-5.3-Codex: A Late Review (And Why I'm Paying Attention Now)

Sun, May 24, 2026 · 6 min read

This is a late review. GPT-5.3-Codex shipped on February 5, 2026, and here I am almost four months later writing about it. I have no excuse beyond the usual one — too many things to look at, not enough evenings. But having spent the past few weeks reading benchmarks, watching demos, and following what developers are actually saying about it, I want to be honest: it is a good model. A genuinely good one.

What GPT-5.3-Codex actually is

OpenAI’s current Codex line is not the old Codex API they shut down back in 2021. That thing is ancient history. The modern Codex is an agentic coding product — a CLI tool, a ChatGPT integration, an IDE extension, and an API, all backed by purpose-built models in the GPT-5 family.

GPT-5.3-Codex is the February 2026 release. The headline claims from OpenAI: 25% faster than GPT-5.2-Codex, 56.8% on SWE-Bench Pro, 77.3% on Terminal-Bench 2.0, and 64.7% on OSWorld-Verified. It was also the first model OpenAI says was “instrumental in creating itself” — the Codex team used it to debug training, manage deployment, and diagnose test results during its own development cycle. That is a weird loop worth paying attention to.

The interactive piece matters more than the raw scores. You can steer GPT-5.3-Codex mid-task — ask it questions, redirect its approach, watch its progress in real time. It feels closer to pair programming than to fire-and-forget batch coding. For long-running tasks that span hours, OpenAI added a “Goal mode” that lets it drive autonomously toward an objective, checking in along the way.

On the API side, pricing sits at $1.75 per million input tokens and $14 per million output tokens, with a 400K context window. Not cheap, but competitive for what it does.

There is also Codex-Spark, a smaller distilled variant running on Cerebras wafer-scale hardware. It pushes over 1,000 tokens per second — roughly 15x faster than the full model — with a 128K context window. The trade-off is obvious: narrower context and slightly lower benchmark scores in exchange for near-instant feedback on targeted edits. OpenAI positioned it for real-time interaction, not overnight refactors, and that division of labor makes sense.

How it compares

Here is where things get complicated, and where I want to be careful.

The benchmark landscape in mid-2026 is a mess. OpenAI stopped reporting SWE-bench Verified scores after finding training-data contamination across frontier models. They report SWE-Bench Pro instead. Anthropic and Google still report Verified. Comparing a 56.8% SWE-Bench Pro score to a 79.6% SWE-bench Verified score is comparing apples to coconuts — the benchmarks measure different things at different difficulty levels. Anyone who lines them up in a neat table and declares a winner is misleading you.

With that caveat, here is what I can say honestly about the current field:

Claude Sonnet 4.7 and Claude Opus 4.7 are the most relevant Anthropic comparisons right now. Anthropic reports Opus 4.7 as a clear step up over Opus 4.6 on coding tasks (including gains on their internal 93-task coding suite), and in day-to-day agent workflows it does feel more stable than the 4.6 generation. Sonnet 4.7 remains the better speed/cost default, while Opus 4.7 is the “use it when the task is gnarly” option.

Kimi K2.6 deserves to be in the same conversation now. Moonshot’s published material positions K2.6 as stronger than K2.5 for coding and long-horizon agent execution. Even if these are vendor-reported numbers, it is enough to treat K2.6 as current-gen competition rather than a fringe alternative.

GLM-5.1 is another model that should not be treated as a side note. Z.ai’s docs and benchmark pages position GLM-5.1 as highly competitive on coding/agentic tasks (including SWE-Bench Pro claims). As always, these should be read as reported results, but they are strong enough to justify GLM-5.1 in any serious 2026 comparison.

If you are doing a 2026 comparison and still anchoring on older baselines like Gemini 2.5 or DeepSeek V3.2, you are probably benchmarking the past, not the present. The current field to watch is Sonnet 4.7, Opus 4.7, Kimi K2.6, GLM-5.1, plus the newest DeepSeek and Qwen coder lines.

What GPT-5.3-Codex does well

Even as someone who reaches for Claude first, I think GPT-5.3-Codex deserves credit on a few specific fronts.

The CLI is solid. The open-source Codex CLI — a Rust-based terminal agent — has crossed 75,000 GitHub stars and 14.5 million monthly npm downloads. It fits the same niche as Claude Code: an agentic coding tool that lives in your terminal and reads, writes, and runs code locally. OpenAI made it open-source from the start, which was the right call.

The interactive model matters. Most agentic coding tools today are some variation of “give it a prompt, wait, review the diff.” Codex’s mid-task steering — asking questions, changing direction, watching decisions happen in real time — is a genuinely different workflow. Not everyone wants it, but for the kind of complex multi-file refactoring where you would normally pair with a colleague, it is closer to the right interaction model.

The Spark variant is clever. Splitting the product into a full reasoning model and a fast, lightweight variant for inline edits is smart product thinking. Different coding tasks have different latency requirements, and one model size does not serve both.

What it does not do

GPT-5.5 may be newer, but as an autonomous coding agent I find it horrible in practice — too erratic in long loops, too inconsistent in execution style, and too likely to drift from the original intent unless you babysit every step. Codex, on the other hand, actually feels good in that autonomous workflow. It is more predictable, easier to steer mid-task, and more trustworthy when the job spans multiple files and decisions.

The cost adds up. OpenAI moved to a credit-based billing system in April 2026, and real-world developer costs tend to land somewhere around $100–200 per month depending on usage patterns. That is reasonable for a primary tool, expensive for a secondary one.

And benchmarks aside, I think the harder question is about the ecosystem. If you are already invested in one agentic coding workflow — the model, the CLI, the IDE integration, the muscle memory — the switching cost is real. The models are converging in capability faster than the tooling is converging in experience.

Where I land

I said at the top that this is a late review, and I want to end by acknowledging what that lateness taught me. GPT-5.3-Codex shipped, GPT-5.5 shipped on top of it, and the cycle kept moving before I wrote this. I already tested GPT-5.5 as an autonomous coding agent and found it horrible — erratic, inconsistent, high-friction. That experience is exactly why this post lands where it lands: Codex is not perfect, but it feels materially better as an autonomous coding agent than the model that supposedly replaced it.

So yes, this is a late review, but the conclusion is clear: GPT-5.3-Codex was a good release. The interactive steering, the Spark split, and the CLI experience are real strengths, not marketing fluff.

I am excited about what is coming next. The current top tier — Opus 4.7, Sonnet 4.7, Kimi K2.6, GLM-5.1 — keeps raising the bar, and whatever OpenAI ships after the 5.3 Codex line will have to clear it. I will definitely test it when it lands.

The best coding model in 2026 is whichever one you actually use in real work. The second best is whichever one survives your next serious test.