OpenAI shipped GPT-5.5 yesterday, and if you just glanced at the headline, you'd be forgiven for yawning. Another 0.1 bump, another benchmark PDF, another rollout to paid tiers. But here's the thing — this isn't a point release. It's the first full base model retrain since GPT-4.5, and OpenAI built it from the ground up to do work, not answer questions.

That distinction matters more than any number on the benchmark page.

What "Fully Retrained" Actually Means

Every GPT release since 4.5 has been a post-training modification — different RLHF recipe, new safety tuning, extended context window, bolted-on tool calling. The base weights stayed the same. GPT-5.5, codenamed "Spud" internally, throws all of that out. New pretraining data, new architecture decisions, new optimization targets. And those targets were agentic workflows from day one.

Greg Brockman described it as "a faster, sharper thinker for fewer tokens." Marketing line? Sure. But it captures something real: the model was designed to take a vague task, decompose it, pick up tools, check its own output, and keep going without you re-prompting every three steps. Previous versions could do this with scaffolding. Spud does it natively.

Two Very Different Benchmark Stories

The headline numbers are strong — in some places, absurdly so. On Terminal-Bench 2.0, which measures real command-line engineering tasks, GPT-5.5 posts 82.7%. Claude Opus 4.7 sits at 69.4%. Gemini 3.1 Pro at 68.5%. That's not a margin; that's a canyon. Thirteen points over the runner-up on a benchmark designed to test exactly the agentic terminal work this model was trained for.

GDPval, which evaluates 44 knowledge-work occupations, returns 84.9%. OSWorld-Verified — the "can this model actually navigate a desktop?" test — comes in at 78.7%. On BrowseComp Pro, a web information retrieval benchmark, it hits 90.1%.

But then there's SWE-Bench Pro.

This is the end-to-end GitHub issue resolution test — the one where a model reads a real bug report, navigates a real codebase, writes a real fix, and submits it. GPT-5.5 scored 58.6%. Respectable. But Claude Opus 4.7 holds the crown at 64.3%, and that 6-point gap isn't small. It directly measures the task most developers care about: patching real bugs in real repos.

OpenAI included a footnote about "potential memorization concerns" in Opus's score. Maybe. The gap is still there.

So here's the honest read: GPT-5.5 dominates terminal and computer-use workflows. For pure code repair — find the bug, understand the context, write the patch — Opus still leads. Your choice depends on which kind of work you're offloading.

The Price Doubled. Your Bill Might Not.

GPT-5.5 costs 5 per million input tokens and 30 per million output — exactly double GPT-5.4's rates. The Pro variant goes up to 30/180. Sticker shock.

OpenAI's counter-argument: Spud uses "significantly fewer tokens to complete the same Codex tasks." No independent verification yet, but the reasoning checks out structurally. An agent that completes a task in one pass instead of three costs a third as much per task, even at higher per-token rates. Anyone running agentic pipelines with GPT-5.4 knows the pain — the model second-guesses itself mid-task, you retry, it wanders down dead ends, you burn tokens on wasted compute. If Spud actually holds course more reliably, total spend goes down despite the sticker going up.

Bank of New York Mellon's CIO offered an early data point, calling out "really impressive hallucination resistance" and a "step change" in response quality.

Still: watch the community benchmarks before migrating production workloads. "Fewer tokens per task" is a claim that needs real-world data, not demos.

What This Changes Right Now

If you're on Codex — 4 million developers are — GPT-5.5 is already in your rotation as of Thursday. The improvement you'll feel first isn't raw code quality. It's fewer moments where the agent stalls and asks you to clarify something it should have figured out on its own.

Cherry-picked demo, but telling: a math professor built an algebraic geometry visualization app from a single natural-language prompt in 11 minutes via Codex. Previous models would have required a conversation. The "single prompt" part is the signal — reduced back-and-forth is what "agentic-first" training actually delivers at the interaction level.

For API users building custom agent orchestration, the practical question is whether you can simplify your retry-and-recovery scaffolding. If a meaningful chunk of your code exists to catch model confusion and re-prompt, Spud might let you delete some of it. The 90.1% BrowseComp Pro score suggests tool integration — especially web browsing — got materially better.

The Numbering Hides the Real Story

Brockman said something unusually candid yesterday: "there are enough model releases that it's probably getting hard to distinguish one from another." He's right. Six weeks between GPT-5.4 and GPT-5.5 is patch cadence, not generational cadence. Easy to assume this is just another incremental bump.

It isn't. The numbering obscures it, but GPT-5.5 is a different model with different pretraining objectives, not a fine-tune of its predecessor. OpenAI just happened to release it during a stretch of rapid-fire updates, and the sequential version number makes it blend in.

The deeper competitive question isn't which model wins which benchmark this week. It's whether "agentic-first" pretraining — baking autonomous task execution into the base model from the start — compounds over time better than the bolt-on approach. Anthropic and Google are retrofitting agent capabilities onto architectures that were originally built for conversation. OpenAI is now saying: what if we just build the agent from scratch?

If that bet is right, the Terminal-Bench gap grows. If it's wrong, the SWE-Bench gap matters more, and Opus keeps winning on the work developers actually do.

Stop evaluating these models on how well they chat. That era ended yesterday.