GPT-5.4 Crossed the Human Baseline — The 25% It Still Fails Is Where It Gets Interesting

OpenAI dropped GPT-5.4 on March 5, and the headline number — 75% on OSWorld-Verified, beating the 72.4% human baseline — made everyone sit up. Almost four weeks later, I've been shipping with it, reading developer threads, and stress-testing the edges. The picture is more nuanced than the press release suggested.

The OSWorld Score Is Legitimate, With a Caveat

OSWorld-Verified measures whether an AI agent can complete real desktop tasks: navigating browsers, filling forms, switching between applications, manipulating files. The benchmark runs on actual virtual machines — Ubuntu, Windows, macOS — not simulated environments. GPT-5.4 scored 75%, up from the previous generation's 47.3%. The human baseline sits at 72.4%.

That jump from 47% to 75% in a single generation is staggering. But here's what most coverage misses: these tasks are well-defined instructions. "Download the Q3 report from SharePoint, rename it, and email it to finance@company.com." The model excels at structured multi-step execution. It doesn't measure ambiguous judgment calls — when a Slack message says "can you handle the Johnson thing," no benchmark captures that.

Still, for RPA-style automation and structured workflows, the capability is genuinely new. If you're building tools that automate repetitive computer tasks, OpenAI's computer use API is the first I'd call close to production-ready. The native integration means you don't need a separate orchestrator — the model sees your screen, plans the steps, and executes through keyboard and mouse actions directly.

What makes me cautious is the other 25%. One-in-four failure on desktop tasks works fine in demos. In production, that means robust error handling, retry logic, and human-in-the-loop fallbacks. Nobody's running this unsupervised at enterprise scale yet.

The Million-Token Fine Print

The headline says 1 million tokens. Default is 272K — you unlock the rest by configuring model_context_window and model_auto_compact_token_limit explicitly. And input pricing doubles to $5.00/MTok above that threshold. Worth it for whole-codebase analysis or legal document review; overkill for most chat applications.

Tool Search Quietly Changes Agent Architecture

This feature deserves more attention than it's getting. Previously, if your agent accessed 50 tools, you'd list all 50 definitions in the system prompt — easily 10K+ tokens of schema boilerplate per request. With MCP crossing 97 million installs, many production agents now connect to dozens or hundreds of capabilities.

GPT-5.4's Tool Search loads definitions on-demand. The model decides which tools it needs, fetches their schemas, then invokes them. OpenAI reports a 47% reduction in token usage for tool-heavy workflows. The practical shift: you can wire your agent into a massive tool registry without paying per-request costs for listing everything. Architecture moves from "curate a small set per task" to "provide everything and let the model pick."

For anyone building agentic systems at scale, this is the most consequential feature in the release. Computer use gets the headlines, but on-demand tool resolution changes the economics.

The Price Math

Standard tier runs $2.50 input /$ 15.00 output per million tokens — a 43% input bump over the previous version. But the model solves problems in fewer tokens, and on-demand tool loading offsets the increase for agent workloads. Net cost per task is roughly flat for most developers.

The Pro tier at $30/$ 180 per MTok is aimed squarely at enterprise legal and finance. Harvey's Head of Applied Research called it the new standard for document-heavy legal work. Meanwhile, Mini at ~ $0.40/$ 1.60 offers a cheap on-ramp to test whether the architectural improvements matter for your specific pipeline before committing budget to the full model.

Four Weeks of Developer Feedback

The Hacker News thread tells a clear story. Consensus praise: computer use works, long context is genuinely useful, and the token efficiency improvements are real. Cursor's VP of Developer Education said their engineers find it "more natural and assertive... proactive about parallelizing work."

The consistent criticism? Common-sense reasoning gaps persist. The model scores 83% on GDPval professional benchmarks but still misses basic logical implications that require real-world context. You can watch it ace a complex financial analysis and then fail to infer that a meeting scheduled for "next Friday" means the one after the upcoming one, not tomorrow.

For frontend work, the community verdict is unambiguous: Claude Opus and Gemini produce more usable UI code. If your workflow is React components and CSS, switching models might be a downgrade. The strength here is document processing, code analysis, and multi-step desktop automation — not pixel-perfect design work.

Factual accuracy improved meaningfully though — 33% fewer hallucinations versus the predecessor. For code generation specifically, the 57.7% on SWE-bench Pro puts it at the top of that leaderboard, and developer reports of more reliable completions back that number up.

When to Actually Reach for It

Strong fit: document processing pipelines, desktop automation, tool-heavy agent workflows, long-context analysis (legal review, codebase comprehension), enterprise knowledge work where accuracy at the Pro tier justifies the cost.

Look elsewhere: frontend and UI generation, creative writing, latency-sensitive chat, anything where 272K context suffices and you don't need the autonomy features.

The agent landscape shifted this month. Not because the model is flawless, but because crossing the human baseline on realistic desktop work — even structured desktop work — draws a line. The question stopped being whether AI agents can handle your computer tasks. It's which pieces, at what cost, and with how much babysitting.

That 25% failure rate is where all the interesting engineering problems live now.

#The OSWorld Score Is Legitimate, With a Caveat

#The Million-Token Fine Print

#Tool Search Quietly Changes Agent Architecture

#The Price Math

#Four Weeks of Developer Feedback

#When to Actually Reach for It