A month ago, OpenAI quietly crossed a threshold that deserves more attention than it got: GPT-5.4 scored 75% on OSWorld-Verified, a benchmark that measures whether an AI can actually use a desktop computer — clicking buttons, filling forms, navigating between apps. The average human scores 72.4%. That gap is small, but the direction matters. We now have a general-purpose model that can operate a GUI better than most people can under test conditions.
If you're building anything that touches automation, RPA, or agent workflows, this changes the math on what's worth building from scratch versus what you can hand off to a model.
What OSWorld-Verified Actually Tests
OSWorld isn't some toy benchmark. It drops a model into a real desktop environment — full Linux or Windows VM — and gives it multi-step tasks: "open the spreadsheet, filter column B by date, export the result as CSV." The model sees screenshots, reasons about what's on screen, and issues mouse clicks and keystrokes in response.
Previous models struggled here because the tasks chain together. You don't just identify a button; you need to remember you're three steps into a five-step workflow, track which window is active, and recover when something unexpected pops up. GPT-5.2 managed 47.3% on this test back in late 2025. Jumping to 75% in a single generation is a massive leap — the kind of improvement that shifts a capability from "interesting demo" to "actually deployable."
The benchmark is verified by human raters who confirm whether the end state matches the goal, so the model can't game it with partial completions.
How the Computer Use API Works
OpenAI built this into the Responses API as a first-class tool type called computer. You send a screenshot (or let the API capture one), and the model returns structured actions: move mouse to coordinates, click, type text, press key combos. It's not parsing DOM or accessibility trees — it's literally looking at pixels and deciding what to do.
The flow looks roughly like this:
Your agent captures a screenshot of the target environment
You send it to GPT-5.4 with a task description and the
computertool enabledThe model responds with a sequence of actions (click at x,y; type "quarterly report"; press Enter)
Your agent executes those actions, captures a new screenshot
Loop until the task is done or the model signals completion
Developers can steer behavior with system messages and configure confirmation policies — so you can require human approval before destructive actions like deleting files or sending emails. That's not a bolt-on; it's baked into the tool specification.
The Context Window Angle Nobody Talks About
The 1-million-token context window isn't just a headline number. For computer-use agents, long context is the difference between an agent that can handle a 30-second task and one that can manage a 20-minute workflow.
Each screenshot-action cycle consumes tokens. A single high-res screenshot might eat 2,000-4,000 tokens depending on resolution. A complex task that requires 50 back-and-forth cycles could easily burn through 200K tokens just on visual input. With the previous 128K limit, agents would lose track of what they'd done halfway through any serious workflow.
The standard context window is 272K tokens at 2.50/MTok input. Go beyond that and pricing doubles to 5.00/MTok for the full session. The 1M ceiling exists, but you'll pay for it. For most computer-use tasks, 272K should be plenty — that's roughly 70-80 screenshot cycles with room for instructions and history.
How This Stacks Up Against Claude's Computer Use
Anthropic shipped computer use with Claude back in late 2024 and has iterated since. Claude Opus 4.6 scores 72.7% on the same OSWorld-Verified benchmark — close, but GPT-5.4 edges it out by about 2 percentage points.
The real difference isn't the benchmark gap; it's the integration approach. Claude's computer use has been available longer and has a more mature ecosystem of wrappers and frameworks around it. OpenAI's advantage is bundling everything into a single model that also happens to be very strong at coding, reasoning, and long-context tasks. You don't need a separate model for "thinking about the task" versus "executing on the desktop."
In practice, I'd bet most teams building serious desktop automation agents will test both and pick based on their specific workflow. The models have different failure modes — GPT-5.4 is reportedly better at recovering from unexpected dialogs, while Claude handles multi-window coordination more reliably.
What Developers Should Actually Build With This
The obvious use case is RPA replacement. Traditional RPA tools (UiPath, Automation Anywhere) require you to manually define every click path, and they break the moment a UI changes. An LLM-based agent that reasons from screenshots doesn't care if a button moved 50 pixels to the left.
But the more interesting applications are the ones that couldn't exist before:
QA automation that actually understands intent. Instead of brittle Selenium scripts, you describe what the test should verify in natural language. The agent navigates to the right page, performs the action, and checks the result. When the UI gets redesigned, the test still works because it's reasoning about the goal, not xpath selectors.
Internal tool glue. Every company has three enterprise apps that don't have APIs but do have web UIs. An agent that can log into the HR system, pull a report, paste numbers into the finance tool, and send a summary via the internal messaging app — that's genuinely useful work that currently requires a human to do manually.
Accessibility testing at scale. Point the agent at your app and ask it to complete common tasks using only keyboard navigation, or with specific display settings. It can surface usability issues faster than manual testing.
The Catch
75% is better than human average, but it's not 99%. One in four tasks still fails. For high-stakes workflows — anything involving money, credentials, or irreversible actions — you absolutely need human-in-the-loop confirmation. The confirmation policy system in the API exists for exactly this reason, and if you skip it to save latency, you'll regret it.
Cost is the other constraint. A heavy computer-use session that runs 100 cycles might cost 3-5 in API calls. That's fine for replacing a 50/hour manual process, but it adds up fast if you're running thousands of automation jobs daily. The cached-input discount (50% off for repeated context) helps, but plan your token budget carefully.
There's also the question of speed. Each cycle requires a screenshot capture, an API round trip, and action execution. Latency per cycle is roughly 2-4 seconds in ideal conditions. A 50-step workflow takes 2-3 minutes. That's fine for background automation, unacceptable for anything interactive.
Where This Goes Next
GPT-5.5 (internally codenamed "Spud," apparently) has finished pretraining and is expected in Q2. If the computer-use capability continues improving at this rate, we might see 85-90% OSWorld scores by summer. At that point, the reliability objection mostly disappears for non-critical workflows.
The bigger shift is that every major model provider now treats computer use as a core capability, not a research preview. Anthropic, OpenAI, and Google are all shipping it. That means the tooling ecosystem — screenshot capture libraries, action execution frameworks, safety middleware — is about to get a lot more mature.
If you've been waiting for computer-use agents to be "ready," the honest answer is: they're ready for supervised automation today, and unsupervised automation for low-risk tasks. Build the confirmation layer now so you're ready to relax it later.