OpenAI shipped a Codex update on April 16 that got buried under the usual cycle of model-release drama and benchmark wars. No keynote, no countdown timer — just a blog post and a changelog. Which is strange, because this update fundamentally changes what Codex is. It's not a code assistant anymore. It's a desktop agent that happens to be very good at code.

What Actually Shipped

The update bundles four capabilities that would each be a solid standalone release: background computer use on macOS, an in-app browser with direct page commenting, over 90 new plugins including Atlassian Rovo, CircleCI, CodeRabbit, and GitLab Issues, and long-running task scheduling that can span days or weeks. Image generation via gpt-image-1.5 tagged along too, because apparently that's a checkbox feature now.

The unifying thread is that Codex no longer just writes code when you ask. It operates software, manages workflows, and maintains context across sessions. OpenAI is pulling instruction, execution, and inspection into a single app surface.

The Perception-Action Loop Under the Hood

Computer use runs on GPT-5.4 — the first general-purpose model to beat human experts on the OSWorld desktop task benchmark. The trajectory here is striking: GPT-5.2 scored 47.3% in December, GPT-5.3 Codex hit 64.7% in February, and GPT-5.4 landed at 75.0% in March against a human baseline of 72.4%. A 28-point climb in four months.

The mechanics work through a screenshot-action loop. The model receives a capture of your desktop, reasons about what should happen next, then fires structured commands — mouse clicks at specific pixel coordinates, keyboard inputs, scrolling, window switches. This isn't API-based tool use. The model touches software the way you do: through the graphical interface. That means it can operate apps without APIs, navigate unfamiliar interfaces, and recover from unexpected states like pop-up dialogs or error messages.

OpenAI also baked in a Playwright bridge for hybrid scenarios. If a web app has a programmable surface, Codex writes automation scripts. If it doesn't, the model falls back to visual interaction. That flexibility matters more than the raw benchmark number, because real developer workflows bounce between scriptable and non-scriptable tools constantly.

Practically, this means you can ask Codex to QA a frontend change by actually clicking through the flow in a real browser, debug an iOS Simulator session, or configure a desktop application's settings — all while you're writing code in a different window.

The Plugin Layer Is the Quiet Power Move

Ninety-plus plugins at launch sounds like a marketing number, but the detail worth noting is that plugins combine "skills, app integrations, and MCP servers." That last part is the real play. MCP — Model Context Protocol — is becoming the standard connector between AI agents and external tools, and Codex now supports it natively.

You can wire Codex to your company's internal tooling: deployment pipelines, monitoring dashboards, issue trackers, whatever — through MCP servers you write yourself. The built-in integrations cover the obvious stuff: CircleCI failures, GitHub PR reviews, SSH tunnels into remote devboxes, multiple terminal tabs. But the extensibility story is what turns this from a product update into a platform bet. Codex wants to be the surface where everything converges.

Where It Falls Short

Start with the obvious: macOS only for computer use. If your team has developers on Linux or Windows, they're watching from the sidelines for now. The EU, UK, and Switzerland are also blocked at launch — regulatory friction around screen recording and accessibility permissions, presumably.

The bigger frustration is model routing. You can't choose which model handles your task. Codex picks internally based on complexity, repo size, and factors OpenAI doesn't document. For someone who understands the performance and cost gaps between GPT-5.4 and its smaller variants, the opacity is annoying. Sometimes you want the flagship model. Sometimes the cheap one would be fine. The system decides for you.

Computer use also can't touch terminals, which is genuinely ironic for a tool marketed at developers. It can't operate itself (no recursive loops), and it stops cold at system security prompts. Every new app interaction requires your explicit permission — sensible for security, but it interrupts the "background agent" promise whenever it encounters an unfamiliar application.

And for file editing or shell work, Codex itself recommends using the regular coding interface. Which raises the question: how often will developers actually reach for the GUI agent versus the code agent that already works fine?

Anthropic Got Here First

I'd be dodging reality if I didn't mention that Anthropic shipped computer use in public beta back in October 2024. On OSWorld, the current standings look like this:

Model OSWorld Score Input Price (per M tokens)
GPT-5.4 75.0% $2.50
Claude Opus 4.6 72.7% $15.00
Claude Sonnet 4.6 72.5% $3.00

The performance gap is real but tight. GPT-5.4 leads, but Claude Sonnet 4.6 delivers nearly identical computer-use capability at a similar price point — and runs on more platforms. Where Codex wins is integration density. Computer use living inside the same app you already code in, connected to your CI pipeline and issue tracker through MCP, with long-running tasks that persist across sessions — that's a different product category than invoking computer use through an API.

For most developers, the decision comes down to ecosystem lock-in rather than raw benchmark deltas.

The Shift Nobody Announced

Operating your computer went from research demo to feature checkbox in about eighteen months. The April 16 Codex update didn't frame it as a breakthrough — it packaged it alongside image generation and plugin counts, like it was just another line item. That normalization is the real story. Whether developers integrate desktop agents into daily workflows or treat them as an occasional convenience will determine if this was a turning point or a footnote.