Anthropic dropped Claude Opus 4.7 yesterday, and the headline number is hard to ignore: 64.3% on SWE-Bench Pro, up from 53.4% on Opus 4.6. That's an 11-point jump in a single generation on the benchmark everyone treats as the coding model horse race.
The Coding Numbers
SWE-Bench Pro at 64.3 percent puts Opus 4.7 firmly ahead of GPT-5.4 (57.7%) and GLM-5.1 (58.4%). For context, Opus 4.6 was trading punches with those two at 53.4%. This isn't incremental — it's the largest single-version jump we've seen on this benchmark from any lab.
Anthropic hasn't published full training details, but the improvements seem concentrated in multi-step reasoning and tool orchestration. If you've been using Claude Code or similar agentic frameworks, the difference should be noticeable: fewer loops where the model talks itself into a corner, better recovery when an approach fails mid-stream. The practical upshot is that multi-file refactors and complex debugging sessions need less hand-holding.
The model also introduces an "xhigh" effort level sitting between the existing "high" and "max" settings. If you've been stuck choosing between "fast but sloppy" and "thorough but I'm paying for a novel," this gives you a useful middle ground for tasks like code review or cross-repo migrations where you want careful reasoning without burning max-tier tokens on every call.
The Tokenizer Tax
Here's the part Anthropic didn't put in the headline: Opus 4.7 ships with a new tokenizer that maps the same text to up to 35% more tokens.
The per-token price stays at 5/25 per million tokens — same as Opus 4.6. But if your typical prompt now generates 35% more tokens, your actual bill goes up by roughly that much. Anthropic can truthfully say "we didn't raise prices" while your invoice disagrees.
This matters at scale. A team processing 10M tokens per day on input just quietly went from around 50/day to 67/day — and that's before output tokens, where the multiplier hits harder at $25 per million. Run the math on your own usage before treating this as a free upgrade.
The new tokenizer likely improves how the model represents code and structured data — more granular tokens can mean better precision on syntax-heavy content. But calling it a price-neutral release is generous at best.
The Model They Won't Let You Touch
The real elephant here is Claude Mythos Preview, which scores 77.8% on SWE-Bench Pro. That's 13 points above Opus 4.7. Anthropic built something significantly better than their best public offering and decided you shouldn't have it.
The reason is cybersecurity. According to Anthropic's red team report, Mythos can autonomously discover and chain zero-day exploits across every major operating system and browser. It wrote a browser exploit chaining four vulnerabilities — JIT heap sprays that escaped both renderer and OS sandboxes. It found privilege escalation bugs using race conditions and KASLR bypasses, the kind of work that takes human security researchers weeks.
The response was Project Glasswing: hand Mythos to a select group of security partners like CrowdStrike to shore up defenses before similar capabilities leak from other labs. Meanwhile, Opus 4.7 was designed to be explicitly less capable in those areas. Anthropic says they "experimentally tried to reduce certain cyber capabilities differentially during training." Translation: they surgically reduced the security skills.
For most devs, this is irrelevant — you don't need your coding assistant to write kernel exploits. But the precedent matters. The gap between what a lab can build and what you can access is widening, and the gating criteria have shifted from "pay more" to "you're not authorized."
Vision Got a Quiet Upgrade
One thing that flew under the radar: Opus 4.7 handles images at up to 3.75 megapixels, roughly three times Opus 4.6's limit. Document reasoning accuracy on OfficeQA Pro jumped from 57.1% to 80.6% — a 23-point gain that moves use cases like document parsing and screenshot analysis from "cool demo" to "actually production-ready." If you gave up on Claude for visual tasks, worth a second look.
Overrefusal Is Better, Not Gone
Opus 4.6 refused legitimate AI safety research requests 88% of the time. Opus 4.7 brings that down to 33%. Real progress, but still a one-in-three chance of getting blocked on reasonable work. If you're doing security research or red-teaming, you'll hit walls. A new Cyber Verification Program gates penetration testing access behind an application process — Anthropic clearly trying to thread the needle between safety and utility.
Where This Leaves the Rankings
The generally available coding leaderboard now reads: Opus 4.7 (64.3%) → GLM-5.1 (58.4%) → GPT-5.4 (57.7%) → Gemini 3.1 Pro (54.2%). Anthropic retakes the lead convincingly.
But pricing muddies the picture. GLM-5.1 runs at 1.40/4.40 per million tokens under MIT license — and you can self-host the weights. Factor in Opus 4.7's tokenizer inflation and the real cost gap between "best model" and "best value" keeps growing. If coding quality is your top priority, Opus 4.7 is the clear choice. If you're optimizing cost per resolved issue, GLM-5.1 at a fraction of the price is hard to argue against, even trailing by six points.
Anthropic shipped a genuinely better coding model and a genuinely higher bill at the same time — which is about as honest a summary of the current AI market as I can give you.