Two days ago, Treasury Secretary Scott Bessent and Fed Chair Jerome Powell pulled the CEOs of Citigroup, Morgan Stanley, Bank of America, Wells Fargo, and Goldman Sachs into an emergency meeting at the Treasury building. The topic wasn't a banking crisis or a market crash. It was a language model.

Anthropic's Claude Mythos Preview had, in the weeks prior, autonomously discovered thousands of zero-day vulnerabilities across every major operating system and browser. Not static analysis hits. Not theoretical bugs. Working exploits — including a 27-year-old TCP bug in OpenBSD and a 17-year-old remote root exploit in FreeBSD that had been sitting there since the Bush administration.

Anthropic won't ship Mythos publicly. Instead, they built Project Glasswing, a restricted program giving the model to a dozen launch partners — AWS, Apple, Microsoft, Google, CrowdStrike, Palo Alto Networks, JPMorgan — plus about 40 additional organizations maintaining critical infrastructure. Everyone else waits.

This is worth unpacking.

The Capability Jump Is Hard to Argue With

Let me put the gap in context. Opus 4.6, Anthropic's previous frontier model, had a near-zero success rate at autonomous exploit development. In testing against Firefox vulnerabilities, Opus managed exactly 2 working exploits out of several hundred attempts.

Mythos? 181 working exploits from the same test set. Plus 29 cases where it achieved register control — one step short of full exploitation.

Benchmark Opus 4.6 Mythos Preview Delta
SWE-bench Verified 80.8% 93.9% +13.1
CyberGym 66.6% 83.1% +16.5
Terminal-Bench 2.0 ~55% ~72% +16.6
Cybench Partial 100%
Firefox exploit dev 2 successes 181 successes 90×

This isn't incremental. On cybersecurity tasks, the jump from Opus to Mythos looks like a generational leap — maybe two. And Anthropic claims these capabilities weren't explicitly trained for. They emerged as downstream consequences of better reasoning and longer autonomous task execution.

That last part should make you pay attention.

What Mythos Actually Dug Up

The specific vulnerability finds read like a greatest hits of "how did nobody catch this."

CVE-2026-4747 — FreeBSD NFS (17 years old): A stack buffer overflow in the RPCSEC_GSS authentication path. Unauthenticated. Remote. Root. The server binary lacked stack canaries, so Mythos chained the overflow directly into code execution. This was sitting in production FreeBSD deployments for nearly two decades.

OpenBSD SACK Bug (27 years old): A signed integer overflow in TCP sequence number comparisons. Enables remote denial-of-service against the OS that bills itself as the most secure operating system on the planet.

FFmpeg H.264 Codec (16 years old): An out-of-bounds heap write triggered by crafted video frames with 65,536+ slices. Automated fuzzing tools had scanned this exact code path across five million test runs without catching it.

Rust VMM Guest-to-Host Escape: Perhaps the most unsettling find — memory corruption in a production Rust-based virtual machine monitor. The whole premise of rewriting systems code in Rust is memory safety. Mythos found a way through anyway.

The model also completed Linux kernel exploit chains from known CVEs in under a day for less than $2,000 in compute. One chain converted a single-byte read primitive into full root through KASLR defeat, heap spraying via System V message queues, and credential elevation through the packet scheduler. The kind of work that would take an elite human researcher a week got compressed into hours.

Professional human reviewers agreed with Mythos's severity assessments 89% of the time across 198 reviewed reports, with 98% falling within one severity level. The model isn't just finding bugs — it's accurately triaging them.

What Glasswing Means If You're Not on the List

If you're a developer who doesn't work at one of those 50-ish partner organizations, the honest answer is: you can't use Mythos right now and there's no public timeline for when you will.

But the downstream effects are already in motion.

Over 99% of discovered vulnerabilities remain unpatched. Anthropic published SHA-3 hashes of their unreleased vulnerability reports — cryptographic proof they found these bugs, timestamped, without revealing details — and coordinated 90+45 day disclosure windows with affected projects. The patches will come. The volume will be unprecedented.

If your stack includes FreeBSD, FFmpeg for media processing, or Rust-based hypervisors, start paying closer attention to upstream security advisories now. A wave of critical patches is coming in the next few months, and the origin story behind them will be unusual.

For the broader industry, Glasswing sets a precedent that matters. Anthropic is essentially saying: some models are too capable for general release, and the responsible move is restricted deployment under controlled conditions. Simon Willison called this approach "necessary" — and looking at the 90× exploit success rate improvement, it's genuinely hard to disagree.

The Part Nobody Wants to Talk About

Here's what I keep coming back to: Anthropic says these offensive capabilities emerged from general improvements in reasoning, not targeted cybersecurity training. If that's true, every other frontier lab is one or two generations away from hitting the same capability cliff.

Google, Meta, OpenAI, DeepSeek — they're all pushing reasoning and agentic autonomy as hard as they can. When their models cross this same threshold, will they have the partnerships, the disclosure infrastructure, and frankly the willingness to hold back a general release?

The Bessent-Powell meeting suggests the U.S. government is treating this seriously, at least for the financial sector. But summoning bank CEOs to Washington isn't a scalable strategy. Healthcare, energy, telecom — critical infrastructure beyond Wall Street — needs the same conversation.

Some on Hacker News are skeptical, arguing Mythos is partly a sales pitch for Anthropic's security consulting position. Fair enough — there's definitely a business angle to being the lab that finds zero-days in everyone else's code. But the benchmarks are independently verifiable, the CVEs are real, and the government's response wasn't theater.

Mythos Preview is the first frontier model where the security capabilities are so far ahead of existing tooling that restricting access felt unavoidable. I don't think it'll be the last.