Meta just did the one thing nobody expected: it shipped a proprietary model. After three years of building goodwill with the Llama series — the open-weight darling that gave every startup and grad student access to frontier-class AI — Meta Superintelligence Labs dropped Muse Spark on April 8 with no downloadable weights, no Apache 2.0 license, and not even a public API yet. Just a private preview for "select partners."
The question isn't whether Muse Spark is good. It is. The question is whether what Meta gained is worth what it gave up.
What Actually Changed
Muse Spark is the first model from Meta Superintelligence Labs, the unit Zuckerberg built around Alexandr Wang after poaching him from Scale AI in a deal reportedly worth $14 billion. The team rebuilt Meta's entire pretraining stack over nine months — architecture, optimizer, data pipeline, the works. Code-named Avocado internally, the model represents a ground-up departure from the Llama lineage rather than an incremental upgrade.
The headline claim: Muse Spark matches Llama 4 Maverick's capabilities using over 10x less compute. That's not a minor efficiency gain. That's the kind of improvement that changes what's economically feasible to train and serve at Meta's scale — billions of users across Facebook, Instagram, WhatsApp, and Threads hitting the same model simultaneously.
On the Artificial Analysis Intelligence Index, Muse Spark scores 52, landing fourth behind Gemini 3.1 Pro (57), GPT-5.4 (57), and Claude Opus 4.6 (53). For context, Llama 4 Maverick debuted at 18. Nearly tripling your Index score in one generation is a serious jump, even if it doesn't take the crown.
Thought Compression Is the Interesting Bit
Most frontier labs are throwing more compute at test-time reasoning — longer chains of thought, more tokens, bigger bills. Meta went the other direction.
During reinforcement learning, Muse Spark gets penalized for using too many reasoning tokens. The model learns to compress its thinking: solving problems with fewer intermediate steps without losing accuracy. Meta calls this "thought compression," and the training curves show something worth paying attention to — performance initially dips as the model learns to be concise, then recovers and eventually surpasses the uncompressed baseline. The result is a model that reasons well without the sprawling internal monologue that makes other reasoning models expensive to serve.
This matters for anyone running inference at scale. Reasoning models from other labs — OpenAI's o-series, Gemini Deep Think — optimize for answer quality and let token count balloon. Muse Spark used 58 million output tokens to run the full Intelligence Index suite, comparable to Gemini 3.1 Pro's 57 million. Cheaper inference per query, same quality tier. If you're serving reasoning to two billion daily actives, that efficiency gap compounds fast.
Three Modes, One Obvious Gap
Muse Spark ships with tiered reasoning: Instant (fast, casual), Thinking (step-by-step), and Contemplating. That last mode is the wild card — it spawns multiple sub-agents that generate solutions independently, then aggregates and refines them into a single answer. Parallel reasoning without proportional latency increase, at least in theory.
The results in Contemplating mode are legitimately strong. On Humanity's Last Exam, Muse Spark hits 50.2 — ahead of Gemini 3.1 Deep Think (48.4) and GPT-5.4 Pro (43.9). On HealthBench Hard, it scores 42.8, comfortably ahead of GPT-5.4 (40.1) and crushing Gemini 3.1 Pro (20.6). The medical AI performance comes from collaborating with over 1,000 physicians on training data curation — domain-specific effort that shows.
But the gaps are real, and Meta is honest about them.
| Benchmark | Muse Spark | GPT-5.4 | Gemini 3.1 Pro | Claude Sonnet 4.6 |
|---|---|---|---|---|
| Intelligence Index | 52 | 57 | 57 | 53 (Opus) |
| MMMU-Pro (vision) | 80.5% | — | 82.4% | — |
| HealthBench Hard | 42.8 | 40.1 | 20.6 | — |
| Terminal-Bench 2.0 | 59.0 | 75.1 | 68.5 | — |
| GDPval-AA (agents) | 1427 | 1676 | — | 1648 |
Coding and agentic workflows are where it clearly trails. Terminal-Bench at 59 versus GPT-5.4's 75 is a 16-point chasm. On real-world agent tasks, it scores 1427 against Claude Sonnet's 1648 and GPT-5.4's 1676. Meta's own blog uses the phrase "current performance gaps" for long-horizon agentic systems and coding workflows, which is corporate-speak for "we know, we're working on it."
If you're building AI-powered dev tools or autonomous agents, Muse Spark is not your model. Not yet.
The Open-Weight Elephant
Here's the part that actually stings for the developer community. Every Llama model from 2023 through Llama 4 shipped with downloadable weights. You could fine-tune them, deploy them on your own hardware, build products without API lock-in. That was Meta's whole strategic identity. It made Llama the default choice for anyone who didn't want to depend on OpenAI or Google's pricing whims.
Muse Spark has none of that. No weights. No self-hosting. Not even a public API — just a private preview with no announced pricing. The model currently lives inside Meta AI across their family of apps. For developers outside Meta's walled garden, Muse Spark is a benchmark report you can read about and nothing more.
Meta insists Llama isn't dead. They've positioned it as the open-weight line and Muse as the proprietary one. But when your proprietary model is dramatically better than your open-weight offering — and you're openly exploring API revenue from it — the open model starts looking like a consolation prize. Meanwhile, Google shipped Gemma 4 under Apache 2.0 this same month, and Alibaba dropped Qwen 3.5 with comparable licensing. The open-weight space isn't hurting for options; it's just hurting for Meta specifically.
What's Worth Watching
You can't use Muse Spark right now unless you're a select partner, so there's no action item for most developers. What's worth tracking is whether Meta publishes the thought compression methodology. They've historically been good about this — open-sourcing training innovations even when the models themselves stay proprietary. If the technique generalizes (and there's no architectural reason it shouldn't), it could change how the broader community trains reasoning models. Compressing chain-of-thought at the RL level is applicable to any model architecture, not just Muse's.
For production workloads today, the open-weight landscape is well-served by Gemma 4 or Qwen 3.5 anyway. Both ship under Apache 2.0, both handle multimodal input natively, and neither requires you to wait for a partner invite.
Meta built something technically impressive and strategically confusing — a strong reasoning model behind a wall, released the same month competitors are giving theirs away for free.