Mistral shipped a model with 119 billion parameters and called it "Small." Under Apache 2.0. With a 256K context window, native vision across 24 languages, and built-in function calling. Most of the coverage fixated on the parameter count — but the feature I keep thinking about is quieter: a reasoning_effort toggle that lets you switch between fast instruct mode and deep chain-of-thought reasoning with a single API parameter. One deployment, two behaviors, zero routing complexity.
The Architecture in 30 Seconds
Mistral Small 4 is a mixture-of-experts model packing 128 expert modules, of which only 4 fire per token. Active parameter count per forward pass: 6.5 billion. Total weight sitting in memory: 119B. A learned router picks which specialists handle each token, giving you the knowledge capacity of a much larger model without the full compute hit on every call.
That 128-expert count is aggressive. Most MoE architectures top out around 64. Nemotron 3 Super runs a 120B/12B split with 10% activation. DeepSeek V4 pushed to a trillion total parameters with a wider active slice. Mistral went the other direction — extreme sparsity, under 5.5% of the model firing per token. The throughput payoff is real: 3x more requests per second than its predecessor, with 40% lower end-to-end latency. The memory bill, though, is a different story.
The Feature That Actually Changes Your Stack
What grabbed me in Mistral's announcement wasn't the parameter count. It was reasoning_effort. Here's the practical version:
from mistralai import Mistral
client = Mistral(api_key="your-key")
# Fast mode — snappy tool calls, no thinking overhead
fast = client.chat.complete(
model="mistral-small-latest",
messages=[{"role": "user", "content": "Parse this JSON and extract all email addresses"}],
reasoning_effort="none"
)
# Deep mode — step-by-step reasoning before answering
deep = client.chat.complete(
model="mistral-small-latest",
messages=[{"role": "user", "content": "Find the concurrency bug in this Go function"}],
reasoning_effort="high"
)
Set it to "none" and you're running something equivalent to Mistral Small 3.2 — direct, fast, minimal token spend. Flip to "high" and the model engages full chain-of-thought, matching what Mistral previously shipped as the separate Magistral model line.
If you've built production agents, you know the pain this solves. The standard pattern is maintaining two deployments: a cheap model for simple routing and tool calls, and a heavier reasoning model for complex tasks. Your orchestrator decides which to invoke, you pay for two sets of infrastructure, and the latency profile of the split adds up. Mistral just collapsed that into one parameter. Your orchestration layer sets reasoning_effort based on task difficulty, same endpoint handles both.
The efficiency story is compelling too. On the LCR benchmark, Small 4 achieves 0.72 accuracy generating just 1,600 characters of output. Comparable Qwen models needed 5,800–6,100 characters to reach similar scores. When this thing reasons, it reasons concisely.
Benchmarks: Solid but Selective
Mistral's headline claim: Small 4 outperforms GPT-OSS 120B on LiveCodeBench while producing 20% fewer tokens. On AIME 2025 it's competitive. Those throughput and latency gains over Small 3 are independently measurable.
I buy these numbers for the workloads Mistral optimized for — function calling, JSON output, code generation, structured agentic tasks. The model handles vision, tool use, and structured output natively, no adapters bolted on. For building agent pipelines that need fast, accurate tool invocation with an optional reasoning fallback, the profile is strong.
But look at what's absent from the announcement: no MMLU-Pro scores, no GPQA, no complex multi-hop reasoning benchmarks. Mistral selected evals that reward concise correctness, which is precisely where sparse MoE with low activation shines. Independent evaluations from the community are still rolling in. I'd hold off on declaring this a GPT-4o-mini replacement across every workload.
Open Source With a Hardware Asterisk
The weights live on Hugging Face under Apache 2.0. Fine-tune commercially, deploy anywhere, no strings. Mistral also released an NVFP4-quantized checkpoint (a collaboration with NVIDIA and Red Hat) and GGUF variants through Unsloth for anyone brave enough to try consumer hardware.
The production reality check: minimum deployment is 4x NVIDIA HGX H100, 2x HGX H200, or a single DGX B200. Mistral recommends doubling those for real throughput. You're looking at 150K–400K in GPU iron at list prices.
This is the MoE memory paradox that rarely makes the marketing copy. Only 6.5B parameters activate per token, but all 119B must reside in GPU VRAM because the router needs instant access to any of those 128 experts. You can't page them in. Cheap compute per token, expensive memory per deployment — that's the fundamental trade.
For most developers, the pragmatic path is obvious: Mistral's API or NVIDIA NIM for production traffic, weights downloaded for fine-tuning on rented clusters. The Apache 2.0 license changes your business calculus around customization and data sovereignty, not necessarily around running your own inference farm. Unless you're pushing millions of tokens daily, self-hosting is paying data center rent for the satisfaction of owning your stack.
Who Should Care
If you're wiring up agentic systems — retrieval, multi-step tool use, code generation workflows — Small 4 belongs in your evaluation. The reasoning toggle alone eliminates a category of infrastructure complexity, and native tool calling with structured output means fewer moving parts between your orchestrator and the model.
The open MoE landscape in April 2026 now has three credible options with very different philosophies: DeepSeek V4 for raw trillion-parameter scale, Nemotron 3 Super for hardware-optimized agentic deployment, and Mistral Small 4 for aggressive sparse routing with maximum operational flexibility.
Calling a 119B-parameter model "small" is obviously absurd. But given that you're only paying for 6.5 billion active parameters per token, maybe Mistral's just being honest about where the industry's goalposts landed.