Google dropped Gemma 4 last Wednesday, and predictably, most of the coverage has been about the benchmark horse race — Arena rankings, MMLU Pro scores, AIME math numbers. All fine and good. But I think the single most consequential thing about this release fits in two words on the model card: Apache 2.0.
The License Is the Product
Previous Gemma versions shipped under Google's custom license — the kind with usage restrictions, "Harmful Use" carve-outs, and terms Google could update whenever they felt like it. Enterprise legal teams hated it. If you've ever tried to get a custom AI model license through corporate procurement, you know exactly how that goes: weeks of back-and-forth, exceptions, addendums, and a lawyer who keeps asking "but what does 'Harmful Use' mean, precisely?"
Gemma 4 uses Apache 2.0. Same license as Qwen, Mistral, and the rest of the open-weight ecosystem. No custom clauses, no redistribution restrictions, no commercial deployment caveats. Your legal team already has it on the approved list. That kind of friction removal is what actually gets models into production — not another percentage point on a leaderboard.
Four Models, One Confusing Naming Scheme
The lineup initially looks bewildering, but the logic clicks once you see the numbers. Google shipped four variants spanning edge devices to workstations:
| Model | Total Params | Active Params | Context | Type | Arena Score |
|---|---|---|---|---|---|
| E2B | 5.1B | 2.3B | 128K | Dense | — |
| E4B | 8B | 4.5B | 128K | Dense | — |
| 26B A4B | 26B | 3.8B | 256K | MoE | 1441 |
| 31B | 31B | 31B | 256K | Dense | 1452 |
The "E" prefix means "effective" — these models pack more total parameters but only activate a fraction during inference. Google's optimizing for phones and edge hardware. The 26B A4B variant pulls the same trick at scale: 26 billion total parameters, but just 3.8 billion active per token. Under the hood it's 128 small experts with 9 activated per forward pass (8 plus one shared).
The 31B Dense is the quality heavyweight. It sits at #3 among all open models on the Arena leaderboard and scores 89.2% on AIME 2026. For context, Gemma 3 scored 6.6% on the same benchmark. That's not incremental improvement — that's a generational leap. Whatever the DeepMind team changed between versions, it worked spectacularly.
All four models handle images and video natively with variable resolution support. The smaller E2B and E4B add audio input — speech recognition baked right into the model weights, no separate Whisper pipeline bolted on. For on-device multimodal agents, that's a genuinely useful simplification.
Where It Wins (and Doesn't)
Honestly, calling any single open model "the best" in April 2026 is pointless — the answer is always "depends on your workload." But the competitive picture is interesting.
Qwen 3.5 27B edges out the 31B on MMLU Pro (86.1% vs 85.2%) and GPQA Diamond (85.5% vs 84.3%). Google's model wins decisively on math competitions and competitive programming, with a Codeforces ELO of 2150. Meta's Llama 4 Scout boasts a wild 10-million-token context window but starts at 109B total parameters — firmly server territory.
The parameter efficiency angle is where Google's entry genuinely stands apart. The 26B A4B sitting at #6 on the Arena leaderboard with only 3.8 billion active parameters is borderline absurd. You're getting near-31B quality for a fraction of the memory footprint. On consumer hardware, that ratio is the whole game.
And licensing seals it. Llama 4 restricts applications over 700M monthly active users and mandates "Built with Llama" branding. Both Gemma 4 and Qwen 3.5 are Apache 2.0 with zero strings. For anyone shipping a product, this matters more than half a percentage point on any benchmark.
The Agentic Angle
All four models support native function calling, structured JSON output, and system instructions out of the box. The 26B MoE scored 86.4% on tau2-bench — the standard agentic evaluation — up from single digits in the previous generation.
Google explicitly trained these on Android development workflows, pitching the model as a local coding agent that can refactor legacy Java, scaffold apps, and iterate on fixes. Marketing? Partly. But the function-calling infrastructure is solid and the tool-use pattern works well in practice. If you're building anything that needs a model to decide when to call an API, parse the result, and keep going, the plumbing is there.
Actually Running It
Getting the 26B MoE running locally takes about sixty seconds:
brew install llama.cpp
llama-server -hf ggml-org/gemma-4-26b-a4b-it-GGUF
That spins up an OpenAI-compatible server on localhost:8080. Any tool that speaks the OpenAI chat completions API — your scripts, LangChain, whatever — just works. With Q4 quantization, the 26B MoE fits comfortably in ~8GB of RAM. Most modern laptops handle it without breaking a sweat.
On Apple Silicon, the MLX path pairs nicely with TurboQuant (Google's own KV cache compression technique that debuted at ICLR 2026) for a 4x reduction in active memory. I covered TurboQuant in Saturday's post — the two technologies complement each other perfectly here.
Six Labs, One Ecosystem
Zoom out and the bigger picture is striking. Six major labs now ship competitive open-weight models: Google, Alibaba (Qwen), Meta (Llama), Mistral, OpenAI (gpt-oss-120b), and Zhipu AI (GLM-5). The gap between open and closed keeps compressing. The Gemma 4 family isn't necessarily the best open model across every axis — Qwen 3.5 beats it on several reasoning benchmarks, and your mileage varies by task. But the combination of competitive quality, a genuinely permissive license, multimodal capabilities all the way down to phone-sized models, and strong agentic support makes it arguably the most deployable open model family available right now.
The model your legal team won't block is the model that ships.