Google DeepMind released Gemma 4 at the start of April, and the AI community immediately zeroed in on the benchmark charts. Fair enough — a 31B model hitting expert-level competitive programming scores and 89% on AIME is hard to ignore. But after two weeks of watching how people actually use this family, I think the real story isn't the big model. It's the tiny one that squeezes a functional AI agent into 1.5 gigs of RAM on a Raspberry Pi.

The Lineup

Gemma 4 ships as four models, each built for a different tier of hardware.

The E2B is the ultralight — under 1.5GB of memory, targeting phones and single-board computers. The E4B adds native audio input, image understanding, and function calling while staying under 4 billion effective parameters. The 26B MoE is a Mixture-of-Experts model: it loads all 26 billion parameters into memory but only activates 4 billion per token, making it the efficiency play for server-side inference. And the 31B dense is the raw performance flagship, currently sitting at #3 on Arena AI's open-model text leaderboard.

What ties the family together is native structured tool use across every variant. Function calling, constrained JSON output, system instructions — none of this requires a fine-tune or a clever prompt wrapper. It works out of the box. That's what elevates the smaller models from "impressive tech demo" to "actually deployable in production."

The Benchmark Picture

Model Active Params AIME 2026 Codeforces ELO MMLU Pro License
Gemma 4 31B 31B 89.2% 2150 85.2% Apache 2.0
Gemma 4 26B MoE 4B Apache 2.0
Qwen 3.5 27B 27B ~49%* 86.1% Apache 2.0
Llama 4 Scout varies Custom

*Qwen 3.5 score is on AIME 2025.

The competitive programming jump deserves its own paragraph. Gemma 3's Codeforces ELO was 110 — it could barely solve easy warm-up problems. The 31B Gemma 4 lands at 2150. That's expert territory, the kind of rating most human competitive programmers never reach. AIME math went from 20.8% to 89.2% in a single generation. Whatever Google's team changed internally between these versions, the results speak for themselves.

But context matters. Qwen 3.5 still leads on general knowledge benchmarks — 86.1% vs 85.2% on MMLU Pro, 85.5% vs 84.3% on GPQA Diamond. It also supports 201 languages versus Gemma's 140, which is significant if you're building for non-English markets. And if you need to ingest an entire monorepo in one prompt, Llama 4 Scout's 10-million-token context window makes Gemma's 256K look quaint.

For code and math-heavy workloads at this parameter count, though, the 31B model owns the category.

Agents Without the Cloud

Here's the part I actually wanted to write about. Buried in Google's developer blog post are the edge inference numbers that nobody's discussing: the E2B pushes 133 prefill tokens and 7.6 decode tokens per second on a Raspberry Pi 5. No GPU — just the ARM CPU doing its thing. On Qualcomm's Dragonwing IQ8 NPU, the E4B hits 3,700 prefill and 31 decode tokens/s.

Those are real-time agent speeds. Fast enough for the model to receive a request, decide which tool to invoke, execute the function call, and return a structured response before a user gets impatient.

What does this look like in practice? A field service technician's ruggedized tablet running equipment diagnostics through a structured agent pipeline — completely offline, no proprietary schematics leaving the device. A retail kiosk handling returns by calling inventory and refund APIs locally, with zero cloud dependency. An agricultural sensor hub making irrigation decisions through multi-step reasoning at the edge of a field with no cell signal.

A year ago, running these scenarios required either a cloud round-trip or accepting a model so limited it was barely useful. Gemma 4's E4B — packing vision, audio comprehension, and structured tool calling into under 4B effective parameters — sits in a space no other open model currently occupies. Google also shipped platform support across an unusually wide surface: Android AICore, iOS, Linux, Windows, macOS via Metal, and browser execution through WebGPU. The litert-lm CLI gives you interactive tool-calling chat with any Gemma 4 variant after a single pip install, no boilerplate required.

My Honest Shorthand for April 2026

Choosing an open model right now comes down to what you need most. Broad multilingual coverage and factual recall? Qwen 3.5. A context window measured in millions of tokens? Llama 4 Scout. Strong reasoning and code generation with the flexibility to deploy anywhere from a Pi to a cloud GPU cluster? Gemma 4, and the Apache 2.0 license means no usage caps or MAU restrictions like Llama's 700-million-user threshold hanging over your roadmap.

The real shift this month isn't about which model tops a leaderboard. It's that running a capable AI agent no longer requires an API key or a rack of GPUs — just a fifty-dollar board and a power cable.