Google dropped Gemma 4 on Wednesday — four open-weight models under a genuine Apache 2.0 license, built from the same research behind Gemini 3. The headline numbers are impressive, but what caught my attention is the architecture. This isn't your typical scaled-up transformer.
A Transformer, But Not Really
The family comes in four sizes: E2B and E4B (edge models using a technique called Per-Layer Embeddings), a 26B Mixture-of-Experts variant with roughly 4B active parameters, and a 31B dense model. One researcher on Hacker News called it a "galaxybrained architecture," and honestly, that tracks.
Instead of uniform attention across all layers, Gemma 4 alternates between local sliding-window attention (512–1024 tokens) and global full-context attention. The ratio on the 31B dense model is 5:1 — five local layers for every global one. You sacrifice some global awareness per layer but massively cut compute. It's the kind of design that only makes sense if you've committed to long contexts and local deployment as first-class priorities.
Then there's the positional encoding. Standard RoPE for sliding-window layers, proportional RoPE with a different theta for global layers — "Dual RoPE," as Google calls it. This is what lets the larger models handle 256K context windows without the usual quality degradation at long distances. The edge models top out at 128K, still generous for on-device work.
The vision encoder uses a learned 2D position encoder with multidimensional RoPE that preserves original aspect ratios. You can configure the token budget per image anywhere from 70 to 1,120 tokens — a practical knob for balancing quality against latency on constrained hardware. Audio comes from a USM-style conformer handling speech recognition and translation for up to 30 seconds of input.
The final layers reuse key/value tensors from earlier in the stack, shaving off memory and compute during inference. Not one revolutionary idea — a dozen smart choices stacked together.
The Numbers
| Benchmark | Gemma 3 (27B) | Gemma 4 (31B) | Gemma 4 (26B MoE) |
|---|---|---|---|
| AIME (math) | 20.8% | 89.2% | ~85% |
| GPQA Diamond | ~45% | 85.7% | ~80% |
| LiveCodeBench | ~40% | 84.0% | ~80% |
| BigBench Extra Hard | 19% | 74% | ~70% |
That AIME jump from 20.8% to 89.2% is the single biggest benchmark leap I've seen in one generation of an open model. Sebastian Raschka thinks the architecture is "pretty much unchanged compared to Gemma 3" and that these gains come from training recipe and data improvements. Maybe. The result speaks for itself either way.
What Developers Are Actually Doing With It
Here's where it gets interesting. The 26B-A4B MoE activates about 4 billion parameters per forward pass. On an RTX 4090, developers report roughly 150 tokens per second — that's 50% faster than Qwen 3.5-35B at comparable quality. On an M2 Ultra with Q8_0 quantization, llama.cpp pushes 300 tok/s.
The edge variants go further. E2B runs on a Raspberry Pi 5 at 133 tok/s prefill, requiring less than 1.5GB with 2-bit weights. Someone built a complete document processing pipeline — OCR, translation, and embedding of historical land records from the 1800s — running entirely local with Unsloth's quantized GGUFs. No cloud dependency, no API costs, full privacy over sensitive documents.
Tooling landed fast. Day-one support from Ollama, llama.cpp, vLLM, LM Studio, Transformers.js, and others. Two minutes to get the 26B running:
ollama run gemma4:26b
Simon Willison's famous pelican-on-a-bicycle SVG test produced what he called "outstanding" results from the 26B — the best output he's seen from any model running on laptop hardware. For Nix scripting, multiple developers say it outperforms the larger Qwen 3.5 models despite the size gap. The HuggingFace team noted they "struggled to find good fine-tuning examples because [the models] are so good out of the box."
Tool Calling: Native but Raw
Gemma 4 has function calling baked in using special control tokens that separate internal reasoning from tool invocations. The model thinks privately, pauses to request a function call, waits for your app to execute it, then continues generating. Google explicitly designed this for agentic workflows at the edge.
In practice, it's rough. Several developers report the 26B hallucinating tool execution rather than properly invoking available functions. One tester watched it confidently produce wrong timestamps while "verifying" against tools it never actually called. If you're wiring this into a production agent loop today, budget time for prompt engineering and output validation. The architecture is sound — the model just needs the community to iron out the kinks that always come with a fresh release.
Apache 2.0 Is the Real Story
Previous Gemma releases shipped under Google's custom license with commercial restrictions. Apache 2.0 changes that completely — irrevocable rights to use, modify, and ship inside commercial products. No royalties, no user caps, no lawyers needed. As VentureBeat argued, the license change might matter more than any benchmark.
A 26B MoE that fits on a single consumer GPU, under a truly permissive license, with native agent capabilities — grab it from HuggingFace and find out what breaks.