Everyone's obsessed with model quality right now — Muse Spark benchmarks, GPT-5.4 reasoning scores, who tops the leaderboard this week. Meanwhile, the people actually running these models in production are staring at their inference bills and wondering when someone's going to fix the real bottleneck: the KV cache eating all their GPU memory.
Google Research might have just done it. TurboQuant, published in late March and heading to ICLR 2026 in Rio later this month, compresses the key-value cache down to 3–4 bits per element — a roughly 4–6x reduction — without any retraining, fine-tuning, or calibration data. On LongBench at 3.5 bits, it scores identically to the full FP16 baseline. Not "close to." Identically: 50.06 both ways.
The Rotation Trick That Makes It Work
Most cache quantization methods have the same problem: you slam values into fewer bits, and the outliers wreck everything. Some coordinate dimensions carry disproportionate energy, and naive rounding crushes them into noise.
TurboQuant sidesteps this with a two-stage pipeline that's genuinely clever.
Stage one applies a random orthogonal rotation — specifically a Hadamard transform at O(d log d) complexity — to every vector before quantizing. This spreads energy uniformly across all coordinates. No more outlier dimensions dominating the signal. And because the rotation is known in advance, you can precompute mathematically optimal quantization buckets via the Lloyd-Max algorithm. No training data needed. No model-specific tuning. The post-rotation distribution is predictable, so the codebook is pure math.
Stage two handles residual error with QJL, a Johnson-Lindenstrauss projection that collapses each residual down to a single sign bit per dimension. One bit. The core insight: you don't need to reconstruct the residual perfectly. You just need unbiased inner products when computing attention. A carefully designed estimator balances the high-precision query against the low-precision compressed data, preserving what matters for attention scores while discarding what doesn't.
The end result is 3.5-bit keys and 2-bit values, with zero memory overhead from quantization constants. Competing methods need per-channel scales and zero-points that eat into your savings. The rotation-based approach eliminates that bookkeeping entirely.
The Numbers
On Llama-3.1-8B-Instruct, the results are hard to argue with:
| Configuration | KV Bits | LongBench Avg | Needle-in-Haystack | Compression |
|---|---|---|---|---|
| Full Precision | 16 | 50.06 | 0.997 | 1× |
| TurboQuant | 3.5 | 50.06 | 0.997 | ~4.6× |
| TurboQuant | 2.5 | 49.44 | 0.997 | ~7.1× |
At 3.5 bits, you get nearly 5x compression for effectively free. Needle-in-haystack retrieval stays perfect. At 2.5 bits you lose 0.6 points on LongBench — barely measurable — while hitting 7x compression. For many production workloads, that trade is a no-brainer.
The speed story matters too. On an H100, the 4-bit variant delivers up to 8x throughput improvement on attention logit computation versus 32-bit keys (roughly 4x over the FP16 baseline most production systems actually use). One developer reported fitting 4 million tokens in the cache on an NVIDIA GB10 — a device the size of a Mac Mini. Another data point: a 104B parameter model at 128K context running at 74GB peak memory on Apple Silicon.
The Open-Source Sprint
Google published the paper. The community started shipping implementations immediately, and there are now multiple repos with varying levels of production-readiness.
The most complete is 0xSero's implementation targeting vLLM 0.18.0 with Triton kernels. Real benchmarks on RTX 5090 running Qwen3.5-27B-AWQ show 2x token capacity and 30GB of freed VRAM across 4 GPUs. A vLLM feature request is already open in the main project — upstream support is a matter of when, not if.
If you want to try the algorithm today:
git clone https://github.com/OnlyTerp/turboquant.git
cd turboquant
pip install -e .
python src/demo.py # Synthetic benchmark, no GPU required
python src/test_real_model.py # Validates on TinyLlama or Nemotron-Nano
The OnlyTerp repo is the better starting point if you want to understand the math before deploying anything. There's also Quansloth, a fork specifically targeting air-gapped local inference on consumer hardware — useful if you're running models offline and every gigabyte counts.
Where This Gets Complicated
I should be honest about what's still rough.
Value quantization at 2 bits is the weak link. Key compression hits cosine similarity near 1.0, but 2-bit values drop to 0.940. Bumping values to 4 bits recovers quality to 0.997, at the cost of a lower overall compression ratio. If you're deploying this, 3.5-bit keys with 4-bit values is probably the production sweet spot until the value path improves.
Current implementations also decompress back to float32 on every decode step, so you're trading memory savings for some extra compute. The fused Triton kernels that could eliminate that overhead exist in various repos but remain flagged as experimental. And those "5x compression" claims in the READMEs don't always include the rotation matrices and metadata — the honest figure is closer to 4.6x at short contexts, only hitting 5x past 32K tokens.
None of that undermines the core result. A zero-training method that works on any transformer and delivers 4–5x memory savings with negligible quality loss is exactly the kind of unglamorous infrastructure improvement that actually changes the economics of serving long-context models. The frontier model wars get the headlines. Compression algorithms get the cost reductions.
TechCrunch already called it "Pied Piper". I'd say the comparison holds up better than most Silicon Valley references do.