NVIDIA dropped Nemotron 3 Super a few weeks ago and it flew under the radar — buried by the Mythos leak drama and GPT-5.4's benchmark parade. That's unfortunate, because this model might matter more for people actually shipping agentic systems than either of those headline-grabbers.

The core problem it targets is straightforward: agents are throughput-bound. A single task might chain 30 to 50 model calls — tool use, reflection, planning, code generation, validation. If each inference round costs you two seconds, your agent takes two minutes on something trivial. Users won't tolerate that. Nemotron 3 Super is NVIDIA's answer: a 120-billion-parameter model that only activates 12 billion parameters per token, processes sequences in linear time through Mamba layers, and delivers 5x the throughput of its predecessor. All of it open-weight.

The Architecture Is Genuinely Novel

Nemotron 3 Super isn't another Transformer with a MoE bolt-on. It interleaves three distinct layer types in a repeating block pattern: Mamba-2 → Latent MoE → Mamba-2 → Attention → Mamba-2 → Latent MoE.

The Mamba-2 layers handle the bulk of sequence processing. Unlike Transformer self-attention — which scales quadratically with sequence length — Mamba stores a fixed-size state that gets updated token by token. Linear time complexity. This is what makes the native million-token context window feasible without burning through your GPU budget on KV cache alone. For agentic workloads where the model needs to hold an entire codebase or long conversation history in context, this is a fundamental advantage over pure Transformer architectures.

Sparse Transformer attention layers are interleaved at key depths for what NVIDIA calls "precision reasoning" — the associative recall where you need to connect a detail from paragraph three to a question in paragraph four thousand. Mamba handles the sequential flow; attention handles the cross-referencing. The insight is that you don't need attention everywhere, just at the critical junctures where long-range dependencies actually matter.

The training story is equally ambitious. NVIDIA fed the model 25 trillion tokens during pretraining, with 10 trillion unique curated tokens. Post-training involved 1.2 million environment rollouts across 21 configurations using NeMo Gym, training on real multi-step agentic tasks — tool-calling, code generation, multi-part planning — not just chat completion. They also released an agentic safety dataset of roughly 11,000 workflow traces, a resource I haven't seen from other labs.

Latent MoE Is the Clever Bit

Standard MoE models route tokens to experts that operate at the full model dimension. Every expert does computation in a massive space, which limits how many experts you can realistically run.

Nemotron's Latent MoE does something different: it compresses token embeddings into a low-rank latent space before routing to experts, runs the expert computation in that compressed space, then projects back to full dimension. The result is 4x as many expert specialists for the same inference cost. The model develops highly specific experts — one that's particularly sharp on Python syntax, another handling SQL logic, a third tuned for structured JSON generation.

NVIDIA claims that without this compression trick, the model would need to be 35x larger to hit the same accuracy. I can't independently verify that number, but the benchmark results suggest the approach works: 85.6% on PinchBench and competitive accuracy against both GPT-OSS 120B and Qwen3 122B.

The other throughput trick is Multi-Token Prediction — predicting several future tokens in a single forward pass. This doubles as built-in speculative decoding during inference, no separate draft model needed. NVIDIA reports 3x wall-clock speedups on structured generation, which matters hugely for agents spending half their time emitting JSON tool calls.

How It Stacks Up

Model Total Params Active Params Context License
Nemotron 3 Super 120B 12B 1M tokens NVIDIA Open
Mistral Small 4 119B 6B 256K tokens Apache 2.0
Qwen3.5-122B 122B ~10B 128K tokens Qwen License
DeepSeek V4 1T+ ~37B 128K tokens DeepSeek License

Raw parameters aside, the throughput gap is the differentiator. The Mamba backbone avoids quadratic attention costs, and the latent compression keeps expert computation cheap. For a workload where you're making dozens of sequential calls per task, that compounds fast.

Running It

The model is on Hugging Face now, plus NVIDIA NIM and inference providers like OpenRouter, Fireworks, and DeepInfra. Self-hosting targets Blackwell GPUs with native NVFP4 — 4x speedup on B200 versus FP8 on H100. For most folks, vLLM with continuous batching or SGLang for multi-agent tool-calling is the deployment path. Cookbooks for both are published.

The Honest Assessment

The openness story is real and surprisingly thorough. Beyond weights, NVIDIA published the full training recipes, the 10-trillion-token pretraining corpus, 40 million post-training samples, 37 RL datasets, and evaluation suites. You can reproduce the training end-to-end, which sets a bar that most "open" releases don't clear.

But there are things to watch. The NVIDIA Open Model License is not Apache 2.0 — read the fine print if you're building a commercial product. Mistral Small 4 ships under Apache and only activates 6 billion parameters, which might be the better choice if licensing matters more than context length. The million-token context window is impressive on paper, but real-world retrieval accuracy at extreme lengths is still unproven across the field. And there's the ecosystem question: a model built by NVIDIA, optimized for NVIDIA GPUs, trained with NVIDIA tools, deployed through NVIDIA NIM. The openness is genuine, but the gravitational pull toward their stack is strong. If you're already running H100s or B200s, this is an obvious evaluation target. If you're on AMD or building for heterogeneous infrastructure, factor that in.

The biggest takeaway for me is the architectural bet. While everyone else is scaling Transformers with MoE bolted on top, NVIDIA went hybrid — and the throughput numbers suggest they're onto something. For agentic workloads specifically, where you care about calls-per-minute more than peak single-call intelligence, this design philosophy might prove more important than any benchmark score.

A model that reasons at 12 billion active parameters with a million-token window and 5x throughput. That's the kind of thing that changes what agents can realistically do in production.