Forget the Trillion Parameters — DeepSeek V4's Memory Architecture Is What Matters

Three days ago, screenshots from DeepSeek's gray-scale test started circulating on Weibo. The interface showed something new: a mode switcher with Fast, Expert, and Vision options — a clear signal that V4 is being stress-tested in production. The reported specs are eye-popping (1 trillion MoE parameters, 1M context window, 81% SWE-bench), but honestly? The number I keep coming back to is 97%.

That's the claimed Needle-in-a-Haystack accuracy at one million tokens. If true, it's the Engram memory architecture underneath V4 that deserves your attention, not the parameter count.

What Engram Actually Does

Standard transformer attention falls apart at extreme context lengths. You've probably seen this yourself — feed a 128K-token document into even the best frontier models and watch retrieval accuracy crater past the 80K mark. The mechanism just can't maintain consistent lookup quality across the entire window.

DeepSeek's Engram is a conditional memory system that sits alongside the regular attention layers. Instead of forcing every token to attend to every other token (quadratic cost, diminishing returns), it selectively stores and retrieves information based on relevance signals. Think of it less like expanding a model's "working memory" and more like giving it an indexed reference shelf.

The paper introduces what they call a Sparsity Allocation Law: 20–25% of the model's sparse parameters should go to memory, with the rest allocated to computation. Two complementary techniques round out the design — Modified Hopfield Continuum (mHC) for bounded attention, and Dynamic Sparse Attention with a "Lightning Indexer" for fast lookups. Together, these give V4 its claimed O(1) knowledge retrieval instead of O(n) degradation as context grows.

The practical upshot: 97% NIAH accuracy at 1M tokens, compared to 84.2% for standard attention. If those numbers hold up under independent testing, the implications for RAG-heavy applications are significant. A model that reliably retrieves from a million tokens of raw context reduces — or in some cases eliminates — the need for external retrieval infrastructure. Your chunking strategy, your vector database, your re-ranking pipeline? Potentially unnecessary overhead.

Big if, though. These are internal benchmarks. Nobody outside DeepSeek has verified them.

The MoE Math

A trillion parameters sounds absurd until you understand how MoE actually works. Only about 37 billion parameters activate per token — the same as V3. The model uses 256 expert sub-networks and routes each token through 8 of them.

The win isn't raw compute per query. It's specialization depth. With 50% more total experts than V3, each one can focus more narrowly — code generation, mathematical reasoning, multilingual tasks, creative writing. The model gets meaningfully smarter without becoming proportionally more expensive to serve.

Here's how the benchmarks shape up against the current frontier (internal numbers, treat accordingly):

Benchmark	DeepSeek V4	Claude Opus 4.5	GPT-5.3 Codex	DeepSeek V3
HumanEval	90%	~88%	~87%	~82%
SWE-bench Verified	80%+	80.9%	~80%	~49%
NIAH @ 1M tokens	97%	—	—	—

The coding numbers are competitive with — possibly slightly ahead of — the best closed-source models. But the jump from V3's 49% on SWE-bench to V4's claimed 80%+ is the line that really pops. That's not incremental. That's a generational leap, if it's real.

Built on Huawei Silicon

Here's the part that matters beyond model architecture. DeepSeek V4 is built to run on Huawei's Ascend chips — specifically the Ascend 950PR for inference. The full model is optimized for Huawei's CANN software stack, and Huawei has adapted its chips to interpret NVIDIA-style programming instructions to minimize code rewrites.

If this works — a frontier-competitive model trained and served entirely on non-NVIDIA silicon — it's a real proof point that the CUDA moat has limits. The 950PR sits roughly between the H100 and H200 in raw capability. Not parity, but close enough for production.

There's also a lighter variant (~200B parameters) designed for Cambricon MLU chips and everyday API workloads. The full trillion-parameter model handles the heavy reasoning while the lite version serves volume traffic.

What the Interface Leak Tells Us

The test screenshots from April 8 reveal a tiered product strategy: Fast mode for daily conversation (lightweight, unlimited), Expert mode for complex reasoning (presumably the full V4), and Vision mode for multimodal tasks. This mirrors how Moonshot structures Kimi — free fast tier, paid deep reasoning.

System crashes and anomalies on the platform over the past week are being widely read as stress tests. The consensus points to a public release in the last two weeks of April.

What This Means If You Build Things

If you're running RAG pipelines with chunking strategies you've spent weeks tuning — watch this space. A model that genuinely retrieves at 97% accuracy over a million tokens at $0.30 per million tokens changes the cost-benefit math for a lot of architectures. That vector database might become optional.

If you're evaluating open-weight models for self-hosting, V4 is expected to ship under Apache 2.0. The INT8-quantized version reportedly runs on two RTX 4090s (48GB total VRAM). INT4 on a single RTX 5090. Frontier-class performance on hardware a serious hobbyist can afford.

For the quantized self-hosting path, the reported configs:

Precision	Hardware	VRAM
Full	Multi-node cluster	800GB+
INT8	2× RTX 4090	48GB
INT4	1× RTX 5090	32GB

And if you're thinking bigger picture — a competitive frontier model that doesn't need NVIDIA GPUs reshapes supply chain assumptions. Every cloud provider running H100 clusters is watching whether Ascend-based inference actually scales.

None of this is independently verified yet. The Engram paper needs replication. The benchmark claims are self-reported. The pricing could shift at launch. But the architectural ideas — conditional memory with O(1) lookup, sparsity allocation laws, bounded attention — are worth studying regardless of whether V4 matches every claim on day one.

DeepSeek earned cautious optimism with V3. It delivered. Whether V4 does the same is about to become a testable question.

#What Engram Actually Does

#The MoE Math

#Built on Huawei Silicon

#What the Interface Leak Tells Us

#What This Means If You Build Things

What Engram Actually Does

The MoE Math

Built on Huawei Silicon

What the Interface Leak Tells Us

What This Means If You Build Things