Google just told you its biggest cost problem isn't training. It's serving.

Reports surfaced this weekend that Google is in talks with Marvell Technology to co-develop two new custom chips — a Memory Processing Unit and an inference-optimized TPU — both aimed squarely at making model serving cheaper. If you're a developer paying per token for Gemini API calls or running models on Vertex AI, this is the infrastructure move that will eventually hit your bill.

The Two Chips

The partnership involves two distinct pieces of silicon. The first is a Memory Processing Unit designed to sit alongside existing TPUs and handle the data-movement bottleneck that plagues large model inference. When you're running a 400-billion-parameter model, shuffling weights between memory and compute becomes the actual wall — not the matrix multiplications themselves. An MPU attacks that specific problem.

The second chip is a new inference-specialized TPU built from scratch for serving workloads. Think of it as a TPU that gave up all training flexibility in exchange for being ruthlessly efficient at one thing: processing user queries as fast and cheaply as possible.

Neither chip has a signed contract yet, and development timelines mean production is likely still a couple years out. But the strategic direction is unmistakable.

Why Inference Is Eating the Budget

Here's the shift most developers haven't internalized. Training a frontier model is a one-time event — expensive, sure, but bounded. You train GPT-5.4 once. You train Gemini 3.1 once. That cost gets amortized over the model's useful life.

Inference runs continuously. Every API call, every search query with AI Overviews, every Gemini conversation — that's inference compute burning. And it scales linearly with demand. Google processes billions of queries daily. Even shaving 20% off per-query cost compounds into billions saved annually.

The numbers tell the story. Custom ASIC shipments from cloud providers are projected to grow 44.6% this year, dwarfing GPU shipment growth at 16.1%. Midjourney reported cutting monthly compute costs from 2.1 million to 700,000 — a 65% reduction — simply by migrating from NVIDIA GPUs to Google's existing TPUs. And that's with current-generation silicon, not purpose-built inference ASICs.

The forecast gets more dramatic further out. While Nvidia holds roughly 90% of the training accelerator market, analysts predict its inference share could drop to 20-30% by 2028. The rest goes to custom chips like what Google, Amazon, and Microsoft are all independently building.

Google's Supply Chain Logic

Google isn't putting all its silicon in one basket. The current hardware strategy involves at least four partners: Broadcom (locked in through 2031), MediaTek (involved in Ironwood TPU design), now Marvell, and TSMC for fabrication — plus an in-house design team.

This is automotive supply chain thinking applied to AI accelerators. No single vendor accumulates enough leverage to dictate pricing. Different partners optimize for different workloads: training, serving, memory-bound operations. The existing Ironwood TPU already demonstrates what specialized silicon delivers — ten times peak performance of TPU v5p, scaling to 9,216 liquid-cooled chips producing 42.5 FP8 exaflops in a single superpod.

What This Actually Means for Your API Bill

If you're calling cloud AI APIs, upstream hardware barely matters until it reaches your invoice. So here's the practical question: will this make inference cheaper?

Almost certainly yes, but not tomorrow. Chip development to production takes 2-3 years. The more immediate effect is competitive pressure. Google building dedicated inference silicon pushes AWS (which already ships Inferentia 2) and Azure to invest harder in their own custom accelerators. That rivalry is what drives token prices down across the board.

Here's where inference economics stand right now:

Platform Approach Cost per Inference vs. GPU Power Savings
Nvidia H100/B200 General-purpose GPU Baseline Baseline
Google TPU v6e Semi-custom ASIC ~4x cheaper per operation ~40%
AWS Inferentia 2 Custom inference ASIC ~3.3x cheaper ~35%
Google + Marvell (projected) Purpose-built inference ~5-6x cheaper (est.) ~50% (est.)

Those projected numbers for the Marvell collaboration are my extrapolation based on the trajectory from general-purpose to increasingly specialized silicon. Each generation of specialization tends to deliver another 30-50% cost reduction for predictable, high-volume workloads.

For developers self-hosting open models, this matters too. Inference-optimized cloud instances built around these chips mean you could serve something like Llama 4 or GLM-5.1 at meaningfully lower cost compared to renting general-purpose H100 time.

Nvidia Isn't Standing Still

Jensen Huang has been pushing deeper into the inference market with TensorRT-LLM optimizations and Blackwell's inference-mode features. Nvidia's pitch is flexibility — one GPU handles training, fine-tuning, and serving, so you don't need to manage three different chip architectures.

That argument resonates for smaller teams and startups where workloads shift constantly. It falls apart at hyperscaler scale, where query patterns are predictable enough to justify custom silicon. Google processes enough inference requests that even a 10% efficiency gain on a dedicated chip saves more than the R&D cost of designing it.

The likely outcome isn't "Nvidia loses." It's a market bifurcation: Nvidia keeps training, ASICs take inference. Two distinct segments that spent a decade pretending to be one.

The Boring Bets Win

Cheaper inference doesn't just mean lower API bills. It means AI features that were previously too expensive to ship at scale become viable. Real-time video understanding in every Google Meet call. Multimodal search across every Drive document. Agent workflows that make thousands of model calls per task without blowing through a budget.

The next wave of AI products probably won't come from a breakthrough model architecture. They'll come from the cost of running existing models dropping below some critical threshold — and that threshold is being set in chip design offices right now.