Netflix's First AI Model Erases Objects From Video — Then Simulates What Happens Next

Everyone was so busy arguing about Gemma 4 benchmarks this week that Netflix quietly shipped something genuinely weird on HuggingFace. VOID — Video Object and Interaction Deletion — is an open-weight model that removes objects from video. That part isn't new. What's new is that it then figures out how the remaining scene should have behaved if the object was never there, and generates that instead.

Think about what that means for a second. You remove a bowling ball mid-roll, and VOID doesn't just fill in the lane texture — it makes the pins stay standing. Remove a person holding a stack of boxes, and the boxes don't float in mid-air. They fall. The model reasons about physical causality, then synthesizes a counterfactual version of the scene. Netflix's VFX pipeline apparently needed this badly enough to build it, paper it, and give it away under Apache 2.0.

Why This Isn't Just Another Inpainting Tool

Every video editor from DaVinci Resolve to Runway offers some version of "remove object." The standard approach masks the region and fills it with plausible background pixels — maybe propagating texture from adjacent frames, maybe running a diffusion model over the gap. These tools handle shadows, reflections, and basic occlusion fine.

They completely fall apart when the removed object was physically interacting with other things in the scene. Pull a chair out from under someone sitting? Existing tools leave the person hovering. Remove a car from a collision? The other vehicle still swerves around nothing. The core issue is that traditional inpainting models have zero concept of "this object was causing that motion." They pattern-match textures. They don't reason about forces.

VOID tackles this with what the Netflix research team calls "counterfactual reasoning" — and the two-stage architecture they built to do it is surprisingly elegant.

The Two-Stage Pipeline

Stage one uses a vision-language model (powered by Gemini's API, interestingly) combined with SAM2 segmentation to analyze the scene. It doesn't just identify where the target object is — it maps every region of the frame that the object was causally influencing. These get encoded into what the team calls a "quadmask": a four-value semantic map where 0 marks the object itself, 63 marks overlap regions, 127 marks the "affected zone" where physics needs to change, and 255 is untouched background.

Stage two feeds that quadmask into a video diffusion model fine-tuned on top of CogVideoX-Fun. This is where the actual generation happens — the diffusion model synthesizes new pixel content for every marked region, but constrained by the physical implications encoded in the mask. An optional second pass adds flow-warped noise refinement for longer clips where temporal consistency starts to drift.

The training data is clever too. Rather than trying to label real-world physics interactions by hand (nightmare), they generated synthetic paired datasets using Kubric for 3D scenes and HUMOTO for human motion simulation. Each training example contains the original video, a version with an object removed, and the ground-truth "what actually would have happened" — letting the model learn the causal gap directly.

The Numbers That Matter

A user study with 25 evaluators across multiple scenarios put VOID against six competitors: ProPainter, DiffuEraser, Runway, MiniMax-Remover, ROSE, and Gen-Omnimatte. VOID was the preferred result 64.8% of the time. Runway — probably the best-known commercial option — landed at 18.4%. That's not a close race.

Honest caveat: 25 people isn't a massive sample, and user preference studies have well-documented biases. But the gap here is wide enough that I'd trust the direction even if the exact percentages shift with more evaluators. The demo videos on the project page are also genuinely impressive — the domino and bowling scenes in particular show exactly the kind of interaction-aware removal that makes other tools look broken.

Running It Yourself

You'll need serious hardware — 40GB+ VRAM, so an A100 or H100. Not a weekend-laptop project. Here's the quickstart:

git clone https://github.com/Netflix/void-model.git
cd void-model
pip install -r requirements.txt

# SAM2 for segmentation
git clone https://github.com/facebookresearch/sam2.git
cd sam2 && pip install -e . && cd ..

# Grab the base inpainting model
huggingface-cli download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
  --local-dir ./CogVideoX-Fun-V1.5-5b-InP

# VOID checkpoints
huggingface-cli download netflix/void-model --local-dir ./void-checkpoints

export GEMINI_API_KEY=your_key_here

That Gemini API key requirement is worth noting — the VLM stage for mask generation calls out to Google's API. So you're not running fully offline; the causal reasoning component depends on an external service. For production use, that's a dependency you'd want to think about.

Inference runs at 384×672 resolution by default with up to 197 frames. The included Jupyter notebook is honestly the easiest way to get started if you just want to see results on sample videos.

The Uncomfortable Part

The Register's coverage of VOID ended with a line that stuck with me: "Whether the world really needs more convincing video manipulation is another question." Fair point. A model that can remove objects and rewrite the physics of what happens next is extraordinarily powerful for VFX work — and extraordinarily dangerous for misinformation.

Netflix clearly built this for legitimate post-production: erasing boom mics, removing crew reflections, fixing continuity errors where an object shouldn't have been in frame. But the Apache 2.0 license means anyone can run it for anything. The counterfactual physics capability in particular makes deepfake detection harder — you can't just look for "floating shadows" or "objects that shouldn't be there" when the model already handles those tells.

Why Developers Should Pay Attention

Even if you're not doing video work, the architecture pattern here is reusable. Using a VLM for causal scene analysis, encoding that reasoning into a structured mask, then feeding it as conditioning to a generative model — that's a pipeline template that could apply to any domain where you need "understand the implications, then generate accordingly." Robotics simulation. Game physics. Architectural visualization.

Netflix publishing on HuggingFace is also just... notable. The company that famously open-sourced Zuul and Eureka for microservices is now publishing diffusion model weights. The bar for "AI-native company" keeps expanding beyond the usual suspects.

If you have an A100 lying around, the HuggingFace demo is worth ten minutes of your time before you even bother with local setup.

#Why This Isn't Just Another Inpainting Tool

#The Two-Stage Pipeline

#The Numbers That Matter

#Running It Yourself

#The Uncomfortable Part

#Why Developers Should Pay Attention