Last Wednesday, Sakana AI's paper describing the AI Scientist system landed in Nature as an open-access publication. The headline result: their v2 system generated a research paper that passed blind peer review at an ICLR workshop, scoring higher than 55% of human-authored submissions. The whole thing cost about twenty bucks in API calls.

I've been poking through the open-source code, the Nature write-up, and the community reactions. Here's the honest breakdown of what this thing actually is, what it can do, and where the cracks are.

The Pipeline: From Idea to LaTeX in Hours

The AI Scientist v2 isn't a single prompt chain — it's a multi-stage agentic system built around what Sakana calls "best-first tree search" (BFTS). The core insight is stolen from game-playing AI: instead of committing to one experimental path and praying it works, the system branches out like a chess engine exploring multiple lines simultaneously.

flowchart LR A[Ideation] --> B[Literature Review] B --> C[Experiment Design] C --> D[Tree Search Execution] D --> E[Analysis + Writing] E --> F[Automated Review] F -->|Score too low| C D -->|Branch| D

Five stages: generate a hypothesis, search the literature via Semantic Scholar, design and code the experiments, run them across parallel branches, then write the full paper in LaTeX with a vision model reviewing the figures. An "experiment manager" agent sits on top of the tree, deciding which branches look promising enough to keep exploring and which are dead ends worth pruning.

What makes BFTS different from naive retry loops is the resource allocation. When a branch hits a bug or produces weak results, the system doesn't just retry the same approach — it backtracks to a parent node and tries a different experimental configuration. The manager agent scores each node based on partial results, then funnels compute toward the branches with the highest expected payoff. It's basically Monte Carlo tree search adapted for empirical research, and the explore-exploit tradeoff is tunable.

The cost breakdown is almost absurd. Ideation runs a few bucks. The main experiment pipeline — the tree search, the code generation, the debugging loops — costs 15-20 using Claude 3.5 Sonnet as the backbone. Writing adds another 5. So you're looking at roughly $20-25 per complete paper, soup to nuts. For context, a single human researcher's time on a workshop paper is measured in weeks and thousands of dollars of salary.

What the Peer Review Actually Showed

Sakana submitted unedited, fully AI-generated papers to the ICLR 2025 "I Can't Believe It's Not Better" (ICBINB) workshop. One manuscript scored 6.33 on average (individual reviewer scores: 6, 7, 6), clearing the acceptance threshold. Human reviewers, blind review, accepted. No human touched the manuscript between generation and submission.

Context matters, though. ICBINB is a workshop, not a main conference track. Workshop acceptance rates are higher and the bar for novelty is deliberately lower — the whole point of that venue is exploring surprising negative results and unconventional framings. Calling this "AI passed peer review" without that qualifier is like saying someone "passed the bar" when they mean the MPRE.

Sakana also did something I respect: they voluntarily withdrew the accepted paper from the proceedings to avoid flooding the review system with AI submissions, watermarked every generated manuscript, and got IRB approval before the experiment. The system's automated reviewer hit 69% balanced accuracy, beating inter-human agreement from the NeurIPS 2021 consistency experiment. Which says something both about the AI reviewer and about how noisy human peer review already is.

Running It Yourself

The whole thing is open source on GitHub. Linux box, NVIDIA GPUs, Python 3.11, about an hour of setup. Two commands:

# Generate research ideas
python ai_scientist/perform_ideation_temp_free.py \
  --workshop-file "ai_scientist/ideas/my_topic.md" \
  --model gpt-4o-2024-05-13 \
  --max-num-generations 20 \
  --num-reflections 5

# Run the full pipeline
python launch_scientist_bfts.py \
  --load_ideas "ai_scientist/ideas/my_topic.json" \
  --model_writeup o1-preview-2024-09-12 \
  --model_citation gpt-4o-2024-11-20 \
  --num_cite_rounds 20

Supports OpenAI, Gemini, or Claude via Bedrock. The tree search defaults (num_workers: 3, steps: 21) give you a reasonable explore-exploit balance without melting your GPU.

The Criticisms Are Real

I'd be dishonest if I presented this as a clean win. The pushback has been sharp, and some of it lands hard.

An ACM SIGIR evaluation found that the system's literature review is shallow — keyword searches on Semantic Scholar rather than genuine synthesis. The system retrieves papers that match surface-level terms but misses foundational work that uses different terminology. Several generated ideas were flagged as "novel" by the system when they were actually well-known techniques like micro-batching for SGD. That's not a minor issue. It's a serious failure mode for a tool whose entire value proposition is automated discovery. If you can't reliably determine what's already known, your novelty claims are built on sand.

Earlier versions had uglier problems: missing figures, repeated sections, placeholder text like "Conclusions Here" left in manuscripts, and hallucinated numerical results. V2 addresses some of this — the vision model catches broken figures, the tree search lets the system self-correct when experiments produce garbage numbers, and the multi-round writing pipeline catches structural issues. But the fundamental problem hasn't disappeared. LLMs confidently generate plausible-sounding nonsense, and no amount of agentic scaffolding changes the fact that the underlying models don't understand the experiments they're running. They pattern-match on what a good result looks like.

Jimmy Koppel's critique on X put it bluntly: the work "contains little in the way of new ideas" and is "much less impressive than presented." Harsh, but there's truth in it. The accepted paper is competent workshop-level work, not a breakthrough. And competent workshop-level work is exactly what a $20 automated pipeline should be expected to produce. The question is whether "cheap and competent" at scale is more valuable than "expensive and occasionally brilliant." For a lot of applied research contexts, the answer might be yes.

What This Actually Means for Developers

Forget the "can AI do science" debate. The useful takeaway: BFTS as an agentic pattern — branching execution, pruning dead ends, a manager agent allocating compute across promising lines — generalizes far beyond academic papers. If you're building experimentation platforms or automated analysis pipelines, study this architecture.

The Uncomfortable Part

Nature's own editorial flagged the endgame: checking AI-generated papers thoroughly "takes as long or longer than the initial creation itself." AI generates faster than humans can evaluate. That's not a triumph of science — it's a denial-of-service attack on peer review. Sakana seems aware of this, hence the voluntary withdrawal and watermarking. But the code is open source now. The genie situation applies.

Twenty-dollar papers that pass peer review. Whether that excites you or terrifies you probably says something about your relationship with academic publishing.