Somewhere between "cool demo" and "existential threat to grad students," Sakana AI quietly shipped something that deserves more attention than it's gotten. Their AI Scientist v2 system autonomously wrote a scientific paper — hypothesis, experiments, analysis, figures, manuscript, the whole thing — and it passed peer review at an ICLR 2025 workshop. The cost? About twenty bucks in API calls.
What the System Actually Does
The AI Scientist v2 is an end-to-end pipeline that takes a research topic as input and produces a complete scientific manuscript as output. No human-authored code templates. No hand-holding between stages. The system handles ideation, experimentation, analysis, visualization, and writing — all orchestrated by LLM agents.
Here's the rough flow: you feed it a workshop topic file, it generates research ideas (checking novelty against Semantic Scholar), picks the most promising ones, runs experiments through an agentic tree search, and then writes up the results as a LaTeX paper with proper citations. Three manuscripts went to the ICLR "I Can't Believe It's Not Better" workshop. One cleared the acceptance bar, scoring above the average human submission threshold.
That last detail is the one that stings if you've ever spent three months grinding on a workshop paper.
The Tree Search Is the Interesting Part
Version 1 of the AI Scientist was clever but brittle — it relied heavily on human-authored templates and struggled to generalize across different ML subfields. The v2 upgrade centers on what Sakana calls "progressive agentic tree search," and this is where the architecture gets genuinely smart.
Instead of running a single linear pipeline (generate idea → run experiment → write paper), the system grows a search tree of possible research directions. An experiment manager agent sits at the root, deciding which branches to expand and which to prune. Think of it like MCTS for science: the system can explore multiple hypotheses in parallel, allocate compute to the branches that show promise, and bail on dead ends early.
The parameters you'd care about if you're running this yourself:
num_workers— how many parallel exploration paths to pursuesteps— maximum tree nodes to visitmax_debug_depth— retry budget when an experiment node failsnum_drafts— independent trees grown during the ideation phase
This is fundamentally different from the "give Claude a notebook and hope for the best" approach most people try when they throw LLMs at research tasks. The tree structure gives the system a way to hedge its bets and recover from bad initial hypotheses without starting over.
Running It Yourself
The whole thing is open source on GitHub and surprisingly straightforward to run. Two commands:
# Stage 1: Generate research ideas
python perform_ideation_temp_free.py --workshop-file topics/your_topic.md
# Stage 2: Run the full pipeline
python launch_scientist_bfts.py --load_ideas ideas.json
Under the hood, it orchestrates multiple models for different stages — Claude 3.5 Sonnet handles the experimental reasoning, o1-preview does the manuscript writing, and GPT-4o manages citations. You can swap these out for Gemini or other providers if you prefer.
The cost breakdown is almost comically low: ideation runs a few dollars, the main experiment loop burns 15–20 in API calls (mostly Sonnet tokens for the tree search), and the writing phase adds about 5. Call it $25 total for a complete paper draft. Even if 80% of the output is mediocre, that economics changes the game for preliminary research exploration.
A VLM That Critiques Its Own Figures
One neat addition in v2: the system uses a vision-language model in a feedback loop to iteratively refine its figures. It generates a plot, renders it, looks at the rendered image, decides if the axes are labeled properly or if the color scheme is confusing, and revises. Multiple passes until the VLM reviewer is satisfied.
I've seen human researchers submit papers with worse figures than what this system produces on its third iteration. The aesthetic quality isn't groundbreaking, but the self-correction loop is a pattern worth stealing for any agentic coding pipeline — generate, render, visually inspect, revise.
Honest Caveats
Before anyone panics (or celebrates), the paper itself is clear-eyed about limitations. The system doesn't consistently produce better papers than v1, especially when strong human-written templates are available. Success rates vary wildly depending on model choice, idea complexity, and the specific ML subfield.
The ICLR acceptance was at a workshop — not a main conference track. Workshop acceptance bars are lower, the review process is lighter, and "I Can't Believe It's Not Better" is specifically a venue that welcomes negative results and unconventional submissions. Nobody should mistake this for an AI writing NeurIPS spotlight papers.
That said, the trajectory matters more than the current benchmark. V1 couldn't generalize beyond its templates. V2 handles arbitrary ML topics. The gap between "workshop-level" and "conference-level" is real but not infinite, and the tree search architecture gives a clear path toward closing it — more compute, better base models, richer experiment environments.
Why Developers Should Care
Even if you never plan to automate paper-writing, the architectural patterns here are directly applicable:
The experiment manager agent pattern — a coordinator that grows a search tree, evaluates branches, and allocates resources — is a reusable blueprint for any complex agentic workflow. Think automated A/B testing, systematic prompt optimization, or even code generation pipelines where you want to explore multiple implementation strategies before committing.
The cost structure also tells you something about where agentic AI is headed. If a complete research cycle costs $25, then the bottleneck isn't compute or money — it's the quality of the orchestration layer. Better tree search heuristics, smarter pruning, more robust experiment environments. That's the engineering work that actually matters now, and it's the kind of problem infrastructure developers are well-positioned to tackle.
The code is MIT-licensed. Go read the experiment manager. That's where the real ideas are.