Sakana AI's AI Scientist-v2 — the system that autonomously generates research hypotheses, runs experiments, and writes full papers — just got a write-up published in Nature. One of its papers passed blind peer review at an ICLR workshop. The code is open source, it costs about $140 per paper, and you can spin it up this weekend.
That sentence alone should make you uncomfortable, excited, or both. Here's what's actually going on.
Wait, an AI wrote a real peer-reviewed paper?
Sort of. The team at Sakana AI (in collaboration with UBC, the Vector Institute, and Oxford) submitted three fully AI-generated manuscripts to the ICBINB workshop at ICLR 2025 — a well-regarded but deliberately provocative venue whose full name is "I Can't Believe It's Not Better." One paper cleared review with scores of 6, 7, and 6, averaging 6.33. That beat 55% of human-authored submissions.
But context matters here, and there's a lot of it. ICBINB accepts roughly 70% of submissions. This wasn't a main conference track at NeurIPS or ICML — it was a workshop specifically designed to surface surprising negative results and unconventional approaches, which means the review bar sits well below where a top venue sets it. Jodi Schneider at Wisconsin–Madison compared the output to "a mediocre graduate student" at a generous venue. Jeff Clune, who led the research at UBC, called his own system's output "okay but not great." Nobody involved is pretending this is a breakthrough in scientific quality.
Sakana withdrew the accepted paper before publication. The point was never to sneak AI work into the literature — it was to prove the pipeline could clear the bar. That distinction matters: this is a capabilities demonstration, not a publishing strategy. They wanted to know if the end-to-end system could produce something that passes muster when reviewers don't know a machine wrote it. It could. What you do with that information depends on whether you're more worried about the quality ceiling or the volume floor.
How does it actually work?
AI Scientist-v2 runs a multi-stage pipeline. You give it a topic description file — basically a markdown doc describing a research area — and it takes over from there.
Ideation: The system generates hypotheses, then checks novelty against Semantic Scholar. V1 needed human-authored code templates to get started; v2 dropped that requirement and generalizes across ML domains.
Experimentation: V2 uses what Sakana calls "progressive agentic tree search" — an experiment manager agent that explores multiple research directions in parallel, pruning dead ends and doubling down on promising branches. Think beam search, but for science. It writes code, runs experiments, debugs failures, and iterates. The tree structure is key: instead of committing to a single hypothesis and grinding through it linearly, the system maintains a frontier of candidate approaches, evaluates intermediate results, and reallocates compute toward whichever branch looks most promising. When an experiment throws an error, the agent diagnoses the failure, patches the code, and re-runs — sometimes cycling through three or four debugging rounds before giving up on a branch. This is where the v1-to-v2 gap is biggest. V1 followed a single path and got stuck often. V2's branching means a bad early result doesn't tank the whole run.
Writing: A separate pass generates a full LaTeX manuscript with figures. A vision model reviews the generated plots for quality. Then an automated reviewer (which itself matches human reviewer accuracy at 69% balanced accuracy) evaluates the draft.
The whole cycle takes about 15 hours and runs you roughly $140 in API costs.
Can I run this myself?
Yes. The GitHub repo is public. You'll need a Linux box with NVIDIA GPUs and a tolerance for LLM-written code executing on your machine — Sakana's own docs stress running it in a Docker sandbox. Fair warning: 42% of experiments in the original evaluation failed due to coding errors. Prolific, not precise.
Should I be impressed or skeptical?
Both, and I think the honest answer is that the direction matters more than the current quality.
Maria Liakata at Queen Mary University told Scientific American the approach is "agentic and without any real novelty" — the system chains together well-known capabilities (LLM coding, search, writing) without a deeper understanding of scientific reasoning. Critics point out that the literature review is shallow (keyword search, not synthesis), the novelty checker sometimes flags established concepts like SGD micro-batching as "novel," and the writing is repetitive and occasionally self-contradictory. These aren't nitpicks. A literature review that searches keywords instead of synthesizing themes will miss the kind of cross-pollination that drives genuinely new ideas. A novelty checker that can't distinguish "nobody's tried this" from "everybody knows this" is a fundamental limitation, not an engineering bug to be patched in the next version.
All fair. But the v1-to-v2 jump was significant. Dropping the template dependency, adding tree search for experiment exploration, and achieving workshop-level acceptance — that's real progress on a hard problem. And the scaling law they describe (better foundation models = better papers) suggests this gets meaningfully better every time Claude or GPT levels up. If you're betting against AI-generated research on the basis that today's output is mediocre, you're making a timing argument, not a structural one.
Sakana isn't alone, either. Intology's Zochi got a paper into ACL main proceedings (with human verification in the loop), and the Autoscience Institute's CARL had ICLR workshop acceptances before AI Scientist did.
What does this mean for developers?
Forget the paper-writing angle for a second. The AI Scientist pipeline is essentially a reference implementation for any "agent explores a large solution space" problem — tree search over experiment space, multi-model orchestration, automated evaluation loops. The experiment manager pattern ports cleanly to automated A/B testing, hyperparameter optimization, even bug reproduction.
The flood problem
The immediate concern from the scientific community is volume. Yanan Sui, an ICLR 2026 workshop chair at Tsinghua, put it bluntly: "AI-written papers are probably going to make things much worse." Peer review is already stretched thin. A system that produces passable-but-mediocre papers for $140 each could overwhelm conferences with sheer output, and the economics are brutal — a single GPU-equipped lab could generate more submissions in a week than a review committee can process in a month.
Top venues are reacting. ICLR's main conference now prohibits purely AI-written submissions. But enforcement depends on disclosure, and as Aaron Schein at U of Chicago notes, "We're not going to be able to remove the power to generate AI scientific papers. This technology is only going to get better." The detection arms race for AI-written text is already losing on the prose side; scientific writing, which is formulaic to begin with, will be even harder to police.
And if you do care about research from a practical standpoint, the $140 price tag is wild. Run it on five variations of an idea overnight and wake up to five drafted manuscripts you can cherry-pick from. Not as a replacement for thinking, but as a first-pass exploration machine.
The part that keeps me up at night
Nature published an editorial arguing that institutions need to fundamentally rethink how they handle AI-generated research. The Bulletin of the Atomic Scientists went further, warning about integrity erosion. Both are right, but the horse has left the barn. The AI Scientist-v2 is a mediocre researcher today. Given the trajectory, "today" is doing a lot of heavy lifting in that sentence.