Sometime in early 2026, we quietly crossed a line that would have sounded absurd three years ago: more than half of all code committed to GitHub is now either generated or substantially assisted by an AI tool. GitHub confirmed this milestone themselves. Nobody threw a party.

Maybe because the other numbers are starting to roll in too.

The 1.7x Problem

CodeRabbit published what might be the most uncomfortable dataset in developer tooling right now. Their team analyzed 470 open-source pull requests — 320 co-authored with AI, 150 written by humans alone — and the results aren't subtle.

Machine-assisted PRs averaged 10.83 issues per submission. Human-only PRs averaged 6.45. That's a 1.7x defect multiplier across every major quality dimension.

The breakdown is what stings:

Category AI vs. Human What it looks like
Logic & correctness +75% Business logic errors, unsafe control flow
Security vulnerabilities +1.5–2x Improper auth handling, insecure object refs
Readability +3x Naming chaos, formatting drift
Performance (I/O) +8x Redundant calls, unoptimized queries

That last row — eight times more excessive I/O operations — is the kind of thing that won't show up in your test suite but will absolutely show up in your AWS bill.

Security Is Stuck at 55%

Here's a stat that should worry anyone shipping to production: Veracode's Spring 2026 analysis found that AI-generated code passes security checks only about 55% of the time. That number hasn't meaningfully budged in two years — not through GPT-5.x, not through Claude 4.x, not through Gemini 3.x.

Syntax correctness? Over 95%. The models are phenomenal at writing code that compiles. They're mediocre at writing code that won't get you breached.

The language-level breakdown is particularly grim for enterprise shops. Python sits at 62% security pass rate — best in class. JavaScript and C# hover around 57-58%. Java? Twenty-nine percent. Not a typo. If you're generating Java with AI and not running SAST on every single output, I honestly don't know what to tell you.

Specific vulnerability types reveal the pattern even more clearly. SQL injection defense? 82% pass rate — the models have seen a million examples. Cross-site scripting? Fifteen percent. Log injection? Thirteen. The models learned the common patterns but not the deep ones.

OpenAI's reasoning models (the o-series lineage) managed 70-72%, which is the best anyone's achieved, but still means roughly three in ten outputs ship with known vulnerabilities. "Best available" and "production-ready" remain very different things.

The Silent Failure Problem

The scariest finding in all of this data isn't the bug counts — it's what kind of bugs they are.

According to research aggregated by multiple analysis firms, roughly 60% of AI code defects are "silent failures." The code compiles. Tests pass. The PR looks clean. And then it produces wrong results in production, quietly, in ways that take days or weeks to surface.

This is what makes the review bottleneck so brutal. PR volume across the industry is up 29% year-over-year, driven largely by AI-assisted throughput. But the review load isn't just higher — it's harder. Developers report that reviewing AI-generated code is more cognitively demanding than writing it from scratch. You can't skim machine output the way you skim a colleague's code, because the failure modes are different and less predictable.

The downstream numbers reflect this: despite 75% of developers claiming they manually review AI-generated code before merging, change failure rates jumped 30%. Nearly three in ten merges to main now fail. Teams exceeding 40% AI-generated code face a 20-25% increase in rework, translating to roughly seven hours per team member per week lost to AI-related churn.

What Actually Helps

Before this sounds like a "throw away your coding assistant" argument — it isn't. The tooling is genuinely useful. But the industry bet heavily on generation and barely invested in verification, and that gap is now showing up in production incident counts.

A few things seem to actually move the needle:

Security-specific prompting works surprisingly well. Veracode found that explicitly asking models to consider security in their prompts improved pass rates dramatically — Claude went from 6/10 to 10/10 secure outputs in their testing. Most developers simply don't prompt for security. Adding "ensure this handles untrusted input safely" to your system prompt is free and measurable.

Automated review tooling is no longer optional. CodeRabbit, Qodo, GitHub's own Code Quality (now in public preview as of April 14) — the market has realized that AI generation without AI review is half a product. The review layer needs to be as aggressive as the generation layer.

Treat generated code like vendor code. This mental model shift matters. You wouldn't merge a third-party library update without testing. AI output deserves the same skepticism — it's code written by something that has no idea what your system actually does at runtime.

The Real Question

The uncomfortable truth is that we've optimized for the wrong metric. Measuring "percentage of code written by AI" is like measuring "percentage of food cooked by microwave." Speed of preparation tells you nothing about what you're eating.

The 51% number is going to keep climbing. Fine. The question that matters now is whether the quality tooling can scale as fast as the generation tooling did. Right now, it can't — and every production incident driven by a silent AI failure makes the case for slowing down and reviewing harder, not generating faster.

We have a generation surplus and a verification deficit. Fixing that imbalance is the actual AI developer tools problem of 2026.