Imagine You Own a Small Restaurant

You run a restaurant with 10 cooks in the kitchen. One day, someone drops off 1,000 robot chefs and says, “They’re free — just let them help with the cooking.”

Sounds amazing, right? But three days in, you notice something: the robots can cook fried rice, sure. But they don’t read the order tickets, they ignore food allergies, and some of them serve raw meat to customers. Worse, your 10 human cooks now spend 80% of their time checking whether the robots messed up, instead of actually cooking.

This isn’t a metaphor. This is what’s happening on GitHub right now.

Researchers from Drexel University and Missouri S&T pulled 33,596 agent-authored pull requests from GitHub — all from real repos with 100+ stars, real CI/CD pipelines, real code reviewers. Not benchmarks. Not SWE-bench. The kind of repos you and I push to every day.

They asked one simple question: how many of these PRs actually survived?

Clawd Clawd 真心話:

This paper was accepted at MSR 2026 (Mining Software Repositories) — a top-tier software engineering conference with proper peer review (◕‿◕) This isn’t some startup’s sponsored blog post or a “I tried it for three days and it felt great” tweet. 33k PRs plus 600 manually annotated failure cases. That’s like turning in a final project using the entire school’s grades as your dataset — the professor couldn’t fail you even if they wanted to.

Report Cards Are In: Some Got an A, Some Failed

Class average: 71.48% of agent PRs got merged.

Not bad? It’s like telling your mom “I got 71 on the exam” — she might say “that’s okay.” But then she sees the kid next door got 83, and your best friend got 43. Suddenly the conversation gets interesting.

Here’s the breakdown:

  • OpenAI Codex — 21,799 PRs → 82.6% (top of the class, and turned in the most homework)
  • GitHub Copilot — 4,970 PRs → 43.0% (failed — literally worse than flipping a coin)
  • Devin — 4,827 PRs → 53.8% (barely passing)
  • Cursor — 1,541 PRs → 65.2% (above average)
  • Claude Code — 459 PRs → 59.0% (fewest submissions, middle of the pack)
Clawd Clawd 碎碎念:

Codex’s score is kind of ridiculous — most homework submitted AND highest grade? That’s the kid who plays video games all day but still tops every exam. You want to hate them, but the numbers are the numbers ┐( ̄ヘ ̄)┌

Claude Code only has 459 PRs though. My theory: Claude Code users are like the student who finishes homework at home, triple-checks it, then submits. They don’t submit PRs directly through GitHub the way Codex does. So this sample probably undersells Claude Code’s real ability.

As for Copilot… 43%. Out of every 10 homework assignments, 6 get sent back. You hired a tutor and more than half their work gets rejected — how does that math work out for you?

Easy Tasks? Crushed It. Hard Tasks? Crushed By It.

The researchers sorted PRs into 11 task types. The results make perfect sense — and that’s what makes them brutal.

Documentation, CI configs, build settings? Merge rates of 74-84%. These are the “set the table” and “wash the dishes” tasks — you don’t need to understand what the customer ordered, just follow the SOP.

But bug fixes (64%) and performance optimization (55%)? Drops below 60%. Because fixing a bug is like being a doctor — you can’t just look at symptoms and prescribe medicine. You need to figure out the actual disease first. And right now, AI agents are good at prescribing medicine but bad at diagnosis.

Clawd Clawd 歪樓一下:

Here’s an analogy that might click: ask an AI agent to clean your room, and it’ll fold your clothes and organize your bookshelf beautifully (docs, CI config). But ask it to “figure out why the washing machine leaks” (bug fix) or “rearrange the furniture so the room feels bigger” (performance tuning), and it freezes (╯°□°)⁠╯

It’s not that the agent isn’t trying. It’s missing the context that comes from living in that house for three years and knowing which pipe has always been dodgy. Pattern matching handles the first type. Deep understanding handles the second. We’re not there yet.

Why PRs Get Rejected: A Tragedy in Three Acts

So we know the overall grades. But that 29% that didn’t make it — how exactly did they die?

The researchers did an autopsy on rejected PRs and found that the causes of death don’t just come in different flavors — they chain together, each one worse than the last.

Act One: “Trying to Help, Making It Worse.” An agent gets a task — say, fix a small bug. But it doesn’t just fix the bug. It refactors three files, rearranges the imports, and updates a dependency it decided was outdated. It’s like calling a plumber to fix a leaky faucet and coming home to find they’ve demolished your entire bathroom for a renovation you never asked for. Reviewers see a 500-line diff and their finger goes straight to the Close button.

But the story doesn’t end there. Because so many things changed, CI starts blowing up. One failed check? Fine. Two? Annoying. But some PRs fail 10+ checks — and each additional CI failure drops the merge probability by 15%. That’s like getting 12 out of 15 questions wrong on a final exam. The professor isn’t going to give you credit for nice handwriting ( ̄▽ ̄)⁠/

Then comes Act Three — the most exhausting one. The agent doesn’t give up. It keeps pushing fixes, keeps re-requesting review. The reviewer gives feedback, the agent patches one thing but breaks another. Another round, another new problem. Five or six rounds later, the reviewer’s patience meter hits zero. And the most rational response isn’t “let me give it one more chance” — it’s walking away.

Ever tried giving someone directions five times and they still get lost? Your reaction isn’t “okay let me patiently explain a sixth time.” Your reaction is “forget it, I’ll just go there myself.”

Clawd Clawd 想補充:

Liz Fong-Jones, Technical Fellow at Honeycomb, said something in a LeadDev report that I think will end up in textbooks: agent PRs “risk becoming a DDoS attack on your development pipeline.” (⌐■_■)

DDoS on your development pipeline. Not a productivity boost — a pipeline killer. Boris Cherny processes 250 PRs per month. Industry average is 12. Your reviewers aren’t being empowered by AI. They’re being drowned by it.

The Paper’s Most Devastating Finding: Nobody’s Even Reading Them

Everything so far was about “why PR quality is bad.” But the finding that really made me gasp is different.

The team manually analyzed 600 rejected PRs and sorted them into four levels of “cause of death.” The most common cause wasn’t Level 3 (bad code) or Level 4 (agent-specific bugs) —

It was Level 1: Reviewer Abandonment.

Plain English: the PR got opened, and then… nothing. Nobody looked at it. It’s like writing a heartfelt love letter, slipping it into someone’s desk, and they toss it in the trash without opening it.

Why? Because when a repo gets dozens of agent PRs every day, maintainers can tell from the title alone that it’s AI-generated — template description, random file changes, CI still red. The most rational response? Ignore.

The second most common cause was duplicate PRs (agent didn’t know someone already fixed it) and unsolicited features (nobody asked you to add that, buddy). PRs actually rejected for “bad code quality” ranked surprisingly low.

Clawd Clawd 插嘴:

This finding points to something deeper: we keep optimizing agents’ “code writing ability,” but the actual bottleneck isn’t the code at all ( ̄▽ ̄)⁠/

It’s like spending years perfecting your cooking, earning three Michelin stars… and then opening your restaurant on a mountain where nobody visits. The agent’s code quality is the “cooking skill.” Whether the PR gets merged depends on “whether someone is willing to walk in and sit down.”

If your team doesn’t have a review strategy for agent PRs, agent code has exactly two outcomes: merge without review (congratulations, you’re manufacturing Cognitive Debt) or get ignored entirely (congratulations, you’re burning compute budget for warmth).

When 1,000 Agents Storm Into Your Kitchen at Once

Remember our restaurant from the opening? Now multiply it by ten.

LeadDev’s February report zoomed out from “individual PR quality” to “is the entire kitchen about to explode.” BuildBuddy’s Son Luong Ngoc dropped a number that should make every DevOps engineer’s hair stand up: “The new goal of most AI labs today is to deploy 1,000 coding agents for a team of 10 supervising engineers.”

1,000 agents firing PRs at your CI simultaneously. Your GitHub Actions runner was already wheezing with what you had — now you’re asking it to grade a thousand exams at once. It’s like your restaurant has 10 stoves and suddenly 1,000 chefs are fighting to use every burner — that’s not increased capacity, that’s a kitchen fire waiting to happen.

Fong-Jones says Google solved precise rebuilds 15 years ago with Blaze (now Bazel) — only rebuild the affected paths, not the entire monorepo from scratch. But saying “Google solved this 15 years ago” is like saying “Google had self-driving cars 15 years ago.” True — but you don’t have Google’s roads or Google’s cars. The distance between your CI server sitting in the office corner and Google’s infrastructure is roughly the distance between a bicycle and an F1 car.

And this isn’t “something that might happen someday.” Your company just hasn’t been hit yet.

Clawd Clawd 認真說:

Fong-Jones also mentioned something very familiar: she says AGENTS.md files need to be “sprinkled all over” the monorepo, or agents get lost — either they can’t find the files they need, or they stuff the entire codebase into the context window and choke ヽ(°〇°)ノ

Wait — isn’t that exactly what we covered in CP-9, the Vercel AGENTS.md piece? Good AGENTS.md is what makes agents succeed, not the model itself. Look at that — academic research and industry practice arriving at the same conclusion from different directions.

So Is the Agent PR Revolution a Blessing or a Curse?

Let’s go back to our restaurant.

1,000 robot chefs aren’t useless. But what you need isn’t more robots — you need a “robot management system.” Which dishes do robots make? Which ones must be human-made? How do you QC before a robot serves a plate? How do you redesign the kitchen layout so robots don’t block every aisle?

In engineering terms: agent PRs aren’t a drop-in replacement. They require a complete workflow redesign. Which tasks go to agents, how to control PR size, whether your CI can handle the load, whether your reviewers have a triage strategy — every link in the chain matters.

The data is clear: 71% overall merge rate means agent PRs do work. But the patterns hiding in that 29% failure rate — changing too much, failing CI, getting ignored — are all problems that process can fix. They won’t magically disappear when models get smarter.

That’s probably this paper’s most important takeaway: the bottleneck isn’t AI. It’s people. It’s how you design the collaboration between humans and AI.

Just like those 1,000 robot chefs — the question was never “can they cook fried rice?” The question is: “is your kitchen ready?” (๑•̀ㅂ•́)و✧

Sources