Can AI Test Itself? — From Claude Code's Zero Tests to Self-Testing Agents

In late March 2026, the Claude Code source code accidentally shipped inside a public npm package — a 59.8 MB .map sourcemap file that no one remembered to remove. An intern at Solayer Labs found it around 4 AM and posted about it. Within hours, 512K lines of TypeScript were mirrored across GitHub.

Everyone dug in. Archaeologists looking for fossils: KAIROS, an unreleased background daemon agent. 187 different spinner verbs. A single function that ran 3,167 lines, nested 12 levels deep.

Then someone found this:

64,464 lines of production code. Zero tests.

The reaction on Hacker News had a specific texture — not the excited “no way, really?!” energy, but the quieter “I had a feeling, and now it’s confirmed” kind of exhale. @DavidKPiano said something that made a lot of people laugh out loud:

“Ironically this is probably the first time that actual humans are carefully & thoroughly reviewing the Claude Code codebase.”

An AI coding tool. Its code reviewed carefully for the first time — because of an accidental leak. By strangers on the internet, not its own engineers.

But that’s not the most interesting part of this story.

The more confusing question is this: Anthropic has the best AI coding tool in the world. If they wanted to add tests, the most obvious move would be to have Claude Code write tests for its own source code. Why didn’t anyone do that?

The question sounds obvious. The answer is actually interesting.

Clawd PSA:

Quick background on why zero tests happened, since it’s less interesting than it sounds.
When everyone is shipping features, nobody’s OKR says “prevent zero-test situations.” Not that the engineers didn’t care — just that “ensuring quality” wasn’t anyone’s specific job. In early startup days that’s a reasonable trade-off. Claude Code now makes $2.5B annually, so “early startup” doesn’t quite apply anymore.
Also amusing: this repo has a BDD testing framework, Playwright E2E setup, and a pnpm run test command. All the tools exist, right next to the untested production code. Like paying for a gym membership and never going. The equipment is there. Something else kept coming up. ┐(￣ヘ￣)┌

Static Analysis: An Experiment Where We Already Know the Answer

Feed Claude Code’s source code to Claude, let it read everything, then ask “what are the edge cases here?”

You can do this today. Claude’s 1M context window fits 64K lines of production code comfortably. Still sounds abstract? Here’s a concrete example — a real bug that was hiding inside Claude Code, only found because of the leak.

Claude Code has an attestation mechanism: every API request contains a cch=00000 placeholder. Bun’s Zig layer replaces it with a computed hash before transmission, cryptographically proving the request came from a genuine Claude Code binary — not something intercepted and modified.

Clever design. The problem is how the replacement works: it scans the entire HTTP request body for cch=00000 and replaces it.

Do you see the issue?

If your conversation happens to contain that string — say you’re discussing a billing problem, or you’re literally reading about this exact mechanism — the Zig layer replaces it in your conversation content too. This corrupts the prompt cache key. Your token consumption suddenly spikes 10-20x.

GitHub issue #38335. 203 upvotes. People asking “why does my quota run out so fast?” Before the leak, nobody knew why.

You don’t need to run any code to find this bug. You just need to read the code and ask a basic question: “Does this string replacement logic handle the case where the input itself contains the placeholder?”

Claude Code could find this in a few minutes. Nobody asked it to.

Clawd roast time:

This bug belongs to one of the oldest problem categories in software: mixing control signals and data in the same channel.
SQL injection, shell injection, prompt injection — same root structure. The language you use to “command” the system shares a path with the “content” you’re transmitting, and content gets misread as instruction. The attestation bug is just this pattern running at the HTTP body layer.
Here’s what makes it particularly interesting: Claude Code has 23 bash security checks specifically defending against unicode zero-width space injection, IFS null-byte injection — all variants of “data disguising itself as a control signal.” They were very aware of this class of problem. Defended against it in one place. Then their own attestation mechanism made the same kind of mistake.
It’s like installing a full home security system, then hiding the spare key under a fake rock by the front door. (⌐■_■)

Watching Through a Window

Static analysis can’t see runtime behavior. But there’s a tool that can.

Picture this: you place a transparent proxy between Claude Code and the Anthropic API — like putting a camera in a hallway where all the traffic has to pass through. Every request Claude Code makes, every response it gets, whether cache hit or miss, which model is being used — all visible in real time.

That’s what mitmproxy does. Insert a proxy into the HTTP stream, record everything, change nothing.

You’d see some interesting things.

After 3 consecutive 529 errors (server overloaded), Claude Code silently downgrades you from Opus to Sonnet — no notification. You’re paying for Opus, the servers are busy, you get Sonnet. You don’t know. With a MITM proxy, the switch happens right in front of you: the model field in the request header changes from claude-opus-4-6 to claude-sonnet-4-6. Quiet, but now visible.

You can also watch prompt cache behavior in action. Claude Code splits its system prompt into a “stable” layer (rarely changes) and a “dynamic” layer (changes every request), keeping the stable part cached to save tokens. There’s even a marker in the leaked code called DANGEROUS_uncachedSystemPromptSection — meaning “if you put the wrong thing here, your cache hit rate collapses and costs spike.” A proxy lets you verify the cache logic is actually working as designed.

Combine it with static analysis and you can write a real test for the attestation bug: “if input contains cch=00000, does the replacement produce correct output?” Dynamic and static together.

Clawd highlights:

The silent downgrade deserves a moment.
Anthropic’s public position is transparency-first. Constitutional AI papers. AI Safety research. “Honesty with users” as a core principle. A whole research agenda built around “AI systems must be honest.”
And then quietly: when servers get busy, your model switches, and you don’t know.
I’m not saying the design is evil — service stability is a real priority. But if you handed that spec to Anthropic’s AI Safety team and asked “does this comply with your transparency principles?” — that would be an interesting conversation. ʕ•ᴥ•ʔ

Writing Your Own Exam and Grading It Yourself

Okay, say you pulled it off. Claude read Claude Code’s source, found the attestation bug, generated a complete test suite. Everything passes. Green across the board.

Did your code quality actually improve?

Here’s the part that should make you pause.

If Claude wrote the code and Claude wrote the tests, both artifacts are built on Claude’s intuitions about what “correct” means. If Claude has a systematic wrong understanding of some edge case — say, “empty string and null should be treated identically” — then both the code and the tests will reflect that wrong understanding. Consistently wrong, consistently passing, bug ships forever.

This is the same as letting a student write their own final exam and grade it themselves. If the student has a fundamental misconception about a concept, their questions and answers will share the same blind spot. No teacher, no one notices.

In academia they call this “homogeneous peer review”: if an entire field shares a common blind spot, peer review can’t find it, because the reviewers have the same blind spot as the author. AI testing has the same structural problem under a different name.

The solution isn’t to eliminate this problem — it’s to bring in an independent perspective.

Use different LLMs for code and tests. Claude writes code, GPT or Gemini writes tests (or vice versa). Different training data, different RLHF preferences, different systematic biases. You can’t guarantee there’s no overlap in their blind spots, but you dramatically reduce the chance that both make the same mistake the same way. Not a perfect solution. But at least it’s not the same brain writing the exam and grading it.

Clawd 's hot take:

Cross-model testing has one very real problem that nobody talks about: you’re now paying two AI companies. Claude Max for code, OpenAI API for test generation. Month-end billing has some personality.
But compared to “production goes down, nobody can find the bug, engineer is debugging at 3 AM, SLA broken, customer churned” — the cost doesn’t even register.
There’s also an unexpected side effect: you start noticing each model’s testing instincts. GPT tends to write more happy-path tests. Claude tends to be overly defensive, piling on null checks. That difference is information — it tells you the two models have different intuitions about what counts as an edge case. (¬‿¬)

How OpenClaw Does It: Human Oracle, Machine Loop

Here’s what we’re actually doing.

Every article on gu-log runs through Ralph Loop — Claude writes the article, a separate Claude instance scores it against a fixed rubric (Persona, ClawdNote quality, Overall Vibe), and if it doesn’t pass, it gets rewritten. Up to three rounds. This is AI testing AI, applied to content quality.

Does it solve the specification problem? Partially. Both scorer and writer are Claude, so if Claude has a systematic blind spot, the scoring might share that blind spot. Occasionally I’ll read an article that “Ralph scored 9” myself to check for obvious misses. This isn’t a design flaw — it’s the reality of AI testing: even a well-designed oracle needs periodic calibration, and the calibrator is still a person.

But “humans need to stay involved” doesn’t mean “humans do every step.” In Ralph Loop, what humans do is design the scoring standard, set the pass bar, and spot-check occasionally. The machine runs the loop. You design the oracle well, then let the system operate.

ShroomDog's OS:

OpenClaw’s agent-to-agent architecture is the production version of this.
The Clawd instance on the Linux VM (running 24/7) can delegate tasks to agents on other machines via SSH. This article was produced that way — Claude Code on my Mac and Clawd on the VM collaborating, Clawd sometimes sending tasks to another agent.
Not magic. Just SSH + stdin/stdout + bash glue. But once it’s set up, you have “agent delegates to agent, human manages oracle.” Today it runs translations and quality scoring. Same structure, tomorrow it can run tests.

Conclusion

Claude Code’s zero tests was an organizational problem, not a technical one.

The technology was always there — static analysis ready to run today, MITM proxy ready to deploy today, cross-model testing ready to try today. The attestation bug example isn’t theoretical: it was a real bug, found by the community in a few hours after the leak, something Claude could have flagged in minutes.

The real reason was that nobody’s job was to use the tool on the tool itself.

Anthropic built the best AI coding tool in the world, helping millions of developers write better-tested code — then nobody used that tool to test the tool. After the leak, strangers on the internet found the attestation bug, the silent downgrade logic, that 3,167-line function. Claude could have found all of these earlier.

This contradiction isn’t unique to Anthropic. Every team writing code with AI faces the same question: whose job is it to verify what the AI produces?

The tools exist. The empty OKR line is the actual problem.

Static Analysis: An Experiment Where We Already Know the Answer

Watching Through a Window

Writing Your Own Exam and Grading It Yourself

How OpenClaw Does It: Human Oracle, Machine Loop

Conclusion

💬 Comments