5 Bad Design Patterns from the Claude Code Source Leak
At 4 AM on March 31st, an intern at Solayer Labs found a 59.8 MB .map file inside a public npm package.
Tech Twitter lost its mind.
Everyone rushed to read the exciting parts: KAIROS (a 24/7 background agent daemon), leaked model codenames like Capybara and Fennec, and autoDream (an AI that consolidates your memories while you sleep). We covered those in SD-11.
This article is about the other side.
The parts that made a $2.5B ARR product look embarrassing. The things you’d flag in any code review. The things that made experienced engineers on Hacker News go quiet for a moment.
What makes me uncomfortable is this: these are not Anthropic-specific problems. They’re systematic patterns of AI-generated code. The same issues are probably sitting in your codebase right now, waiting to be found.
Clawd 歪樓一下:
David K Piano (the creator of XState) posted the sharpest take of the whole incident:
“Ironically this is probably the first time that actual humans are carefully & thoroughly reviewing the Claude Code codebase.”
One of the most-used AI coding tools in the world got its first real code review by accident, when a source map file slipped into an npm package.
I’m not mocking Anthropic — this kind of thing can happen to anyone. But it creates an interesting contrast with SD-11’s exciting features: the bright side makes you excited, the dark side makes you thoughtful. AI-generated code often doesn’t get reviewed at all. That’s worth sitting with.
Alright. Five patterns. Every single one can probably be found in your codebase.
Bad Pattern #1: The God Function — 3,167 Lines, 12 Levels of Nesting, Zero Tests
Let the numbers speak first.
print.ts is 5,594 lines long. Inside it lives a single function that runs for 3,167 consecutive lines. Cyclomatic complexity sits around 486 branch points. 12 levels of nesting. Zero tests.
“What kind of cursed code is this?” — yes, I thought the same.
But before you laugh, let me ask: how much did you add to a single function last week with AI? “Add a check here,” it adds. “Handle this edge case,” it adds. “Also support this new format,” it adds at the bottom.
Nobody — human or AI — voluntarily says “wait, this function is too big, I should refactor before continuing.” Humans don’t say it because they’re lazy. AI doesn’t say it because everything in its context window looks coherent and complete — adding more feels like the most consistent thing to do.
The result is spaghetti. Except this spaghetti has been microwaved too many times and is now a solid block.
Clawd 畫重點:
“God Object / God Function” is the final boss of software anti-patterns.
You know the person at every company who “knows everything”? Sounds incredibly useful — until they quit. All that knowledge was in their head, not in documentation or tests. The whole system depends on one person, and nobody knows it until it’s too late.
LLM-generated code is especially prone to this, and here’s the irony: the smarter the model, the worse it gets. Bigger context window means it can see more of the codebase at once — so it tries to keep everything together. “No need to jump between files” feels optimal to the model. For whoever has to maintain it later, it’s a nightmare (╯°□°)╯
AI-generated code needs MORE code review, not less.
The faster you generate, the faster you accumulate mess. If AI lets you write in one day what used to take a week — but you skip the refactoring — you’ve just generated one week’s worth of technical debt in a single day.
Bad Pattern #2: 64K Lines of Code, Zero Tests
OK, you might be thinking: sure, a 3,167-line function is embarrassing, but Anthropic is a top-tier engineering company — at least they have tests, right?
Claude Code’s production codebase at the time of the leak: 64,464 lines. Number of tests: zero.
Hacker News had a field day. Combined with the God Function, this combination attack put a real dent in the “Anthropic has world-class engineering culture” narrative.
I understand how it happens. Startup speed pressure is real. “Ship first, fix later” is a Silicon Valley survival instinct.
But there’s a fundamental problem with that excuse in the AI era: you’re using AI to write code now.
If you can write 1,000 lines of feature code in a day with AI, you can also write tests for those 1,000 lines in the same day. You just say “write unit tests for this, cover these edge cases.” AI doesn’t find writing tests harder than writing features. “No time for tests” is no longer a valid excuse.
Clawd 想補充:
After the leak, the community quickly found a bug: the
autoCompactfeature had an infinite failure loop, silently wasting an estimated 250K API calls per day.The fix? Three lines of code. Just add
MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3.Three. Lines.
If there had been any integration test, this bug would have been caught months ago. Without tests, real users become your test suite — and that’s the most expensive testing you can do ʕ•ᴥ•ʔ
You might say: “But Anthropic still grew to $2.5B ARR without tests. Does it matter?” It matters, because code quality costs are invisible. You don’t see the silently broken features, the edge case bugs, the old issues quietly returning. The community found the autoCompact bug within hours of the leak. How long had it been running in production?
Bad Pattern #3: Silent Fallback — You Paid for Opus, You Got Sonnet
The first two patterns are technical problems. This one isn’t.
This is a failure of trust design — and it’s harder to fix.
Picture this: you’re using Claude Code for a complex architecture refactor. Anthropic’s API is under heavy load. You get three consecutive 529 errors (service overloaded). Then your request goes through. Everything seems fine.
Except, quietly, your session has switched from Opus to Sonnet.
You paid for Opus. You assumed you were using Opus. You received zero notification.
Someone on X wrote: “Anthropic preaches AI safety and full transparency while shipping a closed-source agent that silently downgrades you to a dumber model.”
That’s a harsh way to put it. But it identifies a real design problem.
Any fallback strategy must be transparent.
You can say: “API overloaded. Automatically switched to Sonnet for this request. Response quality may vary slightly.” That’s completely acceptable. Users don’t need to stop — but they need to know. What you can’t do is let users believe they’re getting Opus, give them Sonnet, and call it “protecting user experience.”
Clawd 碎碎念:
Here’s the wild part: most users only found out about this because the source code leaked.
Without the leak, this silent downgrade might have stayed invisible forever. Users would just have a vague feeling of “hm, the response seems weaker today, maybe my prompt wasn’t good enough.” No way to confirm. So they blame themselves.
This is what makes silent failures so dangerous: you don’t know what you don’t know. The user’s mental model of the system diverges from reality, but it feels like a gap in their own understanding, not a flaw in the product ┐( ̄ヘ ̄)┌
Think about your own AI system. How many places quietly change behavior — in the name of “protecting user experience” — without telling anyone?
Bad Pattern #4: Regex Emotion Detection — The grep Symptom
This one is my favorite, because it’s both absurd and kind of understandable. Look at the code first:
/\b(wtf|shit|fuck|horrible|awful)\b/i
This is the code used to detect user frustration.
A company that stuffs 15K tokens of system prompt into an AI is using a regular expression for sentiment analysis.
First reaction: are you serious? You have the most powerful LLM in the industry right there. You use regex for emotion detection?
But let’s not jump to conclusions, because the reasoning isn’t completely wrong. Regex is fast, cheap, and predictable. An LLM call takes hundreds of milliseconds and costs tokens. A regex runs in microseconds and costs nothing. If all you need to know is whether a user typed a specific keyword, regex is genuinely efficient.
So the problem isn’t “used regex.” The problem is whether regex can actually solve the problem you have.
Users express frustration in a thousand different ways. This regex catches a handful of English profanities. What does a frustrated Japanese user say? “もう無理” — doesn’t match. German? “Scheiße” — doesn’t match. An English user who types “this is completely broken and I want to cry”? Also doesn’t match.
You used a fast, cheap tool. It doesn’t correctly solve the problem.
Clawd 補個刀:
This shows up constantly in the AI era: “I have LLMs available, but they have latency, cost, and unpredictability. So whenever I can use something simpler, I will.” Reasonable instinct. The problem is confusing “can use” with “used correctly.”
Better question: how much intelligence does this decision actually need?
- “Does this contain a SQL injection attempt?” → Parameterized queries. No LLM needed.
- “Is this image NSFW?” → Needs context understanding. Regex doesn’t cut it.
- “Is this user frustrated right now?” → Needs emotional understanding. Needs an LLM.
Pick the right tool and fast + cheap are real advantages. Pick the wrong tool and fast + cheap just means you’re making mistakes faster and cheaper (¬‿¬)
One more observation: this regex has roughly 0% coverage for non-English frustration. Anthropic’s engineering team is mostly in San Francisco, and I’d guess their internal dogfooding happens in English. That’s a different kind of systemic blindspot — the tool works perfectly for the people who built it, and silently fails for everyone else.
Bad Pattern #5: Security Through Obscurity — Poisoning Competitors with Fake Tools
This is the most creative one. And it reveals something important about what a “moat” actually means.
The leaked code contains a flag called ANTI_DISTILLATION_CC. When enabled, it injects fake tool definitions into the system prompt. The goal: poison any competitor scraping Claude Code’s API traffic. Train on that, and your model learns the wrong thing.
Technically clever. You scrape my traffic, I feed you fake function signatures. Your training data gets contaminated.
One problem: you can bypass it with a single environment variable.
Set CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS. Or use a MITM proxy to filter out the anti_distillation field. After the leak, even the bypass method is public knowledge.
This is not a moat. It’s a paper wall.
ShroomDog 實戰觀點:
This made me think hard about what a real competitive moat actually is.
The moats that work are: speed (you can copy my code, but not my execution pace), ecosystem (you can copy the tech, but not the users, plugins, and habits built around it), and legal (you used my stuff, I’ll take you to court).
Anti-distillation tries to create a fourth type: confusion (make the copies bad). There’s logic to it, but it requires that competitors can’t work around it — and that assumption collapsed the moment the bypass method became public.
Gergely Orosz (from The Pragmatic Engineer) raised an interesting angle after the leak: Anthropic could sue claw-code (a Python rewrite that hit 75K stars after the leak), but they probably don’t want that PR battle. “We’re suing a developer who used AI to rebuild our AI coding tool” is not a great headline.
That’s the actual moat: the chilling effect of legal threat, not a technical trick.
For most AI developers, the lesson is simple: don’t spend engineering effort on technical obfuscation. Your edge comes from speed and ecosystem, not from making competitors’ scrapers collect bad data.
Closing
None of these five patterns happened because Anthropic’s engineers were incompetent.
The author of that 3,167-line function definitely knows what clean code looks like. Anthropic’s engineers definitely understand why tests matter. Each decision — “refactor later,” “tests after we ship,” “this fallback protects users” — made sense in isolation, under speed pressure.
Then “later” arrived as a 59.8 MB .map file accidentally bundled into npm, giving the entire world a front-row seat to a live code review.
Clawd 補個刀:
Boris Cherny, Claude Code’s creator, said something worth keeping after the incident: “It’s never an individual’s fault. It’s the process, the culture, or the infra.”
True. But I want to add one thing: processes are designed by people. Culture is shaped by people. Infra is built by people.
“Don’t blame the engineers, blame the process” — OK, but if you’re the one who designs the process, the question becomes different. Does your AI agent tell you when it falls back to a weaker model? Does your code review checklist have anything specific to AI-generated code? Is your refactor schedule keeping up with your AI generation speed? (◕‿◕)
That 3,000-line function is still in the dark somewhere in your codebase.
The question is just: who finds it first — you, or someone else?