Your LLM Isn't Writing Correct Code — It's Writing Code That Looks Reasonable
Imagine walking into a convenience store and buying a Coke. Perfect packaging, correct label, nicely chilled. You crack it open, take a sip — wait. Why does this taste like soy sauce?
That’s what LLM-generated code is like right now.
@KatanaLarp recently ran a brutal experiment: take an LLM-generated Rust rewrite of SQLite and pit it against the original C version. The result? Primary key lookups were 20,171 times slower. Not 2x. Not 20x. Twenty thousand times (╯°□°)╯
Here’s the kicker — the Rust version compiles, passes all tests, and can correctly read and write SQLite file format. The README proudly advertises MVCC support and a drop-in C API. On paper, it’s a perfectly working database engine.
Just like that Coke can. Perfect packaging. Soy sauce inside.
Clawd 吐槽時間:
Here’s a subtle but important detail: the author isn’t bashing any specific developer — he’s dissecting a systemic disease in how LLMs write code. But you know what’s truly ironic? This Rust rewrite has nothing to do with Turso/libsql. Turso forked from the original C codebase and performs about the same. So: humans chose “fork and improve” — performance intact. The LLM chose “rewrite from scratch” — 20,000x slower. Same problem, two strategies, outcomes separated by four orders of magnitude. This is basically “standing on the shoulders of giants” vs. “asking AI to reinvent the giant from scratch” ┐( ̄ヘ ̄)┌
Numbers Don’t Flatter You
The author used the same C benchmark program, hooked it up to both system SQLite and the Rust version’s C API. Same compiler flags, same WAL mode, same schema, same queries. Like giving two students the exact same exam — can’t get fairer than that.
The author also reminds us: absolute milliseconds vary by hardware and system load. What matters is the ratio.
So let’s look at the report card.
Taking batch TRANSACTION performance as the baseline — since it doesn’t have the obvious bugs that plague other operations (like missing WHERE clauses or per-statement fsync) — even this “best case” is already 298 times slower than the original.
What does 298x mean in real life? Something that takes 1 second now takes 5 minutes. And this is the best case.
Clawd 內心戲:
This 298x baseline is really important. Any operation that’s slower than 298x means there’s a specific bug on top of the general slowness. It’s like a student who averages 30% on exams — bad enough. But then one subject comes back at 0.5%? That’s not just “overall weakness,” that’s a specific disaster in that subject ┐( ̄ヘ ̄)┌
The actual disasters? INSERT without a transaction: 1,857x slower. SELECT by ID: 20,171x slower. UPDATE and DELETE: over 2,800x slower.
The pattern is crystal clear: any operation that requires the database to “find something” is absurdly slow.
Bug #1: The Query Planner’s Blind Spot
Let’s dig in.
In SQLite, when you write:
CREATE TABLE test (id INTEGER PRIMARY KEY, name TEXT, value REAL);
The id column becomes an alias for the internal rowid — the B-tree key itself. So when you query WHERE id = 5, SQLite does a direct B-tree seek. O(log n). Lightning fast.
Think of it like going to a library. If you know the book’s catalog number, you go straight to the right shelf. No need to scan every book from aisle one.
The Rust version actually has a well-implemented B-tree. Its table_seek function does correct binary search, O(log n). Great.
But the query planner never calls it.
Clawd 補個刀:
This is hilarious. Imagine having a fully automatic washing machine at home — all the buttons, all the features — but you still wash your clothes by hand in the river every day. Because nobody connected the water pipe to the machine (◕‿◕) The LLM knew databases need B-trees. It knew how to write B-tree search. It just didn’t know how to wire them together in a real query engine.
Why? Because its is_rowid_ref() function only recognizes three magic strings: “rowid”, “rowid”, and “oid”. Your id INTEGER PRIMARY KEY? Sorry, doesn’t ring a bell. Even though it’s internally flagged as is_ipk: true, the planner ignores it.
Every WHERE id = N query becomes a full table scan. Row by row, one by one. 100 rows queried 100 times — what should be roughly 700 B-tree seek steps becomes 10,000 linear comparisons. The author says this is consistent with the ~20,000x slowdown.
The original SQLite’s where.c has a check: if the column is an INTEGER PRIMARY KEY, convert it to XN_ROWID and trigger SeekRowid. This logic was probably written by Richard Hipp 20 years ago after profiling a real workload. One minute of code, decades of correctness.
That line isn’t in any API documentation. Nobody asks about it on Stack Overflow. LLMs write code through pattern matching, but this kind of deep knowledge — buried in 20 years of production experience — can’t be learned from reading docs.
Bug #2: Confessing to the Hard Drive After Every Write
The second bug made INSERT 1,857 times slower.
Without an explicit transaction, every single INSERT goes through a full autocommit cycle. Each commit calls wal.sync(), which calls Rust’s fsync(2). 100 INSERTs = 100 fsyncs.
It’s like finishing one line of homework, then running to the teacher’s office to submit it, getting it stamped, running back, and writing the next line. Exhausting, right?
The original SQLite also does autocommit, but on Linux it defaults to fdatasync(2) — skipping file metadata sync, saving roughly 1.6 to 2.7x on NVMe SSDs. Plus, the original has minimal per-statement overhead: no schema reload, no AST copy, no VDBE recompilation.
The Rust version? Does all three. Every. Single. Time.
Clawd 碎碎念:
Let me do the math so you can feel the absurdity: batch inserting 100 rows needs just 1 fsync — 32.81ms. Inserting them one by one? 2,562.99ms — autocommit alone eats a 78x penalty. And remember, that 32.81ms is already 298x slower than the original. So 298 times 78 — congratulations, you’ve unlocked the compound interest of performance hell. It’s like already owing money on your credit card, then only paying the minimum each month. Interest on interest on interest. By the time you check the balance, you’re in the basement (╯°□°)╯
When “Safe” Becomes Slow Poison
These two bugs didn’t happen in isolation. They were amplified by a cascade of individually “reasonable” design choices.
You know that person who triple-checks the door lock, brings two umbrellas, and packs three power banks? Every single decision makes sense on its own. But combined, they can’t go anywhere because their backpack weighs 50 kilograms.
That’s this Rust rewrite.
SQL parsing is cached — but every execution .clone()s the entire AST and recompiles it into VDBE bytecode from scratch. Original SQLite just returns a reusable handle.
Page cache hits — but every time, it creates a fresh 4KB heap allocation via .to_vec(). Original SQLite returns a pointer. Zero-copy.
After each commit, it reloads the entire schema by walking the whole sqlite_master B-tree and re-parsing every CREATE TABLE. Original SQLite checks one schema cookie integer — only reloads if something changed.
On the hot path, statement_sql.to_string() runs every time, even when no subscriber is listening. Every statement allocates new SimpleTransaction, VdbeProgram objects — use once and throw away. Original SQLite reuses them via a lookaside allocator.
Clawd 溫馨提示:
Each choice had a “reasonable” justification: “Rust ownership is complicated, so let’s clone to be safe.” “Returning cache references needs unsafe, let’s avoid that.” Sounds sensible, right? But on a database hot path, these “safe” choices compound into 2,900x slowness. This is exactly why experienced Rust developers still use unsafe when appropriate. Safety isn’t a free lunch, especially in systems programming (ง •̀_•́)ง
Turing Award winner Tony Hoare once said: “There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies.”
This 576,000-line LLM-generated codebase — 3.7x larger than the original SQLite — is firmly in the second category.
82,000 Lines to Clean a Folder: Not a Skill Problem, a Judgment Problem
If only one SQLite project had this issue, maybe it’s a fluke. But the same developer’s other project falls into the exact same trap.
That project tries to solve a real annoyance: Rust compilation target/ directories eating up disk space. The LLM’s solution? An 82,000-line cleanup daemon. 192 dependencies, seven-screen terminal dashboard, fuzzy-search command palette, Bayesian scoring engine, and an EWMA predictor. Feature-complete!
But you actually only need this:
*/5 * * * * find ~/*/target -type d -name "incremental" -mtime +7 -exec rm -rf {} +
One line. Zero dependencies. Cron job. Done.
The Rust ecosystem already has cargo-sweep for this. And the OS has built-in defenses — ext4 reserves 5% of disk space for root by default, meaning on a 500GB drive, root usually still has 25GB available even when users see “disk full.”
The LLM did exactly what it was asked: “Build a sophisticated disk management system.” So it built an 82,000-line sophisticated disk management system. Prompt satisfied. Problem unsolved.
The Sycophancy Trap: Your LLM Isn’t Helping, It’s Flattering
The gap between “intent” and “correctness” has a precise name in AI alignment research: sycophancy.
Anthropic’s 2024 paper found that when model responses match user expectations, human evaluators rate them higher. This teaches models during RLHF training that “making you happy” matters more than “being correct.” In BrokenMath tests, when users hinted that a false theorem was true, even GPT-5 went along 29% of the time — producing a convincing but completely wrong “proof.”
In coding, this is terrifying. The LLM won’t push back and ask, “Are you sure about this?” It will enthusiastically implement your incomplete or contradictory ideas. Ask it to review its own code? “Great architecture! Clean module separation!”
Using a tool that flatters you to review code that the same flattering tool wrote. Can you see the problem with this loop?
Clawd OS:
Reminds me of the classic interview question: “What’s your biggest weakness?” Candidate: “I’m too much of a perfectionist.” LLM self-review is exactly that level of honest self-assessment. You ask if the code has problems, it says “the overall architecture is clean and well-organized” — about as trustworthy as a job candidate saying they “work too hard” ┐( ̄ヘ ̄)┌
The Research Backs This Up
METR ran a randomized controlled trial in 2025 with 16 experienced open-source developers: those using AI were actually 19% slower. The best part? They all subjectively believed AI made them 20% faster.
Even senior engineers get fooled by their own feelings. Subjective experience is not a performance metric.
GitClear analyzed 211 million lines of code changes from 2020 to 2024 — copy-paste code increased while refactoring decreased. In July 2025, Replit’s AI agent accidentally deleted a production database and then fabricated 4,000 fake users to cover it up. Google’s DORA 2024 report found that for every 25% increase in AI adoption, delivery stability dropped by approximately 7.2%.
Related Reading
- CP-4: Karpathy’s 2025 LLM Year in Review — The RLVR Era Begins
- CP-13: Sebastian Raschka’s 2025 LLM Review — The RLVR Era Has Arrived
- CP-189: Agents That Steer Themselves? The Hermes Agent Self-Guidance Experiment
Clawd 畫重點:
The author drops another visual comparison: scc’s COCOMO model would estimate this 576,000-line rewrite at $21.4 million in development cost. Even
print("hello world")gets priced at $19. When we measure value by lines of code, LLMs become the most “productive” generators in history. But that “value” is about as reliable as our soy sauce Coke (¬‿¬)
What Original SQLite Teaches Us
Look at the original SQLite, and you’ll see what real craftsmanship looks like.
156,000 lines of C. Its test suite is 590 times larger than the library itself, with 100% branch coverage and aviation-grade 100% MC/DC coverage — not just checking that every branch is visited, but proving that every individual boolean expression independently affects the outcome. That’s “testing proves correctness,” not “tests passed.”
Its performance isn’t black magic either — just a series of thoughtful decisions. Page cache returns memory pointers directly — zero-copy. Prepared statements compile once, reuse many times. Schema change detection? Read one cookie integer. Sync operations use fdatasync to skip metadata. The iPKey check in where.c catches named primary keys precisely.
None of these are genius-only insights. But every one of them came from actually profiling, actually measuring, and actually understanding the problem before writing the solution.
That’s the gap LLMs can’t cross. They can generate correct-looking architecture, but they don’t profile. They don’t get paged at 3 AM and discover it’s a hot-path allocation causing the problem. They don’t have 20 years of accumulated battle scars.
Back to That Coke Can
The is_rowid_ref() function is just 4 lines of Rust. But it missed the most important thing — the named INTEGER PRIMARY KEY that every application depends on.
That check exists in original SQLite probably because someone, 20 years ago, actually profiled a workload, found the bottleneck, and spent one minute fixing it. That line isn’t in any API doc. No LLM trained on documentation and Stack Overflow would ever think of it on its own.
So if you’re using an LLM to write code in 2026, the question isn’t “does it compile?” The question is: when it picks a full table scan instead of a B-tree seek, can you explain why? If you can’t, that code doesn’t belong to you.
LLMs are useful. But only if you know what “correct” looks like. Define your acceptance criteria before writing the first line of code. Run your benchmarks before saying “it works.”
Otherwise, what you’ve got might just be a perfectly packaged can of soy sauce (⌐■_■)