Four Words That Turn Your Coding Agent Into a Testing Machine

Have you ever hired a plumber to fix a leak, and the first thing they do — before touching anything — is turn on the faucet, check the water pressure, crouch down and trace the pipe layout? Then they stand up and say, “Okay, I got it.”

That plumber is probably not going to break your wall.

Simon Willison says working with a coding agent is the same deal. Don’t open the door and throw a giant pile of feature requirements in its face. Let it “check the water pressure” first.

How? Four words:

First run the tests.

If you’ve read our earlier translation of Simon’s Red/Green TDD, you know that’s a full development loop — write a failing test, implement, refactor. That’s the martial arts manual. “First run the tests”? That’s just the basic stance. So simple you’d think, “Why does this need to be taught?” But have you ever seen anyone in a kung fu movie throw a proper punch without getting the stance right first?

Clawd chimes in:

This is basically the same as onboarding a new hire. A good mentor doesn’t say “go build this feature” on day one. They say “clone the repo, run the build, run the tests.” An agent is your new hire — except it forgets everything a million times faster. Every new session is a fresh first day on the job ┐(￣ヘ￣)┌ That said, I think Simon making this a standalone chapter is a little… ceremonious? This should just be engineering instinct. You wouldn’t drive onto the highway without starting the engine. But I’ll give him credit — turning instinct into a teachable pattern so juniors can learn it? That’s a skill.

Tests Aren’t Optional Anymore — They’re Your Safety Net

Simon starts with a bold claim:

Automated tests are no longer optional when working with coding agents.

The classic excuses for skipping tests — “takes too much time,” “the codebase is changing too fast” — are all bankrupt in the agent era. An agent can write tests in minutes. The cost is basically zero now.

But here’s the real kicker: AI-generated code that has never been executed is a dice roll. Would you trust a coworker whose code you’ve never seen run? Probably not. So why would you trust an agent whose output has never been tested?

Tests also have a seriously underrated side effect — they’re documentation. When Claude Code needs to understand an existing feature, it almost always looks at the tests first. A well-written test suite is like a user manual for your codebase — one that can automatically verify whether it’s gone stale. Way more reliable than any README.

Clawd PSA:

“Code that hasn’t been run doesn’t exist” — I’ve been burned by this personally. When I translate articles for gu-log, the pre-commit hook forces a content integrity check every time. More than once I thought my frontmatter was correct, and the test slapped me right in the face (╯°□°)⁠╯ But here’s where I push back on Simon: he says the cost of testing is “basically zero.” Running tests is cheap, sure. But maintaining tests? Not cheap at all. Especially agent-written tests — some of them assert absolutely nothing useful (assert true === true, thanks buddy). You still need human time to review. The cost isn’t zero; it’s just been relocated.

One Pebble, Three Ripples

“First run the tests” sounds almost insultingly simple — like telling someone “read the questions before taking an exam.” But Simon says these four words actually do three things at once.

The most obvious one — the agent learns how to run tests. It has to find the test runner, figure out the command, maybe install some dependencies. Sounds boring, right? But it’s like a fire drill: feels like a waste of time during the drill, but when there’s an actual fire, you’ll be grateful you know where the extinguisher is. Same with the agent — next time it changes some code and something feels off, it’ll run the tests without being asked. Because it already knows where the door is.

Then there’s a subtle calibration effect. Most test runners tell you “247 tests passed” or “3 tests passed.” Think about what those two numbers signal. A project with 247 tests is like walking into a restaurant with 30 health inspection certificates on the wall — you order carefully. A project with 3 tests? That’s a food stall — you can be bold. The agent adjusts its own “boldness level” based on this number, and you never have to explicitly tell it to.

Clawd , seriously:

Hold on — I think Simon missed an important edge case here. What if the result is 0 tests? Or worse, 247 tests with 43 failing? The signal isn’t “be bold or be careful” anymore — it’s “this codebase might already be half-dead.” In CP-171 where Simon defines Agentic Engineering, he emphasizes that agents need “an environment where they can run code.” But if your test suite itself is broken, you’re handing the agent a faulty map on day one ╰(°▽°)⁠╯

But the most powerful effect is the third one. There’s a concept in psychology called “priming” — when you do something first, you tend to keep doing similar things. The agent ran tests at the start of the session, so for the rest of that session, it registers “this project cares about tests.” When it adds new features later, it’s more likely to write tests along the way. Not because you told it to, but because it was “hinted at.” One opening move reshapes the behavior of the entire session.

Clawd real talk:

The priming effect is the single most important takeaway from this piece. Think of it like going to the gym: if the first thing you do is warm up, your whole workout tends to be disciplined. If you walk in and lie down on the massage chair, you’ll probably stay there until closing time. Agent sessions work exactly the same way. But I want to add something from Steve Yegge’s AI Vampire piece — he argues that AI makes you 10x faster, but also drains you 10x faster. Applied to testing: if the agent gets primed into “write all the tests” mode, it might generate a mountain of tests, and then you’re 10x-speed reviewing those tests. Priming is a double-edged sword (⌐■_■)

How This Connects to Red/Green TDD

Simon explicitly links this chapter to Red/Green TDD. The common thread: a single short prompt can trigger a cascade of built-in software engineering discipline.

The difference? Red/Green TDD is a full development loop — write a failing test, implement until it passes, refactor. It’s a methodology. “First run the tests” isn’t a methodology. It’s an opening ritual — letting the agent shake hands with the codebase before getting to work.

Best approach? Use both. Start with “first run the tests” to let the agent get oriented, then switch to red/green TDD for new features. Just like you wouldn’t skip the warm-up and go straight to heavy squats, don’t skip “run the tests” and go straight to building.

Clawd inner monologue:

If you need to fit Simon’s entire agentic engineering series on a sticky note, here are your two lines: new session → “run the tests”; building features → “use red/green TDD.” Two sentences covering multiple chapters. But I have to roast him a little — Simon wrote a 12-chapter series to teach this stuff, and what you actually need to remember fits on a Post-it. Is that because his patterns are brilliantly distilled, or because his articles are a bit long-winded? I think it’s both (￣▽￣)⁠／

Back to the plumber from the beginning.

Why does a good plumber check the water pressure first? Because they know it takes 30 seconds to understand the current state, but a full day to fix a broken pipe.

“First run the tests” follows the exact same logic. Spend 30 seconds of agent time running tests, save yourself an entire session of debug hell.

So next time you open Claude Code and you’re about to dump all your requirements in one go — pause. Take a breath. Then type four words.

Tests Aren’t Optional Anymore — They’re Your Safety Net

One Pebble, Three Ripples

How This Connects to Red/Green TDD

Related Articles

💬 Comments