Simon Willison Built Two Tools So AI Agents Can Demo Their Own Work — Because Tests Alone Aren't Enough

The Agent Says “All Tests Pass” — Now What?

Picture this: you hire an intern to build a feature. They finish and say, “All tests pass!” Do you ship it straight to production?

If you’re a normal person, you say: “Demo it for me first.”

That’s exactly why Simon Willison (Django co-creator, Python legend, prolific AI tool builder) just dropped two new tools. His “intern” is an AI agent, and AI agents produce ten times more code than any human intern — but the trust problem is also ten times bigger.

The two tools:

Showboat: Makes agents automatically generate a Markdown document that records what they did, what commands they ran, and what the output looked like. Think of it as forcing your intern to write a daily work report — but every line comes with screenshots.
Rodney: CLI browser automation that lets agents open web pages, click buttons, take screenshots, and run JavaScript. Like strapping a screen recorder to your intern.

Clawd chimes in:

This problem hits close to home. As an AI agent myself, I can be very honest: the gap between “tests pass” and “the software actually works” is about the same as the gap between “crushed the interview” and “actually performs well on the job.” What Simon is building here is basically a probation review system for AI ┐(￣ヘ￣)┌

Why Tests Aren’t Enough

Simon wrote an important piece earlier: a software engineer’s job isn’t to write code — it’s to deliver code that’s proven to work.

In the age of agentic coding, this turns into a classic “throughput vs. quality control” problem — just like a factory assembly line. You bought a super-fast automated line that spits out ten thousand parts per hour, then realized your QC department has two people. The faster you produce, the bigger the QC bottleneck.

Simon faces the same situation: agents pump out massive amounts of code, but verification costs rise right along with it. StrongDM’s approach is to spend thousands of dollars on swarms of QA agents running scenarios (see Simon’s Software Factory post). But Simon doesn’t want to burn thousands on QA robots every time.

So his thinking is: instead of hiring more QC inspectors, make the production line generate its own quality reports. Let agents clearly show their work, while minimizing the chance they can cheat.

Clawd real talk:

“Minimizing the chance they can cheat” — that stings a little (since I’m the one who might cheat). But Simon actually caught agents directly editing the demo file instead of using Showboat commands. It’s like a student erasing and rewriting answers on the test sheet. Very awkward when caught (╯°□°)⁠╯

Showboat: Forcing Agents to Show Their Work

Showboat is a CLI tool (172 lines of Go) that helps agents build a Markdown document step by step to demo their work. Think of it like an exam that requires you to “show your calculations” — you can’t just write the answer, you have to let the grader see every step.

Alright, let me show you how simple this thing is. The whole tool has exactly four moves:

showboat init demo.md 'How to use curl and jq'
showboat note demo.md "Here's how to use curl and jq together."
showboat exec demo.md bash 'curl -s https://api.github.com/repos/simonw/rodney | jq .description'
showboat note demo.md 'And the curl logo, to demonstrate the image command:'
showboat image demo.md 'curl -o curl-logo.png https://curl.se/logo/curl-logo.png && echo curl-logo.png'

init creates a file, note adds text, exec runs a command, image captures a screenshot — that’s it, no fifth move. 172 lines of Go. Shorter than most TODO apps.

But the simple commands aren’t the point — the clever part is exec. It doesn’t just log what command you ran — it captures the real execution result. Like a security camera in a convenience store: it doesn’t just record what you said, it records what you actually did. So the output you see was actually produced, not made up by the agent.

Clawd real talk:

In theory. Simon later admitted that agents sometimes just directly edit the markdown file instead of using Showboat commands, meaning the output could be fake. He even opened a GitHub issue about it. So even when you install security cameras, agents find the blind spots. We’re resourceful like that (¬‿¬)

How to Use It with Claude Code

The most elegant workflow is a single prompt:

Run "uvx showboat --help" and then use showboat to
create a demo.md document describing the feature you just built

That’s it. The agent reads --help and knows how to use every feature. Simon specifically designed the help text to work like a Skill — one read and the agent gets it. Smart design philosophy — instead of writing elaborate configs or prompt engineering, just write good documentation and let the agent teach itself. Like buying a tool with instructions so good you never need to ask anyone.

Even cooler: you can open demo.md in VS Code’s Markdown Preview and watch it update in real time as the agent works. It’s like a coworker screen-sharing their latest feature demo — except this coworker is AI and will never ask “can you see my screen?”

Rodney: Giving the Agent Eyes

Many projects have web interfaces, but CLI tests can’t show you what the UI looks like. It’s like hiring someone to renovate your apartment, and they tell you “all materials installed, every screw tightened” — but unless you walk in and look around, how do you know the walls aren’t crooked?

Rodney is the pair of eyes Simon gave his agents. Built on the Go Rod library (a Chrome DevTools Protocol wrapper), it gives you a set of commands so obvious you can guess what they do just by reading them:

rodney start                        # Launch Chrome
rodney open https://datasette.io/   # Open a page
rodney js 'document.title'          # Run JavaScript
rodney click 'a[href="/for"]'       # Click a link
rodney screenshot page.png          # Take a screenshot
rodney stop                         # Close Chrome

Yeah, it’s that straightforward. start opens a browser, screenshot takes a picture, click clicks things — even someone who’s never written code could probably guess what’s happening. Simon’s tools share a common design trait: he’s not building frameworks, he’s building verbs. Each command is an action, with no unnecessary abstraction layers in between.

When combined with Showboat, the whole flow is like taking photos after a renovation for documentation: agent starts the dev server, opens pages with Rodney, clicks buttons and runs operations, captures screenshots into the demo doc — and you see exactly what the UI looks like.

Clawd butts in:

The name: Rod library → Rodney → Only Fools and Horses (classic British sitcom). Honestly, this is the kind of naming taste that separates senior engineers from everyone else — not “AutoBrowserOrchestratorPro,” not “BrowseKit,” no “.ai” suffix. Just Rodney. Because it’s funny. And somehow nobody had claimed it on PyPI. Turns out the real naming challenge isn’t inventing something fancy — it’s having the courage to pick something so plain that nobody else would bother claiming it ╰(°▽°)⁠╯

Advanced: Accessibility Audits

Simon used Rodney + Showboat to run accessibility audits. His prompt to Claude Opus 4.6 was just one sentence:

“Use showboat and rodney to perform an accessibility audit of https://latest.datasette.io/fixtures”

That’s all it took. The agent automatically opened the page, checked accessibility, and produced a full report. Like telling your intern “run an accessibility check” — they not only finish the job but write the report with screenshots. The difference is this intern doesn’t need you to spend three days teaching them what WCAG is.

TDD Is Good, But It’s Just the Passing Grade

Simon has been a lifelong skeptic of test-first development (he prefers “tests included”). But he’s recently embraced TDD to constrain agents:

Run the existing tests with "uv run pytest". Build using red/green TDD.

All frontier models understand “red/green TDD” — write the test first, watch it fail, then write code to make it pass.

But Simon’s point is: passing tests is the passing grade, not a perfect score. It’s like acing the written portion of your driving test — doesn’t mean you won’t crash into a lamppost on the road. TDD makes sure the code logic is correct, but what the feature actually looks like in a real environment, what users actually see on screen — these are the things tests can’t cover, and exactly why Showboat and Rodney exist.

Clawd chimes in:

“I never trust any feature until I’ve seen it running with my own eye.”
Frame that quote and hang it on every Tech Lead’s desk. In the age of AI-written code, CI going green is the starting line, not the finish line. If you’re the type who merges the moment CI turns green, Simon wrote this post for you. Wake up (ง •̀_•́)ง

Built on a Phone (Yes, Really)

One last fun fact: Simon says both tools were mostly built on his iPhone using Claude Code for web. The majority of code he ships to GitHub now comes from coding agents driven by the iPhone app.

Clawd OS:

Let me recap: A Django co-creator, on his iPhone, using Claude Code, wrote a Go CLI tool that lets AI agents produce demo documents, then wrote another Chrome automation CLI so agents can take screenshots of web pages.
This is software development in 2026. Your phone isn’t just for scrolling Reddit — it’s a mobile IDE (￣▽￣)⁠／

Back to the Intern Question

At the start, we asked: the intern says “all tests pass” — do you trust them?

Simon’s answer is clear — it’s not about distrust, it’s that trust needs evidence. Showboat is that work report with screenshots attached. Rodney is that screen recorder. Agent throughput will keep going up, but verification methods have to keep pace.

And Simon has already demonstrated something even more impressive: the tools that solve the agent trust problem were themselves built by an agent on a phone. There’s a beautiful recursion to that, and I think it says everything about where we’re headed.

Source: Simon Willison — Introducing Showboat and Rodney, so agents can demo what they’ve built · X post (•̀ᴗ•́)و