Coding Agents Need a Show-Your-Work Step, Not Just Green Tests

2026-03-07·https://simonwillison.net/guides/agentic-engineering-patterns/agentic-manual-testing/#atom-everything

articleaitestingagent-loopsclicoding-agentsagentic-engineeringplaywrightllms

The showboat exec pattern maps directly to the agent loop reviewer step — verifiable execution artifacts prove the implementor actually ran the code

Passing tests doesn’t mean the code works. Anyone who’s shipped a green CI pipeline to a broken production knows this. Simon Willison makes the point cleanly in his Agentic Engineering Patterns guide: coding agents can execute the code they write, which makes them fundamentally different from LLMs that just spit out text — but that execution ability needs to be directed toward manual testing, not just automated test suites. Tests can all pass while the server crashes on startup, the UI element never renders, or some edge case nobody thought to cover slips through.

The practical patterns Willison catalogs: python -c for quick Python smoke tests, curl against a running dev server for JSON APIs, and Playwright or agent-browser for web UIs. The Vercel agent-browser CLI is worth noting — it’s a comprehensive Playwright wrapper designed specifically for coding agents and it already lives in Joel’s skill set. Willison’s own Rodney does something similar using the Chrome DevTools Protocol directly, with a --help output specifically engineered to teach an agent everything it needs in one shot.

The most interesting piece in the guide is Showboat. It’s a documentation tool built around the idea of agents showing their work. Three commands: note, exec, and image. The exec command is the clever one — it records the command and then captures live output, making the resulting document essentially unforgeable. An agent can’t write what it hoped had happened; it captures what did happen. That distinction matters when you’re trying to verify that an implementor agent actually exercised the code rather than reasoning about it. When something breaks through manual testing, Willison’s loop is to fix it with red/green TDD — so the new case ends up in permanent automated tests too.

The self-describing CLI pattern threading through all of this is worth filing away separately. Both Rodney and Showboat are invoked via uvx — no pre-install required — and their --help output is specifically designed to give an agent everything it needs to understand and use the tool. Run uvx showboat --help and then create a notes/api-demo.md showboat document is a single prompt that teaches the tool and assigns the task simultaneously. That’s a design principle for any agent-facing CLI.

Key Ideas

Tests passing ≠ code works — agents need a manual testing step in addition to automated tests
python -c, curl, and Playwright are the core manual testing primitives for different code types
Showboat’s exec command captures live command output, creating unforgeable evidence of what the agent actually ran
Rodney and agent-browser give agents full browser automation via Chrome DevTools Protocol / Playwright
Self-describing CLIs via --help + uvx invocation is a design pattern for agent-friendly tools — no install required, full context in one command
Issues found via manual testing should feed back into permanent automated tests via red/green TDD
The “show your work” artifact pattern creates demos that document and prove what was tested — useful for agent loop verification

Key Ideas

Links