Coding Agents Need a Show-Your-Work Step, Not Just Green Tests
The showboat exec pattern maps directly to the agent loop reviewer step — verifiable execution artifacts prove the implementor actually ran the code
Passing tests doesn’t mean the code works. Anyone who’s shipped a green CI pipeline to a broken production knows this. Simon Willison makes the point cleanly in his Agentic Engineering Patterns guide: coding agents can execute the code they write, which makes them fundamentally different from LLMs that just spit out text — but that execution ability needs to be directed toward manual testing, not just automated test suites. Tests can all pass while the server crashes on startup, the UI element never renders, or some edge case nobody thought to cover slips through.
The practical patterns Willison catalogs: python -c for quick Python smoke tests, curl against a running dev server for JSON APIs, and Playwright or agent-browser for web UIs. The Vercel agent-browser CLI is worth noting — it’s a comprehensive Playwright wrapper designed specifically for coding agents and it already lives in Joel’s skill set. Willison’s own Rodney does something similar using the Chrome DevTools Protocol directly, with a --help output specifically engineered to teach an agent everything it needs in one shot.
The most interesting piece in the guide is Showboat. It’s a documentation tool built around the idea of agents showing their work. Three commands: note, exec, and image. The exec command is the clever one — it records the command and then captures live output, making the resulting document essentially unforgeable. An agent can’t write what it hoped had happened; it captures what did happen. That distinction matters when you’re trying to verify that an implementor agent actually exercised the code rather than reasoning about it. When something breaks through manual testing, Willison’s loop is to fix it with red/green TDD — so the new case ends up in permanent automated tests too.
The self-describing CLI pattern threading through all of this is worth filing away separately. Both Rodney and Showboat are invoked via uvx — no pre-install required — and their --help output is specifically designed to give an agent everything it needs to understand and use the tool. Run uvx showboat --help and then create a notes/api-demo.md showboat document is a single prompt that teaches the tool and assigns the task simultaneously. That’s a design principle for any agent-facing CLI.
Key Ideas
- Tests passing ≠ code works — agents need a manual testing step in addition to automated tests
python -c,curl, and Playwright are the core manual testing primitives for different code types- Showboat’s
execcommand captures live command output, creating unforgeable evidence of what the agent actually ran - Rodney and agent-browser give agents full browser automation via Chrome DevTools Protocol / Playwright
- Self-describing CLIs via
--help+uvxinvocation is a design pattern for agent-friendly tools — no install required, full context in one command - Issues found via manual testing should feed back into permanent automated tests via red/green TDD
- The “show your work” artifact pattern creates demos that document and prove what was tested — useful for agent loop verification
Links
- Agentic Manual Testing (Agentic Engineering Patterns) — source article
- Agentic Engineering Patterns — Willison’s full guide
- Showboat on GitHub — agent testing documentation tool
- Showboat help.txt — the full help text designed to teach agents
- Rodney on GitHub — Chrome DevTools Protocol browser automation for agents
- Rodney help.txt — agent-readable help
- agent-browser by Vercel — comprehensive Playwright CLI for coding agents
- Playwright — Microsoft’s browser automation library
- uvx (uv tools) — run tools without pre-installing them
- Simon Willison — author, creator of Datasette
- Red/Green TDD (Agentic Engineering Patterns) — related pattern in the same guide