Structured Exit Reasons as an Agent Retry Policy Primitive
exit-reason enum maps directly to the workload rig's dagWorker retry logic — permission_blocked and budget_exceeded should escalate, not retry blindly
andrewmcoupe/orchestrator is a self-hosted coding orchestrator that wraps Claude Code in a full task lifecycle: plan → implement → test → audit → merge. It’s TypeScript, SQLite for the event store, git worktrees for task isolation, and a Bun-based server. Small, local, no cloud runtime dependency.
The architecture is clean event-sourced coordination — every state transition appends an event, projections are derived views. What makes it worth studying is the PRD for exit classification: when a Claude Code subprocess exits, it doesn’t just record "success" or "failed". It classifies why — a structured enum of "normal" | "timeout" | "budget_exceeded" | "turn_limit" | "permission_blocked" | "killed" | "schema_invalid" | "network_error" | "crashed" | "unknown". That classification drives the retry policy. permission_blocked escalates to human. timeout and network_error retry the same phase. budget_exceeded stops and asks. The retry strategy is a function of exit reason, not just success/failure.
The permission-hang detection is particularly sharp: monitor the stderr stream for "Waiting for permission", and if no stream-json events arrive for 10 seconds, the process is hung — not working. Kill it, classify as permission_blocked, record which tool triggered it. This is the kind of operational detail that only surfaces after you’ve watched an agent spin for five minutes on a permission prompt and wondered what happened. The phaseRunner also has a dependency reactor that listens for task.merged events and unblocks downstream tasks when all their dependencies clear — pure event-driven DAG execution without a separate scheduler.
The git worktree isolation pattern is worth stealing independently of anything else. Each task gets <repo>/.orchestrator-worktrees/<task_id> — a full checkout on its own branch. Crash recovery on server restart runs git reset --hard HEAD && git clean -fd across all worktrees to discard in-flight uncommitted changes without touching completed work. That’s a clean boundary: committed state survives restarts; in-flight state is expendable. The Firecracker microVM approach in the workload rig solves the same isolation problem differently — this is the lightweight git-native version.
Key Ideas
- Exit reason enum — 9-value structured enum classifies every subprocess exit;
on_exit_reasonmap in retry policy config routes each toretry_same,retry_different, orescalate_to_human - Permission-hang detection — active stderr monitoring kills the process after 10s without new stream-json events; records
permission_blocked_onwith the tool name - Stdout/stderr tail capture — last ~4KB of both streams stored in the blob store on every invocation; hash referenced on
invocation.completedfor post-mortem debugging - Event-sourced SQLite —
appendAndProjectis the single write path; projections are derived views rebuilt from the event log; no ORM, just typed queries - Git worktree per task — isolation without containers; crash recovery via
git reset --hardon restart;auto_delete_worktreeconfig cleans up after merge - Dependency reactor — listens on
event.committed; unblocks dependent tasks when all deps reachmergedstatus; emitstask.dependency.warningon terminal failures - Auditor phase — separate Claude API call with structured JSON output (verdict:
approve | revise | reject, confidence score, typed concerns); auto-merge gated on auditor approval + all gates passing - Shadow mode for auto-merge — emits advisory events only, doesn’t merge; safe canary mode before enabling live auto-merge
- A/B testing scaffold —
server/ab/module for assignment and stats; the system runs experiments on its own behavior