Structured Exit Reasons as an Agent Retry Policy Primitive

repoaiagent-loopstypescriptinfrastructureevent-sourcingclaude-codeworkflow

exit-reason enum maps directly to the workload rig's dagWorker retry logic — permission_blocked and budget_exceeded should escalate, not retry blindly

andrewmcoupe/orchestrator is a self-hosted coding orchestrator that wraps Claude Code in a full task lifecycle: plan → implement → test → audit → merge. It’s TypeScript, SQLite for the event store, git worktrees for task isolation, and a Bun-based server. Small, local, no cloud runtime dependency.

The architecture is clean event-sourced coordination — every state transition appends an event, projections are derived views. What makes it worth studying is the PRD for exit classification: when a Claude Code subprocess exits, it doesn’t just record "success" or "failed". It classifies why — a structured enum of "normal" | "timeout" | "budget_exceeded" | "turn_limit" | "permission_blocked" | "killed" | "schema_invalid" | "network_error" | "crashed" | "unknown". That classification drives the retry policy. permission_blocked escalates to human. timeout and network_error retry the same phase. budget_exceeded stops and asks. The retry strategy is a function of exit reason, not just success/failure.

The permission-hang detection is particularly sharp: monitor the stderr stream for "Waiting for permission", and if no stream-json events arrive for 10 seconds, the process is hung — not working. Kill it, classify as permission_blocked, record which tool triggered it. This is the kind of operational detail that only surfaces after you’ve watched an agent spin for five minutes on a permission prompt and wondered what happened. The phaseRunner also has a dependency reactor that listens for task.merged events and unblocks downstream tasks when all their dependencies clear — pure event-driven DAG execution without a separate scheduler.

The git worktree isolation pattern is worth stealing independently of anything else. Each task gets <repo>/.orchestrator-worktrees/<task_id> — a full checkout on its own branch. Crash recovery on server restart runs git reset --hard HEAD && git clean -fd across all worktrees to discard in-flight uncommitted changes without touching completed work. That’s a clean boundary: committed state survives restarts; in-flight state is expendable. The Firecracker microVM approach in the workload rig solves the same isolation problem differently — this is the lightweight git-native version.

Key Ideas

  • Exit reason enum — 9-value structured enum classifies every subprocess exit; on_exit_reason map in retry policy config routes each to retry_same, retry_different, or escalate_to_human
  • Permission-hang detection — active stderr monitoring kills the process after 10s without new stream-json events; records permission_blocked_on with the tool name
  • Stdout/stderr tail capture — last ~4KB of both streams stored in the blob store on every invocation; hash referenced on invocation.completed for post-mortem debugging
  • Event-sourced SQLiteappendAndProject is the single write path; projections are derived views rebuilt from the event log; no ORM, just typed queries
  • Git worktree per task — isolation without containers; crash recovery via git reset --hard on restart; auto_delete_worktree config cleans up after merge
  • Dependency reactor — listens on event.committed; unblocks dependent tasks when all deps reach merged status; emits task.dependency.warning on terminal failures
  • Auditor phase — separate Claude API call with structured JSON output (verdict: approve | revise | reject, confidence score, typed concerns); auto-merge gated on auditor approval + all gates passing
  • Shadow mode for auto-merge — emits advisory events only, doesn’t merge; safe canary mode before enabling live auto-merge
  • A/B testing scaffoldserver/ab/ module for assignment and stats; the system runs experiments on its own behavior