ADR-0093shipped

Agent-Friendly Navigation Contract (AGENT-FIRST 30)

2026-02-22T00:00:00.000Z

Context

JoelClaw has strong capabilities (Inngest workflows, gateway, memory, OTEL, CLI surfaces), but agent UX is still inconsistent:

Discovery is fragmented across commands, skills, and ADRs.
Command contracts are mostly consistent but not enforced by a single CI gate.
Recovery from failures is not uniformly deterministic (error -> runbook -> verify -> rollback).
Memory/context routing is implemented in several places, but still uneven across decision-heavy functions.
Navigation can be surprising (example: joelclaw search against otel_events can fail with embedded params parsing errors while joelclaw otel succeeds).

Goal: make joelclaw extremely agent-friendly, navigable, predictable, and obvious.

Decision

Adopt a 30-day architecture program in three strict phases:

CLI contracts & discovery (foundation)
Deterministic error runbooks & recovery
Memory/context routing standardization

The order is mandatory: no broad memory routing expansion until command contracts and recovery behaviors are deterministic.

Update (2026-02-22)

Phase 1 kickoff started.

AF30-001 implemented in monorepo:
- scripts/validate-cli-contracts.ts added (baseline drift validator)
- docs/agent-contracts/phase1-baseline.json generated
- envelope helpers/validator added in packages/cli/src/response.ts
- contract test added at packages/cli/src/commands/contract-envelope.test.ts
- root script validate:cli-contracts added in package.json
- response output aligned to JSON-only envelope (TOON result mode removed); legacy --json/--toon accepted as no-op for compatibility
Validation passed:
- bun run validate:cli-contracts
- bun test packages/cli/src/commands/contract-envelope.test.ts
- cd packages/cli && bun run check-types
AF30-002 implemented in monorepo:
- packages/cli/src/commands/capabilities.ts added (goal-oriented discoverability command)
- packages/cli/src/commands/capabilities.test.ts added
- docs/agent-contracts/capabilities-map.md added
- root CLI wiring updated in packages/cli/src/cli.ts
- baseline refreshed at docs/agent-contracts/phase1-baseline.json
AF30-002 validation passed:
- cd packages/cli && bun src/cli.ts capabilities
- bun test packages/cli/src/commands/capabilities.test.ts
- bun run validate:cli-contracts
AF30-003 implemented in monorepo:
- packages/cli/src/commands/search.ts hardened for deterministic collection selection
- packages/cli/src/commands/search.test.ts added (predictability + semantic guard coverage)
- .github/workflows/agent-contracts.yml added (contract CI gate)
- otel_events now supported as a first-class collection in joelclaw search
AF30-003 validation passed:
- cd packages/cli && bun src/cli.ts search "telegram.callback.received" --collection otel_events --limit 5
- bun test packages/cli/src/commands/search.test.ts
- bun run validate:cli-contracts
Phase-1 core command hardening completed for high-traffic set:
- send.ts: invalid JSON now returns respondError (INVALID_JSON + fix)
- runs.ts: --compact now returns terse JSON rows (no plain-text output)
- run next actions now use CLI-first log commands (no raw tail/docker logs suggestions)
- gateway.ts: invalid payload now returns respondError (INVALID_JSON + fix)
- status.ts and otel.ts revalidated against JSON envelope contract
- baseline refreshed: docs/agent-contracts/phase1-baseline.json (rawOutputCommandFiles: 5)
Core hardening validation passed:
- cd packages/cli && bun src/cli.ts runs -c --count 3 --hours 1
- cd packages/cli && bun src/cli.ts send test.event --data '{bad json}'
- cd packages/cli && bun src/cli.ts gateway push --type test --payload '{bad json}'
- cd packages/cli && bun src/cli.ts status
- cd packages/cli && bun src/cli.ts otel
- bun run validate:cli-contracts
Phase-2 scaffold implemented (dry-run-first recovery):
- packages/cli/src/error-codes.ts added (canonical code list + normalizer)
- packages/cli/src/runbooks.ts added (runbook registry + placeholder resolution)
- packages/cli/src/commands/recover.ts added (recover list, dry-run preview, --execute phase execution)
- packages/cli/src/commands/recover.test.ts and packages/cli/src/commands/runbooks.test.ts added
- root CLI wiring updated in packages/cli/src/cli.ts (joelclaw recover ...)
- capabilities map extended with deterministic recovery flow
- baseline refreshed: docs/agent-contracts/phase1-baseline.json (Commands scanned: 27)
Phase-2 scaffold validation passed:
- cd packages/cli && bun src/cli.ts recover list
- cd packages/cli && bun src/cli.ts recover TYPESENSE_UNREACHABLE --phase fix --context '{"run-id":"01TEST"}'
- cd packages/cli && bun src/cli.ts recover BAD_CODE
- bun test packages/cli/src/commands/recover.test.ts packages/cli/src/commands/runbooks.test.ts
- bun run validate:cli-contracts
Phase-2 completion tranche implemented (top-20 coverage + recovery wiring):
- expanded runbook registry to canonical top 20 error codes in packages/cli/src/runbooks.ts
- enforced runbook completeness (diagnose, fix, verify, rollback all non-empty) via runbooks.test.ts
- respondError and stream emitError now auto-append recover next action when a runbook exists (packages/cli/src/response.ts, packages/cli/src/stream.ts)
- o11y auto-fix handlers now declare runbook mapping metadata (packages/system-bus/src/observability/auto-fixes/index.ts)
- o11y triage emits runbook-backed recovery hints (recoverCommand, runbookCommands) using shared resolver (packages/system-bus/src/observability/recovery-runbooks.ts, packages/system-bus/src/inngest/functions/o11y-triage.ts)
Phase-2 completion tranche validation passed:
- cd packages/cli && bun run check-types
- bun test packages/cli/src/commands/contract-envelope.test.ts packages/cli/src/commands/recover.test.ts packages/cli/src/commands/runbooks.test.ts packages/cli/src/commands/capabilities.test.ts
- cd packages/cli && bun src/cli.ts send test.event --data '{bad json}'
- cd packages/cli && bun src/cli.ts recover list
- cd packages/cli && bun src/cli.ts recover MEMORY_HEALTH_FAILED --phase rollback
- bun run validate:cli-contracts
- cd packages/system-bus && bunx tsc --noEmit
Phase-2 o11y alignment tranche implemented:
- focused integration test added: packages/system-bus/src/inngest/functions/o11y-triage.test.ts asserts auto_fix.applied metadata includes runbookCode + recoverCommand
- shared runbook event resolver expanded: packages/system-bus/src/observability/recovery-runbooks.ts (resolveRunbookPlanForEvent, normalized code fallback)
- tier2 escalation payloads now include runbook metadata (runbookCode, runbookPhase, recoverCommand, runbookCommands) in session/observation.noted
- tier3 escalation context now carries runbook metadata through Todoist description, Telegram message/payload, and OTEL telemetry (triage.telegram_sent, triage.telegram_rate_limited, triage.escalated)
Phase-2 o11y alignment validation passed:
- bun test packages/system-bus/src/inngest/functions/o11y-triage.test.ts
- cd packages/system-bus && bunx tsc --noEmit
- live trigger (no dedicated CLI invoke surface yet): POST /v0/gql invokeFunction(functionSlug: "system-bus-host-check/o11y-triage")
- joelclaw otel search "auto_fix.applied" --hours 1 shows metadata keys including runbookCode, runbookPhase, recoverCommand, runbookCommands
- joelclaw otel search "joelclaw recover" --hours 1 returns the emitted auto_fix.applied event, confirming runbook recovery command is queryable
Phase-2 CLI path-hardening follow-up implemented:
- added compatibility subcommand joelclaw inngest sync-worker [--restart] [--wait-ms] to align with operational command contract in AGENTS docs.
- Inngest.health worker probing now uses resilient endpoint fallback ($INNGEST_WORKER_URL, $INNGEST_WORKER_URL/health, $INNGEST_WORKER_URL/api/inngest) and robust response parsing to prevent transient false worker unreachable path errors.
- joelclaw inngest restart-worker and joelclaw inngest sync-worker --restart now include active-run guards: restarts are skipped when RUNNING/QUEUED runs exist unless --force is passed.
Phase-2 CLI path-hardening validation passed:
- cd packages/cli && bunx tsc --noEmit -p tsconfig.json
- cd packages/cli && bun src/cli.ts inngest sync-worker --help
- joelclaw send system/network.update -d '{"source":"restart-guard-test"}'
- joelclaw inngest sync-worker --restart (expected: restartSkippedDueToActiveRuns: true while runs active)
- joelclaw inngest restart-worker (expected: skippedDueToActiveRuns: true while runs active)
- joelclaw status
- joelclaw inngest status
Phase-2 aggregate log analysis follow-up implemented:
- joelclaw logs analyze added to aggregate worker stdout, worker stderr, and Inngest server logs into severity/source/component/action rollups with top signatures and sample lines.
- joelclaw langfuse aggregate added for project-level cloud LLM trace rollups (cost/latency/signature trends) with project URL/ID targeting.
- new helper tests added at packages/cli/src/commands/logs.test.ts and packages/cli/src/commands/langfuse.test.ts.
Phase-2 aggregate log analysis validation passed:
- bun test packages/cli/src/commands/logs.test.ts packages/cli/src/commands/langfuse.test.ts
- cd packages/cli && bunx tsc --noEmit -p tsconfig.json
- cd packages/cli && bun src/cli.ts logs analyze --lines 80
- cd packages/cli && bun src/cli.ts langfuse aggregate --hours 24 --bucket-minutes 60 --max-traces 300 --project-url https://us.cloud.langfuse.com/project/cmlx4cd4901lyad07ih16f95i/
Phase-2 invoke + finalization hardening implemented:
- added joelclaw inngest invoke <function-slug> with deterministic wait/poll behavior and dispatch modes (auto|event|invoke) in packages/cli/src/commands/inngest.ts
- auto dispatch prefers EVENT triggers when present (CLI-first path, no raw GQL in operator flow)
- added explicit manual trigger for triage function (check/o11y-triage.requested) in packages/system-bus/src/inngest/functions/o11y-triage.ts
- addressed invoked-run finalization instability by syncing active host-worker code and re-registering worker functions (eliminated repeated Unable to reach SDK URL during validation window)
Phase-2 invoke + finalization hardening validation passed:
- cd packages/cli && bun run check-types
- cd packages/system-bus && bunx tsc --noEmit
- bun test packages/system-bus/src/inngest/functions/o11y-triage.test.ts
- cd packages/cli && bun src/cli.ts inngest invoke system-bus-host-check/o11y-triage --data '{"reason":"cli invoke event route"}' --wait-ms 90000
- cd packages/cli && bun src/cli.ts inngest invoke system-bus-host-check/o11y-triage --mode invoke --data '{"reason":"invoke mode regression check"}' --wait-ms 90000
- joelclaw logs server --lines 200 --grep 'Unable to reach SDK URL' (0 matches)
Policy validator consolidation implemented (single shared workflow):
- .github/workflows/agent-contracts.yml now runs:
  - CLI contract baseline + envelope/capabilities/search contract tests + CLI typecheck
  - LLM observability guard (validate:llm-observability-guards)
  - legacy worker-clone reference guard (validate:no-legacy-worker-clone)
- removed dedicated workflows:
  - .github/workflows/llm-observability-guards.yml
  - .github/workflows/legacy-worker-clone-guard.yml
Policy validator consolidation validation passed:
- bun run validate:cli-contracts
- bun run validate:llm-observability-guards
- bun run validate:no-legacy-worker-clone
Shared policy contract documentation added:
- docs/agent-contracts/README.md now defines the canonical shared workflow contract and expected validator set to prevent scope drift.

Design Contract

CLI-first discoverability: an agent must find the right next command in 1 call.
Stable command envelope: all commands conform to the canonical response schema.
Deterministic failure handling: every known failure maps to machine-readable runbook steps.
Bounded context injection: memory prefetch is policy-driven, traceable, and budgeted.
No silent failure: all degraded/failure paths emit OTEL with actionable metadata.

Implementation Plan

Deliverables

Add canonical contract validation for all CLI commands:
- packages/cli/src/response.ts
- packages/cli/src/commands/*.ts
- packages/cli/src/schema.ts
Add discoverability surface:
- packages/cli/src/commands/capabilities.ts (new)
- map goals -> command templates -> prerequisites -> next actions
Add contract tests:
- packages/cli/src/commands/*.test.ts (new where missing)
- enforce envelope shape + required next_actions quality
Add CI gate:
- .github/workflows/agent-contracts.yml (new)

Acceptance criteria

100% CLI commands return canonical envelope (ok, command, result|error, next_actions).
0 commands missing actionable next_actions.
joelclaw capabilities can enumerate major operational flows (status, runs, gateway, otel, memory).

Day 1 Execution Checklist (Phase 1 kickoff)

Baseline command-surface + contract drift inventory
- inspect:
  - packages/cli/src/cli.ts
  - packages/cli/src/commands/*.ts
  - packages/cli/src/response.ts
- artifact:
  - docs/agent-contracts/phase1-baseline.json (new)
Add contract validator scaffold (failing-first)
- scripts/validate-cli-contracts.ts (new)
- package.json script: validate:cli-contracts (new)
Reproduce and pin current navigation failure as regression test
- failing command to codify:
  - joelclaw search "telegram.callback.received" --collection otel_events --limit 5
- expected behavior:
  - deterministic success or structured recoverable error envelope (never raw Typesense parser failure)
- test file:
  - packages/cli/src/commands/search.test.ts (new)
Upgrade first high-traffic command set to strict contract quality
- packages/cli/src/commands/status.ts
- packages/cli/src/commands/runs.ts
- packages/cli/src/commands/gateway.ts
- packages/cli/src/commands/otel.ts
- packages/cli/src/commands/send.ts

First 3 PR-sized stories (Phase 1)

Story AF30-001 — CLI Contract Harness + Baseline

Goal: make contract drift visible and testable before broad refactors.

Files

packages/cli/src/response.ts
scripts/validate-cli-contracts.ts (new)
packages/cli/src/commands/contract-envelope.test.ts (new)
package.json
docs/agent-contracts/phase1-baseline.json (new)

Acceptance checks

bun run validate:cli-contracts
bun test packages/cli/src/commands/contract-envelope.test.ts

Story AF30-002 — Capabilities Command (Discoverability Surface)

Goal: one-call discovery for agents (goal -> commands -> next_actions -> prerequisites).

Files

packages/cli/src/commands/capabilities.ts (new)
packages/cli/src/cli.ts
packages/cli/src/commands/capabilities.test.ts (new)
docs/agent-contracts/capabilities-map.md (new)

Acceptance checks

joelclaw capabilities
joelclaw capabilities | jq '.result.flows | length'
bun test packages/cli/src/commands/capabilities.test.ts

Story AF30-003 — Predictable Search + Contract CI Gate

Goal: eliminate known navigation surprise and enforce contract checks on every push.

Files

packages/cli/src/commands/search.ts
packages/cli/src/commands/search.test.ts (new)
.github/workflows/agent-contracts.yml (new)
package.json

Acceptance checks

joelclaw search "telegram.callback.received" --collection otel_events --limit 5
bun test packages/cli/src/commands/search.test.ts
CI workflow agent-contracts passes on PR

Phase 2 (Days 11–20): deterministic recovery runbooks

Deliverables

Define shared error/runbook registry:
- packages/cli/src/error-codes.ts (new)
- packages/cli/src/runbooks.ts (new)
Add recovery command:
- packages/cli/src/commands/recover.ts (new)
- supports dry-run first, then execute
Align worker auto-fix + runbook logic:
- packages/system-bus/src/observability/auto-fixes/*
- packages/system-bus/src/inngest/functions/o11y-triage.ts
Require rollback + verify steps in each runbook entry.

Acceptance criteria

Top 20 recurring error codes mapped to deterministic runbooks.
Every runbook includes rollback and verification commands.
OTEL emits for runbook start/success/failure across CLI + worker paths.

Phase 3 (Days 21–30): memory routing standardization

Deliverables

Make shared prefetch policy authoritative:
- packages/system-bus/src/memory/context-prefetch.ts
Expand to remaining high-impact functions still missing memory context:
- packages/system-bus/src/inngest/functions/task-triage.ts
- packages/system-bus/src/inngest/functions/check-calendar.ts
- packages/system-bus/src/inngest/functions/check-granola.ts
- packages/system-bus/src/inngest/functions/check-loops.ts
Add context traceability surface:
- packages/cli/src/commands/inngest.ts (extend) or packages/cli/src/commands/memory.ts (new)
- expose what memory was injected, why, and filter/drop diagnostics

Acceptance criteria

Decision-heavy functions use shared memory policy (no ad-hoc retrieval forks).
Context injection remains bounded and observable (latency + quality metrics in OTEL).
Agents can inspect memory-injection evidence from CLI without pod log grepping.

Verification Commands

joelclaw status
joelclaw capabilities
joelclaw runs --count 20 --hours 24
joelclaw otel stats --hours 24
joelclaw otel search "auto_fix|runbook|recover|memory.recall" --hours 24
bun test packages/cli/src/commands/recall.test.ts
bun test packages/cli/src/commands/*.test.ts

Non-Goals

Rewriting all existing ADRs or command names.
Replacing CLI-first architecture with dashboard-first operations.
Building a new memory backend.

Consequences

Positive

Faster autonomous execution with less prompt thrash.
Lower ambiguity during outages and degraded states.
Better onboarding for any new agent/harness.

Negative / Risks

Requires disciplined schema/runbook maintenance.
Up-front test and contract work before feature velocity gains.
Temporary churn across command surfaces during migration.

References

ADR-0009 (joelclaw CLI naming + agent-first CLI direction)
ADR-0018 (gateway CLI/HATEOAS operational contract)
ADR-0058 (NDJSON streaming for temporal operations)
ADR-0087 (observability contract)
ADR-0090 (autonomous o11y triage loop)
ADR-0077 (memory system next phase)

Status

Accepted (execution in progress: Phase 1 complete, Phase 2 actively implementing deterministic runbooks/recovery wiring).