Self-Healing NAS Backup Orchestrator
Status
shipped
This record is the canonical self-healing architecture for non-domain-specific remediation in the system. ADR-0139 is historical and intentionally superseded by this record.
Context
Backup jobs for Typesense and Redis run as Inngest functions and currently must:
- classify failures predictably when transport or infra flakes occur,
- retry with a bounded strategy that includes delay jitter and budget caps,
- switch between local mount paths and SSH remote copy fallback,
- and expose a central, inspectable config surface for operator tuning.
Additional system-wide SDK reachability failures also required durable remediation and guarded worker restart behavior.
A previous ad-hoc approach handled domain-specific symptoms, but this led to duplicated logic and inconsistent operator controls.
Decision
Adopt a durable, first-class self-healing architecture with these canonical rules:
- Keep transport and retry knobs configurable from
~/.joelclaw/system-bus.config.jsonwith environment variable overrides. - Centralize all cross-domain self-healing on a shared event contract and router flow:
- canonical event:
system/self.healing.requested - canonical completion:
system/self.healing.completed - router action policy:
retry,pause,escalate - bounded budgets from config + policy and deterministic scheduling with
sendEvent/step.sleep.
- canonical event:
- Route backup failures through
system/backup.failure.router:- emit
system/backup.retry.requestedwith explicit target + context - retry via bounded exponential backoff
- escalate after budget exhaustion
- emit
- Keep backup functions (
system/backup.typesense,system/backup.redis) subscribed to both cron triggers and retry requests. - Apply the same architecture to SDK reachability and worker lifecycle incidents:
system/self-healing.investigatorscans failed runs + detectsUnable to reach SDK URL- guarded worker restart path replaces unsafe launchd kickstart behavior
- remediation emits self-healing telemetry for traceability and repeatable recovery
- Instrument each routing/transfer decision in OTEL and include model metadata, transport mode, retry attempts, and incident context.
- Use this as the canonical self-healing decision plane for any future domain that needs durable remediation.
This ADR is the canonical self-healing architecture for system-wide remediation. Future domains (log ingestion, Redis/queue, webhook delivery, gateway sessions, etc.) dispatch into this flow using domain payloads for route-specific context.
Proposed reusable contract:
- Event:
system/self.healing.requested - Required fields:
sourceFunction(string)targetComponent(string)problemSummary(string)attempt(number)retryPolicy(maxRetries,sleepMinMs,sleepMaxMs,sleepStepMs)evidence(e.g. log files, Redis keys, event IDs, query snippets)playbook(skills, restart/kill/defer/notify actions, links, runbook references)
- Optional fields:
context(rich domain context object)owner(team/user routing key)deadlineAt(ISO timestamp)fallbackAction(escalate,manual)
- Router output: one of
retry(schedule via step sleep +sendEvent)pause(bounded hold and recheck)escalate(route to manual intervention queue)
Decision Outcome
- Add shared config loader in
packages/system-bus/src/lib/backup-failure-router-config.ts. - Drive router/transport knobs from
~/.joelclaw/system-bus.config.jsonand env overrides. - Extend CLI with
joelclaw nas config [show|init]for operator visibility and initialization. - Add a documented template at
packages/system-bus/system-bus.config.example.mdwith all supported keys. - Introduce shared self-healing model contracts and route-specific adapters under
system/self.healing.requested. - Implement the SDK reachability investigator + guarded launchd-restart path as the canonical non-backup flow under this ADR.
- Keep existing event names and transport safety checks while moving behavior from informal notes to explicit architecture.
Priority Rollout (Canonical)
P0
system-busandjoelclaw gatewayworker control plane: guard rails around restarts, safe re-registration, and run-level dedupe.- Event-router and Inngest execution path resilience: prevent lost callbacks and duplicate remediation loops on run failures.
- NAS backup domains (
system/backup.typesense,system/backup.redis): bounded transport/retry remediation with mount and remote-copy fallbacks.
P1
gatewayprovider adapters (Telegram/iMessage/Discord/email/webhooks): reconnect, session rebind, and queue drain recovery.- Redis-backed event bridge and transient state stores: broker liveness checks, stale-lock cleanup, and reconnect backoff.
- Observability pipeline (
otel_events+ tracing ingest): fail-closed telemetry sinks and health gates for missing event writes.
P2
- Search/index and content-serving surfaces (
Typesense, Convex projections): query fallback strategies and index rebuild playbooks. - External dependencies (LLM providers, third-party APIs): model/provider fallback and route-specific error budgets.
- K8s edge services (colima/talos/ingress paths): soft restart/reconcile flow with controlled escalation.
Non-goals for this ADR
- Full autonomous model-level diagnosis is not the first-class control path.
- Human review remains the final escalation channel for unresolved policy loops.
Execution TODO (Actionable Backlog)
P0 (Do now)
- Enforce guarded worker restart path for all ingress sync and sync recovery codepaths.
- Validate: no raw
launchctl kickstartinvocations in sync/control loops. - Artifact:
infra/launchd/com.joel.system-bus-sync.plist,packages/system-busrestart handler.
- Validate: no raw
- Ensure
system/self-healing.investigatoris a first-class route target forsystem/self.healing.requestedand emitssystem/self.healing.completed.- Validate: event schema includes
sourceFunction,targetComponent,attempt,retryPolicy,playbook,evidence.
- Validate: event schema includes
- Add deterministic backoff and jitter policy validation for backup router and investigator loops.
- Validate: bounded
maxRetriesandsleepMsranges enforced from config and env.
- Validate: bounded
- Implement canonical
pausepath for transient infra outages beforeretry.- Validate: pause emits completion event with reason and wait duration.
- Add P0 runbook fields to payload contract (
links,restart,kill,defer,notify).- Validate: all P0 routes pass non-empty
playbookcontext.
- Validate: all P0 routes pass non-empty
Priority hardening status (canonical rollout):
- High-priority systems currently mapped to this flow:
gatewayand long-running worker control loops (safe restart + guarded register path).system/backup(Typesense + Redis copy pipeline).system/gateway.bridge.healthfor queue/reconnect/stale-entry reconciliation.check-system-healthas router input for self-healing requests from system-wide degraded checks.system/self-healing.investigatorfor SDK reachability and run-level incident scans.agent-looporchestration via trigger/audit signaling and operator handoff paths.
- Blocked work not yet canonicalized:
- gateway provider adapters (Telegram/iMessage/webhooks) session/session-bridge recovery.
- Redis/bridge stale-lock recovery for broker reconnection after process restarts.
P1 (execute next)
- Add Redis/bridge health checks and stale-run reconciliation for session bridge queues.
- Validate: queue health signal appears in periodic OTEL and triggers
pausewhen unstable.
- Validate: queue health signal appears in periodic OTEL and triggers
- Add OTEL health circuit for missing telemetry writes during recovery loops.
- Validate: emits an explicit telemetry gap event when traces fail to persist.
- Add provider session rebind/retry paths for Telegram/iMessage/email/webhook adapters.
- Validate: retries use bounded cooldown and escalate via
system/self.healing.completedstatusescalated.
- Validate: retries use bounded cooldown and escalate via
P2 (planned)
- Add index/search fallback and rebuild plan for Typesense/Convex incidents.
- Validate: recoverable route triggers a deferred rerun with explicit backoff.
- Add
0138execution tasks into operator task list (task-management) for each P1/P2 domain. - Add ADR status telemetry dashboard for shipped self-healing actions and escalation rate.
- Validate: dashboard shows completion/escalation counts by domain.
Done Criteria
- All TODO items above are linked to code or operational verification events.
- Every executed item has corresponding
system/self.healing.completedtelemetry with action and outcome. - At least one real remediation run demonstrates
retry -> pause -> escalatebehavior for a synthetic transient fault.
Consequences
Positive
- Reliable backup recovery behavior is configurable, bounded, and observable.
- Failure-handling behavior is durable (event-driven) instead of in-process heuristics.
- Ops can tune retry windows and mounts without code edits.
Negative
- More operational complexity: two decision layers (LLM + policy) need monitoring and periodic recalibration.
- Additional config surface adds potential misconfiguration risk; requires template usage and env override discipline.
Risks
- If the configured model IDs are disallowed by the allowlist, router startup or invocation fails.
- Hard transport failures (bad NAS networking) can still escalate correctly, but with potential delayed recovery.