ADR-0139superseded

SDK Reachability Investigator (Historical, Superseded)

2026-02-25T00:00:00.000Z

Status

superseded

This ADR is superseded by 0138-self-healing-backup-orchestrator.md, which is now the canonical self-healing architecture. See this record for historical implementation context and prior reasoning around the initial SDK reachability root-cause.

Context

system/content-sync failures with Unable to reach SDK URL were not always a bad function implementation. Two independent failure modes were interacting:

Hard worker boot failure (syntax/parse errors) made callback URLs unreachable.
com.joel.system-bus-sync used raw launchctl kickstart -k ...com.joel.system-bus-worker on git ref updates, which could restart the worker during active runs and drop finalization callbacks.

ADR-0138 already defined a backup-domain self-healing router. We need the same pattern at system level for SDK reachability incidents.

Decision

Replace unsafe sync-agent restart behavior with guarded CLI restart logic:
- com.joel.system-bus-sync now runs joelclaw inngest restart-worker --register instead of raw launchctl kickstart.
- Keep active-run guard and cooldown semantics from existing CLI + o11y restart handler.
Add a generalized self-healing investigator function:
- New function: system/self-healing.investigator
- Triggers: cron every 10 minutes + manual system/self.healing.requested
- Behavior: scan recent failed runs, inspect run output, detect Unable to reach SDK URL, apply guarded restart remediation via existing restart-worker auto-fix handler, and emit OTEL telemetry.
Formalize event contracts so backup and generic self-healing share typed events:
- system/self.healing.requested
- system/self.healing.completed
- system/backup.failure.detected
- system/backup.retry.requested

Consequences

Positive

Self-healing is no longer backup-only; SDK reachability regressions get continuous investigation.
Worker restarts are guarded against active runs instead of unconditional launchd kickstarts.
Self-healing decisions become observable and queryable in OTEL.

Negative

More moving parts: investigator cron + Redis dedupe + run detail probes.
Partial remediation remains bounded to restart/register recovery; deeper root-cause fixes still need agent/human follow-up.

Risks

If joelclaw run output format changes, investigator parsing can degrade.
Repeated SDK errors from upstream Inngest outages may trigger noisy remediation attempts (mitigated by cooldown + dedupe).