ADR-0139superseded
SDK Reachability Investigator (Historical, Superseded)
Status
superseded
This ADR is superseded by 0138-self-healing-backup-orchestrator.md, which is now the canonical self-healing architecture.
See this record for historical implementation context and prior reasoning around the initial SDK reachability root-cause.
Context
system/content-sync failures with Unable to reach SDK URL were not always a bad function implementation.
Two independent failure modes were interacting:
- Hard worker boot failure (syntax/parse errors) made callback URLs unreachable.
com.joel.system-bus-syncused rawlaunchctl kickstart -k ...com.joel.system-bus-workeron git ref updates, which could restart the worker during active runs and drop finalization callbacks.
ADR-0138 already defined a backup-domain self-healing router. We need the same pattern at system level for SDK reachability incidents.
Decision
- Replace unsafe sync-agent restart behavior with guarded CLI restart logic:
com.joel.system-bus-syncnow runsjoelclaw inngest restart-worker --registerinstead of rawlaunchctl kickstart.- Keep active-run guard and cooldown semantics from existing CLI + o11y restart handler.
- Add a generalized self-healing investigator function:
- New function:
system/self-healing.investigator - Triggers: cron every 10 minutes + manual
system/self.healing.requested - Behavior: scan recent failed runs, inspect run output, detect
Unable to reach SDK URL, apply guarded restart remediation via existing restart-worker auto-fix handler, and emit OTEL telemetry.
- New function:
- Formalize event contracts so backup and generic self-healing share typed events:
system/self.healing.requestedsystem/self.healing.completedsystem/backup.failure.detectedsystem/backup.retry.requested
Consequences
Positive
- Self-healing is no longer backup-only; SDK reachability regressions get continuous investigation.
- Worker restarts are guarded against active runs instead of unconditional launchd kickstarts.
- Self-healing decisions become observable and queryable in OTEL.
Negative
- More moving parts: investigator cron + Redis dedupe + run detail probes.
- Partial remediation remains bounded to restart/register recovery; deeper root-cause fixes still need agent/human follow-up.
Risks
- If
joelclaw runoutput format changes, investigator parsing can degrade. - Repeated SDK errors from upstream Inngest outages may trigger noisy remediation attempts (mitigated by cooldown + dedupe).