Node-0 Fleet Contract and Localhost Resilience
Status: shipped
Date: 2026-02-28
Updated: 2026-03-01
Deciders: Joel Hooks
Related: ADR-0029, ADR-0089, ADR-0148, ADR-0159
Context
We keep hitting the same reliability class:
- localhost ↔ Colima VM split-brain (localhost probe fails while VM-side service is healthy)
- runtime/socket drift after Colima cycles
- kubelet proxy permission drift (
apiserver-kubelet-clientauthz breakage) - host-specific assumptions leaking into core logic (
launchd, Colima paths, localhost-only probes)
This is fixable, but only if we stop treating Panda like a special one-off machine.
Panda is node-0 of a future fleet.
Decision
1) Hardening now (single Mac, node-0)
- Deploy the Talon split-brain fix that is already committed (
68ba160) so runtime behavior matches repo state. - Kill localhost as a hard dependency in operational checks (Talon + CLI + gateway health paths):
- Resolution order:
- localhost
- Colima VM IP (default
192.168.64.2, discovered dynamically) - Kubernetes service DNS (
*.svc.cluster.local)
- Resolution order:
- Add RBAC drift guard for kubelet proxy permissions (the
apiserver-kubelet-clientbreakage class). - Run invariant checks after every Colima cycle:
- Docker socket is healthy
- kube API is reachable
kubectl exec/kubectl logsauthorization works- Inngest and Typesense health endpoints pass
- Make runbook + auto-repair loop replay-safe and idempotent end-to-end.
2) Prepare for multi-Mac / Linux
> ⚠️ Superseded by ADR-0184 — The fleet-prep items below are now tracked with full implementation detail, codebase audit, and phased plan in ADR-0184: Node-0 to Fleet — Platform Abstraction and Multi-Node Readiness.
- Platform-neutral control-plane contract (Talos configs + node labels as the contract; host bootstrap at edges only).
- No host-bound assumptions in core logic (launchd/Colima details live in adapters, not core decision logic).
- Multi-node networking trigger: revisit Cilium/service policy as soon as node-1 joins.
- Stateful service growth path defined now:
- Redis: single-node now, HA topology spec maintained and ready
- Typesense: single-node now, multi-node shard/replica plan maintained and ready
- ARM64/Linux-friendly workloads: multi-arch images and no macOS-only binaries in cluster workloads.
Consequences
Good
- Split-brain detection becomes first-class instead of accidental.
- Health checks degrade gracefully across endpoint classes instead of failing on localhost flaps.
- RBAC drift is detected and corrected early.
- Node-0 is a reusable bootstrap pattern for additional nodes.
- Linux portability work starts now, not as a rewrite later.
Tradeoffs
- Slightly more complexity in endpoint resolution and probe logic.
- More explicit policy/config surface area to maintain.
- Additional verification gates on restart/recovery flows.
Implementation Status (2026-03-01)
- ✅ Endpoint fallback resolver landed for CLI, gateway heartbeat, and system health checks.
- ✅ Talon VM witness probes fixed to run valid remote python socket checks over Colima SSH.
- ✅ RBAC drift guard added for
apiserver-kubelet-client/kube-apiserver-kubelet-clientagainstnodes/proxy(get,create). - ✅ Post-Colima invariant gate added as one replay-safe check (docker socket, kube API, kubelet proxy authz, logs/exec path, Inngest, Typesense).
- ✅ Talon now defers worker supervision when legacy launchd worker supervisor (
com.joel.system-bus-worker) is loaded, preventing dual-supervisor restart thrash during cutover. - ✅ Talon launchd runtime validated in coexistence mode: watchdog healthy, no worker thrash, and all probes passing after voice-agent relink to LiveKit.
- ✅ Warmup-aware gating landed (
WARMUP_GRACE_SECS=120): transient failures (flannel not ready, ImagePullBackOff, health endpoint timeouts) are tolerated during grace window; hard faults (kube API unreachable, RBAC broken) fail immediately regardless. Commitaf4dcea. - ✅ Voice-agent stale-process cleanup automated:
infra/voice-agent/cleanup-stale.shdetects stalemain.pyworkers holding port 8081, kills them, and kickstarts the launchd service. Called automatically after post-Colima invariant gate passes. Commitaf4dcea.
Sequence (vector clock, not calendar)
- After ADR-0159 bridge-heal code is present → deploy Talon runtime from
68ba160. - After runtime matches source → switch operational checks to endpoint resolver fallback order.
- After resolver lands → add RBAC drift guard and enforce invariants post-Colima-cycle.
- After invariants are stable on node-0 → freeze platform-neutral control-plane contract.
- When node-1 is introduced → execute Cilium/service-policy revisit and stateful HA expansion plans.
Verification Gates
- Force-cycle Colima; recovery succeeds without manual intervention; invariant suite passes.
- Health checks continue to pass when localhost endpoints are intentionally disrupted but VM/service DNS paths remain healthy.
- RBAC drift test reproduces kubelet-proxy authz failure and verifies guard recovery.
- Talon repair loop can replay the same failure sequence multiple times without divergent side effects.
- Multi-arch build pipeline produces runnable ARM64/Linux artifacts for cluster workloads.
Notes
This ADR sets the architecture contract and rollout sequence. Existing shipped work in ADR-0159 remains valid and is extended here for node-0-to-fleet evolution.
PDF Brain Reference Pack (Book Corpus)
These references are the research substrate for this ADR and should be used by the pdf-brain skill when expanding hardening work.
| Domain | Book / Doc ID | Retrieval anchors (chunk IDs) | Practical idea for node-0 / k8s |
|---|---|---|---|
| Partial failure & nondeterminism | designing-dataintensive-applications-39cc0d1842a5 | s116:n1, s117, s138:n2 | Never trust localhost-only health. Resolve and verify across host, VM witness, and service DNS paths. |
| Fast fail & dependency protection | release-it-michael-nygard-df70f05c7863 | s109:n0 | Use breaker-style escalation when dependency probes flap; avoid retry storms. |
| Blast-radius containment | release-it-michael-nygard-df70f05c7863 | s111 | Bulkhead probe classes (infra-critical, service-critical, app-level) so app failures don’t trigger infra resets. |
| Alert signal quality & reliability governance | site-reliability-engineering-how-google-runs-pro-36bc8fec5a69 | s42, s31:n1 | Keep pager criteria symptom-based, low-noise. Use invariant gates and release-risk throttling from error-budget posture. |
| Stable boundaries & portability | building-microservices-2nd-edition-sam-newman-88c27beee5d6 | s29:n0, s196:n0 | Keep host-specific bootstrap in adapters; keep core control loop platform-neutral for fleet growth. |
| Replay safety & dedupe | building-event-driven-microservices-adam-bellema-4843d259c45b | s161:n0 | Healing and repair loops must be idempotent, dedupe-aware, and safe to replay. |
| Compatibility-driven evolution | building-event-driven-microservices-adam-bellema-4843d259c45b | s63:n0 | Fleet contract/config changes must preserve forward/backward compatibility. |
| Unknown-unknown debugging | observability-engineering-achieving-production-e-65364c03bf43 | s36:n2, s107:n2 | Emit high-cardinality structured telemetry for probe path, resolver branch, authz drift, and heal tier. |
Retrieval Instructions (pdf-brain)
Baseline retrieval (repeatable)
# 1) Discover candidate evidence
joelclaw docs search "partial failure localhost split brain" --limit 8 --semantic true
# 2) Pin to source doc and retrieve focused windows
joelclaw docs search "partial failure nondeterministic" \
--doc designing-dataintensive-applications-39cc0d1842a5 --limit 8 --semantic true
# 3) Expand exact chunk in local context
joelclaw docs context designing-dataintensive-applications-39cc0d1842a5:s116:n1 \
--mode snippet-window --before 1 --after 1Expansion protocol (research → operations)
- Build an evidence ledger (
doc,chunk-id,claim,relevance). - Convert each claim to one operational principle (imperative, testable).
- Map each principle to one concrete node-0/k8s change (file/service/command path).
- Attach verification signals (exact command + expected healthy output).
- Attach failure signal + next escalation tier.
- If principle changes architecture policy, patch this ADR or superseding ADR in the same turn.
Required output shape for follow-on work
- Principle:
- Evidence (doc + chunk):
- k8s/infra move:
- Implementation path:
- Verify:
- Failure signal:
- Next escalation:This keeps research traceable and directly executable for immediate infrastructure hardening.