ADR-0182shipped

Node-0 Fleet Contract and Localhost Resilience

Status: shipped
Date: 2026-02-28
Updated: 2026-03-01
Deciders: Joel Hooks
Related: ADR-0029, ADR-0089, ADR-0148, ADR-0159

Context

We keep hitting the same reliability class:

  • localhost ↔ Colima VM split-brain (localhost probe fails while VM-side service is healthy)
  • runtime/socket drift after Colima cycles
  • kubelet proxy permission drift (apiserver-kubelet-client authz breakage)
  • host-specific assumptions leaking into core logic (launchd, Colima paths, localhost-only probes)

This is fixable, but only if we stop treating Panda like a special one-off machine.

Panda is node-0 of a future fleet.

Decision

1) Hardening now (single Mac, node-0)

  1. Deploy the Talon split-brain fix that is already committed (68ba160) so runtime behavior matches repo state.
  2. Kill localhost as a hard dependency in operational checks (Talon + CLI + gateway health paths):
    • Resolution order:
      1. localhost
      2. Colima VM IP (default 192.168.64.2, discovered dynamically)
      3. Kubernetes service DNS (*.svc.cluster.local)
  3. Add RBAC drift guard for kubelet proxy permissions (the apiserver-kubelet-client breakage class).
  4. Run invariant checks after every Colima cycle:
    • Docker socket is healthy
    • kube API is reachable
    • kubectl exec/kubectl logs authorization works
    • Inngest and Typesense health endpoints pass
  5. Make runbook + auto-repair loop replay-safe and idempotent end-to-end.

2) Prepare for multi-Mac / Linux

> ⚠️ Superseded by ADR-0184 — The fleet-prep items below are now tracked with full implementation detail, codebase audit, and phased plan in ADR-0184: Node-0 to Fleet — Platform Abstraction and Multi-Node Readiness.

  1. Platform-neutral control-plane contract (Talos configs + node labels as the contract; host bootstrap at edges only).
  2. No host-bound assumptions in core logic (launchd/Colima details live in adapters, not core decision logic).
  3. Multi-node networking trigger: revisit Cilium/service policy as soon as node-1 joins.
  4. Stateful service growth path defined now:
    • Redis: single-node now, HA topology spec maintained and ready
    • Typesense: single-node now, multi-node shard/replica plan maintained and ready
  5. ARM64/Linux-friendly workloads: multi-arch images and no macOS-only binaries in cluster workloads.

Consequences

Good

  • Split-brain detection becomes first-class instead of accidental.
  • Health checks degrade gracefully across endpoint classes instead of failing on localhost flaps.
  • RBAC drift is detected and corrected early.
  • Node-0 is a reusable bootstrap pattern for additional nodes.
  • Linux portability work starts now, not as a rewrite later.

Tradeoffs

  • Slightly more complexity in endpoint resolution and probe logic.
  • More explicit policy/config surface area to maintain.
  • Additional verification gates on restart/recovery flows.

Implementation Status (2026-03-01)

  • ✅ Endpoint fallback resolver landed for CLI, gateway heartbeat, and system health checks.
  • ✅ Talon VM witness probes fixed to run valid remote python socket checks over Colima SSH.
  • ✅ RBAC drift guard added for apiserver-kubelet-client / kube-apiserver-kubelet-client against nodes/proxy (get, create).
  • ✅ Post-Colima invariant gate added as one replay-safe check (docker socket, kube API, kubelet proxy authz, logs/exec path, Inngest, Typesense).
  • ✅ Talon now defers worker supervision when legacy launchd worker supervisor (com.joel.system-bus-worker) is loaded, preventing dual-supervisor restart thrash during cutover.
  • ✅ Talon launchd runtime validated in coexistence mode: watchdog healthy, no worker thrash, and all probes passing after voice-agent relink to LiveKit.
  • ✅ Warmup-aware gating landed (WARMUP_GRACE_SECS=120): transient failures (flannel not ready, ImagePullBackOff, health endpoint timeouts) are tolerated during grace window; hard faults (kube API unreachable, RBAC broken) fail immediately regardless. Commit af4dcea.
  • ✅ Voice-agent stale-process cleanup automated: infra/voice-agent/cleanup-stale.sh detects stale main.py workers holding port 8081, kills them, and kickstarts the launchd service. Called automatically after post-Colima invariant gate passes. Commit af4dcea.

Sequence (vector clock, not calendar)

  1. After ADR-0159 bridge-heal code is present → deploy Talon runtime from 68ba160.
  2. After runtime matches source → switch operational checks to endpoint resolver fallback order.
  3. After resolver lands → add RBAC drift guard and enforce invariants post-Colima-cycle.
  4. After invariants are stable on node-0 → freeze platform-neutral control-plane contract.
  5. When node-1 is introduced → execute Cilium/service-policy revisit and stateful HA expansion plans.

Verification Gates

  • Force-cycle Colima; recovery succeeds without manual intervention; invariant suite passes.
  • Health checks continue to pass when localhost endpoints are intentionally disrupted but VM/service DNS paths remain healthy.
  • RBAC drift test reproduces kubelet-proxy authz failure and verifies guard recovery.
  • Talon repair loop can replay the same failure sequence multiple times without divergent side effects.
  • Multi-arch build pipeline produces runnable ARM64/Linux artifacts for cluster workloads.

Notes

This ADR sets the architecture contract and rollout sequence. Existing shipped work in ADR-0159 remains valid and is extended here for node-0-to-fleet evolution.

PDF Brain Reference Pack (Book Corpus)

These references are the research substrate for this ADR and should be used by the pdf-brain skill when expanding hardening work.

DomainBook / Doc IDRetrieval anchors (chunk IDs)Practical idea for node-0 / k8s
Partial failure & nondeterminismdesigning-dataintensive-applications-39cc0d1842a5s116:n1, s117, s138:n2Never trust localhost-only health. Resolve and verify across host, VM witness, and service DNS paths.
Fast fail & dependency protectionrelease-it-michael-nygard-df70f05c7863s109:n0Use breaker-style escalation when dependency probes flap; avoid retry storms.
Blast-radius containmentrelease-it-michael-nygard-df70f05c7863s111Bulkhead probe classes (infra-critical, service-critical, app-level) so app failures don’t trigger infra resets.
Alert signal quality & reliability governancesite-reliability-engineering-how-google-runs-pro-36bc8fec5a69s42, s31:n1Keep pager criteria symptom-based, low-noise. Use invariant gates and release-risk throttling from error-budget posture.
Stable boundaries & portabilitybuilding-microservices-2nd-edition-sam-newman-88c27beee5d6s29:n0, s196:n0Keep host-specific bootstrap in adapters; keep core control loop platform-neutral for fleet growth.
Replay safety & dedupebuilding-event-driven-microservices-adam-bellema-4843d259c45bs161:n0Healing and repair loops must be idempotent, dedupe-aware, and safe to replay.
Compatibility-driven evolutionbuilding-event-driven-microservices-adam-bellema-4843d259c45bs63:n0Fleet contract/config changes must preserve forward/backward compatibility.
Unknown-unknown debuggingobservability-engineering-achieving-production-e-65364c03bf43s36:n2, s107:n2Emit high-cardinality structured telemetry for probe path, resolver branch, authz drift, and heal tier.

Retrieval Instructions (pdf-brain)

Baseline retrieval (repeatable)

# 1) Discover candidate evidence
joelclaw docs search "partial failure localhost split brain" --limit 8 --semantic true
 
# 2) Pin to source doc and retrieve focused windows
joelclaw docs search "partial failure nondeterministic" \
  --doc designing-dataintensive-applications-39cc0d1842a5 --limit 8 --semantic true
 
# 3) Expand exact chunk in local context
joelclaw docs context designing-dataintensive-applications-39cc0d1842a5:s116:n1 \
  --mode snippet-window --before 1 --after 1

Expansion protocol (research → operations)

  1. Build an evidence ledger (doc, chunk-id, claim, relevance).
  2. Convert each claim to one operational principle (imperative, testable).
  3. Map each principle to one concrete node-0/k8s change (file/service/command path).
  4. Attach verification signals (exact command + expected healthy output).
  5. Attach failure signal + next escalation tier.
  6. If principle changes architecture policy, patch this ADR or superseding ADR in the same turn.

Required output shape for follow-on work

- Principle:
- Evidence (doc + chunk):
- k8s/infra move:
- Implementation path:
- Verify:
- Failure signal:
- Next escalation:

This keeps research traceable and directly executable for immediate infrastructure hardening.