ADR-0182shipped

Node-0 Fleet Contract and Localhost Resilience

2026-02-28T00:00:00.000Z

Status: shipped
Date: 2026-02-28
Updated: 2026-03-01
Deciders: Joel Hooks
Related: ADR-0029, ADR-0089, ADR-0148, ADR-0159

Context

We keep hitting the same reliability class:

localhost ↔ Colima VM split-brain (localhost probe fails while VM-side service is healthy)
runtime/socket drift after Colima cycles
kubelet proxy permission drift (apiserver-kubelet-client authz breakage)
host-specific assumptions leaking into core logic (launchd, Colima paths, localhost-only probes)

This is fixable, but only if we stop treating Panda like a special one-off machine.

Panda is node-0 of a future fleet.

Decision

1) Hardening now (single Mac, node-0)

Deploy the Talon split-brain fix that is already committed (68ba160) so runtime behavior matches repo state.
Kill localhost as a hard dependency in operational checks (Talon + CLI + gateway health paths):
- Resolution order:
  1. localhost
  2. Colima VM IP (default 192.168.64.2, discovered dynamically)
  3. Kubernetes service DNS (*.svc.cluster.local)
Add RBAC drift guard for kubelet proxy permissions (the apiserver-kubelet-client breakage class).
Run invariant checks after every Colima cycle:
- Docker socket is healthy
- kube API is reachable
- kubectl exec/kubectl logs authorization works
- Inngest and Typesense health endpoints pass
Make runbook + auto-repair loop replay-safe and idempotent end-to-end.

2) Prepare for multi-Mac / Linux

> ⚠️ Superseded by ADR-0184 — The fleet-prep items below are now tracked with full implementation detail, codebase audit, and phased plan in ADR-0184: Node-0 to Fleet — Platform Abstraction and Multi-Node Readiness.

Platform-neutral control-plane contract (Talos configs + node labels as the contract; host bootstrap at edges only).
No host-bound assumptions in core logic (launchd/Colima details live in adapters, not core decision logic).
Multi-node networking trigger: revisit Cilium/service policy as soon as node-1 joins.
Stateful service growth path defined now:
- Redis: single-node now, HA topology spec maintained and ready
- Typesense: single-node now, multi-node shard/replica plan maintained and ready
ARM64/Linux-friendly workloads: multi-arch images and no macOS-only binaries in cluster workloads.

Consequences

Good

Split-brain detection becomes first-class instead of accidental.
Health checks degrade gracefully across endpoint classes instead of failing on localhost flaps.
RBAC drift is detected and corrected early.
Node-0 is a reusable bootstrap pattern for additional nodes.
Linux portability work starts now, not as a rewrite later.

Tradeoffs

Slightly more complexity in endpoint resolution and probe logic.
More explicit policy/config surface area to maintain.
Additional verification gates on restart/recovery flows.

Implementation Status (2026-03-01)

✅ Endpoint fallback resolver landed for CLI, gateway heartbeat, and system health checks.
✅ Talon VM witness probes fixed to run valid remote python socket checks over Colima SSH.
✅ RBAC drift guard added for apiserver-kubelet-client / kube-apiserver-kubelet-client against nodes/proxy (get, create).
✅ Post-Colima invariant gate added as one replay-safe check (docker socket, kube API, kubelet proxy authz, logs/exec path, Inngest, Typesense).
✅ Talon now defers worker supervision when legacy launchd worker supervisor (com.joel.system-bus-worker) is loaded, preventing dual-supervisor restart thrash during cutover.
✅ Talon launchd runtime validated in coexistence mode: watchdog healthy, no worker thrash, and all probes passing after voice-agent relink to LiveKit.
✅ Warmup-aware gating landed (WARMUP_GRACE_SECS=120): transient failures (flannel not ready, ImagePullBackOff, health endpoint timeouts) are tolerated during grace window; hard faults (kube API unreachable, RBAC broken) fail immediately regardless. Commit af4dcea.
✅ Voice-agent stale-process cleanup automated: infra/voice-agent/cleanup-stale.sh detects stale main.py workers holding port 8081, kills them, and kickstarts the launchd service. Called automatically after post-Colima invariant gate passes. Commit af4dcea.

Sequence (vector clock, not calendar)

After ADR-0159 bridge-heal code is present → deploy Talon runtime from 68ba160.
After runtime matches source → switch operational checks to endpoint resolver fallback order.
After resolver lands → add RBAC drift guard and enforce invariants post-Colima-cycle.
After invariants are stable on node-0 → freeze platform-neutral control-plane contract.
When node-1 is introduced → execute Cilium/service-policy revisit and stateful HA expansion plans.

Verification Gates

Force-cycle Colima; recovery succeeds without manual intervention; invariant suite passes.
Health checks continue to pass when localhost endpoints are intentionally disrupted but VM/service DNS paths remain healthy.
RBAC drift test reproduces kubelet-proxy authz failure and verifies guard recovery.
Talon repair loop can replay the same failure sequence multiple times without divergent side effects.
Multi-arch build pipeline produces runnable ARM64/Linux artifacts for cluster workloads.

Notes

This ADR sets the architecture contract and rollout sequence. Existing shipped work in ADR-0159 remains valid and is extended here for node-0-to-fleet evolution.

PDF Brain Reference Pack (Book Corpus)

These references are the research substrate for this ADR and should be used by the pdf-brain skill when expanding hardening work.

Domain	Book / Doc ID	Retrieval anchors (chunk IDs)	Practical idea for node-0 / k8s
Partial failure & nondeterminism	`designing-dataintensive-applications-39cc0d1842a5`	`s116:n1`, `s117`, `s138:n2`	Never trust localhost-only health. Resolve and verify across host, VM witness, and service DNS paths.
Fast fail & dependency protection	`release-it-michael-nygard-df70f05c7863`	`s109:n0`	Use breaker-style escalation when dependency probes flap; avoid retry storms.
Blast-radius containment	`release-it-michael-nygard-df70f05c7863`	`s111`	Bulkhead probe classes (`infra-critical`, `service-critical`, `app-level`) so app failures don’t trigger infra resets.
Alert signal quality & reliability governance	`site-reliability-engineering-how-google-runs-pro-36bc8fec5a69`	`s42`, `s31:n1`	Keep pager criteria symptom-based, low-noise. Use invariant gates and release-risk throttling from error-budget posture.
Stable boundaries & portability	`building-microservices-2nd-edition-sam-newman-88c27beee5d6`	`s29:n0`, `s196:n0`	Keep host-specific bootstrap in adapters; keep core control loop platform-neutral for fleet growth.
Replay safety & dedupe	`building-event-driven-microservices-adam-bellema-4843d259c45b`	`s161:n0`	Healing and repair loops must be idempotent, dedupe-aware, and safe to replay.
Compatibility-driven evolution	`building-event-driven-microservices-adam-bellema-4843d259c45b`	`s63:n0`	Fleet contract/config changes must preserve forward/backward compatibility.
Unknown-unknown debugging	`observability-engineering-achieving-production-e-65364c03bf43`	`s36:n2`, `s107:n2`	Emit high-cardinality structured telemetry for probe path, resolver branch, authz drift, and heal tier.

Retrieval Instructions (`pdf-brain`)

Baseline retrieval (repeatable)

# 1) Discover candidate evidence
joelclaw docs search "partial failure localhost split brain" --limit 8 --semantic true
 
# 2) Pin to source doc and retrieve focused windows
joelclaw docs search "partial failure nondeterministic" \
  --doc designing-dataintensive-applications-39cc0d1842a5 --limit 8 --semantic true
 
# 3) Expand exact chunk in local context
joelclaw docs context designing-dataintensive-applications-39cc0d1842a5:s116:n1 \
  --mode snippet-window --before 1 --after 1

Expansion protocol (research → operations)

Build an evidence ledger (doc, chunk-id, claim, relevance).
Convert each claim to one operational principle (imperative, testable).
Map each principle to one concrete node-0/k8s change (file/service/command path).
Attach verification signals (exact command + expected healthy output).
Attach failure signal + next escalation tier.
If principle changes architecture policy, patch this ADR or superseding ADR in the same turn.

Required output shape for follow-on work

- Principle:
- Evidence (doc + chunk):
- k8s/infra move:
- Implementation path:
- Verify:
- Failure signal:
- Next escalation:

This keeps research traceable and directly executable for immediate infrastructure hardening.