ADR-0184superseded

Node-0 to Fleet — Platform Abstraction and Multi-Node Readiness

2026-03-01T00:00:00.000Z

Status: proposed Date: 2026-03-01 Deciders: Joel Hooks Related: ADR-0029, ADR-0089, ADR-0148, ADR-0159, ADR-0182 Supersedes: ADR-0182 Section 2 (“Prepare for multi-Mac / Linux”)

Context

ADR-0182 shipped the “hardening now” work: warmup-aware gates, voice-agent cleanup, RBAC drift guards, post-Colima invariant checks. Panda is reliable as a single node.

But the fleet-prep section of ADR-0182 remained aspirational. A codebase audit reveals the actual state:

Current Host-Bound Debt

Category	Count	Files Affected	Severity
`launchctl` calls in core packages	34	11 files across cli, gateway, system-bus	High — direct macOS process manager calls in business logic
`launchd` references (comments + logic)	48	Same 11 + additional	Medium — some are comments, but many drive restart/heal behavior
`/Users/joel` hardcoded paths	62	20+ files	High — breaks on any machine with a different username or home dir
Colima/`192.168.64` references	5	endpoint-resolver, network-status, seed-network	Low — mostly in the resolver which already abstracts this

What This Means

Panda cannot be cloned. If Joel buys a second Mac Mini tomorrow and wants it running joelclaw workloads, the setup would require:

Creating a joel user account (or rewriting 62 path references)
Installing identical launchd plists (or the gateway/CLI will fail to restart services)
Running Colima in the exact same network config (or endpoint resolution breaks)
Manually configuring every service that assumes single-node

This is the opposite of a fleet-ready system. The goal of this ADR is to make node-1 joinable with minimal manual setup.

Decision

Guiding Principles (from the book corpus)

Principle	Source	Application
Partial failures are nondeterministic — never trust a single probe path	Kleppmann, DDIA §8 (`s116:n1`)	Already shipped in ADR-0182 endpoint resolver. Extend to all service discovery.
Fast fail with dependency protection — breaker-style escalation	Nygard, Release It! (`s109:n0`)	Tier-based escalation in Talon already implements this. Preserve when abstracting.
Blast-radius containment via bulkheads	Nygard, Release It! (`s111`)	Probe classes (`infra-critical`, `service-critical`, `app-level`) must survive platform adapter changes.
Stable boundaries for portability	Newman, Building Microservices (`s29:n0`, `s196:n0`)	Host bootstrap lives in adapters; core control loop stays platform-neutral.
Sidecar pattern for cross-cutting concerns	Ford & Parsons, Building Evolutionary Architectures (`s60`)	Platform-specific operations (process management, service discovery) as injectable sidecars, not inline calls.
Replay safety and idempotency	Bellemare, Building Event-Driven Microservices (`s161:n0`)	All heal/repair operations must remain idempotent after platform abstraction.
High-cardinality structured telemetry	Majors et al., Observability Engineering (`s36:n2`)	Emit `platform`, `node_id`, `endpoint_class` on every probe for fleet-wide debugging.

Phase 1: Platform Adapter Interfaces (the critical path)

Extract host-specific operations behind interfaces. This is the gating work — everything else depends on it.

1a. `ProcessManagerPort` interface

// packages/process-manager/src/types.ts
interface ProcessManagerPort {
  /** List managed services matching a label pattern */
  list(pattern: string): Promise<ManagedService[]>;
  /** Start or restart a service by label */
  restart(label: string): Promise<void>;
  /** Stop a service */
  stop(label: string): Promise<void>;
  /** Check if a service is loaded and get its PID */
  inspect(label: string): Promise<ServiceState>;
  /** Disable a service (survives reboot) */
  disable(label: string): Promise<void>;
  /** Enable and bootstrap a service */
  enable(label: string, config?: ServiceConfig): Promise<void>;
}

macOS adapter: Wraps launchctl calls (bootout, bootstrap, kickstart, print, list, disable). Linux adapter: Wraps systemctl calls (equivalent operations). Null adapter: For testing and environments where process management is external.

Current callsites to migrate (11 files, 34 launchctl invocations):

packages/cli/src/commands/gateway.ts — gateway start/stop/kill/status
packages/cli/src/commands/inngest.ts — worker source repair, launchd drift detection
packages/cli/src/commands/nas.ts — NAS mount service kickstart
packages/cli/src/commands/logs.ts — worker status hints
packages/cli/src/commands/status.ts — worker restart suggestion
packages/cli/src/inngest.ts — agent-mail launchd check
packages/cli/src/typesense-auth.ts — agent-secrets restart hints
packages/gateway/src/channels/imessage.ts — imsg-rpc launchd heal
packages/gateway/src/channels/telegram.ts — hard stop (launchctl disable)
packages/gateway/src/daemon.ts — self-restart expectation
packages/system-bus/src/inngest/functions/network-status-update.ts — launchctl list probe

1b. `NodeIdentity` and path resolution

Replace all 62 /Users/joel hardcoded paths with resolved paths:

// packages/node-identity/src/index.ts
interface NodeIdentity {
  /** Node identifier (e.g., "panda", "koala") */
  nodeId: string;
  /** Home directory (process.env.HOME) */
  homeDir: string;
  /** Monorepo root */
  repoRoot: string;
  /** Vault path */
  vaultPath: string;
  /** Platform: "darwin" | "linux" */
  platform: NodeJS.Platform;
  /** Architecture: "arm64" | "x64" */
  arch: string;
}

Resolution order:

NODE_ID env var (explicit fleet identity)
hostname (fallback)
Path derivation from HOME, JOELCLAW_ROOT, VAULT_PATH env vars
Current hardcoded defaults as last resort (backward compatible)

1c. `ContainerRuntimePort` interface

Abstract Colima-specific operations (already partially done in endpoint-resolver):

interface ContainerRuntimePort {
  /** Check runtime health */
  isHealthy(): Promise<boolean>;
  /** Get VM IP address (if applicable) */
  vmIp(): Promise<string | null>;
  /** Get uptime in seconds */
  uptimeSeconds(): Promise<number | null>;
  /** SSH config path (for VM-based runtimes) */
  sshConfig(): string | null;
}

Colima adapter: Current behavior. Native Docker adapter: For bare-metal Linux where Docker runs natively (no VM). Remote adapter: For nodes where container runtime is on a different host.

Phase 2: Multi-Arch Image Pipeline

Current state: k8s/publish-system-bus-worker.sh builds single-arch (ARM64 because Panda is ARM64). No buildx, no multi-arch manifest.

Target: Multi-arch manifest so images work on ARM64 Mac Minis AND x86 Linux boxes.

# publish-system-bus-worker.sh (updated)
docker buildx create --name joelclaw-builder --use 2>/dev/null || true
docker buildx build \
  --platform linux/arm64,linux/amd64 \
  -f "$ROOT_DIR/packages/system-bus/Dockerfile" \
  -t "$IMAGE" \
  -t "$LATEST_IMAGE" \
  --push \
  "$ROOT_DIR"

Key considerations:

ARM64 is native (fast build on Panda), AMD64 is cross-compiled (slower but tolerable for CI)
GHCR supports OCI multi-arch manifests natively
Kubernetes auto-selects the correct platform image from the manifest
No changes needed to k8s manifests — image: field stays the same

Phase 3: Stateful Service HA Topology Specs

Not implementing HA now — single-node Redis and Typesense are fine for current load. But the specs must exist so node-1 expansion doesn’t require architecture decisions under pressure.

Redis

Topology	Nodes	How
Current	1 (StatefulSet, single replica)	Direct connection from all clients
Target (node-1+)	3 (Sentinel, Bitnami Helm chart)	1 master + 2 replicas, Sentinel quorum of 2, HAProxy frontend for non-Sentinel-aware clients

Key: joelclaw uses ioredis everywhere, which has native Sentinel support. No HAProxy needed if all clients use ioredis Sentinel mode.

Helm values spec (ready to deploy):

# k8s/redis-ha-values.yaml (spec, not deployed)
sentinel:
  enabled: true
  quorum: 2
replica:
  replicaCount: 3
auth:
  enabled: false  # internal cluster, Tailscale mesh provides transport security

Typesense

Topology	Nodes	How
Current	1 (StatefulSet, single replica)	Direct connection
Target (node-1+)	3 (built-in Raft consensus)	Typesense natively supports multi-node via `--peers` flag

# k8s/typesense-ha-values.yaml (spec, not deployed)
replicaCount: 3
peering:
  enabled: true
  # Typesense uses Raft for leader election, no external coordination needed

Inngest

Topology	Nodes	How
Current	1 (StatefulSet, single replica)	SQLite state backend
Target (node-1+)	1 (but on persistent storage)	Inngest self-hosted is single-instance. HA = fast restart on any node + persistent volume. Consider Postgres state backend for shared access.

Phase 4: Network Policy Foundation

Not deploying Cilium yet. Flannel is working, and Cilium migration on a running cluster is complex (requires CNI replacement, kube-proxy swap). But the readiness work is concrete:

Document current Flannel config — capture pod CIDR, service CIDR, VXLAN port
Write Cilium values file — k8s/cilium-values.yaml (spec, not deployed)
Network policy manifests — deny-all default + explicit allow rules for current service mesh
L2 announcement config — Cilium L2 mode replaces MetalLB need (Cilium handles LoadBalancer IPs)
Migration runbook — Flannel→Cilium node-by-node migration steps (from Calico→Cilium guide adapted)

Trigger: deploy when node-1 physically exists and needs inter-node pod networking.

Phase 5: Fleet Bootstrap Automation

Once Phases 1-4 land, a new node joins with:

# On new Mac Mini "koala":
# 1. Install Colima + Talos (or bare-metal Talos)
# 2. Join existing Talos cluster
talosctl gen config joelclaw-cluster https://panda.tailnet:6443
talosctl apply-config --nodes koala.tailnet --file controlplane.yaml
 
# 3. Set node identity
export NODE_ID=koala
export JOELCLAW_ROOT=/Users/joel/Code/joelhooks/joelclaw  # or wherever
 
# 4. Workloads auto-schedule via k8s
# Multi-arch images just work
# Redis Sentinel auto-rebalances
# Typesense Raft adds a peer

Consequences

Good

Node-1 joinable without rewriting 62 path references or understanding launchd internals
Linux nodes possible — the same codebase runs on ARM64 Linux (Raspberry Pi cluster, cheap mini PCs)
Testing improves — null adapters for ProcessManagerPort mean unit tests don’t shell out to launchctl
Observability grows — node_id and platform tags on every probe enable fleet-wide dashboards
HA specs exist before the emergency — when a service goes down, the upgrade path is documented

Tradeoffs

Phase 1 is large — 11 files, 34 launchctl calls, 62 path references to migrate. Probably 3-4 coding sessions.
Over-engineering risk — fleet of 1 doesn’t need all this. But the debt is real (codebase literally can’t run on a different machine) and the abstractions improve testability regardless.
Colima abstraction may be premature — only 5 references, already behind endpoint-resolver. Low priority.

Won’t Do (explicit scope exclusion)

Cilium deployment — spec only until node-1 exists
Redis/Typesense HA deployment — spec only, deploy when load or availability requires it
Bare-metal Talos on Macs — Colima VM approach stays. Bare metal requires macOS removal which kills iMessage, Granola, and other macOS-only agent surfaces
Multi-cluster — single cluster, multi-node. Multi-cluster adds complexity with no current benefit

Implementation Sequence (vector clock)

Phase 1a first → ProcessManagerPort + macOS adapter. Unblocks everything.
Phase 1b parallel → NodeIdentity path resolution. Independent of 1a.
Phase 1c deferred → ContainerRuntimePort. Only 5 callsites, already partially abstracted.
Phase 2 after 1a+1b → Multi-arch builds. Needs working codebase on both platforms.
Phase 3 any time → HA specs are documents, not code. Can draft independently.
Phase 4 after node-1 hardware → Cilium values + migration runbook.
Phase 5 after all above → Bootstrap automation validates the whole stack.

Verification Gates

Gate	Condition	How to verify
Phase 1a complete	Zero `launchctl` calls outside adapter package	`grep -rn 'launchctl' packages/ --include='*.ts' \| grep -v process-manager \| grep -v node_modules` returns empty
Phase 1b complete	Zero `/Users/joel` hardcoded paths	`grep -rn '/Users/joel' packages/ --include='*.ts' \| grep -v node_modules` returns empty
Phase 2 complete	`docker buildx imagetools inspect ghcr.io/joelhooks/system-bus-worker:latest` shows both `linux/arm64` and `linux/amd64`
Phase 3 complete	`k8s/redis-ha-values.yaml`, `k8s/typesense-ha-values.yaml`, `k8s/inngest-ha-notes.md` exist and are reviewed
Phase 4 complete	`k8s/cilium-values.yaml` and `k8s/flannel-to-cilium-runbook.md` exist
Phase 5 complete	A second node joins the cluster and runs a test workload without manual path/service fixups

PDF Brain Reference Pack

Domain	Book / Doc ID	Chunk IDs	Fleet application
Partial failure & nondeterminism	`designing-dataintensive-applications-39cc0d1842a5`	`s116:n1`, `s116:n2`	Never trust single-node probe paths. Every health check must degrade through endpoint classes (localhost → VM → svc DNS).
Fast fail & dependency protection	`release-it-michael-nygard-df70f05c7863`	`s109:n0`	ProcessManagerPort adapters must fail fast with clear errors, not hang on missing launchctl.
Blast-radius containment	`release-it-michael-nygard-df70f05c7863`	`s111`	Platform adapter failures (launchd down) must not cascade into application-level failures. Probe tiers remain isolated.
Stable boundaries & portability	`building-microservices-2nd-edition-sam-newman-88c27beee5d6`	`s29:n0`, `s196:n0`	Host bootstrap in adapters, core loop platform-neutral. This is the entire thesis of Phase 1.
Sidecar pattern	`building-evolutionary-architectures-2nd-edition-26211f9a3473`	`s60`	Platform-specific operations as injectable dependencies, not inline shells.
Replay safety & idempotency	`building-event-driven-microservices-adam-bellema-4843d259c45b`	`s161:n0`	Heal loops must stay idempotent after abstraction. Platform adapter + core loop = same replay guarantees.
Observability in distributed systems	`observability-engineering-achieving-production-e-65364c03bf43`	`s36:n2`, `s107:n2`	Emit `node_id`, `platform`, `endpoint_class`, `adapter_type` on every probe. Fleet debugging requires high-cardinality telemetry.
Stateful service growth	`designing-dataintensive-applications-39cc0d1842a5`	`s214:n3`	Stateful services (Redis, Typesense) need explicit growth topology specs before they need to grow.
SRE error budget posture	`site-reliability-engineering-how-google-runs-pro-36bc8fec5a69`	`s42`, `s31:n1`	Alert on symptoms, not causes. Fleet health = aggregate probe pass rate across nodes, not per-node launchd state.

Retrieval Instructions

# Partial failure evidence
joelclaw docs context designing-dataintensive-applications-39cc0d1842a5:s116:n1 \
  --mode snippet-window --before 1 --after 1
 
# Stable boundaries
joelclaw docs context building-microservices-2nd-edition-sam-newman-88c27beee5d6:s29:n0 \
  --mode snippet-window --before 1 --after 1
 
# Blast radius
joelclaw docs context release-it-michael-nygard-df70f05c7863:s111 \
  --mode snippet-window --before 1 --after 1
 
# Sidecar pattern
joelclaw docs context building-evolutionary-architectures-2nd-edition-26211f9a3473:s60 \
  --mode snippet-window --before 1 --after 1

External Research References

Topic	Source	Key Insight
Mac Mini Talos cluster	3-node Talos cluster on Mac Minis	3-node HA quorum with each node as both control plane + worker is viable and commonly used. MinIO/NFS for storage, not S3-as-PV.
Talos single-node to multi	r/kubernetes: Talos as single node	Single-node Talos works but upgrades are all-or-nothing. Adding nodes later is straightforward via `talosctl gen config`.
Colima vs bare-metal tradeoff	r/kubernetes: VMs or bare metal	VMs (Colima/Proxmox) offer easier snapshot/destroy cycles. Bare metal offers better performance. For homelab: VMs win on ops simplicity.
Cilium on Talos/K3s	Cilium K3s docs	Cilium replaces both Flannel (CNI) and kube-proxy. ARM64 supported. L2 announcement mode eliminates MetalLB.
Flannel→Cilium migration	Calico to Cilium migration guide	Node-by-node migration with dual-CNI: label nodes, install Cilium with `customConf: true`, migrate nodes gradually. Adapts to Flannel→Cilium.
Multi-arch Docker builds	Multi-arch container images	`docker buildx build --platform linux/arm64,linux/amd64 --push` creates single manifest tag. K8s auto-selects correct arch.
Redis Sentinel on K8s	Bitnami Redis Sentinel Helm	3-node Sentinel with Bitnami Helm chart. ioredis has native Sentinel support — no HAProxy needed if all clients use ioredis.
Mac Mini homelab consensus	r/homelab: 2nd Mac Mini	Community split: Mac Minis are power-efficient but expensive per GB RAM. Mini PCs (N100) are cheaper for pure compute. For joelclaw: Mac Minis win because macOS agent surfaces (iMessage, Granola, voice) require macOS.
Kubernetes distro comparison	Best K8s Distros 2025	Talos: most secure, immutable, API-managed. Learning curve offset by operational simplicity. Right choice for fleet.

Notes

This ADR is deliberately phased with explicit “spec only” markers on items that don’t need implementation until hardware exists. The critical path is Phase 1 (platform abstraction) which improves testability and code quality regardless of whether node-1 ever materializes.

ADR-0182 Section 2 items are subsumed here:

“Platform-neutral control-plane contract” → Phase 1 + Phase 5
“No host-bound assumptions in core logic” → Phase 1a + 1b (the bulk of the work)
“Multi-node networking trigger” → Phase 4
“Stateful service growth path” → Phase 3
“ARM64/Linux-friendly workloads” → Phase 2