ADR-0184superseded

Node-0 to Fleet — Platform Abstraction and Multi-Node Readiness

Status: proposed Date: 2026-03-01 Deciders: Joel Hooks Related: ADR-0029, ADR-0089, ADR-0148, ADR-0159, ADR-0182 Supersedes: ADR-0182 Section 2 (“Prepare for multi-Mac / Linux”)

Context

ADR-0182 shipped the “hardening now” work: warmup-aware gates, voice-agent cleanup, RBAC drift guards, post-Colima invariant checks. Panda is reliable as a single node.

But the fleet-prep section of ADR-0182 remained aspirational. A codebase audit reveals the actual state:

Current Host-Bound Debt

CategoryCountFiles AffectedSeverity
launchctl calls in core packages3411 files across cli, gateway, system-busHigh — direct macOS process manager calls in business logic
launchd references (comments + logic)48Same 11 + additionalMedium — some are comments, but many drive restart/heal behavior
/Users/joel hardcoded paths6220+ filesHigh — breaks on any machine with a different username or home dir
Colima/192.168.64 references5endpoint-resolver, network-status, seed-networkLow — mostly in the resolver which already abstracts this

What This Means

Panda cannot be cloned. If Joel buys a second Mac Mini tomorrow and wants it running joelclaw workloads, the setup would require:

  1. Creating a joel user account (or rewriting 62 path references)
  2. Installing identical launchd plists (or the gateway/CLI will fail to restart services)
  3. Running Colima in the exact same network config (or endpoint resolution breaks)
  4. Manually configuring every service that assumes single-node

This is the opposite of a fleet-ready system. The goal of this ADR is to make node-1 joinable with minimal manual setup.

Decision

Guiding Principles (from the book corpus)

PrincipleSourceApplication
Partial failures are nondeterministic — never trust a single probe pathKleppmann, DDIA §8 (s116:n1)Already shipped in ADR-0182 endpoint resolver. Extend to all service discovery.
Fast fail with dependency protection — breaker-style escalationNygard, Release It! (s109:n0)Tier-based escalation in Talon already implements this. Preserve when abstracting.
Blast-radius containment via bulkheadsNygard, Release It! (s111)Probe classes (infra-critical, service-critical, app-level) must survive platform adapter changes.
Stable boundaries for portabilityNewman, Building Microservices (s29:n0, s196:n0)Host bootstrap lives in adapters; core control loop stays platform-neutral.
Sidecar pattern for cross-cutting concernsFord & Parsons, Building Evolutionary Architectures (s60)Platform-specific operations (process management, service discovery) as injectable sidecars, not inline calls.
Replay safety and idempotencyBellemare, Building Event-Driven Microservices (s161:n0)All heal/repair operations must remain idempotent after platform abstraction.
High-cardinality structured telemetryMajors et al., Observability Engineering (s36:n2)Emit platform, node_id, endpoint_class on every probe for fleet-wide debugging.

Phase 1: Platform Adapter Interfaces (the critical path)

Extract host-specific operations behind interfaces. This is the gating work — everything else depends on it.

1a. ProcessManagerPort interface

// packages/process-manager/src/types.ts
interface ProcessManagerPort {
  /** List managed services matching a label pattern */
  list(pattern: string): Promise<ManagedService[]>;
  /** Start or restart a service by label */
  restart(label: string): Promise<void>;
  /** Stop a service */
  stop(label: string): Promise<void>;
  /** Check if a service is loaded and get its PID */
  inspect(label: string): Promise<ServiceState>;
  /** Disable a service (survives reboot) */
  disable(label: string): Promise<void>;
  /** Enable and bootstrap a service */
  enable(label: string, config?: ServiceConfig): Promise<void>;
}

macOS adapter: Wraps launchctl calls (bootout, bootstrap, kickstart, print, list, disable). Linux adapter: Wraps systemctl calls (equivalent operations). Null adapter: For testing and environments where process management is external.

Current callsites to migrate (11 files, 34 launchctl invocations):

  • packages/cli/src/commands/gateway.ts — gateway start/stop/kill/status
  • packages/cli/src/commands/inngest.ts — worker source repair, launchd drift detection
  • packages/cli/src/commands/nas.ts — NAS mount service kickstart
  • packages/cli/src/commands/logs.ts — worker status hints
  • packages/cli/src/commands/status.ts — worker restart suggestion
  • packages/cli/src/inngest.ts — agent-mail launchd check
  • packages/cli/src/typesense-auth.ts — agent-secrets restart hints
  • packages/gateway/src/channels/imessage.ts — imsg-rpc launchd heal
  • packages/gateway/src/channels/telegram.ts — hard stop (launchctl disable)
  • packages/gateway/src/daemon.ts — self-restart expectation
  • packages/system-bus/src/inngest/functions/network-status-update.ts — launchctl list probe

1b. NodeIdentity and path resolution

Replace all 62 /Users/joel hardcoded paths with resolved paths:

// packages/node-identity/src/index.ts
interface NodeIdentity {
  /** Node identifier (e.g., "panda", "koala") */
  nodeId: string;
  /** Home directory (process.env.HOME) */
  homeDir: string;
  /** Monorepo root */
  repoRoot: string;
  /** Vault path */
  vaultPath: string;
  /** Platform: "darwin" | "linux" */
  platform: NodeJS.Platform;
  /** Architecture: "arm64" | "x64" */
  arch: string;
}

Resolution order:

  1. NODE_ID env var (explicit fleet identity)
  2. hostname (fallback)
  3. Path derivation from HOME, JOELCLAW_ROOT, VAULT_PATH env vars
  4. Current hardcoded defaults as last resort (backward compatible)

1c. ContainerRuntimePort interface

Abstract Colima-specific operations (already partially done in endpoint-resolver):

interface ContainerRuntimePort {
  /** Check runtime health */
  isHealthy(): Promise<boolean>;
  /** Get VM IP address (if applicable) */
  vmIp(): Promise<string | null>;
  /** Get uptime in seconds */
  uptimeSeconds(): Promise<number | null>;
  /** SSH config path (for VM-based runtimes) */
  sshConfig(): string | null;
}

Colima adapter: Current behavior. Native Docker adapter: For bare-metal Linux where Docker runs natively (no VM). Remote adapter: For nodes where container runtime is on a different host.

Phase 2: Multi-Arch Image Pipeline

Current state: k8s/publish-system-bus-worker.sh builds single-arch (ARM64 because Panda is ARM64). No buildx, no multi-arch manifest.

Target: Multi-arch manifest so images work on ARM64 Mac Minis AND x86 Linux boxes.

# publish-system-bus-worker.sh (updated)
docker buildx create --name joelclaw-builder --use 2>/dev/null || true
docker buildx build \
  --platform linux/arm64,linux/amd64 \
  -f "$ROOT_DIR/packages/system-bus/Dockerfile" \
  -t "$IMAGE" \
  -t "$LATEST_IMAGE" \
  --push \
  "$ROOT_DIR"

Key considerations:

  • ARM64 is native (fast build on Panda), AMD64 is cross-compiled (slower but tolerable for CI)
  • GHCR supports OCI multi-arch manifests natively
  • Kubernetes auto-selects the correct platform image from the manifest
  • No changes needed to k8s manifests — image: field stays the same

Phase 3: Stateful Service HA Topology Specs

Not implementing HA now — single-node Redis and Typesense are fine for current load. But the specs must exist so node-1 expansion doesn’t require architecture decisions under pressure.

Redis

TopologyNodesHow
Current1 (StatefulSet, single replica)Direct connection from all clients
Target (node-1+)3 (Sentinel, Bitnami Helm chart)1 master + 2 replicas, Sentinel quorum of 2, HAProxy frontend for non-Sentinel-aware clients

Key: joelclaw uses ioredis everywhere, which has native Sentinel support. No HAProxy needed if all clients use ioredis Sentinel mode.

Helm values spec (ready to deploy):

# k8s/redis-ha-values.yaml (spec, not deployed)
sentinel:
  enabled: true
  quorum: 2
replica:
  replicaCount: 3
auth:
  enabled: false  # internal cluster, Tailscale mesh provides transport security

Typesense

TopologyNodesHow
Current1 (StatefulSet, single replica)Direct connection
Target (node-1+)3 (built-in Raft consensus)Typesense natively supports multi-node via --peers flag
# k8s/typesense-ha-values.yaml (spec, not deployed)
replicaCount: 3
peering:
  enabled: true
  # Typesense uses Raft for leader election, no external coordination needed

Inngest

TopologyNodesHow
Current1 (StatefulSet, single replica)SQLite state backend
Target (node-1+)1 (but on persistent storage)Inngest self-hosted is single-instance. HA = fast restart on any node + persistent volume. Consider Postgres state backend for shared access.

Phase 4: Network Policy Foundation

Not deploying Cilium yet. Flannel is working, and Cilium migration on a running cluster is complex (requires CNI replacement, kube-proxy swap). But the readiness work is concrete:

  1. Document current Flannel config — capture pod CIDR, service CIDR, VXLAN port
  2. Write Cilium values filek8s/cilium-values.yaml (spec, not deployed)
  3. Network policy manifests — deny-all default + explicit allow rules for current service mesh
  4. L2 announcement config — Cilium L2 mode replaces MetalLB need (Cilium handles LoadBalancer IPs)
  5. Migration runbook — Flannel→Cilium node-by-node migration steps (from Calico→Cilium guide adapted)

Trigger: deploy when node-1 physically exists and needs inter-node pod networking.

Phase 5: Fleet Bootstrap Automation

Once Phases 1-4 land, a new node joins with:

# On new Mac Mini "koala":
# 1. Install Colima + Talos (or bare-metal Talos)
# 2. Join existing Talos cluster
talosctl gen config joelclaw-cluster https://panda.tailnet:6443
talosctl apply-config --nodes koala.tailnet --file controlplane.yaml
 
# 3. Set node identity
export NODE_ID=koala
export JOELCLAW_ROOT=/Users/joel/Code/joelhooks/joelclaw  # or wherever
 
# 4. Workloads auto-schedule via k8s
# Multi-arch images just work
# Redis Sentinel auto-rebalances
# Typesense Raft adds a peer

Consequences

Good

  • Node-1 joinable without rewriting 62 path references or understanding launchd internals
  • Linux nodes possible — the same codebase runs on ARM64 Linux (Raspberry Pi cluster, cheap mini PCs)
  • Testing improves — null adapters for ProcessManagerPort mean unit tests don’t shell out to launchctl
  • Observability growsnode_id and platform tags on every probe enable fleet-wide dashboards
  • HA specs exist before the emergency — when a service goes down, the upgrade path is documented

Tradeoffs

  • Phase 1 is large — 11 files, 34 launchctl calls, 62 path references to migrate. Probably 3-4 coding sessions.
  • Over-engineering risk — fleet of 1 doesn’t need all this. But the debt is real (codebase literally can’t run on a different machine) and the abstractions improve testability regardless.
  • Colima abstraction may be premature — only 5 references, already behind endpoint-resolver. Low priority.

Won’t Do (explicit scope exclusion)

  • Cilium deployment — spec only until node-1 exists
  • Redis/Typesense HA deployment — spec only, deploy when load or availability requires it
  • Bare-metal Talos on Macs — Colima VM approach stays. Bare metal requires macOS removal which kills iMessage, Granola, and other macOS-only agent surfaces
  • Multi-cluster — single cluster, multi-node. Multi-cluster adds complexity with no current benefit

Implementation Sequence (vector clock)

  1. Phase 1a firstProcessManagerPort + macOS adapter. Unblocks everything.
  2. Phase 1b parallelNodeIdentity path resolution. Independent of 1a.
  3. Phase 1c deferredContainerRuntimePort. Only 5 callsites, already partially abstracted.
  4. Phase 2 after 1a+1b → Multi-arch builds. Needs working codebase on both platforms.
  5. Phase 3 any time → HA specs are documents, not code. Can draft independently.
  6. Phase 4 after node-1 hardware → Cilium values + migration runbook.
  7. Phase 5 after all above → Bootstrap automation validates the whole stack.

Verification Gates

GateConditionHow to verify
Phase 1a completeZero launchctl calls outside adapter packagegrep -rn 'launchctl' packages/ --include='*.ts' | grep -v process-manager | grep -v node_modules returns empty
Phase 1b completeZero /Users/joel hardcoded pathsgrep -rn '/Users/joel' packages/ --include='*.ts' | grep -v node_modules returns empty
Phase 2 completedocker buildx imagetools inspect ghcr.io/joelhooks/system-bus-worker:latest shows both linux/arm64 and linux/amd64
Phase 3 completek8s/redis-ha-values.yaml, k8s/typesense-ha-values.yaml, k8s/inngest-ha-notes.md exist and are reviewed
Phase 4 completek8s/cilium-values.yaml and k8s/flannel-to-cilium-runbook.md exist
Phase 5 completeA second node joins the cluster and runs a test workload without manual path/service fixups

PDF Brain Reference Pack

DomainBook / Doc IDChunk IDsFleet application
Partial failure & nondeterminismdesigning-dataintensive-applications-39cc0d1842a5s116:n1, s116:n2Never trust single-node probe paths. Every health check must degrade through endpoint classes (localhost → VM → svc DNS).
Fast fail & dependency protectionrelease-it-michael-nygard-df70f05c7863s109:n0ProcessManagerPort adapters must fail fast with clear errors, not hang on missing launchctl.
Blast-radius containmentrelease-it-michael-nygard-df70f05c7863s111Platform adapter failures (launchd down) must not cascade into application-level failures. Probe tiers remain isolated.
Stable boundaries & portabilitybuilding-microservices-2nd-edition-sam-newman-88c27beee5d6s29:n0, s196:n0Host bootstrap in adapters, core loop platform-neutral. This is the entire thesis of Phase 1.
Sidecar patternbuilding-evolutionary-architectures-2nd-edition-26211f9a3473s60Platform-specific operations as injectable dependencies, not inline shells.
Replay safety & idempotencybuilding-event-driven-microservices-adam-bellema-4843d259c45bs161:n0Heal loops must stay idempotent after abstraction. Platform adapter + core loop = same replay guarantees.
Observability in distributed systemsobservability-engineering-achieving-production-e-65364c03bf43s36:n2, s107:n2Emit node_id, platform, endpoint_class, adapter_type on every probe. Fleet debugging requires high-cardinality telemetry.
Stateful service growthdesigning-dataintensive-applications-39cc0d1842a5s214:n3Stateful services (Redis, Typesense) need explicit growth topology specs before they need to grow.
SRE error budget posturesite-reliability-engineering-how-google-runs-pro-36bc8fec5a69s42, s31:n1Alert on symptoms, not causes. Fleet health = aggregate probe pass rate across nodes, not per-node launchd state.

Retrieval Instructions

# Partial failure evidence
joelclaw docs context designing-dataintensive-applications-39cc0d1842a5:s116:n1 \
  --mode snippet-window --before 1 --after 1
 
# Stable boundaries
joelclaw docs context building-microservices-2nd-edition-sam-newman-88c27beee5d6:s29:n0 \
  --mode snippet-window --before 1 --after 1
 
# Blast radius
joelclaw docs context release-it-michael-nygard-df70f05c7863:s111 \
  --mode snippet-window --before 1 --after 1
 
# Sidecar pattern
joelclaw docs context building-evolutionary-architectures-2nd-edition-26211f9a3473:s60 \
  --mode snippet-window --before 1 --after 1

External Research References

TopicSourceKey Insight
Mac Mini Talos cluster3-node Talos cluster on Mac Minis3-node HA quorum with each node as both control plane + worker is viable and commonly used. MinIO/NFS for storage, not S3-as-PV.
Talos single-node to multir/kubernetes: Talos as single nodeSingle-node Talos works but upgrades are all-or-nothing. Adding nodes later is straightforward via talosctl gen config.
Colima vs bare-metal tradeoffr/kubernetes: VMs or bare metalVMs (Colima/Proxmox) offer easier snapshot/destroy cycles. Bare metal offers better performance. For homelab: VMs win on ops simplicity.
Cilium on Talos/K3sCilium K3s docsCilium replaces both Flannel (CNI) and kube-proxy. ARM64 supported. L2 announcement mode eliminates MetalLB.
Flannel→Cilium migrationCalico to Cilium migration guideNode-by-node migration with dual-CNI: label nodes, install Cilium with customConf: true, migrate nodes gradually. Adapts to Flannel→Cilium.
Multi-arch Docker buildsMulti-arch container imagesdocker buildx build --platform linux/arm64,linux/amd64 --push creates single manifest tag. K8s auto-selects correct arch.
Redis Sentinel on K8sBitnami Redis Sentinel Helm3-node Sentinel with Bitnami Helm chart. ioredis has native Sentinel support — no HAProxy needed if all clients use ioredis.
Mac Mini homelab consensusr/homelab: 2nd Mac MiniCommunity split: Mac Minis are power-efficient but expensive per GB RAM. Mini PCs (N100) are cheaper for pure compute. For joelclaw: Mac Minis win because macOS agent surfaces (iMessage, Granola, voice) require macOS.
Kubernetes distro comparisonBest K8s Distros 2025Talos: most secure, immutable, API-managed. Learning curve offset by operational simplicity. Right choice for fleet.

Notes

This ADR is deliberately phased with explicit “spec only” markers on items that don’t need implementation until hardware exists. The critical path is Phase 1 (platform abstraction) which improves testability and code quality regardless of whether node-1 ever materializes.

ADR-0182 Section 2 items are subsumed here:

  • “Platform-neutral control-plane contract” → Phase 1 + Phase 5
  • “No host-bound assumptions in core logic” → Phase 1a + 1b (the bulk of the work)
  • “Multi-node networking trigger” → Phase 4
  • “Stateful service growth path” → Phase 3
  • “ARM64/Linux-friendly workloads” → Phase 2