Node-0 to Fleet — Platform Abstraction and Multi-Node Readiness
Status: proposed Date: 2026-03-01 Deciders: Joel Hooks Related: ADR-0029, ADR-0089, ADR-0148, ADR-0159, ADR-0182 Supersedes: ADR-0182 Section 2 (“Prepare for multi-Mac / Linux”)
Context
ADR-0182 shipped the “hardening now” work: warmup-aware gates, voice-agent cleanup, RBAC drift guards, post-Colima invariant checks. Panda is reliable as a single node.
But the fleet-prep section of ADR-0182 remained aspirational. A codebase audit reveals the actual state:
Current Host-Bound Debt
| Category | Count | Files Affected | Severity |
|---|---|---|---|
launchctl calls in core packages | 34 | 11 files across cli, gateway, system-bus | High — direct macOS process manager calls in business logic |
launchd references (comments + logic) | 48 | Same 11 + additional | Medium — some are comments, but many drive restart/heal behavior |
/Users/joel hardcoded paths | 62 | 20+ files | High — breaks on any machine with a different username or home dir |
Colima/192.168.64 references | 5 | endpoint-resolver, network-status, seed-network | Low — mostly in the resolver which already abstracts this |
What This Means
Panda cannot be cloned. If Joel buys a second Mac Mini tomorrow and wants it running joelclaw workloads, the setup would require:
- Creating a
joeluser account (or rewriting 62 path references) - Installing identical launchd plists (or the gateway/CLI will fail to restart services)
- Running Colima in the exact same network config (or endpoint resolution breaks)
- Manually configuring every service that assumes single-node
This is the opposite of a fleet-ready system. The goal of this ADR is to make node-1 joinable with minimal manual setup.
Decision
Guiding Principles (from the book corpus)
| Principle | Source | Application |
|---|---|---|
| Partial failures are nondeterministic — never trust a single probe path | Kleppmann, DDIA §8 (s116:n1) | Already shipped in ADR-0182 endpoint resolver. Extend to all service discovery. |
| Fast fail with dependency protection — breaker-style escalation | Nygard, Release It! (s109:n0) | Tier-based escalation in Talon already implements this. Preserve when abstracting. |
| Blast-radius containment via bulkheads | Nygard, Release It! (s111) | Probe classes (infra-critical, service-critical, app-level) must survive platform adapter changes. |
| Stable boundaries for portability | Newman, Building Microservices (s29:n0, s196:n0) | Host bootstrap lives in adapters; core control loop stays platform-neutral. |
| Sidecar pattern for cross-cutting concerns | Ford & Parsons, Building Evolutionary Architectures (s60) | Platform-specific operations (process management, service discovery) as injectable sidecars, not inline calls. |
| Replay safety and idempotency | Bellemare, Building Event-Driven Microservices (s161:n0) | All heal/repair operations must remain idempotent after platform abstraction. |
| High-cardinality structured telemetry | Majors et al., Observability Engineering (s36:n2) | Emit platform, node_id, endpoint_class on every probe for fleet-wide debugging. |
Phase 1: Platform Adapter Interfaces (the critical path)
Extract host-specific operations behind interfaces. This is the gating work — everything else depends on it.
1a. ProcessManagerPort interface
// packages/process-manager/src/types.ts
interface ProcessManagerPort {
/** List managed services matching a label pattern */
list(pattern: string): Promise<ManagedService[]>;
/** Start or restart a service by label */
restart(label: string): Promise<void>;
/** Stop a service */
stop(label: string): Promise<void>;
/** Check if a service is loaded and get its PID */
inspect(label: string): Promise<ServiceState>;
/** Disable a service (survives reboot) */
disable(label: string): Promise<void>;
/** Enable and bootstrap a service */
enable(label: string, config?: ServiceConfig): Promise<void>;
}macOS adapter: Wraps launchctl calls (bootout, bootstrap, kickstart, print, list, disable).
Linux adapter: Wraps systemctl calls (equivalent operations).
Null adapter: For testing and environments where process management is external.
Current callsites to migrate (11 files, 34 launchctl invocations):
packages/cli/src/commands/gateway.ts— gateway start/stop/kill/statuspackages/cli/src/commands/inngest.ts— worker source repair, launchd drift detectionpackages/cli/src/commands/nas.ts— NAS mount service kickstartpackages/cli/src/commands/logs.ts— worker status hintspackages/cli/src/commands/status.ts— worker restart suggestionpackages/cli/src/inngest.ts— agent-mail launchd checkpackages/cli/src/typesense-auth.ts— agent-secrets restart hintspackages/gateway/src/channels/imessage.ts— imsg-rpc launchd healpackages/gateway/src/channels/telegram.ts— hard stop (launchctl disable)packages/gateway/src/daemon.ts— self-restart expectationpackages/system-bus/src/inngest/functions/network-status-update.ts— launchctl list probe
1b. NodeIdentity and path resolution
Replace all 62 /Users/joel hardcoded paths with resolved paths:
// packages/node-identity/src/index.ts
interface NodeIdentity {
/** Node identifier (e.g., "panda", "koala") */
nodeId: string;
/** Home directory (process.env.HOME) */
homeDir: string;
/** Monorepo root */
repoRoot: string;
/** Vault path */
vaultPath: string;
/** Platform: "darwin" | "linux" */
platform: NodeJS.Platform;
/** Architecture: "arm64" | "x64" */
arch: string;
}Resolution order:
NODE_IDenv var (explicit fleet identity)hostname(fallback)- Path derivation from
HOME,JOELCLAW_ROOT,VAULT_PATHenv vars - Current hardcoded defaults as last resort (backward compatible)
1c. ContainerRuntimePort interface
Abstract Colima-specific operations (already partially done in endpoint-resolver):
interface ContainerRuntimePort {
/** Check runtime health */
isHealthy(): Promise<boolean>;
/** Get VM IP address (if applicable) */
vmIp(): Promise<string | null>;
/** Get uptime in seconds */
uptimeSeconds(): Promise<number | null>;
/** SSH config path (for VM-based runtimes) */
sshConfig(): string | null;
}Colima adapter: Current behavior. Native Docker adapter: For bare-metal Linux where Docker runs natively (no VM). Remote adapter: For nodes where container runtime is on a different host.
Phase 2: Multi-Arch Image Pipeline
Current state: k8s/publish-system-bus-worker.sh builds single-arch (ARM64 because Panda is ARM64). No buildx, no multi-arch manifest.
Target: Multi-arch manifest so images work on ARM64 Mac Minis AND x86 Linux boxes.
# publish-system-bus-worker.sh (updated)
docker buildx create --name joelclaw-builder --use 2>/dev/null || true
docker buildx build \
--platform linux/arm64,linux/amd64 \
-f "$ROOT_DIR/packages/system-bus/Dockerfile" \
-t "$IMAGE" \
-t "$LATEST_IMAGE" \
--push \
"$ROOT_DIR"Key considerations:
- ARM64 is native (fast build on Panda), AMD64 is cross-compiled (slower but tolerable for CI)
- GHCR supports OCI multi-arch manifests natively
- Kubernetes auto-selects the correct platform image from the manifest
- No changes needed to k8s manifests —
image:field stays the same
Phase 3: Stateful Service HA Topology Specs
Not implementing HA now — single-node Redis and Typesense are fine for current load. But the specs must exist so node-1 expansion doesn’t require architecture decisions under pressure.
Redis
| Topology | Nodes | How |
|---|---|---|
| Current | 1 (StatefulSet, single replica) | Direct connection from all clients |
| Target (node-1+) | 3 (Sentinel, Bitnami Helm chart) | 1 master + 2 replicas, Sentinel quorum of 2, HAProxy frontend for non-Sentinel-aware clients |
Key: joelclaw uses ioredis everywhere, which has native Sentinel support. No HAProxy needed if all clients use ioredis Sentinel mode.
Helm values spec (ready to deploy):
# k8s/redis-ha-values.yaml (spec, not deployed)
sentinel:
enabled: true
quorum: 2
replica:
replicaCount: 3
auth:
enabled: false # internal cluster, Tailscale mesh provides transport securityTypesense
| Topology | Nodes | How |
|---|---|---|
| Current | 1 (StatefulSet, single replica) | Direct connection |
| Target (node-1+) | 3 (built-in Raft consensus) | Typesense natively supports multi-node via --peers flag |
# k8s/typesense-ha-values.yaml (spec, not deployed)
replicaCount: 3
peering:
enabled: true
# Typesense uses Raft for leader election, no external coordination neededInngest
| Topology | Nodes | How |
|---|---|---|
| Current | 1 (StatefulSet, single replica) | SQLite state backend |
| Target (node-1+) | 1 (but on persistent storage) | Inngest self-hosted is single-instance. HA = fast restart on any node + persistent volume. Consider Postgres state backend for shared access. |
Phase 4: Network Policy Foundation
Not deploying Cilium yet. Flannel is working, and Cilium migration on a running cluster is complex (requires CNI replacement, kube-proxy swap). But the readiness work is concrete:
- Document current Flannel config — capture pod CIDR, service CIDR, VXLAN port
- Write Cilium values file —
k8s/cilium-values.yaml(spec, not deployed) - Network policy manifests — deny-all default + explicit allow rules for current service mesh
- L2 announcement config — Cilium L2 mode replaces MetalLB need (Cilium handles LoadBalancer IPs)
- Migration runbook — Flannel→Cilium node-by-node migration steps (from Calico→Cilium guide adapted)
Trigger: deploy when node-1 physically exists and needs inter-node pod networking.
Phase 5: Fleet Bootstrap Automation
Once Phases 1-4 land, a new node joins with:
# On new Mac Mini "koala":
# 1. Install Colima + Talos (or bare-metal Talos)
# 2. Join existing Talos cluster
talosctl gen config joelclaw-cluster https://panda.tailnet:6443
talosctl apply-config --nodes koala.tailnet --file controlplane.yaml
# 3. Set node identity
export NODE_ID=koala
export JOELCLAW_ROOT=/Users/joel/Code/joelhooks/joelclaw # or wherever
# 4. Workloads auto-schedule via k8s
# Multi-arch images just work
# Redis Sentinel auto-rebalances
# Typesense Raft adds a peerConsequences
Good
- Node-1 joinable without rewriting 62 path references or understanding launchd internals
- Linux nodes possible — the same codebase runs on ARM64 Linux (Raspberry Pi cluster, cheap mini PCs)
- Testing improves — null adapters for ProcessManagerPort mean unit tests don’t shell out to launchctl
- Observability grows —
node_idandplatformtags on every probe enable fleet-wide dashboards - HA specs exist before the emergency — when a service goes down, the upgrade path is documented
Tradeoffs
- Phase 1 is large — 11 files, 34 launchctl calls, 62 path references to migrate. Probably 3-4 coding sessions.
- Over-engineering risk — fleet of 1 doesn’t need all this. But the debt is real (codebase literally can’t run on a different machine) and the abstractions improve testability regardless.
- Colima abstraction may be premature — only 5 references, already behind endpoint-resolver. Low priority.
Won’t Do (explicit scope exclusion)
- Cilium deployment — spec only until node-1 exists
- Redis/Typesense HA deployment — spec only, deploy when load or availability requires it
- Bare-metal Talos on Macs — Colima VM approach stays. Bare metal requires macOS removal which kills iMessage, Granola, and other macOS-only agent surfaces
- Multi-cluster — single cluster, multi-node. Multi-cluster adds complexity with no current benefit
Implementation Sequence (vector clock)
- Phase 1a first →
ProcessManagerPort+ macOS adapter. Unblocks everything. - Phase 1b parallel →
NodeIdentitypath resolution. Independent of 1a. - Phase 1c deferred →
ContainerRuntimePort. Only 5 callsites, already partially abstracted. - Phase 2 after 1a+1b → Multi-arch builds. Needs working codebase on both platforms.
- Phase 3 any time → HA specs are documents, not code. Can draft independently.
- Phase 4 after node-1 hardware → Cilium values + migration runbook.
- Phase 5 after all above → Bootstrap automation validates the whole stack.
Verification Gates
| Gate | Condition | How to verify |
|---|---|---|
| Phase 1a complete | Zero launchctl calls outside adapter package | grep -rn 'launchctl' packages/ --include='*.ts' | grep -v process-manager | grep -v node_modules returns empty |
| Phase 1b complete | Zero /Users/joel hardcoded paths | grep -rn '/Users/joel' packages/ --include='*.ts' | grep -v node_modules returns empty |
| Phase 2 complete | docker buildx imagetools inspect ghcr.io/joelhooks/system-bus-worker:latest shows both linux/arm64 and linux/amd64 | |
| Phase 3 complete | k8s/redis-ha-values.yaml, k8s/typesense-ha-values.yaml, k8s/inngest-ha-notes.md exist and are reviewed | |
| Phase 4 complete | k8s/cilium-values.yaml and k8s/flannel-to-cilium-runbook.md exist | |
| Phase 5 complete | A second node joins the cluster and runs a test workload without manual path/service fixups |
PDF Brain Reference Pack
| Domain | Book / Doc ID | Chunk IDs | Fleet application |
|---|---|---|---|
| Partial failure & nondeterminism | designing-dataintensive-applications-39cc0d1842a5 | s116:n1, s116:n2 | Never trust single-node probe paths. Every health check must degrade through endpoint classes (localhost → VM → svc DNS). |
| Fast fail & dependency protection | release-it-michael-nygard-df70f05c7863 | s109:n0 | ProcessManagerPort adapters must fail fast with clear errors, not hang on missing launchctl. |
| Blast-radius containment | release-it-michael-nygard-df70f05c7863 | s111 | Platform adapter failures (launchd down) must not cascade into application-level failures. Probe tiers remain isolated. |
| Stable boundaries & portability | building-microservices-2nd-edition-sam-newman-88c27beee5d6 | s29:n0, s196:n0 | Host bootstrap in adapters, core loop platform-neutral. This is the entire thesis of Phase 1. |
| Sidecar pattern | building-evolutionary-architectures-2nd-edition-26211f9a3473 | s60 | Platform-specific operations as injectable dependencies, not inline shells. |
| Replay safety & idempotency | building-event-driven-microservices-adam-bellema-4843d259c45b | s161:n0 | Heal loops must stay idempotent after abstraction. Platform adapter + core loop = same replay guarantees. |
| Observability in distributed systems | observability-engineering-achieving-production-e-65364c03bf43 | s36:n2, s107:n2 | Emit node_id, platform, endpoint_class, adapter_type on every probe. Fleet debugging requires high-cardinality telemetry. |
| Stateful service growth | designing-dataintensive-applications-39cc0d1842a5 | s214:n3 | Stateful services (Redis, Typesense) need explicit growth topology specs before they need to grow. |
| SRE error budget posture | site-reliability-engineering-how-google-runs-pro-36bc8fec5a69 | s42, s31:n1 | Alert on symptoms, not causes. Fleet health = aggregate probe pass rate across nodes, not per-node launchd state. |
Retrieval Instructions
# Partial failure evidence
joelclaw docs context designing-dataintensive-applications-39cc0d1842a5:s116:n1 \
--mode snippet-window --before 1 --after 1
# Stable boundaries
joelclaw docs context building-microservices-2nd-edition-sam-newman-88c27beee5d6:s29:n0 \
--mode snippet-window --before 1 --after 1
# Blast radius
joelclaw docs context release-it-michael-nygard-df70f05c7863:s111 \
--mode snippet-window --before 1 --after 1
# Sidecar pattern
joelclaw docs context building-evolutionary-architectures-2nd-edition-26211f9a3473:s60 \
--mode snippet-window --before 1 --after 1External Research References
| Topic | Source | Key Insight |
|---|---|---|
| Mac Mini Talos cluster | 3-node Talos cluster on Mac Minis | 3-node HA quorum with each node as both control plane + worker is viable and commonly used. MinIO/NFS for storage, not S3-as-PV. |
| Talos single-node to multi | r/kubernetes: Talos as single node | Single-node Talos works but upgrades are all-or-nothing. Adding nodes later is straightforward via talosctl gen config. |
| Colima vs bare-metal tradeoff | r/kubernetes: VMs or bare metal | VMs (Colima/Proxmox) offer easier snapshot/destroy cycles. Bare metal offers better performance. For homelab: VMs win on ops simplicity. |
| Cilium on Talos/K3s | Cilium K3s docs | Cilium replaces both Flannel (CNI) and kube-proxy. ARM64 supported. L2 announcement mode eliminates MetalLB. |
| Flannel→Cilium migration | Calico to Cilium migration guide | Node-by-node migration with dual-CNI: label nodes, install Cilium with customConf: true, migrate nodes gradually. Adapts to Flannel→Cilium. |
| Multi-arch Docker builds | Multi-arch container images | docker buildx build --platform linux/arm64,linux/amd64 --push creates single manifest tag. K8s auto-selects correct arch. |
| Redis Sentinel on K8s | Bitnami Redis Sentinel Helm | 3-node Sentinel with Bitnami Helm chart. ioredis has native Sentinel support — no HAProxy needed if all clients use ioredis. |
| Mac Mini homelab consensus | r/homelab: 2nd Mac Mini | Community split: Mac Minis are power-efficient but expensive per GB RAM. Mini PCs (N100) are cheaper for pure compute. For joelclaw: Mac Minis win because macOS agent surfaces (iMessage, Granola, voice) require macOS. |
| Kubernetes distro comparison | Best K8s Distros 2025 | Talos: most secure, immutable, API-managed. Learning curve offset by operational simplicity. Right choice for fleet. |
Notes
This ADR is deliberately phased with explicit “spec only” markers on items that don’t need implementation until hardware exists. The critical path is Phase 1 (platform abstraction) which improves testability and code quality regardless of whether node-1 ever materializes.
ADR-0182 Section 2 items are subsumed here:
- “Platform-neutral control-plane contract” → Phase 1 + Phase 5
- “No host-bound assumptions in core logic” → Phase 1a + 1b (the bulk of the work)
- “Multi-node networking trigger” → Phase 4
- “Stateful service growth path” → Phase 3
- “ARM64/Linux-friendly workloads” → Phase 2