ADR-0240accepted

Boot-safe LaunchDaemons for critical host services

Context

ADR-0239 tried to patch the reboot gap by keeping the critical host services as user LaunchAgents and adding a root-owned bridge that bootstrapped them into user/$UID whenever gui/$UID was absent.

That design was not earned on Panda.

After installation, the bridge daemon itself loaded, but the real step that mattered kept failing:

launchctl bootstrap user/501 <plist>
Bootstrap failed: 5: Input/output error

So the machine still needed manual nohup recovery for the same critical surfaces after a headless reboot:

  • com.joel.colima
  • com.joel.k8s-reboot-heal
  • com.joel.agent-secrets
  • com.joel.system-bus-worker
  • com.joel.gateway
  • com.joelclaw.agent-mail

That is needless complexity. The safer shape is to run the host control plane as actual system services.

Decision

Replace the ADR-0239 bridge with boot-safe LaunchDaemons for the critical host services.

  1. Keep the canonical plist sources in infra/launchd/.
  2. Install the critical labels into /Library/LaunchDaemons/, not ~/Library/LaunchAgents/.
  3. Run the services in the system launchd domain.
  4. Use UserName=joel / GroupName=staff where the process should execute with Joel’s home, repo, auth, and filesystem context.
  5. Add infra/install-critical-launchdaemons.sh as the canonical root installer.
  6. Keep infra/install-headless-bootstrap.sh only as a compatibility wrapper that now delegates to the new installer.
  7. Remove the installed com.joel.headless-bootstrap system daemon and stop documenting the bridge as an active recovery path.

Why this

  • Boot-safe by design — no cross-domain launchctl trickery, just real system services.
  • Less moving parts — the bridge, periodic probing, and GUI/user handoff logic all disappear.
  • Same runtime identity where neededUserName=joel keeps the processes in Joel’s filesystem/auth context without requiring Aqua login.
  • Cleaner recovery — the installer can also tear down stale user LaunchAgents, stale autossh tunnel listeners, and manual nohup fallbacks before bootstrapping the system daemons.
  • But still only if the daemons tell the truth — the boot-safe launchd shape does not help if recovery scripts still use brittle checks like plain colima status or declare success before the Colima→NAS route and flannel subnet state are back.

Consequences

Positive

  • Critical host services can start at boot without Aqua login.
  • The installed runtime matches the repo-managed truth directly.
  • The reboot story is simpler to inspect: launchctl print system/<label>.

Negative

  • Installer still requires root once.
  • Launchd assets must now remain valid for /Library/LaunchDaemons/ semantics, not just GUI LaunchAgents.
  • Services that were previously recovered manually may see a brief restart during migration when the installer kills stale fallbacks and reboots them under launchd ownership.

Implementation notes

The migrated critical labels are:

  • com.joel.colima
  • com.joel.k8s-reboot-heal
  • com.joel.agent-secrets
  • com.joel.system-bus-worker
  • com.joel.gateway
  • com.joelclaw.agent-mail

Canonical installer:

sudo ~/Code/joelhooks/joelclaw/infra/install-critical-launchdaemons.sh

Compatibility alias:

sudo ~/Code/joelhooks/joelclaw/infra/install-headless-bootstrap.sh

Follow-up

  1. Run the new installer on Panda and verify each critical label via launchctl print system/<label>.
  2. Remove or archive any stale local notes that still instruct operators to rely on ADR-0239’s bridge.
  3. Keep com.joel.colima narrow: it is a boot/startup helper, not a periodic hammer. It should run colima start ... at load and exit, but it must not keep a StartInterval that re-runs colima start every few minutes against an already-running VM.
  4. Do not keep com.joel.colima-tunnel in the critical boot path. Colima/Lima already owns the docker-published host ports for joelclaw-controlplane-1; a second autossh daemon on those same ports is duplicate ownership and can kill Lima’s own ssh listeners. The installer should remove the daemon, and the compatibility script should exit cleanly instead of binding those ports.
  5. Do not keep com.joel.typesense-portforward in the critical boot path either. Typesense is already published through joelclaw-controlplane-1; a separate kubectl port-forward svc/typesense 8108:8108 daemon only adds EOF / connection-refused churn on a port Lima already exposes.
  6. Keep the reboot-heal script aligned with the live runtime contract: use colima status --json, restore the 192.168.1.0/24 via 192.168.64.1 dev col0 NAS route, and restart flannel when recent kubelet events show missing subnet.env even if the flannel pod still claims Running.
  7. Because com.joel.k8s-reboot-heal is a fresh launchd process every interval, any recovery markers that matter across ticks must be persisted on disk. In-memory timestamps are not enough; they let already-seen flannel subnet.env events trigger repeat restarts and can throw Typesense back into warmup after it already recovered.
  8. Host-facing status surfaces must report daemon reachability, not just launchd bookkeeping, or operators will get lied to during partial recovery.
  9. Keep working the separate steering issue: agent-mail search reliability still needs repair so daily steering is based on real traffic.