ADR-0240accepted
Boot-safe LaunchDaemons for critical host services
Context
ADR-0239 tried to patch the reboot gap by keeping the critical host services as user LaunchAgents and adding a root-owned bridge that bootstrapped them into user/$UID whenever gui/$UID was absent.
That design was not earned on Panda.
After installation, the bridge daemon itself loaded, but the real step that mattered kept failing:
launchctl bootstrap user/501 <plist>
Bootstrap failed: 5: Input/output errorSo the machine still needed manual nohup recovery for the same critical surfaces after a headless reboot:
com.joel.colimacom.joel.k8s-reboot-healcom.joel.agent-secretscom.joel.system-bus-workercom.joel.gatewaycom.joelclaw.agent-mail
That is needless complexity. The safer shape is to run the host control plane as actual system services.
Decision
Replace the ADR-0239 bridge with boot-safe LaunchDaemons for the critical host services.
- Keep the canonical plist sources in
infra/launchd/. - Install the critical labels into
/Library/LaunchDaemons/, not~/Library/LaunchAgents/. - Run the services in the
systemlaunchd domain. - Use
UserName=joel/GroupName=staffwhere the process should execute with Joel’s home, repo, auth, and filesystem context. - Add
infra/install-critical-launchdaemons.shas the canonical root installer. - Keep
infra/install-headless-bootstrap.shonly as a compatibility wrapper that now delegates to the new installer. - Remove the installed
com.joel.headless-bootstrapsystem daemon and stop documenting the bridge as an active recovery path.
Why this
- Boot-safe by design — no cross-domain launchctl trickery, just real system services.
- Less moving parts — the bridge, periodic probing, and GUI/user handoff logic all disappear.
- Same runtime identity where needed —
UserName=joelkeeps the processes in Joel’s filesystem/auth context without requiring Aqua login. - Cleaner recovery — the installer can also tear down stale user LaunchAgents, stale
autosshtunnel listeners, and manualnohupfallbacks before bootstrapping the system daemons. - But still only if the daemons tell the truth — the boot-safe launchd shape does not help if recovery scripts still use brittle checks like plain
colima statusor declare success before the Colima→NAS route and flannel subnet state are back.
Consequences
Positive
- Critical host services can start at boot without Aqua login.
- The installed runtime matches the repo-managed truth directly.
- The reboot story is simpler to inspect:
launchctl print system/<label>.
Negative
- Installer still requires root once.
- Launchd assets must now remain valid for
/Library/LaunchDaemons/semantics, not just GUI LaunchAgents. - Services that were previously recovered manually may see a brief restart during migration when the installer kills stale fallbacks and reboots them under launchd ownership.
Implementation notes
The migrated critical labels are:
com.joel.colimacom.joel.k8s-reboot-healcom.joel.agent-secretscom.joel.system-bus-workercom.joel.gatewaycom.joelclaw.agent-mail
Canonical installer:
sudo ~/Code/joelhooks/joelclaw/infra/install-critical-launchdaemons.shCompatibility alias:
sudo ~/Code/joelhooks/joelclaw/infra/install-headless-bootstrap.shFollow-up
- Run the new installer on Panda and verify each critical label via
launchctl print system/<label>. - Remove or archive any stale local notes that still instruct operators to rely on ADR-0239’s bridge.
- Keep
com.joel.colimanarrow: it is a boot/startup helper, not a periodic hammer. It should runcolima start ...at load and exit, but it must not keep aStartIntervalthat re-runscolima startevery few minutes against an already-running VM. - Do not keep
com.joel.colima-tunnelin the critical boot path. Colima/Lima already owns the docker-published host ports forjoelclaw-controlplane-1; a second autossh daemon on those same ports is duplicate ownership and can kill Lima’s own ssh listeners. The installer should remove the daemon, and the compatibility script should exit cleanly instead of binding those ports. - Do not keep
com.joel.typesense-portforwardin the critical boot path either. Typesense is already published throughjoelclaw-controlplane-1; a separatekubectl port-forward svc/typesense 8108:8108daemon only adds EOF / connection-refused churn on a port Lima already exposes. - Keep the reboot-heal script aligned with the live runtime contract: use
colima status --json, restore the192.168.1.0/24 via 192.168.64.1 dev col0NAS route, and restart flannel when recent kubelet events show missingsubnet.enveven if the flannel pod still claimsRunning. - Because
com.joel.k8s-reboot-healis a fresh launchd process every interval, any recovery markers that matter across ticks must be persisted on disk. In-memory timestamps are not enough; they let already-seen flannelsubnet.envevents trigger repeat restarts and can throw Typesense back into warmup after it already recovered. - Host-facing status surfaces must report daemon reachability, not just launchd bookkeeping, or operators will get lied to during partial recovery.
- Keep working the separate steering issue: agent-mail search reliability still needs repair so daily steering is based on real traffic.