ADR-0240accepted

Boot-safe LaunchDaemons for critical host services

Context

ADR-0239 tried to patch the reboot gap by keeping the critical host services as user LaunchAgents and adding a root-owned bridge that bootstrapped them into user/$UID whenever gui/$UID was absent.

That design was not earned on Panda.

After installation, the bridge daemon itself loaded, but the real step that mattered kept failing:

launchctl bootstrap user/501 <plist>
Bootstrap failed: 5: Input/output error

So the machine still needed manual nohup recovery for the same critical surfaces after a headless reboot:

  • com.joel.colima
  • com.joel.k8s-reboot-heal
  • com.joel.agent-secrets
  • com.joel.system-bus-worker
  • com.joel.gateway
  • com.joel.typesense-portforward
  • com.joelclaw.agent-mail

That is needless complexity. The safer shape is to run the host control plane as actual system services.

Decision

Replace the ADR-0239 bridge with boot-safe LaunchDaemons for the critical host services.

  1. Keep the canonical plist sources in infra/launchd/.
  2. Install the critical labels into /Library/LaunchDaemons/, not ~/Library/LaunchAgents/.
  3. Run the services in the system launchd domain.
  4. Use UserName=joel / GroupName=staff where the process should execute with Joel’s home, repo, auth, and filesystem context.
  5. Add infra/install-critical-launchdaemons.sh as the canonical root installer.
  6. Keep infra/install-headless-bootstrap.sh only as a compatibility wrapper that now delegates to the new installer.
  7. Remove the installed com.joel.headless-bootstrap system daemon and stop documenting the bridge as an active recovery path.

Why this

  • Boot-safe by design — no cross-domain launchctl trickery, just real system services.
  • Less moving parts — the bridge, periodic probing, and GUI/user handoff logic all disappear.
  • Same runtime identity where neededUserName=joel keeps the processes in Joel’s filesystem/auth context without requiring Aqua login.
  • Cleaner recovery — the installer can also tear down stale user LaunchAgents and manual nohup fallbacks before bootstrapping the system daemons.

Consequences

Positive

  • Critical host services can start at boot without Aqua login.
  • The installed runtime matches the repo-managed truth directly.
  • The reboot story is simpler to inspect: launchctl print system/<label>.

Negative

  • Installer still requires root once.
  • Launchd assets must now remain valid for /Library/LaunchDaemons/ semantics, not just GUI LaunchAgents.
  • Services that were previously recovered manually may see a brief restart during migration when the installer kills stale fallbacks and reboots them under launchd ownership.

Implementation notes

The migrated critical labels are:

  • com.joel.colima
  • com.joel.k8s-reboot-heal
  • com.joel.agent-secrets
  • com.joel.system-bus-worker
  • com.joel.gateway
  • com.joel.typesense-portforward
  • com.joelclaw.agent-mail

Canonical installer:

sudo ~/Code/joelhooks/joelclaw/infra/install-critical-launchdaemons.sh

Compatibility alias:

sudo ~/Code/joelhooks/joelclaw/infra/install-headless-bootstrap.sh

Follow-up

  1. Run the new installer on Panda and verify each critical label via launchctl print system/<label>.
  2. Remove or archive any stale local notes that still instruct operators to rely on ADR-0239’s bridge.
  3. Keep working the separate steering issue: agent-mail search reliability still needs repair so daily steering is based on real traffic.