ADR-0239superseded

Headless user-domain boot bridge for critical launchd services

> Superseded on 2026-04-12 by ADR-0240 — Boot-safe LaunchDaemons for critical host services. > > Why: the installed bridge never earned the reboot path on Panda. The system daemon ran, but launchctl bootstrap user/501 <plist> kept failing with Input/output error, so the bridge could not reliably restore the critical services after headless boot.

Context

A post-reboot failure left Panda in a headless state: the machine was up, the user/$UID launchd domain existed, but the Aqua gui/$UID domain did not. Critical services that had only been managed as GUI LaunchAgents never came back:

  • com.joel.colima
  • com.joel.k8s-reboot-heal
  • com.joel.agent-secrets
  • com.joel.system-bus-worker
  • com.joel.gateway
  • com.joel.typesense-portforward
  • com.joelclaw.agent-mail

The recovery required manual nohup starts just to get the system back on its feet. That is not earned infrastructure.

The reboot also exposed two configuration drifts:

  1. the repo-tracked Colima launchd asset still declared 4 CPU / 8 GiB / 60 GiB instead of the stable 8 / 16 / 100 profile.
  2. several important launchd assets (gateway, typesense-portforward, agent-mail) still lived as hand-edited files under ~/Library/LaunchAgents instead of repo-tracked sources in infra/launchd/.

Decision

Adopt a headless boot bridge:

  1. Canonical launchd assets live in the repo under infra/launchd/, including newly tracked plists for:
    • com.joel.gateway
    • com.joel.typesense-portforward
    • com.joelclaw.agent-mail
  2. Add a system LaunchDaemon asset: infra/launchd/com.joel.headless-bootstrap.plist.
  3. That LaunchDaemon runs infra/headless-bootstrap.sh as root on boot and every 60 seconds.
  4. The script detects whether gui/$UID exists:
    • if GUI is absent: bootstrap the critical repo-managed launch agents into user/$UID
    • if GUI is present again: boot out the temporary user/$UID copies so normal GUI ownership can resume without duplicate processes
  5. Add infra/install-headless-bootstrap.sh as the canonical installer:
    • symlink critical user launch agents from ~/Library/LaunchAgents/ back to repo sources
    • install the system LaunchDaemon to /Library/LaunchDaemons/
    • bootstrap and kickstart the system bridge
  6. Correct the repo-tracked Colima launchd asset to the stable runtime profile: 8 CPU / 16 GiB / 100 GiB.

Why this

  • Survives headless reboots — boot no longer depends on Aqua login just to restore the core control plane.
  • No shadow plist drift — launchd assets become git-tracked truth, not hand-edited local snowflakes.
  • Minimal change to service code — keep existing launchd-managed services; add a domain bridge instead of rewriting every runtime.
  • Clean handoff when GUI returns — the bridge is temporary ownership, not a permanent duplicate runtime.

Consequences

Positive

  • Core services recover automatically after reboot even when no GUI session exists.
  • Colima boot no longer regresses to the stale undersized profile from the repo asset.
  • Gateway, agent-mail, and Typesense port-forward become canonical repo-managed launchd assets.

Negative

  • The bridge installer requires root once (sudo infra/install-headless-bootstrap.sh).
  • Critical services now have a cross-domain handoff path that must stay documented and tested.
  • CLI/ops commands that assume gui/$UID still need gradual cleanup to become fully domain-aware.

Implementation notes

Repo assets landed in the reboot-hardening session:

  • infra/launchd/com.joel.gateway.plist
  • infra/launchd/com.joel.typesense-portforward.plist
  • infra/launchd/com.joelclaw.agent-mail.plist
  • infra/launchd/com.joel.headless-bootstrap.plist
  • infra/headless-bootstrap.sh
  • infra/install-headless-bootstrap.sh
  • infra/launchd/com.joel.colima.plist updated to 8 / 16 / 100

Follow-up

  1. Make CLI launchd management domain-aware (gui/$UID vs user/$UID) for gateway, worker, Talon, and secrets surfaces.
  2. Add a deterministic smoke test for the headless bridge install path.
  3. Extend the daily steering repair work to agent-mail search reliability, which was independently degraded during the same recovery window.