ADR-0246proposed

Mac Studio Central runtime migration

Status

Proposed.

This ADR captures the design direction resolved in CONTEXT.md: move Central from Panda to the Mac Studio host identified as mac-studio-central, make Panda relay-only, and replace the Panda k8s/Talos operational shape with a reboot-surviving Mac service runtime.

Context

Panda currently carries too much of the joelclaw control plane:

  • Colima VM
  • Talos/k8s single-node cluster
  • Inngest
  • Redis
  • Typesense
  • Restate
  • MinIO/object storage surfaces
  • system-bus worker
  • restate-worker/workload runtime
  • gateway and channel relay surfaces
  • local-hardware-bound iMessage relay work

The recent fan/CPU incident had two causes:

  1. stale orphaned Next.js jest-worker/processChild.js processes from a separate repo, which were killed;
  2. sustained Colima/Talos/Inngest churn, especially memory/run.captured backlog pressure, which exposed that Panda is doing too much as both Central and relay host.

The earlier runtime line has been useful but expensive:

  • ADR-0029 replaced Docker Desktop with Colima + Talos because Docker Desktop was a GUI app pretending to be infrastructure.
  • ADR-0240 moved critical host services toward boot-safe LaunchDaemons.
  • ADR-0241/0242/0244 added increasing amounts of evidence, gating, and recovery machinery around Colima/Talos failure modes.
  • ADR-0245 hardened kube operator access with dedicated tunnels.

That work made Panda more honest, but it also proves the point: the system is accumulating infrastructure ceremony Joel does not want. The goal is stable personal AI infrastructure, not a sysadmin career.

The Mac Studio is the next Central host. Panda should remain useful where it has unique local value — account-bound relay surfaces such as iMessage — but it should stop hosting authoritative Central state and runtime.

Decision

1. Move Central to Mac Studio as a whole-Central cutover

The Mac Studio host is the target primary Central Machine for the joelclaw Network.

  • Stable machine_id: mac-studio-central
  • Human/themed display name: optional and non-authoritative
  • Central remains logical: one Central per Network, hosted on one primary Machine at a time

The migration is a whole-Central cutover:

  • state and runtime move together,
  • Panda’s old Central stack freezes only as rollback,
  • no split-brain mode where Redis lives on one host while the runtime of record lives on another,
  • no long-term active/active Central.

1a. Use a Standard-user macOS host model

The Mac Studio host should follow the same least-privilege macOS account shape as the Infinite Red “Using Standard users instead of Admin users as your primary login” guidance.

Central infrastructure must not assume Joel’s primary login is an Administrator account.

Mac Studio account model:

  • keep a separate Administrator account for setup, package installation, and LaunchDaemon installation,
  • use a Standard human login for day-to-day access and remote development,
  • run Central services under a dedicated Standard service account where practical, default name joelclaw,
  • install critical services as system LaunchDaemons, but set UserName / GroupName so long-running processes do not run as admin/root unless required,
  • lock down service-user and relay-user home directories (chmod 700),
  • keep service state in explicit shared service paths, default /Users/Shared/joelclaw/services, owned by the service account with restrictive permissions,
  • reserve root/admin escalation for installation and host maintenance, not normal runtime.

This complements the Panda Relay Sandbox rule: Relay Sandboxes are also Standard macOS users, isolated per Channel Account owner, and not dev/admin accounts.

1b. Allow Mac Studio to be Joel’s remote dev box without coupling it to Central

The Mac Studio does not need to be a single-purpose appliance. It may also be Joel’s remote development box, similar to Panda today.

That is allowed only by separating account identity and responsibility:

  • Joel’s Standard dev account may have repos, pi, codex, claude, editor tooling, and interactive agent runtimes.
  • The Central service account owns Central service state and runtime credentials.
  • Dev processes must not be required for Central boot, health, or recovery.
  • Central services must not depend on Joel being logged in.
  • Dev workloads are allowed to consume spare host resources, but they are not part of the critical Central runtime contract.

The host can be shared. The service identity cannot.

1c. Do not repeat Panda’s dev-account-as-infrastructure coupling

Panda’s current shape is the cautionary tale: useful dev machine and Central infrastructure accreted into the same login, same repo checkouts, same user launchd surfaces, and same operational blast radius.

Mac Studio must not follow that path.

Forbidden for critical Central runtime:

  • no Central service state under /Users/joel or any interactive dev home,
  • no critical Central services installed as ~/Library/LaunchAgents in Joel’s dev account,
  • no Central boot/recovery dependency on Joel being logged in over SSH, Screen Sharing, VS Code, Cursor, or a GUI session,
  • no Central credentials stored only in Joel’s dev account keychain, shell profile, or repo checkout,
  • no system-bus / gateway / worker process whose canonical runtime is “whatever Joel’s dev shell started”,
  • no Docker/Compose/Colima state owned by the dev account if that state is required for Central uptime,
  • no “temporary” manual dev-account service start that becomes the real production path.

If a dev workflow needs access to Central, it uses the same typed service interfaces as any other Machine. It does not become the service owner.

2. Panda becomes a Relay Machine only

After cutover, Panda is not a Central fallback and not a normal family-use Machine.

Panda keeps local-hardware-bound relay responsibilities, especially per-User Relay Sandboxes for Channel Accounts such as iMessage/iCloud:

  • one Standard macOS user account per Relay Sandbox,
  • separate iCloud session, Messages DB, Keychain, TCC/FDA grants, and relay launchd agent,
  • home directories locked down (chmod 700),
  • no pi/codex/claude dev tooling,
  • no repo checkouts,
  • no broad Central credentials.

Relay Machines normalize channel events and forward them to Central with the resolved joelclaw User identity. They do not own authoritative state, indexing, ingestion, or workflow runtime.

3. Critical Central services must be launchd-recoverable after reboot

Unattended reboot recovery is a core pillar for mac-studio-central.

A runtime is eligible for critical Central infrastructure only if it can recover after host reboot with:

  • no logged-in user,
  • no GUI session,
  • no auto-login,
  • no manual app launch.

launchd is the supervisor of record.

A container substrate is acceptable only if launchd can start, stop, health-check, and recover it headlessly. Anything that cannot satisfy this is dev/canary-only.

4. Use launchd + headless Colima/Compose as the default initial runtime shape

The default target shape is:

mac-studio-central
  launchd
    com.joel.central.colima        # start/verify container substrate, if used
    com.joel.central.compose       # bring up stateful services
    com.joel.system-bus-worker     # host-native worker, if not containerized
    com.joel.restate-worker        # host-native worker, if not containerized
    com.joel.gateway               # if Central-hosted after cutover
    com.joel.central-health        # health/verification loop
 
  Docker Compose via headless Colima
    redis
    typesense
    inngest
    restate
    minio or selected object-store surface
    optional supporting services

This is not a vote for Colima because it is beautiful. It is not. It is a vote for a CLI-managed substrate that fits launchd and Compose better than a GUI-session-bound app.

Native launchd services are allowed and preferred when they are simpler than containers. The dividing line is operational clarity:

  • containers for stateful services where image/version portability matters,
  • launchd for Mac-native joelclaw workers and supervisors,
  • no Kubernetes for the first Mac Studio Central runtime unless a future ADR earns it again.

5. OrbStack is not the default Central runtime

OrbStack remains a good Mac developer container runtime, but it is not the default for critical Central services because reboot survival beats polish.

OrbStack is acceptable for:

  • dev use,
  • canary experiments,
  • non-critical workloads,
  • or critical use only after a proof shows it can be fully supervised by launchd after reboot without a logged-in GUI session.

The same rule applies to any other runtime: if it needs a person or Aqua login to come back, it is not Central infrastructure.

6. Preserve explicit state ownership and rollback

Central state must be boring to inspect, back up, and move.

Implementation must prefer explicit service data paths or NAS-backed mounts over opaque hidden state. The exact path layout is an implementation detail to finalize before cutover, but the contract is not:

  • state ownership must be visible,
  • backups must be scriptable,
  • restore must be testable,
  • Typesense remains rebuildable from NAS where ADR-0243 says NAS is authoritative,
  • rollback means freezing Mac Studio cutover and restarting the old Panda Central stack from its last known-good state, not running two Centrals.

Decision Drivers

  • Reboot survival: Central must recover after host reboot without a human login.
  • Boring operations: the stack should be understandable over SSH at 2am.
  • Reduced substrate complexity: remove Talos/k8s ceremony unless it earns its keep again.
  • Whole-system truth: one Central owns state and runtime at a time.
  • Relay isolation: Panda can safely host account-bound relay processes without holding Central state.
  • Least-privilege host accounts: Standard macOS users are the runtime default; Administrator is for setup and maintenance.
  • Dev/service separation: Mac Studio can be a remote dev box only if Central service state and reboot recovery remain independent of Joel’s dev login.
  • No Panda repeat: interactive dev convenience must not accrete into the critical Central runtime path.
  • Agent-first observability: health, deployment, migration, and rollback must be inspectable by CLI/OTEL, not a dashboard ritual.
  • Reversibility: cutover must have a frozen rollback path.

Considered Options

Option 1: Keep Panda Colima + Talos/k8s as Central

Pros

  • Already running.
  • Existing manifests and deploy scripts know this shape.
  • ADR-0240 through ADR-0245 improved reboot and recovery truth.

Cons

  • Panda is doing both Central and relay work.
  • k8s/Talos recovery is already too much ceremony for one-person home infrastructure.
  • Fan/CPU incidents show the cost of colocating memory ingest, workflow runtime, and relay services.
  • Continuing here means more healer theatre instead of simplification.

Verdict: keep only as rollback during cutover. Not the target.

Option 2: Mac Studio with OrbStack + Compose

Pros

  • Fast, polished, low idle overhead.
  • Strong Docker Compose compatibility.
  • Good Apple Silicon integration.

Cons

  • Closed-source commercial Mac app dependency.
  • Headless/no-login reboot story is weaker than required for critical Central services.
  • Risks replacing one desktop-app-as-infra mistake with a nicer-looking one.

Verdict: dev/canary only unless proven launchd-recoverable without GUI login.

Option 3: Mac Studio with native Homebrew services + launchd

Pros

  • Best reboot behavior.
  • No container VM.
  • Very inspectable under launchd.

Cons

  • Per-service install/config/version drift.
  • Less portable rollback/cutover story.
  • More host snowflake risk unless every config is repo-managed.

Verdict: allowed where simpler, but not the only substrate for all services.

Option 4: Mac Studio with headless Colima + Docker Compose + launchd

Pros

  • CLI-first, no GUI app required.
  • Docker Compose compatibility.
  • Repo-managed service graph possible.
  • launchd can supervise the substrate and health checks.
  • Lower ceremony than Talos/k8s while preserving container portability.
  • Builds on existing Colima operational knowledge without keeping Kubernetes.

Cons

  • Still a Linux VM substrate on macOS.
  • Colima has already needed recovery hardening on Panda.
  • Compose has less orchestration machinery than Kubernetes; health checks and dependency order must be explicit.

Verdict: recommended initial runtime shape.

Architecture sketch

                    Tailscale Network


┌─────────────────────────────────────────────────────────────┐
│ Mac Studio: mac-studio-central                              │
│                                                             │
│  separate admin account for setup only                      │
│  Standard dev user: joel                                    │
│  Standard service user: joelclaw                            │
│  service data: /Users/Shared/joelclaw/services              │
│                                                             │
│  launchd = supervisor of record                             │
│    ├─ central substrate start/verify                        │
│    ├─ compose up / health gate                              │
│    ├─ system-bus-worker / restate-worker                    │
│    ├─ gateway if Central-hosted                             │
│    └─ backup + migration + health jobs                      │
│                                                             │
│  Compose / headless Colima                                  │
│    ├─ redis                                                  │
│    ├─ typesense                                              │
│    ├─ inngest                                                │
│    ├─ restate                                                │
│    └─ object-store surface                                   │
│                                                             │
│  Authoritative Central state + runtime                       │
└─────────────────────────────────────────────────────────────┘

                           │ relay events / API calls
┌─────────────────────────────────────────────────────────────┐
│ Panda: Relay Machine                                        │
│                                                             │
│  Relay Sandboxes                                             │
│    ├─ Joel channel accounts                                  │
│    ├─ Kristina channel accounts                              │
│    └─ kid channel accounts                                   │
│                                                             │
│  local-hardware-bound relay processes only                   │
│  no Central state, no Central fallback                       │
└─────────────────────────────────────────────────────────────┘

Implementation Plan

Required skills

Load these before implementation starts:

  • system-architecture — source of truth for current Panda topology and event flow.
  • k8s — current Central services still live in the Talos/k8s stack and must be inventoried/frozen safely.
  • sync-system-bus — deploy/run mechanics for system-bus-worker must change without losing worker parity.
  • system-bus — Inngest functions and worker runtime behavior depend on this package.
  • inngest-durable-functions, inngest-events, inngest-steps, inngest-flow-control, inngest-middleware, inngest-setup — required before changing Inngest-hosted durable functions or deployment wiring.
  • restate-workflows or workflow-rig — Restate/workload runtime cutover must preserve DAG and queue contracts.
  • gateway and telegram — if gateway ownership moves with Central, operator relay behavior must stay intact.
  • o11y-logging — every migration, health gate, and rollback edge needs OTEL and truthful status.
  • clawmail — shared-file coordination and edit reservation.
  • adr-skill — ADR lifecycle/status updates during implementation.

Affected paths

Expected repo/vault surfaces:

  • CONTEXT.md
  • docs/architecture.md
  • docs/deploy.md
  • docs/inngest-functions.md
  • docs/gateway.md
  • docs/observability.md
  • infra/launchd/
  • infra/central/ (new, for Compose/service definitions and bootstrap scripts)
  • infra/central/setup-macos-users.sh or equivalent documented host-prep script
  • k8s/ (freeze/export current Central manifests; do not make it the new target)
  • packages/system-bus/
  • packages/restate/
  • packages/gateway/
  • skills/system-architecture/SKILL.md
  • skills/k8s/SKILL.md
  • skills/sync-system-bus/SKILL.md
  • skills/gateway/SKILL.md
  • skills/o11y-logging/SKILL.md
  • ~/Vault/docs/decisions/0246-mac-studio-central-runtime-migration.md

Phase 1 — Inventory and classify

  1. Produce a current Panda Central inventory:
    • Redis
    • Typesense
    • Inngest
    • Restate
    • Dkron
    • MinIO/object store
    • PDS
    • docs-api
    • livekit
    • system-bus-worker
    • restate-worker/workload worker
    • gateway
    • Caddy/Tailscale ingress
    • relay-only iMessage surfaces
  2. Classify each as:
    • Central critical — moves to mac-studio-central.
    • Central optional — moves later or remains canary/dev.
    • Relay-bound — stays on Panda.
    • Retired — not carried forward.
  3. Record the inventory in docs/architecture.md or a linked migration runbook.

Phase 2 — Define the Mac Studio runtime contract

  1. Create repo-managed infra/central/ definitions for:
    • Compose service graph,
    • explicit data paths / volume mounts,
    • environment templates without secrets,
    • health checks,
    • backup/snapshot hooks,
    • rollback/start/stop scripts.
  2. Define host-prep steps for the Standard-user account model:
    • separate Administrator account for setup/maintenance,
    • Standard dev account for Joel’s interactive remote development,
    • Standard service account, default joelclaw,
    • service root /Users/Shared/joelclaw/services,
    • restrictive ownership and modes for service data and homes,
    • clear boundary between dev credentials/tooling and Central service credentials.
  3. Create infra/launchd/ LaunchDaemons for the Central substrate and host-native workers.
  4. Ensure LaunchDaemons use UserName / GroupName instead of running long-lived processes as root where possible.
  5. Ensure all critical labels run after reboot without a logged-in user.
  6. Ensure no runtime depends on a GUI app, auto-login, or manual launch.

Phase 3 — Stand up Mac Studio Central in shadow mode

  1. Start services on mac-studio-central without taking Panda out of service.
  2. Restore/copy data snapshots into Mac Studio service paths.
  3. Run health checks against Mac Studio endpoints over Tailscale.
  4. Verify workers can start and connect to the Mac Studio services.
  5. Keep Panda as the active Central until shadow health passes.

Shadow mode is read/verify only for authoritative state unless a specific migration step requires a controlled write. Do not run two active Centrals.

Phase 4 — Whole-Central cutover

  1. Freeze Panda Central writes.
  2. Take final snapshots/export for stateful services.
  3. Restore/import into Mac Studio.
  4. Start Mac Studio Central runtime under launchd.
  5. Switch Tailscale/Caddy/service discovery endpoints to Mac Studio.
  6. Verify end-to-end:
    • gateway route,
    • Inngest function execution,
    • Restate DAG run,
    • Redis queue flow,
    • Typesense query,
    • Run capture/search path from ADR-0243,
    • OTEL ingestion/query,
    • relay event from Panda to Mac Studio.
  7. Freeze Panda’s old Central stack as rollback.

Phase 5 — Panda relay-only cleanup

  1. Remove or disable Central launchd/k8s surfaces from Panda once rollback window closes.
  2. Keep Relay Sandboxes and iMessage relay services.
  3. Remove broad Central credentials from Panda relay accounts.
  4. Update docs/skills so future agents stop treating Panda as Central.

Verification

  • mac-studio-central has a stable Machine record with machine_id=mac-studio-central.
  • Mac Studio has a separate Administrator account, Standard Joel dev account, and Standard Central service account model.
  • The Central service account home and service data paths have restrictive ownership/permissions.
  • Central service state, credentials, and runtime configs are not stored only under Joel’s dev home.
  • No critical Central service is installed as a LaunchAgent in Joel’s dev account.
  • Central services do not depend on Joel’s dev account being logged in or running user agents.
  • Critical Mac Studio services recover after host reboot with no logged-in user.
  • launchctl print system/<label> is the canonical inspection path for each critical Central service.
  • No critical Central service requires OrbStack, Docker Desktop, a GUI session, auto-login, or manual app launch.
  • Compose/native service data paths are explicit and covered by backup/restore scripts.
  • Panda’s old Central stack is frozen before cutover and documented as rollback only.
  • After cutover, Redis, Typesense, Inngest, Restate, and worker surfaces resolve to Mac Studio.
  • Panda relay can deliver at least one channel event to Mac Studio Central.
  • Run capture/search from ADR-0243 works against Mac Studio Central.
  • OTEL records show migration start, shadow verification, cutover, rollback availability, and cutover completion.
  • CONTEXT.md, docs/architecture.md, docs/deploy.md, and relevant skills describe the new topology.

Non-goals

  • Building multi-Central active/active replication.
  • Keeping Panda as a hot Central fallback after cutover.
  • Migrating every possible service in one unbounded session.
  • Reintroducing Kubernetes through OrbStack/Rancher/Desktop tooling.
  • Building full enterprise macOS workstation hardening, MDM, or fleet policy.
  • Preventing Joel from using the Mac Studio as a remote dev box.
  • Allowing dev-account convenience to become the production service path.
  • Solving Phase 2 Credential Proxy hardening; that remains a post-stability improvement.
  • Renaming the Mac Studio with a personality/themed Machine ID. The stable ID is mac-studio-central.

Open questions before implementation

  1. Exact per-service subpaths and permissions under /Users/Shared/joelclaw/services.
  2. Which services are Central critical for cutover day versus deferred wave 2.
  3. Whether gateway moves during the same cutover or after the state/runtime core is stable.
  4. Whether PDS moves in wave 1 or is bridged temporarily during the identity transition.
  5. Length and exit criteria for the rollback window before Panda Central surfaces are removed.

Consequences

Positive

  • Central gets the RAM and thermal headroom of the Mac Studio.
  • Panda stops doing two jobs at once.
  • The runtime contract becomes easier: launchd, Compose/native services, explicit health checks.
  • Reboot survival becomes a hard eligibility rule instead of a hope.
  • macOS account privileges stop being accidental: admin for setup, Standard users for runtime.
  • Joel can still use the Mac Studio as a productive remote dev box without making his login part of Central uptime.
  • The system moves away from k8s ceremony for local personal infrastructure.

Negative

  • Migration work must touch deployment scripts, docs, skills, service discovery, and state backups.
  • Colima remains a substrate dependency unless native launchd services replace more of the stack.
  • Compose has fewer built-in safety rails than Kubernetes; health/ordering/rollback need explicit scripts.
  • During cutover there is real risk of stale endpoint references pointing at Panda.

References

Follow-up

  1. Convert this ADR from proposed to accepted once Joel approves the draft.
  2. Write a concrete migration runbook with inventory, commands, health checks, rollback commands, and host account setup.
  3. Define exact per-service subpaths and ownership under /Users/Shared/joelclaw/services before writing Compose files.
  4. Update or supersede ADR-0029 only for future Central runtime direction; keep it as historical truth for the shipped Panda runtime.
  5. When cutover ships, update this ADR status and add verification results.