ADR-0230proposed

Firecracker MicroVM Agent Sandboxes

Status

proposed

Context

joelclaw needs on-demand isolated execution for agent workloads. The current proven path is local host-worker sandboxes (ADR-0221, 11 phases shipped) using git worktrees + compose identity isolation. ADR-0205 proposed k8s Jobs as the next substrate. This ADR proposes Firecracker microVMs instead.

Why Firecracker over k8s Jobs

Propertyk8s JobsFirecracker microVMs
Boot time5-15s (image pull + start)<125ms
IsolationContainer (shared kernel)Hardware VM (separate kernel)
Memory overhead~50-100MB per container<5MB per microVM
Snapshot/restoreNoYes (<5ms warm start)
Security modelNamespace-basedKVM hardware virtualization
AWS equivalentECS/Fargate (container)Lambda/Fargate (microVM)

Firecracker is what Lambda and Fargate run on. ADR-0205 says “local mirrors AWS primitives” — Firecracker is the most direct local equivalent to the target cloud substrate.

The snapshot capability is the decisive advantage: boot a microVM, install the full agent toolchain (bun, node, git, pi, codex, ~2GB), snapshot it. New agent sandboxes restore from snapshot in milliseconds. This is the ADR-0206 hot-image vision realized at the VM level rather than the container level.

Platform path on Panda

Firecracker requires Linux KVM. macOS doesn’t have KVM. The path:

macOS (M4 Pro 64GB) → Colima VM (VZ framework, aarch64, nested virt) → /dev/kvm → Firecracker aarch64

Lima/Colima supports nestedVirtualization=true on Apple Silicon. Firecracker has native aarch64 binaries (v1.13.0+). This is confirmed working on M3/M4 Pro hardware.

The Colima VM already runs our Talos k8s cluster. Firecracker would run alongside it, managed directly by the Restate worker (which also needs to move inside the Colima/k8s environment per the execution sequence).

Decision

Adopt Firecracker microVMs as the primary agent sandbox substrate for joelclaw.

Architecture

joelclaw workload run plan.json --execution-mode microvm
  → Redis queue (priority admission)
    → Queue drainer
      → Restate dagOrchestrator
        → dagWorker (executionMode: "microvm")
          → Firecracker API socket
            → microVM boots from snapshot (<5ms)
              → Agent runs (codex/pi/claude)
                → Artifacts collected
                  → microVM destroyed
                    → Results in inbox + registry

Components

  1. Firecracker binary — installed inside the Colima VM, aarch64 build
  2. Guest kernel — minimal Linux kernel built for aarch64, KVM guest
  3. Agent rootfs — ext4 image: Alpine base + bun + node + git + pi + codex CLI + common toolchains
  4. Warm snapshot — booted rootfs with tools verified, snapshotted for instant restore
  5. MicroVM runnerpackages/agent-execution/src/microvm.ts: create/configure/start/stop/snapshot via Firecracker REST API socket
  6. Network bridge — TAP device per microVM, iptables routing for outbound (API access, git clone)
  7. Filesystem mount — virtio-block for workspace (repo checkout mounted into microVM), read-only rootfs from snapshot

Execution modes after this ADR

ModeSubstrateUse case
hostLocal processLegacy, interactive debugging
sandboxWorktree + compose (ADR-0221)Local dev, compose-backed stacks
microvmFirecracker microVMDefault for agent workloads
k8sk8s JobFuture cloud path (EKS equivalent)

Dogfood progression

The project dogfoods itself — later stages run through infrastructure built by earlier stages:

  • Phase A (Steps 1-3): Manual/codex. Fix admin port, triage queue, move Restate worker to k8s.
  • Phase B (Steps 4-6): Via joelclaw workload run with local sandbox (ADR-0221). Install Firecracker, build rootfs, create snapshot.
  • Phase C (Steps 7-9): Build the microVM runner, wire dagWorker, prove with canary. The canary proves itself — the first microVM workload validates the microVM infrastructure.
  • Phase D (Step 10): ADR-0228/0229 docs-api work runs as the first real consumer. Full end-to-end: workload plan → Restate orchestration → Firecracker sandbox → artifacts → completion.

Execution Sequence

Step 1: Fix Restate admin reachability

Add NodePort 30970 for Restate admin API (9070). Makes joelclaw jobs status accurate. Acceptance: joelclaw jobs status reports restate as healthy without manual port-forward.

Step 2: Triage stale queue

Inspect 58 queued items. Identify unhandled event families. Ack/drain dead items. Acceptance: Queue depth < 10, no items older than 1 hour.

Step 3: Move Restate worker into k8s

Build ARM64 image for @joelclaw/restate. Push to GHCR. Create Deployment manifest + publish script. Migrate from launchd host process. Acceptance: kubectl get pods -n joelclaw | grep restate-worker shows 1/1 Running. Services still registered in Restate server. Queue drainer consuming.

Step 4: Enable nested virtualization in Colima

Reconfigure Colima with nestedVirtualization=true. Verify /dev/kvm exists inside the VM. Install Firecracker aarch64 binary. Acceptance: firecracker --version works inside Colima VM. /dev/kvm accessible.

Step 5: Build agent rootfs + guest kernel

Create Alpine-based ext4 rootfs with: bun, node (v24), git, pi CLI, codex CLI, common build toolchains. Build or obtain minimal aarch64 Linux guest kernel. Acceptance: microVM boots from rootfs, bun --version && node --version && git --version && pi --version all succeed inside the VM.

Step 6: Create warm snapshot

Boot microVM from rootfs. Run tool verification suite. Snapshot the running VM. Verify restore from snapshot boots in <200ms. Acceptance: time firecracker --restore-from-snapshot < 200ms. Tools verified post-restore.

Step 7: Build microVM runner in agent-execution

packages/agent-execution/src/microvm.ts — create/configure/start/stop microVMs. REST API socket management. Workspace volume mount. Artifact collection. Network setup. Cleanup. Acceptance: Unit tests pass. Integration test boots a microVM, runs a command, collects output, destroys VM.

Step 8: Wire dagWorker executionMode: “microvm”

Restate dagWorker dispatches to microVM runner when executionMode: "microvm". Same contract as host/sandbox modes — sandbox identity, env injection, artifact collection, terminal inbox writeback. Acceptance: joelclaw workload run plan.json --execution-mode microvm --dry-run produces valid request.

Step 9: Deterministic canary

Run trivial workload (echo test + file write + artifact collect) through full path: CLI → queue → Restate → Firecracker → artifacts → inbox. Acceptance: Canary completes, artifacts present, inbox terminal, microVM cleaned up, joelclaw jobs status reflects completion.

Step 10: ADR-0228/0229 as first real consumer

Shape docs-api upgrade phases as workload plans. Dispatch through microVM sandboxes. Real code changes, real isolation, real Restate durability. Acceptance: At least one ADR-0228 phase completes end-to-end through the microVM rig, producing a merge-ready patch.

Consequences

Positive

  • Hardware-level isolation for every agent workload
  • <200ms sandbox startup from snapshot (vs 5-15s containers)
  • <5MB memory overhead per sandbox (run many in parallel)
  • Direct path to AWS Lambda/Fargate substrate when needed
  • Snapshot model eliminates repeated toolchain installation
  • Dogfood progression proves each layer before depending on it

Negative

  • Nested virtualization adds a layer (macOS → Colima VM → KVM → microVM)
  • Firecracker networking requires manual TAP/iptables setup per microVM
  • Guest kernel and rootfs are new build artifacts to maintain
  • More complex than k8s Jobs for the initial setup

Risks

  • Nested virtualization performance on M4 Pro is unproven at scale (works on M3 per community reports)
  • Colima VM memory pressure with many concurrent microVMs
  • Firecracker aarch64 may have edge cases vs x86_64 (less community testing)
  • TAP networking inside a nested VM may have latency/reliability issues

References