Firecracker MicroVM Agent Sandboxes
Status
proposed
Context
joelclaw needs on-demand isolated execution for agent workloads. The current proven path is local host-worker sandboxes (ADR-0221, 11 phases shipped) using git worktrees + compose identity isolation. ADR-0205 proposed k8s Jobs as the next substrate. This ADR proposes Firecracker microVMs instead.
Why Firecracker over k8s Jobs
| Property | k8s Jobs | Firecracker microVMs |
|---|---|---|
| Boot time | 5-15s (image pull + start) | <125ms |
| Isolation | Container (shared kernel) | Hardware VM (separate kernel) |
| Memory overhead | ~50-100MB per container | <5MB per microVM |
| Snapshot/restore | No | Yes (<5ms warm start) |
| Security model | Namespace-based | KVM hardware virtualization |
| AWS equivalent | ECS/Fargate (container) | Lambda/Fargate (microVM) |
Firecracker is what Lambda and Fargate run on. ADR-0205 says “local mirrors AWS primitives” — Firecracker is the most direct local equivalent to the target cloud substrate.
The snapshot capability is the decisive advantage: boot a microVM, install the full agent toolchain (bun, node, git, pi, codex, ~2GB), snapshot it. New agent sandboxes restore from snapshot in milliseconds. This is the ADR-0206 hot-image vision realized at the VM level rather than the container level.
Platform path on Panda
Firecracker requires Linux KVM. macOS doesn’t have KVM. The path:
macOS (M4 Pro 64GB) → Colima VM (VZ framework, aarch64, nested virt) → /dev/kvm → Firecracker aarch64Lima/Colima supports nestedVirtualization=true on Apple Silicon. Firecracker has native aarch64 binaries (v1.13.0+). This is confirmed working on M3/M4 Pro hardware.
The Colima VM already runs our Talos k8s cluster. Firecracker would run alongside it, managed directly by the Restate worker (which also needs to move inside the Colima/k8s environment per the execution sequence).
Decision
Adopt Firecracker microVMs as the primary agent sandbox substrate for joelclaw.
Architecture
joelclaw workload run plan.json --execution-mode microvm
→ Redis queue (priority admission)
→ Queue drainer
→ Restate dagOrchestrator
→ dagWorker (executionMode: "microvm")
→ Firecracker API socket
→ microVM boots from snapshot (<5ms)
→ Agent runs (codex/pi/claude)
→ Artifacts collected
→ microVM destroyed
→ Results in inbox + registryComponents
- Firecracker binary — installed inside the Colima VM, aarch64 build
- Guest kernel — minimal Linux kernel built for aarch64, KVM guest
- Agent rootfs — ext4 image: Alpine base + bun + node + git + pi + codex CLI + common toolchains
- Warm snapshot — booted rootfs with tools verified, snapshotted for instant restore
- MicroVM runner —
packages/agent-execution/src/microvm.ts: create/configure/start/stop/snapshot via Firecracker REST API socket - Network bridge — TAP device per microVM, iptables routing for outbound (API access, git clone)
- Filesystem mount — virtio-block for workspace (repo checkout mounted into microVM), read-only rootfs from snapshot
Execution modes after this ADR
| Mode | Substrate | Use case |
|---|---|---|
host | Local process | Legacy, interactive debugging |
sandbox | Worktree + compose (ADR-0221) | Local dev, compose-backed stacks |
microvm | Firecracker microVM | Default for agent workloads |
k8s | k8s Job | Future cloud path (EKS equivalent) |
Dogfood progression
The project dogfoods itself — later stages run through infrastructure built by earlier stages:
- Phase A (Steps 1-3): Manual/codex. Fix admin port, triage queue, move Restate worker to k8s.
- Phase B (Steps 4-6): Via
joelclaw workload runwith local sandbox (ADR-0221). Install Firecracker, build rootfs, create snapshot. - Phase C (Steps 7-9): Build the microVM runner, wire dagWorker, prove with canary. The canary proves itself — the first microVM workload validates the microVM infrastructure.
- Phase D (Step 10): ADR-0228/0229 docs-api work runs as the first real consumer. Full end-to-end: workload plan → Restate orchestration → Firecracker sandbox → artifacts → completion.
Execution Sequence
Step 1: Fix Restate admin reachability
Add NodePort 30970 for Restate admin API (9070). Makes joelclaw jobs status accurate.
Acceptance: joelclaw jobs status reports restate as healthy without manual port-forward.
Step 2: Triage stale queue
Inspect 58 queued items. Identify unhandled event families. Ack/drain dead items. Acceptance: Queue depth < 10, no items older than 1 hour.
Step 3: Move Restate worker into k8s
Build ARM64 image for @joelclaw/restate. Push to GHCR. Create Deployment manifest + publish script. Migrate from launchd host process.
Acceptance: kubectl get pods -n joelclaw | grep restate-worker shows 1/1 Running. Services still registered in Restate server. Queue drainer consuming.
Step 4: Enable nested virtualization in Colima
Reconfigure Colima with nestedVirtualization=true. Verify /dev/kvm exists inside the VM. Install Firecracker aarch64 binary.
Acceptance: firecracker --version works inside Colima VM. /dev/kvm accessible.
Step 5: Build agent rootfs + guest kernel
Create Alpine-based ext4 rootfs with: bun, node (v24), git, pi CLI, codex CLI, common build toolchains. Build or obtain minimal aarch64 Linux guest kernel.
Acceptance: microVM boots from rootfs, bun --version && node --version && git --version && pi --version all succeed inside the VM.
Step 6: Create warm snapshot
Boot microVM from rootfs. Run tool verification suite. Snapshot the running VM. Verify restore from snapshot boots in <200ms.
Acceptance: time firecracker --restore-from-snapshot < 200ms. Tools verified post-restore.
Step 7: Build microVM runner in agent-execution
packages/agent-execution/src/microvm.ts — create/configure/start/stop microVMs. REST API socket management. Workspace volume mount. Artifact collection. Network setup. Cleanup.
Acceptance: Unit tests pass. Integration test boots a microVM, runs a command, collects output, destroys VM.
Step 8: Wire dagWorker executionMode: “microvm”
Restate dagWorker dispatches to microVM runner when executionMode: "microvm". Same contract as host/sandbox modes — sandbox identity, env injection, artifact collection, terminal inbox writeback.
Acceptance: joelclaw workload run plan.json --execution-mode microvm --dry-run produces valid request.
Step 9: Deterministic canary
Run trivial workload (echo test + file write + artifact collect) through full path: CLI → queue → Restate → Firecracker → artifacts → inbox.
Acceptance: Canary completes, artifacts present, inbox terminal, microVM cleaned up, joelclaw jobs status reflects completion.
Step 10: ADR-0228/0229 as first real consumer
Shape docs-api upgrade phases as workload plans. Dispatch through microVM sandboxes. Real code changes, real isolation, real Restate durability. Acceptance: At least one ADR-0228 phase completes end-to-end through the microVM rig, producing a merge-ready patch.
Consequences
Positive
- Hardware-level isolation for every agent workload
- <200ms sandbox startup from snapshot (vs 5-15s containers)
- <5MB memory overhead per sandbox (run many in parallel)
- Direct path to AWS Lambda/Fargate substrate when needed
- Snapshot model eliminates repeated toolchain installation
- Dogfood progression proves each layer before depending on it
Negative
- Nested virtualization adds a layer (macOS → Colima VM → KVM → microVM)
- Firecracker networking requires manual TAP/iptables setup per microVM
- Guest kernel and rootfs are new build artifacts to maintain
- More complex than k8s Jobs for the initial setup
Risks
- Nested virtualization performance on M4 Pro is unproven at scale (works on M3 per community reports)
- Colima VM memory pressure with many concurrent microVMs
- Firecracker aarch64 may have edge cases vs x86_64 (less community testing)
- TAP networking inside a nested VM may have latency/reliability issues
References
- Firecracker GitHub
- Running Firecracker on M3 MacBook
- Lima nested virtualization
- ADR-0205: Cloud-Native Agent Execution Vision
- ADR-0206: Hot Image Agent Development Environments
- ADR-0207: Restate Durable Execution Engine
- ADR-0221: Local Sandbox Isolation Primitives