Hot Image Agent Development Environments
Status
Proposed
Context and Problem Statement
ADR-0205 defines the cloud-native direction for agent execution: isolated, durable, and fast-starting agent runtimes that map local k8s to AWS primitives.
Today, codex/claude/pi sessions start as cold host processes. Each run repeats clone/install/build work. As ADR-0060 swarm DAG execution scales parallel agents, this repeated startup overhead multiplies and directly limits throughput.
Ramp’s Inspect pattern (hot container snapshots rebuilt on a fixed schedule) provides a proven approach: pre-build the full workspace and dependencies, then start sessions from a recent image and perform a fast git sync.
This ADR captures the focused implementation decision for local hot image infrastructure and the AWS-equivalent mapping.
Decision
Build a hot image pipeline for agent execution environments on the local k8s cluster, starting with the joelclaw monorepo.
Boundary
- Cold isolated execution correctness comes first.
- Hot images are a latency optimization, not a substitute for proving isolated k8s Job execution.
- The concrete sequencing for ADR-0217 story execution is tracked in
~/Vault/Projects/09-joelclaw/0205-0206-sandboxed-story-execution-prd.md.
Hot Image Build Pipeline
- Run a k8s CronJob every 30 minutes.
- Clone the target repo at
HEAD(initial scope: joelclaw monorepo). - Run
pnpm installso dependencies are cached in image layers. - Run
bunx tsc --noEmitto warm TypeScript/build caches. - Run the test suite once to warm test/runtime caches.
- Push image tags to registry:
agent-runner:latestandagent-runner:<timestamp>. - Add image metadata tags/labels: repo, commit SHA, build time.
Image Contents
Each agent runner image must include:
- Bun runtime
- pnpm
- git (configured for push via GitHub App token injected at runtime)
- codex CLI (or active agent runtime)
- pi binary
- Common tools:
ripgrep,fd,jq,curl - Pre-built repo workspace mounted at
/workspace
Warm Pool
- Maintain 2–3 pre-spun pods from the latest hot image via a Deployment.
- Swarm orchestrator claims a warm pod instead of creating a cold-start Job.
- Claimed pods are automatically replaced to maintain pool size.
- Adopt Ramp-style optimization: allow immediate read access while git sync runs; gate writes until sync completion.
Multi-Repo Support
- Use one image definition per target repo.
- Start with
joelclawonly. - Extend to
course-builder/gremlinwhen needed. - Give each repo its own CronJob and tag lineage.
AWS Equivalent
| Local | AWS |
|---|---|
| k8s CronJob | CodeBuild on schedule |
| Local registry / GHCR | ECR |
| k8s Job (agent session) | ECS Fargate task / EKS Job |
| Warm pool Deployment | ECS capacity provider with warm targets |
| PVC workspace | EBS / EFS mount |
Active Consumer
ADR-0217 queue/story work is the first concrete consumer of this ADR. The speed layer here should only be enabled after the sandbox runtime contract and cold isolated runner path are already passing the proof gates in ~/Vault/Projects/09-joelclaw/0205-0206-sandboxed-story-execution-prd.md.
Implementation Status Snapshot (2026-03-07)
- cold isolated execution has now been proven on the local host-worker sandbox runner
- that proof unblocks queue/drainer work and validates the contract side of ADR-0206’s dependency chain
- this ADR itself is still pending because the hot-image build pipeline, warm pool, and k8s Job-backed runtime are not shipped yet
- do not mistake the current local sandbox runner for the hot-image solution; it is the correctness substrate this ADR depends on
Consequences
Positive
- Agent cold start drops from minutes to seconds.
- Dependency install is removed from per-session startup.
- Build/test caches are pre-warmed before agent claim.
- Warm pool removes most image-pull latency.
- Pattern ports directly to AWS primitives with minimal redesign.
Negative
- Scheduled builds consume CPU/disk every 30 minutes.
- Registry storage grows without retention controls.
- Warm pods consume resources while idle.
- Image definitions must stay aligned with real dev environment/tooling needs.
Verification
- CronJob successfully builds and pushes image every 30 minutes.
- Agent Job starts from hot image in <10 seconds.
- Warm pool pod is claimable and replacement spins up automatically.
- Image contains all required tools (
bun,pnpm,git,codex,pi,rg,fd,jq). - Agent can run
pnpm testin/workspacewithout any install step.