ADR-0206proposed

Hot Image Agent Development Environments

Status

Proposed

Context and Problem Statement

ADR-0205 defines the cloud-native direction for agent execution: isolated, durable, and fast-starting agent runtimes that map local k8s to AWS primitives.

Today, codex/claude/pi sessions start as cold host processes. Each run repeats clone/install/build work. As ADR-0060 swarm DAG execution scales parallel agents, this repeated startup overhead multiplies and directly limits throughput.

Ramp’s Inspect pattern (hot container snapshots rebuilt on a fixed schedule) provides a proven approach: pre-build the full workspace and dependencies, then start sessions from a recent image and perform a fast git sync.

This ADR captures the focused implementation decision for local hot image infrastructure and the AWS-equivalent mapping.

Decision

Build a hot image pipeline for agent execution environments on the local k8s cluster, starting with the joelclaw monorepo.

Boundary

  • Cold isolated execution correctness comes first.
  • Hot images are a latency optimization, not a substitute for proving isolated k8s Job execution.
  • The concrete sequencing for ADR-0217 story execution is tracked in ~/Vault/Projects/09-joelclaw/0205-0206-sandboxed-story-execution-prd.md.

Hot Image Build Pipeline

  1. Run a k8s CronJob every 30 minutes.
  2. Clone the target repo at HEAD (initial scope: joelclaw monorepo).
  3. Run pnpm install so dependencies are cached in image layers.
  4. Run bunx tsc --noEmit to warm TypeScript/build caches.
  5. Run the test suite once to warm test/runtime caches.
  6. Push image tags to registry: agent-runner:latest and agent-runner:<timestamp>.
  7. Add image metadata tags/labels: repo, commit SHA, build time.

Image Contents

Each agent runner image must include:

  • Bun runtime
  • pnpm
  • git (configured for push via GitHub App token injected at runtime)
  • codex CLI (or active agent runtime)
  • pi binary
  • Common tools: ripgrep, fd, jq, curl
  • Pre-built repo workspace mounted at /workspace

Warm Pool

  • Maintain 2–3 pre-spun pods from the latest hot image via a Deployment.
  • Swarm orchestrator claims a warm pod instead of creating a cold-start Job.
  • Claimed pods are automatically replaced to maintain pool size.
  • Adopt Ramp-style optimization: allow immediate read access while git sync runs; gate writes until sync completion.

Multi-Repo Support

  • Use one image definition per target repo.
  • Start with joelclaw only.
  • Extend to course-builder/gremlin when needed.
  • Give each repo its own CronJob and tag lineage.

AWS Equivalent

LocalAWS
k8s CronJobCodeBuild on schedule
Local registry / GHCRECR
k8s Job (agent session)ECS Fargate task / EKS Job
Warm pool DeploymentECS capacity provider with warm targets
PVC workspaceEBS / EFS mount

Active Consumer

ADR-0217 queue/story work is the first concrete consumer of this ADR. The speed layer here should only be enabled after the sandbox runtime contract and cold isolated runner path are already passing the proof gates in ~/Vault/Projects/09-joelclaw/0205-0206-sandboxed-story-execution-prd.md.

Implementation Status Snapshot (2026-03-07)

  • cold isolated execution has now been proven on the local host-worker sandbox runner
  • that proof unblocks queue/drainer work and validates the contract side of ADR-0206’s dependency chain
  • this ADR itself is still pending because the hot-image build pipeline, warm pool, and k8s Job-backed runtime are not shipped yet
  • do not mistake the current local sandbox runner for the hot-image solution; it is the correctness substrate this ADR depends on

Consequences

Positive

  • Agent cold start drops from minutes to seconds.
  • Dependency install is removed from per-session startup.
  • Build/test caches are pre-warmed before agent claim.
  • Warm pool removes most image-pull latency.
  • Pattern ports directly to AWS primitives with minimal redesign.

Negative

  • Scheduled builds consume CPU/disk every 30 minutes.
  • Registry storage grows without retention controls.
  • Warm pods consume resources while idle.
  • Image definitions must stay aligned with real dev environment/tooling needs.

Verification

  • CronJob successfully builds and pushes image every 30 minutes.
  • Agent Job starts from hot image in <10 seconds.
  • Warm pool pod is claimable and replacement spins up automatically.
  • Image contains all required tools (bun, pnpm, git, codex, pi, rg, fd, jq).
  • Agent can run pnpm test in /workspace without any install step.