Cloud-Native Agent Execution Vision
Status
Proposed
Implementation Status Snapshot (2026-03-07)
- the shared-checkout-only assumption is now false
executionMode: "sandbox"is live insystem/agent-dispatchas a local isolated host-worker runner- Gate A and Gate B are proven, and a real ADR-0217 acceptance run completed on that sandbox path without dirtying the operator checkout
- the architecture direction in this ADR is still valid: k8s Job isolation remains the next gate, not done
- treat the current local sandbox runner as the proved phase-1 isolation surface, and the k8s Job runner as the follow-on substrate swap
Context and Problem Statement
joelclaw currently runs on a single Mac Mini (M4 Pro, 64GB). Agent execution (Codex, Claude, and pi sessions) happens as local host processes with broad machine access. ADR-0060 introduces a swarm DAG orchestrator that will run N agents in parallel; without isolation, this becomes N concurrent processes with full host access and coupled failure domains.
Ramp’s Inspect architecture demonstrates a production pattern for background agents: isolated sandbox execution, prebuilt hot images for fast startup, and multi-client operation surfaces. This ADR captures the target direction for joelclaw: cloud-native, isolated, and durable agent execution that can run locally on k8s now and map cleanly to AWS later.
This is a vision ADR. It sets architecture intent and constraints, then explicitly spawns focused implementation ADRs.
The first concrete consumer is ADR-0217: queue/event-routing work can continue on its own axis, but real autonomous story execution now depends on this runtime moving off shared host workspaces and onto isolated sandboxes.
Decision Drivers
- Agent isolation — agents currently have unrestricted host access.
- Self-hosted Inngest reliability concerns — durable execution needs stronger operational guarantees.
- Swarm DAG parallelism — requires sandboxed parallel execution primitives.
- Planned path to AWS/cloud — must be designed into the architecture now.
- Startup latency — hot images (pre-built snapshots) can reduce agent cold-start time dramatically.
Decision
Core Principle: Local infrastructure mirrors AWS primitives. Design for AWS, run locally on k8s. Every local service choice should have a direct AWS equivalent so the path to cloud deployment is a substrate swap, not a rewrite. AWS chosen over Cloudflare for broadest “normal” ecosystem — standard patterns, widest tooling support, most transferable knowledge.
| AWS Primitive | Local k8s Equivalent | Status |
|---|---|---|
| EKS | Talos/Colima k8s | ✅ Running |
| ElastiCache | Redis StatefulSet | ✅ Running |
| S3 | MinIO (needed) or NAS-backed PVC | ⚠️ No S3 API yet |
| Step Functions | Restate (spiked, 1:1 API mapping) | 🔬 Evaluated |
| Secrets Manager | agent-secrets | ✅ Running |
| CloudWatch/OTEL | Typesense OTEL collection | ✅ Running |
| ECR | GHCR | ✅ Running |
| Lambda / Fargate | k8s Jobs (agent pods) | 🔜 Swarm DAG ready |
| API Gateway | Caddy reverse proxy | ✅ Running |
| EventBridge | Inngest events / Restate signals | 🔬 Evaluating |
| IAM Roles | k8s ServiceAccounts | ❌ Not yet |
Restate Spike Findings (2026-03-04)
A working spike (packages/restate-spike/) validated three representative patterns:
- Durable step chain —
ctx.run()maps 1:1 to Inngeststep.run(). ✅ - Fan-out/fan-in —
ctx.serviceClient(svc).method()for parallel orchestration. Cleaner than Inngest event fan-out for request/response patterns. ✅ - Workflow with signals —
ctx.promise("key")+ external.resolve(). Maps tostep.waitForEvent()but more ergonomic for human-in-the-loop. ✅
Restate advantages: Single Rust binary (no Postgres), direct AWS Step Functions equivalent, service/RPC model natural for orchestration, promise/signal cleaner than event-wait.
Restate gaps: No event-driven topology by default (joelclaw is deeply event-centric), 110+ Inngest functions represent large migration surface, operator tooling less mature than Inngest dashboard.
Recommendation: Dual-run pilot — mirror swarm orchestrator and approval workflow in Restate alongside Inngest. Measure reliability, DX, and observability. Full migration decision after pilot data.
Adopt a cloud-native execution vision where agent work runs in isolated ephemeral runtimes, orchestration is backed by a reliable durable workflow engine, and infrastructure is expressed as portable IaC so local and cloud environments share the same logical topology.
1) Agent Execution Layer
- Agent sessions run as ephemeral workloads: k8s Jobs locally, cloud sandbox equivalents on AWS.
- Build hot images every 30 minutes via CronJob containing repo checkout, dependency install, and build caches.
- Maintain a warm pool of pre-spun runners for low-latency session start.
- Enforce resource limits, network policies, and TTL-based cleanup lifecycle on all agent runners.
- Treat host-level execution as exception-only fallback, not the default runtime.
2) Durable Workflow Engine
- Run a formal evaluation of Restate as replacement/complement to self-hosted Inngest.
- Candidate set includes Restate, Inngest Cloud, and AWS Step Functions alignment strategy.
- Required capabilities:
- event-driven triggers
- fan-out execution
- durable step memoization
- retries and failure recovery
- strong run introspection/observability
- Prefer solutions that preserve local-first operation while providing cloud-grade durability.
3) Infrastructure as Code
- All network/stateful services (Redis, Typesense, workflow engine, queues, secrets wiring) must be declared as IaC.
- Local substrate remains k8s manifests; cloud substrate uses Terraform/Pulumi for EKS, ElastiCache, S3, and Secrets Manager.
- Keep one logical architecture and swap substrate adapters per environment.
4) Claw Organism Operating Model
Adopt the organism framing from <https://joelclaw.com/joelclaw-is-a-claw-like-organism> as an explicit architecture model:
- Event bus (Inngest/Restate) = nervous system.
- Agent pods/jobs = appendages that grow and retract based on workload.
- Swarm DAG = coordination cortex for multi-appendage tasks.
- Gateway channels (Telegram/Slack/webhooks) = sensory input surface.
This model is descriptive and operational: topology, scaling behavior, and failure handling should follow this distributed-organism contract.
Non-Goals (for this ADR)
- Selecting a single workflow runtime implementation immediately.
- Defining full migration sequencing details from single-node to fleet.
- Committing to a specific cloud vendor managed service set today.
- Replacing all existing local workflows in one cutover.
Active Follow-on Work
- ADR-0206 owns the speed layer. Hot-image builds and warm pools are implementation details of this runtime vision, not separate justification for skipping isolation proof.
- ADR-0217 is the first hard consumer. Queue/event-routing can progress independently, but autonomous story execution for that work now depends on isolated runtime landing.
- Execution build sheet:
~/Vault/Projects/09-joelclaw/0205-0206-sandboxed-story-execution-prd.md - Open tracks still to decide: workflow engine evaluation, IaC substrate mapping, and migration sequencing beyond the initial local k8s runner path.
Consequences
Positive
- Agent isolation reduces blast radius from rogue or compromised agent processes.
- Hot images and warm pools reduce startup latency for agent sessions.
- Architecture becomes cloud-ready without requiring a full rewrite.
- Parallel agent scaling can move beyond single-machine process limits.
- Durable runtime options (Restate/Step Functions class) can improve reliability versus current self-hosted fragility.
Negative
- Infrastructure and operational complexity increase significantly.
- Agent runner images must package all required tools (codex, pi, git, language/toolchains).
- Network hops between agent pods and shared services introduce additional latency/failure edges.
- Compute and storage costs rise with Job-based execution and image pipelines.