ADR-0205proposed

Cloud-Native Agent Execution Vision

Status

Proposed

Implementation Status Snapshot (2026-03-07)

  • the shared-checkout-only assumption is now false
  • executionMode: "sandbox" is live in system/agent-dispatch as a local isolated host-worker runner
  • Gate A and Gate B are proven, and a real ADR-0217 acceptance run completed on that sandbox path without dirtying the operator checkout
  • the architecture direction in this ADR is still valid: k8s Job isolation remains the next gate, not done
  • treat the current local sandbox runner as the proved phase-1 isolation surface, and the k8s Job runner as the follow-on substrate swap

Context and Problem Statement

joelclaw currently runs on a single Mac Mini (M4 Pro, 64GB). Agent execution (Codex, Claude, and pi sessions) happens as local host processes with broad machine access. ADR-0060 introduces a swarm DAG orchestrator that will run N agents in parallel; without isolation, this becomes N concurrent processes with full host access and coupled failure domains.

Ramp’s Inspect architecture demonstrates a production pattern for background agents: isolated sandbox execution, prebuilt hot images for fast startup, and multi-client operation surfaces. This ADR captures the target direction for joelclaw: cloud-native, isolated, and durable agent execution that can run locally on k8s now and map cleanly to AWS later.

This is a vision ADR. It sets architecture intent and constraints, then explicitly spawns focused implementation ADRs.

The first concrete consumer is ADR-0217: queue/event-routing work can continue on its own axis, but real autonomous story execution now depends on this runtime moving off shared host workspaces and onto isolated sandboxes.

Decision Drivers

  1. Agent isolation — agents currently have unrestricted host access.
  2. Self-hosted Inngest reliability concerns — durable execution needs stronger operational guarantees.
  3. Swarm DAG parallelism — requires sandboxed parallel execution primitives.
  4. Planned path to AWS/cloud — must be designed into the architecture now.
  5. Startup latency — hot images (pre-built snapshots) can reduce agent cold-start time dramatically.

Decision

Core Principle: Local infrastructure mirrors AWS primitives. Design for AWS, run locally on k8s. Every local service choice should have a direct AWS equivalent so the path to cloud deployment is a substrate swap, not a rewrite. AWS chosen over Cloudflare for broadest “normal” ecosystem — standard patterns, widest tooling support, most transferable knowledge.

AWS PrimitiveLocal k8s EquivalentStatus
EKSTalos/Colima k8s✅ Running
ElastiCacheRedis StatefulSet✅ Running
S3MinIO (needed) or NAS-backed PVC⚠️ No S3 API yet
Step FunctionsRestate (spiked, 1:1 API mapping)🔬 Evaluated
Secrets Manageragent-secrets✅ Running
CloudWatch/OTELTypesense OTEL collection✅ Running
ECRGHCR✅ Running
Lambda / Fargatek8s Jobs (agent pods)🔜 Swarm DAG ready
API GatewayCaddy reverse proxy✅ Running
EventBridgeInngest events / Restate signals🔬 Evaluating
IAM Rolesk8s ServiceAccounts❌ Not yet

Restate Spike Findings (2026-03-04)

A working spike (packages/restate-spike/) validated three representative patterns:

  1. Durable step chainctx.run() maps 1:1 to Inngest step.run(). ✅
  2. Fan-out/fan-inctx.serviceClient(svc).method() for parallel orchestration. Cleaner than Inngest event fan-out for request/response patterns. ✅
  3. Workflow with signalsctx.promise("key") + external .resolve(). Maps to step.waitForEvent() but more ergonomic for human-in-the-loop. ✅

Restate advantages: Single Rust binary (no Postgres), direct AWS Step Functions equivalent, service/RPC model natural for orchestration, promise/signal cleaner than event-wait.

Restate gaps: No event-driven topology by default (joelclaw is deeply event-centric), 110+ Inngest functions represent large migration surface, operator tooling less mature than Inngest dashboard.

Recommendation: Dual-run pilot — mirror swarm orchestrator and approval workflow in Restate alongside Inngest. Measure reliability, DX, and observability. Full migration decision after pilot data.

Adopt a cloud-native execution vision where agent work runs in isolated ephemeral runtimes, orchestration is backed by a reliable durable workflow engine, and infrastructure is expressed as portable IaC so local and cloud environments share the same logical topology.

1) Agent Execution Layer

  • Agent sessions run as ephemeral workloads: k8s Jobs locally, cloud sandbox equivalents on AWS.
  • Build hot images every 30 minutes via CronJob containing repo checkout, dependency install, and build caches.
  • Maintain a warm pool of pre-spun runners for low-latency session start.
  • Enforce resource limits, network policies, and TTL-based cleanup lifecycle on all agent runners.
  • Treat host-level execution as exception-only fallback, not the default runtime.

2) Durable Workflow Engine

  • Run a formal evaluation of Restate as replacement/complement to self-hosted Inngest.
  • Candidate set includes Restate, Inngest Cloud, and AWS Step Functions alignment strategy.
  • Required capabilities:
    • event-driven triggers
    • fan-out execution
    • durable step memoization
    • retries and failure recovery
    • strong run introspection/observability
  • Prefer solutions that preserve local-first operation while providing cloud-grade durability.

3) Infrastructure as Code

  • All network/stateful services (Redis, Typesense, workflow engine, queues, secrets wiring) must be declared as IaC.
  • Local substrate remains k8s manifests; cloud substrate uses Terraform/Pulumi for EKS, ElastiCache, S3, and Secrets Manager.
  • Keep one logical architecture and swap substrate adapters per environment.

4) Claw Organism Operating Model

Adopt the organism framing from <https://joelclaw.com/joelclaw-is-a-claw-like-organism> as an explicit architecture model:

  • Event bus (Inngest/Restate) = nervous system.
  • Agent pods/jobs = appendages that grow and retract based on workload.
  • Swarm DAG = coordination cortex for multi-appendage tasks.
  • Gateway channels (Telegram/Slack/webhooks) = sensory input surface.

This model is descriptive and operational: topology, scaling behavior, and failure handling should follow this distributed-organism contract.

Non-Goals (for this ADR)

  • Selecting a single workflow runtime implementation immediately.
  • Defining full migration sequencing details from single-node to fleet.
  • Committing to a specific cloud vendor managed service set today.
  • Replacing all existing local workflows in one cutover.

Active Follow-on Work

  1. ADR-0206 owns the speed layer. Hot-image builds and warm pools are implementation details of this runtime vision, not separate justification for skipping isolation proof.
  2. ADR-0217 is the first hard consumer. Queue/event-routing can progress independently, but autonomous story execution for that work now depends on isolated runtime landing.
  3. Execution build sheet: ~/Vault/Projects/09-joelclaw/0205-0206-sandboxed-story-execution-prd.md
  4. Open tracks still to decide: workflow engine evaluation, IaC substrate mapping, and migration sequencing beyond the initial local k8s runner path.

Consequences

Positive

  • Agent isolation reduces blast radius from rogue or compromised agent processes.
  • Hot images and warm pools reduce startup latency for agent sessions.
  • Architecture becomes cloud-ready without requiring a full rewrite.
  • Parallel agent scaling can move beyond single-machine process limits.
  • Durable runtime options (Restate/Step Functions class) can improve reliability versus current self-hosted fragility.

Negative

  • Infrastructure and operational complexity increase significantly.
  • Agent runner images must package all required tools (codex, pi, git, language/toolchains).
  • Network hops between agent pods and shared services introduce additional latency/failure edges.
  • Compute and storage costs rise with Job-based execution and image pipelines.