ADR-0060proposed

Inngest-Backed Swarm/DAG Multi-Agent Orchestration

February 19, 2026

Context and Problem Statement

ralph-loop executes PRD stories sequentially — one codex session per story, linear chain. This works for implementation loops but can’t express:

Parallel fan-out: 3 agents researching simultaneously, then a synthesizer
Dependency graphs: agent C depends on A and B, but A and B are independent
Pipeline iteration: repeat the full graph N times (e.g., find 50 items, one per iteration)
Mixed topologies: diamond, fan-in, hybrid DAGs

oh-my-pi’s packages/swarm-extension/ solves this with YAML-defined DAGs, topological wave sorting, and parallel execution. But their implementation is process-local — spawns subprocesses from the pi session, tracks state on the filesystem, dies if the terminal closes. No durability, no retries, no observability beyond log files.

joelclaw already has durable execution infrastructure via Inngest (ADR-0005). The swarm should be built on top of it.

Decision Drivers

ralph-loop can’t express parallel or DAG workflows
oh-my-pi’s swarm architecture (YAML config, DAG, waves) is well-designed but wrong execution layer
Inngest provides durability, step memoization, retries, event fan-out — exactly what a DAG orchestrator needs
Must run AFK (no pi session required) via the existing system-bus worker
Must integrate with existing joelclaw infrastructure (Redis state, gateway notifications, joelclaw CLI)

Decision

Build an Inngest-backed swarm/DAG orchestrator that uses oh-my-pi’s YAML definition format and DAG algorithms, but executes via Inngest functions and steps instead of local subprocesses.

YAML Definition Format

Adopt oh-my-pi’s swarm YAML format (credit: Can Boluk, MIT licensed):

swarm:
  name: codebase-audit
  workspace: /path/to/project
  mode: parallel          # pipeline | parallel | sequential
  target_count: 1         # iterations (pipeline mode only)
  model: claude-sonnet-4  # optional model override
  tool: codex             # codex | claude | pi (agent runtime)
 
  agents:
    security:
      role: security-auditor
      task: |
        Audit all code in src/ for security vulnerabilities.
        Write findings to reports/security.md with severity ratings.
      reports_to:
        - lead
 
    performance:
      role: performance-analyst
      task: |
        Profile and analyze src/ for performance bottlenecks.
        Write findings to reports/performance.md with benchmarks.
      reports_to:
        - lead
 
    lead:
      role: engineering-lead
      task: |
        Read all reports in reports/.
        Create a prioritized action plan in output/action_plan.md.
      waits_for:
        - security
        - performance

Extensions to oh-my-pi format:

tool field (default: codex) — which agent runtime to use per-agent or globally
Per-agent tool override — mix codex and claude in the same swarm
Per-agent model override
Per-agent sandbox mode (workspace-write default, danger-full-access for system tasks)

Architecture

Event Flow

swarm/started              → Inngest function: swarm-orchestrator
  ├─ step: parse-yaml      → validate + build DAG + compute waves
  ├─ step: wave-0           → fan-out: invoke agent functions in parallel
  │   ├─ swarm/agent.started (security)  → step.invoke → codex exec
  │   └─ swarm/agent.started (performance) → step.invoke → codex exec
  ├─ step: wave-0-collect   → waitForEvent on all wave-0 completions
  ├─ step: wave-1           → fan-out: invoke agent functions for next wave
  │   └─ swarm/agent.started (lead) → step.invoke → codex exec
  ├─ step: wave-1-collect
  └─ step: finalize         → update Redis state, gateway notification

Inngest Functions

swarm-orchestrator — main function, triggered by swarm/started:

step.run("parse") — parse YAML, validate, build dependency graph, compute waves via topological sort
For each wave:
- step.run("wave-N-dispatch") — emit swarm/agent.started events for all agents in wave
- For each agent in wave: step.invoke("swarm-agent-exec", { agent, wave, iteration })
- Collect results (all agents in wave must complete before next wave)
step.run("finalize") — update Redis state, push gateway notification
For pipeline mode: loop back to wave-0 for next iteration

swarm-agent-exec — individual agent execution, invoked by orchestrator:

step.run("exec") — spawn codex/claude/pi with the agent’s task + system prompt
Return result (exit code, output summary, error if any)

DAG Algorithms

Port oh-my-pi’s dag.ts directly (MIT licensed, pure functions, no runtime deps):

buildDependencyGraph() — extracts deps from waits_for / reports_to
detectCycles() — Kahn’s algorithm
buildExecutionWaves() — topological sort into parallel wave groups

State Management

Redis keys (namespace: swarm:{name}):

swarm:{name}:state — JSON: pipeline status, current iteration, current wave
swarm:{name}:agents — hash: per-agent status, timing, errors
swarm:{name}:definition — stored YAML definition for resume/inspection
TTL: 7 days after completion

Inter-Agent Communication

Same pattern as oh-my-pi: agents communicate via the shared workspace filesystem. The orchestrator doesn’t pass data between agents. Patterns:

Signal files: signals/agent_out.txt — lightweight status flags
Structured output: reports/security.md — detailed results
Tracking files: processed.txt — dedup across pipeline iterations

CLI Integration

# Start a swarm
joelclaw swarm start path/to/swarm.yaml
 
# Status
joelclaw swarm status <name>
 
# Cancel
joelclaw swarm cancel <name>
 
# List running swarms
joelclaw swarm list

pi Extension (Optional)

A pi-tools extension for interactive use:

/swarm run path/to/swarm.yaml — start from pi session
/swarm status <name> — check progress
Progress events pushed to gateway via pushGatewayEvent()

Relationship to ralph-loop

ralph-loop is not replaced. It serves a different purpose:

ralph-loop: PRD-driven story implementation with judge evaluation, git branching, test verification
swarm: arbitrary DAG workflows — research pipelines, code audits, content creation, data processing

They coexist. A swarm agent could internally use a ralph-loop for its task if the task is “implement this PRD.”

Consequences

Positive

Parallel multi-agent execution with dependency management
Durable — survives crashes, restarts, terminal closures via Inngest
Observable — joelclaw runs shows swarm progress, Inngest dashboard for step traces
AFK — runs through system-bus worker, no pi session needed
Gateway notifications on progress and completion
YAML config is human-readable and version-controllable
DAG algorithms are pure functions, easily testable
Pipeline mode enables iterative accumulation workflows

Negative

More complex than ralph-loop — DAG orchestration adds conceptual overhead
Inngest step limits may constrain very large swarms (hundreds of agents)
Filesystem-based inter-agent communication is loose — no type safety between agents
Each agent is a full codex/claude session — cost scales with agent count × iterations

packages/system-bus/src/inngest/functions/swarm-orchestrator.ts — new orchestrator function
packages/system-bus/src/inngest/functions/swarm-agent-exec.ts — new agent execution function
packages/system-bus/src/swarm/ — new directory
- dag.ts — dependency graph + wave computation (port from oh-my-pi)
- schema.ts — YAML parsing + validation (port from oh-my-pi + extensions)
- state.ts — Redis state management
- types.ts — swarm event types, agent config types
packages/system-bus/src/inngest/events.ts — add swarm event schemas
packages/system-bus/src/inngest/serve.ts — register new functions
packages/cli/src/commands/swarm.ts — new CLI subcommand
Optional: pi-tools/swarm/index.ts — pi extension for /swarm command

Patterns to Follow

oh-my-pi’s DAG algorithms are MIT-licensed pure functions — port directly with attribution
oh-my-pi’s YAML format is the external interface — maintain compatibility for potential config sharing
Inngest step naming: swarm-{name}-wave-{N}-dispatch, swarm-{name}-wave-{N}-collect
Redis state follows existing patterns from loop PRD state (ADR-0011)
Gateway notifications follow ADR-0018 patterns (pushGatewayEvent)
Agent execution follows existing codex-exec / ralph-loop patterns for spawning

What to Avoid

Don’t port oh-my-pi’s subprocess execution (runSubprocess) — use Inngest step.invoke instead
Don’t port filesystem state tracking — use Redis
Don’t try to pass data between agents via Inngest events — filesystem is the communication channel
Don’t conflate with ralph-loop — they serve different purposes and coexist

Verification

Define a 3-agent fan-out + synthesizer YAML → joelclaw swarm start → all 4 agents run, waves respected
Cycle in YAML → rejected with clear error before any execution
Kill the worker mid-swarm → restart → swarm resumes from last completed wave (Inngest durability)
joelclaw swarm status <name> → shows per-agent status, current wave, iteration
Pipeline mode with target_count=3 → full DAG executes 3 times
Gateway receives progress notifications during swarm execution

More Information

Reference Implementation

oh-my-pi swarm extension (can1357/oh-my-pi):

packages/swarm-extension/src/swarm/dag.ts — dependency graph + topological sort
packages/swarm-extension/src/swarm/schema.ts — YAML parsing + validation
packages/swarm-extension/src/swarm/pipeline.ts — wave-based execution controller
packages/swarm-extension/src/swarm/executor.ts — agent subprocess spawning
packages/swarm-extension/src/swarm/state.ts — filesystem state persistence
packages/swarm-extension/README.md — YAML format reference + usage patterns

Credit: Can Boluk (@can1357) for the YAML format, DAG algorithms, and swarm patterns. MIT licensed.

ADR-0005 — Durable multi-agent coding loops (Inngest foundation)
ADR-0011 — Redis-backed PRD state (same state pattern)
ADR-0018 — Pi-native gateway with Redis event bridge (notification pattern)
ADR-0026 — Background agents via Inngest (execution pattern)