Configurable Sub-Agent Roster
Context
joelclaw delegates work to sub-agents (codex for coding, Inngest infer() for LLM calls in functions). But:
- Codex only supports OpenAI models — design tasks need Opus/Sonnet (taste, not speed)
- No roster — agent selection is ad-hoc, hardcoded per call site
- Role.md exists but isn’t actionable — we have role definitions but no mechanism to spawn a sub-agent with a specific role, model, and tool set
Pi’s native subagent tool (from nicobailon/pi-subagents) already provides:
- Agent definitions (YAML frontmatter + markdown body) with discovery by scope (builtin/user/project)
- Model + thinking level per agent, tool/extension sandboxing
- Chain execution with
{previous},{task},{chain_dir}template vars - Parallel dispatch, skill injection
- 6 builtin agents: context-builder, planner, researcher, reviewer, scout, worker
Key insight: Pi-subagents handles agent definitions and spawning. joelclaw needs to make infer() resolve pi agent definitions and wrap execution in Inngest steps for durability. Best of both worlds, minimal new code.
Decision
Adopt pi-subagents as the definition and discovery layer, with Inngest as the execution backbone for durable agent dispatch from system-bus functions.
Gaps identified during review (addressed by this revision)
- The original ADR did not define how
chain/parallelsemantics map to Inngest step return values ({previous}was underspecified). - Agent definition parsing/validation rules were implicit; invalid frontmatter, unknown models, and unsafe paths were not explicitly rejected.
infer()compatibility risk was unaddressed:agentcurrently maps to inference-router profiles (classifier,triage,reflector) inpackages/system-bus/src/lib/inference.ts.- Extension sandboxing policy was not translated from pi-subagents to joelclaw’s production worker threat model.
Agent Definition Format
Markdown files in ~/.joelclaw/agents/{name}.md (user scope) and .joelclaw/agents/{name}.md (project scope):
---
name: designer
description: Frontend design with taste — UI components, layouts, visual polish
model: claude-opus-4-6
thinking: high
tools: read, bash, edit, write
skill: frontend-design, ui-animation, emilkowal-animations
---
You are a design-focused agent. You create distinctive, production-grade
frontend interfaces. Use the StatusPulseDot and StatusLed components from
@repo/ui/status-badge for activity indicators...Roster Configuration
.joelclaw/config.toml:
[agents]
# Default agent for unspecified tasks
default = "coder"
# Route by task type
[agents.routing]
design = "designer" # $frontend-design tagged tasks
code = "coder" # Default coding tasks
research = "researcher" # Web research, repo autopsy
review = "reviewer" # Code review, PR reviewExecution Backbone: Inngest
All sub-agent execution runs through Inngest — not raw subprocess spawning. This gives durability, retries, observability, and session streaming for free.
// Single agent dispatch
const result = await step.run("designer", async () => {
return infer(task, { agent: "designer" });
});
// Chain execution — each step memoized
const recon = await step.run("scout-recon", () => infer(task, { agent: "scout" }));
const plan = await step.run("planner", () => infer(recon, { agent: "planner" }));
const impl = await step.run("worker", () => infer(plan, { agent: "worker" }));
// Parallel execution
const [a, b] = await Promise.all([
step.run("task-a", () => infer(taskA, { agent: "designer" })),
step.run("task-b", () => infer(taskB, { agent: "coder" })),
]);Why Inngest over raw subprocess:
- Step memoization — crash mid-chain, resume from last completed step
- Retries — model 503, timeout → automatic retry with backoff
- Observability — every step in Inngest dashboard + OTEL
- Concurrency control — throttle parallel agents, prevent resource contention
- Timeouts —
timeouts.finishkills runaway agent calls - Cancellation — cancel running chains by event
- Session streaming — results stream back into gateway/interactive sessions via Redis event bridge
infer() extension:
// infer() resolves agent definition → pi flags
await infer("redesign this component", { agent: "designer" });
// Resolves to: pi -p --no-session --models claude-opus-4-6:high --tools read,bash,edit,write
// With agent's system prompt injected via --append-system-promptInngest execution model parity (mapped from pi-subagents internals)
pi-subagents currently implements:
~/.repo-autopsy/nicobailon/pi-subagents/execution.ts:runSync()derives pi flags from agent definitions (--models, tool splitting, extension policy, skill injection,MCP_DIRECT_TOOLS).~/.repo-autopsy/nicobailon/pi-subagents/chain-execution.ts+settings.ts: sequential/parallel chains with{task},{previous},{chain_dir}substitution and per-step behavior overrides.~/.repo-autopsy/nicobailon/pi-subagents/async-execution.ts+subagent-runner.ts: detached background runner withstatus.json,events.jsonl, and filesystem polling.
In joelclaw, we replace subprocess orchestration with Inngest primitives:
- Single agent: one
step.run("agent:{name}")wrappinginfer(...). - Chain: one Inngest function with deterministic step IDs (
chain:{index}:{agent}) for memoized replay/resume. - Parallel step: parallel
step.runcalls with explicitconcurrencyand optionalfailFastbehavior. - Async/background: no detached worker needed;
step.sendEvent+ Inngest run state replaces FS watcher polling. - Progress streaming: reuse the existing
gateway.progress()/gateway.notify()pattern insidestep.runblocks (same replay-safe pattern used inpackages/system-bus/src/inngest/functions/story-pipeline.ts).
{previous} template semantics in Inngest chains
{previous} must map to structured step outputs, not only plain text. Define chain step returns as:
type AgentStepResult = {
agent: string;
text: string;
model?: string;
provider?: string;
usage?: LlmUsage;
artifacts?: Record<string, string>;
exitCode: number;
};Template context per step:
{task}: original top-level request{previous}: previous steptext(or aggregated parallel text){previous_json}: JSON-serialized previousAgentStepResult(or array for parallel){chain_dir}: durable artifact directory path
For parallel steps, aggregate outputs with stable headers (pi-subagents style: === Parallel Task N (agent) ===) so downstream prompts remain parseable and deterministic.
Integration Points
- Gateway interactive —
$frontend-designtag in user message dispatchesagent/task.runInngest event → result streams back via Redis - Inngest functions —
infer()gainsagentoption that resolves from roster - CLI —
joelclaw agent list,joelclaw agent run <name> <prompt>(fires Inngest event, streams result) - Codex delegation — unchanged for OpenAI tasks, designer agent for Anthropic tasks
- Session feedback — Inngest step results emit
agent/task.completeevents, gateway picks up via Redis subscription and streams into active session
Existing joelclaw plumbing to reuse:
packages/system-bus/src/inngest/middleware/gateway.tsalready carriesoriginSessionand exposesgateway.progress()/notify()/alert()helpers.packages/system-bus/src/inngest/functions/agent-loop/utils.ts#pushGatewayEvent()already fans events togateway+originSessiontargets.packages/gateway/src/channels/redis.tsalready prefersoriginSessionrouting andpackages/gateway/src/daemon.tsalready routes responses by active source.
Event contracts to standardize:
agent/task.run:{ taskId, agent, task, originSession?, cwd?, timeoutMs?, metadata? }agent/task.progress:{ taskId, step, message, originSession? }(optional for long chains)agent/task.complete:{ taskId, agent, status, output, usage?, artifacts?, originSession? }agent/task.failed:{ taskId, agent, error, retryable, attempt, originSession? }
Discovery Priority
- Project:
.joelclaw/agents/(highest) - User:
~/.joelclaw/agents/(medium) - Builtin:
joelclaw/agents/in repo (lowest, git-tracked)
Patterns Adopted from pi-subagents
Agent definition format — YAML frontmatter + markdown body. Fields: name, description, model, thinking (off/minimal/low/medium/high/xhigh), tools, skill, extensions, output, defaultReads, defaultProgress, interactive.
Extension sandboxing — extensions: field controls which pi extensions load in the sub-agent:
- Absent → all extensions load (default)
- Empty →
--no-extensions - List →
--no-extensions --extension a --extension b
Three execution modes — Single (one agent, one task), Chain (sequential with {previous} template var + shared {chain_dir}), Parallel (concurrent with max concurrency).
Spawn mechanism — pi -p --mode json --no-session with --models, --tools, --extension, --append-system-prompt flags derived from agent definition. Captures stdout as JSONL, tracks usage/tokens/duration.
Async mode — Background execution via worker process. FSWatcher on results directory detects completion. Widget polls progress.
Skill injection — skill: field resolves skill files, injects content into system prompt before spawn.
Key Differences from pi-subagents
- Chain execution via Inngest steps — durable, retryable, observable vs raw subprocess chains
- Inngest-native — long-running agent tasks dispatched as Inngest steps with memoization
- Role.md integration — agent definitions can reference
roles/*.mdfor shared context - No TUI clarify step — joelclaw agents are headless; confirmation happens via gateway/CLI
- Discovery adds repo scope — builtin agents live in
joelclaw/agents/(git-tracked, lowest priority)
Agent Definition Validation (required)
Use strict runtime validation at load-time before any dispatch:
- Frontmatter parsing: use a real YAML parser (not regex-only key/value parsing) so arrays/booleans are typed reliably.
- Do not rely on permissive tool schemas alone (
Type.Anyusage inpi-subagents/schemas.ts); enforce strict server-side validation in joelclaw.
- Do not rely on permissive tool schemas alone (
- Identity:
name+descriptionrequired;namemust match file basename (designer.md→name: designer). - Model: must resolve in
@joelclaw/inference-routercatalog (support bare IDs andprovider/modelIDs). - Thinking level: enum
off|minimal|low|medium|high|xhigh. - Tools: each entry must be either allowed builtin tool or approved extension path.
- Skills: each skill in
skill/skillsmust resolve from canonical skill loading paths; missing skills are validation errors (not warnings) in strict mode. - Path safety:
output,defaultReads, and chain file paths must reject traversal (..) outside configured workspace unless explicitly absolute-allowlisted. - Extensions policy: absent vs empty vs explicit list semantics must be preserved:
- absent: inherit platform default policy
- empty list: disable extensions (
--no-extensions) - explicit list: allowlist only
- Role composition: optional
role: roles/<name>.mdmust resolve to existing repo role file before run.
infer() compatibility contract (critical)
packages/system-bus/src/lib/inference.ts currently maps agent to inference-router profiles via resolveProfile() (packages/inference-router/src/profiles.ts) and existing production callers rely on this (reflect.ts, task-triage.ts, email-cleanup.ts).
Adoption rules:
- Resolution order: roster agent definition → legacy inference profile → explicit options.
- Preserve legacy behavior for
classifier,reflector,triageuntil migrated. - Add an explicit
profileoption long-term; keepagentbackward compatible during migration window. runPiAttempt()currently hardcodes--no-extensionsand--model; roster mode must support full flag derivation (--models, tools, extensions, appended prompt) while keeping locked-down defaults for non-roster calls.
Extension sandboxing in joelclaw
pi-subagents allows extension paths from agent definitions. In joelclaw worker context this is a security boundary, so defaults must remain deny-by-default:
- Default execution remains no extensions for system-bus unless explicitly allowlisted.
- Add
agents.extension_allowlistin.joelclaw/config.toml; reject non-allowlisted extension paths at validation time. - Record effective extension set in OTEL metadata for every run.
Consequences
Positive
- Design tasks route to Opus automatically
- Agent selection is explicit and configurable
- New agent types added without code changes
- Model/tool/skill combos are named and reusable
Negative
- Another config surface to maintain
- Agent definitions can drift from actual capabilities
- pi-subagents is a third-party dependency (or we steal patterns)
Risks
- Over-engineering if we only need 2-3 agents
- Chain execution complexity if adopted later
- Agent/profile naming collision (
agentcurrently means inference profile ininfer()) - Path/extension injection risk from unvalidated agent markdown
- Parallel
{previous}aggregation ambiguity can create non-deterministic downstream prompts - Gateway replay duplicates if progress/notify emits happen outside
step.run
Resolved Questions
- Chain artifact directories → Durable per-session workspace with retention policy. Must survive crashes for Inngest replay.
- Chain topology → Full DAG support from day one. Not just linear + parallel groups.
- JSON output contracts → Per-agent configurable. Some agents (coder, reviewer) require strict output schemas; others (designer, researcher) are freeform text. Add
outputSchemafield to agent definition — when present, output is validated; when absent, plain text passthrough. - Agent definition storage → Filesystem as source of truth (git-tracked), mirrored to Typesense for search + version pinning. Enables hot reload without git pull and searchable agent catalog.
Phase Plan
Phase 0+1: Loader + infer() Integration ✅ SHIPPED (2026-02-28)
packages/system-bus/src/lib/agent-roster.ts: loads pi-subagent-format .md files from project (.pi/agents/) and user (~/.pi/agent/agents/) scopes with module-level cacheinfer()resolution order: roster → profile → throw (backward compatible with classifier/triage/reflector)- Roster agents derive full pi flags:
--models MODEL:THINKING,--tools,--append-system-prompt, conditional--no-extensions - OTEL metadata includes
agentSource(roster/profile/direct),agentName,agentDefinitionPath - 3 project-scoped agents committed:
agents/{designer,coder,ops}.md(symlinked to.pi/agents/) - 6/6 unit tests: project load, user load, project-overrides-user, cache hit, missing agent, malformed frontmatter
- Commit:
a709622 - Deferred: strict schema validation (model catalog check, skill resolution, path safety), role composition, extension allowlist. These become Phase 2 prerequisites.
Phase 2: Inngest Functions + CLI + Gateway Routing ✅ SHIPPED (2026-02-28)
agent/task.run,agent/task.complete,agent/task.progressevent types added to Inngest clientagent-task-runInngest function: validate → execute viainfer()→ emit complete/failed- Concurrency: 3 per agent type, 2 retries, 5m timeout
- Gateway progress notification before execution, OTEL on start/complete/fail
originSessioncarried through all events (gateway middleware passthrough)joelclaw agent list— discover agents from all scopesjoelclaw agent show <name>— display full definition + system promptjoelclaw agent run <name> <task>— fireagent/task.runevent, return taskId- HATEOAS JSON responses with next_actions throughout
- Commits:
f922842(CLI),5348b55(Inngest function + events) - Deferred: Gateway
$frontend-designtag routing (gateway pi session can already dispatch viajoelclaw agent runorinngest_send)
Phase 3: Chain Execution ✅ SHIPPED (2026-02-28)
agent/chain.run,agent/chain.completeevent types added to Inngest clientagent-chain-runInngest function: sequential steps with{task}/{previous}template substitution- Parallel groups via
Promise.allSettledwith=== Parallel Task N (agent) ===aggregation headers failFastoption (default false — continue on step failure, collect partial results)- Concurrency: 2 chains, 1 retry, 15m timeout
- OTEL per step + chain completion/failure; gateway progress per step (replay-safe)
- CLI:
joelclaw agent chain scout,planner+reviewer,coder --task "..."(+= parallel,,= sequential) - 5 unit tests: template substitution, parallel aggregation, sequential passing, error handling
- Commit:
ab1b885 - Deferred: output artifact validation (warning-first, fail-on-strict mode), DAG topology beyond linear+parallel
Runtime proof + recovery timeline (2026-02-28)
Attempt 1 — blocked by local runtime reachability
joelclaw agent listandjoelclaw agent show codersucceeded.joelclaw agent run ...failed while local Inngest API was unreachable (localhost:8288).- No reliable event→run trace could be captured in that attempt.
Attempt 2 — ingress restored, roster drift surfaced
- Event send path recovered.
agent/task.runreachedAgent Task Run, but failed withUnknown agent roster entry: coder.- This proved ingress was healthy while worker runtime resolution was stale.
Remediation applied
- Patched roster resolution to search ancestor directories for builtin
agents/when worker CWD is nested:- commit
a3e013a - file:
packages/system-bus/src/lib/agent-roster.ts - tests:
packages/system-bus/src/lib/__tests__/agent-roster.test.ts
- commit
- Published
system-bus-workerimage with this fix and rolled k8s deployment. - Restarted host worker process (the active executor for
Agent Task Runin this environment). - Recovered local control plane after transient outage (Colima/Talos restart + taint cleanup + pod recycle).
Final runtime proof — PASS
bun run packages/cli/src/cli.ts agent run coder "reply with OK" --timeout 20- event ID:
01KJK9JJ1C5P54ZH4F200XYWBD
- event ID:
bun run packages/cli/src/cli.ts event 01KJK9JJ1C5P54ZH4F200XYWBD- run ID:
01KJK9JJEX3A6NW55WQSZXKWNY - function:
Agent Task Run - status:
COMPLETED - output includes
{"status":"completed", ... "text":"OK"}
- run ID:
Conclusion: ADR-0180 runtime contract is now validated end-to-end (list/show/run/chain/watch paths + truthful event navigation + durable execution).
Validation smoke test — ts=1772321467 ✅ DEEP PROOF
Second full end-to-end proof with production binary against live k8s worker, capturing full OTEL metadata:
Roster
joelclaw agent list→ok: true,total: 3(coder/designer/ops, allsource: builtin)joelclaw agent show coder→ filePath, systemPrompt, model, tools, skills all present
Dispatch
joelclaw agent run coder "ADR-0180 smoke test ts=1772321467 — echo the string 'SMOKE_OK' and exit"- Event
01KJK9MT0X00WXREWX3KZW6F2Xaccepted · taskIdat-1772321662985-gv2oj9
Run
- Run
01KJK9MT3H0N5AGJH8F1PYJ6Z2· COMPLETED · 3,759 ms - Output:
{ status: "completed", text: "SMOKE_OK", model: "anthropic/claude-sonnet-4-6", provider: "anthropic" }
Step trace (7 steps, all COMPLETED)
emit-started-otel → validate → agent-task-progress-execute → execute (2,404 ms) → agent-task-complete → emit-completed-otel → Finalization
OTEL (5 events)
agent.task.started— taskId, agent, originSession, cwd, timeoutMsmodel_router.request—agentSource: "roster",agentName: "coder", agentDefinitionPath, resolvedModelmodel_router.route— policy version, resolved modelmodel_router.result— 2,140 ms, fallbackUsed, usageagent.task.completed— model, provider, durationMs
agentSource: "roster" in OTEL confirms builtin scope resolution is healthy end-to-end. Historical OTEL also shows the pre-deploy failure arc: 5 agent.task.failed events with "Unknown agent roster entry: coder" (22:50–23:22 UTC), followed by clean completions post-deploy — observable failure→fix→recovery captured in Typesense.
Phase 4: Live streaming + async UX ✅ SHIPPED (2026-02-28)
joelclaw agent watch <taskId|chainId>— NDJSON streaming watcher- Redis pub/sub subscription to
joelclaw:notify:gatewayfor real-time progress events - Inngest API polling fallback when Redis is degraded or task completed before watch started
- Auto-detects task (
at-*) vs chain (ac-*) IDs, adjusts timeout (300s vs 900s) - Graceful degradation documented in-code: Redis down → polling only, pre-completed → immediate result
--timeoutoption, SIGINT/SIGTERM cleanup, HATEOAS next_actions in terminal events- Commit:
9ab8c6d
References
- nicobailon/pi-subagents — pi extension for subagent delegation
execution.ts,chain-execution.ts,async-execution.ts,agents.ts,skills.ts,types.ts,schemas.ts,agents/*.md
packages/system-bus/src/lib/inference.ts— current infer implementation andagentprofile resolution pathpackages/inference-router/src/profiles.ts— legacyclassifier/triage/reflectorprofilespackages/system-bus/src/inngest/functions/story-pipeline.ts— replay-safe gateway signaling + contract-first stage executionpackages/system-bus/src/inngest/middleware/gateway.ts—originSessionrouting helperspackages/system-bus/src/inngest/functions/agent-loop/utils.ts(pushGatewayEvent)packages/gateway/src/channels/redis.ts+packages/gateway/src/daemon.ts— source-aware response routing and Redis event bridge- ADR-0170: Agent Role System
- ADR-0163: Adaptive Prompt Architecture