ADR-0118superseded

Koko Shadow Executor Mode

2026-02-23T00:00:00.000Z

Context

ADR-0117 proposed novel workloads for Koko (health pulse, event digest, file watcher). Sean Grove suggested a better approach: shadow execution — Koko runs the same workloads as the TypeScript stack in parallel, on the same inputs, and we compare results. This is how you actually validate whether BEAM is better, worse, or equivalent for joelclaw’s real workloads.

Shadow execution eliminates the “apples to oranges” problem. Instead of “Koko can do X,” the question becomes “Koko does X better/faster/more reliably than TypeScript does X, on the same data.”

Decision

How it works

Event arrives
    ├─→ Inngest function (TypeScript) → authoritative result → writes to state
    └─→ Koko shadow (Elixir) → shadow result → writes to shadow log only
                                                     ↓
                                              Compare: latency, output, errors

Koko observes the same Redis events that trigger Inngest functions
Koko executes its own implementation of the same function
Koko writes results to a shadow log (joelclaw:koko:shadow:<function> in Redis, or local file)
Koko never writes to authoritative state — no Typesense upserts, no Todoist mutations, no gateway notifications
A comparison process periodically diffs shadow results against real results

Shadow log schema

{
  "function": "heartbeat",
  "event_id": "evt_abc123",
  "shadow_result": { ... },
  "shadow_latency_ms": 42,
  "shadow_error": null,
  "timestamp": "2026-02-23T21:00:00Z"
}

The TypeScript side already logs via OTEL. Koko logs shadow results to its own namespace. A comparison can be done offline — no real-time coupling needed.

Candidate functions for shadow execution

Ranked by suitability (read-only inputs, clear outputs, no mutation required):

Function	Inputs	Output to shadow	Why
heartbeat checks	Redis ping, Typesense health, Inngest API	health status per service	Pure reads. Both check the same endpoints. Compare detection speed + accuracy.
event digest	N hours of events from Redis	summary text	Same events in, LLM summary out. Compare quality + latency.
friction analysis	observation corpus from Typesense	pattern list	Same observations in, patterns out. Compare what each finds.
proposal triage	pending memory proposals	approve/reject/needs-review verdicts	Same proposals in, verdicts out. Compare triage quality.
ADR validation	ADR file contents	validation errors/warnings	Pure file read + check. Compare completeness.
content-sync detection	Vault file mtimes	changed file list	Both scan same directory. Compare detection latency.

Functions explicitly excluded from shadow

Anything that mutates external state: Todoist close, Front archive, Telegram send, PDS write, Vercel deploy hooks
Agent loops: code generation requires tool execution and git mutations
Gateway message routing: has side effects by definition

Comparison metrics

For each shadowed function, track:

Metric	What it tells us
Latency (ms)	Is BEAM faster for this workload?
Error rate	Does Koko crash less? Does supervisor recovery mask errors?
Output quality	When both produce text (digests, summaries), are they equivalent?
Recovery time	When a check fails, how fast does each recover? (BEAM advantage)
Resource usage	Memory per function. Process count vs Node.js event loop.

Implementation phases

Phase 1: Single shadow (heartbeat)

Koko shadows the heartbeat checks only
Logs results to ~/Code/joelhooks/koko/shadow/heartbeat.jsonl
Manual comparison after 7 days

Phase 2: Multi-shadow (3-4 functions)

Add friction analysis, event digest, content-sync
Structured shadow log in Redis (joelclaw:koko:shadow:*)
Basic comparison script (Elixir Mix task or CLI command)

Phase 3: Automated comparison

Koko reads OTEL events for TypeScript function results
Automatic diff report: latency distribution, error rates, output quality
Weekly shadow report posted to gateway (or Vault note)

Shadow execution rules

Shadow must never write to authoritative stores. No Typesense, no Todoist, no Convex, no gateway notify. Violation = immediate disable.
Shadow reads are fine. Redis GET, Typesense search, file reads, HTTP GETs to health endpoints — all OK.
Shadow LLM calls use the same provider/model as TypeScript. Apples to apples. Different models invalidate the comparison.
Shadow failures are logged, not escalated. Koko crashing during shadow execution is data, not an incident.
Shadow results are append-only. Never overwrite or delete shadow logs. They’re the evidence base for ADR-0114 decisions.

Consequences

Koko’s value proposition becomes empirically testable, not hypothetical
Every shadow run produces data that informs the ADR-0114 migration decision
Zero risk to production — shadow is purely additive
Forces Koko implementations to be input-compatible with TypeScript versions (good discipline)
Shadow comparison becomes the graduation exam: if Koko consistently matches or beats TypeScript on 3+ functions, ADR-0114 Strategy B (hybrid) has concrete evidence
If Koko consistently loses or adds no value, we kill ADR-0114 with data instead of opinion