ADR-0118superseded

Koko Shadow Executor Mode

Context

ADR-0117 proposed novel workloads for Koko (health pulse, event digest, file watcher). Sean Grove suggested a better approach: shadow execution — Koko runs the same workloads as the TypeScript stack in parallel, on the same inputs, and we compare results. This is how you actually validate whether BEAM is better, worse, or equivalent for joelclaw’s real workloads.

Shadow execution eliminates the “apples to oranges” problem. Instead of “Koko can do X,” the question becomes “Koko does X better/faster/more reliably than TypeScript does X, on the same data.”

Decision

How it works

Event arrives
    ├─→ Inngest function (TypeScript) → authoritative result → writes to state
    └─→ Koko shadow (Elixir) → shadow result → writes to shadow log only

                                              Compare: latency, output, errors
  1. Koko observes the same Redis events that trigger Inngest functions
  2. Koko executes its own implementation of the same function
  3. Koko writes results to a shadow log (joelclaw:koko:shadow:<function> in Redis, or local file)
  4. Koko never writes to authoritative state — no Typesense upserts, no Todoist mutations, no gateway notifications
  5. A comparison process periodically diffs shadow results against real results

Shadow log schema

{
  "function": "heartbeat",
  "event_id": "evt_abc123",
  "shadow_result": { ... },
  "shadow_latency_ms": 42,
  "shadow_error": null,
  "timestamp": "2026-02-23T21:00:00Z"
}

The TypeScript side already logs via OTEL. Koko logs shadow results to its own namespace. A comparison can be done offline — no real-time coupling needed.

Candidate functions for shadow execution

Ranked by suitability (read-only inputs, clear outputs, no mutation required):

FunctionInputsOutput to shadowWhy
heartbeat checksRedis ping, Typesense health, Inngest APIhealth status per servicePure reads. Both check the same endpoints. Compare detection speed + accuracy.
event digestN hours of events from Redissummary textSame events in, LLM summary out. Compare quality + latency.
friction analysisobservation corpus from Typesensepattern listSame observations in, patterns out. Compare what each finds.
proposal triagepending memory proposalsapprove/reject/needs-review verdictsSame proposals in, verdicts out. Compare triage quality.
ADR validationADR file contentsvalidation errors/warningsPure file read + check. Compare completeness.
content-sync detectionVault file mtimeschanged file listBoth scan same directory. Compare detection latency.

Functions explicitly excluded from shadow

  • Anything that mutates external state: Todoist close, Front archive, Telegram send, PDS write, Vercel deploy hooks
  • Agent loops: code generation requires tool execution and git mutations
  • Gateway message routing: has side effects by definition

Comparison metrics

For each shadowed function, track:

MetricWhat it tells us
Latency (ms)Is BEAM faster for this workload?
Error rateDoes Koko crash less? Does supervisor recovery mask errors?
Output qualityWhen both produce text (digests, summaries), are they equivalent?
Recovery timeWhen a check fails, how fast does each recover? (BEAM advantage)
Resource usageMemory per function. Process count vs Node.js event loop.

Implementation phases

Phase 1: Single shadow (heartbeat)

  • Koko shadows the heartbeat checks only
  • Logs results to ~/Code/joelhooks/koko/shadow/heartbeat.jsonl
  • Manual comparison after 7 days

Phase 2: Multi-shadow (3-4 functions)

  • Add friction analysis, event digest, content-sync
  • Structured shadow log in Redis (joelclaw:koko:shadow:*)
  • Basic comparison script (Elixir Mix task or CLI command)

Phase 3: Automated comparison

  • Koko reads OTEL events for TypeScript function results
  • Automatic diff report: latency distribution, error rates, output quality
  • Weekly shadow report posted to gateway (or Vault note)

Shadow execution rules

  1. Shadow must never write to authoritative stores. No Typesense, no Todoist, no Convex, no gateway notify. Violation = immediate disable.
  2. Shadow reads are fine. Redis GET, Typesense search, file reads, HTTP GETs to health endpoints — all OK.
  3. Shadow LLM calls use the same provider/model as TypeScript. Apples to apples. Different models invalidate the comparison.
  4. Shadow failures are logged, not escalated. Koko crashing during shadow execution is data, not an incident.
  5. Shadow results are append-only. Never overwrite or delete shadow logs. They’re the evidence base for ADR-0114 decisions.

Consequences

  • Koko’s value proposition becomes empirically testable, not hypothetical
  • Every shadow run produces data that informs the ADR-0114 migration decision
  • Zero risk to production — shadow is purely additive
  • Forces Koko implementations to be input-compatible with TypeScript versions (good discipline)
  • Shadow comparison becomes the graduation exam: if Koko consistently matches or beats TypeScript on 3+ functions, ADR-0114 Strategy B (hybrid) has concrete evidence
  • If Koko consistently loses or adds no value, we kill ADR-0114 with data instead of opinion