ADR-0190accepted

Memory Yield Contract

  • Status: proposed
  • Date: 2026-03-02
  • Deciders: Joel, Panda
  • Relates to: ADR-0021, ADR-0068, ADR-0077, ADR-0096, ADR-0140, ADR-0146, ADR-0183, ADR-0189

Context

The memory system intent is clear: capture durable patterns, reject noise, and make each session smarter than the last.

Observed behavior is falling short:

  1. repeated rewrite and classification calls that return empty/null outputs,
  2. expensive LLM loops that do not create actionable outputs,
  3. high tool churn in long sessions with weak outcome density,
  4. telemetry blind spots that hide true token/cost usage on parts of system-bus inference.

This is a coherence problem, not just a spend problem. We need a hard contract that measures memory yield and blocks no-op churn.

Decision

1) Memory is governed by yield, not activity

From this ADR forward, memory and recall paths are judged by outcome metrics, not number of runs, number of traces, or number of events.

2) Adopt a mandatory Memory Yield Scorecard

Track and report these metrics as first-class signals:

MetricDefinitionWhy it matters
rewrite_fallback_raterewrite attempts ending in fallback / total rewrite attemptsdetects wasted rewrite loops
null_output_rateinference calls with parse-failed/null output / total callscatches paid no-op calls
tool_churn_ratiotoolUse turns / terminal-output turnsdetects busy loops
memory_yield_rateretrieved memories that are referenced or used in output/action / retrieved memoriesmeasures actual compounding
cost_per_useful_outcomeLLM cost / useful outcomes (promotion, action, successful decision)ROI anchor
usage_coverage_ratecalls with non-empty usage+cost metadata / total traced callsensures observability truth

3) Add hard gate policy for memory-path health

If any critical metric breaches configured thresholds, the system must:

  1. open a degradation gate for the offending path,
  2. fall back to deterministic/simpler behavior,
  3. emit a single high-signal operator alert,
  4. stay degraded until recovery criteria are met.

No silent retries forever.

4) Freeze memory feature expansion while failing core health

No new memory/retrieval features are eligible while critical scorecard gates are red. Work priority is:

  1. observability truth,
  2. no-op prevention,
  3. output contract reliability,
  4. then feature expansion.

5) Split enforcement into focused child ADRs

This umbrella ADR is enacted by:

  • ADR-0191 (no-op inference circuit breakers),
  • ADR-0192 (recall rewrite reliability contract),
  • ADR-0193 (task triage output contract).

Consequences

Good

  • restores the original memory-system intent (compounding patterns, not noise),
  • converts waste complaints into measurable gates and remediations,
  • prevents cost/latency churn from masquerading as progress.

Tradeoffs

  • reduced autonomy on degraded paths until they recover,
  • additional telemetry and gating logic in system-bus/cli paths,
  • some previously “best effort” behavior becomes explicit failure.

Risks

  • threshold tuning may be noisy early,
  • aggressive gates may temporarily reduce recall quality.

Mitigation: start with conservative thresholds and tighten once usage coverage is reliable.

Required Skills (Preflight)

Load before implementation starts:

  • langfuse — tracing contract, usage/cost correctness, and attribution
  • system-bus — implementation points for inference and task workflows
  • inngest-durable-functions — durable gate/circuit state transitions
  • inngest-flow-control — cooldown and suppression behavior
  • joelclaw — operational validation via CLI and run inspection
  • system-architecture — ensure gate behavior matches runtime topology

Implementation Plan (vector clock)

  1. V1 (truth): make usage/cost coverage explicit and queryable for all memory-critical inference calls.
  2. V2 (no-op control): implement no-op circuit breaker contract (ADR-0191).
  3. V3 (recall reliability): implement rewrite reliability and skip contract (ADR-0192).
  4. V4 (triage reliability): enforce task triage output contract (ADR-0193).
  5. V5 (governance): add scorecard summary to daily health reporting and gate feature work when red.

Implementation Progress

V1 (truth) — shipped 2026-03-04

Added joelclaw memory scorecard CLI command. Queries Typesense otel_events collection and computes:

MetricSourceThreshold (green/yellow/red)
rewrite_fallback_raterecall OTEL (strategy=fallback)<10% / <30% / ≥30%
null_output_ratereflect OTEL (failed/total)<5% / <15% / ≥15%
memory_yield_raterecall metadata (returned>0/total)>70% / >50% / ≤50%
usage_coverage_rateall memory OTEL (success facet)>95% / >85% / ≤85%
observe_volumeobserve-session-noted events>0 / - / -
reflect_volumereflect events>0 / - / -

First scorecard run (24h window):

  • rewrite_fallback_rate: 0% ✅
  • null_output_rate: 0% ✅
  • memory_yield_rate: 59% 🟡 (50/122 recalls return empty)
  • usage_coverage_rate: 98.5% ✅

V2 (no-op control) — shipped 2026-03-04

Added empty-transcript circuit breaker in observe.ts. When sanitizeObservationTranscript() produces empty text, the function skips the LLM call entirely and returns early with observe.skipped.empty_transcript OTEL event. Previously, empty transcripts were passed to Haiku with "No user-facing transcript content available" — which produced generic/empty observations that burned tokens with zero yield.

Observable: joelclaw otel search "observe.skipped.empty_transcript" --hours 24

Verification Checklist

  • Memory scorecard metrics are emitted and queryable in OTEL (joelclaw memory scorecard)
  • Empty-transcript observe calls short-circuit before LLM (V2 circuit breaker)
  • Critical-gate breaches produce deterministic degraded behavior (not endless retries)
  • Recall rewrite fallback loops are bounded by contract
  • Task triage null outputs are treated as failures, not successes
  • At least one daily report includes scorecard + gate state + active degradations