Memory Yield Contract
- Status: proposed
- Date: 2026-03-02
- Deciders: Joel, Panda
- Relates to: ADR-0021, ADR-0068, ADR-0077, ADR-0096, ADR-0140, ADR-0146, ADR-0183, ADR-0189
Context
The memory system intent is clear: capture durable patterns, reject noise, and make each session smarter than the last.
Observed behavior is falling short:
- repeated rewrite and classification calls that return empty/null outputs,
- expensive LLM loops that do not create actionable outputs,
- high tool churn in long sessions with weak outcome density,
- telemetry blind spots that hide true token/cost usage on parts of system-bus inference.
This is a coherence problem, not just a spend problem. We need a hard contract that measures memory yield and blocks no-op churn.
Decision
1) Memory is governed by yield, not activity
From this ADR forward, memory and recall paths are judged by outcome metrics, not number of runs, number of traces, or number of events.
2) Adopt a mandatory Memory Yield Scorecard
Track and report these metrics as first-class signals:
| Metric | Definition | Why it matters |
|---|---|---|
rewrite_fallback_rate | rewrite attempts ending in fallback / total rewrite attempts | detects wasted rewrite loops |
null_output_rate | inference calls with parse-failed/null output / total calls | catches paid no-op calls |
tool_churn_ratio | toolUse turns / terminal-output turns | detects busy loops |
memory_yield_rate | retrieved memories that are referenced or used in output/action / retrieved memories | measures actual compounding |
cost_per_useful_outcome | LLM cost / useful outcomes (promotion, action, successful decision) | ROI anchor |
usage_coverage_rate | calls with non-empty usage+cost metadata / total traced calls | ensures observability truth |
3) Add hard gate policy for memory-path health
If any critical metric breaches configured thresholds, the system must:
- open a degradation gate for the offending path,
- fall back to deterministic/simpler behavior,
- emit a single high-signal operator alert,
- stay degraded until recovery criteria are met.
No silent retries forever.
4) Freeze memory feature expansion while failing core health
No new memory/retrieval features are eligible while critical scorecard gates are red. Work priority is:
- observability truth,
- no-op prevention,
- output contract reliability,
- then feature expansion.
5) Split enforcement into focused child ADRs
This umbrella ADR is enacted by:
- ADR-0191 (no-op inference circuit breakers),
- ADR-0192 (recall rewrite reliability contract),
- ADR-0193 (task triage output contract).
Consequences
Good
- restores the original memory-system intent (compounding patterns, not noise),
- converts waste complaints into measurable gates and remediations,
- prevents cost/latency churn from masquerading as progress.
Tradeoffs
- reduced autonomy on degraded paths until they recover,
- additional telemetry and gating logic in system-bus/cli paths,
- some previously “best effort” behavior becomes explicit failure.
Risks
- threshold tuning may be noisy early,
- aggressive gates may temporarily reduce recall quality.
Mitigation: start with conservative thresholds and tighten once usage coverage is reliable.
Required Skills (Preflight)
Load before implementation starts:
langfuse— tracing contract, usage/cost correctness, and attributionsystem-bus— implementation points for inference and task workflowsinngest-durable-functions— durable gate/circuit state transitionsinngest-flow-control— cooldown and suppression behaviorjoelclaw— operational validation via CLI and run inspectionsystem-architecture— ensure gate behavior matches runtime topology
Implementation Plan (vector clock)
- V1 (truth): make usage/cost coverage explicit and queryable for all memory-critical inference calls.
- V2 (no-op control): implement no-op circuit breaker contract (ADR-0191).
- V3 (recall reliability): implement rewrite reliability and skip contract (ADR-0192).
- V4 (triage reliability): enforce task triage output contract (ADR-0193).
- V5 (governance): add scorecard summary to daily health reporting and gate feature work when red.
Implementation Progress
V1 (truth) — shipped 2026-03-04
Added joelclaw memory scorecard CLI command. Queries Typesense otel_events collection and computes:
| Metric | Source | Threshold (green/yellow/red) |
|---|---|---|
rewrite_fallback_rate | recall OTEL (strategy=fallback) | <10% / <30% / ≥30% |
null_output_rate | reflect OTEL (failed/total) | <5% / <15% / ≥15% |
memory_yield_rate | recall metadata (returned>0/total) | >70% / >50% / ≤50% |
usage_coverage_rate | all memory OTEL (success facet) | >95% / >85% / ≤85% |
observe_volume | observe-session-noted events | >0 / - / - |
reflect_volume | reflect events | >0 / - / - |
First scorecard run (24h window):
- rewrite_fallback_rate: 0% ✅
- null_output_rate: 0% ✅
- memory_yield_rate: 59% 🟡 (50/122 recalls return empty)
- usage_coverage_rate: 98.5% ✅
V2 (no-op control) — shipped 2026-03-04
Added empty-transcript circuit breaker in observe.ts. When sanitizeObservationTranscript() produces empty text, the function skips the LLM call entirely and returns early with observe.skipped.empty_transcript OTEL event. Previously, empty transcripts were passed to Haiku with "No user-facing transcript content available" — which produced generic/empty observations that burned tokens with zero yield.
Observable: joelclaw otel search "observe.skipped.empty_transcript" --hours 24
Verification Checklist
- Memory scorecard metrics are emitted and queryable in OTEL (
joelclaw memory scorecard) - Empty-transcript observe calls short-circuit before LLM (V2 circuit breaker)
- Critical-gate breaches produce deterministic degraded behavior (not endless retries)
- Recall rewrite fallback loops are bounded by contract
- Task triage null outputs are treated as failures, not successes
- At least one daily report includes scorecard + gate state + active degradations