ADR-0190accepted

Memory Yield Contract

2026-03-02T00:00:00.000Z

Status: proposed
Date: 2026-03-02
Deciders: Joel, Panda
Relates to: ADR-0021, ADR-0068, ADR-0077, ADR-0096, ADR-0140, ADR-0146, ADR-0183, ADR-0189

Context

The memory system intent is clear: capture durable patterns, reject noise, and make each session smarter than the last.

Observed behavior is falling short:

repeated rewrite and classification calls that return empty/null outputs,
expensive LLM loops that do not create actionable outputs,
high tool churn in long sessions with weak outcome density,
telemetry blind spots that hide true token/cost usage on parts of system-bus inference.

This is a coherence problem, not just a spend problem. We need a hard contract that measures memory yield and blocks no-op churn.

Decision

1) Memory is governed by yield, not activity

From this ADR forward, memory and recall paths are judged by outcome metrics, not number of runs, number of traces, or number of events.

2) Adopt a mandatory Memory Yield Scorecard

Track and report these metrics as first-class signals:

Metric	Definition	Why it matters
`rewrite_fallback_rate`	rewrite attempts ending in fallback / total rewrite attempts	detects wasted rewrite loops
`null_output_rate`	inference calls with parse-failed/null output / total calls	catches paid no-op calls
`tool_churn_ratio`	`toolUse` turns / terminal-output turns	detects busy loops
`memory_yield_rate`	retrieved memories that are referenced or used in output/action / retrieved memories	measures actual compounding
`cost_per_useful_outcome`	LLM cost / useful outcomes (promotion, action, successful decision)	ROI anchor
`usage_coverage_rate`	calls with non-empty usage+cost metadata / total traced calls	ensures observability truth

3) Add hard gate policy for memory-path health

If any critical metric breaches configured thresholds, the system must:

open a degradation gate for the offending path,
fall back to deterministic/simpler behavior,
emit a single high-signal operator alert,
stay degraded until recovery criteria are met.

No silent retries forever.

4) Freeze memory feature expansion while failing core health

No new memory/retrieval features are eligible while critical scorecard gates are red. Work priority is:

observability truth,
no-op prevention,
output contract reliability,
then feature expansion.

5) Split enforcement into focused child ADRs

This umbrella ADR is enacted by:

ADR-0191 (no-op inference circuit breakers),
ADR-0192 (recall rewrite reliability contract),
ADR-0193 (task triage output contract).

Consequences

Good

restores the original memory-system intent (compounding patterns, not noise),
converts waste complaints into measurable gates and remediations,
prevents cost/latency churn from masquerading as progress.

Tradeoffs

reduced autonomy on degraded paths until they recover,
additional telemetry and gating logic in system-bus/cli paths,
some previously “best effort” behavior becomes explicit failure.

Risks

threshold tuning may be noisy early,
aggressive gates may temporarily reduce recall quality.

Mitigation: start with conservative thresholds and tighten once usage coverage is reliable.

Required Skills (Preflight)

Load before implementation starts:

langfuse — tracing contract, usage/cost correctness, and attribution
system-bus — implementation points for inference and task workflows
inngest-durable-functions — durable gate/circuit state transitions
inngest-flow-control — cooldown and suppression behavior
joelclaw — operational validation via CLI and run inspection
system-architecture — ensure gate behavior matches runtime topology

Implementation Plan (vector clock)

V1 (truth): make usage/cost coverage explicit and queryable for all memory-critical inference calls.
V2 (no-op control): implement no-op circuit breaker contract (ADR-0191).
V3 (recall reliability): implement rewrite reliability and skip contract (ADR-0192).
V4 (triage reliability): enforce task triage output contract (ADR-0193).
V5 (governance): add scorecard summary to daily health reporting and gate feature work when red.

Implementation Progress

V1 (truth) — shipped 2026-03-04

Added joelclaw memory scorecard CLI command. Queries Typesense otel_events collection and computes:

Metric	Source	Threshold (green/yellow/red)
`rewrite_fallback_rate`	recall OTEL (strategy=fallback)	<10% / <30% / ≥30%
`null_output_rate`	reflect OTEL (failed/total)	<5% / <15% / ≥15%
`memory_yield_rate`	recall metadata (returned>0/total)	>70% / >50% / ≤50%
`usage_coverage_rate`	all memory OTEL (success facet)	>95% / >85% / ≤85%
`observe_volume`	observe-session-noted events	>0 / - / -
`reflect_volume`	reflect events	>0 / - / -

First scorecard run (24h window):

rewrite_fallback_rate: 0% ✅
null_output_rate: 0% ✅
memory_yield_rate: 59% 🟡 (50/122 recalls return empty)
usage_coverage_rate: 98.5% ✅

V2 (no-op control) — shipped 2026-03-04

Added empty-transcript circuit breaker in observe.ts. When sanitizeObservationTranscript() produces empty text, the function skips the LLM call entirely and returns early with observe.skipped.empty_transcript OTEL event. Previously, empty transcripts were passed to Haiku with "No user-facing transcript content available" — which produced generic/empty observations that burned tokens with zero yield.

Observable: joelclaw otel search "observe.skipped.empty_transcript" --hours 24

Verification Checklist

Memory scorecard metrics are emitted and queryable in OTEL (joelclaw memory scorecard)
Empty-transcript observe calls short-circuit before LLM (V2 circuit breaker)
Critical-gate breaches produce deterministic degraded behavior (not endless retries)
Recall rewrite fallback loops are bounded by contract
Task triage null outputs are treated as failures, not successes
At least one daily report includes scorecard + gate state + active degradations