Memory System — Next Phase
Status
implementing
Context
The memory system (ADR-0021) has been running in production since 2026-02-14. Six days of continuous operation have proven the core pipeline and revealed where to invest next.
What’s Built and Running
| Component | Status | Details |
|---|---|---|
| Observer (Phase 1) | ✅ Operational | observe.ts — extracts structured observations from session transcripts on compaction/shutdown. Haiku model. Segment-aware distillation. |
| Qdrant storage | ✅ 1,343 points | 768-dim nomic-embed-text-v1.5 vectors. Semantic search verified. |
| Reflector (Phase 2) | ✅ Operational | reflect.ts — condenses observations into MEMORY.md proposals. Stages in Redis. |
| Promotion (Phase 3) | ✅ Operational | promote.ts — approved proposals merge into MEMORY.md. Todoist-as-review-surface. |
| Auto-triage (ADR-0068) | ✅ Operational | proposal-triage.ts + batch-review.ts — LLM batch review, auto-promote/reject. Only needs-review creates Todoist tasks. Eliminated 50+ junk tasks per compaction. |
| Friction (Phase 4) | ⚠️ Deployed, unproven | friction.ts — daily cron queries Qdrant clusters. First run: 10 patterns from 227 observations. Has not yet produced actionable friction-fix tasks. |
| Recall tool | ✅ Operational | pi-tools recall command — semantic search over Qdrant. Used by agents mid-session. |
| Daily log | ✅ Operational | ~/.joelclaw/workspace/memory/YYYY-MM-DD.md — observations appended per session. |
| Session briefing | ✅ Operational | MEMORY.md + daily log + slog auto-injected at session start. |
What’s Not Built (from @jumperz’s 31-piece memory stack)
@jumperz on X published a comprehensive 31-piece agent memory architecture split across 3 phases. The two diagrams below capture the full system:
Section 1: Memory Storage — “Full Agent Memory Build — In Order”
Section 2: Memory Intelligence — “Feeds Into”
Key concepts from that framework not yet implemented in joelclaw:
Retrieval Intelligence:
- Score Decay —
final_score = raw_score × exp(-0.01 × days_since_created)for time-weighted relevance - Tiered Search — category summaries first (fast), vector search fallback (thorough)
- Query Rewriting — LLM rewrites the last 5 messages into an optimal search query before retrieval
- Dual Search — parallel vector similarity + knowledge graph traversal, merged scoring
Storage Intelligence:
- Write Gate — validate facts before storing (verifiable? useful? conflicting?)
- Dedup at write time — cosine sim > 0.85 → merge, keep fresher wording, bump confidence
- Strength tags — EXPLICIT (0.9), IMPLIED (0.7), INFERRED (0.5) confidence levels
- Sentiment tags — emotional context when facts were stated
Maintenance:
- Nightly job — merge similar items, boost recently-accessed, extract from unprocessed resources
- Weekly job — regenerate category summaries, archive old items, cluster insights
- Monthly job — rebuild embeddings, full reindex, cold archive, health report
- Cron Fallback — check job timestamps on heartbeat, run if overdue
Feedback Loops:
- Echo/Fizzle — track which injected memories get referenced in responses (+0.1 priority) vs ignored (-0.05)
- Behavior Loop — extract corrections/preferences as lessons, load on every session start
- Trust Pass — validate memory sets before returning (conflict detection, confidence thresholds, staleness tags)
Advanced (future):
- Categories — per-topic markdown summaries (work.md, health.md, preferences.md)
- Domain TTLs — per-category retention (work: 90d, preferences: 180d, health: 365d)
- Forward triggers — detect temporal signals (“next Friday”), preload context on that date
- Cross-agent memory sharing — SharedMemory REST API with domain relevance filters
- Knowledge graphs — entity relationships, graph-based retrieval
- Budget-aware retrieval — reduce operations under cost constraints
- Memory Agent — dedicated class abstracting all memory operations
Decision
Implement the next phase in three increments, building on what’s proven. Each increment is independently valuable.
Increment 1: Retrieval Quality (highest impact, lowest risk)
These improve what agents get back when they query memory. No storage changes needed.
-
Score Decay — Add time-based decay to Qdrant search results. Modify the
recalltool and any retrieval path to applyfinal_score = raw_score × exp(-0.01 × days_since_created). Recent relevant facts outrank old relevant facts. -
Query Rewriting — Before Qdrant search, pass the query + recent conversation context through Haiku to produce an optimized search query. The raw user question (“what was that Redis thing?”) becomes a targeted query (“Redis SETNX deduplication pattern for Inngest functions”).
-
Inject Cap — Hard limit of 10 memories per retrieval. Already informally followed; make it explicit in the recall tool.
Increment 2: Storage Quality (reduces noise over time)
These improve what goes INTO memory. Builds on the existing observer pipeline.
-
Dedup at observation time — Before storing a new observation in Qdrant, check cosine similarity against recent points. If > 0.85, merge: keep the fresher wording, bump confidence metadata. Prevents the same insight from appearing 5+ times across sessions.
-
Nightly maintenance — Inngest cron (2 AM): scan today’s observations, merge duplicates, identify orphaned resources with no extractions. Log stats. This is a simple extension of what friction.ts already does with Qdrant queries.
-
Staleness tagging — Observations older than 90 days with zero recall hits get tagged
stalein Qdrant metadata. The recall tool deprioritizes stale results.
Increment 3: Feedback Loop (the compounding moat)
These create the virtuous cycle that makes memory better with use.
-
Echo/Fizzle tracking — After each agent response, heuristically detect which recalled memories were referenced. Boost priority for used memories (+0.1), penalize ignored ones (-0.05). Over time, useful memories surface; irrelevant ones sink. This is the most valuable long-term feature.
-
Behavior Loop — When the user corrects the agent, extract the correction pattern and write it to a lessons section in MEMORY.md. Load on every session start. Already partially implemented via friction detection — this formalizes it.
Not Yet (deferred until increments 1-3 prove out)
- Categories / per-topic summaries — requires defining the category taxonomy first. May emerge naturally from friction analysis.
- Knowledge graphs — needs a graph database (Neo4j or similar). Big investment, unclear ROI at current scale.
- Cross-agent sharing — only one agent system (joelclaw). Premature.
- Forward triggers — interesting but low-frequency. Calendar integration (ADR-0040, gogcli) already covers most temporal awareness.
- Budget-aware — needs per-function token tracking first (see ADR-0078).
- Dual Search — vector search alone is performing well at 1,343 points. Graph traversal adds value at 10K+ points.
- Write Gate — the auto-triage pipeline (ADR-0068) already filters proposals. Adding a gate at observation time risks losing raw data.
Consequences
Positive
- Retrieval quality improves immediately (score decay, query rewriting)
- Storage noise reduces over time (dedup, maintenance, staleness)
- Echo/fizzle creates compounding advantage — the system gets smarter with use
- Each increment is independently deployable and testable
Negative
- Echo/fizzle detection is heuristic — may be noisy initially
- Nightly maintenance adds another cron job to monitor
- Query rewriting adds latency to recall (one Haiku call per query)
References
- ADR-0021: Agent Memory System (canonical spec, all phases)
- ADR-0068: Memory Proposal Auto-Triage Pipeline
- ADR-0078: Opus Token Reduction (budget-aware retrieval dependency)
- @jumperz on X — 31-piece memory system thread (2026-02-20). The source architecture for this ADR’s vision. Credit: jumperz.
Notes
Source: @jumperz’s 31-Piece Memory Stack
The architecture vision comes from @jumperz’s X thread, which Joel shared via Telegram. jumperz frames it as “the entire memory stack if you actually want to take your agent memory to somewhere real — from actually remembering to having an intelligence layer.”
jumperz’s key insight — build order is everything:
“31 pieces total, split into 3 phases: core first, reliability second, then advanced last. You build from core to advanced slowly, and you test each phase before touching the next. If you try to build all 31 at once, you will break everything and you won’t understand anything.”
Phase 1 (10 pieces) — “Memory that actually works”: write pipeline, read pipeline, decay, session flush, behavior loop, categories, strength tags, sentiment tags, inject limit, trust pass.
Phase 2 (7 pieces) — “Memory that survives”: crash recovery, audit trail, dedup, conflict resolution, nightly maintenance, weekly maintenance, cron fallback.
Phase 3 (14 pieces) — “The ceiling”: rewrite query, tiered search, dual search, episodes, episode search, trust scoring, echo/fizzle, cross-agent sharing, memory agent, forward triggers, budget-aware, domain TTLs, monthly reindexing, knowledge graphs.
“None of Phase 3 matters until Phase 1 and 2 are solid. Phase 1 unstable means Phase 3 just amplifies the flaws. Phase 2 missing means Phase 3 is literally optimising pure garbage.”
“Personally I’m not done yet. Phase 1 and 2 are solid, Phase 3 is still being built. But the longer you work with it the more you see, they’re not separate… it’s all one system.”
How joelclaw Maps to jumperz’s Phases
| jumperz Phase 1 (Core) | joelclaw Status |
|---|---|
| Write pipeline | ✅ observe.ts — extracts facts from sessions |
| Read pipeline | ✅ recall tool — Qdrant semantic search |
| Score Decay | ❌ Increment 1 of this ADR |
| Session Flush | ✅ session-lifecycle flushes on compaction/shutdown |
| Behavior Loop | ⚠️ Partially — friction.ts detects patterns, but no corrections→lessons pipeline |
| Categories | ❌ Deferred — taxonomy undefined |
| Strength tags | ❌ Deferred |
| Sentiment tags | ❌ Deferred |
| Inject limit | ⚠️ Informal — not enforced in code |
| Trust Pass | ❌ Increment 1 of this ADR (basic confidence thresholds) |
| jumperz Phase 2 (Reliability) | joelclaw Status |
|---|---|
| Crash recovery | ⚠️ Inngest retries handle transient failures, but no checkpoint/replay |
| Audit trail | ✅ Qdrant stores source metadata, daily logs are append-only |
| Dedup | ⚠️ Redis SETNX dedupe key per session, but no cross-session cosine dedup |
| Conflict resolution | ❌ Increment 2 of this ADR (staleness tagging) |
| Nightly maintenance | ❌ Increment 2 of this ADR |
| Weekly maintenance | ❌ Deferred |
| Cron fallback | ✅ Inngest cron + heartbeat checks = equivalent |
| jumperz Phase 3 (Intelligence) | joelclaw Status |
|---|---|
| Echo/Fizzle | ❌ Increment 3 of this ADR |
| All others | ❌ Deferred — see “Not Yet” section above |
Assessment: joelclaw has ~60% of Phase 1 and ~40% of Phase 2 by jumperz’s framework. This ADR’s 3 increments close the gaps in Phase 1 (score decay, inject cap, trust pass) and Phase 2 (dedup, nightly maintenance, staleness), then start Phase 3 (echo/fizzle). That’s the right order per jumperz: “build from core to advanced slowly.”