Taxonomy-enhanced session search with SKOS concept layer
Context and Problem Statement
590 agent sessions were generated in 48 hours (37 Pi, 381 Claude Code, 172 Codex). These contain every decision, debugging insight, architecture discussion, and configuration change that happened on this system. They are completely unsearchable.
ADR-0021 specifies a memory_observations collection for extracted observations — LLM-processed, structured, ~50 bullets per session. But observations are lossy by design. When the Reflect tool (ADR-0021 Phase 5) needs to answer “how did we fix the worker crash?” or “what did Joel say about Redis TTLs?”, it needs the raw transcript context, not just a distilled bullet.
Beyond searchability, a deeper problem exists: the same concepts have different names across every data source.
| Concept | Slog | Vault | Codebase | Sessions |
|---|---|---|---|---|
| System bus worker | tool: system-bus-worker | Projects/07-event-bus/ | packages/system-bus/ | ”the worker crashed” |
| Video pipeline | tool: video-ingest | Projects/06-video-ingest/ | src/inngest/video/ | ”ingest this video” |
| joelclaw | tool: joelclaw | Projects/09-joelclaw/ | ~/Code/joelhooks/joelclaw/ | ”the monorepo” |
| Memory | tool: memory | Projects/08-memory-system/ | (not yet implemented) | “memory system”, “recall”, “Qdrant” |
Vector similarity alone cannot bridge these gaps reliably. A query for “infrastructure” won’t find chunks about Qdrant, Redis, Inngest, or Docker unless those chunks literally contain the word “infrastructure.”
What the research says (Feb 2026)
-
FloTorch benchmark (Feb 2026): Recursive 512-token chunking outperformed semantic and proposition-based methods on equal context budgets. Simpler chunking + re-ranking is the dominant strategy. Proposition-based chunking (LLM decomposition into atomic facts) ranked among the worst — smaller fragments dilute accuracy.
-
GraphRAG / taxonomy-enhanced search (Squirro 2026): Combining vector search with structured taxonomies achieves up to 99% precision. The prerequisite is a carefully curated taxonomy. This is the single largest precision lever beyond basic chunking.
-
Contextual chunking (Anthropic): Adding a short context prefix to each chunk before embedding (e.g., “[SESSION: pi, debugging worker crash, 2026-02-15]”) improves retrieval 2-18% over baseline.
-
Re-ranking (FloTorch): Cross-encoder re-ranking after initial retrieval boosts precision 18-42%. This is larger than any chunking improvement.
-
Hybrid retrieval (BM25 + dense): 20-40% higher recall than dense search alone, especially for exact terminology, acronyms, and domain jargon.
Synthesis: Simple chunking + rich metadata taxonomy + re-ranking >>> complex chunking alone.
Storage projections
| Timeframe | Sessions | Raw size | Chunks (est.) | Vector storage |
|---|---|---|---|---|
| Current (2 days) | 590 | 137 MB | ~34k | ~100 MB |
| 1 month | ~9,000 | ~2.1 GB | ~500k | ~1.5 GB |
| 6 months (with TTL) | ~30,000 | ~5 GB | ~1.2M | ~4 GB |
With source-based TTL (Codex sessions expire after 30 days), 6-month projection drops from ~9 GB to ~4 GB. Manageable on local SSD.
Decision
Add two new Qdrant collections — session_transcripts and taxonomy_concepts — that work together with the existing memory_observations (ADR-0021) to provide taxonomy-enhanced semantic search across all system data.
The SKOS taxonomy layer lives in two places:
- Qdrant
taxonomy_conceptscollection — full machine-queryable concept graph with vectors, SKOS relationships, and cross-system mappings - Vault
Resources/taxonomy/notes — human-curated concept notes for major categories, browsable in Obsidian with wikilinks
Architecture
┌──────────────────────────────────────────────────────────────────┐
│ Qdrant Collections │
│ │
│ taxonomy_concepts session_transcripts │
│ ├─ vector: 768-dim ├─ vector: 768-dim │
│ ├─ prefLabel ├─ text (chunk content) │
│ ├─ altLabels[] ├─ concept_ids[] ←── taxonomy link │
│ ├─ broader[] ├─ source (pi|claude|codex) │
│ ├─ narrower[] ├─ sessionId │
│ ├─ related[] ├─ timestamp_start / timestamp_end │
│ ├─ exactMatch{} ├─ turn_roles[] (user|assistant|tool) │
│ ├─ closeMatch{} ├─ files_read[] │
│ ├─ scopeNote ├─ files_modified[] │
│ ├─ definition ├─ vault_notes[] │
│ ├─ conceptScheme ├─ slog_tool_refs[] │
│ └─ vault_note_path ├─ codebase_paths[] │
│ ├─ adr_refs[] │
│ memory_observations ├─ chunk_index │
│ (ADR-0021, enriched) ├─ total_chunks │
│ ├─ concept_ids[] ←─┐ ├─ context_prefix │
│ └─ (existing schema)│ └─ ttl_expires_at (codex only) │
│ │ │
│ └──── shared concept_ids enable cross-query │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ Vault (human layer) │
│ │
│ Resources/taxonomy/ │
│ ├─ _index.md (concept scheme overview) │
│ ├─ agent-infrastructure.md │
│ │ frontmatter: { prefLabel, altLabels, narrower, related } │
│ │ body: scope note, definition, links to projects │
│ ├─ memory-system.md │
│ ├─ video-pipeline.md │
│ ├─ joelclaw.md │
│ └─ ... │
│ │
│ Wikilinks = broader/narrower/related graph │
│ Frontmatter = machine-readable SKOS fields │
│ Sync: Vault notes → Qdrant taxonomy_concepts (Inngest function) │
└──────────────────────────────────────────────────────────────────┘Query Flow
User/Agent query: "how did we fix the worker crash?"
│
├─ 1. Embed query (nomic-embed-text-v1.5, 768-dim)
│
├─ 2. Taxonomy expansion
│ ├─ Search taxonomy_concepts by vector similarity
│ │ → finds: jc:system-bus-worker (score 0.82)
│ ├─ Traverse broader[]: jc:agent-infrastructure
│ ├─ Traverse related[]: jc:inngest, jc:launchd, jc:docker
│ ├─ Collect altLabels: ["worker", "event bus", "system-bus"]
│ └─ Build expanded concept set: [jc:system-bus-worker, jc:inngest, ...]
│
├─ 3. Hybrid retrieval
│ ├─ Dense: vector similarity on session_transcripts + memory_observations
│ ├─ Sparse: BM25 on altLabels + query terms (future: Qdrant sparse vectors)
│ ├─ Filter: concept_ids overlap with expanded concept set (payload filter)
│ └─ Combine via Reciprocal Rank Fusion (RRF)
│
├─ 4. Re-rank top-k results
│ ├─ Cross-encoder or LLM-based re-ranking
│ └─ Score by: relevance + recency + source priority
│
└─ 5. Return ranked chunks with concept context
├─ Each result includes: text, source, timestamp, concept_ids, files
└─ Reflect tool synthesizes answer from top chunksSKOS Concept Schema
Each concept in taxonomy_concepts:
interface TaxonomyConcept {
// Identity
id: string; // e.g., "jc:system-bus-worker"
prefLabel: string; // "system-bus worker"
altLabels: string[]; // ["worker", "event bus worker", "system-bus"]
hiddenLabels: string[]; // typos, abbreviations: ["sb-worker", "sysbus"]
// Hierarchy (SKOS semantic relations)
broader: string[]; // ["jc:agent-infrastructure"]
narrower: string[]; // ["jc:inngest-functions", "jc:launchd-plist"]
related: string[]; // ["jc:inngest", "jc:docker", "jc:bun"]
// Cross-system mappings (SKOS mapping properties)
exactMatch: {
slog_tool?: string; // "system-bus-worker"
vault_project?: string; // "Projects/07-event-bus/"
codebase_path?: string; // "packages/system-bus/"
skill?: string; // "inngest"
};
closeMatch: {
vault_notes?: string[]; // related Vault notes
adr_refs?: string[]; // ["ADR-0021", "ADR-0022"]
};
// Documentation (SKOS documentation properties)
scopeNote: string; // brief description of concept scope
definition?: string; // formal definition
// Metadata
conceptScheme: string; // "jc:system" | "jc:tools" | "jc:projects"
vault_note_path?: string; // "Resources/taxonomy/system-bus-worker.md"
created: string; // ISO 8601
modified: string; // ISO 8601
source: "mined" | "curated"; // how the concept was created
}Concept schemes (top-level groupings):
jc:system— infrastructure, services, deployment (Qdrant, Redis, Inngest, Docker, launchd, Caddy, Tailscale)jc:tools— CLI tools, skills, extensions (pi, claude, codex, slog, igs, yt-dlp, ffmpeg)jc:projects— active projects, features, pipelines (joelclaw, video-ingest, memory-system)jc:patterns— architectural patterns, decisions (ADRs, PARA, SKOS, event-driven, durable execution)jc:people— people and organizations referenced (Joel, Alex Hillman, John Lindquist, Anthropic, Sanity)
Chunking Strategy
Based on FloTorch 2026 findings, we use adaptive recursive chunking — simple splitting that respects the natural structure of session transcripts, enhanced with contextual metadata.
Session transcript structure
Sessions are JSONL files with entries like:
{"type":"user","message":"fix the worker crash","timestamp":"2026-02-15T10:30:00Z"}
{"type":"assistant","message":"Let me check the logs...","timestamp":"2026-02-15T10:30:05Z"}
{"type":"tool_use","tool":"bash","input":"docker logs ...","timestamp":"2026-02-15T10:30:06Z"}
{"type":"tool_result","output":"Error: Cannot find module '@qdrant/js-client-rest'","timestamp":"2026-02-15T10:30:07Z"}
{"type":"assistant","message":"The worker crashed because...","timestamp":"2026-02-15T10:30:10Z"}Chunking rules
- Parse JSONL into conversation turns (user message + assistant response + any tool calls between them = 1 logical turn)
- Target chunk size: 400-600 tokens (recursive character splitting within turns at paragraph/sentence boundaries)
- Small turns (< 100 tokens): merge with adjacent turns up to target size
- Large turns (> 600 tokens): split at paragraph boundaries, then sentence boundaries. Tool outputs over 1000 tokens are truncated to first/last 200 tokens with
[...truncated...]marker - Overlap: 50 tokens between chunks from the same turn split (not between different turns)
- Compaction summaries: treated as single high-value chunks (they’re already distilled)
Contextual prefix
Each chunk gets a prefix before embedding (Anthropic contextual retrieval pattern):
[SESSION: {source} | {date} | {session_name}]
[TOPIC: {auto-detected or taxonomy concept labels}]
[FILES: {files_read + files_modified, truncated to top 5}]This prefix is embedded with the chunk but stored separately in the context_prefix payload field so it can be excluded from the returned text.
Metadata Extraction
At chunk ingestion time, extract from the session content:
| Metadata field | Extraction method |
|---|---|
files_read[] | Parse tool_use entries for read, cat, head operations |
files_modified[] | Parse tool_use entries for write, edit, sed, tee operations |
vault_notes[] | Regex: paths matching ~/Vault/ or Vault/ |
codebase_paths[] | Regex: paths matching ~/Code/ or common project directories |
slog_tool_refs[] | Match against known slog tool names |
adr_refs[] | Regex: ADR-\d{4} |
concept_ids[] | Match chunk text against taxonomy altLabels + prefLabels (exact + fuzzy) |
TTL Strategy
Source-based TTL reflecting signal density:
| Source | TTL | Rationale |
|---|---|---|
| Pi sessions | ∞ (no expiry) | Joel’s direct conversations — highest signal, decisions, preferences |
| Claude Code sessions | ∞ (no expiry) | Direct coding sessions — architecture context, debugging insights |
| Codex loop sessions | 30 days | Automated iterations — repetitive, low-signal. 52% of storage, ~10% of unique insights |
Implementation: Codex chunks include ttl_expires_at in payload. A daily Inngest cron function deletes expired points.
Ingestion Pipeline
┌─────────────────────────────────┐
│ Inngest Functions │
│ │
File watcher or │ search/session.index.requested │
manual trigger ──→│ ① Find unindexed sessions │
│ ② For each: parse JSONL │
│ ③ Chunk (adaptive recursive) │
│ ④ Extract metadata per chunk │
│ ⑤ Tag with concept_ids │
│ (match against taxonomy) │
│ ⑥ Add context prefix │
│ ⑦ Embed (nomic-embed-text) │
│ ⑧ Upsert to Qdrant │
│ ⑨ Mark session as indexed │
│ (Redis: indexed:{hash}) │
│ │
Vault change or │ search/taxonomy.sync.requested │
manual trigger ──→│ ① Read Resources/taxonomy/*.md │
│ ② Parse frontmatter → SKOS │
│ ③ Embed prefLabel + definition │
│ ④ Upsert to taxonomy_concepts │
│ ⑤ Re-tag affected chunks │
│ (concept label changes) │
│ │
Daily cron ──────→│ search/session.ttl.cleanup │
│ ① Find points where │
│ ttl_expires_at < now │
│ ② Delete expired points │
│ ③ Log cleanup stats │
└─────────────────────────────────┘Idempotency: Each session is hashed (sha256(filepath + file_mtime)). Redis stores indexed:{hash} — if the hash exists, the session is skipped. If the file changes (rare), the hash changes and it gets re-indexed.
Embedding
Per ADR-0021, use nomic-ai/nomic-embed-text-v1.5:
- 768 dimensions, Cosine distance
- Local execution via subprocess (no external API dependency)
- Already validated in memory system spikes (0.454 similarity for related queries vs 0.004 for unrelated)
- Matryoshka representation: can truncate to 256/512 dims for storage optimization later
Qdrant Collection Configuration
// session_transcripts
{
vectors: { size: 768, distance: "Cosine" },
optimizers_config: {
indexing_threshold: 20000, // delay indexing until 20k points (batch-friendly)
},
// Payload indexes for filtered search
payload_indexes: [
{ field: "source", type: "keyword" },
{ field: "sessionId", type: "keyword" },
{ field: "concept_ids", type: "keyword" }, // array of concept IDs
{ field: "timestamp_start", type: "integer" },
{ field: "files_read", type: "keyword" },
{ field: "files_modified", type: "keyword" },
{ field: "ttl_expires_at", type: "integer" },
],
}
// taxonomy_concepts
{
vectors: { size: 768, distance: "Cosine" },
payload_indexes: [
{ field: "prefLabel", type: "keyword" },
{ field: "altLabels", type: "keyword" },
{ field: "conceptScheme", type: "keyword" },
{ field: "broader", type: "keyword" },
{ field: "narrower", type: "keyword" },
{ field: "related", type: "keyword" },
{ field: "exactMatch.slog_tool", type: "keyword" },
{ field: "exactMatch.vault_project", type: "keyword" },
],
}Vault Taxonomy Notes
Each concept note in Resources/taxonomy/ follows this template:
---
type: taxonomy-concept
concept_id: "jc:system-bus-worker"
prefLabel: "system-bus worker"
altLabels:
- worker
- event bus worker
- system-bus
hiddenLabels:
- sb-worker
- sysbus
broader:
- "[[agent-infrastructure]]"
narrower:
- "[[inngest-functions]]"
- "[[launchd-plist]]"
related:
- "[[inngest]]"
- "[[docker]]"
- "[[bun]]"
exactMatch:
slog_tool: system-bus-worker
vault_project: "Projects/07-event-bus/"
codebase_path: "packages/system-bus/"
conceptScheme: jc:system
tags:
- taxonomy
---
# System Bus Worker
The Inngest worker process that registers and executes durable functions for the event bus. Runs as a launchd service on the Mac Mini, serving the `/api/inngest` endpoint.
## Scope
Covers the worker process itself, its start script (`start.sh`), the launchd plist (`com.joel.system-bus-worker.plist`), and the serve entrypoint (`src/serve.ts`). Does NOT cover individual Inngest functions (those have their own concepts) or the Inngest server (see [[inngest]]).
## See Also
- [[Projects/07-event-bus/index|Event Bus Project]]
- [[ADR-0022|Webhook to System Event Pipeline]]Bootstrap: Seed Taxonomy from Existing Data
The initial taxonomy is mined from structured data already in the system:
| Source | Concepts extracted | Method |
|---|---|---|
| Slog tool names | ~30 unique tools | Direct: each tool name → concept |
| Vault Projects | 12 projects | Direct: each project → concept with broader: jc:projects |
| Vault Resources/tools | Tool inventory notes | Parse frontmatter |
| Skills | 18 skills | Direct: each skill → concept |
| Codebase packages | 4 packages | Direct: each → concept |
| ADRs | 23 ADRs | Each ADR topic → concept or concept refinement |
| AGENTS.md tool tables | CLI tools, Mac apps | Parse markdown tables |
Estimated seed size: ~80-100 concepts with hierarchy. Agents expand organically as new concepts emerge in sessions.
Bootstrap process: An Inngest function (search/taxonomy.bootstrap.requested) reads all sources, deduplicates, infers broader/narrower from Vault PARA structure and codebase nesting, generates Vault notes, and upserts to Qdrant.
Implementation Phases
Phase 1: Collections + Taxonomy Seed (MEM-24 through MEM-28)
- MEM-24: Create
session_transcriptscollection in Qdrant with schema above - MEM-25: Create
taxonomy_conceptscollection in Qdrant - MEM-26: Build taxonomy bootstrap function — mine slog, Vault, codebase, skills → seed ~80 concepts
- MEM-27: Create Vault
Resources/taxonomy/with top ~20 concept notes (human-curated subset) - MEM-28: Taxonomy sync function — Vault notes → Qdrant (and flag drift)
Phase 2: Session Ingestion (MEM-29 through MEM-33)
- MEM-29: Session JSONL parser — handle Pi, Claude Code, and Codex formats
- MEM-30: Adaptive chunker — turn-based grouping, recursive splitting, context prefix
- MEM-31: Metadata extractor — files, Vault refs, codebase paths, slog tools, ADR refs
- MEM-32: Concept tagger — match chunk content against taxonomy labels
- MEM-33: Ingestion Inngest function — orchestrate parse → chunk → tag → embed → upsert
Phase 3: Search + Query Expansion (MEM-34 through MEM-37)
- MEM-34: Taxonomy expansion — given a query, find related concepts and expand
- MEM-35: Hybrid search — dense vector + payload filter on concept_ids
- MEM-36: Re-ranking layer — cross-encoder or LLM-based re-rank of top-k
- MEM-37: Reflect tool integration — wire search into ADR-0021 Phase 5 Reflect tool
Phase 4: Maintenance (MEM-38 through MEM-40)
- MEM-38: TTL cleanup cron — daily deletion of expired Codex session chunks
- MEM-39: Incremental indexing — watch for new sessions, index on arrival (via Inngest event)
- MEM-40: Taxonomy growth — agents propose new concepts during sessions, staged for review
Consequences
Positive
- Searchable sessions: Every conversation, debug session, and decision becomes findable
- Cross-system linking: Vault project ↔ slog tool ↔ codebase path ↔ session topic, connected by shared concept IDs
- Hierarchical retrieval: Search for “infrastructure” and get results about Qdrant, Redis, Inngest, Docker
- Disambiguation: “Worker” resolves to the correct concept in context
- Human-browsable: Vault taxonomy notes give Joel a visual map of the concept graph in Obsidian
- Reflect tool powered: ADR-0021 Phase 5 has a real search backend
- Agent vocabulary: Shared controlled vocabulary prevents terminology drift across sessions and agents
Negative
- Two sources of truth for taxonomy: Qdrant (machine) and Vault (human) can drift. Mitigated by sync function, but requires discipline.
- Taxonomy maintenance: Concepts need curation as the system evolves. Mitigated by agent-proposed growth + human review.
- Storage growth: ~4 GB vectors at 6 months. Acceptable for local SSD, but worth monitoring.
- Embedding compute: ~34k chunks today × embedding time. Local
nomic-embed-textis ~100 chunks/sec on M-series — initial backfill takes ~6 minutes. Incremental is negligible. - Complexity: Three Qdrant collections + taxonomy sync + TTL cleanup + multiple Inngest functions. More moving parts than flat vector search.
Neutral
- Does not replace ADR-0021’s
memory_observations— complements it. Observations are distilled intelligence; transcript chunks are raw evidence. Different retrieval patterns, connected by sessionId and concept_ids. - Does not require changes to the session-lifecycle extension. Sessions are indexed after the fact by the ingestion pipeline.
- Taxonomy is intentionally SKOS-inspired, not SKOS-compliant. We use the conceptual framework (concepts, labels, hierarchies, mappings) without RDF, OWL, or SPARQL. If interoperability with external SKOS systems becomes needed, the JSON schema maps cleanly to RDF.
Related Literature (pdf-brain @ clanker-001:3847)
The following books and papers in the pdf-brain library directly inform this ADR. Agents can query http://100.95.167.75:3847/search with {"query": "...", "limit": N} for deep-dive content.
| Document | Pages | Key Relevance |
|---|---|---|
| The Accidental Taxonomist, 2nd Ed | 500 | SKOS spec (p.184-389), controlled vocabulary design, faceted taxonomies for retrieval (p.57, p.340), auto-tagging (p.255-283) |
| Building Knowledge Graphs: A Practitioner’s Guide | 291 | KG-enhanced semantic search (p.225-247), entity extraction with NER (p.227), disambiguation |
| Knowledge Graphs: Fundamentals, Techniques and Applications | 679 | Comprehensive KG reference, graph algorithms for retrieval |
| Mem0 paper (2504.19413) | 23 | Dual memory (text + graph), Mem0g entity-relation triples, conflict detection, temporal event graph |
| A-MEM: Agentic Memory for LLM Agents | 28 | Zettelkasten-inspired self-evolving memory, dynamic linking, contextual descriptions per note |
| Temporal KG Architecture for Agent Memory | 12 | Zep/Graphiti, deep memory retrieval (98.2% recall), temporally-aware KGs |
| Generative Agents: Interactive Simulacra | 22 | Memory stream, recency/importance/relevance scoring for retrieval |
| Graph-Based RAG for Global Sensemaking | 26 | Microsoft GraphRAG, community detection + summarization, outperforms vector RAG |
| Chip Huyen: AI Engineering | 1209 | Chunking strategies (p.635), contextual retrieval (p.644), re-ranking |
| Patterns for Building AI Agents | 93 | Agent memory patterns, context engineering, tool selection |
| Principles of Building AI Agents, 2nd Ed | 149 | Working memory (p.40), Mastra agents with persistent memory |
| Information Architecture for the Web and Beyond | 603 | Controlled vocabularies (p.335-388), faceted search, synonym rings |
| MemoryBench | 51 | Memory architecture benchmarks, evaluation criteria |
| Ontology Engineering with Ontology Design Patterns | 389 | Formal ontology design (if SKOS-lite needs upgrading) |
Verification Criteria
-
taxonomy_conceptscollection exists with ≥80 concepts, broader/narrower links, and cross-system exactMatch mappings -
session_transcriptscollection exists with all 590 current sessions indexed - Each transcript chunk has
concept_ids[]payload linking to taxonomy - Query “worker crash” retrieves relevant chunks from the Feb 15 debugging session
- Taxonomy expansion: query “memory” also retrieves chunks about Qdrant, Redis, embedding — not just literal “memory” mentions
- Codex session chunks include
ttl_expires_atand cleanup cron deletes expired points - Vault
Resources/taxonomy/contains ≥20 curated concept notes with SKOS frontmatter - Taxonomy sync function keeps Vault notes ↔ Qdrant concepts consistent