System-Wide Taxonomy + Concept Contract (No Tag Soup)
Status
proposed
Context
Taxonomy exists in multiple places today, but not as one system contract:
- memory has SKOS-lite category IDs (ADR-0095),
- docs migration currently routes by coarse folder categories (ADR-0105 Phase 0),
- free-form tags still exist in several surfaces,
- graph work is deferred (ADR-0099, ADR-0100) and needs stable concept IDs first.
Legacy pdf-brain data proves the risk of ungoverned growth:
concepts: 1643concept_hierarchyedges: 24- most concepts are leaf-like and weakly connected
document_conceptsmostly came from backfill heuristics, not curated ontology decisions
This is useful evidence and alias material, but not a canonical taxonomy source.
Without one shared concept contract, we get tag soup:
- drift between domains (
memory,docs,projects,people,events), - brittle recall/routing behavior,
- weak cross-domain graph mesh.
Decision
Adopt a single system-wide concept taxonomy contract across joelclaw.
- Concepts are first-class and versioned.
- Every major entity and event carries concept IDs.
- Free-form tags are secondary metadata and must map to concepts before they affect retrieval/routing.
- Graph relations and dual-search build on concept IDs, not raw tags.
Canonical Taxonomy Surfaces
Canonical source lives in the monorepo and is mirrored to read-only consumers:
packages/system-bus/src/taxonomy/core-v1.*packages/system-bus/src/taxonomy/aliases-v1.*packages/system-bus/src/taxonomy/mappings-v1.*
Mirror targets:
- CLI/runtime modules in
packages/cliandapps/web - Vault documentation snapshots (optional, generated)
Legacy sources (including ~/Documents/.pdf-library/library.db) feed candidate aliases and proposals only.
Concept Contract
Concept record
id(immutable concept ID)prefLabelaltLabels[]broader[]/narrower[]/related[]scopeNotetaxonomy_versionstate(canonical|candidate|deprecated)
Entity fields (minimum)
primary_concept_id(optional)concept_ids[]concept_source(rules|llm|backfill|manual)taxonomy_version
Event metadata (minimum)
concept_ids[]primary_concept_id(optional)taxonomy_version
Retrieval evidence metadata (required for docs/memory text units)
For any retrievable evidence unit (memory observation, transcript chunk, docs chunk):
context_prefix(short structural context used for retrieval text)source_entity_id(document/session/source id)evidence_tier(snippet|section|summary|observation)parent_evidence_id(required for snippet-tier units)
This keeps retrieval explainable and lets systems expand from precise hits to fuller context deterministically.
Anti-Tag-Soup Guardrails
- No unbounded auto-creation of canonical concepts in hot ingest paths.
- New concepts enter as
candidatefirst, with dedupe + review. - Retrieval/routing features read concept IDs, not raw tags.
- Concept growth is measured and capped by policy until governance matures.
- Every alias must resolve to exactly one canonical concept per taxonomy version.
Rollout Plan
Phase 0: Contract + Freeze
- Publish canonical concept schema and alias schema.
- Freeze direct canonical writes from auto-taggers.
- Import curated base from
pdf-brain/data/taxonomy.json. - Build alias table from legacy
library.dbleaves and current memory/doc categories.
Phase 1: Docs + Memory Alignment
- ADR-0105 docs ingest writes concept fields (
primary_concept_id,concept_ids[], version/source). - ADR-0095 memory writes align to shared concept contract and aliases.
- Docs ingest emits hierarchical evidence (
section+snippet) with concept propagation and parent links. - Keep storage routing separate from semantic taxonomy (filesystem categories are operational, not ontology).
Phase 2: Cross-Domain Coverage
Add concept IDs to:
- memory observations and summaries
- docs records and chunk records
- docs distillates (narrative + retained facts linked back to evidence chunk ids)
- system log events
- discovery records
- people dossiers
- project/task records where applicable
Phase 3: Graph Substrate Activation
Use shared concept IDs to activate ADR-0099 relation substrate:
- relation extraction
- relation health checks
- conflict/drift controls
Phase 4: Dual Search Activation
Activate ADR-0100 only after measurable gains from concept-grounded graph retrieval.
Implementation Backlog
packages/system-bus
- Create taxonomy core modules:
src/taxonomy/core-v1.ts(canonical concepts)src/taxonomy/aliases-v1.ts(alias map)src/taxonomy/mappings-v1.ts(domain/category mappings)
- Add shared resolver helpers:
src/taxonomy/resolve.ts(label/tag -> concept_id)src/taxonomy/validate.ts(alias uniqueness + version checks)
- Wire docs ingest/enrich:
- update ADR-0105 paths to persist
primary_concept_id,concept_ids[],concept_source,taxonomy_version. - ensure hierarchical docs chunk records (
snippet|section) includeevidence_tier,parent_evidence_id,context_prefix.
- update ADR-0105 paths to persist
- Wire memory ingest:
- align ADR-0095 flow to use shared resolver; reject or flag unmapped labels.
- Emit OTEL taxonomy diagnostics on write paths:
- mapped count, unmapped count, alias hit rate, taxonomy version.
packages/cli
- Add concept-aware query filters for:
joelclaw recalljoelclaw docs search
- Add taxonomy diagnostics commands (or subcommands under existing surfaces):
- coverage by domain
- unmapped tag report
- alias collision report
- Ensure command outputs follow ADR-0093 navigation contract with next actions.
- Add retrieval context expansion surfaces shared by docs/memory:
- deterministic expansion modes (
snippet-window,parent-section,section-neighborhoodfor docs analogs) - per-response evidence attribution (
source,evidence_tier,chunk_id/observation_id)
- deterministic expansion modes (
apps/web
- Add concept facet filters to memory/docs system views.
- Show canonical concept labels (not raw tags) in detail panels.
- Add taxonomy version visibility in owner/system views for debugging drift.
Vault/docs + governance
- Keep ADR links synchronized (
0095,0099,0100,0105,0109). - Add a recurring taxonomy governance review note template:
- new candidates
- alias conflicts
- deprecated concepts
- cross-domain drift summary.
Legacy migration tasks
- Import seed concepts from
pdf-brain/data/taxonomy.json. - Build candidate alias set from legacy
~/Documents/.pdf-library/library.dbconcept leaves. - Do not auto-promote legacy candidates to canonical without review.
Phase Exit Criteria
Phase 0 exit
- Shared taxonomy modules exist and are imported by memory/docs write paths.
- Canonical auto-create is disabled in hot ingest routes.
- Seed taxonomy is loaded and versioned.
Phase 1 exit
- New docs + memory writes include concept fields and taxonomy version.
- Resolver coverage for docs + memory is >=95%.
- Unmapped writes are observable and queryable in OTEL.
- Docs chunk entities preserve parent links and expose deterministic expansion behavior.
Phase 2 exit
- Concept fields are present in at least 4 domains (memory, docs, system events, one of people/projects/discovery).
- Cross-domain recall/query by
concept_idreturns mixed-source hits.
Phase 3 exit
- ADR-0099 activation gates pass with shared concept IDs.
- Relation extraction pilot runs with health metrics and drift/conflict reporting.
Phase 4 exit
- ADR-0100 benchmark gate passes (>=10% retrieval quality lift on hard queries).
- Dual search fallback to vector-only is deterministic and telemetry-backed.
Acceptance Criteria
-
concept_ids[]coverage >= 95% for new memory + docs records. - Unmapped tags in write paths <= 2% (weekly).
- Concept IDs are queryable across at least 4 domains.
- Cross-domain recall can pivot by concept ID and return mixed-source hits.
- Concept growth rate and alias drift are observable in OTEL/health output.
- Docs/memory retrieval responses include evidence attribution and expansion path metadata.
Verification Commands
joelclaw otel search "taxonomy_version|concept_ids|primary_concept_id" --hours 24joelclaw otel search "evidence_tier|parent_evidence_id|context_prefix" --hours 24joelclaw otel stats --hours 24joelclaw recall "<query>" --jsonjoelclaw docs search "<query>" --jsonjoelclaw inngest memory-health --hours 24 --json
Non-Goals
- Adopting a dedicated graph database now.
- Replacing all legacy tags in one migration step.
- Perfect ontology design before shipping iterative improvements.
Consequences
Positive
- Shared semantics across memory, docs, and future graph retrieval.
- Less drift and better cross-domain recall precision.
- Clear path from taxonomy to graph mesh without tag explosion.
Negative / Risks
- Governance overhead (proposal review, alias curation, deprecation policy).
- Upfront migration cost to normalize legacy tags and categories.
References
- ADR-0024: Taxonomy-Enhanced Session Search (historical strategy)
- ADR-0095: Typesense-Native Memory Categories
- ADR-0099: Memory Knowledge-Graph Substrate
- ADR-0100: Memory Dual Search Activation Plan
- ADR-0105: Joelclaw PDF Brain