Agentic Docs API — Taxonomy Graph, Smart Search, Context Assembly
Status
proposed
Context
The docs-api (apps/docs-api/) currently provides basic CRUD+search over Typesense docs and docs_chunks collections. It works, but it’s a thin Typesense proxy — agents must do all the intelligence themselves.
Current corpus: 623 documents, 145K chunks (56K sections + 89K snippets), 10 SKOS concepts with related edges.
What agents actually need for knowledge retrieval:
- Ask a question → get assembled, contextual answers with breadth (across books) and depth (section → snippet drill-down)
- Navigate the library by concept domain — “show me all the design books”, “what does this library say about systems thinking across domains?”
- Cross-concept exploration — related concepts surface unexpected connections (a design book about feedback loops relates to operations/SRE)
- Pre-assembled context — don’t make the agent fetch chunk → parent → neighbors in 3 round-trips
What’s missing:
- No concept graph API — the SKOS taxonomy exists in
core-v1.tsbut isn’t exposed through the API - No concept-scoped search or concept faceting on search results
- No doc structure browsing (
/docs/:id/chunks,/docs/:id/toc) - No doc-level search (by title/author/tags/concepts)
- No context assembly — agents must fetch snippet, then parent section, then neighbors manually
- No concept expansion — when a query crosses domains, agents can’t discover related concepts
- Taxonomy is flat — 10 top-level concepts with no
broader/narrowerhierarchy is too coarse for 623 books - No caching — every request hits Typesense directly
Decision
Phase 1: Taxonomy Graph API
Expose the SKOS concept graph as a first-class API surface.
New endpoints:
GET /concepts— full concept graph with doc counts per concept, related/broader/narrower links, chunk countsGET /concepts/:id— single concept detail: prefLabel, altLabels, scopeNote, broader/narrower/related, doc count, chunk count, top docsGET /concepts/:id/docs[?page=1][&perPage=20]— all docs in a conceptGET /concepts/:id/search?q=<query>— search chunks scoped to a concept
Concept graph is embedded in the binary (the 10-concept taxonomy is small enough). Doc/chunk counts are facet queries to Typesense cached with 5-minute TTL.
Phase 2: Agentic Search Upgrade
Upgrade GET /search for agentic multi-strategy retrieval:
?concept=<id>— filter by primary concept (Typesensefilter_by=primary_concept_id:=<id>)?concepts=<id1>,<id2>— filter by concept_ids array intersection?doc_id=<id>— search within a single document?expand=true— when primary concept hits < perPage, auto-expand to related concepts and mark expanded results?assemble=true— for each snippet hit, inline its parent section content (one round-trip context assembly)- Response includes
conceptFacets[]— which concepts the results span, with counts
Search response shape (when assemble=true):
{
"hits": [{
"id": "chunk-id",
"docId": "doc-id",
"title": "Book Title",
"chunkType": "snippet",
"score": 0.85,
"content": "the snippet text",
"parentSection": {
"id": "parent-chunk-id",
"headingPath": ["Chapter 3", "Feedback Loops"],
"content": "full section text"
},
"headingPath": ["Chapter 3", "Feedback Loops"],
"conceptIds": ["jc:docs:design", "jc:docs:operations"]
}],
"conceptFacets": [
{"concept": "jc:docs:design", "count": 7, "label": "Design"},
{"concept": "jc:docs:operations", "count": 3, "label": "Operations"}
],
"expandedConcepts": ["jc:docs:operations"]
}Phase 3: Doc Structure Browsing
New endpoints for navigating a book’s internal structure:
GET /docs/:id/chunks[?type=section|snippet][&page=1][&perPage=50]— paginated chunk list ordered by chunk_indexGET /docs/:id/toc— extracted table of contents from heading_path data (deduplicated section heading paths)GET /docs/search?q=<query>[&concept=<id>][&page=1]— search doc metadata (title, filename, tags, summary, concept_ids), not chunks
Phase 4: Taxonomy Deepening
The current taxonomy has 10 top-level concepts with no hierarchy. For a 623-book library, this is too coarse — “programming” covers everything from Rust to distributed systems to compilers.
Add sub-concepts with proper broader/narrower relationships:
jc:docs:programming
├── jc:docs:programming:systems (distributed systems, databases, networking)
├── jc:docs:programming:languages (Rust, TypeScript, language design)
├── jc:docs:programming:architecture (patterns, DDD, clean arch)
└── jc:docs:programming:devtools (editors, build systems, CLI design)
jc:docs:design
├── jc:docs:design:game (game design, play, interactivity)
├── jc:docs:design:systems (systems thinking, complexity, emergence)
├── jc:docs:design:product (product design, UX, interaction)
└── jc:docs:design:visual (graphic design, typography, layout)
jc:docs:education
├── jc:docs:education:learning-science (cognitive science, memory, transfer)
├── jc:docs:education:pedagogy (instructional design, UbD, curriculum)
└── jc:docs:education:media (multimedia learning, video, course design)
jc:docs:business
├── jc:docs:business:strategy (competitive strategy, positioning)
├── jc:docs:business:operations (management, teams, execution)
└── jc:docs:business:creator (creator economy, indie business, audience)
jc:docs:ai
├── jc:docs:ai:agents (autonomous agents, tool use, planning)
├── jc:docs:ai:ml (machine learning, training, models)
└── jc:docs:ai:applied (RAG, embeddings, production AI)Sub-concepts inherit parent’s broader link. Agents can search at any level — a search for jc:docs:design automatically includes all narrower concepts.
Reclassification: Fire a reindex event (docs/reindex.requested) after taxonomy update. The LLM classifier gets the expanded concept list; existing rules+aliases get new entries.
Phase 5: Smart Caching
- Concept graph: in-memory, rebuilt on startup (10-30 concepts, trivially small)
- Concept facet counts: cached 5 minutes (Typesense facet query)
- Doc metadata by ID: LRU cache, 100 entries, 10-min TTL
- Parent section content: LRU cache by chunk_id, 500 entries, 10-min TTL (for
assemble=true) - Response-level:
Cache-Controlheaders for agent HTTP caches
Phase 6: PDF Extraction Upgrade (marker-pdf)
Replace pypdf → strings -n 6 fallback with marker-pdf for dramatically better text extraction:
marker_single input.pdf --output_format markdown --disable_image_extraction- Produces structured markdown with real
#headings, tables, equation handling - Requires Python 3.10-3.12 + PyTorch in worker image
- Fallback chain: marker → pypdf → strings
- Marker’s markdown output gives heading detection for free (Phase 3 of the chunking improvements)
Consequences
Positive
- Agents get assembled, contextual answers in one round-trip
- Concept graph enables cross-domain exploration — “what connects design and operations in this library?”
- Deeper taxonomy makes retrieval more precise — searching “game design” doesn’t pull in Rust books
- Caching makes hot paths fast enough for interactive agent use
- marker-pdf extraction dramatically improves chunk quality for complex PDFs
Negative
- Taxonomy deepening requires reclassification of all 623 docs (~LLM cost for Haiku calls)
- marker-pdf adds ~2-3GB of ML models to the worker image
- More API surface to maintain
- Cache invalidation on reindex needs careful handling
Risks
- Sub-concept boundaries are judgment calls — “is this book systems or architecture?”
- marker-pdf may be slow on large PDFs without GPU
- Concept expansion could return too many results if related graph is too connected
Implementation Order
- Phase 1 (Taxonomy Graph API) — exposes what already exists, zero risk
- Phase 2 (Agentic Search) — biggest agent UX improvement
- Phase 3 (Doc Structure) — enables book browsing
- Phase 4 (Taxonomy Deepening) — quality improvement, needs LLM reclassification
- Phase 5 (Caching) — can be woven into phases 1-3
- Phase 6 (marker-pdf) — independent, can run in parallel
References
- ADR-0109: System-Wide Taxonomy + Concept Contract
- ADR-0153: Docs REST HTTP API (current implementation)
- ADR-0164: Mandatory Taxonomy Classification
- Anthropic Contextual Retrieval (2024): contextualize chunks before embedding
- arXiv:2603.06976 (Mar 2026): Paragraph Group Chunking = SOTA for structured docs
- arXiv:2510.20356: FreeChunker cross-granularity retrieval