ADR-0228proposed

Agentic Docs API — Taxonomy Graph, Smart Search, Context Assembly

Status

proposed

Context

The docs-api (apps/docs-api/) currently provides basic CRUD+search over Typesense docs and docs_chunks collections. It works, but it’s a thin Typesense proxy — agents must do all the intelligence themselves.

Current corpus: 623 documents, 145K chunks (56K sections + 89K snippets), 10 SKOS concepts with related edges.

What agents actually need for knowledge retrieval:

  1. Ask a question → get assembled, contextual answers with breadth (across books) and depth (section → snippet drill-down)
  2. Navigate the library by concept domain — “show me all the design books”, “what does this library say about systems thinking across domains?”
  3. Cross-concept exploration — related concepts surface unexpected connections (a design book about feedback loops relates to operations/SRE)
  4. Pre-assembled context — don’t make the agent fetch chunk → parent → neighbors in 3 round-trips

What’s missing:

  • No concept graph API — the SKOS taxonomy exists in core-v1.ts but isn’t exposed through the API
  • No concept-scoped search or concept faceting on search results
  • No doc structure browsing (/docs/:id/chunks, /docs/:id/toc)
  • No doc-level search (by title/author/tags/concepts)
  • No context assembly — agents must fetch snippet, then parent section, then neighbors manually
  • No concept expansion — when a query crosses domains, agents can’t discover related concepts
  • Taxonomy is flat — 10 top-level concepts with no broader/narrower hierarchy is too coarse for 623 books
  • No caching — every request hits Typesense directly

Decision

Phase 1: Taxonomy Graph API

Expose the SKOS concept graph as a first-class API surface.

New endpoints:

  • GET /concepts — full concept graph with doc counts per concept, related/broader/narrower links, chunk counts
  • GET /concepts/:id — single concept detail: prefLabel, altLabels, scopeNote, broader/narrower/related, doc count, chunk count, top docs
  • GET /concepts/:id/docs[?page=1][&perPage=20] — all docs in a concept
  • GET /concepts/:id/search?q=<query> — search chunks scoped to a concept

Concept graph is embedded in the binary (the 10-concept taxonomy is small enough). Doc/chunk counts are facet queries to Typesense cached with 5-minute TTL.

Phase 2: Agentic Search Upgrade

Upgrade GET /search for agentic multi-strategy retrieval:

  • ?concept=<id> — filter by primary concept (Typesense filter_by=primary_concept_id:=<id>)
  • ?concepts=<id1>,<id2> — filter by concept_ids array intersection
  • ?doc_id=<id> — search within a single document
  • ?expand=true — when primary concept hits < perPage, auto-expand to related concepts and mark expanded results
  • ?assemble=true — for each snippet hit, inline its parent section content (one round-trip context assembly)
  • Response includes conceptFacets[] — which concepts the results span, with counts

Search response shape (when assemble=true):

{
  "hits": [{
    "id": "chunk-id",
    "docId": "doc-id",
    "title": "Book Title",
    "chunkType": "snippet",
    "score": 0.85,
    "content": "the snippet text",
    "parentSection": {
      "id": "parent-chunk-id",
      "headingPath": ["Chapter 3", "Feedback Loops"],
      "content": "full section text"
    },
    "headingPath": ["Chapter 3", "Feedback Loops"],
    "conceptIds": ["jc:docs:design", "jc:docs:operations"]
  }],
  "conceptFacets": [
    {"concept": "jc:docs:design", "count": 7, "label": "Design"},
    {"concept": "jc:docs:operations", "count": 3, "label": "Operations"}
  ],
  "expandedConcepts": ["jc:docs:operations"]
}

Phase 3: Doc Structure Browsing

New endpoints for navigating a book’s internal structure:

  • GET /docs/:id/chunks[?type=section|snippet][&page=1][&perPage=50] — paginated chunk list ordered by chunk_index
  • GET /docs/:id/toc — extracted table of contents from heading_path data (deduplicated section heading paths)
  • GET /docs/search?q=<query>[&concept=<id>][&page=1] — search doc metadata (title, filename, tags, summary, concept_ids), not chunks

Phase 4: Taxonomy Deepening

The current taxonomy has 10 top-level concepts with no hierarchy. For a 623-book library, this is too coarse — “programming” covers everything from Rust to distributed systems to compilers.

Add sub-concepts with proper broader/narrower relationships:

jc:docs:programming
  ├── jc:docs:programming:systems       (distributed systems, databases, networking)
  ├── jc:docs:programming:languages     (Rust, TypeScript, language design)
  ├── jc:docs:programming:architecture  (patterns, DDD, clean arch)
  └── jc:docs:programming:devtools      (editors, build systems, CLI design)
 
jc:docs:design
  ├── jc:docs:design:game               (game design, play, interactivity)
  ├── jc:docs:design:systems            (systems thinking, complexity, emergence)
  ├── jc:docs:design:product            (product design, UX, interaction)
  └── jc:docs:design:visual             (graphic design, typography, layout)
 
jc:docs:education
  ├── jc:docs:education:learning-science (cognitive science, memory, transfer)
  ├── jc:docs:education:pedagogy         (instructional design, UbD, curriculum)
  └── jc:docs:education:media            (multimedia learning, video, course design)
 
jc:docs:business
  ├── jc:docs:business:strategy         (competitive strategy, positioning)
  ├── jc:docs:business:operations       (management, teams, execution)
  └── jc:docs:business:creator          (creator economy, indie business, audience)
 
jc:docs:ai
  ├── jc:docs:ai:agents                 (autonomous agents, tool use, planning)
  ├── jc:docs:ai:ml                     (machine learning, training, models)
  └── jc:docs:ai:applied                (RAG, embeddings, production AI)

Sub-concepts inherit parent’s broader link. Agents can search at any level — a search for jc:docs:design automatically includes all narrower concepts.

Reclassification: Fire a reindex event (docs/reindex.requested) after taxonomy update. The LLM classifier gets the expanded concept list; existing rules+aliases get new entries.

Phase 5: Smart Caching

  • Concept graph: in-memory, rebuilt on startup (10-30 concepts, trivially small)
  • Concept facet counts: cached 5 minutes (Typesense facet query)
  • Doc metadata by ID: LRU cache, 100 entries, 10-min TTL
  • Parent section content: LRU cache by chunk_id, 500 entries, 10-min TTL (for assemble=true)
  • Response-level: Cache-Control headers for agent HTTP caches

Phase 6: PDF Extraction Upgrade (marker-pdf)

Replace pypdf → strings -n 6 fallback with marker-pdf for dramatically better text extraction:

  • marker_single input.pdf --output_format markdown --disable_image_extraction
  • Produces structured markdown with real # headings, tables, equation handling
  • Requires Python 3.10-3.12 + PyTorch in worker image
  • Fallback chain: marker → pypdf → strings
  • Marker’s markdown output gives heading detection for free (Phase 3 of the chunking improvements)

Consequences

Positive

  • Agents get assembled, contextual answers in one round-trip
  • Concept graph enables cross-domain exploration — “what connects design and operations in this library?”
  • Deeper taxonomy makes retrieval more precise — searching “game design” doesn’t pull in Rust books
  • Caching makes hot paths fast enough for interactive agent use
  • marker-pdf extraction dramatically improves chunk quality for complex PDFs

Negative

  • Taxonomy deepening requires reclassification of all 623 docs (~LLM cost for Haiku calls)
  • marker-pdf adds ~2-3GB of ML models to the worker image
  • More API surface to maintain
  • Cache invalidation on reindex needs careful handling

Risks

  • Sub-concept boundaries are judgment calls — “is this book systems or architecture?”
  • marker-pdf may be slow on large PDFs without GPU
  • Concept expansion could return too many results if related graph is too connected

Implementation Order

  1. Phase 1 (Taxonomy Graph API) — exposes what already exists, zero risk
  2. Phase 2 (Agentic Search) — biggest agent UX improvement
  3. Phase 3 (Doc Structure) — enables book browsing
  4. Phase 4 (Taxonomy Deepening) — quality improvement, needs LLM reclassification
  5. Phase 5 (Caching) — can be woven into phases 1-3
  6. Phase 6 (marker-pdf) — independent, can run in parallel

References

  • ADR-0109: System-Wide Taxonomy + Concept Contract
  • ADR-0153: Docs REST HTTP API (current implementation)
  • ADR-0164: Mandatory Taxonomy Classification
  • Anthropic Contextual Retrieval (2024): contextualize chunks before embedding
  • arXiv:2603.06976 (Mar 2026): Paragraph Group Chunking = SOTA for structured docs
  • arXiv:2510.20356: FreeChunker cross-granularity retrieval