ADR-0229proposed

Markdown-Aware Chunking Pipeline

Status

proposed

Context

The chunking pipeline (packages/system-bus/src/lib/book-chunk.ts) currently operates on flat extracted text. It uses heuristic heading detection (isHeadingLine()) based on regex patterns (ALL CAPS, title case, numbered lines, chapter/section prefixes) and splits by paragraph boundaries with token estimation via words * 1.25.

What works well (keep)

  • Two-tier chunking (sections ~1700 tokens, snippets ~420 tokens) with parent-child links — this IS hierarchical/parent-child chunking, confirmed as SOTA by March 2026 arXiv benchmarks
  • Overlap between adjacent chunks (120/80 tokens) — 10-20% overlap is the confirmed sweet spot
  • Context prefix from heading path — this IS contextual retrieval per Anthropic’s recommendation
  • Retrieval text with [DOC:][PATH:][CONCEPTS:] prefix — contextualizes chunks for embedding quality
  • splitByTokens() respects paragraph boundaries, splits long paragraphs by sentence — this is recursive splitting, the proven general-purpose default

What’s broken

  1. Heading detection is heuristic-only on flat text. isHeadingLine() guesses based on formatting patterns. Works for well-structured technical books, fails badly on:

    • Game design books (Schell, Koster) with creative formatting
    • Books with sidebars, callout boxes, epigraphs
    • Books where section titles are full sentences
    • PDFs where pypdf mangles the text structure entirely
  2. No markdown structure awareness. When marker-pdf (ADR-0228 Phase 6) replaces pypdf, the extracted text will be structured markdown with real #/##/### headings, markdown tables, and code blocks. The current parseSections() doesn’t know about markdown.

  3. Tables become garbage text. PDF tables extracted by pypdf become random strings. Even with marker’s markdown table output, the chunker treats tables as regular paragraphs — they get split mid-row.

  4. Token estimation is crude. words * 1.25 is a rough proxy for MiniLM-L12’s WordPiece tokenizer. Not a dealbreaker (~15% error on average) but affects chunk size precision, especially for code-heavy or non-English text.

  5. No content-type awareness. A code block, a table, a quote, and running prose are all chunked identically. Code and tables should be kept intact as atomic units.

Decision

Upgrade book-chunk.ts to be markdown-structure-aware while preserving the proven two-tier architecture.

1. Markdown-first heading detection

Add a new heading detection path that runs BEFORE the heuristic fallback:

function detectMarkdownHeading(line: string): { depth: number; text: string } | null {
  const match = line.match(/^(#{1,6})\s+(.+)$/);
  if (!match) return null;
  return { depth: match[1].length, text: match[2].trim() };
}

In parseSections():

  1. First pass: detect if input contains # markdown headings (≥3 occurrences)
  2. If yes: use markdown heading detection exclusively (trust marker’s output)
  3. If no: fall back to current isHeadingLine() heuristic (for txt files, legacy pypdf output)

This means the same chunkBookText() function handles both markdown and plain text — the detection strategy adapts to the input.

2. Atomic block preservation

Identify and protect content blocks that must not be split:

  • Markdown tables: lines matching |...| patterns → keep entire table as one unit
  • Code blocks: ``` fenced blocks → keep as one unit
  • Block quotes: > prefixed lines → keep as one unit

Implementation: in splitByTokens(), before paragraph splitting, identify atomic blocks and treat each as a single unsplittable unit. If an atomic block exceeds maxTokens, keep it whole and mark it as oversized (don’t split a table mid-row).

3. Table metadata

When a chunk contains a markdown table:

  • Set a contains_table: true flag on the chunk record
  • Extract table headers as additional heading_path context
  • Include table column count in chunk metadata

This lets downstream search weight table-containing chunks appropriately and lets agents know a result contains structured data.

4. Improved token estimation

Replace words * 1.25 with a lightweight approximation that accounts for:

  • Subword splitting: words >8 chars count as ceil(len/4) tokens (closer to WordPiece behavior)
  • Punctuation: each standalone punctuation token counts as 1
  • Numbers: digit sequences count as ceil(digits/3) tokens

This gets within ~5% of actual MiniLM-L12 token counts without requiring the tokenizer dependency. Still no external dependency, just better math.

function estimateTokens(text: string): number {
  const words = text.trim().split(/\s+/).filter(Boolean);
  if (words.length === 0) return 0;
  let tokens = 0;
  for (const word of words) {
    if (word.length <= 4) {
      tokens += 1;
    } else if (word.length <= 8) {
      tokens += 1.3;
    } else {
      tokens += Math.ceil(word.length / 4);
    }
  }
  return Math.max(1, Math.round(tokens));
}

5. Content-type tagging on chunks

Add a content_hints field to chunk records:

type ContentHint = "prose" | "table" | "code" | "list" | "quote" | "mixed";

Detected from markdown structure. Stored in Typesense as a facetable string field. Lets agents filter: “give me only prose chunks about X” or “show me tables related to Y”.

6. Boilerplate detection upgrade

Current shouldDropBoilerplateLine() catches page numbers and copyright lines. Extend to catch:

  • Repeated headers/footers (detect lines that appear >3 times in the same position pattern)
  • Table of contents entries (numbered lines with page references: Chapter 3 ..... 47)
  • Index entries (alphabetically sorted single-word lines at end of document)

Migration

  • No schema migration needed for existing chunks — new fields (contains_table, content_hints) are optional
  • Reindex recommended after marker-pdf is deployed — the combination of better extraction + better chunking is the full upgrade
  • Backward compatiblechunkBookText() accepts both markdown and plain text, adapts automatically

Consequences

Positive

  • Heading detection goes from ~70% accuracy (heuristic) to ~95%+ (marker markdown structure)
  • Tables preserved as atomic units instead of being split into garbage
  • Agents can filter by content type — “show me tables about X”
  • Token estimation within ~5% of actual, improving chunk size consistency
  • Same function handles both markdown and legacy text — no branching in the ingest pipeline

Negative

  • Slightly more complex parseSections() logic (but cleaner — markdown detection is simpler than heuristics)
  • content_hints field adds a small storage overhead per chunk
  • Oversized atomic blocks (huge tables) may exceed maxTokens — accepted tradeoff, splitting them is worse

Risks

  • Marker may not produce consistent markdown for all PDF types — heuristic fallback handles this
  • Token estimation improvement may change chunk boundaries for existing docs on reindex — expected and desired

Testing

  • Unit tests: markdown heading detection, atomic block preservation, table detection
  • Integration test: chunk a known book (DDIA) with both pypdf text and marker markdown, compare heading accuracy
  • Regression test: existing book-chunk.test.ts must still pass for plain text input

Implementation

All changes in packages/system-bus/src/lib/book-chunk.ts plus a new Typesense field addition in docs-ingest.ts. No new packages or dependencies.

Sequence: implement after ADR-0228 Phase 6 (marker-pdf) so that markdown input is available for testing. Can be developed in parallel on plain text with synthetic markdown test cases.