Markdown-Aware Chunking Pipeline
Status
proposed
Context
The chunking pipeline (packages/system-bus/src/lib/book-chunk.ts) currently operates on flat extracted text. It uses heuristic heading detection (isHeadingLine()) based on regex patterns (ALL CAPS, title case, numbered lines, chapter/section prefixes) and splits by paragraph boundaries with token estimation via words * 1.25.
What works well (keep)
- Two-tier chunking (sections ~1700 tokens, snippets ~420 tokens) with parent-child links — this IS hierarchical/parent-child chunking, confirmed as SOTA by March 2026 arXiv benchmarks
- Overlap between adjacent chunks (120/80 tokens) — 10-20% overlap is the confirmed sweet spot
- Context prefix from heading path — this IS contextual retrieval per Anthropic’s recommendation
- Retrieval text with
[DOC:][PATH:][CONCEPTS:]prefix — contextualizes chunks for embedding quality splitByTokens()respects paragraph boundaries, splits long paragraphs by sentence — this is recursive splitting, the proven general-purpose default
What’s broken
-
Heading detection is heuristic-only on flat text.
isHeadingLine()guesses based on formatting patterns. Works for well-structured technical books, fails badly on:- Game design books (Schell, Koster) with creative formatting
- Books with sidebars, callout boxes, epigraphs
- Books where section titles are full sentences
- PDFs where pypdf mangles the text structure entirely
-
No markdown structure awareness. When marker-pdf (ADR-0228 Phase 6) replaces pypdf, the extracted text will be structured markdown with real
#/##/###headings, markdown tables, and code blocks. The currentparseSections()doesn’t know about markdown. -
Tables become garbage text. PDF tables extracted by pypdf become random strings. Even with marker’s markdown table output, the chunker treats tables as regular paragraphs — they get split mid-row.
-
Token estimation is crude.
words * 1.25is a rough proxy for MiniLM-L12’s WordPiece tokenizer. Not a dealbreaker (~15% error on average) but affects chunk size precision, especially for code-heavy or non-English text. -
No content-type awareness. A code block, a table, a quote, and running prose are all chunked identically. Code and tables should be kept intact as atomic units.
Decision
Upgrade book-chunk.ts to be markdown-structure-aware while preserving the proven two-tier architecture.
1. Markdown-first heading detection
Add a new heading detection path that runs BEFORE the heuristic fallback:
function detectMarkdownHeading(line: string): { depth: number; text: string } | null {
const match = line.match(/^(#{1,6})\s+(.+)$/);
if (!match) return null;
return { depth: match[1].length, text: match[2].trim() };
}In parseSections():
- First pass: detect if input contains
#markdown headings (≥3 occurrences) - If yes: use markdown heading detection exclusively (trust marker’s output)
- If no: fall back to current
isHeadingLine()heuristic (for txt files, legacy pypdf output)
This means the same chunkBookText() function handles both markdown and plain text — the detection strategy adapts to the input.
2. Atomic block preservation
Identify and protect content blocks that must not be split:
- Markdown tables: lines matching
|...|patterns → keep entire table as one unit - Code blocks: ``` fenced blocks → keep as one unit
- Block quotes:
>prefixed lines → keep as one unit
Implementation: in splitByTokens(), before paragraph splitting, identify atomic blocks and treat each as a single unsplittable unit. If an atomic block exceeds maxTokens, keep it whole and mark it as oversized (don’t split a table mid-row).
3. Table metadata
When a chunk contains a markdown table:
- Set a
contains_table: trueflag on the chunk record - Extract table headers as additional heading_path context
- Include table column count in chunk metadata
This lets downstream search weight table-containing chunks appropriately and lets agents know a result contains structured data.
4. Improved token estimation
Replace words * 1.25 with a lightweight approximation that accounts for:
- Subword splitting: words >8 chars count as
ceil(len/4)tokens (closer to WordPiece behavior) - Punctuation: each standalone punctuation token counts as 1
- Numbers: digit sequences count as
ceil(digits/3)tokens
This gets within ~5% of actual MiniLM-L12 token counts without requiring the tokenizer dependency. Still no external dependency, just better math.
function estimateTokens(text: string): number {
const words = text.trim().split(/\s+/).filter(Boolean);
if (words.length === 0) return 0;
let tokens = 0;
for (const word of words) {
if (word.length <= 4) {
tokens += 1;
} else if (word.length <= 8) {
tokens += 1.3;
} else {
tokens += Math.ceil(word.length / 4);
}
}
return Math.max(1, Math.round(tokens));
}5. Content-type tagging on chunks
Add a content_hints field to chunk records:
type ContentHint = "prose" | "table" | "code" | "list" | "quote" | "mixed";Detected from markdown structure. Stored in Typesense as a facetable string field. Lets agents filter: “give me only prose chunks about X” or “show me tables related to Y”.
6. Boilerplate detection upgrade
Current shouldDropBoilerplateLine() catches page numbers and copyright lines. Extend to catch:
- Repeated headers/footers (detect lines that appear >3 times in the same position pattern)
- Table of contents entries (numbered lines with page references:
Chapter 3 ..... 47) - Index entries (alphabetically sorted single-word lines at end of document)
Migration
- No schema migration needed for existing chunks — new fields (
contains_table,content_hints) are optional - Reindex recommended after marker-pdf is deployed — the combination of better extraction + better chunking is the full upgrade
- Backward compatible —
chunkBookText()accepts both markdown and plain text, adapts automatically
Consequences
Positive
- Heading detection goes from ~70% accuracy (heuristic) to ~95%+ (marker markdown structure)
- Tables preserved as atomic units instead of being split into garbage
- Agents can filter by content type — “show me tables about X”
- Token estimation within ~5% of actual, improving chunk size consistency
- Same function handles both markdown and legacy text — no branching in the ingest pipeline
Negative
- Slightly more complex
parseSections()logic (but cleaner — markdown detection is simpler than heuristics) content_hintsfield adds a small storage overhead per chunk- Oversized atomic blocks (huge tables) may exceed maxTokens — accepted tradeoff, splitting them is worse
Risks
- Marker may not produce consistent markdown for all PDF types — heuristic fallback handles this
- Token estimation improvement may change chunk boundaries for existing docs on reindex — expected and desired
Testing
- Unit tests: markdown heading detection, atomic block preservation, table detection
- Integration test: chunk a known book (DDIA) with both pypdf text and marker markdown, compare heading accuracy
- Regression test: existing
book-chunk.test.tsmust still pass for plain text input
Implementation
All changes in packages/system-bus/src/lib/book-chunk.ts plus a new Typesense field addition in docs-ingest.ts. No new packages or dependencies.
Sequence: implement after ADR-0228 Phase 6 (marker-pdf) so that markdown input is available for testing. Can be developed in parallel on plain text with synthetic markdown test cases.