ADR-0234accepted

Docs Pipeline v2 — Artifact-Chain Architecture with Workload Queue

2026-03-17T00:00:00.000Z

Context

The current docs-ingest pipeline (602 PDFs, 145K chunks) has three problems:

Garbage extraction: pypdf dumps flat text — no tables, no headings, no reading order. opendataloader-pdf (just installed) produces structured markdown but the pipeline wasn’t designed around it.
No intermediate artifacts: The pipeline extracts text to /tmp, processes it in memory, indexes to Typesense, then deletes the temp file. If you change chunking strategy, embedding model, or summary prompt, you must re-extract every PDF from scratch. Every stage is coupled.
No batch orchestration: Re-indexing 602 PDFs requires manually firing events. No progress tracking, no resume-on-failure, no workload shaping.

Research findings (Pinecone, Weaviate, arxiv “Chunk Twice Embed Once”, Id8, Anthropic):

Document-based chunking on markdown headings outperforms fixed-size and semantic chunking for structured documents
Recursive chunking with no overlap (R100-0) achieves 45% higher precision than overlapping approaches
Context inheritance (prepending parent heading/summary to each chunk) improves retrieval 2-18%
Retrieval-tuned embedding models (nomic-embed-text-v1.5, E5, bge) substantially outperform general-purpose MiniLM
Typesense has ts/nomic-embed-text-v1.5 built-in — no external API needed

Decision

Staged Artifact Pipeline

Each stage produces a durable artifact on disk. Stages are independently re-runnable. Later stages consume artifacts from earlier stages, not raw source files.

PDF (source, immutable, on NAS)
  │
  ├─ Stage 1: CONVERT
  │   Input:  {path}.pdf
  │   Output: {artifacts_dir}/{docId}/{docId}.md
  │   Tool:   opendataloader-pdf → structured markdown
  │   Fallback: pypdf → flat text
  │   Artifact: markdown file, persisted
  │
  ├─ Stage 2: CLASSIFY + SUMMARIZE
  │   Input:  {docId}.md (first 8K chars for classification, full text for summary)
  │   Output: {artifacts_dir}/{docId}/{docId}.meta.json
  │   Contains:
  │     - title, filename, file_type, page_count, sha256
  │     - primaryConceptId, conceptIds, conceptSource, taxonomyVersion
  │     - storageCategory, documentType, tags
  │     - summary (LLM-generated, 2-3 sentences)
  │     - source_host, nas_path, nas_paths
  │   Tool: existing taxonomy classifier + pi -p for summary
  │   Artifact: JSON metadata file, persisted
  │
  ├─ Stage 3: CHUNK
  │   Input:  {docId}.md + {docId}.meta.json
  │   Output: {artifacts_dir}/{docId}/{docId}.chunks.jsonl
  │   Strategy:
  │     - Markdown-native heading detection (# markers, not heuristics)
  │     - Recursive splitting within sections exceeding target tokens
  │     - No overlap (R100-0 finding)
  │     - Two-level: section chunks + snippet sub-chunks
  │     - heading_path from actual markdown heading levels
  │     - Each chunk line: {chunk_id, chunk_type, heading_path, content, context_prefix, ...}
  │   Artifact: JSONL file, one chunk per line, persisted
  │
  └─ Stage 4: INDEX
      Input:  {docId}.meta.json + {docId}.chunks.jsonl
      Output: Typesense docs + docs_chunks collections
      Process:
        - Upsert document record to `docs` collection from meta.json
        - Build retrieval_text per chunk:
          "[DOC: {title}] [SUMMARY: {summary}] [PATH: {heading_path}] [CONCEPTS: {labels}]\n\n{content}"
        - Bulk import chunks to `docs_chunks` with auto-embedding
      Embedding: ts/nomic-embed-text-v1.5 (768-dim, retrieval-tuned)
      No artifact (Typesense IS the artifact for this stage)

Artifact Storage — DURABLE, RESUMABLE, RECOVERABLE

Artifacts live on the NAS at /Volumes/three-body/docs-artifacts/{docId}/:

Durable: NAS-backed RAID5 storage, survives reboots and pod restarts
Resumable: each stage checks for existing artifacts before re-running — crash mid-pipeline, resume from last completed stage
Recoverable: artifacts are immutable once written — if indexing breaks, re-run stage 4 from existing artifacts
Archived: artifacts persist permanently — historical record of extraction/chunking quality
Persistent: NOT /tmp. NOT ephemeral. Real storage on real disks.

Directory structure:

/Volumes/three-body/docs-artifacts/
  {docId}/
    {docId}.md              — markdown extraction artifact
    {docId}.meta.json       — taxonomy + summary metadata
    {docId}.chunks.jsonl    — chunk records, one per line

Env var DOCS_ARTIFACTS_DIR defaults to /Volumes/three-body/docs-artifacts. Falls back to ~/clawd/data/pdf-brain/artifacts if NAS unavailable.

Re-run Semantics

Change	Re-run from	Skips
New PDF added	Stage 1	Nothing
Chunking strategy changed	Stage 3	Stages 1-2 (reuse .md + .meta.json)
Summary prompt changed	Stage 2	Stage 1 (reuse .md)
Embedding model changed	Stage 4	Stages 1-3 (reuse all artifacts)
Full reindex	Stage 1	Nothing

Embedding Model Upgrade

Switch from ts/all-MiniLM-L12-v2 (384-dim) to nomic-embed-text (768-dim):

Nomic is retrieval-tuned — #1 in QA tasks per arxiv “Chunk Twice, Embed Once”
Pre-computed via ollama GPU on M4 Pro (~150x faster than Typesense CPU ONNX)
Stored as raw float[] in Typesense — no auto-embed, collection schema uses num_dim: 768
Query-time vector search requires passing pre-embedded query vector (TODO: expose ollama to k8s for docs-api)
Text-based search on retrieval_text field still works without vector query
768-dim vectors are ~2x storage but significantly better retrieval precision

Workload Queue Orchestration

Batch reindex uses joelclaw workload plan → Redis queue → Inngest execution:

joelclaw workload plan "docs-reindex-v2"
  → Scan: find all PDFs (NAS manifest or filesystem walk)
  → Shape: 602 items, each with nasPath + docId
  → Queue: Redis sorted set, ordered by priority (modified-date or alpha)
  → Dispatch: fire docs/reindex-v2.requested events in batches
     - Concurrency: 3 parallel (JVM startup is memory-heavy)
     - Throttle: 1 event per 5s (prevent JVM pile-up)
  → Track: workload queue tracks pending/running/complete/failed
  → Resume: failed items re-queued automatically (Inngest retries)
  → Progress: OTEL events per stage, queryable via joelclaw o11y

The system-bus-worker on panda executes (Java, opendataloader-pdf, NAS mount, pi all available). The workload queue orchestrates.

New Inngest Function: docs-reindex-v2

Replaces the monolithic docs-ingest for the reindex flow. Each stage is a separate Inngest step for durability:

step.run("convert-pdf")      → produces .md artifact
step.run("classify-summarize") → produces .meta.json artifact
step.run("chunk")             → produces .chunks.jsonl artifact
step.run("index-typesense")   → upserts to Typesense

If any step fails, Inngest retries from that step. Artifacts from completed steps are preserved.

Consequences

Every stage produces a durable, inspectable artifact
Re-indexing with different strategies doesn’t require re-extracting PDFs
Embedding model upgrade from MiniLM to nomic improves retrieval quality
Workload queue provides batch orchestration, progress tracking, resume-on-failure
Markdown artifacts are useful beyond indexing (direct reading, vault integration, other pipelines)
768-dim vectors double storage for docs_chunks (~300MB → ~600MB, trivial)
JVM startup per PDF adds latency (~2.5s) but runs AFK

Implementation Order

→ ADR written
Write artifact storage helpers (save/load .md, .meta.json, .chunks.jsonl)
Upgrade book-chunk.ts for markdown-native heading detection, no overlap
Add LLM summary step to pipeline
Create docs_chunks_v2 collection with nomic-embed-text-v1.5
Build docs-reindex-v2 Inngest function with staged steps
Wire workload queue for batch orchestration
Run full reindex AFK
Verify retrieval quality improvement
Cut over: alias docs_chunks_v2 → docs_chunks, deprecate v1

Durability Provisions

Storage (Layer 1)

Artifacts on NAS RAID5 (57TB, hardware redundancy)
Write-once: write to .tmp then atomic rename — no partial artifacts
Artifacts immutable once written — never modified, only replaced by re-running the stage

Pipeline Resumability (Layer 2)

Each step checks hasArtifact(docId, stage) before running
Crash after stage 2 → restart picks up at stage 3 from existing .md + .meta.json
skipExistingArtifacts: true on batch event skips completed books entirely

Inngest Durability (Layer 3)

Each stage is a separate step.run() — output persisted between retries
Default retry policy (never retries: 0)
Concurrency limit 3 prevents JVM pile-up
Throttle between batch dispatches prevents queue flood

Batch Recovery (Layer 4)

Failed items queryable via joelclaw o11y session docs-reindex
Re-fire batch with skipExistingArtifacts: true — only unfinished books re-run
All progress in artifacts on disk + Inngest run state, nothing in memory

Observability (Layer 5)

OTEL event per stage per book (with ADR-0233 provenance)
Gateway progress updates per book completion
Batch summary via gateway on completion

Rollback (Layer 6)

New chunks go to docs_chunks_v2 (parallel to existing docs_chunks)
Old collection untouched until verified
Cut over only after retrieval quality comparison

Estimated Cost

Extraction: 602 PDFs × 2.5s = ~25 min
Summaries: 602 LLM calls via pi = ~30 min, ~$6
Embedding: 150K+ chunks with nomic via Typesense = ~1-2 hours (CPU, local)
Total: ~3 hours AFK