ADR-0234accepted

Docs Pipeline v2 — Artifact-Chain Architecture with Workload Queue

Context

The current docs-ingest pipeline (602 PDFs, 145K chunks) has three problems:

  1. Garbage extraction: pypdf dumps flat text — no tables, no headings, no reading order. opendataloader-pdf (just installed) produces structured markdown but the pipeline wasn’t designed around it.

  2. No intermediate artifacts: The pipeline extracts text to /tmp, processes it in memory, indexes to Typesense, then deletes the temp file. If you change chunking strategy, embedding model, or summary prompt, you must re-extract every PDF from scratch. Every stage is coupled.

  3. No batch orchestration: Re-indexing 602 PDFs requires manually firing events. No progress tracking, no resume-on-failure, no workload shaping.

Research findings (Pinecone, Weaviate, arxiv “Chunk Twice Embed Once”, Id8, Anthropic):

  • Document-based chunking on markdown headings outperforms fixed-size and semantic chunking for structured documents
  • Recursive chunking with no overlap (R100-0) achieves 45% higher precision than overlapping approaches
  • Context inheritance (prepending parent heading/summary to each chunk) improves retrieval 2-18%
  • Retrieval-tuned embedding models (nomic-embed-text-v1.5, E5, bge) substantially outperform general-purpose MiniLM
  • Typesense has ts/nomic-embed-text-v1.5 built-in — no external API needed

Decision

Staged Artifact Pipeline

Each stage produces a durable artifact on disk. Stages are independently re-runnable. Later stages consume artifacts from earlier stages, not raw source files.

PDF (source, immutable, on NAS)

  ├─ Stage 1: CONVERT
  │   Input:  {path}.pdf
  │   Output: {artifacts_dir}/{docId}/{docId}.md
  │   Tool:   opendataloader-pdf → structured markdown
  │   Fallback: pypdf → flat text
  │   Artifact: markdown file, persisted

  ├─ Stage 2: CLASSIFY + SUMMARIZE
  │   Input:  {docId}.md (first 8K chars for classification, full text for summary)
  │   Output: {artifacts_dir}/{docId}/{docId}.meta.json
  │   Contains:
  │     - title, filename, file_type, page_count, sha256
  │     - primaryConceptId, conceptIds, conceptSource, taxonomyVersion
  │     - storageCategory, documentType, tags
  │     - summary (LLM-generated, 2-3 sentences)
  │     - source_host, nas_path, nas_paths
  │   Tool: existing taxonomy classifier + pi -p for summary
  │   Artifact: JSON metadata file, persisted

  ├─ Stage 3: CHUNK
  │   Input:  {docId}.md + {docId}.meta.json
  │   Output: {artifacts_dir}/{docId}/{docId}.chunks.jsonl
  │   Strategy:
  │     - Markdown-native heading detection (# markers, not heuristics)
  │     - Recursive splitting within sections exceeding target tokens
  │     - No overlap (R100-0 finding)
  │     - Two-level: section chunks + snippet sub-chunks
  │     - heading_path from actual markdown heading levels
  │     - Each chunk line: {chunk_id, chunk_type, heading_path, content, context_prefix, ...}
  │   Artifact: JSONL file, one chunk per line, persisted

  └─ Stage 4: INDEX
      Input:  {docId}.meta.json + {docId}.chunks.jsonl
      Output: Typesense docs + docs_chunks collections
      Process:
        - Upsert document record to `docs` collection from meta.json
        - Build retrieval_text per chunk:
          "[DOC: {title}] [SUMMARY: {summary}] [PATH: {heading_path}] [CONCEPTS: {labels}]\n\n{content}"
        - Bulk import chunks to `docs_chunks` with auto-embedding
      Embedding: ts/nomic-embed-text-v1.5 (768-dim, retrieval-tuned)
      No artifact (Typesense IS the artifact for this stage)

Artifact Storage — DURABLE, RESUMABLE, RECOVERABLE

Artifacts live on the NAS at /Volumes/three-body/docs-artifacts/{docId}/:

  • Durable: NAS-backed RAID5 storage, survives reboots and pod restarts
  • Resumable: each stage checks for existing artifacts before re-running — crash mid-pipeline, resume from last completed stage
  • Recoverable: artifacts are immutable once written — if indexing breaks, re-run stage 4 from existing artifacts
  • Archived: artifacts persist permanently — historical record of extraction/chunking quality
  • Persistent: NOT /tmp. NOT ephemeral. Real storage on real disks.

Directory structure:

/Volumes/three-body/docs-artifacts/
  {docId}/
    {docId}.md              — markdown extraction artifact
    {docId}.meta.json       — taxonomy + summary metadata
    {docId}.chunks.jsonl    — chunk records, one per line

Env var DOCS_ARTIFACTS_DIR defaults to /Volumes/three-body/docs-artifacts. Falls back to ~/clawd/data/pdf-brain/artifacts if NAS unavailable.

Re-run Semantics

ChangeRe-run fromSkips
New PDF addedStage 1Nothing
Chunking strategy changedStage 3Stages 1-2 (reuse .md + .meta.json)
Summary prompt changedStage 2Stage 1 (reuse .md)
Embedding model changedStage 4Stages 1-3 (reuse all artifacts)
Full reindexStage 1Nothing

Embedding Model Upgrade

Switch from ts/all-MiniLM-L12-v2 (384-dim) to ts/nomic-embed-text-v1.5 (768-dim):

  • Nomic is retrieval-tuned — #1 in QA tasks per arxiv “Chunk Twice, Embed Once”
  • Built into Typesense — no external API, no cost
  • Requires dropping and recreating docs_chunks collection with new schema
  • 768-dim vectors are ~2x storage but significantly better retrieval precision

Workload Queue Orchestration

Batch reindex uses joelclaw workload plan → Redis queue → Inngest execution:

joelclaw workload plan "docs-reindex-v2"
  → Scan: find all PDFs (NAS manifest or filesystem walk)
  → Shape: 602 items, each with nasPath + docId
  → Queue: Redis sorted set, ordered by priority (modified-date or alpha)
  → Dispatch: fire docs/reindex-v2.requested events in batches
     - Concurrency: 3 parallel (JVM startup is memory-heavy)
     - Throttle: 1 event per 5s (prevent JVM pile-up)
  → Track: workload queue tracks pending/running/complete/failed
  → Resume: failed items re-queued automatically (Inngest retries)
  → Progress: OTEL events per stage, queryable via joelclaw o11y

The system-bus-worker on panda executes (Java, opendataloader-pdf, NAS mount, pi all available). The workload queue orchestrates.

New Inngest Function: docs-reindex-v2

Replaces the monolithic docs-ingest for the reindex flow. Each stage is a separate Inngest step for durability:

step.run("convert-pdf")      → produces .md artifact
step.run("classify-summarize") → produces .meta.json artifact
step.run("chunk")             → produces .chunks.jsonl artifact
step.run("index-typesense")   → upserts to Typesense

If any step fails, Inngest retries from that step. Artifacts from completed steps are preserved.

Consequences

  • Every stage produces a durable, inspectable artifact
  • Re-indexing with different strategies doesn’t require re-extracting PDFs
  • Embedding model upgrade from MiniLM to nomic improves retrieval quality
  • Workload queue provides batch orchestration, progress tracking, resume-on-failure
  • Markdown artifacts are useful beyond indexing (direct reading, vault integration, other pipelines)
  • 768-dim vectors double storage for docs_chunks (~300MB → ~600MB, trivial)
  • JVM startup per PDF adds latency (~2.5s) but runs AFK

Implementation Order

  1. → ADR written
  2. Write artifact storage helpers (save/load .md, .meta.json, .chunks.jsonl)
  3. Upgrade book-chunk.ts for markdown-native heading detection, no overlap
  4. Add LLM summary step to pipeline
  5. Create docs_chunks_v2 collection with nomic-embed-text-v1.5
  6. Build docs-reindex-v2 Inngest function with staged steps
  7. Wire workload queue for batch orchestration
  8. Run full reindex AFK
  9. Verify retrieval quality improvement
  10. Cut over: alias docs_chunks_v2 → docs_chunks, deprecate v1

Durability Provisions

Storage (Layer 1)

  • Artifacts on NAS RAID5 (57TB, hardware redundancy)
  • Write-once: write to .tmp then atomic rename — no partial artifacts
  • Artifacts immutable once written — never modified, only replaced by re-running the stage

Pipeline Resumability (Layer 2)

  • Each step checks hasArtifact(docId, stage) before running
  • Crash after stage 2 → restart picks up at stage 3 from existing .md + .meta.json
  • skipExistingArtifacts: true on batch event skips completed books entirely

Inngest Durability (Layer 3)

  • Each stage is a separate step.run() — output persisted between retries
  • Default retry policy (never retries: 0)
  • Concurrency limit 3 prevents JVM pile-up
  • Throttle between batch dispatches prevents queue flood

Batch Recovery (Layer 4)

  • Failed items queryable via joelclaw o11y session docs-reindex
  • Re-fire batch with skipExistingArtifacts: true — only unfinished books re-run
  • All progress in artifacts on disk + Inngest run state, nothing in memory

Observability (Layer 5)

  • OTEL event per stage per book (with ADR-0233 provenance)
  • Gateway progress updates per book completion
  • Batch summary via gateway on completion

Rollback (Layer 6)

  • New chunks go to docs_chunks_v2 (parallel to existing docs_chunks)
  • Old collection untouched until verified
  • Cut over only after retrieval quality comparison

Estimated Cost

  • Extraction: 602 PDFs × 2.5s = ~25 min
  • Summaries: 602 LLM calls via pi = ~30 min, ~$6
  • Embedding: 150K+ chunks with nomic via Typesense = ~1-2 hours (CPU, local)
  • Total: ~3 hours AFK