Docs Pipeline v2 — Artifact-Chain Architecture with Workload Queue
Context
The current docs-ingest pipeline (602 PDFs, 145K chunks) has three problems:
-
Garbage extraction: pypdf dumps flat text — no tables, no headings, no reading order. opendataloader-pdf (just installed) produces structured markdown but the pipeline wasn’t designed around it.
-
No intermediate artifacts: The pipeline extracts text to
/tmp, processes it in memory, indexes to Typesense, then deletes the temp file. If you change chunking strategy, embedding model, or summary prompt, you must re-extract every PDF from scratch. Every stage is coupled. -
No batch orchestration: Re-indexing 602 PDFs requires manually firing events. No progress tracking, no resume-on-failure, no workload shaping.
Research findings (Pinecone, Weaviate, arxiv “Chunk Twice Embed Once”, Id8, Anthropic):
- Document-based chunking on markdown headings outperforms fixed-size and semantic chunking for structured documents
- Recursive chunking with no overlap (R100-0) achieves 45% higher precision than overlapping approaches
- Context inheritance (prepending parent heading/summary to each chunk) improves retrieval 2-18%
- Retrieval-tuned embedding models (nomic-embed-text-v1.5, E5, bge) substantially outperform general-purpose MiniLM
- Typesense has
ts/nomic-embed-text-v1.5built-in — no external API needed
Decision
Staged Artifact Pipeline
Each stage produces a durable artifact on disk. Stages are independently re-runnable. Later stages consume artifacts from earlier stages, not raw source files.
PDF (source, immutable, on NAS)
│
├─ Stage 1: CONVERT
│ Input: {path}.pdf
│ Output: {artifacts_dir}/{docId}/{docId}.md
│ Tool: opendataloader-pdf → structured markdown
│ Fallback: pypdf → flat text
│ Artifact: markdown file, persisted
│
├─ Stage 2: CLASSIFY + SUMMARIZE
│ Input: {docId}.md (first 8K chars for classification, full text for summary)
│ Output: {artifacts_dir}/{docId}/{docId}.meta.json
│ Contains:
│ - title, filename, file_type, page_count, sha256
│ - primaryConceptId, conceptIds, conceptSource, taxonomyVersion
│ - storageCategory, documentType, tags
│ - summary (LLM-generated, 2-3 sentences)
│ - source_host, nas_path, nas_paths
│ Tool: existing taxonomy classifier + pi -p for summary
│ Artifact: JSON metadata file, persisted
│
├─ Stage 3: CHUNK
│ Input: {docId}.md + {docId}.meta.json
│ Output: {artifacts_dir}/{docId}/{docId}.chunks.jsonl
│ Strategy:
│ - Markdown-native heading detection (# markers, not heuristics)
│ - Recursive splitting within sections exceeding target tokens
│ - No overlap (R100-0 finding)
│ - Two-level: section chunks + snippet sub-chunks
│ - heading_path from actual markdown heading levels
│ - Each chunk line: {chunk_id, chunk_type, heading_path, content, context_prefix, ...}
│ Artifact: JSONL file, one chunk per line, persisted
│
└─ Stage 4: INDEX
Input: {docId}.meta.json + {docId}.chunks.jsonl
Output: Typesense docs + docs_chunks collections
Process:
- Upsert document record to `docs` collection from meta.json
- Build retrieval_text per chunk:
"[DOC: {title}] [SUMMARY: {summary}] [PATH: {heading_path}] [CONCEPTS: {labels}]\n\n{content}"
- Bulk import chunks to `docs_chunks` with auto-embedding
Embedding: ts/nomic-embed-text-v1.5 (768-dim, retrieval-tuned)
No artifact (Typesense IS the artifact for this stage)Artifact Storage — DURABLE, RESUMABLE, RECOVERABLE
Artifacts live on the NAS at /Volumes/three-body/docs-artifacts/{docId}/:
- Durable: NAS-backed RAID5 storage, survives reboots and pod restarts
- Resumable: each stage checks for existing artifacts before re-running — crash mid-pipeline, resume from last completed stage
- Recoverable: artifacts are immutable once written — if indexing breaks, re-run stage 4 from existing artifacts
- Archived: artifacts persist permanently — historical record of extraction/chunking quality
- Persistent: NOT
/tmp. NOT ephemeral. Real storage on real disks.
Directory structure:
/Volumes/three-body/docs-artifacts/
{docId}/
{docId}.md — markdown extraction artifact
{docId}.meta.json — taxonomy + summary metadata
{docId}.chunks.jsonl — chunk records, one per lineEnv var DOCS_ARTIFACTS_DIR defaults to /Volumes/three-body/docs-artifacts. Falls back to ~/clawd/data/pdf-brain/artifacts if NAS unavailable.
Re-run Semantics
| Change | Re-run from | Skips |
|---|---|---|
| New PDF added | Stage 1 | Nothing |
| Chunking strategy changed | Stage 3 | Stages 1-2 (reuse .md + .meta.json) |
| Summary prompt changed | Stage 2 | Stage 1 (reuse .md) |
| Embedding model changed | Stage 4 | Stages 1-3 (reuse all artifacts) |
| Full reindex | Stage 1 | Nothing |
Embedding Model Upgrade
Switch from ts/all-MiniLM-L12-v2 (384-dim) to ts/nomic-embed-text-v1.5 (768-dim):
- Nomic is retrieval-tuned — #1 in QA tasks per arxiv “Chunk Twice, Embed Once”
- Built into Typesense — no external API, no cost
- Requires dropping and recreating
docs_chunkscollection with new schema - 768-dim vectors are ~2x storage but significantly better retrieval precision
Workload Queue Orchestration
Batch reindex uses joelclaw workload plan → Redis queue → Inngest execution:
joelclaw workload plan "docs-reindex-v2"
→ Scan: find all PDFs (NAS manifest or filesystem walk)
→ Shape: 602 items, each with nasPath + docId
→ Queue: Redis sorted set, ordered by priority (modified-date or alpha)
→ Dispatch: fire docs/reindex-v2.requested events in batches
- Concurrency: 3 parallel (JVM startup is memory-heavy)
- Throttle: 1 event per 5s (prevent JVM pile-up)
→ Track: workload queue tracks pending/running/complete/failed
→ Resume: failed items re-queued automatically (Inngest retries)
→ Progress: OTEL events per stage, queryable via joelclaw o11yThe system-bus-worker on panda executes (Java, opendataloader-pdf, NAS mount, pi all available). The workload queue orchestrates.
New Inngest Function: docs-reindex-v2
Replaces the monolithic docs-ingest for the reindex flow. Each stage is a separate Inngest step for durability:
step.run("convert-pdf") → produces .md artifact
step.run("classify-summarize") → produces .meta.json artifact
step.run("chunk") → produces .chunks.jsonl artifact
step.run("index-typesense") → upserts to TypesenseIf any step fails, Inngest retries from that step. Artifacts from completed steps are preserved.
Consequences
- Every stage produces a durable, inspectable artifact
- Re-indexing with different strategies doesn’t require re-extracting PDFs
- Embedding model upgrade from MiniLM to nomic improves retrieval quality
- Workload queue provides batch orchestration, progress tracking, resume-on-failure
- Markdown artifacts are useful beyond indexing (direct reading, vault integration, other pipelines)
- 768-dim vectors double storage for docs_chunks (~300MB → ~600MB, trivial)
- JVM startup per PDF adds latency (~2.5s) but runs AFK
Implementation Order
- → ADR written
- Write artifact storage helpers (save/load .md, .meta.json, .chunks.jsonl)
- Upgrade book-chunk.ts for markdown-native heading detection, no overlap
- Add LLM summary step to pipeline
- Create docs_chunks_v2 collection with nomic-embed-text-v1.5
- Build docs-reindex-v2 Inngest function with staged steps
- Wire workload queue for batch orchestration
- Run full reindex AFK
- Verify retrieval quality improvement
- Cut over: alias docs_chunks_v2 → docs_chunks, deprecate v1
Durability Provisions
Storage (Layer 1)
- Artifacts on NAS RAID5 (57TB, hardware redundancy)
- Write-once: write to
.tmpthen atomicrename— no partial artifacts - Artifacts immutable once written — never modified, only replaced by re-running the stage
Pipeline Resumability (Layer 2)
- Each step checks
hasArtifact(docId, stage)before running - Crash after stage 2 → restart picks up at stage 3 from existing
.md+.meta.json skipExistingArtifacts: trueon batch event skips completed books entirely
Inngest Durability (Layer 3)
- Each stage is a separate
step.run()— output persisted between retries - Default retry policy (never
retries: 0) - Concurrency limit 3 prevents JVM pile-up
- Throttle between batch dispatches prevents queue flood
Batch Recovery (Layer 4)
- Failed items queryable via
joelclaw o11y session docs-reindex - Re-fire batch with
skipExistingArtifacts: true— only unfinished books re-run - All progress in artifacts on disk + Inngest run state, nothing in memory
Observability (Layer 5)
- OTEL event per stage per book (with ADR-0233 provenance)
- Gateway progress updates per book completion
- Batch summary via gateway on completion
Rollback (Layer 6)
- New chunks go to
docs_chunks_v2(parallel to existingdocs_chunks) - Old collection untouched until verified
- Cut over only after retrieval quality comparison
Estimated Cost
- Extraction: 602 PDFs × 2.5s = ~25 min
- Summaries: 602 LLM calls via pi = ~30 min, ~$6
- Embedding: 150K+ chunks with nomic via Typesense = ~1-2 hours (CPU, local)
- Total: ~3 hours AFK