ADR-0136proposed
Integrate 20-Year Twitter Archive as a Typesense-Backed Corpus
Status
proposed
Context
A complete Twitter archive (~7+ GB, ~20 years) exists at /Volumes/three-body/twitter.
The question is how to make this useful inside joelclaw without creating another silo.
Current constraints:
- Qdrant is no longer part of the active system.
- Typesense is already the unified search backend and supports vector/embedding fields (ADR-0082).
- joelclaw architecture prefers durable, idempotent Inngest pipelines and CLI-first operations.
- Source archives on NAS should remain immutable and re-indexable (ADR-0088).
If we only import directly into one ad-hoc Typesense collection, we lose reproducibility, make re-indexing painful, and couple parsing bugs to serving state.
Decision
Adopt a three-layer architecture for Twitter archive integration:
-
Raw layer (immutable)
- Keep original archive files unchanged at
/Volumes/three-body/twitter.
- Keep original archive files unchanged at
-
Canonical normalized layer (rebuildable)
- Produce normalized records (JSONL or Parquet) from archive exports.
- This layer is the source for all re-indexing and schema migrations.
-
Typesense serving layer (query/runtime)
- Use Typesense for both lexical and vector retrieval.
- Store embeddings directly in Typesense fields.
Collection strategy
Phase 1 collections:
twitter_tweets— one document per tweettwitter_threads— materialized thread/conversation documents
Optional phase 2:
twitter_links— URL/domain-centric documents for link intelligence
Required normalized tweet fields
At minimum:
id(tweet id, stable primary key)author_idcreated_attextconversation_idin_reply_to_status_idlanghashtags[]mentions[]urls[]media_keys[]raw_ref(pointer back to original archive artifact)archive_batch(for ingest provenance)
Retrieval policy
- Hybrid search by default (keyword + vector), not vector-only.
- Support filters for date range, thread/conversation, mentions, hashtags, URL domains.
- Recency and exact-match signals should be tunable so old tweets do not dominate broad queries.
Alternatives considered
-
Typesense-only, no canonical normalized layer
- Rejected: fragile, difficult re-index, no clean replay path.
-
Keep archive as files + grep/jq only
- Rejected: poor ranking, poor typo tolerance, weak semantic recall.
-
Add another vector DB beside Typesense
- Rejected: unnecessary infra split; current direction is Typesense consolidation.
Non-goals
- Replacing or changing the existing
joelclaw xposting/auth commands (ADR-0119). - Building full social automation from historical tweets in phase 1.
- Real-time firehose ingest in this ADR (archive-first only).
Implementation plan
Affected paths (planned)
packages/system-bus/src/inngest/client.ts(event schema)packages/system-bus/src/inngest/functions/twitter-archive-ingest.ts(new)packages/system-bus/src/inngest/functions/index.host.ts(function registration)packages/system-bus/src/lib/typesense.ts(collection schema helpers/constants)packages/cli/src/commands/x.tsorpackages/cli/src/commands/twitter.ts(archive ingest/search ops)skills/x-api/SKILL.md(operator guidance updates)
Pipeline shape
- Emit
twitter/archive.ingest.requestedwith source path + run metadata. - Parse archive files into normalized records with idempotent dedupe on
tweet_id. - Materialize threads (
twitter_threads) from canonical tweet set. - Upsert into Typesense collections with embedding fields.
- Emit OTEL checkpoints for parse/normalize/threadize/index stages.
- Expose CLI operations for ingest status, reindex, and search.
Patterns to follow / avoid
Follow:
- Reuse existing Typesense helper patterns in
packages/system-bus/src/lib/typesense.ts. - Keep ingest idempotent with deterministic document IDs.
- Keep event naming in past-tense happened/requested style already used by system-bus.
Avoid:
- Do not call the live X API for historical backfill when local archive data exists.
- Do not write parser output straight to Typesense without persisting canonical normalized data first.
- Do not introduce a second vector store for this pipeline.
Data location policy
- Raw source stays at
/Volumes/three-body/twitter. - Canonical normalized output should live adjacent on NAS under a dedicated subdir (e.g.
/Volumes/three-body/twitter/normalized/). - Typesense is serving index only; never source of truth.
Configuration impact
- No new external API credentials are required for archive backfill.
- Reuse existing
typesense_api_keylease flow and current Typesense connectivity.
Verification
- A full ingest run can be re-executed without duplicates (idempotency by tweet id).
-
twitter_tweetsandtwitter_threadsexist in Typesense with expected doc counts. - A known tweet can be found by exact phrase and by semantic paraphrase.
- Date-filtered queries return correct bounded result sets.
- Pipeline emits OTEL events for every stage with success/failure counts.
- Reindex from canonical layer succeeds without reading external APIs.
Consequences
Positive
- 20 years of personal signal becomes first-class, queryable context inside joelclaw.
- No extra search/vector database to operate.
- Deterministic replay/reindex path from canonical data.
Negative / risks
- Initial normalization will be heavy and may expose malformed historical payloads.
- Schema design mistakes in phase 1 can create expensive reindex cycles.
- Embedding cost/latency must be controlled for large backfills.
Follow-up
If accepted, create implementation stories for:
- Archive parser + canonical writer
- Typesense schema + ingest functions
- CLI surfaces (
ingest,status,search,reindex) - OTEL dashboards for archive ingest health