ADR-0220accepted

pi-mono Artifacts Corpus via Restate + Typesense

Status

Accepted

Context and Problem Statement

We keep learning the same upstream lesson with badlogic/pi-mono: maintainer voice, review heuristics, issue-first contributor gate, common rejection patterns, release velocity, and package-boundary expectations.

Right now that knowledge is fragmented across:

  • local repo clones
  • GitHub issue threads
  • PR review comments
  • commit history
  • one-off agent research notes
  • human memory

That is brittle. It means every future upstream contribution risks redoing the same archaeology and sending low-signal issue/PR text back at Mario.

The useful corpus is not just prose docs. It is the full repo surface:

  • root docs and package READMEs
  • issue templates and GitHub workflow gates
  • issues and issue comments
  • pull requests and review comments
  • commits and releases
  • materialized maintainer guidance distilled from those artifacts

This is an appropriate Restate workload because the sync problem is:

  • paginated
  • idempotent
  • resumable
  • rate-limit sensitive
  • artifact-shaped (issues, comments, PRs, commits, releases)
  • worth re-running incrementally over time

Decision

Create a dedicated Typesense collection named pi_mono_artifacts and populate it via a Restate DAG pipeline backed by the existing host-side direct task runner pattern.

Collection contract

One denormalized collection holds:

  • repo_doc
  • issue
  • issue_comment
  • pull_request
  • pull_request_review_comment
  • commit
  • release
  • maintainer_profile
  • sync_state

Each document carries enough structure for retrieval and filtering:

  • repo
  • kind
  • title
  • content
  • author
  • author_role
  • maintainer_signal
  • package_scopes
  • labels
  • decision_tags
  • path
  • sha
  • tag
  • thread_key
  • created_at
  • updated_at

Runtime shape

Use the existing Restate DAG runtime, not a parallel ingestion subsystem.

Implementation path:

  1. packages/restate/src/pi-mono-artifacts.ts

    • normalize GitHub + local repo artifacts
    • ensure/create pi_mono_artifacts
    • bulk-upsert documents into Typesense
    • materialize maintainer profile docs
    • write a sync_state checkpoint for later incremental runs
  2. scripts/restate/run-tier1-task.ts

    • add task pi-mono-artifacts-sync
    • reuse the host-runner pattern already used by ADR-0216 tier-1 Restate jobs
  3. packages/restate/src/pipelines.ts

    • add buildPiMonoArtifactsSyncPipeline()
    • first node: real sync shell task
    • second node: infer-based operator summary for readable run output
  4. packages/restate/src/trigger-dag.ts

    • expose pipeline pi-mono-sync
    • support --repo, --full-backfill, --max-pages, --per-page, --local-clone
  5. packages/cli/src/commands/restate.ts

    • add joelclaw restate pi-mono-sync
  6. packages/cli/src/commands/search.ts

    • add pi_mono_artifacts as a first-class searchable collection

Scheduling policy

Do not add a Dkron cron by default yet.

This corpus is useful immediately as a manual/operator-triggered sync and backfill. After it proves useful, a scheduled incremental refresh can be added as a separate ADR update or follow-up task.

Public surface and extension split

Keep the corpus, sync runtime, and public search surface in joelclaw.

Why:

  • joelclaw already owns Typesense, Upstash rate limiting, and joelclaw.com API discovery
  • the public operator surface belongs on joelclaw.com
  • the contributor-facing extension should evolve separately from the backend corpus/indexing runtime

Therefore:

  • joelclaw owns pi_mono_artifacts, joelclaw restate pi-mono-sync, joelclaw search --collection pi_mono_artifacts, and the public API discovery/search surface on joelclaw.com
  • the public extension/installer repo lives at joelhooks/contributing-to-pi-mono
  • the public API should include current install instructions for both the public skill and the public extension

Why this shape

One denormalized collection first

We care about retrieval and operator leverage more than perfect normalization.

The real operator questions are:

  • show me examples where Mario rejected config bloat
  • find comments that say a feature belongs in an extension
  • what does “Breaks TUI” usually mean in practice
  • compare an accepted proposal with a rejected one
  • what package boundaries show up in review comments for a given PR

One collection with strong facets answers those now.

Host-runner pattern over a second executor

ADR-0216 already proved the Restate shell-node → host direct task runner pattern. Reusing that is the right first move because:

  • it keeps the runtime consistent
  • it avoids inventing a second ingestion stack
  • it gives us real durable orchestration today
  • it keeps Typesense writes and GitHub auth on the operator host where they already work

Consequences

Positive

  • pi-mono contribution research becomes durable system knowledge, not session vapor
  • future issue/PR drafting can search real maintainer patterns instead of guessing
  • maintainer profile and sync checkpoint docs are materialized automatically
  • Restate gets a research/indexing workload that actually fits its durability model
  • CLI-first search remains intact because joelclaw search can query the new collection

Negative

  • another corpus to keep fresh
  • GitHub API pagination and rate limits must be handled honestly
  • initial implementation still relies on host-side auth/tooling rather than a fully isolated runner

Risks

  • public unauthenticated GitHub requests will be rate-limited; use token/env/gh auth token fallback
  • Typesense collection growth could get noisy if we ingest too much low-value repo documentation; keep the doc set curated
  • maintainer-profile materialization can drift into bullshit if it becomes too generative; keep it evidence-backed and heuristic-heavy for now

Required Skills Preflight

Load before implementing or extending this ADR:

  • system-architecture — understand the Restate/CLI/Typesense/operator wiring
  • typesense — collection schema, import, and search behavior
  • adr-skill — keep the ADR executable and updated

Current gap:

  • there is no canonical Restate skill yet. Until one exists, implementers must read packages/restate/README.md, ADR-0207, and ADR-0216 directly.

Implementation Plan

Code

  • add packages/restate/src/pi-mono-artifacts.ts
  • extend scripts/restate/run-tier1-task.ts
  • extend packages/restate/src/pipelines.ts
  • extend packages/restate/src/trigger-dag.ts
  • extend packages/cli/src/commands/restate.ts
  • extend packages/cli/src/commands/search.ts
  • add targeted tests for normalization/tagging heuristics

Documentation

  • update packages/restate/README.md
  • update docs/cli.md
  • update docs/architecture.md
  • update docs/inngest-functions.md
  • update ADR index ~/Vault/docs/decisions/README.md

Backfill

  • run a real sync against badlogic/pi-mono
  • verify documents landed in pi_mono_artifacts
  • verify maintainer_profile and sync_state were written
  • verify joelclaw search --collection pi_mono_artifacts works

Implementation Progress (2026-03-07)

Phase 1 shipped in the repo:

  • packages/restate/src/pi-mono-artifacts.ts
    • collection schema
    • GitHub + local-doc normalization
    • heuristic decision_tags
    • maintainer_profile materialization
    • sync_state checkpoint writes
  • scripts/restate/run-tier1-task.ts
    • added pi-mono-artifacts-sync
  • packages/restate/src/pipelines.ts
    • added buildPiMonoArtifactsSyncPipeline()
  • packages/restate/src/trigger-dag.ts
    • added pipeline trigger surface for pi-mono-sync
  • packages/cli/src/commands/restate.ts
    • added joelclaw restate pi-mono-sync
  • packages/cli/src/commands/search.ts
    • added pi_mono_artifacts collection support

Verification

  • bun test packages/restate/src/pi-mono-artifacts.test.ts
  • bunx tsc --noEmit
  • pnpm biome check packages/ apps/ (repo-wide pre-existing failures; use targeted checks on touched files)
  • joelclaw restate pi-mono-sync --repo badlogic/pi-mono --sync
  • joelclaw search "badlogic" --collection pi_mono_artifacts
  • joelclaw search "Breaks TUI" --collection pi_mono_artifacts
  • joelclaw search "maintainer profile" --collection pi_mono_artifacts
  • curl -sS "https://joelclaw.com/api/search?q=which+provider%2Fmodel+triggered+this&collection=pi_mono_artifacts" (route implemented locally; deploy verification still required after push)
  • curl -sS "https://joelclaw.com/api/pi-mono" (route implemented locally; deploy verification still required after push)

Follow-up

If the corpus proves useful, add a scheduled incremental refresh via Dkron as a separate step. Do not conflate “collection exists” with “cron should exist.”