ADR-0216proposed

Dkron Distributed Scheduler for Restate DAG Pipelines

Unknown

The joelclaw system currently uses self-hosted Inngest for both scheduling (cron triggers, event triggers) and durable execution (step functions, retries). Operational data from 20 days of slog entries reveals significant instability:

141 worker restarts (avg ~7/day)
110 stale/stuck RUNNING runs requiring manual GraphQL cancellation or sqlite DB surgery
61 function registry sync failures (worker running code Inngest server doesn’t know about)
46 disk pressure / crash / zombie events
9 “SDK URL unreachable” errors (Inngest server can’t reach worker, runs hang forever)

Root causes include: transport instability between k8s Inngest server and host worker (EOF/socket closed), sqlite journal state drift, Colima VM zombie states, and the tight coupling between Inngest’s scheduler, execution runtime, and function registry.

Restate (ADR-0207) now handles durable DAG execution reliably — the contact enrichment pipeline (ADR-0133) runs 11-node DAGs with 46 OTEL events per run, gateway notifications, and zero of the registry/transport issues that plague Inngest. But Restate has no built-in scheduling — it’s a runner, not a scheduler.

The system needs a dedicated scheduler that triggers Restate DAG runs on cron expressions, decoupled from the execution runtime.

Decision Drivers

Operational stability — the scheduler must not share failure modes with the runner
Minimal new infra — runs in existing k8s cluster, single pod
API-first — jobs created/managed via REST API (agent-friendly, CLI-compatible)
Dashboard — visual job management and execution history
HTTP executor — triggers Restate via plain HTTP POST, zero SDK coupling
Cron expressions — standard 5-field plus interval shortcuts (@every 1h)
Persistence — job definitions survive restarts
Single-node viable — HA is nice but not required for personal infra

Considered Options

1. Dkron (distributed cron service)

Go binary, Raft consensus, embedded BuntDB storage
Helm chart for k8s deployment
Web UI dashboard with job management, execution history
HTTP executor — jobs POST to any URL
REST API for job CRUD
~20MB binary, minimal resource footprint
Open source (LGPL-3.0), actively maintained

2. Kubernetes CronJobs

Already available in our k8s cluster
Standard YAML-based job definitions
No dashboard, no REST API for job management
Minute-level granularity only
No retry intelligence beyond k8s restartPolicy
Each job is a YAML manifest — operational overhead scales linearly

3. Restate self-scheduling (virtual objects with delayed sends)

Zero new infra — uses Restate’s durable timers
Jobs are code, not config — requires redeployment to change schedules
No dashboard or management API
State lives in Restate’s journal — couples scheduler and runner failure modes
Cron parsing via npm library, not native

4. Cronicle (Node.js job scheduler)

Web UI with multi-server support
Heavier footprint (Node.js runtime)
Less k8s-native than Dkron
More features than needed (plugin system, categories, resource limits)

5. Keep Inngest for scheduling, Restate for execution

Preserves existing cron triggers
Still has all Inngest operational issues for the scheduling layer
Couples two runtimes — Inngest event → Inngest function → Restate HTTP call
Adds latency and failure points

Decision

Use Dkron as the dedicated scheduler for Restate DAG pipelines.

The architecture becomes:

┌─────────┐    HTTP POST     ┌─────────┐    durable execution    ┌───────────┐
│  Dkron  │ ──────────────→  │ Restate │ ─────────────────────→  │ dagWorker │
│ (cron)  │   /dagOrch/run   │ ingress │   waves + handlers      │ (shell,   │
│         │                  │         │                          │  http,    │
│ Web UI  │                  │ k8s:8080│                          │  infer)   │
└─────────┘                  └─────────┘                          └───────────┘
     ↑                                                                  │
     │ REST API                                              OTEL + gateway
     │ joelclaw restate cron ...                              notification

Scheduler (Dkron) decides when. Runner (Restate) decides how. Clean separation. No shared failure modes.

Current State

Phase 1 shipped (2026-03-06 vector)

k8s/dkron.yaml deployed: StatefulSet/dkron, headless peer service dkron-peer, API service dkron-svc
Dkron runs as ClusterIP-only for now; operator access goes through short-lived CLI-managed tunnels, not a permanent host port mapping
joelclaw restate cron command group shipped:
- status
- list
- enable-health
- sync-tier1
- delete
Tier-1 scheduler ownership migrated off Inngest cron and onto Dkron:
- restate-health-check ← check/system-health-signals-schedule
- restate-skill-garden ← skill-garden
- restate-typesense-full-sync ← typesense/full-sync
- restate-daily-digest ← memory/digest-daily
- restate-subscription-check-feeds ← subscription/check-feeds
The health job is a native Restate DAG.
The other tier-1 jobs run through Restate shell nodes that execute host-side direct task runners (scripts/restate/run-tier1-task.ts) so Dkron success means real work executed, not just event dispatch.
The corresponding Inngest functions keep manual/on-demand event triggers where useful, but their recurring cron triggers were removed.
joelclaw restate cron list now surfaces migratedFrom, successCount, errorCount, lastSuccess, and lastError for soak monitoring.
Dkron cron expressions are six-field (sec min hour dom month dow), so hourly-at-minute-7 is 0 7 * * * *

Open follow-up

Monitor tier-1 soak before touching tier-2 candidates; scheduler green is necessary but not sufficient — underlying OTEL evidence still has to stay clean.
Dashboard still lacks a stable host/Tailscale exposure path
Upstream dkron/dkron:latest currently needs root to write the local-path PVC; non-root hardening caused permission denied under /data/raft/snapshots/permTest

Deployment Plan

Phase 1: Deploy Dkron to k8s (single-node)

StatefulSet manifest in k8s/dkron.yaml
Headless peer service dkron-peer + ClusterIP API service dkron-svc
PVC for BuntDB persistence
Operator access via joelclaw restate cron ... using a short-lived tunnel

Phase 2: Wire to Restate

Configure HTTP executor jobs that POST to Restate ingress
Seed initial jobs:
- enrich-vip-contacts — weekly re-enrichment of VIP contacts
- system-health-check — hourly health pipeline
- memory-maintenance — daily memory/observation pipeline
OTEL emission on job trigger (Dkron → OTEL endpoint)

Phase 3: CLI integration

joelclaw restate cron list — list Dkron jobs via REST API
joelclaw restate cron create — create job with cron expression + pipeline
joelclaw restate cron delete — remove job
joelclaw restate cron status — dashboard URL + job summary

Phase 4: Inngest scheduling migration (selective)

Identify Inngest cron-triggered functions that can move to Dkron → Restate
Migrate incrementally — Inngest keeps event-triggered functions
Track which cron triggers have been moved

Consequences

Positive

Decoupled failure domains — scheduler crash doesn’t break running DAGs, runner crash doesn’t lose schedules
No function registry sync — Dkron doesn’t need to know about handler code
HTTP-only coupling — Dkron just POSTs JSON, doesn’t care what runs it
Visual dashboard — first time we have a UI for scheduled job management
Agent-friendly — REST API for job CRUD, CLI wrapper for operator access

Negative

Another service to run — one more pod in k8s (though lightweight ~20MB)
No event triggers — Dkron is cron-only, Inngest’s event fan-out stays on Inngest
Auth gap — Dkron Pro has auth, open source doesn’t (mitigated: ClusterIP, no external exposure)
Known Dkron bug — leader failover can double-fire jobs (#1569) — acceptable for single-node, need idempotent pipelines regardless

Migration Candidates

31 cron-triggered Inngest functions identified. Tiered by migration fit:

Tier 1 — First movers (already fit DAG shape, low risk)

Function	Inngest ID	Cron	Migration notes
System health check	`check/system-health-signals-schedule`	`7 * * * *`	Already a working Restate `health` pipeline — just needs Dkron trigger. Zero new code.
Skill garden	`skill-garden`	`0 6 * * *`	Pure shell checks (broken symlinks, frontmatter, stale patterns). Natural shell-node DAG.
Typesense full sync	`typesense/full-sync`	`0 11 * * *`	Sequential shell steps (index vault notes, index slack messages, index blog). Simple wave chain.
Daily digest	`memory/digest-daily`	`55 7 * * *`	Gather day’s slog/OTEL/memory data → LLM synthesis → write vault note. Classic fan-in to infer.
Feed subscriptions	`subscription/check-feeds`	`0 * * * *`	HTTP fetches per feed → diff → notify on changes. Parallel probes like contact enrichment.

Tier 2 — Good candidates, more complex

Function	Inngest ID	Cron	Migration notes
Nightly memory maintenance	`system/memory-nightly-maintenance`	`0 10 * * *`	Multi-step memory pipeline (prune, compact, reconcile). High value — currently flaky on Inngest.
NAS backup (5 jobs)	`system/backup-`, `system/rotate-`	various daily/weekly/monthly	Shell commands (rsync, rotate). Simple but touches NAS infra — test carefully.
Friction analysis	`memory/friction-analysis`	`0 7 * * *`	Gather friction signals from OTEL → LLM analysis → vault write. Fan-in pattern.
ADR evidence capture	`system/memory-adr-evidence-capture`	`15 13 * * *`	Searches sessions for ADR-relevant observations → writes evidence. Shell + infer.
Granola check	`granola-check-cron`	`7 * * * *`	Polls Granola for new meetings. Single HTTP check + conditional processing.
Content sync	`content-sync`	`0 * * * *`	Vault → website content sync. Sequential steps, idempotent.
Channel intelligence	`channel/intelligence.garden`	`0 /6 * *`	Slack channel analysis. Multi-source probe + synthesis.

Tier 3 — Keep on Inngest (monitors Inngest itself or tightly coupled)

Function	Inngest ID	Cron	Reason to keep
System heartbeat	`system-heartbeat`	`/15 * * *`	Monitors Inngest health — can’t move the watchdog off the thing it watches
Self-healing investigator	`system/self-healing.investigator`	`/10 * * *`	Monitors and heals Inngest/worker/gateway — coupled to Inngest internals
Self-healing gateway bridge	`system/self-healing.gateway-bridge`	`/10 * * *`	Gateway health check — coupled to Inngest event bridge
Self-healing Inngest runtime	`system/self-healing.inngest-runtime`	`/10 * * *`	Monitors Inngest server itself
O11y triage	`check/o11y-triage`	`/15 * * *`	Queries Inngest run data via GraphQL
Memory batch review	`memory/batch-review`	`/30 * * *`	Tightly coupled to Inngest step patterns and event fan-out
Memory reflect	`memory/reflect`	`0 6 * * *`	Triggered by both cron and `memory/observations.accumulated` event
Memory promote	`memory/review-promote`	`0 8 * * *`	Multi-event trigger (approved/rejected + cron)
Email nag	`email-nag`	`0 17,22 * * *`	Simple but low priority — not worth migrating early
Gateway behavior review	`gateway/behavior.daily-review`	`15 8 * * *`	Reads gateway session data — keep near gateway infra
Weekly maintenance	`system/memory-weekly-maintenance-summary`	`0 13 * * 1`	Complex multi-step — migrate after nightly maintenance proves out
Knowledge watchdog	`knowledge-watchdog`	`0 /4 * *`	Monitors knowledge pipeline health — keep on Inngest until Tier 1 proves stable
NAS soak (2 jobs)	`nas/soak-*`	`/30 * * `, `15 16 * *`	NAS health monitoring — low priority
Docs maintenance (2 jobs)	`docs-backlog-driver`, `docs-ingest-janitor`	configurable	Complex doc pipeline orchestration

Recommended migration order

check-system-health — already a working Restate pipeline, zero new code needed
skill-garden — pure shell commands, validates shell-handler DAG path
typesense-sync — sequential shells, validates wave chains
daily-digest — validates fan-in + infer handler path
subscriptions — validates parallel HTTP probes at scale
Then Tier 2 based on operational confidence

ADR-0207 — Restate execution layer (runner)
ADR-0133 — Contact enrichment pipeline (first Restate DAG workload)
ADR-0201 — Workflow runtime deployment alternatives (Restate adoption context)
ADR-0205 — Cloud-native agent execution (vision)
ADR-0156 — Graceful worker restart (Inngest stability issues that motivated this)