Dkron Distributed Scheduler for Restate DAG Pipelines
Status
accepted
Context and Problem Statement
The joelclaw system currently uses self-hosted Inngest for both scheduling (cron triggers, event triggers) and durable execution (step functions, retries). Operational data from 20 days of slog entries reveals significant instability:
- 141 worker restarts (avg ~7/day)
- 110 stale/stuck RUNNING runs requiring manual GraphQL cancellation or sqlite DB surgery
- 61 function registry sync failures (worker running code Inngest server doesn’t know about)
- 46 disk pressure / crash / zombie events
- 9 “SDK URL unreachable” errors (Inngest server can’t reach worker, runs hang forever)
Root causes include: transport instability between k8s Inngest server and host worker (EOF/socket closed), sqlite journal state drift, Colima VM zombie states, and the tight coupling between Inngest’s scheduler, execution runtime, and function registry.
Restate (ADR-0207) now handles durable DAG execution reliably — the contact enrichment pipeline (ADR-0133) runs 11-node DAGs with 46 OTEL events per run, gateway notifications, and zero of the registry/transport issues that plague Inngest. But Restate has no built-in scheduling — it’s a runner, not a scheduler.
The system needs a dedicated scheduler that triggers Restate DAG runs on cron expressions, decoupled from the execution runtime.
Decision Drivers
- Operational stability — the scheduler must not share failure modes with the runner
- Minimal new infra — runs in existing k8s cluster, single pod
- API-first — jobs created/managed via REST API (agent-friendly, CLI-compatible)
- Dashboard — visual job management and execution history
- HTTP executor — triggers Restate via plain HTTP POST, zero SDK coupling
- Cron expressions — standard 5-field plus interval shortcuts (
@every 1h) - Persistence — job definitions survive restarts
- Single-node viable — HA is nice but not required for personal infra
Considered Options
1. Dkron (distributed cron service)
- Go binary, Raft consensus, embedded BuntDB storage
- Helm chart for k8s deployment
- Web UI dashboard with job management, execution history
- HTTP executor — jobs POST to any URL
- REST API for job CRUD
- ~20MB binary, minimal resource footprint
- Open source (LGPL-3.0), actively maintained
2. Kubernetes CronJobs
- Already available in our k8s cluster
- Standard YAML-based job definitions
- No dashboard, no REST API for job management
- Minute-level granularity only
- No retry intelligence beyond k8s restartPolicy
- Each job is a YAML manifest — operational overhead scales linearly
3. Restate self-scheduling (virtual objects with delayed sends)
- Zero new infra — uses Restate’s durable timers
- Jobs are code, not config — requires redeployment to change schedules
- No dashboard or management API
- State lives in Restate’s journal — couples scheduler and runner failure modes
- Cron parsing via npm library, not native
4. Cronicle (Node.js job scheduler)
- Web UI with multi-server support
- Heavier footprint (Node.js runtime)
- Less k8s-native than Dkron
- More features than needed (plugin system, categories, resource limits)
5. Keep Inngest for scheduling, Restate for execution
- Preserves existing cron triggers
- Still has all Inngest operational issues for the scheduling layer
- Couples two runtimes — Inngest event → Inngest function → Restate HTTP call
- Adds latency and failure points
Decision
Use Dkron as the dedicated scheduler for Restate DAG pipelines.
The architecture becomes:
┌─────────┐ HTTP POST ┌─────────┐ durable execution ┌───────────┐
│ Dkron │ ──────────────→ │ Restate │ ─────────────────────→ │ dagWorker │
│ (cron) │ /dagOrch/run │ ingress │ waves + handlers │ (shell, │
│ │ │ │ │ http, │
│ Web UI │ │ k8s:8080│ │ infer) │
└─────────┘ └─────────┘ └───────────┘
↑ │
│ REST API OTEL + gateway
│ joelclaw restate cron ... notificationScheduler (Dkron) decides when. Runner (Restate) decides how. Clean separation. No shared failure modes.
Current State
Phase 1 shipped (2026-03-06 vector)
k8s/dkron.yamldeployed:StatefulSet/dkron, headless peer servicedkron-peer, API servicedkron-svc- Dkron runs as ClusterIP-only for now; operator access goes through short-lived CLI-managed tunnels, not a permanent host port mapping
joelclaw restate croncommand group shipped:statuslistenable-healthsync-tier1delete
- Tier-1 scheduler ownership migrated off Inngest cron and onto Dkron:
restate-health-check←check/system-health-signals-schedulerestate-skill-garden←skill-gardenrestate-typesense-full-sync←typesense/full-syncrestate-daily-digest←memory/digest-dailyrestate-subscription-check-feeds←subscription/check-feeds
- The health job is a native Restate DAG.
- The other tier-1 jobs run through Restate shell nodes that execute host-side direct task runners (
scripts/restate/run-tier1-task.ts) so Dkron success means real work executed, not just event dispatch. - The corresponding Inngest functions keep manual/on-demand event triggers where useful, but their recurring cron triggers were removed.
joelclaw restate cron listnow surfacesmigratedFrom,successCount,errorCount,lastSuccess, andlastErrorfor soak monitoring.- Dkron cron expressions are six-field (
sec min hour dom month dow), so hourly-at-minute-7 is0 7 * * * *
Open follow-up
- Monitor tier-1 soak before touching tier-2 candidates; scheduler green is necessary but not sufficient — underlying OTEL evidence still has to stay clean.
- Dashboard still lacks a stable host/Tailscale exposure path
- Upstream
dkron/dkron:latestcurrently needs root to write the local-path PVC; non-root hardening causedpermission deniedunder/data/raft/snapshots/permTest
Deployment Plan
Phase 1: Deploy Dkron to k8s (single-node)
- StatefulSet manifest in
k8s/dkron.yaml - Headless peer service
dkron-peer+ ClusterIP API servicedkron-svc - PVC for BuntDB persistence
- Operator access via
joelclaw restate cron ...using a short-lived tunnel
Phase 2: Wire to Restate
- Configure HTTP executor jobs that POST to Restate ingress
- Seed initial jobs:
enrich-vip-contacts— weekly re-enrichment of VIP contactssystem-health-check— hourly health pipelinememory-maintenance— daily memory/observation pipeline
- OTEL emission on job trigger (Dkron → OTEL endpoint)
Phase 3: CLI integration
joelclaw restate cron list— list Dkron jobs via REST APIjoelclaw restate cron create— create job with cron expression + pipelinejoelclaw restate cron delete— remove jobjoelclaw restate cron status— dashboard URL + job summary
Phase 4: Inngest scheduling migration (selective)
- Identify Inngest cron-triggered functions that can move to Dkron → Restate
- Migrate incrementally — Inngest keeps event-triggered functions
- Track which cron triggers have been moved
Consequences
Positive
- Decoupled failure domains — scheduler crash doesn’t break running DAGs, runner crash doesn’t lose schedules
- No function registry sync — Dkron doesn’t need to know about handler code
- HTTP-only coupling — Dkron just POSTs JSON, doesn’t care what runs it
- Visual dashboard — first time we have a UI for scheduled job management
- Agent-friendly — REST API for job CRUD, CLI wrapper for operator access
Negative
- Another service to run — one more pod in k8s (though lightweight ~20MB)
- No event triggers — Dkron is cron-only, Inngest’s event fan-out stays on Inngest
- Auth gap — Dkron Pro has auth, open source doesn’t (mitigated: ClusterIP, no external exposure)
- Known Dkron bug — leader failover can double-fire jobs (#1569) — acceptable for single-node, need idempotent pipelines regardless
Migration Candidates
31 cron-triggered Inngest functions identified. Tiered by migration fit:
Tier 1 — First movers (already fit DAG shape, low risk)
| Function | Inngest ID | Cron | Migration notes |
|---|---|---|---|
| System health check | check/system-health-signals-schedule | 7 * * * * | Already a working Restate health pipeline — just needs Dkron trigger. Zero new code. |
| Skill garden | skill-garden | 0 6 * * * | Pure shell checks (broken symlinks, frontmatter, stale patterns). Natural shell-node DAG. |
| Typesense full sync | typesense/full-sync | 0 11 * * * | Sequential shell steps (index vault notes, index slack messages, index blog). Simple wave chain. |
| Daily digest | memory/digest-daily | 55 7 * * * | Gather day’s slog/OTEL/memory data → LLM synthesis → write vault note. Classic fan-in to infer. |
| Feed subscriptions | subscription/check-feeds | 0 * * * * | HTTP fetches per feed → diff → notify on changes. Parallel probes like contact enrichment. |
Tier 2 — Good candidates, more complex
| Function | Inngest ID | Cron | Migration notes |
|---|---|---|---|
| Nightly memory maintenance | system/memory-nightly-maintenance | 0 10 * * * | Multi-step memory pipeline (prune, compact, reconcile). High value — currently flaky on Inngest. |
| NAS backup (5 jobs) | system/backup-*, system/rotate-* | various daily/weekly/monthly | Shell commands (rsync, rotate). Simple but touches NAS infra — test carefully. |
| Friction analysis | memory/friction-analysis | 0 7 * * * | Gather friction signals from OTEL → LLM analysis → vault write. Fan-in pattern. |
| ADR evidence capture | system/memory-adr-evidence-capture | 15 13 * * * | Searches sessions for ADR-relevant observations → writes evidence. Shell + infer. |
| Granola check | granola-check-cron | 7 * * * * | Polls Granola for new meetings. Single HTTP check + conditional processing. |
| Content sync | content-sync | 0 * * * * | Vault → website content sync. Sequential steps, idempotent. |
| Channel intelligence | channel/intelligence.garden | 0 */6 * * * | Slack channel analysis. Multi-source probe + synthesis. |
Tier 3 — Keep on Inngest (monitors Inngest itself or tightly coupled)
| Function | Inngest ID | Cron | Reason to keep |
|---|---|---|---|
| System heartbeat | system-heartbeat | */15 * * * * | Monitors Inngest health — can’t move the watchdog off the thing it watches |
| Self-healing investigator | system/self-healing.investigator | */10 * * * * | Monitors and heals Inngest/worker/gateway — coupled to Inngest internals |
| Self-healing gateway bridge | system/self-healing.gateway-bridge | */10 * * * * | Gateway health check — coupled to Inngest event bridge |
| Self-healing Inngest runtime | system/self-healing.inngest-runtime | */10 * * * * | Monitors Inngest server itself |
| O11y triage | check/o11y-triage | */15 * * * * | Queries Inngest run data via GraphQL |
| Memory batch review | memory/batch-review | */30 * * * * | Tightly coupled to Inngest step patterns and event fan-out |
| Memory reflect | memory/reflect | 0 6 * * * | Triggered by both cron and memory/observations.accumulated event |
| Memory promote | memory/review-promote | 0 8 * * * | Multi-event trigger (approved/rejected + cron) |
| Email nag | email-nag | 0 17,22 * * * | Simple but low priority — not worth migrating early |
| Gateway behavior review | gateway/behavior.daily-review | 15 8 * * * | Reads gateway session data — keep near gateway infra |
| Weekly maintenance | system/memory-weekly-maintenance-summary | 0 13 * * 1 | Complex multi-step — migrate after nightly maintenance proves out |
| Knowledge watchdog | knowledge-watchdog | 0 */4 * * * | Monitors knowledge pipeline health — keep on Inngest until Tier 1 proves stable |
| NAS soak (2 jobs) | nas/soak-* | */30 * * * *, 15 16 * * * | NAS health monitoring — low priority |
| Docs maintenance (2 jobs) | docs-backlog-driver, docs-ingest-janitor | configurable | Complex doc pipeline orchestration |
Recommended migration order
- check-system-health — already a working Restate pipeline, zero new code needed
- skill-garden — pure shell commands, validates shell-handler DAG path
- typesense-sync — sequential shells, validates wave chains
- daily-digest — validates fan-in + infer handler path
- subscriptions — validates parallel HTTP probes at scale
- Then Tier 2 based on operational confidence
Related ADRs
- ADR-0207 — Restate execution layer (runner)
- ADR-0133 — Contact enrichment pipeline (first Restate DAG workload)
- ADR-0201 — Workflow runtime deployment alternatives (Restate adoption context)
- ADR-0205 — Cloud-native agent execution (vision)
- ADR-0156 — Graceful worker restart (Inngest stability issues that motivated this)