ADR-0216proposed

Dkron Distributed Scheduler for Restate DAG Pipelines

Status

accepted

Context and Problem Statement

The joelclaw system currently uses self-hosted Inngest for both scheduling (cron triggers, event triggers) and durable execution (step functions, retries). Operational data from 20 days of slog entries reveals significant instability:

  • 141 worker restarts (avg ~7/day)
  • 110 stale/stuck RUNNING runs requiring manual GraphQL cancellation or sqlite DB surgery
  • 61 function registry sync failures (worker running code Inngest server doesn’t know about)
  • 46 disk pressure / crash / zombie events
  • 9 “SDK URL unreachable” errors (Inngest server can’t reach worker, runs hang forever)

Root causes include: transport instability between k8s Inngest server and host worker (EOF/socket closed), sqlite journal state drift, Colima VM zombie states, and the tight coupling between Inngest’s scheduler, execution runtime, and function registry.

Restate (ADR-0207) now handles durable DAG execution reliably — the contact enrichment pipeline (ADR-0133) runs 11-node DAGs with 46 OTEL events per run, gateway notifications, and zero of the registry/transport issues that plague Inngest. But Restate has no built-in scheduling — it’s a runner, not a scheduler.

The system needs a dedicated scheduler that triggers Restate DAG runs on cron expressions, decoupled from the execution runtime.

Decision Drivers

  • Operational stability — the scheduler must not share failure modes with the runner
  • Minimal new infra — runs in existing k8s cluster, single pod
  • API-first — jobs created/managed via REST API (agent-friendly, CLI-compatible)
  • Dashboard — visual job management and execution history
  • HTTP executor — triggers Restate via plain HTTP POST, zero SDK coupling
  • Cron expressions — standard 5-field plus interval shortcuts (@every 1h)
  • Persistence — job definitions survive restarts
  • Single-node viable — HA is nice but not required for personal infra

Considered Options

1. Dkron (distributed cron service)

  • Go binary, Raft consensus, embedded BuntDB storage
  • Helm chart for k8s deployment
  • Web UI dashboard with job management, execution history
  • HTTP executor — jobs POST to any URL
  • REST API for job CRUD
  • ~20MB binary, minimal resource footprint
  • Open source (LGPL-3.0), actively maintained

2. Kubernetes CronJobs

  • Already available in our k8s cluster
  • Standard YAML-based job definitions
  • No dashboard, no REST API for job management
  • Minute-level granularity only
  • No retry intelligence beyond k8s restartPolicy
  • Each job is a YAML manifest — operational overhead scales linearly

3. Restate self-scheduling (virtual objects with delayed sends)

  • Zero new infra — uses Restate’s durable timers
  • Jobs are code, not config — requires redeployment to change schedules
  • No dashboard or management API
  • State lives in Restate’s journal — couples scheduler and runner failure modes
  • Cron parsing via npm library, not native

4. Cronicle (Node.js job scheduler)

  • Web UI with multi-server support
  • Heavier footprint (Node.js runtime)
  • Less k8s-native than Dkron
  • More features than needed (plugin system, categories, resource limits)

5. Keep Inngest for scheduling, Restate for execution

  • Preserves existing cron triggers
  • Still has all Inngest operational issues for the scheduling layer
  • Couples two runtimes — Inngest event → Inngest function → Restate HTTP call
  • Adds latency and failure points

Decision

Use Dkron as the dedicated scheduler for Restate DAG pipelines.

The architecture becomes:

┌─────────┐    HTTP POST     ┌─────────┐    durable execution    ┌───────────┐
│  Dkron  │ ──────────────→  │ Restate │ ─────────────────────→  │ dagWorker │
│ (cron)  │   /dagOrch/run   │ ingress │   waves + handlers      │ (shell,   │
│         │                  │         │                          │  http,    │
│ Web UI  │                  │ k8s:8080│                          │  infer)   │
└─────────┘                  └─────────┘                          └───────────┘
     ↑                                                                  │
     │ REST API                                              OTEL + gateway
     │ joelclaw restate cron ...                              notification

Scheduler (Dkron) decides when. Runner (Restate) decides how. Clean separation. No shared failure modes.

Current State

Phase 1 shipped (2026-03-06 vector)

  • k8s/dkron.yaml deployed: StatefulSet/dkron, headless peer service dkron-peer, API service dkron-svc
  • Dkron runs as ClusterIP-only for now; operator access goes through short-lived CLI-managed tunnels, not a permanent host port mapping
  • joelclaw restate cron command group shipped:
    • status
    • list
    • enable-health
    • sync-tier1
    • delete
  • Tier-1 scheduler ownership migrated off Inngest cron and onto Dkron:
    • restate-health-checkcheck/system-health-signals-schedule
    • restate-skill-gardenskill-garden
    • restate-typesense-full-synctypesense/full-sync
    • restate-daily-digestmemory/digest-daily
    • restate-subscription-check-feedssubscription/check-feeds
  • The health job is a native Restate DAG.
  • The other tier-1 jobs run through Restate shell nodes that execute host-side direct task runners (scripts/restate/run-tier1-task.ts) so Dkron success means real work executed, not just event dispatch.
  • The corresponding Inngest functions keep manual/on-demand event triggers where useful, but their recurring cron triggers were removed.
  • joelclaw restate cron list now surfaces migratedFrom, successCount, errorCount, lastSuccess, and lastError for soak monitoring.
  • Dkron cron expressions are six-field (sec min hour dom month dow), so hourly-at-minute-7 is 0 7 * * * *

Open follow-up

  • Monitor tier-1 soak before touching tier-2 candidates; scheduler green is necessary but not sufficient — underlying OTEL evidence still has to stay clean.
  • Dashboard still lacks a stable host/Tailscale exposure path
  • Upstream dkron/dkron:latest currently needs root to write the local-path PVC; non-root hardening caused permission denied under /data/raft/snapshots/permTest

Deployment Plan

Phase 1: Deploy Dkron to k8s (single-node)

  • StatefulSet manifest in k8s/dkron.yaml
  • Headless peer service dkron-peer + ClusterIP API service dkron-svc
  • PVC for BuntDB persistence
  • Operator access via joelclaw restate cron ... using a short-lived tunnel

Phase 2: Wire to Restate

  • Configure HTTP executor jobs that POST to Restate ingress
  • Seed initial jobs:
    • enrich-vip-contacts — weekly re-enrichment of VIP contacts
    • system-health-check — hourly health pipeline
    • memory-maintenance — daily memory/observation pipeline
  • OTEL emission on job trigger (Dkron → OTEL endpoint)

Phase 3: CLI integration

  • joelclaw restate cron list — list Dkron jobs via REST API
  • joelclaw restate cron create — create job with cron expression + pipeline
  • joelclaw restate cron delete — remove job
  • joelclaw restate cron status — dashboard URL + job summary

Phase 4: Inngest scheduling migration (selective)

  • Identify Inngest cron-triggered functions that can move to Dkron → Restate
  • Migrate incrementally — Inngest keeps event-triggered functions
  • Track which cron triggers have been moved

Consequences

Positive

  • Decoupled failure domains — scheduler crash doesn’t break running DAGs, runner crash doesn’t lose schedules
  • No function registry sync — Dkron doesn’t need to know about handler code
  • HTTP-only coupling — Dkron just POSTs JSON, doesn’t care what runs it
  • Visual dashboard — first time we have a UI for scheduled job management
  • Agent-friendly — REST API for job CRUD, CLI wrapper for operator access

Negative

  • Another service to run — one more pod in k8s (though lightweight ~20MB)
  • No event triggers — Dkron is cron-only, Inngest’s event fan-out stays on Inngest
  • Auth gap — Dkron Pro has auth, open source doesn’t (mitigated: ClusterIP, no external exposure)
  • Known Dkron bug — leader failover can double-fire jobs (#1569) — acceptable for single-node, need idempotent pipelines regardless

Migration Candidates

31 cron-triggered Inngest functions identified. Tiered by migration fit:

Tier 1 — First movers (already fit DAG shape, low risk)

FunctionInngest IDCronMigration notes
System health checkcheck/system-health-signals-schedule7 * * * *Already a working Restate health pipeline — just needs Dkron trigger. Zero new code.
Skill gardenskill-garden0 6 * * *Pure shell checks (broken symlinks, frontmatter, stale patterns). Natural shell-node DAG.
Typesense full synctypesense/full-sync0 11 * * *Sequential shell steps (index vault notes, index slack messages, index blog). Simple wave chain.
Daily digestmemory/digest-daily55 7 * * *Gather day’s slog/OTEL/memory data → LLM synthesis → write vault note. Classic fan-in to infer.
Feed subscriptionssubscription/check-feeds0 * * * *HTTP fetches per feed → diff → notify on changes. Parallel probes like contact enrichment.

Tier 2 — Good candidates, more complex

FunctionInngest IDCronMigration notes
Nightly memory maintenancesystem/memory-nightly-maintenance0 10 * * *Multi-step memory pipeline (prune, compact, reconcile). High value — currently flaky on Inngest.
NAS backup (5 jobs)system/backup-*, system/rotate-*various daily/weekly/monthlyShell commands (rsync, rotate). Simple but touches NAS infra — test carefully.
Friction analysismemory/friction-analysis0 7 * * *Gather friction signals from OTEL → LLM analysis → vault write. Fan-in pattern.
ADR evidence capturesystem/memory-adr-evidence-capture15 13 * * *Searches sessions for ADR-relevant observations → writes evidence. Shell + infer.
Granola checkgranola-check-cron7 * * * *Polls Granola for new meetings. Single HTTP check + conditional processing.
Content synccontent-sync0 * * * *Vault → website content sync. Sequential steps, idempotent.
Channel intelligencechannel/intelligence.garden0 */6 * * *Slack channel analysis. Multi-source probe + synthesis.

Tier 3 — Keep on Inngest (monitors Inngest itself or tightly coupled)

FunctionInngest IDCronReason to keep
System heartbeatsystem-heartbeat*/15 * * * *Monitors Inngest health — can’t move the watchdog off the thing it watches
Self-healing investigatorsystem/self-healing.investigator*/10 * * * *Monitors and heals Inngest/worker/gateway — coupled to Inngest internals
Self-healing gateway bridgesystem/self-healing.gateway-bridge*/10 * * * *Gateway health check — coupled to Inngest event bridge
Self-healing Inngest runtimesystem/self-healing.inngest-runtime*/10 * * * *Monitors Inngest server itself
O11y triagecheck/o11y-triage*/15 * * * *Queries Inngest run data via GraphQL
Memory batch reviewmemory/batch-review*/30 * * * *Tightly coupled to Inngest step patterns and event fan-out
Memory reflectmemory/reflect0 6 * * *Triggered by both cron and memory/observations.accumulated event
Memory promotememory/review-promote0 8 * * *Multi-event trigger (approved/rejected + cron)
Email nagemail-nag0 17,22 * * *Simple but low priority — not worth migrating early
Gateway behavior reviewgateway/behavior.daily-review15 8 * * *Reads gateway session data — keep near gateway infra
Weekly maintenancesystem/memory-weekly-maintenance-summary0 13 * * 1Complex multi-step — migrate after nightly maintenance proves out
Knowledge watchdogknowledge-watchdog0 */4 * * *Monitors knowledge pipeline health — keep on Inngest until Tier 1 proves stable
NAS soak (2 jobs)nas/soak-**/30 * * * *, 15 16 * * *NAS health monitoring — low priority
Docs maintenance (2 jobs)docs-backlog-driver, docs-ingest-janitorconfigurableComplex doc pipeline orchestration
  1. check-system-health — already a working Restate pipeline, zero new code needed
  2. skill-garden — pure shell commands, validates shell-handler DAG path
  3. typesense-sync — sequential shells, validates wave chains
  4. daily-digest — validates fan-in + infer handler path
  5. subscriptions — validates parallel HTTP probes at scale
  6. Then Tier 2 based on operational confidence
  • ADR-0207 — Restate execution layer (runner)
  • ADR-0133 — Contact enrichment pipeline (first Restate DAG workload)
  • ADR-0201 — Workflow runtime deployment alternatives (Restate adoption context)
  • ADR-0205 — Cloud-native agent execution (vision)
  • ADR-0156 — Graceful worker restart (Inngest stability issues that motivated this)