Redis, Dkron, Restate, and Sandboxes

· updated
architectureredisrestatedkronsandboxarticle
ADR-0217 runtime architecture

ADR-0217 can get messy fast if you read it straight from the implementation edge.

There are queue pilots. There are Dkron jobs. There are Restate workflows. There is still some transitional Inngest surface hanging around. There are sandboxed runners in one state today and another state next.

That’s all real, but it hides the important split.

Here’s the clean version:

  • Redis is the pressure layer.
  • Dkron is the clock.
  • Restate is the durable executor.
  • Sandboxed runners are the side-effect boundary.

That’s the architecture.

Everything else is wiring.

Redis is where work waits

When work shows up from the gateway, a webhook, a CLI command, or another part of the system, Redis is where it lands first.

That queue layer handles the stuff that gets ugly in real systems:

  • bursts
  • priority
  • pause and resume
  • replay after failure
  • family-level control

If the system needs to hold a family for a minute because the downstream path is unhealthy, that belongs here.

If three things arrive at once and one of them matters more, that belongs here.

If a worker restarts and the backlog needs to be recovered truthfully, that belongs here too.

Redis is not trying to be the workflow engine. It’s the pressure valve.

Dkron is just the clock

Scheduled work is a different problem.

I don’t want the queue pretending to be cron, and I don’t want the workflow runner pretending to be a scheduler.

So Dkron gets one job: decide when something should start.

That’s it.

Hourly health checks, recurring maintenance, periodic sync work — Dkron starts the run.

What happens after that is not Dkron’s business.

That split matters because it keeps the clock separate from the runner. When the scheduler and the workflow engine are the same thing, you inherit a bunch of weird coupling for no real gain.

Restate is where work becomes durable

Once work is admitted, Restate is where it becomes a real workflow.

This is where I want:

  • durable steps
  • retries with memory
  • DAGs and waves
  • explicit progress
  • workflow identity that survives a process restart

Redis tells the system what is waiting.

Restate tells the system what is running and what already happened.

That’s a much cleaner separation than trying to force one tool to do both jobs.

Sandboxed runners are the side-effect boundary

The system can reason about work all day, but eventually some workloads need to touch a repo.

That’s where the sandboxed runner matters.

I don’t want autonomous code-changing work mutating the operator checkout directly. That path creates dirt, collisions, and lies.

So the execution boundary is:

  • materialize a clean repo at the requested base SHA
  • run the agent there
  • verify there
  • export an artifact there
  • keep promotion separate

That gives the system a clean place to do side-effect-heavy work without pretending shared state is fine when it obviously isn’t.

What’s real today

A lot of this is already earned.

Redis queue control is real.

Restate is already running the deterministic drainer and the DAG layer.

Dkron is already the scheduler for Restate cron starts.

The local sandbox runner is already proven.

The operator surfaces are finally getting honest too. joelclaw jobs status and the async runtime monitor now do a decent job of saying what the runtime is actually doing instead of dumping a bag of unrelated health checks in your lap.

What’s still transitional

Inngest is still around.

That’s fine.

The goal is not to perform some dramatic rewrite for sport. The goal is to get one architecture slice after another into a shape where it is obviously better, obviously more truthful, and obviously easier to operate.

So right now the system is in a dual-runtime phase.

That means some paths are already native to the new shape, and some still bounce through older surfaces while the substrate settles.

The important thing is to be honest about which is which.

Why I like this split

It gives each part one clear job.

Redis handles pressure.

Dkron handles time.

Restate handles durable execution.

Sandboxed runners handle side effects.

Once the system is carved up that way, it gets a lot easier to reason about what broke.

It also gets easier to build better operator surfaces because each one is reporting on a real boundary instead of a vague blob of “background work.”

That’s the whole trick.

What’s next

The next batch is where this stops being a nice diagram and becomes a real workload lane.

I want one slice that proves all of this together:

  • queued work enters through Redis
  • scheduled work enters through Dkron
  • Restate runs both durably
  • sandboxed execution handles code-changing side effects
  • the operator sees the same truth from one set of CLI and gateway surfaces

That’s a much better milestone than arguing about whether the architecture is conceptually elegant.

If the lane is real, the diagram earned its keep.

If it isn’t, it’s just a pretty box drawing.