ADR-0244accepted

Durable Colima recovery verification

Context

Recent Colima incidents on Panda showed a useless pattern:

  1. the healer force-cycled Colima,
  2. limactl and forwarded ports briefly looked healthy,
  3. the hostagent later withdrew the control path,
  4. Docker, SSH, and localhost service health fell over again.

That is not recovery. That is theatre.

The system priority is durability and boring stability, not autonomous recovery choreography that produces a short green flash and then collapses.

ADR-0241 already moved Colima force-cycles behind confirmation gates. ADR-0242 added proof capture so recovery stops destroying evidence. The missing rule is what counts as success after a restart, and what to do when the system flaps again immediately afterward.

Decision

1. A restart is not recovery

A Colima restart, start, or force-cycle does not count as recovery by itself.

Recovery only counts after a post-restart verification window stays healthy long enough to prove that the control path is actually stable.

2. Recovery verification must check the real control path

The verification window must repeatedly confirm all of these:

  • Colima SSH path is usable
  • Docker socket is usable
  • Kubernetes API is reachable
  • Typesense answers on localhost
  • Inngest answers on localhost

If any of those regress during the verification window, the restart is classified as a failed recovery.

3. Failed recovery blocks repeated force-cycles

When a restart regresses during the verification window, the healer must persist a failed-recovery marker and refuse to keep force-cycling Colima for a hold period.

The right response is:

  • preserve evidence,
  • mark the substrate degraded,
  • stop pretending more cycling is progress.

4. Convergence and durability are different checks

The existing warmup/convergence gate still matters for flannel, pods, image pulls, RBAC, and dependent services.

But convergence is only the first gate.

The new durability gate sits after convergence and proves that the host control path stays up once the quick boot noise is over.

5. Evidence-first remains mandatory

Every failed recovery and post-restart regression must emit proof artifacts and OTEL so the system can distinguish:

  • stale usernet recovery,
  • hostagent/SSH control-path collapse,
  • guest runtime failure,
  • healer-induced churn.

6. Simplicity beats healer cleverness

If this pattern repeats, the system should bias toward substrate simplification rather than ever-more-clever healing logic.

This ADR does not flatten the stack by itself, but it makes the current stack honest about whether it is actually recovering.

Consequences

Positive

  • Short-lived green states no longer count as success.
  • Repeated VM kicks stop masking substrate instability.
  • OTEL and proof artifacts now capture failed recovery as a first-class state.
  • Operators get a truthful signal: stable, failed recovery, or degraded hold.

Negative

  • The healer may now leave the system degraded for longer instead of repeatedly trying to “fix” it.
  • Some incidents will end with “stop cycling and investigate” instead of a fake self-heal story.
  • State tracking grows by one more persisted marker.

Implementation Plan

Required skills

  • k8s — Colima/Talos recovery behavior lives here.
  • system-architecture — needed to distinguish guest health from host control-path health.
  • adr-skill — this is a policy change with cross-agent consequences.
  • clawmail — shared-file coordination and edit reservation.

Affected paths

  • infra/k8s-reboot-heal.sh
  • docs/deploy.md
  • skills/k8s/SKILL.md
  • Vault/docs/decisions/0244-durable-colima-recovery-verification.md

Required changes

  1. Add a post-restart stability verification window after service convergence.
  2. Require repeated healthy checks for Docker, Colima SSH, Kubernetes API, Typesense, and Inngest during that window.
  3. Persist a failed-recovery timestamp when that window regresses.
  4. Block repeated Colima force-cycles while a failed-recovery hold is active.
  5. Emit proof snapshots for post-restart verification failure.
  6. Update operator docs and k8s skill guidance so agents stop treating a restart as success.

Verification

  • bash -n infra/k8s-reboot-heal.sh passes.
  • The state file includes a persisted failed-recovery marker.
  • The healer requires a post-recovery stability window after convergence.
  • Regression of Docker/SSH/Typesense/Inngest during that window causes a failed recovery.
  • A recent failed recovery blocks repeated force-cycles for the configured hold period.
  • docs/deploy.md and skills/k8s/SKILL.md describe the new contract plainly.

Non-goals

  • Replacing Colima in this ADR.
  • Solving every possible hostagent bug.
  • Pretending the current substrate is good enough if it only looks healthy for a minute.

Follow-up

  1. If failed-recovery holds repeat, open the simplification ADR instead of growing more healer theatre.
  2. Consider teaching infra/colima-proof.sh an explicit hostagent-control-path hypothesis tag beyond H1-usernet.
  3. Add a small incident summary view over proof artifacts so the degraded-hold state is obvious at a glance.