ADR-0242accepted

Colima substrate proof harness

Context

Panda still suffers intermittent Colima/VZ collapse after the deprecated com.joel.colima-tunnel and com.joel.typesense-portforward launchd interference was removed.

The remaining failure is not yet proven. Current evidence points at Lima/Colima usernet state leakage and hostagent readiness stalls, but evidence is not proof.

The system needs a harness that can:

  1. capture the exact substrate state at failure time,
  2. emit structured OTEL with high-cardinality metadata,
  3. save raw artifacts for later audit,
  4. distinguish hypotheses instead of rewarding whatever recovery guess happened to work.

Without that, every restart risks erasing the thing we needed to inspect.

Decision

Adopt an evidence-first proof harness for Colima substrate failures.

1. Canonical probe script

infra/colima-proof.sh is the canonical harness for Colima substrate evidence capture.

It must:

  • snapshot Colima/Lima substrate state into durable JSON artifacts,
  • emit structured OTEL events through the normal ingest path,
  • support repeatable recovery experiments keyed by incident_id.

2. High-cardinality truth belongs in metadata and artifacts

Stable OTEL facets stay low-cardinality:

  • source=infra
  • component=colima-proof
  • stable action
  • level
  • success

High-cardinality diagnostic values go in metadata and referenced artifact files:

  • incident id
  • hypothesis id
  • recovery mode
  • process counts / PIDs
  • socket reachability
  • port-open matrix
  • artifact paths / hashes
  • raw command snapshots and log tails

3. Failure edges must be captured before destructive recovery

infra/k8s-reboot-heal.sh must capture proof snapshots at the failure edge before it force-cycles Colima.

The healer is allowed to recover the machine, but not before leaving a durable evidence trail.

4. Proof requires discriminating interventions

The harness exists to test competing hypotheses.

Initial hypothesis set:

  • H1-usernet — Lima/Colima usernet leakage or stale user-v2 state is the root cause.
  • H2-hostagent — hostagent/VZ readiness stalls first and usernet is secondary damage.
  • H3-healer — the repair loop is inducing or amplifying the collapse.
  • H4-guest-runtime — guest Docker/Talos/control-plane failure is primary.

A hypothesis is only promoted when:

  1. the fault is present at failure time,
  2. the selective intervention targets only that layer,
  3. the selective intervention restores service,
  4. the result repeats.

5. Artifact-first recovery experiments

Recovery experiments must be named and queryable.

Initial recovery modes:

  • observe
  • hold
  • usernet-only
  • force-cycle

The proof harness must let operators compare these modes by incident, not by memory.

usernet-only is the first non-destructive discriminator. It resets Lima user-v2 control state without touching the VM disk, captures pre/post snapshots, and writes a verdict artifact plus OTEL event (infra.colima.recovery.usernet_only.verdict) that explicitly says whether the intervention supported H1-usernet, failed to support it, or stayed inconclusive because a broader restart was required.

Why this

  • Stops evidence destruction — the next restart no longer wipes away the best clues.
  • Makes root-cause claims falsifiable — each hypothesis gets explicit evidence and intervention criteria.
  • Improves operator trust — decisions can be audited from OTEL + artifacts instead of reconstructed from vague logs.
  • Supports simplification decisions — if the proof harness shows the nested substrate is the real tax, flattening the stack becomes an evidence-based decision instead of a mood.

Consequences

Positive

  • Colima failures become attributable to incident IDs with raw evidence attached.
  • The healer produces proof snapshots at failure edges and after recovery boundaries.
  • OTEL queries can separate failure detection from recovery mode and outcome.
  • Future simplification work can cite actual repeated substrate failure modes.

Negative

  • More artifact volume under local state.
  • More OTEL event volume from infra diagnosis.
  • Slightly more complexity in the healer, because evidence capture now precedes some recovery actions.

Implementation Plan

Required skills

  • k8s
  • system-architecture
  • o11y-logging
  • adr-skill
  • clawmail

Affected paths

  • infra/colima-proof.sh
  • infra/k8s-reboot-heal.sh
  • docs/deploy.md
  • Vault/docs/decisions/0242-colima-substrate-proof-harness.md
  • Vault/docs/decisions/0241-recovery-authority-and-colima-escalation-gates.md

Required changes

  1. Add infra/colima-proof.sh to capture:
    • Colima status JSON
    • Lima list output
    • process snapshots
    • socket probes
    • selected port reachability
    • healer state file
    • hostagent / Colima log tails
    • artifact hashes and incident IDs
  2. Emit OTEL for:
    • failure detection
    • hold state
    • recovery start/completion/failure
    • post-invariant pass/fail
    • usernet-only verdicts
  3. Wire infra/k8s-reboot-heal.sh to call the proof harness before destructive recovery and after recovery boundaries.
  4. Preserve raw artifacts under a durable local path for later diffing and replay.

Verification

  • bash -n infra/colima-proof.sh passes.
  • bash -n infra/k8s-reboot-heal.sh passes after proof-harness integration.
  • infra/colima-proof.sh snapshot writes a JSON artifact and an OTEL payload file.
  • Failure-edge snapshots are emitted before force_cycle_colima() runs.
  • Post-invariant pass/fail is queryable in OTEL.
  • At least two incidents can be compared by incident_id with artifact path + hash.
  • recover-usernet --restart-mode none writes pre/post snapshots and a verdict artifact that distinguishes supports_h1_usernet, does_not_support_h1_usernet, and inconclusive_broader_recovery.

Non-goals

  • Declaring H1-usernet proven immediately.
  • Rewriting Colima, Lima, or VZ internals.
  • Flattening the entire runtime stack in this ADR.

Follow-up

  1. Add incident comparison tooling on top of the proof artifacts.
  2. If the proof harness shows the nested substrate is the real tax, write a separate ADR for stack simplification.