Colima substrate proof harness
Context
Panda still suffers intermittent Colima/VZ collapse after the deprecated com.joel.colima-tunnel and com.joel.typesense-portforward launchd interference was removed.
The remaining failure is not yet proven. Current evidence points at Lima/Colima usernet state leakage and hostagent readiness stalls, but evidence is not proof.
The system needs a harness that can:
- capture the exact substrate state at failure time,
- emit structured OTEL with high-cardinality metadata,
- save raw artifacts for later audit,
- distinguish hypotheses instead of rewarding whatever recovery guess happened to work.
Without that, every restart risks erasing the thing we needed to inspect.
Decision
Adopt an evidence-first proof harness for Colima substrate failures.
1. Canonical probe script
infra/colima-proof.sh is the canonical harness for Colima substrate evidence capture.
It must:
- snapshot Colima/Lima substrate state into durable JSON artifacts,
- emit structured OTEL events through the normal ingest path,
- support repeatable recovery experiments keyed by
incident_id.
2. High-cardinality truth belongs in metadata and artifacts
Stable OTEL facets stay low-cardinality:
source=infracomponent=colima-proof- stable
action levelsuccess
High-cardinality diagnostic values go in metadata and referenced artifact files:
- incident id
- hypothesis id
- recovery mode
- process counts / PIDs
- socket reachability
- port-open matrix
- artifact paths / hashes
- raw command snapshots and log tails
3. Failure edges must be captured before destructive recovery
infra/k8s-reboot-heal.sh must capture proof snapshots at the failure edge before it force-cycles Colima.
The healer is allowed to recover the machine, but not before leaving a durable evidence trail.
4. Proof requires discriminating interventions
The harness exists to test competing hypotheses.
Initial hypothesis set:
H1-usernet— Lima/Colimausernetleakage or staleuser-v2state is the root cause.H2-hostagent— hostagent/VZ readiness stalls first andusernetis secondary damage.H3-healer— the repair loop is inducing or amplifying the collapse.H4-guest-runtime— guest Docker/Talos/control-plane failure is primary.
A hypothesis is only promoted when:
- the fault is present at failure time,
- the selective intervention targets only that layer,
- the selective intervention restores service,
- the result repeats.
5. Artifact-first recovery experiments
Recovery experiments must be named and queryable.
Initial recovery modes:
observeholdusernet-onlyforce-cycle
The proof harness must let operators compare these modes by incident, not by memory.
usernet-only is the first non-destructive discriminator. It resets Lima user-v2 control state without touching the VM disk, captures pre/post snapshots, and writes a verdict artifact plus OTEL event (infra.colima.recovery.usernet_only.verdict) that explicitly says whether the intervention supported H1-usernet, failed to support it, or stayed inconclusive because a broader restart was required.
Why this
- Stops evidence destruction — the next restart no longer wipes away the best clues.
- Makes root-cause claims falsifiable — each hypothesis gets explicit evidence and intervention criteria.
- Improves operator trust — decisions can be audited from OTEL + artifacts instead of reconstructed from vague logs.
- Supports simplification decisions — if the proof harness shows the nested substrate is the real tax, flattening the stack becomes an evidence-based decision instead of a mood.
Consequences
Positive
- Colima failures become attributable to incident IDs with raw evidence attached.
- The healer produces proof snapshots at failure edges and after recovery boundaries.
- OTEL queries can separate failure detection from recovery mode and outcome.
- Future simplification work can cite actual repeated substrate failure modes.
Negative
- More artifact volume under local state.
- More OTEL event volume from infra diagnosis.
- Slightly more complexity in the healer, because evidence capture now precedes some recovery actions.
Implementation Plan
Required skills
k8ssystem-architectureo11y-loggingadr-skillclawmail
Affected paths
infra/colima-proof.shinfra/k8s-reboot-heal.shdocs/deploy.mdVault/docs/decisions/0242-colima-substrate-proof-harness.mdVault/docs/decisions/0241-recovery-authority-and-colima-escalation-gates.md
Required changes
- Add
infra/colima-proof.shto capture:- Colima status JSON
- Lima list output
- process snapshots
- socket probes
- selected port reachability
- healer state file
- hostagent / Colima log tails
- artifact hashes and incident IDs
- Emit OTEL for:
- failure detection
- hold state
- recovery start/completion/failure
- post-invariant pass/fail
- usernet-only verdicts
- Wire
infra/k8s-reboot-heal.shto call the proof harness before destructive recovery and after recovery boundaries. - Preserve raw artifacts under a durable local path for later diffing and replay.
Verification
-
bash -n infra/colima-proof.shpasses. -
bash -n infra/k8s-reboot-heal.shpasses after proof-harness integration. -
infra/colima-proof.sh snapshotwrites a JSON artifact and an OTEL payload file. - Failure-edge snapshots are emitted before
force_cycle_colima()runs. - Post-invariant pass/fail is queryable in OTEL.
- At least two incidents can be compared by
incident_idwith artifact path + hash. -
recover-usernet --restart-mode nonewrites pre/post snapshots and a verdict artifact that distinguishessupports_h1_usernet,does_not_support_h1_usernet, andinconclusive_broader_recovery.
Non-goals
- Declaring
H1-usernetproven immediately. - Rewriting Colima, Lima, or VZ internals.
- Flattening the entire runtime stack in this ADR.
Follow-up
- Add incident comparison tooling on top of the proof artifacts.
- If the proof harness shows the nested substrate is the real tax, write a separate ADR for stack simplification.