The healer was the killer

· updated
kuberneteshomelabpostmortemautomationcolima

I run a real kubernetes cluster. On a Mac Mini. Under my desk. It hosts every durable workflow, queue, memory store, and AI agent I’ve built for myself.

Last Friday I rebooted it. A boring, thinking-nothing-of-it kind of reboot. Over the next few days, services were flickering, pods restarting for no reason, kubectl hanging and spitting connection-refused. I’d hit colima restart and move on. Standard vibe-coding-k8s bullshit, I assumed.

Then I actually looked. The reboot was the trigger. The watchdogs were the accelerant. The machine had been force-cycling itself every nineteen minutes, for days, without me knowing.

The rate was too regular

Here’s what last reboot in the guest VM returned:

reboot   Fri 06:55
reboot   Fri 06:46
reboot   Fri 06:35
reboot   Fri 06:16
reboot   Fri 05:57
reboot   Fri 05:38
reboot   Fri 05:20
reboot   Fri 05:01
reboot   Fri 04:42
reboot   Fri 04:23
reboot   Fri 04:04
reboot   Fri 03:45

Metronomic. Every eighteen to nineteen minutes. All night.

Pod restart counts told the scale. Livekit at 1,400+ restarts. Redis, Inngest, Typesense at ~300 each. Healthy pods in this cluster restart maybe once a week. These were restarting three times an hour.

No panic in dmesg. Memory fine. No graceful shutdown in the journal — just abrupt cutoffs mid-log-line. Something external was killing this VM on a schedule.

The culprit had my name on it

Colima daemon log, every nineteen minutes: terminate signal receivedfatal: context canceled → respawn. Nothing inside the VM was crashing. Something outside was reaching in and pulling the plug on a cadence.

That something was a LaunchDaemon I’d written months ago called k8s-reboot-heal. 867 lines of bash meant to babysit the cluster.

It checks a list of invariants every three minutes. Docker socket responsive? kubectl reachable? NAS route healthy? br_netfilter loaded? If enough invariants fail for long enough, it runs colima stop --force and restarts everything. A reasonable-sounding self-healer.

The problem: several of the invariants could not pass in this environment.

The SSH tunnel Lima sets up to the guest VM dies two to four minutes after every boot — a vsock-forwarder interaction with the systemd version Ubuntu ships in the guest. Services inside the VM stay fine. The tunnel breaks. But k8s-reboot-heal used that tunnel to check docker and kubectl. So two to four minutes after every boot, those probes started failing. The unhealthy-streak counter ticked up. Fifteen minutes later: colima stop --force.

Which rebooted the VM.

Which started fresh probe failures.

Which started the countdown again.

The healer had been force-cycling the cluster every nineteen minutes, for days. The pod restart counts were a census of its body count.

Three watchdogs, one pathology

k8s-reboot-heal wasn’t even alone. There were three separate watchdog systems all racing to “help.”

One was the 867-line bash script. Another was Talon — a Rust program with a proper escalation ladder, Telegram SOS alerts, a “cloud recovery agent,” a “local recovery agent.” Beautifully engineered. Also had colima stop --force in its escalation chain. The third was the gateway daemon’s own internal Redis reconnect loop, spamming retries at a dead endpoint for the last twenty-four hours.

All three watched broadly the same thing. All three used the same broken probe paths. All three reached for the same nuclear button. Only a cooldown gate in Talon held the line and kept it from compounding the damage mid-investigation.

I disabled all three. The VM has been up for seven hours.

Longest uptime in weeks.

Here’s where the “AI slop” narrative breaks

The lazy version of this story writes itself: LLM wrote bad bash, blew up infrastructure, cautionary tale. I half-believed it for two weeks.

Except it’s half wrong. The 867-line bash script was absolutely AI-generated — a careful, well-logged, typed-exit-paths-everywhere agent output I skimmed and shipped because it looked reasonable. Talon was thoughtful Rust, hand-written by me. Two different origins. Same pathology. Both executed their assumptions perfectly. Both got wrecked the moment the environment shifted under them — which is to say, the moment I rebooted.

The failure wasn’t AI-slop. It was shipping boot-path automation whose probes I’d never pressure-tested against a fresh boot, and whose remediation was a sledgehammer wired to a tripwire that fires every single time the machine comes up cold. An agent wrote half of it, I wrote the other half, and both halves believed the same lies about what “healthy” looks like in the first three minutes after boot.

The lesson isn’t “don’t let AI write your scripts.” AI wrote plenty of careful code I’m still running. The lesson is two things — what your probes actually test, and what your remediation is actually allowed to do — and neither of those is AI-shaped:

Probes have to test the thing you actually care about. My probes checked the admin plane — “is the SSH tunnel up? can I run docker ps remotely?” — which is inherently flaky in this environment. What I actually cared about was the data plane: does Inngest answer a health check? Does Redis PONG? Those diverged. Admin plane was broken continuously. Data plane was fine. The probes looked at the wrong signal and red-alerted on it forever.

Remediation has to be proportional to the fault. A single probe failing is not a reason to force-cycle a hypervisor. That’s a sledgehammer reaching for a thumbtack. If flannel is crashlooping, restart flannel. If kubelet is stuck, restart kubelet. If one pod is wedged, delete the pod. colima stop --force should never be in an automation chain — full stop. That’s a human decision made after looking at actual data.

What I’m doing now

One watchdog, not three. Probes only against data-plane services over stable network paths — HTTP to real service endpoints, not SSH into the hypervisor. Remediation that caps at restart-the-specific-broken-thing. Nothing reaches for colima stop --force except me.

The 867-line bash script is still on disk. Disabled, not deleted.