[CONTEXT FOR AI ASSISTANTS]
You're reading the agent markdown version of a post from JoelClaw (https://joelclaw.com).
Author: Joel Hooks — software engineer building a personal AI operating system.

How to use implementation details in this document:
- Treat commands, code blocks, and config as reference implementations.
- Preserve ordering and architecture, but adapt hostnames, paths, versions, and credentials.
- Prefer details in <AgentOnly> blocks when present; they contain operational context removed from the human page.
- Verify assumptions against the current repo/system state before executing changes.

This is a narrative article from a real build. Technical details reflect Joel's setup and should be adapted to your environment.

If you cite this, link to the original: https://joelclaw.com/healer-was-the-killer
If you quote Joel, attribute him by name. Don't paraphrase opinions as facts.

Site index: https://joelclaw.com/sitemap.md
Machine-readable: https://joelclaw.com/llms.txt

Other posts on this site:
- [The memory system that watches itself](https://joelclaw.com/memory-that-watches-itself.md)
- [Agentic AI Optimization: Implementation Checklist](https://joelclaw.com/aaio-implementation-checklist.md)
- [Agentic AI Optimization: Implementation Checklist](https://joelclaw.com/aaio-implementation-checklist.md)
- [Redis, Dkron, Restate, and Sandboxes](https://joelclaw.com/redis-dkron-restate-and-sandboxes.md)
- [Dogfooding Story 4: the queue observer earns dry-run, not enforce](https://joelclaw.com/dogfooding-story-4-queue-observer.md)
- [Contributing to pi-mono with a public maintainer corpus](https://joelclaw.com/contributing-to-pi-mono-with-a-public-maintainer-corpus.md)
- [AI Job Scheduling on Mac as Local-First Video Infrastructure](https://joelclaw.com/ai-job-scheduling-macos-launchd.md)
- [Breakable Toys in the Wild: Apprenticeship Patterns and the joelclaw Experiment](https://joelclaw.com/breakable-toys-joelclaw.md)
- [Utah and joelclaw: Convergent Architecture](https://joelclaw.com/utah-joelclaw-convergent-architecture.md)
- [The Harness Is a Framework](https://joelclaw.com/the-harness-is-a-framework.md)
- [The Agent Memory System](https://joelclaw.com/the-memory-system.md)
- [JoelClaw is a Claw-like Organism](https://joelclaw.com/joelclaw-is-a-claw-like-organism.md)
- [The Agent Writing Loop](https://joelclaw.com/the-writing-loop.md)
- [Talon: the watchdog that finally bites](https://joelclaw.com/talon-watchdog-that-finally-bites.md)
- [The Knowledge Adventure Club Graph](https://joelclaw.com/knowledge-adventure-club-graph.md)
- [MineClaw](https://joelclaw.com/mineclaw.md)
- [Build a Voice Agent That Answers the Phone](https://joelclaw.com/build-a-voice-agent-that-answers-the-phone.md)
- [Plan 9 from Bell Labs: What Rob Pike Built After Unix](https://joelclaw.com/plan-9-pike-everything-is-a-file.md)
- [Propositions as Sessions: What Armstrong Built and Wadler Proved](https://joelclaw.com/propositions-as-sessions-armstrong-wadler.md)
- [Cache Components Patterns Skill for Next.js 16+ Applications](https://joelclaw.com/cache-components-patterns-skill-for-nextjs.md)
- [Karpathy Says We're Building "Claws"](https://joelclaw.com/karpathy-claws-as-category.md)
- [Voice Agent: A Rough Edge Experiment](https://joelclaw.com/voice-agent-deployment-deep-dive.md)
- [Extending Pi Coding Agent with Custom Tools and Widgets](https://joelclaw.com/extending-pi-with-custom-tools.md)
- [The Soul of Erlang Made Me Question Everything](https://joelclaw.com/soul-of-erlang-beam-evaluation.md)
- [CLI Design for AI Agents](https://joelclaw.com/cli-design-for-ai-agents.md)
- [Building a Gateway for Your AI Agent](https://joelclaw.com/building-a-gateway-for-your-ai-agent.md)
- [Self-Hosting Inngest: A Background Task Manager for AI Agents](https://joelclaw.com/self-hosting-inngest-background-tasks.md)
- [The One Where Joel Deploys Kubernetes... Again](https://joelclaw.com/joel-deploys-k8s.md)
- [How I Built an Observation Pipeline So My AI Remembers Yesterday](https://joelclaw.com/observation-pipeline-persistent-ai-memory.md)
- [Riding the Token Wave: Sean Grove at Everything NYC](https://joelclaw.com/riding-the-token-wave-sean-grove.md)
- [Playing with AT Protocol as a Data Layer](https://joelclaw.com/at-protocol-as-bedrock.md)
- [Building My Own OpenClaw on a Mac Mini](https://joelclaw.com/building-my-own-openclaw.md)
- [Inngest is the Nervous System](https://joelclaw.com/inngest-is-the-nervous-system.md)
- [OpenClaw: Peter Steinberger on Lex Fridman](https://joelclaw.com/openclaw-peter-steinberger-lex-fridman.md)
[END CONTEXT]

---
# The healer was the killer

> The cluster had been rebooting itself every nineteen minutes for days. Turned out the self-healing daemon I'd written was the killer, not the cure.

By Joel Hooks · 2026-04-17T20:34:35.279Z
Original: https://joelclaw.com/healer-was-the-killer
Mode: agent

---
I run a real kubernetes cluster. On a Mac Mini. Under my desk. It hosts every durable workflow, queue, memory store, and AI agent I've built for myself.

Last Friday I rebooted it. A boring, thinking-nothing-of-it kind of reboot. Over the next few days, services were flickering, pods restarting for no reason, `kubectl` hanging and spitting connection-refused. I'd hit `colima restart` and move on. Standard vibe-coding-k8s bullshit, I assumed.

Then I actually looked. The reboot was the trigger. The watchdogs were the accelerant. The machine had been force-cycling itself every nineteen minutes, for days, without me knowing.

## The rate was too regular

Here's what `last reboot` in the guest VM returned:

```
reboot   Fri 06:55
reboot   Fri 06:46
reboot   Fri 06:35
reboot   Fri 06:16
reboot   Fri 05:57
reboot   Fri 05:38
reboot   Fri 05:20
reboot   Fri 05:01
reboot   Fri 04:42
reboot   Fri 04:23
reboot   Fri 04:04
reboot   Fri 03:45
```

Metronomic. Every eighteen to nineteen minutes. All night.

Pod restart counts told the scale. Livekit at **1,400+ restarts**. Redis, Inngest, Typesense at \~300 each. Healthy pods in this cluster restart maybe once a week. These were restarting three times an hour.

No panic in `dmesg`. Memory fine. No graceful shutdown in the journal — just abrupt cutoffs mid-log-line. Something external was killing this VM on a schedule.

## The culprit had my name on it

Colima daemon log, every nineteen minutes: `terminate signal received` → `fatal: context canceled` → respawn. Nothing inside the VM was crashing. Something outside was reaching in and pulling the plug on a cadence.

That something was a LaunchDaemon I'd written months ago called `k8s-reboot-heal`. **867 lines of bash** meant to babysit the cluster.

It checks a list of invariants every three minutes. Docker socket responsive? kubectl reachable? NAS route healthy? `br_netfilter` loaded? If enough invariants fail for long enough, it runs `colima stop --force` and restarts everything. A reasonable-sounding self-healer.

The problem: **several of the invariants could not pass in this environment**.

The SSH tunnel Lima sets up to the guest VM dies two to four minutes after every boot — a vsock-forwarder interaction with the systemd version Ubuntu ships in the guest. Services inside the VM stay fine. The *tunnel* breaks. But `k8s-reboot-heal` used that tunnel to check `docker` and `kubectl`. So two to four minutes after every boot, those probes started failing. The unhealthy-streak counter ticked up. Fifteen minutes later: `colima stop --force`.

Which rebooted the VM.

Which started fresh probe failures.

Which started the countdown again.

The healer had been force-cycling the cluster every nineteen minutes, for days. The pod restart counts were a census of its body count.

## Three watchdogs, one pathology

`k8s-reboot-heal` wasn't even alone. There were **three** separate watchdog systems all racing to "help."

One was the 867-line bash script. Another was Talon — a Rust program with a proper escalation ladder, Telegram SOS alerts, a "cloud recovery agent," a "local recovery agent." Beautifully engineered. Also had `colima stop --force` in its escalation chain. The third was the gateway daemon's own internal Redis reconnect loop, spamming retries at a dead endpoint for the last twenty-four hours.

All three watched broadly the same thing. All three used the same broken probe paths. All three reached for the same nuclear button. Only a cooldown gate in Talon held the line and kept it from compounding the damage mid-investigation.

I disabled all three. The VM has been up for seven hours.

Longest uptime in weeks.

## Here's where the "AI slop" narrative breaks

The lazy version of this story writes itself: *LLM wrote bad bash, blew up infrastructure, cautionary tale.* I half-believed it for two weeks.

Except it's half wrong. The 867-line bash script was **absolutely AI-generated** — a careful, well-logged, typed-exit-paths-everywhere agent output I skimmed and shipped because it looked reasonable. Talon was thoughtful Rust, hand-written by me. Two different origins. **Same pathology.** Both executed their assumptions perfectly. Both got wrecked the moment the environment shifted under them — which is to say, the moment I rebooted.

The failure wasn't AI-slop. It was shipping boot-path automation whose probes I'd never pressure-tested against a fresh boot, and whose remediation was a sledgehammer wired to a tripwire that fires every single time the machine comes up cold. An agent wrote half of it, I wrote the other half, and both halves believed the same lies about what "healthy" looks like in the first three minutes after boot.

The lesson isn't "don't let AI write your scripts." AI wrote plenty of careful code I'm still running. The lesson is two things — **what your probes actually test, and what your remediation is actually allowed to do** — and neither of those is AI-shaped:

**Probes have to test the thing you actually care about.** My probes checked the admin plane — "is the SSH tunnel up? can I run `docker ps` remotely?" — which is inherently flaky in this environment. What I actually cared about was the data plane: does Inngest answer a health check? Does Redis PONG? Those diverged. Admin plane was broken continuously. Data plane was fine. The probes looked at the wrong signal and red-alerted on it forever.

**Remediation has to be proportional to the fault.** A single probe failing is not a reason to force-cycle a hypervisor. That's a sledgehammer reaching for a thumbtack. If flannel is crashlooping, restart flannel. If kubelet is stuck, restart kubelet. If one pod is wedged, delete the pod. `colima stop --force` should never be in an automation chain — full stop. That's a human decision made after looking at actual data.

## What I'm doing now

One watchdog, not three. Probes only against data-plane services over stable network paths — HTTP to real service endpoints, not SSH into the hypervisor. Remediation that caps at restart-the-specific-broken-thing. Nothing reaches for `colima stop --force` except me.

The 867-line bash script is still on disk. Disabled, not deleted.
