ADR-0148accepted

Kubernetes Cluster Resilience Policy

2026-02-26T00:00:00.000Z

Context

The joelclaw k8s cluster (Talos Linux on Colima, single control-plane node) is the production runtime for all core services: Redis, Inngest, Typesense, system-bus-worker, LiveKit, Bluesky PDS. When the cluster is unhealthy, joelclaw is down.

We’ve hit recurring failures from:

kubectl port-forward silently dying, breaking Inngest pipeline runs
Control-plane taint returning after Docker/node restarts, blocking pod scheduling
Flannel losing subnet.env after Docker restart, causing pod sandbox failures
No backup strategy for PVC data (local-path provisioner, reclaimPolicy: Delete)
Missing health probes on some services
Kubeconfig port drift: k8s API (6443) and Talos API (50000) mapped to random host ports, going stale on container restart (2026-03-21)

Decision

Service Exposure: NodePort Only

All services MUST use NodePort with Docker port mappings on the Talos container. Never use kubectl port-forward for any service that needs persistent host access.

To add a new port:

Hot-add to Docker container’s hostconfig.json + config.v2.json (see k8s skill)
Set k8s service type to NodePort with matching nodePort value
Update the port mapping table in the k8s skill

Health Probes: All Three Required

Every workload MUST have:

Liveness probe — restart if hung
Readiness probe — don’t route traffic until ready
Startup probe — grace period for slow starts (prevents liveness kills during init)

Current gaps to fix:

Typesense: missing liveness probe
Bluesky PDS: missing readiness and startup probes
system-bus-worker: missing startup probe

Kubeconfig Port Drift (2026-03-21)

Docker port mappings for k8s API (6443) and Talos API (50000) are not pinned — they use random host ports. When the Talos container restarts, Docker reassigns them, and kubeconfig goes stale. All services remain healthy (their ports ARE pinned), but kubectl-dependent tools fail with tls: internal error.

Self-heal: health.sh auto-regenerates kubeconfig via talosctl kubeconfig --force when kubectl is unreachable.

Permanent fix: Recreate the Talos container with explicit port bindings for 6443:6443 and 50000:50000 (requires cluster recreation).

Post-Restart Recovery Checklist

After any Docker/Colima/node restart:

Regenerate kubeconfig (if kubectl fails): talosctl --talosconfig ~/.talos/config --nodes 127.0.0.1 kubeconfig --force && kubectl config use-context "$(kubectl config get-contexts -o name | grep joelclaw | head -1)"
Remove control-plane taint: kubectl taint nodes joelclaw-controlplane-1 node-role.kubernetes.io/control-plane:NoSchedule- || true
Verify flannel is running: kubectl get pods -n kube-system | grep flannel
If flannel is crash-looping: colima ssh -- sudo modprobe br_netfilter, then delete the flannel pod
Verify all pods reach Running state
Test service connectivity on all mapped ports

PVC Data Protection

reclaimPolicy: Delete means PVC deletion = data loss
Critical stateful services: Redis (event bus state), Typesense (OTEL + search indices), Inngest (run history), PDS (AT Proto repo)
TODO: implement periodic PVC backup to NAS via CronJob or rsync

Disk Monitoring

Colima VM has 19GB total. Monitor with colima ssh -- df -h /. Alert if >80% used.

Consequences

No more silent port-forward failures breaking pipelines
Services recover automatically after restarts (with taint removal)
Health probes catch hung processes instead of leaving them zombie
PVC backup prevents data loss on VM recreation