ADR-0148accepted

Kubernetes Cluster Resilience Policy

Context

The joelclaw k8s cluster (Talos Linux on Colima, single control-plane node) is the production runtime for all core services: Redis, Inngest, Typesense, system-bus-worker, LiveKit, Bluesky PDS. When the cluster is unhealthy, joelclaw is down.

We’ve hit recurring failures from:

  • kubectl port-forward silently dying, breaking Inngest pipeline runs
  • Control-plane taint returning after Docker/node restarts, blocking pod scheduling
  • Flannel losing subnet.env after Docker restart, causing pod sandbox failures
  • No backup strategy for PVC data (local-path provisioner, reclaimPolicy: Delete)
  • Missing health probes on some services
  • Kubeconfig port drift: k8s API (6443) and Talos API (50000) mapped to random host ports, going stale on container restart (2026-03-21)

Decision

Service Exposure: NodePort Only

All services MUST use NodePort with Docker port mappings on the Talos container. Never use kubectl port-forward for any service that needs persistent host access.

To add a new port:

  1. Hot-add to Docker container’s hostconfig.json + config.v2.json (see k8s skill)
  2. Set k8s service type to NodePort with matching nodePort value
  3. Update the port mapping table in the k8s skill

Health Probes: All Three Required

Every workload MUST have:

  • Liveness probe — restart if hung
  • Readiness probe — don’t route traffic until ready
  • Startup probe — grace period for slow starts (prevents liveness kills during init)

Current gaps to fix:

  • Typesense: missing liveness probe
  • Bluesky PDS: missing readiness and startup probes
  • system-bus-worker: missing startup probe

Kubeconfig Port Drift (2026-03-21)

Docker port mappings for k8s API (6443) and Talos API (50000) are not pinned — they use random host ports. When the Talos container restarts, Docker reassigns them, and kubeconfig goes stale. All services remain healthy (their ports ARE pinned), but kubectl-dependent tools fail with tls: internal error.

Self-heal: health.sh auto-regenerates kubeconfig via talosctl kubeconfig --force when kubectl is unreachable.

Permanent fix: Recreate the Talos container with explicit port bindings for 6443:6443 and 50000:50000 (requires cluster recreation).

Post-Restart Recovery Checklist

After any Docker/Colima/node restart:

  1. Regenerate kubeconfig (if kubectl fails): talosctl --talosconfig ~/.talos/config --nodes 127.0.0.1 kubeconfig --force && kubectl config use-context "$(kubectl config get-contexts -o name | grep joelclaw | head -1)"
  2. Remove control-plane taint: kubectl taint nodes joelclaw-controlplane-1 node-role.kubernetes.io/control-plane:NoSchedule- || true
  3. Verify flannel is running: kubectl get pods -n kube-system | grep flannel
  4. If flannel is crash-looping: colima ssh -- sudo modprobe br_netfilter, then delete the flannel pod
  5. Verify all pods reach Running state
  6. Test service connectivity on all mapped ports

PVC Data Protection

  • reclaimPolicy: Delete means PVC deletion = data loss
  • Critical stateful services: Redis (event bus state), Typesense (OTEL + search indices), Inngest (run history), PDS (AT Proto repo)
  • TODO: implement periodic PVC backup to NAS via CronJob or rsync

Disk Monitoring

Colima VM has 19GB total. Monitor with colima ssh -- df -h /. Alert if >80% used.

Consequences

  • No more silent port-forward failures breaking pipelines
  • Services recover automatically after restarts (with taint removal)
  • Health probes catch hung processes instead of leaving them zombie
  • PVC backup prevents data loss on VM recreation