Kubernetes Cluster Resilience Policy
Context
The joelclaw k8s cluster (Talos Linux on Colima, single control-plane node) is the production runtime for all core services: Redis, Inngest, Typesense, system-bus-worker, LiveKit, Bluesky PDS. When the cluster is unhealthy, joelclaw is down.
We’ve hit recurring failures from:
kubectl port-forwardsilently dying, breaking Inngest pipeline runs- Control-plane taint returning after Docker/node restarts, blocking pod scheduling
- Flannel losing
subnet.envafter Docker restart, causing pod sandbox failures - No backup strategy for PVC data (local-path provisioner, reclaimPolicy: Delete)
- Missing health probes on some services
- Kubeconfig port drift: k8s API (6443) and Talos API (50000) mapped to random host ports, going stale on container restart (2026-03-21)
Decision
Service Exposure: NodePort Only
All services MUST use NodePort with Docker port mappings on the Talos container. Never use kubectl port-forward for any service that needs persistent host access.
To add a new port:
- Hot-add to Docker container’s
hostconfig.json+config.v2.json(see k8s skill) - Set k8s service type to NodePort with matching nodePort value
- Update the port mapping table in the k8s skill
Health Probes: All Three Required
Every workload MUST have:
- Liveness probe — restart if hung
- Readiness probe — don’t route traffic until ready
- Startup probe — grace period for slow starts (prevents liveness kills during init)
Current gaps to fix:
- Typesense: missing liveness probe
- Bluesky PDS: missing readiness and startup probes
- system-bus-worker: missing startup probe
Kubeconfig Port Drift (2026-03-21)
Docker port mappings for k8s API (6443) and Talos API (50000) are not pinned — they use random host ports. When the Talos container restarts, Docker reassigns them, and kubeconfig goes stale. All services remain healthy (their ports ARE pinned), but kubectl-dependent tools fail with tls: internal error.
Self-heal: health.sh auto-regenerates kubeconfig via talosctl kubeconfig --force when kubectl is unreachable.
Permanent fix: Recreate the Talos container with explicit port bindings for 6443:6443 and 50000:50000 (requires cluster recreation).
Post-Restart Recovery Checklist
After any Docker/Colima/node restart:
- Regenerate kubeconfig (if kubectl fails):
talosctl --talosconfig ~/.talos/config --nodes 127.0.0.1 kubeconfig --force && kubectl config use-context "$(kubectl config get-contexts -o name | grep joelclaw | head -1)" - Remove control-plane taint:
kubectl taint nodes joelclaw-controlplane-1 node-role.kubernetes.io/control-plane:NoSchedule- || true - Verify flannel is running:
kubectl get pods -n kube-system | grep flannel - If flannel is crash-looping:
colima ssh -- sudo modprobe br_netfilter, then delete the flannel pod - Verify all pods reach Running state
- Test service connectivity on all mapped ports
PVC Data Protection
reclaimPolicy: Deletemeans PVC deletion = data loss- Critical stateful services: Redis (event bus state), Typesense (OTEL + search indices), Inngest (run history), PDS (AT Proto repo)
- TODO: implement periodic PVC backup to NAS via CronJob or rsync
Disk Monitoring
Colima VM has 19GB total. Monitor with colima ssh -- df -h /. Alert if >80% used.
Consequences
- No more silent port-forward failures breaking pipelines
- Services recover automatically after restarts (with taint removal)
- Health probes catch hung processes instead of leaving them zombie
- PVC backup prevents data loss on VM recreation