Stable kube operator access on dedicated local tunnels
Context
The Colima/Talos rebuild restored the core runtime, but it exposed an operator-plane problem:
- the core services were healthy,
10.5.0.2:6443inside the VM was healthy,- but the direct host-published kube API path could still return TLS garbage on
127.0.0.1:6443.
That is unacceptable for operator access. A stable substrate is not enough if kubectl and talosctl still depend on an ad hoc manual tunnel to work.
The old com.joel.colima-tunnel daemon is still dead and should stay dead. It fought Lima for ownership of app-facing ports that Colima already published. We need a different contract: one that hardens the operator plane without reintroducing duplicate ownership on runtime/service ports.
Decision
1. Operator access gets its own dedicated local ports
The canonical operator plane is now:
127.0.0.1:16443 -> 10.5.0.2:6443for kube-apiserver127.0.0.1:15000 -> 10.5.0.2:50000for the Talos API
These ports are dedicated to kubectl/talosctl and are intentionally separate from Colima/Lima-published runtime ports.
2. The operator tunnel is a launchd-managed critical daemon
com.joel.kube-operator-access is a repo-managed system LaunchDaemon.
It runs as joel, starts at boot, keeps the operator tunnel alive, and is installed by the same critical-daemon installer used for the other host control-plane services.
3. Use Colima SSH config, but not the generic mux path
The daemon must use:
ssh -F ~/.colima/_lima/colima/ssh.config-o ControlMaster=no-o ExitOnForwardFailure=yes
That avoids trusting the generic Lima mux path for long-lived operator access after rebuild/recovery churn.
4. kubectl and talosctl should target the stable local operator ports
The daemon is responsible for rewriting:
~/.talos/config→ endpoint127.0.0.1:15000, node10.5.0.2~/.kube/config→ cluster serverhttps://127.0.0.1:16443
The kubeconfig may use insecure local TLS verification if the rebuilt operator path still presents a certificate chain that is correct for the in-VM endpoint but not boring on the loopback tunnel.
5. Distinct operator tunnel good; duplicate app-port tunnel bad
This ADR does not revive com.joel.colima-tunnel.
The rule is:
- dedicated operator-only ports are allowed,
- duplicate ownership of Colima/Lima-published runtime ports is not.
Consequences
Positive
- kubectl/talosctl stop depending on a manual, ad hoc tunnel.
- Operator access survives reboot as part of the critical launchd surface.
- The contract is explicit: runtime ports belong to Colima/Lima, operator ports belong to the operator daemon.
- The system keeps the stable core runtime while hardening the operator plane separately.
Negative
- Operator access now depends on one more critical daemon.
- The kubeconfig uses a loopback tunnel rather than the direct host-published 6443 path.
- The workaround exposes that the direct published kube path is still not boring enough to trust as the canonical operator plane.
Implementation Plan
Required skills
k8ssystem-architectureadr-skillclawmail
Affected paths
infra/kube-operator-access.shinfra/launchd/com.joel.kube-operator-access.plistinfra/install-critical-launchdaemons.shdocs/deploy.mdskills/k8s/SKILL.mdskills/k8s/references/operations.mdskills/system-architecture/SKILL.md
Required changes
- Add a repo-managed daemon that owns
16443and15000. - Install it through
install-critical-launchdaemons.sh. - Kill stale manual operator tunnels during install.
- Rewrite kubectl/talos configs toward the stable loopback endpoints.
- Document the distinction between operator-only tunnels and forbidden duplicate app-port tunnels.
Verification
-
launchctl print system/com.joel.kube-operator-accessshows the daemon running. -
kubectl get nodesworks through127.0.0.1:16443. -
talosctl -e 127.0.0.1:15000 -n 10.5.0.2 healthworks. - Installer output includes
com.joel.kube-operator-access. - Docs/skills describe the operator plane without reviving
com.joel.colima-tunnel.
Non-goals
- Fixing the direct host-published 6443 path in this ADR.
- Reintroducing duplicate tunnel ownership for runtime ports.
- Restoring wave-2 services.
Follow-up
- If the direct host-published kube path becomes truly boring later, supersede this ADR with a simpler operator-plane contract.
- Keep wave-2 restore work separate from operator-plane hardening so substrate truth stays obvious.