ADR-0245accepted

Stable kube operator access on dedicated local tunnels

Context

The Colima/Talos rebuild restored the core runtime, but it exposed an operator-plane problem:

  • the core services were healthy,
  • 10.5.0.2:6443 inside the VM was healthy,
  • but the direct host-published kube API path could still return TLS garbage on 127.0.0.1:6443.

That is unacceptable for operator access. A stable substrate is not enough if kubectl and talosctl still depend on an ad hoc manual tunnel to work.

The old com.joel.colima-tunnel daemon is still dead and should stay dead. It fought Lima for ownership of app-facing ports that Colima already published. We need a different contract: one that hardens the operator plane without reintroducing duplicate ownership on runtime/service ports.

Decision

1. Operator access gets its own dedicated local ports

The canonical operator plane is now:

  • 127.0.0.1:16443 -> 10.5.0.2:6443 for kube-apiserver
  • 127.0.0.1:15000 -> 10.5.0.2:50000 for the Talos API

These ports are dedicated to kubectl/talosctl and are intentionally separate from Colima/Lima-published runtime ports.

2. The operator tunnel is a launchd-managed critical daemon

com.joel.kube-operator-access is a repo-managed system LaunchDaemon.

It runs as joel, starts at boot, keeps the operator tunnel alive, and is installed by the same critical-daemon installer used for the other host control-plane services.

3. Use Colima SSH config, but not the generic mux path

The daemon must use:

  • ssh -F ~/.colima/_lima/colima/ssh.config
  • -o ControlMaster=no
  • -o ExitOnForwardFailure=yes

That avoids trusting the generic Lima mux path for long-lived operator access after rebuild/recovery churn.

4. kubectl and talosctl should target the stable local operator ports

The daemon is responsible for rewriting:

  • ~/.talos/config → endpoint 127.0.0.1:15000, node 10.5.0.2
  • ~/.kube/config → cluster server https://127.0.0.1:16443

The kubeconfig may use insecure local TLS verification if the rebuilt operator path still presents a certificate chain that is correct for the in-VM endpoint but not boring on the loopback tunnel.

5. Distinct operator tunnel good; duplicate app-port tunnel bad

This ADR does not revive com.joel.colima-tunnel.

The rule is:

  • dedicated operator-only ports are allowed,
  • duplicate ownership of Colima/Lima-published runtime ports is not.

Consequences

Positive

  • kubectl/talosctl stop depending on a manual, ad hoc tunnel.
  • Operator access survives reboot as part of the critical launchd surface.
  • The contract is explicit: runtime ports belong to Colima/Lima, operator ports belong to the operator daemon.
  • The system keeps the stable core runtime while hardening the operator plane separately.

Negative

  • Operator access now depends on one more critical daemon.
  • The kubeconfig uses a loopback tunnel rather than the direct host-published 6443 path.
  • The workaround exposes that the direct published kube path is still not boring enough to trust as the canonical operator plane.

Implementation Plan

Required skills

  • k8s
  • system-architecture
  • adr-skill
  • clawmail

Affected paths

  • infra/kube-operator-access.sh
  • infra/launchd/com.joel.kube-operator-access.plist
  • infra/install-critical-launchdaemons.sh
  • docs/deploy.md
  • skills/k8s/SKILL.md
  • skills/k8s/references/operations.md
  • skills/system-architecture/SKILL.md

Required changes

  1. Add a repo-managed daemon that owns 16443 and 15000.
  2. Install it through install-critical-launchdaemons.sh.
  3. Kill stale manual operator tunnels during install.
  4. Rewrite kubectl/talos configs toward the stable loopback endpoints.
  5. Document the distinction between operator-only tunnels and forbidden duplicate app-port tunnels.

Verification

  • launchctl print system/com.joel.kube-operator-access shows the daemon running.
  • kubectl get nodes works through 127.0.0.1:16443.
  • talosctl -e 127.0.0.1:15000 -n 10.5.0.2 health works.
  • Installer output includes com.joel.kube-operator-access.
  • Docs/skills describe the operator plane without reviving com.joel.colima-tunnel.

Non-goals

  • Fixing the direct host-published 6443 path in this ADR.
  • Reintroducing duplicate tunnel ownership for runtime ports.
  • Restoring wave-2 services.

Follow-up

  1. If the direct host-published kube path becomes truly boring later, supersede this ADR with a simpler operator-plane contract.
  2. Keep wave-2 restore work separate from operator-plane hardening so substrate truth stays obvious.