Tailscale Kubernetes Operator for Service Mesh
Context
The joelclaw k8s cluster (Talos on Colima, Mac Mini) exposes services via a 6-hop chain:
Tailnet device → Tailscale → panda:port → Caddy HTTPS → localhost:port
→ Lima SSH mux → Docker port map → Talos NodePort → PodThis works but has compounding friction:
- Docker port maps are immutable — adding a new service port requires full cluster recreation (
talosctl cluster destroy+ recreate with new--exposed-ports). This is the #1 operational pain point. - Lima SSH mux is a single point of failure — killing any process on a forwarded port can take down ALL tunnels.
- Every new service needs 3 manual steps: Docker port mapping, NodePort service, Caddy HTTPS entry.
- No per-service access control — once you can reach panda, you can reach everything.
- Adding Typesense (ADR-0082) requires cluster recreation anyway — perfect opportunity to fix the networking layer.
Current port inventory (all require Docker mapping + Caddy):
| Service | Ports | Caddy HTTPS |
|---|---|---|
| Redis | 6379 | direct (no TLS) |
| Qdrant | 6333, 6334 | :6443 |
| Inngest | 8288, 8289 | :9443, :8290 |
| LiveKit | 7880, 7881 | :7443 |
| PDS | 9627→3000 | port-forward |
| Worker | 3111 | :3443 |
| Typesense (new) | 8108 | would need entry |
That’s 12 port mappings across 7 services, each requiring manual Docker + Caddy config.
Decision
Deploy the Tailscale Kubernetes Operator to the joelclaw cluster. Each k8s service gets a first-class Tailscale identity, accessible from any tailnet device via MagicDNS. Eliminates Docker port mapping, Lima tunneling, and most Caddy config.
How It Works
Tailscale Control Plane
│
┌────────────┼────────────┐
│ │ │
┌──────▼──┐ ┌──────▼──┐ ┌─────▼────┐
│ panda │ │clanker │ │ laptop │
│(Mac Mini│ │ -001 │ │ /phone │
└────┬────┘ └─────────┘ └───────────┘
│
┌─────▼─────────────────────────────┐
│ Colima VM → Talos k8s cluster │
│ │
│ ┌──────────────────────┐ │
│ │ Tailscale Operator │ │
│ │ (watches annotations)│ │
│ └──────────┬───────────┘ │
│ │ creates │
│ ┌──────────▼───────────┐ │
│ │ Per-service proxies │ │
│ │ redis.ts.net │ │
│ │ inngest.ts.net │ │
│ │ typesense.ts.net │ │
│ │ livekit.ts.net │ │
│ └──────────────────────┘ │
└───────────────────────────────────┘Service Exposure
Each service gets exposed by adding one annotation:
metadata:
annotations:
tailscale.com/expose: "true"
tailscale.com/hostname: "typesense"
spec:
type: ClusterIP # NOT NodePort — no port mapping neededResult: <internal-tailnet-host>:8108 — accessible from any tailnet device.
What Changes
| Before (NodePort + Caddy) | After (Tailscale Operator) |
|---|---|
| Add Docker port map → cluster recreation | kubectl annotate svc → done |
| Configure Caddy HTTPS entry | Tailscale provides TLS automatically |
| Lima SSH mux tunnels all ports | Direct WireGuard mesh, no tunneling |
| NodePort on every service | ClusterIP (simpler, no port conflicts) |
| One IP for all services (panda) | Each service gets own tailnet identity |
| No per-service ACLs | Tailscale ACL tags per service |
Installation
# One-time: create OAuth client in Tailscale admin console
kubectl create namespace tailscale
kubectl create secret generic tailscale-oauth \
--namespace tailscale \
--from-literal=clientId=$TS_OAUTH_ID \
--from-literal=clientSecret=$TS_OAUTH_SECRET
helm repo add tailscale https://pkgs.tailscale.com/helmcharts
helm install tailscale-operator tailscale/tailscale-operator \
--namespace tailscale \
--set oauth.clientId=$TS_OAUTH_ID \
--set oauth.clientSecret=$TS_OAUTH_SECRETResource Cost
- Operator pod: ~50m CPU, 64Mi memory
- Per-service proxy: ~10m CPU, 20Mi memory each
- 6 services × 20Mi = ~120Mi total proxy overhead
- Well within Mac Mini’s 64GB capacity
Migration Plan
Deploy alongside the Typesense rollout (ADR-0082) during the cluster recreation:
- Create OAuth client in Tailscale admin console with
tag:k8s-servicetag - Recreate cluster with Typesense port (8108) — last time we need to add Docker ports
- Install Tailscale operator via Helm
- Annotate existing services with
tailscale.com/expose: "true" - Verify MagicDNS —
<internal-tailnet-host>,<internal-tailnet-host>, etc. - Update Caddy — strip internal service proxies, keep only Funnel (public webhooks)
- Update NEIGHBORHOOD.md — new service URLs
- Future services — just annotate, no cluster recreation ever again
What Caddy Keeps
After migration, Caddy only handles:
- Webhook Funnel (
:8443) — public internet → Tailscale Funnel → Caddy → worker - Everything else served directly by Tailscale operator proxies with auto-TLS
Consequences
Positive
- Never recreate cluster for ports again — the biggest operational win
- MagicDNS —
<internal-tailnet-host>instead ofpanda:8108 - Per-service ACLs — Redis only reachable from panda, Inngest from panda + clanker
- Simpler Caddy — only public webhook funnel
- Auto-TLS — Tailscale handles certs, no manual cert renewal
- Multi-machine access — clanker-001, laptop, phone all get direct mesh connections to k8s services
- Future-proof — adding any new service is
kubectl annotate, not infrastructure surgery
Negative
- More tailnet devices — ~6 proxy devices added to machine list (cosmetic)
- OAuth client management — one-time setup, but requires Tailscale admin console access
- Operator dependency — if operator pod dies, proxy pods still run but won’t update
Risks
- Colima VM networking — Tailscale operator needs outbound internet from inside the Talos container. Should work (pods can reach internet today for LiveKit, PDS), but verify during install.
- Talos compatibility — Tailscale operator is well-tested on standard k8s. Talos-in-Docker-in-Colima is unusual. May need
hostNetwork: trueon the operator or proxy pods.
References
- Tailscale Kubernetes Operator docs
- Tailscale Helm chart
- Tailscale blog: Mesh your k8s cluster
- ADR-0029: Colima + Talos k8s cluster
- ADR-0082: Typesense unified search (paired deployment)