Replace Flannel + kube-proxy with Cilium
Context and Problem Statement
The Talos cluster (joelclaw) currently runs with the default CNI: Flannel + kube-proxy. This is the simplest path — Talos bundles Flannel out of the box and it works with zero config.
However, Flannel is a minimal overlay network. It provides pod-to-pod connectivity and nothing else — no network policies, no advanced load balancing, no observability. kube-proxy uses iptables rules for service routing, which works but doesn’t scale well and provides no visibility into traffic flows.
Community advice (from X, Feb 2026) recommends Cilium as a kube-proxy replacement for anyone moving beyond basic testing.
Research Findings
What Cilium actually is
Cilium is an eBPF-based CNI that replaces both Flannel (pod networking) and kube-proxy (service routing) in a single component. It’s a CNCF graduated project (same tier as Kubernetes itself).
Key capabilities beyond Flannel:
- eBPF dataplane: Bypasses iptables entirely. Service routing happens in kernel space via eBPF programs. Measurably lower latency and higher throughput than iptables-based kube-proxy.
- Network policies: Kubernetes NetworkPolicy support (Flannel has none). Plus CiliumNetworkPolicy for L7-aware rules.
- Hubble observability: Built-in flow visibility — see which pods talk to which, protocol-level metrics, DNS resolution tracking. Available as CLI (
hubble), UI, and Prometheus metrics. - L2/BGP load balancing: Can replace MetalLB for bare-metal LoadBalancer services. Fewer moving parts.
- Identity-based security: Pods get cryptographic identities, not just IP-based rules.
Talos + Cilium is a documented, first-class path
Siderolabs’ own Kubernetes Cluster Reference Architecture (May 2025) states:
“Talos Linux will install Flannel by default. In order to install Cilium or replace Flannel, it is necessary to override the machine config to specify that no CNI should be initially installed.”
Cilium docs have a dedicated Talos installation page. The config is well-documented:
# Talos machine config patch to disable Flannel + kube-proxy
cluster:
network:
cni:
name: none
proxy:
disabled: trueThen install Cilium via Helm with Talos-specific values:
helm install cilium cilium/cilium \
--namespace kube-system \
--set ipam.mode=kubernetes \
--set kubeProxyReplacement=true \
--set k8sServiceHost=localhost \
--set k8sServicePort=7445 \
--set cgroup.autoMount.enabled=false \
--set cgroup.hostRoot=/sys/fs/cgroupThe k8sServiceHost=localhost:7445 leverages Talos’s KubePrism — a built-in API server proxy on every node. No external load balancer needed.
Flannel limitations that matter now
- No network policy support at all. Any pod can talk to any other pod.
- No traffic observability. Can’t see what’s talking to what.
- kube-proxy iptables rules become a debugging black hole when services misbehave.
Flannel’s case for staying
- Zero config. Works out of the box with Talos.
- Lower resource usage than Cilium (~50-100 MB RAM vs Cilium’s ~300-500 MB per node).
- Simpler mental model for a single-node dev/test cluster.
- Current cluster has 3 services (Redis, Qdrant, Inngest). Not exactly a complex network topology.
Decision Drivers
- Observability: Can we see what’s happening in the network?
- Security: Can we enforce pod-to-pod access rules?
- Operational complexity: How much harder is it to operate?
- Resource cost: What’s the RAM/CPU overhead on a single-node Docker-based cluster?
- Future path: Does this carry forward to bare-metal Pi 5 Talos?
Options
Option A: Keep Flannel + kube-proxy (status quo)
Works today. Zero overhead. No network policies, no observability, no advanced routing. Fine for a 3-service testing cluster.
Option B: Replace with Cilium (kube-proxy replacement mode)
Full eBPF networking stack. Network policies, Hubble observability, service mesh capabilities. More resources, more complexity, but the industry-standard path for Talos clusters.
Current Lean
Not yet. The cluster is single-node Docker-based with 3 services. Cilium’s value scales with cluster complexity. The right trigger is one of:
- Moving to bare-metal multi-node (Pi 5)
- Adding services that need network isolation
- Needing traffic observability for debugging
- Adopting Gateway API (ADR-0031) — Cilium is the natural gateway controller
Implementation Plan (When Ready)
-
Recreate Talos cluster with CNI disabled:
talosctl cluster create docker \ --name joelclaw \ --workers 0 \ --memory-controlplanes 4GiB \ -p 6379:6379/tcp,6333:6333/tcp,6334:6334/tcp,8288:8288/tcp,8289:8289/tcp \ --config-patch-controlplanes @controlplane-patch.yamlWhere
controlplane-patch.yamladds:cluster: network: cni: name: none proxy: disabled: true allowSchedulingOnControlPlanes: true -
Install Cilium via Helm with Talos values (see helm command above)
-
Verify:
cilium status,cilium connectivity test -
Re-apply all k8s manifests (unchanged)
-
Install Hubble for observability:
cilium hubble enable --ui
Verification
-
cilium statusshows all components healthy -
kubectl get pods -n joelclawshows all services running -
hubble observeshows traffic flows between pods - No kube-proxy pods running
- Services accessible on same ports as before
Sources
- Cilium docs — Talos installation: https://docs.cilium.io/en/latest/installation/k8s-install-helm/
- Siderolabs Reference Architecture (2025): https://www.siderolabs.com/wp-content/uploads/2025/08/Kubernetes-Cluster-Reference-Architecture-with-Talos-Linux-for-2025-05.pdf
- Talos + Cilium walkthrough: https://rcwz.pl/2025-10-08-adding-cilium-to-talos-cluster/
- Talos + Cilium on Proxmox: https://unixorn.github.io/post/homelab/k8s/01-talos-with-cilium-cni-on-proxmox/
Deferral Note (2026-02-19)
Deferred. Single-node Talos with Flannel works today. Cilium becomes relevant when the cluster goes multi-node (Pi 5 workers on the roadmap) — network policy, eBPF observability, and kube-proxy replacement matter at that scale. Revisit when adding the first worker node.