▲ 23 r/kubernetes
Minimal downtime with Metallb BGP and Envoy Gateway
Hi everyone,
I'm trying to minimize downtime in a cluster using MetalLB (BGP with BFD) and Envoy Gateway, but I'm struggling to find a configuration that handles both graceful shutdowns (node drain) and sudden node failures (power off) smoothly.
Here is what I've observed so far with two different externalTrafficPolicy settings:
Option 1: externalTrafficPolicy: Local
- Power off (Sudden failure): Works great. MetalLB BFD stops responding immediately, BGP withdraws the route, and the downtime is under 4 seconds.
- Node Drain / Maintenance: Causes issues. When I drain a node (or label it with
node.kubernetes.io/exclude-from-external-load-balancers=true), there is a short window withconnect: connection refusederrors. The Envoy pod enters aTerminatingstate, sends aGOAWAYto existing TCP sessions, and refuses new ones. However, MetalLB takes a moment to realize there are no active endpoints on that node and withdraw the route, leading to dropped requests.
Option 2: externalTrafficPolicy: Cluster
- Node Drain / Maintenance: Works flawlessly. Cilium smoothly redirects new TCP sessions to Envoy pods running on other healthy nodes. Zero downtime.
- Power off (Sudden failure): Breaks. BFD drops the BGP route to the dead node within 4 seconds, so the top-of-rack router stops sending traffic directly to it. However, because there is no active health checking between Cilium (on other nodes) and the Envoy pod on the dead node, Cilium keeps routing a portion of internal cluster traffic to the dead node for the next 40 seconds—until the node is officially marked as
NotReadyby Kubernetes.
My Question:
What is the correct architectural approach here? I am aiming for zero downtime during planned maintenance and as low downtime as possible during sudden node malfunctions.
Is there a way to make Cilium aware of dead pods faster in the Cluster policy, or a way to force MetalLB to withdraw BGP routes before Envoy stops accepting connections in the Local policy?
Thanks in advance for any insights!
u/NegotiationIcy8547 — 2 days ago