r/kubernetes

Metalb gives confusing errors

Hello,

Im now trying to make metallb work

So I made this yaml files : https://github.com/RoelofWobben/devops/tree/main/metallb

but every time I try to apply them I see these errors :

resource mapping not found for name 'default-pool' namespace 'metallb-system` from  'metallb-config.yaml' : no matches found for kind 'IPaddressPool'  om version  metallb.io/v1beta1 

ensure CRD's are installedresource mapping not found for name 'default-pool' namespace 'metallb-system` from  'metallb-config.yaml' : no matches found for kind 'IPaddressPool'  om version  metallb.io/v1beta1 

I use metallb 0.16

Anyone a idea how to get out of this mess

reddit.com
u/roelof_w — 6 hours ago

Docker Hub rate limit reached during K8S upgrade, best practices?

We're running into Docker Hub rate limiting during Kubernetes upgrades and I'm curious how others solve this at scale.

Let's say you have 100+ containers coming from external registries (mostly Docker Hub images like busybox, alpine, utility sidecars, etc.).

During a Kubernetes upgrade or large node rotation, eventually new pods start failing with errors like:

Init:failed to pull and unpack image "docker.io/library/busybox:1.37.0": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/busybox/manifests/sha256:1487d0af5f52b4ba31c7e465126ee2123fe3f2305d638e7827681e7cf6c83d5e: 429 Too Many Requests - Server message: toomanyrequests: You have reached your unauthenticated pull rate limit.

The 101st image pull basically kills the rollout.

I'm interested in how people operating larger clusters handle this in practice.Some options I can think of:

- configuring imagePullSecrets everywhere

- using dedicated ServiceAccounts with registry credentials

- mirroring all external images into an internal/private registry

- registry pull-through cache (Harbor, Artifactory, Nexus, etc.)

- pre-pulling images onto nodes

- completely avoiding Docker Hub in production

What has worked best for you operationally?

—-

EDIT: The K8S is an AKS

reddit.com
u/KalnaiK — 13 hours ago
▲ 0 r/kubernetes+2 crossposts

How do you keep track of cloud waste?

At $300k/month Cloud spend, our bill keeps 
growing faster than our traffic.

Cost Explorer shows the numbers but nobody 
actually checks it weekly.

Trusted Advisor gives 40+ recommendations 
with no priority order.

Anomaly detection emails get archived.

What actually works for your team?

Curious about:
- How often someone reviews the bill
- Whether you automate any cleanup
- If you bought a tool, which one and is it used
- War stories from cost incidents

Trying to learn from teams that figured this out.
reddit.com
u/Accomplished_Job_76 — 11 hours ago

Running a node-level binary against a specific pod’s container — Linux and Windows

Hi all,

I want to run a command/binary that exists on the node (not inside the container image) but have it operate in the context of a specific pod’s container — e.g., use the node’s tcpdump to capture traffic on a pod’s network interface, or run a diagnostic tool that isn’t shipped in the container.

On Linux, I know nsenter -t <pid> -n … works for this by entering the container’s namespaces while still executing the node’s binary. Is this the recommended approach, or is there something cleaner (e.g., kubectl debug, ephemeral containers)?

On Windows, nsenter doesn’t exist since containers use Job Objects / Server Silos instead of Linux namespaces. What’s the equivalent pattern for running a node-installed tool against a specific pod’s container?

Thanks!

reddit.com
u/ParticularCake1475 — 15 hours ago
▲ 15 r/kubernetes+2 crossposts

We built an open-source KEDA external scaler for GPU workloads - no Prometheus needed

Been running GPU inference workloads on k8s and got tired of the dcgm-exporter → Prometheus → PromQL → KEDA chain just to autoscale based on GPU utilization. 5 components, 15-30s metric lag, PromQL queries to maintain.


So I built keda-gpu-scaler — a KEDA external scaler that talks to NVML directly on each GPU node via a DaemonSet. Reads GPU utilization, memory, temperature, power and serves them over gRPC to KEDA. Sub-second metrics, no Prometheus in the loop.


Wrote about the architecture and why it has to be an external scaler (not a native one) on the CNCF blog: https://www.cncf.io/blog/2026/05/27/gpu-autoscaling-on-kubernetes-with-keda-building-an-external-scaler/


It ships with pre-built profiles for vLLM, Triton, training jobs, and batch workloads. Scale-to-zero works too.


GitHub: https://github.com/pmady/keda-gpu-scaler
Docs: https://keda-gpu-scaler.readthedocs.io


Still early (v0.1.0) so if you're running GPU workloads on k8s I'd appreciate feedback, bug reports, or contributions. Roadmap and open issues are on the repo.Been running GPU inference workloads on k8s and got tired of the dcgm-exporter → Prometheus → PromQL → KEDA chain just to autoscale based on GPU utilization. 5 components, 15-30s metric lag, PromQL queries to maintain.


So I built keda-gpu-scaler — a KEDA external scaler that talks to NVML directly on each GPU node via a DaemonSet. Reads GPU utilization, memory, temperature, power and serves them over gRPC to KEDA. Sub-second metrics, no Prometheus in the loop.


Wrote about the architecture and why it has to be an external scaler (not a native one) on the CNCF blog: https://www.cncf.io/blog/2026/05/27/gpu-autoscaling-on-kubernetes-with-keda-building-an-external-scaler/


It ships with pre-built profiles for vLLM, Triton, training jobs, and batch workloads. Scale-to-zero works too.


GitHub: https://github.com/pmady/keda-gpu-scaler
Docs: https://keda-gpu-scaler.readthedocs.io


Still early (v0.1.0) so if you're running GPU workloads on k8s I'd appreciate feedback, bug reports, or contributions. Roadmap and open issues are on the repo.
reddit.com
u/Aware-Ticket-5585 — 17 hours ago

Getting Started with Self-Managed Kubernetes in Corporate Environment

For reasons I won't go into we have an increasing desire to start self-managing our Kubernetes clusters as opposed to using GKE, EKS, etc.

Admittedly though we don't have a great understanding for everything this will involve and the initial set of decisions we should be exploring.

Does anyone have any good pointers or references to blogs / articles / documentation exploring the technical details? Most online are pretty high-level and don't go into great depth.

reddit.com
u/Equal_Muffin_9402 — 1 day ago
▲ 11 r/kubernetes+5 crossposts

Anyone using telemetry data in tandem with AI coding agents?

Hey folks 👋

I'm building an open-source dev tool that turns telemetry data into knowledge graphs that can be used as context in AI coding agents for debugging purposes or improving performance & costs.

Why? My intuition is three fold:

(1) coding agents are much more useful when they understand how a system actually behaves in production, not just what the repo looks like

(2) using raw telemetry data (for example traces) doesn't really work with coding agents at scale

(3) telemetry context graphs might be even cheaper and more efficient to query compared to using raw telemetry data

Before spending too much time on this & going down the rabbit hole, I'm trying to sanity-check my assumptions and assess if this is actually useful for people building/running AI systems in production. Curious to hear from software engineers that have tried something like this: what worked & what didn't, etc.

Happy to hear thoughts directly in the comments and if anyone's interested in helping out with feedback on the actual tool as I build it, please let me know and I can send more details in private - not my intention to spam anyone.

Appreciate it 🙇

reddit.com
u/n4r735 — 1 day ago

Affordable mini pc option for someone learning Devops (Netherlands)

Hello everyone

I'm a refugee in the Netherlands and currently studying cloud engineering. I'm in need of a mini pc for my studies and I'm extremely tight on budget. (I get 50 euros per month for sustenance). Do you know how a website or a place that sells used or refurbished mini PC's here in the Netherlands? And what should i target that can help me with my studies especially Kubrnetes. Thank you.

reddit.com
u/Severe_Mouse_2597 — 1 day ago

How fast did you patch Copy.Fail?

For folks running production K8s on EU providers like managed at OVHCloud, etc or self-hosted on Hetzner or wherever?

Asking because Copy Fail was hitting in late April and the managed offerings all shipped patched images within roughly 10 days ( i checked and scanned their news Sources)

Curious how long it took the k8s self hosters to roll out the fix across their fleet, and whether that kind of incident is shifting your self host k8s vs. managed k8s thinking at all.

Disclosure: I run eucloudcost.com, a comparison site for EU cloud pricing.
I track provider release notes for a monthly roundup there, the full Feb-May breakdown across 14 providers is here if useful: https://www.eucloudcost.com/blog/eu-cloud-news-feb-may-2026/

btw. OvhCloud has EFS (trident RWX storage ) now - and no I am not getting paid by them.

reddit.com
u/mixxor1337 — 1 day ago

Minimal downtime with Metallb BGP and Envoy Gateway

Hi everyone,

I'm trying to minimize downtime in a cluster using MetalLB (BGP with BFD) and Envoy Gateway, but I'm struggling to find a configuration that handles both graceful shutdowns (node drain) and sudden node failures (power off) smoothly.

Here is what I've observed so far with two different externalTrafficPolicy settings:

Option 1: externalTrafficPolicy: Local

  • Power off (Sudden failure): Works great. MetalLB BFD stops responding immediately, BGP withdraws the route, and the downtime is under 4 seconds.
  • Node Drain / Maintenance: Causes issues. When I drain a node (or label it with node.kubernetes.io/exclude-from-external-load-balancers=true), there is a short window with connect: connection refused errors. The Envoy pod enters a Terminating state, sends a GOAWAY to existing TCP sessions, and refuses new ones. However, MetalLB takes a moment to realize there are no active endpoints on that node and withdraw the route, leading to dropped requests.

Option 2: externalTrafficPolicy: Cluster

  • Node Drain / Maintenance: Works flawlessly. Cilium smoothly redirects new TCP sessions to Envoy pods running on other healthy nodes. Zero downtime.
  • Power off (Sudden failure): Breaks. BFD drops the BGP route to the dead node within 4 seconds, so the top-of-rack router stops sending traffic directly to it. However, because there is no active health checking between Cilium (on other nodes) and the Envoy pod on the dead node, Cilium keeps routing a portion of internal cluster traffic to the dead node for the next 40 seconds—until the node is officially marked as NotReady by Kubernetes.

My Question:

What is the correct architectural approach here? I am aiming for zero downtime during planned maintenance and as low downtime as possible during sudden node malfunctions.

Is there a way to make Cilium aware of dead pods faster in the Cluster policy, or a way to force MetalLB to withdraw BGP routes before Envoy stops accepting connections in the Local policy?

Thanks in advance for any insights!

reddit.com
u/NegotiationIcy8547 — 1 day ago

If someone offered to write you a CRD e2e testing framework, what would you like to have?

Im currently working with Kyverno Chainsaw on my job, and i must admit i really dont like the tool. Its too much code, the logs are nonexistent, passing variables around is a nightmare..

Do you have experience with any other e2e frameworks, what do you think are the most common problems, is it flexibility, visibility, or whatever else?

reddit.com
▲ 4 r/kubernetes+1 crossposts

CPU node usage is different between kubectl top nodes &amp; prometheus node exporter

We're using prometheus/grafana for monitoring.

We receive high node cpu usage with over 80% from node exporter, but using kubectl top nodes shows a much smaller value (around 20%). Which is frustrating!

First thing is: which one is true ? and how do we correctly monitor our cpu usage ?

reddit.com
u/Old-Broccoli-4704 — 2 days ago

Anyone else struggling with production error detection despite having tons of observability data?

So this is probably a basic question but I am stuck on it. We have got prometheus, datadog, custom metrics, logs going everywhere. Our stack is monitored to death but when something breaks in production we still find out from customers before alerts catch it.

I have been digging through dashboards and our alert thresholds look reasonable on paper, but clearly they are not working. Either they are too noisy so people ignore them or they are too quiet and miss actual issues.

Has anyone dealt with this situation where the tooling is there but detection still does not work well? Trying to understand if this is a setup problem or something else.

What actually helped you get from lots of data to alerts that catch real problems before your customers do?

reddit.com
u/Economy_Passenger296 — 2 days ago

Migrating Away from the Kong Enterprise Stack

Hey everyone,

We adopted Kong Mesh and Kong Gateway early on, but recently decided to migrate our entire stack over to Istio and Envoy Gateway.

Keeping both stacks running in parallel with live workloads was an absolute nightmare for a minute there, but we officially managed to wire it all together successfully.

Has anyone else gone through this exact migration path? What was the hardest part of the coexistence phase for your team?

reddit.com
u/HarryOwin — 3 days ago

Deep Dive into cgroups v1 vs v2

Hi all

I published an article that tries to explain the huge change in cgroups with v2 and why it’s a great thing for Kubernetes.

The words are human, the visualisations were created with a LLM.

I hope some of y’all find it useful!

rawkode.academy
u/RawkodeAcademy — 3 days ago