r/devopsGuru

▲ 15 r/devopsGuru+2 crossposts

We built an open-source KEDA external scaler for GPU workloads - no Prometheus needed

Been running GPU inference workloads on k8s and got tired of the dcgm-exporter → Prometheus → PromQL → KEDA chain just to autoscale based on GPU utilization. 5 components, 15-30s metric lag, PromQL queries to maintain.


So I built keda-gpu-scaler — a KEDA external scaler that talks to NVML directly on each GPU node via a DaemonSet. Reads GPU utilization, memory, temperature, power and serves them over gRPC to KEDA. Sub-second metrics, no Prometheus in the loop.


Wrote about the architecture and why it has to be an external scaler (not a native one) on the CNCF blog: https://www.cncf.io/blog/2026/05/27/gpu-autoscaling-on-kubernetes-with-keda-building-an-external-scaler/


It ships with pre-built profiles for vLLM, Triton, training jobs, and batch workloads. Scale-to-zero works too.


GitHub: https://github.com/pmady/keda-gpu-scaler
Docs: https://keda-gpu-scaler.readthedocs.io


Still early (v0.1.0) so if you're running GPU workloads on k8s I'd appreciate feedback, bug reports, or contributions. Roadmap and open issues are on the repo.Been running GPU inference workloads on k8s and got tired of the dcgm-exporter → Prometheus → PromQL → KEDA chain just to autoscale based on GPU utilization. 5 components, 15-30s metric lag, PromQL queries to maintain.


So I built keda-gpu-scaler — a KEDA external scaler that talks to NVML directly on each GPU node via a DaemonSet. Reads GPU utilization, memory, temperature, power and serves them over gRPC to KEDA. Sub-second metrics, no Prometheus in the loop.


Wrote about the architecture and why it has to be an external scaler (not a native one) on the CNCF blog: https://www.cncf.io/blog/2026/05/27/gpu-autoscaling-on-kubernetes-with-keda-building-an-external-scaler/


It ships with pre-built profiles for vLLM, Triton, training jobs, and batch workloads. Scale-to-zero works too.


GitHub: https://github.com/pmady/keda-gpu-scaler
Docs: https://keda-gpu-scaler.readthedocs.io


Still early (v0.1.0) so if you're running GPU workloads on k8s I'd appreciate feedback, bug reports, or contributions. Roadmap and open issues are on the repo.
reddit.com
u/Aware-Ticket-5585 — 1 day ago
▲ 39 r/devopsGuru+2 crossposts

Built a Dockerized Ansible lab with a browser-based IDE

I built a portable Ansible lab that spins up in seconds using Docker. Thought some of you might find it useful for learning or testing playbooks.

https://github.com/Yoas1/ansible-handson

The setup:

  • **1 controller** (Python + Ansible + code-server IDE on port 8080)
  • **2 workers** — one Ubuntu 22.04, one Red Hat UBI 9
  • Pre-configured SSH keys (Ed25519), inventory, ansible.cfg, Vault, and linters

You literally run `docker compose up`, open your browser, and start writing/running playbooks. No manual VM setup, no SSH config headaches.

What I like about it:

  • **Hot-reload configs** — edit .config/ files and inotifywait auto-applies them via update_config.sh
  • **Pre-commit hooks** built in — yamllint, ansible-lint, shellcheck, markdownlint all run before commit
  • **Multi-distro workers** — test your playbooks against both Debian-based and RHEL-based systems
  • **Code-server** — full VS Code in the browser with Ansible and Python extensions

Would love feedback or ideas for improvement. The full setup is on my GitHub if anyone wants to check it out.

Cheers

u/yoas1a — 2 days ago
▲ 4 r/devopsGuru+2 crossposts

10k followers but almost zero engagement now, can this page recover?

Hey guys,

Need some honest advice.

I have a DevOps/IT related Instagram page with around 10k followers, but engagement is almost dead now. Hardly getting story views, posts don’t perform, and feels like even followers don’t see my content anymore.

To be honest, I was very inconsistent for a long time. Mostly posted normal posts/carousels, barely did reels, and sometimes disappeared for weeks/months.

Content is mostly:

  • DevOps
  • cloud/Azure/AWS
  • sysadmin stuff
  • tech career related posts

Now I’m confused whether:

  • the page is dead because of inactivity
  • tech content just performs badly on Instagram now
  • or I completely ignored reels for too long

Has anyone here actually revived an inactive page successfully?

Should I keep this account and start posting reels consistently or just start fresh?

Also what kind of content is working now for tech/IT pages?

Would appreciate real advice from people who’ve gone through this.

reddit.com
u/Birentechy — 2 days ago

How do you decide whether your side project is worth turning into a real business

been working on a side project for 3 months now. like 20 users, no revenue yet, but they actually use it weekly. cant tell if its just a hobby that some people happen to like, or if its worth quitting other stuff to focus on. for the founders here who turned a side project into a real business, what was the moment you knew it was worth going all in. or did you wait too long, or jump too early

reddit.com
u/AssociateNo2293 — 4 days ago
▲ 2 r/devopsGuru+2 crossposts

Has a SQL migration ever taken down your production database? How did you handle it?

I'm a backend engineer building a tool to prevent Postgres migration outages and I'm in pure research mode right now — no product pitch, just trying to understand how widespread this is.

Our worst case: an ALTER TABLE on a 30M row table held an AccessExclusiveLock for 22 minutes. Everything queued up. Users saw timeouts. We found out from customer support, not monitoring.

Has this happened to your team? How do you currently check migrations before pushing to prod? Do you use squawk, strong_migrations, manual review, or just hope for the best?

Genuinely trying to understand the problem before I build anything. All experiences welcome.

reddit.com
u/Gadimov03 — 6 days ago

Enterprise AI consulting on LLMOps

Platform team supporting 4 product squads all doing GenAI development. We have Datadog and LangSmith but alerts are noise and we can’t trace why a specific user got a bad response.

Need enterprise AI consulting to set up real observability: prompt + completion logging with PII redaction, cost attribution per feature, hallucination tracking, and replay. Vendors keep selling dashboards but we need pipelines.

What does good LLMOps look like in practice and which tools do you actually run in prod vs demo well? We’re on Kubernetes + AWS. Need to standardize before we have 10 bespoke setups.

reddit.com
u/Difficult-Arrival665 — 6 days ago

K8s: The DevOps job market is really weird right now.

You want to get into DevOps or Platform Engineering, then everywhere online you see people saying:

“Certificates are not important.”
“They do not prove you can do the real job.”
“Anybody can cram and pass exams.”

Honestly, some of them are right.

But I will argue the other side too.

Because certs like CKA, CKAD and CKS can really solidify your DevOps job hunt.

Let’s be realistic here.

You have a degree.
I have a degree.

You have some experience.
I have some experience.

But if neither of us are coming from companies like Uber, Meta or Google, then what makes a recruiter stop and actually want to talk to one candidate over the other?

That extra proof matters.

Before I had these certs, I could apply to 10 jobs and maybe get 1 interview.

Recently, I applied to around 10 jobs again.

I got 4 interviews.

And I made it all the way to third-stage interviews.

That is not by accident.

Ever since I started posting my Kubernetes certs and projects publicly, I constantly get recruiters and hiring managers messaging me asking if I am open for work.

The competition is just very fierce now.

There are too many good candidates.

So you need something that helps you stand out.

And most companies are using Kubernetes now.

They want people who can actually work with this stuff.

Not people who just watched tutorials.

If I was starting from scratch again, I would first focus on understanding Kubernetes basics properly.

Pods.
Deployments.
Services.
Networking.
Secrets.
ConfigMaps.

Then after that, I would go hard on practicing Kubernetes questions tailored to the exams.

That is honestly what helped me pass.

For CKAD:

I failed it before.

Then I passed it after focusing heavily on practicing questions close to the real exam patterns.

You can access my CKAD practice resources here:
👉 CKAD Offer

For CKA:

Do not rush into CKA first if you are new to Kubernetes.

Go chronologically.

CKAD → CKA → CKS

It really helps.

CKA is a serious level up from CKAD.

You can get the CKA resources here:
👉 CKA Offer

For CKS:

Now this is the elephant in the room.

This one is difficult.

You really need to know what you are doing.

I know some gurus jump straight into CKS, but honestly, this post is for the new engineer trying to break into DevOps.

Please start from top to bottom.

Not bottom to top.

You can get the CKS resources here:
👉 CKS Offer

If I had to do this all over again, I would simply follow people who already passed and understand the road.

I had to go through a lot of trial and error.

That is just my nature sometimes.

Long road. Burn time. Burn money.

You can also do that.

Or you can follow someone who already lived it.

Not to brag.

Just to let you know that if I could do it, you can too.

reddit.com
u/Defiant-Chard-2023 — 7 days ago
▲ 1 r/devopsGuru+1 crossposts

Show HN: Taso Stop crashing in production due to missing environment variables

Hey everyone!

I just released Taso, a high-performance CLI tool built in Go that solves a problem we’ve all had: The "Ghost" Environment Variable.

You know the feeling you add a new feature, call os.Getenv("NEW_SERVICE_URL"), but forget to add it to your .env or production secrets. Your app crashes, and you waste 20 minutes debugging.

Taso fixes this.

It uses AST (Abstract Syntax Tree) analysis to scan your source code (Go, JS, TS, etc.) and find every environment variable you are actually using. It then compares them against your .env files and reports exactly what's missing.

Key Features:

  • Ghost Detection: Finds variables used in code but missing from config.
  • Drift Tracking: Snapshots your env and alerts you if anything changes.
  • Health Scoring: Audits your project and gives you a grade (A-F).
  • Blazing Fast: Scans 10k+ files in milliseconds using SHA-256 caching.
  • Cross-Platform: Support for Windows (Scoop), Mac/Linux (Homebrew).

Check it out:

If you find Taso useful, please drop a star on GitHub and consider contributing! We have a full roadmap ahead and would love help adding more language parsers.

GitHub Repository: https://github.com/Hossiy21/taso

Install it now:

I’d love to hear your feedback! What language should we add AST support for next? 

reddit.com
u/hossiy16 — 8 days ago
▲ 39 r/devopsGuru+1 crossposts

Real DevOps work is more troubleshooting than deployment

Started my DevOps journey thinking the job was mostly deployments and automation.
In reality, a huge part of the work is:
Troubleshooting production issues
Monitoring systems
Managing infrastructure changes
Handling networking and permissions
Optimizing CI/CD pipelines
Tools change, but problem-solving skills stay constant.
What was the biggest surprise for you after entering DevOps?

reddit.com
u/Blacksmith-23 — 9 days ago
▲ 7 r/devopsGuru+1 crossposts

Roast my resume - devops engineer 5YOE - not getting calls

Hi,
Please roast my resume as a devops engineer. Just to clarify i am into system development engineering using most to the stack as devops but not into core devops but looking to change domain to devops/SRE so modified resume according to that.

please let me know about the issues

u/Humble_Noise2945 — 7 days ago

Is it just me, or is there a massive "Logic Gap" between passing a DevOps certification and actually surviving a live incident? 🤡🛠️

I see a lot of talk about joining specific training batches and "Guru" paths, but as a CS student currently building out technical event pipelines, I’ve realized that no bootcamp can simulate the adrenaline of a database schema failing five minutes before "Go Live."

Are we focusing too much on the "Syntax" of tools and not enough on the "Psychology" of infrastructure? In 2026, the real Gurus aren't the ones who know every Terraform command; they're the ones who don't panic when the ingress controller starts acting sentient. What was the one "un-teachable" moment that actually made you a DevOps professional?

reddit.com
u/Deep-Location-6426 — 8 days ago

The deployment checklist item we always skipped until something broke in production

every deployment we did had the same gap. we checked the obvious stuff, tests passing, environment variables set, database migrations run, health check endpoint responding. what we kept skipping was rollback validation. we had a rollback plan on paper but we never actually tested whether it worked until we needed it in production. the first time we had to roll back a bad deploy the process was slower and messier than expected because some of the assumptions in the rollback plan were wrong. started actually running rollback drills after that. deploy to staging, verify it works, then roll it back and verify the previous version is fully functional again. adds maybe 20 minutes to the staging process but means the rollback procedure is tested and familiar before you ever need it under pressure. sounds obvious but i've talked to a lot of teams who have never actually executed a rollback in a non crisis situation. anyone else building rollback testing into their regular deployment process or only finding out it's broken when something goes wrong?

reddit.com
u/Excellent_Poetry_718 — 9 days ago
▲ 1 r/devopsGuru+1 crossposts

Built a CLI tool that remembers your infrastructure fixes so you don't have to

Tired of re-googling the same errors, I built FixDoc. It runs locally, captures fixes as you go, and surfaces them when a similar error shows up again. It also has native support for ServiceNow, Jira, Notion, and Slack. It's not asking you to change anything about how your team already works. It just meets you there.

It classifies errors by whether they're worth storing, scores Terraform change impact before you apply, and works completely offline by default.

No SaaS, no dashboard to log into. Just a searchable history on your machine. Lmk if you try it out. https://fixdoc.dev/

reddit.com
u/FixDoc — 8 days ago
▲ 7 r/devopsGuru+1 crossposts

Ai in Devops

Hi everyone! We’re students (2-year vocational education) conducting a survey on how teams are adopting AI in their daily work. We’re especially interested in hearing from people working in DevOps, infrastructure, platform engineering, SRE, or related fields...but anyone is welcome to participate.
 
https://docs.google.com/forms/d/e/1FAIpQLSdtxsY8EAsY2FL2JHR8-Im0lcKJjWj4mf2Hj5r-dA71C96VaA/viewform?usp=publish-editor

Thanks! 🙏

Edit: Clarification of program.

u/DoraExploraFTW — 9 days ago

How do you handle deployments when the client's existing infra is a mess you didn't build?

Having seen this happen more often recently. Get called out to implement something new, only to discover that the customer's infrastructure is falling apart and not something that we built. There is no documentation, there is no CI/CD, and some of the environment variables are set where they are not supposed to be.

The new development is good, but the real problem is making sure it works well with the existing one.

We have been including an infrastructure audit step early on, before scoping any new projects.

Has anyone else been incorporating that step into their workflow?

reddit.com
u/Excellent_Poetry_718 — 11 days ago

From Scattered Notes to Shared Knowledge: Why DevOps Professionals Should Document Their Learning

I started my DevOps journey in 2012 running meetups in Pune, and honestly, I was a mess. Questions everywhere—from the community, from speakers, from myself. Notes scattered across my phone, emails, a blog I never updated. I'd lose them when my phone died or I'd upgrade. Google Drive, WhatsApp groups, note-taking apps—nothing stuck. Then I created a simple private channel just for myself, a dumping ground for everything I learned. Answers to tough questions. Articles worth reading. Open source projects that solved real problems. Nothing polished. Just raw knowledge capture.

What I didn't expect was that this channel would eventually grow to thousands of followers organically. People from my meetups asked to join. They shared it with others. Years later, it started getting mentioned in newsletters and curated lists as a resource worth following.

But here's the thing—I'm telling you this not to promote that channel, but to make a bigger point. After thirteen years in DevOps, I'm now teaching. And what I've learned is this: the knowledge you capture for yourself has a way of helping others you'll never meet. Every question you document. Every pattern you notice. Every useful resource you bookmark and share—it compounds over time.

If you're early in your DevOps career, or anywhere along the way, I'd encourage you to start capturing what you learn. Don't aim for perfection. Don't wait until you're an expert. Just document. Share. You might surprise yourself at what grows from that simple habit.

If you're already doing something similar, I'd love to hear how you're managing it in the comments.

reddit.com
u/Man_Of_Steel47 — 13 days ago

The one monitoring mistake we kept making on client deployments

Took us longer than it should have to figure this out. Every time we deployed something for a client we set up monitoring for the obvious stuff, uptime, error rates, response times. Standard stuff. What we kept missing was business logic monitoring. The app was technically healthy but something in the actual workflow was silently broken. An invoice not sending. A webhook not firing. A scheduled job running but producing wrong output. None of that shows up in your standard infra dashboard. We started adding a simple layer on top, key business events logged explicitly, alerts if expected events stop happening within a time window. Invoice agent hasn't sent anything in 2 hours? Alert. Reminder scheduler ran but zero reminders fired? Alert. Sounds obvious but it took a few painful production incidents to actually build it into our standard setup. Anyone else separating business logic monitoring from infra monitoring or treating them as the same thing?

reddit.com
u/Excellent_Poetry_718 — 10 days ago

Attention everyone 👋 I am looking for DevOps training in Hyderabad. I want offline classes only (in person). I can pay for training. If anyone teaches, please message me.

reddit.com
u/Electronic_Golf7155 — 14 days ago

Attention everyone 👋 I am looking for DevOps training in Hyderabad. I want offline classes only (in person). I can pay for training. If anyone teaches, please message me.

reddit.com
u/Electronic_Golf7155 — 14 days ago