r/devops

I am not sure why my website is getting 24 k requests.
▲ 1 r/devops+1 crossposts

I am not sure why my website is getting 24 k requests.

Hi, I have a region specific domain. But my domain is getting so much request.
I have checked if it is crawler/bots but those are only 38request other seems to be legit requests.
I have a half baked ecom website with only one product . How do I figure this out my fellows devops.

u/EnvironmentalRun4163 — 14 hours ago
▲ 39 r/devops+4 crossposts

pnpm 11 Might Finally Be a Better Default Than npm

pnpm 11 feels like the first Node.js package manager update in a while that actually improves supply chain security by default.

Features like:

  • minimumReleaseAge
  • blockExoticSubdeps
  • allowBuilds

directly reduce the risk of malicious package installs in CI/CD pipelines.

I wrote a short deep dive on why I think pnpm is now a better default than npm for production workloads.

Curious what others here are using in production today.

blog.prateekjain.dev
u/root0ps — 17 hours ago
▲ 109 r/devops+11 crossposts

A Practical Back End Engineering Roadmap

A practical backend engineering roadmap with interactive explainers and visualizations.

Topics include networking, HTTP, databases, queues, caching, distributed systems, and real-world outages.

Feedback welcome.

https://semicolony.dev/roadmap/

semicolony.dev
u/nulless — 24 hours ago
▲ 47 r/devops

How do I become valuable in DevOps & Cloud within the next 2 years as a student?

I’m currently at the end of my 2nd year in INFT course and want to build a career in DevOps and Cloud. I’m planning to spend the next 2 years seriously learning and building skills.

Where should I start?

What should I focus on first?

What skills/tools are most important in today’s industry?

What projects should I build to stand out for internships/jobs?

Would appreciate guidance from people already working in DevOps or Cloud engineering.

reddit.com
u/Interesting-Bug7715 — 1 day ago
▲ 0 r/devops

Agent Observability and what I think

Hey all, I wanted to share a perspective on something I've been thinking about a lot lately.

Traditional APM was built for request-response and AI Agents break that model entirely. Because, most of what's on the market right now is just legacy APM with agent added, and that leaves a gap you really only feel when things go wrong. You can see the agent's intent (what it decided to do) OR the system-level impact (latency, errors, resource usage), but not both in the same trace. Unfortunately, you're flying blind through the exact moments when cost spikes.

I think observability at the agent layer is one of the real problems here. It's not solved yet. But it's defined well enough that you can instrument properly if you start now.

UC Santa Cruz published research on this last year (arxiv:2508.02736). They used eBPF to intercept TLS traffic and correlate what the agent intended to do with what actually happened at the kernel level. Less than 3% overhead. Point being that this is architecturally possible.

About 5% of AI model requests fail in production today (Datadog, April 2026 survey). Sixty percent of those failures are capacity-related, not model errors. So, it's an operational gap. And teams that built agent-layer observability into their setup caught those failures before they cascaded into outages. Teams that didn't had incidents.

If you're building agents, start with OpenTelemetry. If you're buying a platform, ask the hard questions: Does this handle reasoning loops as a first-class thing? Can you see the decision tree as a continuous trace? Does it know the difference between a tool failing and the agent misunderstanding the tool? Can you alert on semantic drift?

Those are the questions that separate something actually built for agents from something that's just adding agent features to traditional APM. Honeycomb published their approach. Langfuse and LangSmith are solid for multi-step debugging. There are about 15 tools competing on this now, most built on OpenTelemetry standards.

My candid assessment is that you're going to be in supervised mode for a while. Your agent still needs human approval, there is no way around it right now. That's not going away in the next two years. If a vendor tells you otherwise, that's a red flag.

Curious if people can share a) what does good agent observability actually look like at your scale? And b) what are you currently missing on the observability side if anything?

reddit.com
u/gaurav_sherlocks_ai — 19 hours ago
▲ 40 r/devops+8 crossposts

Cosmo - Real-time PostgreSQL TUI Dashboard (v0.2.0)

Just shipped Cosmo — a clean TUI to monitor your Postgres database in real-time.

Github: https://github.com/mujib77/cosmo

Live overview, active queries, WAL rate, locks, and more.

I’m actively developing more features and older version support.
Would love your feedback and suggestions!

▲ 5 r/devops+4 crossposts

Survey anonim despre AI coding agents și impactul lor asupra developerilor

Salutare tuturor 👋

Lucrez la un studiu despre cum folosesc developerii agenții AI de codare și ce impact au asupra productivității și well-being la muncă. M-ar ajuta mult si-as aprecia dacă ați completa acest survey anonim: https://forms.gle/8QDM46LnVHDweV779. Durează cam 1–2 minute.

Mulțumesc mult pentru ajutor și pentru timpul vostru! 🙏

u/n4r735 — 19 hours ago
▲ 1.4k r/devops

Today is why i no longer have the desire to work in IT anymore

I have over 20yrs experienced and have been a lead for the last 10 years of my career. Im usually the one people go to for help and the one folks come to when junior members cant figure things out. With AI, i have a love hate relationship with it. Im old school, i prefer VI to vscode and with AI i just refuse to accept it. Anyways, today we had an issue in prod. A mid-level engineer went straight to claude. He couldnt figure out what the issue was. He runs out salt code through claude and in claude's defense, it did point out what the root cause was.

Now, because everyone nowadays depend heavily on AI, you'd think ppl wouldve spent the time to actually check the nginx config and see if they were different between our prod environments. No, everyone waited a few hours for me to confirm when all i did was compare our 3 prod env and yes sure enough they were different. Problem solved once we pushed out the correct config.

I think people lost the ability to think for themselves. What im seeing in my org is folks go straight for claude. If you use it right it works but i cant count the number of times i tailed log files in the past few weeks and managed to figure out root cause without using AI.

Lately, we have been told to leverage AI heavily. I found out they are also tracking our token usage. If that is true, then im at the bottom of the list in terms of adoption. I guess they can fire me and keep the folks who use claude for everything while they fumble to address prod issues because claude doesnt have all the necessary information regarding our infra and app.

End rant

reddit.com
u/SecureTaxi — 2 days ago
▲ 0 r/devops

Built an internal AI assistant four months ago. security just asked what access it has. i have no idea

We shipped an internal assistant about four months ago hooks into slack, confluence, jira, and google drive. users authenticate through SSO, agent acts on their behalf worked fine, people use it daily, no complaints.

security came to me last week asking for a list of what it can access and what scopes we granted i pulled it together and sent it over and then looked at it myself properly for the first time.

confluence is read-write across all spaces google drive is full access jira can create and modify issues across every project i picked those scopes four months ago because they made the integration work i didn't think too hard about it at the time.

security came back with questions i couldn't answer. what happens to the OAuth tokens if we switch vendors is there an offboarding process for the agent who reviews its access what does it actually do during a session beyond what the logs show.

i don't have answers for any of that we have an IAM process for employees and service accounts but nothing that covers this it doesn't fit neatly into either category.

is anyone actually governing LLM agent access formally or is everyone just dealing with it when security asks.

reddit.com
u/Awkward-Chemistry627 — 22 hours ago
▲ 1 r/devops+1 crossposts

Pivoting from support to cloud ops. Looking for a reality check.

I have spent the last few years working in operations and support with a heavy focus on escalations and operational excellence. I am currently finishing up my AWS SAA to pivot into a full cloud ops role. I already hold my CCP and FOCP certifications.

I am looking for guidance on how to break into this space, ideally in a remote capacity. I want to find a position that allows me to build my skills without the extreme, high-stress cycle I am used to in my current operations background. I know that passing a test is a totally different planet compared to what happens in a real production environment, so I am looking for a sanity check from someone actually in the trenches.

If you have been in this space for a while and would be willing to share some perspective here, I would appreciate the insight. Even better, if you are open to a quick 10 minute call to tell me how you got your start and how you navigated finding a role that was sustainable, please shoot me a DM.

I am happy to respect your time and keep it brief. Thanks for any help.

reddit.com
u/Broad-Lake6535 — 1 day ago
▲ 30 r/devops

Books about Release Engineering and Management

I'm not sure if this is the right place to ask, but do you know any books or courses that can be helpful in release engineering and management, git tagging and repository branch management, versioning, packaging (including its naming and structuring), and so on?

reddit.com
u/ferry_rex — 1 day ago
▲ 5 r/devops

Maybe I'm overengineering this, but managing AI workloads in production feels weirdly fragmented right now.

I have:

  • normal app monitoring
  • separate GPU metrics
  • separate prompt/version tracking
  • separate model evaluation logs
  • separate cost dashboards
  • and then random scripts duct-taped between all of them

The actual inference part is becoming easier than the infrastructure around it.

Curious if people are converging on a stack yet or if everyone else also has a pile of semi-connected tooling.

reddit.com
u/Bladerunner_7_ — 1 day ago
▲ 12 r/devops

Python dev (Django/FastAPI/Docker/K8s) trying to break into DevOps — what should I prioritize, and what are the real problems no one warns you about?

Hey everyone, long-time lurker, first time posting here. Looking for honest advice from people who've actually made this kind of transition.

My current stack:

Python · Django / FastAPI · Docker + Compose · Kubernetes (basics) · Redis / PostgreSQL · Celery / Async · Bash / Linux · RTSP / FFmpeg pipelines / LLMs · YOLO / OpenCV

I've been building backend systems and a full AI-powered camera security system from the ground up — ingestion pipelines, async workers, containerized deployments, the whole thing. So I'm not starting from scratch, but I know my infra/ops knowledge has real gaps.

Now I want to go deeper into the operations side — CI/CD pipelines, infrastructure-as-code, monitoring, cloud, reliability engineering. Basically bridge the gap between "I can Dockerize things" and "I own the entire deployment lifecycle."

What I want to learn next:

  • CI/CD pipelines end-to-end (GitHub Actions, GitLab CI, Jenkins?)
  • Terraform or Pulumi for infrastructure-as-code
  • Proper Kubernetes beyond just "kubectl apply" — RBAC, Helm, Ingress, autoscaling
  • Cloud fundamentals — AWS or GCP (which is better to start with?)
  • Observability stack — Prometheus, Grafana, ELK, alerting
  • GitOps workflows — ArgoCD, FluxCD

Real questions for this community:

  1. What order should I learn these in? I've seen conflicting roadmaps. Some say start with cloud, others say master Linux first, others say just go build something and learn as you go.
  2. What are the actual painful problems nobody tells you about? Not the beginner stuff — I mean the things that trip up even experienced engineers. The stuff that takes months to unlearn or figure out on your own.
  3. Career reality check — I'm coming from a Python/ML background. Will that help me in DevOps roles or will recruiters just not take me seriously because I don't have a traditional sysadmin / infra background?

The real problems I'm already anticipating (want your take on these):

  • Tool sprawl confusion — Terraform vs Pulumi vs CDK vs Ansible vs Chef — no one agrees and every job posting wants something different. How did you pick one and stick with it?
  • Cloud costs — I have zero experience budgeting cloud infra and I know this bites everyone at some point. Any war stories?
  • Debugging distributed failures — logs scattered across 10 services, no clear owner, alerts firing at midnight. How long did it take you to get good at this?
  • Kubernetes complexity cliff — goes from "simple" to genuinely hard very fast, and tutorials always skip the hard parts. What resource actually helped you get past that wall?
  • "DevOps is a culture, not a role" — some companies don't even have a DevOps team, it's just dumped on top of dev work with no extra support or title. How common is this really?
  • Imposter syndrome — coming in as a developer, not a sysadmin, means constantly feeling like you're missing some foundational Linux/networking knowledge everyone else just has. Did this get better?
reddit.com
u/TodayFar9846 — 2 days ago
▲ 187 r/devops

We accidentally spent $300/month running lint on macOS runners. What's your worst GitHub Actions cost mistake?

Just discovered one of our devs set up a lint workflow using macos-latest instead of ubuntu-latest. That's $0.08/min vs $0.008/min — 10x more expensive. It was running 400+ times a month. $300 down the drain for months before anyone noticed.

GitHub's billing page doesn't break down costs per workflow, so there was no way to spot this without manually digging through the API.

What's your worst accidental Actions cost waste? And how do you prevent this kind of thing from happening?

reddit.com
u/Zealousideal_Tip4089 — 2 days ago
▲ 7 r/devops

Want to switch to Cloud/DevOps engineer role

I have around 1.2 years of experience as a software developer. My main work has been in Flutter and React frontend development, along with some exposure to full-stack development during my internship (building internal tools and dashboards). Most of my work has been frontend-heavy, but I’ve also worked with APIs and backend.

I’m now looking to transition into Cloud / DevOps engineering roles.

I currently have learned Linux and it's useful commands and also have limited hands-on experience with cloud platforms and DevOps tools, but I’m actively learning Docker, CI/CD, and AWS fundamentals.

I'd appreciate any advice or guidance on how to approach this transition.

reddit.com
u/Katalyst9957 — 2 days ago
▲ 306 r/devops

The absolute pain of trying to debug a Jira ticket that was clearly written by Claude

I just assigned an "urgent" infrastructure ticket that contains a beautifully formatted 5-bullet-point summary, meticulous bolding, perfect em-dashes, and a conclusion summarizing why stability matters.

What it doesn't contain? The actual error logs, the cluster environment name, or any indication of what actually broke.

Please tell your developers that a raw, messy terminal copy-paste is worth 100x more than a perfectly polished, AI-generated corporate paragraph.

reddit.com
u/Huge-Instance-1632 — 3 days ago
▲ 10 r/devops

Help me develop few intermediate to advanced DevOps projects that simulate real time workflows.

Can someone help me do DevOps projects that'll simulate real world workflows and the issues they'll resolve while working in production. I'm trying to pivot to the DevOps Engineer role from a cloud background. I have done some projects like 2 tier 3 tier scalable applications with AWS cloud, using tools like terraform, docker, jenkins. I'll be thankful if anyone can provide much more advanced projects that'll help me land a decent devops engineer role .

reddit.com
u/artsybx26 — 2 days ago
▲ 63 r/devops

do you or your colleagues communicate through Claude / LLMs? is it widely common now, and is it culturally acceptable / expected?

I don't mean using them in any capacity to do the work, I mean sending emails / jira comments / instant messages fully and obviously written by them.

by "obviously" I mean that they show all the markings of LLMs:

  • bullet points
  • bolding and / or paragraph titles
  • emdashes
  • phrasing that the person would never use naturally (and it's so very obvious when the message isn't in their native language)
  • emojis (lots of emojis)

a large proportion of the tickets opened for devops stuff are now entirely written by Claude as well, and regularly are shining examples of confidently incorrect X/Y problems where the ticket brings its own "solution".

just like https://nohello.net/ there are equivalents for this like https://stopsloppypasta.ai and https://406.fail/ but I see more and more of it in my company and it often feels like I'm just talking to the person's claude through two layers of redirection...

our management is fully onboard the AI train, we're encouraged to vibe code and vibe review (but somehow still own the result) so they don't see this as problematic. they have praised people for doing it, even! I'm wondering if this is just how things are now.

u/Le_Vagabond — 3 days ago
▲ 0 r/devops

How do you track which GitHub Carions workflows costs the most?

We have ~40 repos with github actions and our monthly bill keeps climbing. The billing page only shows org level totals by OS type, but I can't figure out which specific workflow or repo is the biggest cost driver without manually calling the API for every single run.

How are you all handling this? Do you:

  1. Just accept the bill and move on?

  2. Built some internal script to calculate per-workflow costs?

  3. Use a third-party tool? (I haven't found one that does this well)

  4. Manually audit workflow files once in a while?

Our bill went from $800 to $1400 /month in 3 months and I can't explain why to my manager. Would love to hear how others deal with this.

reddit.com
u/Zealousideal_Tip4089 — 2 days ago
▲ 181 r/devops

How are you actually upskilling to survive the shift from traditional DevOps to Platform Eng / MLOps?

Hey everyone,
I’m currently a Cloud/DevOps engineer. With AI rapidly automating things like boilerplate YAML, standard CI/CD pipelines, and basic log analysis, I'm trying to be proactive about my next career move.
For those already adapting:
Where do you see traditional DevOps going over the next few years?
What do you think is the most reliable, high-demand career shift adjacent to DevOps right now? (e.g., Platform Engineering, MLOps, DevSecOps?)
Would love to hear your thoughts on where to focus my upskilling. Thanks!

reddit.com
u/Fantastic-Leg-5806 — 3 days ago