u/MembershipUnited5355 — reddlx

What actually gets lost when on-call rotates isn’t in the runbooks

When you're handing off on-call? The runbook covers the basics: where the dashboards are, who to page if something’s really bad… all that standard stuff.

But there's this other layer, the things you actually say when passing it on. Like:

- "Hey, if Alert X fires up? Everyone thinks ‘check Y,’ but don’t bother with Y at all. Just go straight to Z."

- "If logging for Service A stops suddenly? Usually not a logging problem at all, something upstream probably died quietly."

That kind of info gets shared in conversations during handoffs but not documented anywhere official.

You say it once while sitting across from someone and hope they remember later... because if they roll off rotation soon after? That knowledge just evaporates

reddit.com

u/MembershipUnited5355 — 6 days ago

▲ 0 r/kubernetes

Some of the worst deploy incidents came from things the diff never showed

Deploy-related incidents have started feeling like their own thing to me.

Most of the time, it’s not even the actual code change that slows things down, atleast, not directly.

Someone deploys something new, a system breaks somewhere, and everyone immediately dives into the diff: application logic? Config changes? Feature flags? Whatever was officially modified.

But weirdly often… that’s not where the problem is.

The real issue ends up being something adjacent—something around or connected to the deploy but not technically part of "the change."

Resource limits that looked fine before this new traffic pattern came in. Dependencies getting restarted in an unexpected order.

It feels like when you make one small update… it actually ripples through this much bigger operational surface area than any code review or validation would show. That’s quite obvious, but worse, incidents drag on because people keep debugging what was declared as “changed” instead focusing on all these destabilized systems around it.

I’ve seen some teams start making checklists (not generic rollout plans); specific lists based on historical failures tied to certain types of deploys: "This kind usually breaks X first."

But is that really aligned with the latest practices? (Efficiency reasons)

reddit.com

u/MembershipUnited5355 — 6 days ago

▲ 0 r/devops

Runbooks survive rotation changes. A lot of actual incident knowledge doesn’t.

One thing I've noticed after rotation changes is that all the runbooks, dashboards, alerts, postmortems.… They're still there.

But what often disappears? The undocumented filter logic people develop over time. Like which alerts are usually just noise and can be ignored. Which metrics spike with every deploy but don’t actually mean anything’s wrong. Which services look suspicious at first glance but just overthought in reality.

And which checks waste 20 minutes every time because they lead to dead ends.

There's always a few folks on each rotation who handle incidents faster than others, because they've seen the same failure pattern before and *already* know not to go down those useless paths.

you’d have one engineer leave and nobody realized how much of that knowledge lived only in their head until later, so the same incidents would keep happening with the same tools available… but resolution took way longer this time around since everyone defaulted back to obvious routes first instead of cutting straight through like she used to do automatically from experience alone.

reddit.com

u/MembershipUnited5355 — 6 days ago

▲ 0 r/sre

The first place you look after an alert fires is usually not random

One thing you might have noticed is what postmortems consistently erase is why the first debugging path felt correct at the time.

You read the ‘writeup’ and it looks like the responder went straight to the failing dependency.

But in reality they probably lost 15 minutes in the wrong service first because the symptoms matched the last outage, or the alert wording biased them, or one dashboard looked “close enough.”

That decision process right there almost never survives into the final document even though it probably shapes incident response quality more than the root cause itself.

reddit.com

u/MembershipUnited5355 — 6 days ago

▲ 0 r/sre

What gives SRE work meaning for you?

truth is SRE is one of the few disciplines where invisible work carries enormous consequence.

Most people never notice the work but are impacted by it every single day— think hospitals, businesses, security and financial systems (massive deal!)

As it rarely gets the credit it’s supposed to, a lot of SRE must have really strong intrinsic drive to consistently tackle it everyday.

I personally respect operators who can handle that level of responsibility calmly and competently.

But the next generation of infrastructure work will also reward people who are adaptable, and willing to evolve beyond the scope traditional operational thinking.

Interested in hearing how others in the field think about the meaning and the long term direction of the work.

reddit.com

u/MembershipUnited5355 — 6 days ago

▲ 0 r/devops

Is DevOps a “meaningful” career?

Hey devs,

Do you feel your work is meaningful enough to justify the effort you put in it?

Maybe one of your projects contributes to life-saving impact, or quietly powers logistics systems that keep our food supply moving, or even on a smaller level simply saves consumers’ time.

bonus: If yes, what kind of metrics would you measure to justify it as “positive impact”?

reddit.com

u/MembershipUnited5355 — 6 days ago

▲ 59 r/devops

spent two weeks chasing slow queries before realizing Slack handlers were holding the DB pool

The team had two weeks of intermittent timeouts before they understood what they were actually looking at. The initial on-call engineer opened traces and found HTTP requests waiting almost 20 seconds to get a connection from the Go database/sql pool.

First move was to look at which specific endpoints were holding contention, hoping it was one pool, because that would have scoped the problem.

What they found was the issue was widespread, no single connection pool affected. So they went wide instead: pulled historical HTTP traffic, checked PubSub metrics, looked at Heroku Postgres stats. Nothing obviously wrong.

The decision at that point was to just fix whatever looked slow (take materialized views, new indexes, rewritten joins. Closed the incident).

Within a couple of days, lightning struck twice. Second on-call pulled the same dashboards, saw the same connection pool wait pattern, still no discernible concentration in the slow requests. Someone suggested adding a one-second lock timeout to all transactions but not to fix anything, just to force the system to surface which requests were holding connections longest.

Deployed it, nothing broke, still no root cause. 24 deploys’ worth of fixes later… the root cause turned out to be an unnecessary transaction wrapping every Slack modal submission. Many small fast transactions were collectively holding the pool. The Slack events had been processed synchronously inside the HTTP request lifetime the whole time, and nobody had looked there because it didn’t pattern-match to a “slow query” problem.

reddit.com

u/MembershipUnited5355 — 7 days ago

▲ 74 r/sre

We had a 40 minute outage and nothing alerted because traffic dropped 95%

Could someone explain to me how we had 0 alerts for a 40 minute outage?

Our users were getting errors for 40 minutes, status page was green. and our on-call engineer was not paged, dashboards were not red.

Way we found out: a customer emailed support. support slacked the on-call. on-call looked at dashboards, saw green, said “i don’t see anything.” support pushed back, customer is pretty insistent. on-call actually opens the app and tries to use it themselves and yeah it’s broken.

What happened: we monitor error rates as a percentage of total traffic. our traffic had dropped by 95% because a load balancer config change had been routing most users to a dead endpoint. the small amount of traffic that was hitting the right endpoint was healthy. so our error RATE was fine. our error COUNT was fine. our absolute traffic volume metric existed but nobody had an alert on it.

we had the metric. we looked at it in reviews sometimes. we just never said “if this drops 80% in 5 minutes, page someone.”

the thing that’s haunting me: we’ve had a version of this conversation in two previous retrospectives. both times we said “we should alert on traffic volume drops.” both times it was added to a backlog and deprioritized.

reddit.com

u/MembershipUnited5355 — 7 days ago

▲ 0 r/devops

APM pointed us to the wrong service during a payments latency incident

Did our quarterly incident review yesterday and something’s still bugging me…

APi latency spike three weeks ago, 47 minutes to resolve, traced back to connection pool exhaustion on RDS. Ok fine.

In the review i asked the engineer who worked it what they checked first. “went straight to DB, payments latency almost always goes back there” confident, made sense.

BUT Then i pulled the actual pagerduty timeline, and first 14 minutes were spent in APM traces chasing an upstream auth service that had nothing to do with it. total dead end. doesn’t appear anywhere in the postmortem. the written narrative goes: alert fired, checked DB, found it, fixed it. clean. wrong.

i don’t think they were being dishonest. i think they genuinely remembered it that way.

reddit.com

u/MembershipUnited5355 — 7 days ago

▲ 5 r/devops

Why our canary didn't roll back: hardcoded Prometheus endpoint in a separate HelmRelease

So we had a fun tuesday. “fun”

Background: we run a fairly standard k8s setup on EKS, 40 services, nothing crazy.

deploy pipeline goes through ArgoCD, we do canary via flagger with a prometheus-based analysis template. has worked fine for like eight months.

2:17 pm I get paged. error rate on checkout-service climbing past 2%, flagger’s canary analysis is failing but it’s not rolling back. just… sitting there. suspended state. customers hitting errors, canary still live at 20% traffic split.

first thing i do is check the flagger logs. analysis is failing because the prometheus query is returning no data, not bad data, no data. so flagger can’t evaluate the metric and it’s going into this weird limbo instead of failing safe. ok, prometheus issue then. i go check prometheus, it’s up, scraping looks fine, i run the query manually and it returns results no problem.

15 minutes wasted there.

SO i’m thinking maybe it’s a namespace thing, flagger can’t reach the prometheus endpoint, we had an ingress policy change a few weeks back. i start tracing network policy rules and slack the platform team asking if anything changed on the observability namespace. they say no. i spend another twenty minutes here going through netpol yaml diffs. nothing.

what actually got me looking in the right place was one of our senior engineers asking “which prometheus?” offhand in the thread. we have two. the one flagger was configured to query had been migrated off its old service name as part of a prometheus-operator upgrade in staging that got promoted to prod last thursday. the internal DNS name changed. flagger’s helmrelease still had the old address hardcoded.

Our runbook for “flagger canary stuck in suspended” literally says step 3 is “Verify prometheus connectivity using the endpoint in values.yaml” which we did, and that endpoint was fine, but it just wasn’t the one flagger was actually using because the flagger config is in a separate helmrelease that nobody remembered has its own prometheus URL override.

So we manually rolled back the canary, patched the flagger config, redeployed. took about four minutes after knowing what it was.

Now the most frustrating part is I had a vague experience of something similar happening before I joined this team, there’s a slack thread from like 14 months ago that references “prometheus endpoint mismatch” but it’s not in any runbook and nobody who was here then is still on the rotation. So we just wasted time rediscovering it.

updating the runbook NOW to explicitly call out that flagger has its own prometheus config independent of everything else, and adding a config validation step to the deploy pipeline that actually curls the configured endpoint from within the flagger pod. probably should have done that the first time, whoever that was.

reddit.com

u/MembershipUnited5355 — 7 days ago

▲ 0 r/devops

IAM drift keeps recurring… when do you turn a fix into a CI gate vs leave it as a runbook note?

CI deploy fails due to IAM drift. the Sr Engineer finds it, fixes it, closes ticket.

Weeks later, different service, different engineer, same root cause shows up again.

Not asking about tooling or documentation. here… assume monitoring + CI/CD gates already exist and log a lot of this “class” of problems.

What I’m curious about is :

When you moved prevention earlier (CI / deploy / monitoring gates), how did you decide what stays as “incident-time knowledge” vs what gets promoted into a hard pre-deploy check?

For example: if an IAM drift issue is discovered during a deploy, do you treat the fix as:

something you only add to the postmortem/runbook (so next engineer still has to recognize it), or As something that becomes a CI gate like “fail deploy if IAM policy diff ≠ baseline”?

reddit.com

u/MembershipUnited5355 — 11 days ago

▲ 0 r/sre

What’s one concrete change that made repeat incidents cheaper to diagnose instead of re-learning the same root cause each time?

Something I keep noticing after production incidents:

The fix gets merged, the immediate issue is resolved, and everyone moves on.

A few months later, a very similar failure happens again. Different symptoms, same underlying cause. The team ends up re-deriving the same debugging path from scratch because the useful part of the last incident never really became operational knowledge.

Sometimes there’s a runbook, but it explains what happened instead of what to check first next time. Sometimes the context behind a mitigation or alert threshold only exists in someone’s head.

Feels like less of a monitoring/tooling issue and more of a “decision memory” issue.

For teams that are actually good at reducing repeat debugging effort: what concretely changes after an incident? Not asking about tools so much as process, habits, ownership, review steps, escalation flow, etc.

reddit.com

u/MembershipUnited5355 — 11 days ago