r/sre

▲ 0 r/sre

AI projects in our field...because we have to

With every company pushing AI every, I was wondering what kinda easy and "cheap-thrill" projects I can do.

My company mandated everyone uses AI and it is simply not enough to ask an LLM questions and to write skills, the upper management wants to see something "new and shiny".

What are some cheap shiny things I can build to satisfy upper management's shiny new toy syndrome. That way I can keep them occupied so I can spend more time with things that scream for my attention.

u/Odd-Engineering6931 — 1 day ago

▲ 18 r/sre

How's the market in 2026?

Hey everyone!

I recently moved to a small country, and since remote work wasn't an option with my previous employer, I had to leave my job. I used to work at a giant corporation on a project involving government infrastructure. I can't go into details due to an NDA, but yeah – it was a government contract, something along those lines. Anyway, that's not the point.

I've polished my LinkedIn, spruced up my CV, and already received one so-so offer, but I'm not planning to settle just yet. I'd confidently rate myself as a Mid-level engineer, though of course I have some gaps in my knowledge (they say impostor syndrome only hits real pros – ha-ha!).

Here's my question – TL;DR: What's the current state of the job market? Any job-hunting tips? Could you recommend any types of projects (from your network or experience) that a Mid-level should go for, and which to avoid? I'm primarily looking for remote work since I'm not based in Europe or the US, so on-site roles aren't feasible for me – I simply don't have a work permit. Is it even realistic to land a remote SRE position this year? It used to be easier.

And yeah, thanks for your time and insights! When you comment, please mention your level (Jun/Mid/Senior) so I can understand your perspective – no bias at all, I promise!

u/Lazy_Cranberry4545 — 2 days ago

▲ 0 r/sre

How do you protect cloud infrastructure from outages without over engineering?

I keep getting dragged into debates that start with what if AWS, Azure or GCP go down and end with proposals for triple provider setups that nobody can run. We need to protect ourselves from outages, but we also have finite humans and brainpower. For us, the middle ground has been a multi availability zone as a baseline, multi region for the systems that justify it and backups and disaster recovery plans that do not depend on the same control plane and get exercised on purpose. The subtle failure mode has been configuration entropy: primary and failover stacks drifting apart over time until resilience is theoretical. Terraform everywhere helped but only once we treated drift detection and clickops discovery as ongoing work rather than an annual audit and had a way to reconstruct IaC from reality when we needed to rebuild. Poeple who have been through big outages: what is your minimum viable set of patterns and ways that keeps a medium to large estate from going dark, without building an architecture nobody wants to operate?

u/Bright-View-8289 — 3 days ago

▲ 132 r/sre+17 crossposts

Walks the full cmd/compile pipeline in order: package names, data structures, and the SSA construction that drives inlining, escape analysis, bounds-check elimination, and register allocation, with flags to observe each phase directly.

This one took a while, it's probably the longest thing I've written on this blog. I wanted to do a proper end-to-end walkthrough of cmd/compile: real package names, real data structures, diagrams for the AST and SSA CFG, and the flags you actually need (-m, -m=2, GOSSAFUNC, -S) to observe each phase yourself rather than just take my word for it.

Covers the full pipeline: lexer → parser → type checker → IR lowering → SSA construction → optimization passes (inlining, escape analysis, BCE, nil check elimination, register allocation) → architecture-specific code emission.

Hope it's useful — happy to answer questions or push back on anything that looks wrong.

blog.gaborkoos.com

u/OtherwisePush6424 — 4 days ago

▲ 765 r/sre+1 crossposts

Uber left PagerDuty after using it for 12 years.

I wonder what took them so long. PagerDuty seems to have become one of those heavyweight products that are so content in their illusion of market dominance that they have stopped innovating. But until the enterprise CFOs wake up and ask why is this costing us 5k per month, they are going to stay in their bubble.

I last used PD 3 years ago, and the UI had not changed in years, looked like something out of a 90s app. Pricing was our way or the highway.

No wonder people are leaving it for other solutions.

u/thomsterm — 6 days ago

▲ 34 r/sre

Am I glorified Observability Engineer?

Since joining my current team, Im mostly working on setting up monitoring on clusters, creating/optimizing alerts and dashboards as well as automation around that. Since we have loads of different microservices with different monitoring approaches it’s become my daily job with occasional oncall duties.
I am taking on different tasks as well, like FinOps, AI integration for self healing, etc. but the sheer ammount of work with monitoring part makes me less productive in those other projects.

Was wondering how this looks like for others, is it normal to have SREs spending most of the time entangled in monitoring work?

u/Fun-Adeptness9700 — 5 days ago

▲ 0 r/sre

Need advice on how to start as a freelancer

Looking for advice/opinion if there are any freelancers amongst. If you are not, please keep your opinions to yourself.

I want to start-off as an SRE and Platform Engineering freelancer, and wanted to ask:

What platform you use to get gigs
How do you position/promote yourself in terms of offerings , ex: setup observability stack or developer platform
How has your experience been
Any other generic advice for a rookie.

u/Sea-Examination7503 — 4 days ago

▲ 0 r/sre

Rotating on-call means engineers get paged on unfamiliar systems. How are teams handling the cold-start investigation problem?

The cold-start problem is when a rotation puts an unfamiliar engineer on a system they don't own, the first 30-60 minutes of an incident is orientation work. What does normal look like, what changed recently, where to start. The investigation itself hasn't started yet.

A few things that can potentially reduce it, all with a published reference behind it.

Runbooks in the service repo, not a wiki. GitLab publishes theirs publicly at runbooks.gitlab.com. The value isn't just the content, co-locating the runbook with the code means whoever's on-call finds it without hunting. A runbook buried at different sources gets skipped at 2 AM.

Alert context at page time, PagerDuty's Rich Incidents documentation covers this very well by embedding the dashboard link, the runbook URL, and the last few deploys directly in the page. The engineer gets context before they open a laptop, not after spending 20 minutes scouring the web and finding relations.

What hasn't worked as well, in my experience is the generic postmortem actions. "Improve runbook for X service" as an action item never survives the week. The runbooks that get maintained are tied to alerting, so engineers update them when the alert fires wrong.

I'm building in this domain, happy to answer specific questions about how we've approached any of these, or what we haven't solved.

u/gaurav_sherlocks_ai — 4 days ago

▲ 10 r/sre

18 YOE in IT (5.5 as Observability Engineer, AKS/New Relic) trying to formalize the jump to SRE — what actually matters in interviews?

18 years in IT overall (started in helpdesk/lab admin, 10 of those years at Juniper Networks across QA/test engineering), the last 5.5 as an Observability Engineer on a SaaS platform running on AKS. Day to day is mostly New Relic — alert design, dashboards, APM, some NRQL work that goes deeper than the defaults — plus Fluent Bit for log shipping and Python/PowerShell for internal tooling and custom metrics pipelines.

My contract winds down at the end of this year, so a transition that used to be a "someday" goal is now an active, time-boxed one. I want to move into a proper, production/customer-facing SRE role rather than just another observability/monitoring title, and I'd rather close real gaps now than find out about them in an interview.

Some of what I've actually owned: alert frameworks built around FACET-based NRQL (steady-state dashboards faceted by container, not pod, learned that one the hard way), a New Relic region migration, RCA work using distributed tracing to find gaps between synthetic and APM signal, and building custom metrics pipelines that feed New Relic from SQL/PowerShell.

Where I'm less sure of myself: hands-on K8s admin depth vs. "I can read a dashboard and explain a CrashLoopBackOff," real infra-as-code (Terraform/ARM) vs. just monitoring infra someone else provisioned, and owning SLOs/error budgets rather than just building the dashboards that report on them.

For people who made a similar observability → SRE jump:

What was the actual gap that mattered in interviews — not the resume gap, the real one?
Is CKA worth the time investment, or do interviewers not really probe that deep on K8s admin for a production SRE role?
How much IaC depth do you actually need to do the job vs. just be able to speak credibly to it?

Appreciate any honest input, especially from people who've sat on the hiring side of this transition.

u/naveen0109 — 5 days ago

▲ 97 r/sre

AWS DynamoDB was down for hours on June 28 while the status page said "operating normally." Cost us 3 hours of assuming it was our fault.

DynamoDB us-east-1 was having a bad day on June 28 and we lost about 3 hours assuming it was our fault.

Errors started climbing, we went straight to our own code. Questioned a deploy from earlier that morning, pulled in two people who weren't on call, spent time we didn't have going through changes that turned out to be fine. The AWS status page was green the whole time, so we kept looking inward.

Eventually someone just tried writing to DynamoDB directly from their laptop and it was clearly broken on AWS's end. That's when we checked Twitter and found a bunch of other people hitting the same thing.

The status page didn't update for another hour after that. What stung was that this was a solvable problem. A simple check on our own write success rate, with our own threshold, would have told us within minutes that the failure wasn't in our code. We've since set that up for every external dependency we use. Obvious in hindsight, annoying that it took this to get there.

u/Holiday-Record7341 — 7 days ago

▲ 5 r/sre

What would make an ML curriculum for SREs actually useful day-to-day?

I got tired of ML tutorials that teach through flowers and passenger manifests.

https://github.com/laban254/ml-for-infrastructure

As someone who spends time looking at dashboards, digging through log files, and getting paged at bad hours, I wanted to learn ML through problems I actually face, not toy datasets. So over the past few months, I put together a curriculum of 27 Jupyter notebooks, all framed around real observability and SRE scenarios.

A few examples: Isolation Forest anomaly detection on synthetic Prometheus metrics with real daily seasonality (with a slider to see how the contamination parameter changes alert volume, and a Z-score comparison to show why static thresholds miss seasonal anomalies). Log clustering with TF-IDF + KMeans that auto-names clusters from keywords and flags novel patterns it hasn't seen before. KS-test drift detection for when a production distribution has permanently shifted. A PyTorch LSTM that does recursive forecasting with a preemptive capacity alert. MLflow tracking for a full hyperparameter sweep with inline run comparison. And a small LoRA fine-tune that turns raw log lines into structured JSON.

Genuinely curious what people who actually do this job think: what production scenarios am I missing that would be worth adding? Does this kind of framing (real infra data instead of toy datasets) actually help build intuition, or is it a gimmick?

u/kibe254 — 5 days ago

▲ 0 r/sre+1 crossposts

What’s the biggest bottleneck during incident investigations for your team?

I’ve been reading a lot of incident postmortems lately, and one thing that stands out is how different every team’s investigation process is.

Some people jump straight into logs, others start with dashboards or traces, while some rely heavily on service dependencies.

In your experience, what’s the biggest bottleneck during the first 20–30 minutes of an incident? Is it finding the right signal, correlating information across systems, or something completely different?

Curious to hear what has actually improved your team’s workflow.

u/ashhash6007 — 5 days ago

▲ 1 r/sre

Seeking Advice: True Zero-Downtime Redis Sentinel on Kubernetes (Node.js)

Hey everyone, looking for some architectural advice on handling Redis failovers gracefully under high traffic.

Our Setup:

Node.js backend using ioredis

Redis Sentinel (Bitnami Helm Chart) running on AWS EKS (Karpenter for node provisioning)

1 Master, 2 Replicas

What we've done so far: We found that the default Bitnami preStop hook uses CLIENT PAUSE during pod termination, which freezes our app for ~20s and causes massive TimeoutErrors.

We overwrote the preStop script to remove CLIENT PAUSE and instead trigger a SENTINEL FAILOVER immediately, followed by cleanly severing the TCP connections. On the Node.js side, we use ioredis with maxRetriesPerRequest: null and enableOfflineQueue: true.

The Result: When a node is drained, ioredis catches the dropped connection, buffers all incoming commands in memory, asks Sentinel for the new master, and flushes the queue once connected. The failover usually takes about 2 to 5 seconds. To the end user, this just looks like a slightly slower API request. No 500 errors.

My Questions for the community: While this works perfectly in testing, I know we can't guarantee a strict 2-second failover in production.

Under heavy traffic and large datasets, Sentinel elections and DNS propagation could easily push this delay to 5-10 or 15 seconds or more.

If the delay extends to 10 seconds under massive traffic, our Node.js ioredis in-memory buffer will explode in size, potentially causing OOM crashes on the application side, or massive latency spikes when it finally flushes thousands of queued commands to the new master at once.

How do you handle this at scale?

Do you just accept the 5-10 second latency spike during a failover?

Is migrating to a managed service like AWS ElastiCache the only way to avoid this completely?

Would love to hear how folks are handling Redis HA edge cases at scale!

u/wildwarrior007 — 5 days ago

▲ 1 r/sre

Is there a missing pre-event layer in observability, or do current workflows already cover this?

Why is observability still mostly retrospective?

Most monitoring and observability workflows seem excellent at answering what crossed a threshold, what alert fired, and what happened after the incident became visible. But I keep wondering about the earlier window. In many systems, the alert is not the first thing that changes. Queueing, latency, cache behavior, load, memory pressure, or downstream coupling may start moving together before the visible incident.

So the question becomes: given a bounded historical trace, can we test whether the system entered a separable pre-event regime before the current alarm fired?

I’m not thinking of this as another alerting system. More like an offline audit of a past incident trace:

- start from one anonymized telemetry trace around an incident

- map raw metrics into a shared transition representation

- ask whether multiple channels began moving together before the current alarm

- compare that timing against the existing alarm or a tuned baseline

- classify the outcome as usable pre-event structure, no actionable signal, or unstable mapping

The distinction I care about is this: not “predict the future,” but audit a past incident and ask whether the telemetry had already entered a separable regime before the alert became operationally visible.

For people running production systems:

Does this sound like a real missing layer, or just overfitting the problem?

Do current observability workflows already cover this well enough?

Where would it fail in practice: noisy metrics, bad timestamps, lack of incident labels, false positives, trust, workflow integration?

I’m investigating this as part of a broader attempt to understand whether observability has a missing pre-event layer — or whether existing tools already cover it in practice and I’m just naming something teams already do informally.

u/BoringRock9997 — 6 days ago

▲ 0 r/sre

What does your team's ops automation stack look like, and is the setup actually painful?

How are SRE teams handling the atomic ops stuff today? Restart pod, vacuum table, rotate creds, replay DLQ, force-delete a stuck namespace, drain a node.

There are tools for different pieces of this:

Runtime / execution: Rundeck, Ansible Automation Platform, AWS SSM, Argo Workflows, Temporal...
Shared / portable library: Ansible Galaxy is config not ops, StackStorm Exchange stalled, Rundeck has no job registry
RBAC + per-action safety: AAP+SAML, custom homegrown, vault dynamic creds bolted on top
Audit + traceability: whatever the runtime has, usually thin and tied to that runtime

Most teams I've worked with end up stitching pieces together. Something like AAP plus a private git of collections plus SAML plus a custom audit pipeline plus a Slack bot for triggers.

Questions I have:

What does your team's stack actually look like for this? Single tool? Stitched?
Can dev teams write their own playbooks, or does authoring stay gatekept by SRE/platform?
Is the setup actively painful (slow to iterate, hard to onboard, scary in incidents), or does it work fine once it's in place?

(Engineering org size context useful - 50 vs 500 vs 5000 changes the answers a lot.)

u/gp42 — 5 days ago

▲ 5 r/sre+1 crossposts

[ Removed by moderator ]

[removed]

u/Willing-Lettuce-5937 — 6 days ago

▲ 30 r/sre

Is this how an SRE's role actually is?

Around 3 months ago I started as a "senior SRE" in a fairly big company, but I'm really curious to know if this is what SREs typically do. Previously I was a platform engineer and imagined there'd be a lot more crossover than there is. In my prior company, PEs and SREs are mostly interchangeable titles, and coexist in the same teams.

For this role, the job description did emphasize that this team focuses on incident prevention & management efforts such as: observability, load testing, disaster recovery, etc. But what I didn't quite realize is that the bulk of this team's work is around standardizing and enforcing those best practices rather than doing that much "engineering" of it. The observability portion of our work is mainly around assessing the monitoring stacks of our product teams and calling out how they can improve, the load testing work is mainly around promoting the habit of load testing and driving the adoption of it rather than actually driving/implementing the technology behind it.

Most of our engineering hours are spent on what feels like potential marginal improvements & rolling out AI capabilities for each of the areas i mentioned above. I would've imagined there'd be more technical involvement especially on things that drive "reliability", but no. We don't really touch the CI/CD process, we don't do any resource management & optimizations, we don't really do any infrastructure stuff. Things which I thought were probably more impactful to the "reliability" of a service. This team is also in a separate org & reporting line from the platform engineering and CloudOps teams, and our department is specifically the one called "Reliability". But I just feel like we're mostly doing the extra fluff that provide the final 5% of reliability, whilst the rest of it are up to the platform teams.

I don't know, maybe im coming at it from too much of a bias from my previous company, but I'm starting to wonder what this job even is. Is this a common kind of work for SREs in other companies?

u/hxbachigrillonionz — 9 days ago

▲ 0 r/sre

Where do AI incident/RCA tools actually fail under pager pressure?

We’re exploring AI-assisted incident response/RCA and trying to understand where these tools actually break down in real on-call situations.

For people who’ve used tools like Resolve, Traversal, Rootly, Cleric, Komodor, Datadog Bits AI, or built your own setup with Claude/MCP/scripts:

Where did it actually fail?

A few areas we’re trying to understand:

Confident but wrong RCA
Did the tool give a plausible explanation before it had enough evidence, and send you chasing the wrong thing during an incident?

Missing context across tools
Did it explain the alert/symptom but miss the real cause because the important context was in GitHub, deploy history, Kubernetes config, PagerDuty, Slack, feature flags, cloud changes, or internal runbooks?

Security/data concerns
Did the evaluation die because prod logs, traces, or incident data had to go to an external SaaS? Is data sovereignty a hard blocker for your team, or something you worked around?

Self-hosted/on-prem demand
Would running fully inside your environment actually matter, or are teams fine with SaaS if the tool is useful enough?

The write-access wall
Was the tool acceptable as read-only, but blocked once remediation or prod write access came up?

DIY with Claude/MCP/scripts
If you tried building your own version, where did it break down — cost, maintenance, permissions, governance, hallucinations, or reliability under real incident pressure?

No learning loop
After you corrected it, closed the incident, and wrote the postmortem, did the tool learn anything useful for next time? Or did every incident still feel like starting from zero?

All suggestions are welcomed, we're at mid-stage and trying to understand actual pain points before progressing further.

u/Agreeable_Celery8277 — 7 days ago

▲ 19 r/sre

We are looking for straightforward takes on Terraform Cloud alternatives that have drift detection and governance built in

We have been evaluating IaC orchestration platforms for a few months and at this point we have opinions. Curious if others have been through the same exercise. Many of them handle the orchestration piece fine. Plans, approvals, state management. The problem is drift detection and IaC governance get treated like afterthoughts. Terraform Cloud runs drift on a schedule which collapses at 100 + workspaces. Spacelift's drift does not work at scale. I am sure there are others. Besides drift, we struggle with IaC coverage. 30% of our infrastructure lives outside any workflow because it was never in IaC to begin with. The downstream consequence is that when we need to recover an environment, we are rebuilding from an incomplete picture of what existed. Has anyone found something that handles both the orchestration and the inventory and drift side without stitching three things together?

u/Bright-View-8289 — 10 days ago

▲ 4 r/sre

DORA has tracked MTTR for years. For most teams it hasn't moved. What actually moved it for you?

We've been grinding on incident response time for the past year. The DORA (DevOps Research and Assessment) 2023 report shows the elite cohort at under an hour for MTTR (mean time to recovery); the bottom 60% still sitting at 1 to 24 hours, same as 2019.

The frustrating part is we added observability tooling over that period, more dashboards, better alerting, structured logs, and none of it moved the number.

What we eventually noticed is that the actual wall-clock time in most incidents goes to the hypothesis loop, you think you know the cause, you check 3 tools, you're wrong, you form another theory. The fix itself is usually fast, sometimes anticlimactic, once you find the root cause.

Is this a universal pattern or just something very specific to our stack. If you and your team actually moved the number, help a fellow redditor?

u/Holiday-Record7341 — 10 days ago