r/sre

Stop copy-pasting Terraform modules, I built a tested registry for AWS, GCP, and Azure with Terratest and CI
▲ 0 r/sre+3 crossposts

Stop copy-pasting Terraform modules, I built a tested registry for AWS, GCP, and Azure with Terratest and CI

Disclaimer: I built this project and am sharing it as a free open-source tool.

Every project I join has the same problem: someone copied and pasted a VPC module from a blog post in 2019, nobody tested it properly, and now it's load-bearing infrastructure.

This registry has 9 modules across AWS, GCP and Azure, VPC/VNet, Kubernetes (EKS/GKE/AKS), and IAM/Workload Identity for each cloud.

Every module has:

- A Terratest that provisions real infrastructure and tears it down (no mocks)

- GitHub Actions CI (fmt, validate, tflint, Checkov)

- Secure defaults with every option exposed as a variable

- Working examples you can run in under 5 minutes

**Module list:**

- modules/aws/vpc: VPC, public/private subnets, NAT gateway, route tables

- modules/aws/eks: EKS cluster, managed node groups, OIDC, IRSA

- modules/aws/iam: roles, policies, IRSA binding

- modules/gcp/vpc: VPC, Cloud NAT, Private Google Access, firewall rules

- modules/gcp/gke: GKE cluster, node pools, Workload Identity

- modules/gcp/iam: service accounts, IAM bindings, WI federation

- modules/azure/vnet: VNet, subnets, NSGs, route tables

- modules/azure/aks: AKS, managed identity, OIDC, Workload Identity

- modules/azure/iam: managed identities, federated credentials, role assignments

**Quick start:**

git clone https://github.com/Cloud-Architect-Emma/terraform-module-registry

cd terraform-module-registry/examples/aws

terraform init && terraform plan

**Or reference directly in your code:**

module "vpc" {

source = "github.com/Cloud-Architect-Emma/terraform-module-registry//modules/aws/vpc?ref=main"

name = "production"

cidr = "10.0.0.0/16"

azs = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]

private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]

public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

enable_nat_gateway = true

}

⭐ If this saves you time, a star on the repo helps others find it: https://github.com/Cloud-Architect-Emma/terraform-module-registry

PRs welcome, what module would you add first?

u/EmmaOpu — 9 hours ago
▲ 4 r/sre

Moving from Performance Testing to SRE/Resiliency (How to avoid the LeetCode trap?)

I’ve been in performance testing for about 3.5 years. I’ve run projects end-to-end getting applications ready before release. Now I want to switch out because I don't see the growth that tech-driven product firms provide, and I want to make sure I'm moving toward something that is harder for AI to replace.

Ideally, I still don't want to code heavily. I suck at DSA and LeetCode grinding. I want out of that step so it can't creep into the interview process. I'm okay to let go of the title Software Engineer. Infrastructure, Platform, Cloud, DevOps, Resiliency, Chaos - anything in that realm works.

I have some DevSecOps experience and I've worked with GitLab CI/CD. I also have an academic research background in system vulnerabilities and stress-testing, so I naturally lean toward breaking systems.

I'm doing my AWS SAA right now. After I'm done with it, what is the broader idea of where to head my career? If I target Resiliency or Chaos Engineering, can I bypass the heavy DSA rounds? Would love to hear from people who have figured out the why/why not of this transition.

reddit.com
u/PerfPivot2026 — 10 hours ago
▲ 0 r/sre

Which AI Alert Investigation Tools Are Actually Good in Production?

Any tools that actually helped reduce:

  • alert fatigue
  • investigation time
  • MTTR

Please only real production experiences, not marketing claims.

reddit.com
u/Wise-Formal494 — 15 hours ago
▲ 2 r/sre

What do you consider a “bad” page-worthy alert?

I’ve been reviewing alert quality lately and noticed a few patterns that seem to create noise:

  • alerts with no owner
  • alerts with no runbook
  • symptom alerts that self-resolve
  • CPU/memory alerts that are not tied to user impact
  • duplicate paging from app + infra layers
  • short “for” windows on bursty workloads
  • vague alert descriptions with no action path

For SRE teams here, what makes an alert page-worthy in your environment?

Do you use a checklist or rubric before an alert is allowed to page someone?

reddit.com
u/Software_Sennin — 20 hours ago
▲ 15 r/sre

GitHub breach highlights developer tools as part of attack surface

The recent GitHub incident + reports of a compromised VSCode extension feel like a wake up call for modern engineering teams.

A trusted extension already has repository access, local context, and developer trust. “That makes it a very different security problem than traditional infra attacks.”

Teams now need to treat developer environments, extensions, Github Apps, and local tooling with the same weight as production infrastructure.

What are other teams going to do after this I wonder.

reddit.com
u/steadwing_official — 1 day ago
▲ 2 r/sre

A simple AI agent override mistake wiped out our ART metrics improvement

Still cant believe i did this. we rolled out this new ai agent setup a couple months ago for tier 1 tickets. supposed to auto resolve simple stuff like password resets and basic app crashes cutting average resolution time from 45 minutes down to under 5 per early reports. whole point was compressing time to value on every employee request management loves the dashboards showing slas green across the board.

was tweaking permissions yesterday because some high priority incidents were getting stuck in queue. agent was too aggressive on p2s so i wrote a quick bulk update script that pulled back a few hundred open tickets from last week across a couple of categories. tested on staging first everything fine. but i was rushing end of day friday brain dead from back to back meetings and hit the prod endpoint instead.

script ran in 90 seconds marked every matching ticket as resolved with canned note from agent 'instant intervention complete user notified'. art plummets overnight from 12 minutes average to 2.3 minutes. looks amazing at first glance until you dig in. 80% reduction but now 800 tickets show resolved with zero human touch including around 60 serious cases like broken payroll access and crm outages.

morning meeting cto pulls up the metrics dashboard screaming about how art never looked this good but finance director is furious because their month end reports are gone. service desk phones melting down employees calling back saying their issues vanished. slas technically hit but audit trail shows my id did bulk closure on everything. scrambling to reopen without triggering false alerts or double counting stats.

team is pissed i bypassed qa manager wants post mortem asap and now legal asking about compliance since some were security tickets. we can recover most data but the embarrassment is killing me. has anyone nuked their core metrics like this with ai overrides and how bad does this blow up usually??

reddit.com
▲ 73 r/sre+1 crossposts

A24 Films has a new tech startup and we're hiring devops!

A24 Labs is a technology startup within A24 Films.

Compensation: $125k - $200k, plus equity and a competitive benefits package (*see more in posting).

Location: Most of the labs team works from our New York office so we're currently focused on local candidates.

If you apply please mention you saw this post!

https://labs.a24films.com/jobs/devops

u/OddRutabaga2849 — 2 days ago
▲ 0 r/sre

How does your team actually handle runbook documentation? Ours doesn't.

Honest question — we have a strong infra team, great uptime, fast incident resolution. But every single runbook is either 2 years out of date or just doesn't exist.

The engineers who fix things are the same ones who "never have time" to document what they did. And honestly I get it — after a 2am incident nobody wants to write docs.

The knowledge just lives in Slack threads that are impossible to search, or worse, in one person's head.

Curious how other teams actually deal with this. Have you found anything that works, or is this just a universal DevOps tax everyone quietly accepts?

reddit.com
u/Ashwith_Garlapati — 1 day ago
▲ 0 r/sre

Landed a SysAdmin Internship at an ISP & MSP combo. Is this a good path?

I live in a really rural area and IT jobs are really the only thing around me. Software positions don’t exist. I am not quite ready to move to the metropolitan city yet, because I would like to try to get a few years of experience while living at home.

However, I genuinely like both software development and IT. My degree is IT as well, and I have mainly been working on game engines, games, web apps, and that sort of stuff on the side. My current internship is well aware that I am currently stronger in software development, but they still brought me on even though the role is more IT heavy. I told them how I’d like to do SRE stuff as my ultimate career goal and everything though.

Anyways, it may sound like a dumb question, but is this a solid path? I’m just assuming a SRE role may prefer a SWE over someone doing SysAdmin work. I do plan to stay on top of programming throughout my career, as a hobby, regardless. It’s just all my actual work experience will be in IT most likely.

reddit.com
u/Infectedtoe32 — 1 day ago
▲ 3 r/sre

How to load test an I/O-bound service to choose the right autoscaling metric in Kubernetes?

I have a Python data service (gunicorn, 7 workers, 3 pod replicas, static) that the compute service calls during ML workflows. The heavy endpoint reads large datasets from S3 and processes them in-memory.

What I see in Prometheus:

- Request rate stays roughly flat during ML workflows

- p99 duration spikes to several minutes during heavy workflows

- Errors stay at zero

I suspect the high p99 is dominated by I/O wait on S3, and that under enough concurrent load in-flight requests would queue at the worker level, making horizontal autoscaling useful. But I want to confirm this with a load test before deciding which metric to scale on.

My questions:

Is sending varying levels of concurrent heavy requests and watching how key metrics (request duration, worker saturation, CPU, memory) respond a sound way to find the saturation point? Or is there a better-established approach for I/O-bound services?

For a service that pins workers waiting on S3, which metric tends to be the most predictive trigger for autoscaling? Custom worker saturation (queue length), or latency itself?

Using Prometheus with the gunicorn StatsD exporter. Open to suggestions about additional instrumentation worth adding before the test.

reddit.com
u/fearless_expert216 — 2 days ago
▲ 1 r/sre

What the april anthropic 529 incident revealed about llm gateway reliability posture

If you were on-call during the late-april opus 4.7 capacity squeeze, you probably already have opinions about this. If you were not, the short version: a lot of teams discovered mid-incident that what they had configured as failover was not actually functioning as a reliability mechanism.

Spent most of last weekend going through public post-mortems from teams that ran through it, partly to figure out whether our own posture would have held up. The pattern i kept seeing was that teams running through gateways found their gateway degrading in lockstep with the upstream it was proxying. The gateway did not have anywhere else to send the traffic, so it was effectively a more polished error returner. Some teams had multi-provider failover configured, but the failover only saved them if the secondary provider was healthy. During a few hours of the worst window, the major providers were degrading in correlated fashion.

This is making me think about gateways less as developer-convenience infrastructure and more as a serious dependency choice with the same kind of reliability questions you would ask of a database or a queue. Three concrete things i am now asking when i look at one.

Does the gateway hold any inference capacity itself, or is its entire serving capacity contingent on healthy upstreams? For workloads where a 30-minute degraded window is tolerable, pure-proxy is fine. For tier-1 inference, it is not. Some newer entrants keep their own compute on the back end as a degraded-mode fallback. Not frontier-class, not a substitute, but it converts service fully down into service in degraded mode where some quality drops but it still answers. Whether that distinction matters depends on what you are building.

What is the migration blast radius if you need to bypass the gateway entirely? Most gateways normalize traffic to a single api shape, which is useful until the incident requires hitting a provider direct. Then the normalization becomes a migration tax. We did not have a good answer for this in our runbook before april and i am still working on what the right one is.

Can you charge back by on-call team after the incident? If your gateway only gives you workspace-level rollups, you cannot answer which team consumed extra budget retrying through the degradation. That is a gap that quietly becomes a problem at scale.

I do not have a clean conclusion. Mostly april reminded me that reliability decisions for ai infra need to be made on the messy case. The gateway category is still maturing toward primitives that genuinely help under degradation rather than just under healthy load.

Would be useful to hear how teams here are writing runbooks for this failure mode specifically, especially the bilateral-degradation case where two upstreams are correlated. Mine still feels weak on it.

reddit.com
u/Drysetcat — 2 days ago
▲ 50 r/sre

Bringing laptop with you in public on-call?

Hey fellow SRE'rs,

I just started my first full-time position as an SRE & that means going on-call (🙌). I have a date coming up but it conflicts with my on-call so I'm just planning on bringing my laptop with me. Anyone ever been in a similar situation? It feels like this is probably pretty common in this field?

edit: appreciate the help everyone! It's just a casual date, so I'll just bring my laptop and leave it in my car. this is all very good to know though!

reddit.com
u/CallsyReds — 4 days ago
▲ 8 r/sre

Anyone else notice how incidents always seem to happen at the absolute worst possible time?

Anyone else feeling like production incidents have perfect timing?
I began to see this after a few on-call rotations.
The whole day can be silent. Dashboards are good. No weird alerts. Everything is stable.
Then the moment you actually stop for a minute, production suddenly finds a whole new failure mode, and Slack goes into chaos.

It’s almost as if the systems know when the engineers stop looking at them.
What’s the worst-timed incident your team has encountered?

reddit.com
u/steadwing_official — 3 days ago
▲ 51 r/sre

every team has a postmortem action item from 2 years ago everyone agreed was P1 and nobody has touched

the kind that says "implement circuit breaker on payments service" or "set up automated runbook for stale leader election" and just sits in jira under "later"

mine is "add chaos testing to the deployment pipeline" from an incident in 2023 where a bad rollout took down half the platform for 40 min. everyone in the room nodded. ticket got created. it has priority "high" and has been moved across 4 different epics since every quarter someone brings it up. every quarter the answer is "we should do that this sprint".

nothing happens edit: bonus points if the engineer who wrote the action item has since left whats the action item your team has been carrying for years

reddit.com
u/Complex_Computer2966 — 4 days ago
▲ 0 r/sre

The Observability Cosmos

https://preview.redd.it/p6v9ylo20w1h1.png?width=900&format=png&auto=webp&s=a48a5a6aba3731408316d321dc14361844f5c1cd

So, I have built a mapping to the observability space.

The market seems to be evolving and growing at an incredible rate. New specialisms are developing and AI is changing the nature of observability itself. This is an attempt to identify some kind of order and structure. It currently encompasses 126 products (with many more to come) across 16 categories. Not surprisingly SRE, especially AI SRE is one of the hotspots.

If you want to dive straight in and explore the Cosmos, this is your launchpad:
https://observability-360.com/Product/Cosmos

There is also an introductory article here:
https://observability-360.com/article/viewArticle?id=introducing-the-observability-cosmos

And an explanation of the classifications here:
https://observability-360.com/article/viewArticle?id=observability-cosmos-classifications

reddit.com
u/Observability-Guy — 3 days ago
▲ 3 r/sre

Would embedded systems engineer make good SREs?

Hi,

I currently work as an embedded systems engineer and been thinking of transitioning into SRE.

Me specifically I also have a stint as a Backend Engineer where I thought the fun parts were actually finding production bugs, from Chrome Dev Tools to Datadog logs, etc. Merging a PR that fixes a bug is more fun than doing one that adds a feature IMO.

So my backend experience brings something relevant I think but the embedded brings a lot too. For instance, I think embedded folks know a lot more about Linux than most SWEs. One day I might be working with filesystems, the other the networking subsystem, creating boot, initialization scripts, patching a kernel module, adding a driver to the kernel, etc. Moreover once that hardware is fully brought up, the work can pivot to tasks that are similar to SRE but to the edge not cloud like monitoring the device fleet (like how SREs monitor servers/VMs/pods idk), optimizing the CI pipeline, etc

In my mind there is a good intersection there. But I actually haven’t found too many examples of people who did this. Maybe because there is a class of SREs who are “embedded SREs” so search results become very mixed.

For me I’m interested in the change because modern software companies have better culture in my experience. And software has better margins and pay. In a SRE role I’d still use skills I like. I like Linux, networking, writing software, have solid CS fundamentals (even do good in leetcode interviews), but I haven’t worked with Kubernetes and many other tech you see in JDs (which I’m not even intimidated or anything but aware I don’t have the experience)

Any input appreciated

reddit.com
u/hejirerr — 4 days ago
▲ 74 r/sre

We had a 40 minute outage and nothing alerted because traffic dropped 95%

Could someone explain to me how we had 0 alerts for a 40 minute outage?

Our users were getting errors for 40 minutes, status page was green. and our on-call engineer was not paged, dashboards were not red.

Way we found out: a customer emailed support. support slacked the on-call. on-call looked at dashboards, saw green, said “i don’t see anything.” support pushed back, customer is pretty insistent. on-call actually opens the app and tries to use it themselves and yeah it’s broken.

What happened: we monitor error rates as a percentage of total traffic. our traffic had dropped by 95% because a load balancer config change had been routing most users to a dead endpoint. the small amount of traffic that was hitting the right endpoint was healthy. so our error RATE was fine. our error COUNT was fine. our absolute traffic volume metric existed but nobody had an alert on it.

we had the metric. we looked at it in reviews sometimes. we just never said “if this drops 80% in 5 minutes, page someone.”

the thing that’s haunting me: we’ve had a version of this conversation in two previous retrospectives. both times we said “we should alert on traffic volume drops.” both times it was added to a backlog and deprioritized.

reddit.com
u/MembershipUnited5355 — 6 days ago
▲ 49 r/sre

Observability for AI tooling: Grafana dashboard for Claude Code's OpenTelemetry metrics on Prometheus

Hi! I'm an SRE who got pretty excited when Claude Code added the ability to emit OpenTelemetry metrics. Felt like that capability landed pretty quietly out there, so I built a Grafana dashboard on top.

https://preview.redd.it/6llimh66pi1h1.png?width=1840&format=png&auto=webp&s=61945c7ef15ec3ab45c34888ab77359171760f5a

The metrics mostly cover what you'd want to watch: cost, cache hit ratio, active time, tool decisions, lines of code. Compatible with Prometheus, VictoriaMetrics, Mimir, Thanos.

https://preview.redd.it/2wydaoj7pi1h1.png?width=1820&format=png&auto=webp&s=816aa081f92981aa10ab56eb3d492eabfab78b8b

Parallel implementation of dashboard 25052 by 1w2w3y (Azure Application Insights / KQL). Every panel rewritten in PromQL.

https://preview.redd.it/pdnyz1j8pi1h1.png?width=1833&format=png&auto=webp&s=0ccff65ce3b5762e7c04f365f633a930469df485

Things worth flagging up front (covered in the article):

- Temporality settings matter. Pin to cumulative or you'll get silently broken rates.

- Cost is a client-side estimate; it won't match Anthropic billing to the cent.

- The PR counter only increments when Claude Code itself opens the PR (e.g., via gh CLI inside a session); manual PRs don't register.

- Custom labels via OTEL_RESOURCE_ATTRIBUTES extend the dashboard to per-team / per-project / per-cost-center views. For org-wide rollouts the same labels enable cost attribution by team or cost center; the per-user data is exposed too, what you do with it is up to you.

Article with the walkthrough: https://rockdarko.dev/posts/grafana-dashboard-for-claude-code-on-prometheus/

Dashboard on Grafana Labs: https://grafana.com/grafana/dashboards/25255-claude-code-metrics-prometheus/

Repo (MIT): https://github.com/rockdarko/claude-code-metrics-prometheus

reddit.com
u/rockdarko — 5 days ago
▲ 2 r/sre+1 crossposts

How do you currently turn noisy incident logs into a useful timeline?

I’m working on a small local prototype around incident-log grouping, and I’m trying to understand whether this output shape is actually useful for SRE/DevOps workflows.

During incidents, exported alerts/logs often become repeated noise:

- database pool exhaustion

- payment gateway timeout

- retry storm

- checkout failure

- unrelated background events

The prototype, Crystal Incident Lens, tries to turn exported events into:

- incident groups

- timestamp-backed incident paths

- related-incident candidates

- resource usage numbers

Example recovered path from a demo incident set:

db_pool_exhaustion -> payment_gateway_timeout -> retry_storm -> checkout_failure

I also tested it on a 2,500-line slice of Rootly AI Labs’ public logs dataset:

https://github.com/Rootly-AI-Labs/logs-dataset

Current measured result:

- 2,500 real log lines

- 10 incident groups

- 0 queue rejections

- 27.42 events/sec

- 21.43 MB working set RAM

- 56.0 CPU-sec

Important boundaries:

- This is a prototype research project.

- It is not a Datadog/PagerDuty replacement.

- It is not an LLM wrapper.

- The runnable engine is not public yet.

- Rootly is only used as a public dataset source; this is not affiliated with or endorsed by Rootly.

What I’m trying to validate:

Can a local event-memory layer reduce noisy operational exports into a smaller, evidence-backed incident picture without sending logs to a cloud API?

Questions for SREs / DevOps / incident response people:

  1. Is this output shape useful?

  2. Do you care more about deduplication or causal timeline?

  3. What exported log/alert format should I test next?

  4. What would make this convincing enough to try on anonymized real incidents?

Proof repo:

https://github.com/Antriksh005/CRYSTAL_GITHUB_PUBLIC

Disclaimer: I built this prototype in my spare time; there is no paid product or cloud service behind it.

u/Salt_Diamond5703 — 4 days ago