u/steadwing_official

▲ 15 r/sre

GitHub breach highlights developer tools as part of attack surface

The recent GitHub incident + reports of a compromised VSCode extension feel like a wake up call for modern engineering teams.

A trusted extension already has repository access, local context, and developer trust. “That makes it a very different security problem than traditional infra attacks.”

Teams now need to treat developer environments, extensions, Github Apps, and local tooling with the same weight as production infrastructure.

What are other teams going to do after this I wonder.

reddit.com
u/steadwing_official — 1 day ago

GitHub breach highlights developer tools as part of attack surface

The recent GitHub incident + reports of a compromised VSCode extension feel like a wake up call for modern engineering teams.

A trusted extension already has repository access, local context, and developer trust. “That makes it a very different security problem than traditional infra attacks.”

Teams now need to treat developer environments, extensions, Github Apps, and local tooling with the same weight as production infrastructure.

What are other teams going to do after this I wonder.

reddit.com
u/steadwing_official — 1 day ago
▲ 5 r/redhat

Anyone else notice how incidents always seem to happen at the absolute worst possible time?

Anyone else feeling like production incidents have perfect timing?
I began to see this after a few on-call rotations.
The whole day can be silent. Dashboards are good. No weird alerts. Everything is stable.
Then the moment you actually stop for a minute, production suddenly finds a whole new failure mode, and Slack goes into chao

It’s almost as if the systems know when the engineers stop looking at them.
What’s the worst-timed incident your team has encountered?

reddit.com
u/steadwing_official — 3 days ago

Building SRE SubReddit

Hey 👋
We’re building a community around SRE where people can share ideas, ask questions, discover resources, and connect with others in the space.

If you're interested in discussions around SRE, startups, tech, learning, opportunities, and real conversations without spam, feel free to join us on Reddit:
https://www.reddit.com/r/steadwing/

Would love to have early members help shape the community

reddit.com
u/steadwing_official — 3 days ago
▲ 8 r/sre

Anyone else notice how incidents always seem to happen at the absolute worst possible time?

Anyone else feeling like production incidents have perfect timing?
I began to see this after a few on-call rotations.
The whole day can be silent. Dashboards are good. No weird alerts. Everything is stable.
Then the moment you actually stop for a minute, production suddenly finds a whole new failure mode, and Slack goes into chaos.

It’s almost as if the systems know when the engineers stop looking at them.
What’s the worst-timed incident your team has encountered?

reddit.com
u/steadwing_official — 3 days ago

What’s the weirdest thing that caused a production incident for your team?

No major outages.
Just the little stupid things that somehow brought prod down.

For us it has been:
- expired certificates
- a bad env var
- DNS oddities
- queue lag that went unnoticed for hours

Sometimes it seems that little config issues cause way more incidents than big system failures.

What’s the most shockingly dumb root cause other teams have discovered?

reddit.com
u/steadwing_official — 6 days ago

What’s the dumbest thing that has caused a production incident for your team?

I'll go first.
We’ve wasted way too much time debugging what seemed like an app problem, only to find out the real problem was a tiny config change that no one thought could break anything.

Small things = big percentage of incidents:
dead certs.
Weirdness in DNS
one evil env variable
time warp
queue formation

Wondering what's the most surprising root-cause you found for an incident?

reddit.com
u/steadwing_official — 6 days ago

What’s something about on call that actually got better at your company?

For us it was turn by turn handoffs

We've added a 10-minute sync between outgoing and incoming on call. Here’s what’s flaky now, here’s what to watch for this week

Small thing but it killed the anxiety of going into a rotation completely blind
Most on-call posts here feel horrible

Anyone got a real win they can share? Curious

reddit.com
u/steadwing_official — 9 days ago

What’s something about on call that actually got better at your company?

For us it was turn by turn handoffs

We've added a 10-minute sync between outgoing and incoming on call. Here’s what’s flaky now, here’s what to watch for this week

Small thing but it killed the anxiety of going into a rotation completely blind
Most on-call posts here feel horrible

Anyone got a real win they can share? Curious

reddit.com
u/steadwing_official — 9 days ago
▲ 1 r/devops

Developer onboarding used to be a lot more painful

Just talking to a friend about how much time used to get sucked into local setup issues.

Dependency mismatches, missing env vars, weird machine-specific bugs, outdated docs, permission problems... Sometimes it took longer to get onboarded than to understand the actual code base.

It seems like over the last few years, teams have gotten better at minimising that friction.

What improvements have you seen to onboarding for your team?

reddit.com
u/steadwing_official — 10 days ago

Developer onboarding used to be a lot more painful

Just talking to a friend about how much time used to get sucked into local setup issues.

Dependency mismatches, missing env vars, weird machine-specific bugs, outdated docs, permission problems... Sometimes it took longer to get onboarded than to understand the actual code base.

It seems like over the last few years, teams have gotten better at minimising that friction.

What improvements have you seen to onboarding for your team?

reddit.com
u/steadwing_official — 10 days ago

Developer onboarding is finally getting better

Feels like the industry is slowly getting one of its biggest productivity drains sorted, local setup chaos.

cloud/ephemeral dev environments are making onboarding so much easier than the good old days of dependency conflicts and broken configs and "works on my machine" issues.

still not perfect but definitely moving in a better direction.

curious what tooling/workflows have been most helpful for your team?

reddit.com
u/steadwing_official — 10 days ago

What K8s debugging trick would you have wished you knew on day one?

For me it was kubectl get events --sort-by=.metadata.creationTimestamp
Before that I was running describe on each and every resource trying to figure out what happened. 90% of the time the answer was in the events section

Also learned the hard way that events expire after 1 hour by default. if you're debugging anything older than that they're just gone

What’s something that would have saved you hours if you knew it earlier?

reddit.com
u/steadwing_official — 11 days ago

What K8s debugging trick would you have wished you knew on day one?

For me it was kubectl get events --sort-by=.metadata.creationTimestamp
Before that I was running describe on each and every resource trying to figure out what happened. 90% of the time the answer was in the events section

Also learned the hard way that events expire after 1 hour by default. if you're debugging anything older than that they're just gone

What’s something that would have saved you hours if you knew it earlier?

reddit.com
u/steadwing_official — 11 days ago

Developer onboarding is finally getting better

Feels like the industry is slowly getting one of its biggest productivity drains sorted, local setup chaos.

cloud/ephemeral dev environments are making onboarding so much easier than the good old days of dependency conflicts and broken configs and "works on my machine" issues.

still not perfect but definitely moving in a better direction.

curious what tooling/workflows have been most helpful for your team?

reddit.com
u/steadwing_official — 11 days ago

Developer onboarding is finally getting better

Feels like the industry is slowly solving one of the biggest productivity suckers: local setup hell.

Cloud/ephemeral dev environments have greatly improved onboarding from the days of dependency conflicts, broken configs, and "works on my machine" problems.

Not perfect by any stretch, but certainly trending in a better direction.

wondering what tooling/ workflows have been most helpful to your team?

reddit.com
u/steadwing_official — 11 days ago
▲ 0 r/sre

What SRE practice led to more than expected reduction of incidents?

Funny how sometimes small reliability things can outdo big infra changes.

Better alert tuning did more to reduce noise and improve response time than adding new monitoring tools, for our team.

wondering what was the biggest impact for your team.

reddit.com
u/steadwing_official — 12 days ago

What’s the most underrated Kubernetes feature your team actually uses in production?

everyone talks about autoscaling, service mesh, and operators.

but curious about the smaller k8s features/configs that genuinely made life easier in production for your team.

for us, proper readiness/liveness probes + resource requests fixed more issues than some “advanced tooling ever did.

what’s yours?

reddit.com
u/steadwing_official — 12 days ago

What one small DevOps change saved your team a lot of time?

For us it was about making rollbacks easier, not only thinking about deployments.

Fast, clean ways to roll back changes removed a lot of stress from releases and incidents.

wondering what small infra/devops change had the biggest impact for your workflow or team?

reddit.com
u/steadwing_official — 14 days ago

Trying to learn from teams that have made this change work well, because we're in the middle of it and still figuring it out.

When the engineering team was smaller, on call worked well because everyone knew what services the other people offered. The right person usually just noticed when something broke. Slack threads stayed short, post-incident chats were short, and most things were fixed within a week.

As we've got bigger, that model has slowly stopped working. People are on call for systems they hardly ever use, the institutional knowledge that used to be shared in conversations is now lost, and our most senior engineers keep getting pulled in by default, which I don't think is good for them or for us.

I'm not asking about tools in particular. I want to know what changed in the way your team worked, like how you designed the rotation, who owns what, how new engineers get on-call, whether you split alerting from response, and anything else.

If you've made this change without burning out your senior staff, what advice would you give to a team that is going through it right now?

reddit.com
u/steadwing_official — 16 days ago