u/Upper_Caterpillar_96

We need automation without adding complexity.

The irony of ITSM you buy automation software… then need a full time admin just to manage the automation, we are trying to simplify operations, not create another platform that needs babysitting.

Looking for smth thats easy to deploy, automation heavy, AI native, minimal maintenance and good for lean IT teams.

reddit.com

Network security troubleshooting tools that actually work for SASE environments?

we merged networking and security a couple months ago. triage time went up.

environment is AWS with Transit Gateway, inline Palo Alto firewalls, and Okta for identity. mix of EC2, EKS, and some on-prem VMware. traffic goes through centralized inspection.

symptoms show up as latency and intermittent drops. hard to tell if it’s routing, firewall policy, or identity timing.

this has turned into a recurring SASE troubleshooting problem where no single layer gives a complete picture.

we pull VPC flow logs, firewall logs, and packet captures, but each view is partial. changes in one layer don’t line up with the others.

recent incident took hours to isolate. traffic was blocked by a firewall app-id override while identity hadn’t propagated yet. looked like a network issue at first.

how are you isolating the failure domain quickly in setups like this?

reddit.com
u/Upper_Caterpillar_96 — 2 days ago
▲ 0 r/sre

Accidentally DoSed our production cluster with function level performance monitoring.

turned on function level performance monitoring in prod and it did not go well.

we have been discussing it internally for a while. wanted better visibility into hot paths in our Go services, not just endpoint latency but whats actually happening inside requests. staging tests looked fine. we had it running at 1% sampling, no noticeable overhead, clean traces.

prod is a different story.

we enabled it on one of our main services during a low traffic window. that service handles 500k reqs/min during peak, a bit lower at the time. within about 10 to 15 minutes CPU started climbing across all pods. not a spike, just steady increase until everything was under pressure.

latency followed. p99 went from 200ms to over 2s. error rate started creeping up. alerts everywhere.

initial assumption was traffic or some dependency issue, but nothing else changed. digging in, it was the tracing layer itself. even with 1% sampling, at that volume we were generating a huge number of spans. the function level hooks were firing constantly on hot paths and adding overhead we didnt see in staging.

heap usage also went up more than expected. looks like metadata collection per span added pressure there too. nothing obviously broken, just too much work being done per request.

we rolled it back as soon as it was clear, but it still took time for things to stabilize. traffic had already started shifting to other regions and we spent a couple hours just getting everything back to normal.

for now we have turned it off and gone back to basic endpoint level metrics and some targeted tracing.

rn if others are using function level monitoring at this scale without causing issues. is it mostly about much lower sampling, or only enabling it selectively? how are you rolling this out safely in production???

reddit.com
u/Upper_Caterpillar_96 — 4 days ago

Anyone else think hybrid work made it support a mess?

Ever since everyone started working from home more, support has been a pain sometimes. When people were in the office it was easy. If something broke you could just walk over, look at the laptop, maybe restart something, done.

Now its a lot of can you share your screen and trying to figure out if the problem is the laptop, the vpn, bad wifi, or something else completely random.

Had one guy last week say his laptop was acting weird and it turned out updates had been failing forever and the drive was basically full. Nobody noticed until everything slowed down.

Feels like small issues sit in the background way longer now because nobodys physically around to catch them early. And remote troubleshooting somehow turns a 5 minute fix into an hour.

reddit.com
u/Upper_Caterpillar_96 — 5 days ago
▲ 42 r/SmallMSP+1 crossposts

This has been bugging me for a while now. w\We have like eight techs across our org and every week I get asked by leadership where the bottlenecks are and what we should prioritize for next quarter but I have no real data to point to.

I know tickets come in, I know they get resolved, but I have zero insight into what's taking time. One tech might spend two days on a single issue while another closes ten tickets in the same span. Are we categorizing wrong. Is someone drowning. Are we missing patterns. No clue.

We tried using the ticketing system reporting but it's basically useless. The data is there but it's noise. Ticket volume doesn't tell me anything about actual workload. Someone might spend four hours troubleshooting a network issue that shows as one ticket, while another person cranks out password resets that look like fifty tickets.

I can't see time spent per issue type. I can't track which problems keep coming back. I can't tell if one person is getting stuck on recurring issues or if we just have bad processes. And trying to measure technician performance without actual data just breeds resentment because it's all guesswork.

Leadership wants to know if we need more hiring or if we are just unorganized. I genuinely don't know how to answer that without just... guessing. How are you all handling this??

reddit.com
u/Upper_Caterpillar_96 — 17 days ago

Boss drops bomb friday: were doubling headcount this year. Sounds good til you see my queue. 150 tickets last month, mostly password resets and laptop setups. Now imagine x2 users same crap.

I'm solo IT right now basically, part time msp. Can't hire fast enough. What's breaking first? probably me.

How do you prep for this? Outsource? AI chatbots for basics? Or just pray? Share your disasters or wins, need ideas bad.

reddit.com
u/Upper_Caterpillar_96 — 23 days ago

We're at 620 endpoints now (mix of Windows + some Macs) across 3 locations and currently running everything through NinjaOne. At first it worked fine when we were under 200 devices, but now we're starting to hit some weird friction.

Patch visibility feels… fragmented? Like I can see status, but not always in a way that helps me prioritize fast.

Alert noise is getting harder to manage especially when multiple issues hit the same device.

Asset tracking isn't as clean as I expected at this scale (we've had duplicate entries and stale devices still showing active) What's frustrating is leadership expects faster response times now that we've scaled, but operationally it feels slower. We've tried tightening policies, adjusting alerts, even restructuring device groups but it still feels like we're working around the tool instead of with it. Would appreciate advice here.

reddit.com
u/Upper_Caterpillar_96 — 27 days ago
▲ 5 r/FinOps

Added a few new Spark pipelines last week to handle more data going into BigQuery. Before that usage and costs were fairly stable.

Since then monthly costs are up around 30–40%. Billing shows higher slot usage but doesn't point to which jobs caused it.

Went through Spark UI history and BigQuery jobs. There are a lot of runs across teams, some scheduled, some ad hoc. Hard to connect specific pipelines to the increase. Current monitoring is cluster level, it doesn't give job-level attribution so everything looks averaged out.

Tried grouping by project and job id. Still no clear link between Spark runs and BigQuery cost changes. GCP billing doesn't help much either when trying to trace back to a specific pipeline.

Is there a reliable way to tie Spark job activity to BigQuery costs on Dataproc without manually tracing everything? And has Spark monitoring at the job level helped anyone solve this?

reddit.com
u/Upper_Caterpillar_96 — 30 days ago