u/GoldTap9957

I'm honestly starting to feel like my IT team is becoming a password reset department :(

Every morning its the same cycle:

Locked out again.

Can you install this app.

PN is not working.

How do I access the shared drive.

We support around 200 remote employees and the repetitive tickets are eating up the entire day. The frustrating part is most of these are simple fixes, but they still need someone from our team to jump in manually.

reddit.com
u/GoldTap9957 — 2 days ago
▲ 0 r/sre

Best practices for software performance optimization before production rollout in 2026?

We have an API handling checkout for our ecommerce site, usually around 500 reqs/sec. Last week we started looking at performance because some endpoints were hitting 300ms p95.

I found a service doing N+1 queries and rewrote it with batching using goroutines and a worker pool. Also adjusted the caching layer, moved to Redis with pipelining, and tuned connection pooling. In staging it looked good, latencies dropped significantly and no obvious issues.

We pushed it to prod during low traffic and everything looked fine.

Then traffic ramped up hard. Latencies jumped to seconds, error rate climbed, and the API started timing out. CPU spiked across pods, Redis backed up, and the worker pool started thrashing under load.

Looking back, a few things didn't hold up under real traffic:

Batching assumed fairly uniform request sizes, which wasn't true during peak.

Redis instance could not handle the burst pattern the way memcached did.

Connection pool limits were not enforced the way we expected under load.

We rolled back, but not before taking a hit.

This is not the first time optimizing ahead of traffic caused more damage than the original issue.

How are you validating performance changes under realistic load before pushing to production?

reddit.com
u/GoldTap9957 — 9 days ago
▲ 3 r/SaaS

We spent the last year trying to modernize our B2B funnel. Everything looked right on paper. AI SDR layer for qualification, CRM integration with Salesforce, inbound lead automation across website and LinkedIn, and lead routing rules designed to push high intent buyers straight into sales.

But the reality is that our best leads still fall through the cracks in places nobody really monitors. Not because of lack of traffic or lack of intent. But because every tool optimizes its own step instead of the full journey.

Here is what keeps happening:

- A high value account engages on one channel but gets treated as a new lead on another
- Sales assumes qualification already happened, marketing assumes sales followed up, and nobody owns the transition
- AI SDR tools flag interest but do not always connect context across interactions
- CRM ends up with fragmented history instead of a single continuous buyer story
- By the time someone looks at the lead, the intent is already cold

So even with a fully automated stack, the system behaves less like a funnel and more like disconnected checkpoints. We optimized every stage individually, but not the handoffs between them.

At this point I am starting to think the real problem in B2B is not capturing leads or qualifying them. It is keeping context alive long enough for a real conversation to happen. Where exactly do you think most B2B funnels break today, the tools, or the transitions between them?

reddit.com
u/GoldTap9957 — 14 days ago

We started testing some AI debugging tools in staging because logs were getting messy with edge cases in our Go services.

Pushed one to prod last week on a low traffic service. Now it's throwing suggestions all the time. Some are useful, but a lot are obvious or don't really help, and they just add noise.

We are running k8s with a few microservices.

Nothing huge, but enough that this extra layer makes it harder to tell what actually matters.

Tried tuning it a bit, still feels like more distraction than help. Anyone running this in prod in a way that actually cuts down debug time instead of adding noise?

reddit.com
u/GoldTap9957 — 15 days ago

In a lot of MSP environments, profit leaks don't come from big failures. They come from day to day inefficiencies like,

Spending too much time on low value tickets.

Assigning senior techs to basic password resets and routine issues.

No real optimization of technician workload distribution.

Manually handling repetitive maintenance and updates.

Poor visibility into where time is going.

Individually, none of these look serious. But together, they quietly destroy margins.

Most teams think they have a revenue problem when in reality it's an efficiency problem. The real shift happens when MSPs start automating repetitive work and balancing workload properly instead of scaling headcount linearly with ticket volume.

Otherwise, growth just means more noise, not more profit. So at what point do MSPs usually realize they are bleeding margin through inefficiency rather than lack of clients or pricing?

reddit.com
u/GoldTap9957 — 15 days ago

At small companies (10 to 30 people), IT support is easy to manage.

But once you hit the 50 to 200 employee range, things start to fall apart fast.

Here's why:

  1. Ticket volume grows faster than IT staff. More employees = more devices, SaaS tools, onboarding, and quick fixes that pile up daily.

  2. No real prioritization system. Everything becomes urgent, so critical issues get buried under low impact tickets.

  3. Manual triage becomes the actual job. Instead of solving problems, IT spends hours just sorting, assigning, and chasing tickets.

  4. One tech handles everything . Password resets, app issues, network problems constant context changing slows everything down.

  5. Tool fragmentation adds more complexity. Separate RMM, helpdesk, and monitoring tools create more work instead of reducing it.

This is exactly where most IT teams realize that adding more people doesn't fix the problem, only better systems do. Has anyones team ever hit this breaking point in IT support, and what helped you get past it?

reddit.com
u/GoldTap9957 — 17 days ago

Spent the last 6 weeks tuning Spark configs across a handful of jobs. Executor memory, parallelism, shuffle partitions, went through the usual levers. Runtime improved on most runs but job stability didn't move.

The same jobs that ran faster after tuning still fail or slow down randomly under load. The numbers look better on average but the variance is wider than before. A job that used to take 28 min consistently now finishes anywhere between 18 and 55 min.
Checked for skew, GC pressure, shuffle spill. Nothing obvious. Bad runs don't leave much of a trace, just slow tasks with no clear pattern between them.
The working theory is that tuning without run-to-run visibility just moves the problem. You improve one metric and introduce instability somewhere else without seeing it. What's missing is a Spark observability tool that shows what shifts between a good run and a bad one,not just aggregate stage times but the specific conditions that differ.

How do you approach stability separately from raw performance. And has a Spark observability tool helped you connect the variance to a root cause?

reddit.com
u/GoldTap9957 — 25 days ago