u/BalancedQuilt3

so i ended up mass reading threads about alert fatigue last night and holy shit its basically everyones on-call horror story lol

found a write-up about tuning monitoring alerts while i was digging around. nothing groundbreaking but a few things actually clicked for me:

separate your alerts into "page me now" vs "make a ticket" vs "just log it." if users/revenue arent actively impacted, it shouldnt be waking anyone up.

the 3am test: if this fires at 3am and nothing is actually broken, id be pissed. so why is it paging me?

hysteresis matters more than people think. if you alert at 90% dont resolve at 89.9% or youll get the on-off-on-off thing all night. resolve at like 85%.

alert on what users actually feel. nobody cares about cpu at 95% if latency is fine and error rate is zero.

all of this makes sense in theory but i really dont want to sit there hand-tuning 50 rules and maintaining runbooks and routing and silences for weeks on end

still looking for the best way to solve that. what was the ONE thing that cut noise the most?

reddit.com
u/BalancedQuilt3 — 25 days ago