u/Software_Sennin

▲ 2 r/sre

What do you consider a “bad” page-worthy alert?

I’ve been reviewing alert quality lately and noticed a few patterns that seem to create noise:

  • alerts with no owner
  • alerts with no runbook
  • symptom alerts that self-resolve
  • CPU/memory alerts that are not tied to user impact
  • duplicate paging from app + infra layers
  • short “for” windows on bursty workloads
  • vague alert descriptions with no action path

For SRE teams here, what makes an alert page-worthy in your environment?

Do you use a checklist or rubric before an alert is allowed to page someone?

reddit.com
u/Software_Sennin — 23 hours ago