Anyone else struggling with production error detection despite having tons of observability data?
So this is probably a basic question but I am stuck on it. We have got prometheus, datadog, custom metrics, logs going everywhere. Our stack is monitored to death but when something breaks in production we still find out from customers before alerts catch it.
I have been digging through dashboards and our alert thresholds look reasonable on paper, but clearly they are not working. Either they are too noisy so people ignore them or they are too quiet and miss actual issues.
Has anyone dealt with this situation where the tooling is there but detection still does not work well? Trying to understand if this is a setup problem or something else.
What actually helped you get from lots of data to alerts that catch real problems before your customers do?