Automating root cause analysis for AI agent failures
I’m an SRE and have a high tolerance for pain. But for the last few months I’ve been babysitting LLM agents in production. This is a new kind of hell.
When an agent behaves unexpectedly and an alert fires, I open the logs. Then I have to go through fifty thousand lines to find the prompt or tool call that sent things off the rails. It feels more like I’m doing archaeology. This is not sustainable.
The failure mode is rarely a crash, it’s more of a drift. The agent completed successfully by the metrics we’re tracking but the output was wrong, which only becomes apparent downstream. Sometimes it’s hours later so by the time I’m investigating I have to reverse-engineer intent from a log file. The log files were not designed for this.
I’ve tried dumping structured JSON logs with full prompt / response pairs, but that ends up becoming its own archaeological dig. Datadog gives me spans and latency and token counts, which doesn’t tell me what I need to know. Grafana gives me a dashboard of things that are not the problem. How do you all deal with this?