u/Cheap-Slide-8146

Building self-healing infrastructure for LLM applications

Is anyone interested in building a self-healing AI monitoring system?

Current flow:

LLM App → langsmith ->traces → n8n webhook → failure alerts/errors → another LLM analyzes the failure and suggests improvements (model issues, latency problems, retries, etc.) → sends actionable insights directly to your email.

The idea is to make AI systems not just observable, but capable of diagnosing their own failures.

Would love to hear thoughts from people working in AI infra, LLMOps, or agent engineering.

reddit.com
u/Cheap-Slide-8146 — 2 days ago