I spent last 6 months talking to AI engineering teams about production agent failures
I was building infrastructure for AI agent experimentation recently and ended up doing 50+ deep conversations with engineering teams across startups and Series B companies about what actually breaks in production and why. A few things that surprised me:
- most agent failures are not model failures
- prompt changes are often tested way more casually than normal code changes
- almost nobody fully agrees on who owns agent reliability
- teams underestimate the operational cost of flaky agents until customers feel it
Happy to talk about how teams run controlled experiments on prompts/configs, common production failure patterns, evals, reliability ownership, rollout strategies, and the economics behind all this.
Ask me anything.