u/Signal_Mammoth_9622

How reliable are your Eleven Labs agents?

Hey guys, I have been tinkering with eleven labs agents and I wanted to understand how does one scale their agents to 1000s of calls in production.. already a lot of things are breaking and honestly I'm not sure what specifically is breaking..

How do you guys handle observability or post call analytics on a conversational level? do you use 3rd party tools or manage it in-house.

reddit.com
u/Signal_Mammoth_9622 — 3 days ago

Your Voice AI agent fails in production because you have 0 observability into your stack

I have been building and running voice agents in production for a while now, crossed 300K calls and wanted to write up the failure modes that keep showing up across stacks. Posting here because I'd genuinely like to hear what others are seeing.

The five we keep hitting:

  1. Teams blend infrastructure failures and conversation failures into one quality score. A VAD misconfig is not a conversation problem, but if your dashboard treats them the same, you debug in the wrong direction every time.
  2. No visibility into VAD performance. When this layer fails silently, the agent looks dumb but the actual problem is two layers upstream of the LLM.
  3. Sampling at 1-2%. Statistically guaranteed to miss accent-triggered misclassifications, late-call breakdowns, and underperforming segments. The stuff that matters lives in the long tail.
  4. Auto-generated evals from failed calls. Produces noise that looks like signal. We ended up building a human-in-the-loop annotation flow at the sentence level instead.
  5. Evaluating at the agent level instead of the campaign level. An agent can score well on average while quietly tanking a specific campaign objective. "Does this agent speak well" is the wrong unit of evaluation. "Does this agent serve this campaign goal" is the right one.

Curious what others are running into. What's the failure mode you wish you'd caught earlier?

Full writeup with how we built around these is here if anyone wants the longer version:

https://dinodial.ai/voice-ai-observability

reddit.com
u/Signal_Mammoth_9622 — 3 days ago

Question for teams shipping LLM agents in production:

How are you handling prompt iteration + regression testing at scale?

Right now most workflows seem painfully manual:
prompt tweak → test calls → note failures → rewrite → repeat

But every fix creates new failure modes somewhere else.

Has anyone actually found a reliable way to automate prompt evaluation/iteration without:

  • ⁠breakign something new
  • overfitting to synthetic conversations
  • humans manually QA’ing everything anyway?

Would love to discuss with folks thinking deeply about this problem and explore whether there’s a better way to solve it.

reddit.com
u/Signal_Mammoth_9622 — 4 days ago

Been building and running voice agents in production for a while now and wanted to write up the failure modes that keep showing up across stacks. Posting here because I'd genuinely like to hear what others are seeing.

The five we keep hitting:

  1. Teams blend infrastructure failures and conversation failures into one quality score. A VAD misconfig is not a conversation problem, but if your dashboard treats them the same, you debug in the wrong direction every time.
  2. No visibility into VAD performance. When this layer fails silently, the agent looks dumb but the actual problem is two layers upstream of the LLM.
  3. Sampling at 1-2%. Statistically guaranteed to miss accent-triggered misclassifications, late-call breakdowns, and underperforming segments. The stuff that matters lives in the long tail.
  4. Auto-generated evals from failed calls. Produces noise that looks like signal. We ended up building a human-in-the-loop annotation flow at the sentence level instead.
  5. Evaluating at the agent level instead of the campaign level. An agent can score well on average while quietly tanking a specific campaign objective. "Does this agent speak well" is the wrong unit of evaluation. "Does this agent serve this campaign goal" is the right one.

Curious what others are running into. What's the failure mode you wish you'd caught earlier?

Full writeup with how we built around these is here if anyone wants the longer version:

https://dinodial.ai/voice-ai-observability

reddit.com
u/Signal_Mammoth_9622 — 16 days ago