r/VoiceAutomationAI

▲ 5 r/VoiceAutomationAI+2 crossposts

Voice AI biggest unsolved challenges

What do you think are the biggest unsolved challenges in Voice AI that almost nobody is seriously working on right now?

Not “better ASR” or “lower latency” but deeper problems that could define the next generation of voice products/research.

Examples:
- Real-time conversational memory that actually feels human
- Emotion + intent understanding beyond sentiment analysis
- Interruptions/turn-taking that feel natural
- Voice-native UX instead of “ChatGPT but spoken”
- Long-term personalization without being creepy
- Multilingual/code-switching conversations
- Continuous ambient agents
- Social/companion dynamics
- Voice AI for kids/elderly/accessibility
- Real-time multimodal understanding (voice + environment + context)

Curious what people building/using Voice AI think is still fundamentally broken or missing.

reddit.com
u/Legal_Wolverine_7267 — 13 hours ago
▲ 19 r/VoiceAutomationAI+13 crossposts

How do you actually test a voice AI agent without calling it yourself every time?

So we've been working on a voice bot that handles customer calls and honestly the testing part has been brutal. We were literally calling the thing ourselves to check if it broke after every change.

Eventually we just wrote a framework that synthesizes fake caller audio, pipes it into the agent, and checks if the response is sane — latency, hallucinations, whether it handles interruptions, etc. Runs locally against a SQLite db, no cloud stuff.

It connects over websockets, can mock twilio streams, works with elevenlabs and vapi agents too. You can also plug in ollama as the judge so the whole thing runs offline.

We open sourced it: https://github.com/unforkopensource-org/decibench

Curious how others here handle this. Are you just vibing and hoping production doesn't break or is there a better workflow I'm missing?

u/Tricky_School_4613 — 2 days ago

What's the most human-sounding TTS voice you've actually used in production?

ElevenLabs sounds incredible on paper but it's almost too clean that HD clarity actually breaks immersion in a lot of real-world call flows. Not to mention the cost at scale.

My hunch is that slightly imperfect, mid-range voices convert better because they feel less robotic-trying-too-hard.

Heard good things about Cartesia and Deepgram Aura lately. Anyone running these in production? What's working for you and what accent/style are your US or UK users responding to best?

reddit.com
u/Aryanlabs — 3 days ago
▲ 1 r/VoiceAutomationAI+1 crossposts

Help building offline AI assistant

I'm kinda new to this and feeling overwhelmed with the abundance of sw. I am in pursuit of making my own AI assistant. I would like to use it (voice controlled):

- simple automation tasks

- play music from a local hdd

- chat (general conversation)

It would be nice if I could choose or configure its voice.

Ill need to be able to add my own scripts in python or c# to integrate various other devices.

For the hardware I would like to use an Elitedesk minipc with a i5 7500 CPU, 16GB Ram, sdd but no graphic card.

I've just started to fiddle with ollama and open claw. But its soooo slow and I think it might be overkill for what I'm trying to achieve.

I don't own a rpi and I would like to use the hw I already have. Unless it's impossible or way too complicated compared to other hw solutions. Still, I would consider changing the hw if it greatly simplifies the project.

Can you guys help me out ? Thanks in advance

reddit.com
u/Gold_Atmosphere_7502 — 3 days ago
▲ 9 r/VoiceAutomationAI+1 crossposts

Vapi raised $50M Series B and immediately rebuilt the dashboard, here's what actually changed

So Vapi closed a $50M Series B from Peak XV, they've now done 1 billion calls, and they're powering Ring, Intuit, ServiceTitan. Cool. But the second you log in this week you can see where part of that money went, because the agent builder looks completely different.

Spent the last few days rebuilding agents on it. Here's what's worth knowing if you haven't logged in yet:

The wins:

  • Logs are now ON the agent page. FINALLY. No more bouncing to Observe > Logs > filter by assistant just to check what your agent did on the last call. This alone saves me probably 50 clicks a day.
  • Transcriber, model and voice are now top-level on the assistant page. Click into any of them and a sidebar opens with everything (provider, language, fallback, denoising, smart endpointing). Way less scrolling.
  • Fallbacks are first-class citizens now. You can set fallback transcriber, model AND voice from the same panel. Pro tip if you're new: never set the fallback to the same provider as the primary. If OpenAI goes down, your fallback to OpenAI also goes down. Pick a different vendor.
  • Structured outputs and scorecards have replaced the old summary, success eval, and structured data fields (those are deprecated now). Structured outputs are now created globally and assigned to assistants, same model as tools. Build once, attach everywhere.
  • Monitors are (fairly) new. Set them up to catch things like transcriber request failures across all assistants, with thresholds and Slack/email notifications. Production hygiene baked in.

The small stuff that still annoys me:

  • End call is no longer a toggle, you have to create an End Call tool and assign it. Fine once you know, but I broke an agent the first time because I assumed it carried over.
  • The old layout toggle is hidden under the three dots top right. If you panic on first login, that's your escape hatch.

My humble opinion: the rebrand especially, but also the new UI feel less developer-first and more enterprise-pitchable. Make of that what you will. I think it's net positive, the old dashboard had real friction once you scaled past a couple of agents.

Made a full walkthrough on YouTube going through every tab if you want the visual version.

What's been your experience so far? Anyone hit weirdness on the migration?

u/ApprehensiveUnion288 — 3 days ago

Ai for sales training

Any recommendations for a company that could build a platform for inbound sales training. I’m looking to build an AI voice-based training platform for front desk and sales teams in the medspa/aesthetics industry.

The platform would allow team members to practice real inbound phone scenarios with an AI caller. The AI would roleplay different types of callers, objections, price shoppers, booking situations, and treatment inquiries. After each call, the system would score the rep based on a custom training framework and provide feedback.

The first version would need:

User logins
AI voice roleplay calls
Custom scenarios
Call recording/playback
Automated scoring
Manager dashboard
Training content library
Ability to upload scripts, notes, call recordings, and company website information as the knowledge base
Progress tracking by rep and location

reddit.com
u/TNLex23 — 4 days ago
▲ 6 r/VoiceAutomationAI+1 crossposts

OSS to win - VoiceBox is here

OSS app replaces ElevenLabs & WisprFlow, runs 100% locally.

→ Clone voice from 3s audio
→ 7 TTS engines in one
→ 23 langs: Ar, Hi, Ja etc.
→ Built-in MCP srv so Claude Code/Cursor/Cline speak cloned voice
→ Local LLM rewrites in-char before TTS

u/bhalothia — 4 days ago

Is Selling Voice AI Agents Really That Hard? What Worked for You?

Hi,

I’m a Software Developer and recently built a Voice Agent platform backed by deep expertise in low-latency systems and AI. The product has highly human-like voice interactions, very low response latency, and a complete external dashboard for usage tracking and billing.

The challenge now is sales.

I’ve worked with salespeople from different countries, but most of them seem to struggle when it comes to selling Voice AI solutions effectively. I’d really love to hear from people who have actual experience selling Voice Agents or AI automation products.

A few things I’d love insights on:

- What worked differently for you in Voice Agent sales?

- How do you approach sales properly for this type of product?

- How do you find and evaluate good salespeople for AI/SaaS products?

- Any practical strategies, lessons, or growth hacks that helped?

Would really appreciate any advice or experiences you can share.

Thanks!

reddit.com
u/Away_Gift2387 — 4 days ago

Question for teams shipping LLM agents in production:

How are you handling prompt iteration + regression testing at scale?

Right now most workflows seem painfully manual:
prompt tweak → test calls → note failures → rewrite → repeat

But every fix creates new failure modes somewhere else.

Has anyone actually found a reliable way to automate prompt evaluation/iteration without:

  • ⁠breakign something new
  • overfitting to synthetic conversations
  • humans manually QA’ing everything anyway?

Would love to discuss with folks thinking deeply about this problem and explore whether there’s a better way to solve it.

reddit.com
u/Signal_Mammoth_9622 — 4 days ago

How I built a production TTS API: sentence-boundary chunking, Redis distributed locks, and killing the thundering herd

Built a text-to-speech API that converts full articles to MP3. The interesting engineering problems weren't the TTS calls — they were everything around them.

**The chunking problem**

Every TTS provider has a per-request character limit (Polly standard: 3,000 chars). A real article is 8,000–20,000 chars. Naive character-boundary splitting produces broken audio mid-word. The solution: a two-threshold sentence-boundary splitter.

- `target_chars = 2500` — soft target; flush the buffer when reached

- `max_chars = 4000` — hard ceiling; flush before appending if the next sentence would exceed it

- Split regex: `(?<=[.!?])\s+` — only splits after terminal punctuation

Result: every chunk is a coherent group of complete sentences, always within the provider limit.

**The caching layer**

TTS synthesis is deterministic — same text + same voice/engine/region = identical audio bytes every time. Cache key structure:

`sha256(text) + voice_id + engine + region`

All four parameters matter. Swapping from `Joanna/standard` to `Matthew/neural` must be a cache miss, not a hit.

Warm cache: N × `redis.get()` + ffmpeg concat. Latency under 300ms for most articles. Zero upstream calls.

**The thundering herd**

Without locking: 50 concurrent users hit a cold article → 50 × 7 chunks = 350 Polly calls, 349 of them redundant.

Fix: Redis `SET NX` distributed lock per chunk. One worker wins the lock, synthesizes, writes to cache, releases. Everyone else exponential-backoff polls until the cache key appears.

Backoff: start at 50ms, grow ×1.25 per iteration, cap at 500ms.

Critical detail: lock release is in a `finally` block. A failed synthesis that doesn't release its lock blocks all subsequent requests for that chunk until TTL expiry — potentially minutes.

Result under load: `chunk cache stats hits=49 misses=1` per chunk. 7 Polly calls total, not 350.

**Provider comparison (brief)**

- Piper (local): free, no concurrency, model files are hundreds of MB, degrades on long inputs

- ElevenLabs: best voice quality, cost curve is steep at real traffic levels

- Amazon Polly: 5M chars/month free (standard), permanent — right economics for this use case

Full writeup with architecture diagram, all code, and the failure sequence in order: From Piper to Polly: How I Built a Production-Ready Text-to-Speech API (and That Broke Along the Way)

What I'm solving next: moving synthesis off the request thread into an async job queue (ARQ vs Celery) and streaming chunk_0 to the client while chunk_1 is still synthesizing.

reddit.com
u/lizcodes — 6 days ago

Why are realistic conversational audio datasets still so hard to find?

Working on conversational/voice systems internally and we keep running into the same bottleneck where most public speech datasets still feel very “lab-like” compared to production environments.

A lot of the common datasets are:
- clean recordings
- isolated speakers
- minimal interruption overlap
- low emotional variance
- limited telecom degradation
- highly structured conversations

which vary drastically from real deployments:

We’ve been trying to find stronger datasets around:
- customer support conversations
- multilingual support calls
- emotionally escalated interactions
- difficult ASR environments
- telecom degradation
- retention/escalation scenarios
- interruption-heavy dialogue
- and long-form conversational drift.

Any recommendations on where to find datasets like these would be appreciated!

Feels like current public datasets still underrepresent the actual conditions most conversational systems face once they hit production traffic.

reddit.com
u/Helpful_Actuator9790 — 8 days ago
▲ 12 r/VoiceAutomationAI+7 crossposts

Three bots in a trenchcoat is not omnichannel

Self-serve is exciting. Genuinely. But if I am honest, it is not the most interesting thing about 13 May.

The most interesting thing is that we have been quietly running architecture that the rest of the industry is only just figuring out exists.

A competitor recently launched real-time SMS ingestion. The coverage was breathless. Everyone lost it. So innovative. Revolutionary. Game-changing.

Me? I looked at our codebase and thought: "SMS ingestion. Wow. That is so 2025."

Here is what we actually built, and have been running in production for the better part of a year.

Mid-voice-call, Elba texts a short URL to the caller. The caller fills out a form on their phone. The structured data comes back into the live call via RPC. The workflow receives clean JSON. The voice call never paused. The agent never lost session state. The caller submitted a form while still talking and the agent acted on it in the same conversational turn.

That is not SMS ingestion. That is a bidirectional channel bridge inside a single active session. Sending an SMS during a call is not new. Getting structured data back into the active session in real time without dropping state on either side - that is the part nobody else has shipped.

And it sits on top of something even more fundamental.

Most "omnichannel AI" are three bots in a trench coat. A voice agent, a WhatsApp bot, a webchat widget, all pointing at the same CRM row and calling it unified. Each with its own prompt, its own config, its own version history, its own failure modes.

Elba is one agent. One workflow. One memory layer. Voice, WhatsApp, SMS, email and webchat all running through the same execution engine. Not copies. Not synced versions. The same agent, same logic, same memory, regardless of which channel the conversation arrived on. Deployments are atomic - every channel switches to the new workflow version in the same transaction. No drift. No "did the WhatsApp bot get the update" incident. One audit trail.

When a regulated enterprise customer asks what exactly their AI told a customer across every channel and every session for the past six months, we have a single clean answer.

The competition is announcing SMS ingestion and calling it a breakthrough.

We are launching self-serve on 13 May and already cooking the next thing. We may have put it on hold until after the launch. Our tech never sleeps though.

If you want an agent that actually knows who it is talking to across every channel and every session: self-serve opens 13 May at www.kolsetu.com.

Full technical writeup: https://www.kolsetu.com/blog/the-architecture-nobody-else-built

u/EdikTheFurry — 9 days ago

Seeking collaborator/advice for "StillVoice" – AI-driven silent-speech interface for tracheostomy patients

​Hi everyone,

​I’m working on a project called StillVoice. The mission is to restore vocal identity for tracheostomy patients using a silent-speech interface. I’ve developed the business logic, branding, and a high-level technical roadmap, but I’ve hit a wall with the hardware execution and recently lost access to my local prototyping lab. It's a lot to handle solo, and I’m looking for some technical guidance (or a partner) to help move the needle.

The Concept:

A wearable device (the "Stealth Band") that captures non-vocalized speech intent and uses an on-device AI inference engine to provide localized audio output.

Current Technical Targets:

  • Latency: Sub-100ms (crucial for natural conversation).
  • Connectivity: BLE 5.3 for high-fidelity streaming.
  • Sensors: Exploring multimodal sensor fusion using piezoelectric and MEMS technology to capture "silent" speech.
  • Processing: Edge AI/On-device inference to keep it fast and private.

Where I’m Stuck:

I need advice on optimizing the sensor fusion to filter out biogenic noise (swallowing, movement) while maintaining a high signal-to-noise ratio for the speech intent. I’m also looking for recommendations on low-power microcontrollers that can handle this level of Edge AI without becoming too bulky for a neck-based wearable.

​Does anyone have experience with MEMS-based speech capture or low-latency audio hardware? I'd love to hear your thoughts on the most viable path forward for a solo dev moving from a lab environment to a home setup.

reddit.com
u/Kooky-Ball6382 — 7 days ago

We’re One Bug Away From Launching Our Voice AI Startup and Nobody Can Figure Out What’s Breaking

Hey everyone,

We’ve been building a real-time AI voice agent for the last 4 months and we’re finally in the final stages. The frustrating part is… the core experience works beautifully sometimes, and then completely falls apart the next moment.

Our stack right now:

  • STT: Deepgram
  • LLM: Groq using Llama 3.3 70B Versatile
  • TTS: Grok TTS

The issue:

  • Internally, the voice agent often works perfectly.
  • Low latency, smooth responses, natural conversation.
  • But the moment we ask external users to test it, the voice starts cracking, glitching, stuttering, or breaking randomly.
  • Sometimes it works flawlessly for them too… and then suddenly breaks again after a few interactions.

What’s driving us insane is the inconsistency.

We’ve checked:

  • Internet stability
  • Different devices
  • Browsers
  • Concurrency
  • Streaming logic
  • Buffering
  • Latency spikes
  • Sample rate mismatches (at least we think)

But we still cannot pinpoint the root cause.

At this point we genuinely don’t know whether:

  • the issue is in streaming architecture,
  • audio chunk handling,
  • WebRTC,
  • Groq response timing,
  • Deepgram streaming,
  • TTS buffering,
  • or some synchronization issue between all components.

Has anyone here faced similar “works internally but breaks for real users” problems in voice AI systems?

Would love:

  • debugging suggestions,
  • architecture advice,
  • common hidden issues,
  • monitoring ideas,
  • or even theories.

This one issue is literally blocking our launch right now.

reddit.com
u/Electronic_Argument6 — 12 days ago

Do you think this is useful?

I built an AI role-play system that can integrate into Claude via MCP it can take any skills any knowledge anything from the organization and build a custom role-play at the users request so I can train their sales team based off of the things they’re already focusing on at scale

My question is do you think that this is actually something that’s helpful with other companies that are out there like hyper bound or any other players in this space I haven’t seen a ton of them integrate via Claude so I’m wondering if this is something that people would actually find useful

reddit.com
u/Complex_Report_356 — 13 days ago

Anyone using speech-to-text for Indian languages in production? What's actually working and what's not?

Marketing pages claim 90%+ accuracy on Hinglish. Reality from the teams I've talked to looks very different.

If you're using or have evaluated Indian-language STT for any use-case - voicebots, call analytics, video KYC, transcription, voice search, etc. would love to hear what you picked, why, and where it falls short.

Happy to share my learnings. Drop a comment or DM for a 30 min chat.

reddit.com
u/Spare-Ad2520 — 14 days ago