r/AIVoice_Agents

▲ 3 r/AIVoice_Agents+1 crossposts

cant i make my own ai voice agent but the claude The ONE Place You Cannot Avoid Paying Phone calls. That's it. Just phone calls. To give a client a real phone number that customers can call, you need Twilio or similar. There is no legitimate free option for this anywhere in the world. Minimum cost

#voiceagent #ai is there any way

reddit.com
u/Wrong_Factor_4048 — 6 hours ago
▲ 6 r/AIVoice_Agents+2 crossposts

Voice AI biggest unsolved challenges

What do you think are the biggest unsolved challenges in Voice AI that almost nobody is seriously working on right now?

Not “better ASR” or “lower latency” but deeper problems that could define the next generation of voice products/research.

Examples:
- Real-time conversational memory that actually feels human
- Emotion + intent understanding beyond sentiment analysis
- Interruptions/turn-taking that feel natural
- Voice-native UX instead of “ChatGPT but spoken”
- Long-term personalization without being creepy
- Multilingual/code-switching conversations
- Continuous ambient agents
- Social/companion dynamics
- Voice AI for kids/elderly/accessibility
- Real-time multimodal understanding (voice + environment + context)

Curious what people building/using Voice AI think is still fundamentally broken or missing.

reddit.com
▲ 19 r/AIVoice_Agents+13 crossposts

How do you actually test a voice AI agent without calling it yourself every time?

So we've been working on a voice bot that handles customer calls and honestly the testing part has been brutal. We were literally calling the thing ourselves to check if it broke after every change.

Eventually we just wrote a framework that synthesizes fake caller audio, pipes it into the agent, and checks if the response is sane — latency, hallucinations, whether it handles interruptions, etc. Runs locally against a SQLite db, no cloud stuff.

It connects over websockets, can mock twilio streams, works with elevenlabs and vapi agents too. You can also plug in ollama as the judge so the whole thing runs offline.

We open sourced it: https://github.com/unforkopensource-org/decibench

Curious how others here handle this. Are you just vibing and hoping production doesn't break or is there a better workflow I'm missing?

u/Tricky_School_4613 — 3 days ago
▲ 2 r/AIVoice_Agents+1 crossposts

I built a voice AI that triages medical emergencies over the phone, call it and try to break it

I built FirstCall, a voice AI you call during a medical emergency. Describe what's happening in plain English, and it triages severity and walks you through first aid step-by-step. No app, no account, just call.

How it works:

- You call and describe the emergency in plain language

- It classifies severity: routine → urgent → critical

- For life-threatening situations (cardiac arrest, stroke, severe bleeding, choking, etc.) it tells you to call 911 first, then stays on the line and guides you through first aid while you wait

- For less critical situations, it walks you through first aid step-by-step

---

📞 Call it: +1 782-802-0868 — US/Canada, regular phone call, free, no signup

---

I want you to stress test it. Try weird edge cases, vague descriptions, multiple symptoms at once, or anything you think might trip it up.

Specifically curious about:

- Does the severity triage feel right?

- Does it escalate to 911 correctly — not too aggressively, not too timidly?

- Are the first aid instructions clear enough to actually follow in a panic?

- Any responses that felt wrong, scary, or confusing?

Happy to answer questions about the build.

---

⚠️ Prototype — not a substitute for 911. Real emergency? Call 911.

reddit.com
u/flyingadansonii — 6 days ago
▲ 5 r/AIVoice_Agents+3 crossposts

I kept blanking during technical interviews so I built an AI that listens to calls and answers questions in real time — fully open source, works with local LLMs too

Long story short: I bombed a few interviews because my brain

decided to go on vacation right when I needed it most.

I'd know the answer 10 minutes after the call ended. Classic.

So instead of just practicing more (boring), I built

Helply — a desktop AI meeting assistant that:

- Listens to your call via mic

- Lets you type or speak questions

- Gets you AI answers instantly, right on your screen

- Hides itself before you screenshare (one toggle)

The part I'm most proud of: it works with literally any

LLM backend.

Cloud: OpenAI, Groq, Anthropic

Local/Private: Ollama, LM Studio

Custom: any OpenAI-compatible endpoint (vLLM, llama.cpp,

OpenRouter, LocalAI, your own server, whatever)

If you use Ollama locally — nothing leaves your machine.

Full privacy. No API costs. No cloud dependency.

You can also add your resume and the job description to

the settings, and it tailors every answer to your actual

experience and the role. Which is wild.

It's built with Node.js + Electron, runs on Windows/macOS/

Linux, and is MIT licensed.

I'm not selling anything. Just built something I needed and

figured others might too.

GitHub: github.com/PIYUSH-MISHRA-00/Helply

Would love feedback — especially from people who've tried

similar tools or have opinions on the local LLM setup.

What backend are you running if you try it?

u/Consistent-Ruin1868 — 8 days ago
▲ 11 r/AIVoice_Agents+7 crossposts

Three bots in a trenchcoat is not omnichannel

Self-serve is exciting. Genuinely. But if I am honest, it is not the most interesting thing about 13 May.

The most interesting thing is that we have been quietly running architecture that the rest of the industry is only just figuring out exists.

A competitor recently launched real-time SMS ingestion. The coverage was breathless. Everyone lost it. So innovative. Revolutionary. Game-changing.

Me? I looked at our codebase and thought: "SMS ingestion. Wow. That is so 2025."

Here is what we actually built, and have been running in production for the better part of a year.

Mid-voice-call, Elba texts a short URL to the caller. The caller fills out a form on their phone. The structured data comes back into the live call via RPC. The workflow receives clean JSON. The voice call never paused. The agent never lost session state. The caller submitted a form while still talking and the agent acted on it in the same conversational turn.

That is not SMS ingestion. That is a bidirectional channel bridge inside a single active session. Sending an SMS during a call is not new. Getting structured data back into the active session in real time without dropping state on either side - that is the part nobody else has shipped.

And it sits on top of something even more fundamental.

Most "omnichannel AI" are three bots in a trench coat. A voice agent, a WhatsApp bot, a webchat widget, all pointing at the same CRM row and calling it unified. Each with its own prompt, its own config, its own version history, its own failure modes.

Elba is one agent. One workflow. One memory layer. Voice, WhatsApp, SMS, email and webchat all running through the same execution engine. Not copies. Not synced versions. The same agent, same logic, same memory, regardless of which channel the conversation arrived on. Deployments are atomic - every channel switches to the new workflow version in the same transaction. No drift. No "did the WhatsApp bot get the update" incident. One audit trail.

When a regulated enterprise customer asks what exactly their AI told a customer across every channel and every session for the past six months, we have a single clean answer.

The competition is announcing SMS ingestion and calling it a breakthrough.

We are launching self-serve on 13 May and already cooking the next thing. We may have put it on hold until after the launch. Our tech never sleeps though.

If you want an agent that actually knows who it is talking to across every channel and every session: self-serve opens 13 May at www.kolsetu.com.

Full technical writeup: https://www.kolsetu.com/blog/the-architecture-nobody-else-built

u/EdikTheFurry — 10 days ago
▲ 2 r/AIVoice_Agents+5 crossposts

Running Claude Opus for free? I thought it was a scam until I tried it.

Hey everyone,

​I’ve been working on a financial audit system (IntegrityOps) for a while now, and to be honest, I was hitting a massive wall. Dealing with high-volume PDFs and images was draining my budget. Between OpenAI and Anthropic, the API costs were becoming a nightmare for a solo builder.

​Yesterday, I was about to give up on using high-end models like Claude Opus because I couldn't justify the cost during the testing phase. But then, I stumbled upon a way to get $125 in free credits on a multi-model router.

​I honestly didn't think it would work, or that it would be some limited trial, but it gave me full access to everything—Claude 4.6, DeepSeek, and even ALM 5.1—all in one place without even asking for a credit card up front.

​It completely changed my workflow. Now I can test my automations without staring at my bank balance every 5 minutes.

​If any of you are struggling with the same 'API burnout' or just want to test these heavy models for free while building, I'd be happy to share my experience or show you how I set it up. We builders have to stick together!

reddit.com
u/GabriellaAmaya — 12 days ago

We’ve built what is essentially a full real-time telephony conversational operating system, not just a chatbot, and we’re trying to diagnose where our biggest failures actually are.

What we built:

A live voice pipeline for outbound/inbound calls:

Telephony (8kHz µ-law) → PCM decode → VAD → Silence thresholds → Echo suppression / AEC → STT (Deepgram/Groq/Sarvam) → Validation / hallucination filters → State machine → LLM (Groq LLaMA) → TTS (Grok) → Playback

Current capabilities:

Real-time Hindi + Hinglish support

Sales / lead-gen / support agents

Silero VAD

Deepgram Nova-3 primary STT

Groq LLaMA 3.x

Grok TTS

Barge-in

Sentence streaming

TTS cache

Carrier suppression

Hallucination filtering

Hindi grammar / transliteration optimization

Pipecat-style orchestration

FAISS RAG

The problem:

Users often feel like:

“The AI forgot what I said”

or

“It stopped responding”

or

“It heard me but replied weirdly”

But from logs, the LLM itself is often fine.

What we’re seeing:

STT:

Hindi strong

Hinglish moderate

Brand/model names weak

Short acknowledgements (“haan”, “ji”) vulnerable

Some blank transcripts / segmentation misses

TTS:

Biggest bottleneck

1.1–2.4s latency

“Response ended prematurely”

Long Hindi promotional lines degrade badly

Pipeline suspicion:

We may have over-engineered thresholds:

VAD

RMS gates

Silence windows

Echo suppression

Carrier suppression

Hallucination filtering

Confidence thresholds

Our current hypothesis:

This may not be a memory problem.

It may be a pipeline integrity problem where user intent is getting:

Clipped before STT

Mis-segmented

Filtered out

Suppressed during state transitions

Corrupted before conversational memory ever forms

Example:

Caller says a short Hindi response during suppression or barge-in window → speech never becomes canonical transcript → LLM never truly receives it → AI appears forgetful.

Questions for people who’ve built production voice stacks:

  1. Where do advanced telephony systems most commonly lose conversational fidelity?

VAD?

Endpointing?

Suppression windows?

STT confidence gates?

State machine transitions?

  1. For Hindi/Hinglish specifically:

How are people handling:

Short acknowledgements

Code-switching

Brand names

Telecom narrowband degradation?

  1. Would you simplify the stack?

Are we harming reliability by stacking too many protections before STT?

  1. TTS:

Would you prioritize:

Faster lower-quality speech

Smaller sentence chunks

Interruptibility

over polished voice quality?

  1. Architecture:

At what point does “production safety” become “signal destruction”?

Brutal honesty welcome:

If this architecture sounds overbuilt, fragile, or fundamentally mis-prioritized, I’d genuinely love to hear it.

We’re trying to move from:

“Smart AI on a fragile phone line”

to:

“Reliable conversational telecom system”

Right now it feels like our AI may actually be smarter than the user experience — but too much user intent dies before intelligence can act.

Would really appreciate insights from:

Voice AI engineers

Contact center architects

Telecom DSP people

Deepgram / Whisper / Pipecat builders

Hindi ASR/TTS teams

Thanks — looking for architecture-level criticism, not just model suggestions.

reddit.com
u/Electronic_Argument6 — 9 days ago

Why do most AI voice agents still sound robotic even in 2026?

I’ve been building voice AI agents for businesses at Vomyra for quite some time now, and one thing we noticed early was this:

Most people don’t actually care which AI model you’re using.

They care about one thing:

“Does it feel natural?”

And honestly… most AI voice agents still sound robotic.

Not because the technology is bad.

But because real conversations are imperfect.

Humans:

pause while thinking

breathe between sentences

whisper sometimes

laugh unexpectedly

change tone based on emotion

Most AI systems only focus on words.

Very few focus on conversation behavior.

Over the last few months we tested multiple TTS engines like:

ElevenLabs

Cartesia

xAI voices

Voxtral and more for real-world customer calls.

Some had amazing voice quality.

Some had ultra-low latency.

Some handled emotions better.

Some worked better for Indian languages like Hindi, Tamil, Telugu, Kannada etc.

But the biggest learning was:

The moment AI starts sounding less perfect… it actually starts sounding more human.

We recently started adding:

natural pauses

breathing

whispering

emotional tone shifts

human-like conversation flow

And customer reactions changed instantly.

People stopped asking:

“Is this AI?”

Instead they started saying:

“This actually feels real.”

Curious to know:

What makes an AI voice sound robotic to you?

latency?

monotone speech?

wrong emotions?

unnatural pauses?

pronunciation?

over-politeness?

Would love to hear real experiences from people using voice AI tools daily.

#VoiceAI #ConversationalAI #TextToSpeech #AI #ElevenLabs #Cartesia #OpenAI #AIvoice

reddit.com
u/Sumit-Voiceman — 13 days ago

Voice AI for non-English speakers, what's actually working in production

Curious what use cases people here have seen succeed with multilingual voice agents. From what I've been reading, healthcare appointment booking in Spanish is basically a solved problem at this, point, and Arabic is close behind, still some accent inconsistencies depending on dialect, but viable enough for production. Debt and EMI reminders in regional dialects also seem to be doing really well, especially across South Asian languages like Hindi, Tamil, Telugu, and Bengali. The hyperlocal support has gotten surprisingly deep. The auto language detection stuff is what I keep coming back to. Detecting from the first couple words and switching mid-call is pretty standard now, and the better systems are handling code-switching too, like someone bouncing between Hindi and English in the same sentence, which is just how a lot of people actually talk. That feels like a genuinely hard problem that's quietly gotten a lot better. I work with a lot of non-native English speakers and the trust angle is something I think about constantly. When someone is navigating healthcare or a loan reminder in their second language, the cognitive load is already high. A voice that sounds native to them, even if it's clearly an AI, probably changes the whole dynamic in ways that are hard to measure but really matter. There's also something to be said for culturally adapted prompts, not just the language but the framing and tone. Curious if anyone here has actually built or deployed something in this space. What was the hardest part to get right? Accent handling, latency, something else entirely?

reddit.com
u/noechuvi — 10 days ago
▲ 11 r/AIVoice_Agents+4 crossposts

The return path nobody built

A few days ago I posted about why most "omnichannel AI" is three bots in a trenchcoat. One agent, one memory layer, one execution engine across voice, WhatsApp, SMS, email and webchat. If you missed it, short version: what the industry calls unified is usually three separate configurations pointing at the same CRM row and hoping nobody looks too closely.

Today I want to go one layer deeper. Because the single-agent architecture is not just cleaner operationally. It enables something that no other platform has shipped.

Here is the problem every voice AI system has and nobody talks about honestly.

Structured data collection over voice is unreliable. Alphanumeric strings - vehicle registrations, policy reference numbers, membership IDs - get transcribed wrong at a rate that matters in production. One wrong character in a registration fails a lookup. A mishearing in a policy number causes a downstream processing failure that someone has to fix manually. Production systems either flag everything for human review or quietly accept the errors and clean up after themselves. Neither is a solution.

The alternative is deferring collection to a post-call follow-up. The call ends without the data. A second interaction is required. In emergency services, insurance intake, or patient triage, that is not a workflow step. That is an operational failure.

We did not accept either of these.

When the agent reaches a data collection node in the workflow, it sends a single SMS to the caller. The caller, who is still on the call, opens the URL on their phone. A dynamic form renders with exactly the fields the agent needs. The caller fills it in and submits. The structured JSON payload is returned to the active call session via LiveKit RPC. The workflow receives the payload and continues. The call never paused. The agent never lost session state.

Now here is the part that does not exist anywhere else.

Every other platform that sends an SMS during a call sends it outbound. A confirmation, a receipt, a link. The SMS departs the session. The call and the message are separate interactions from that point. There is no return path. Data flows one direction.

What we built is a bidirectional channel bridge inside a single active session. The SMS is an ingestion pipe. The form submission is an RPC call into the live session that the agent is actively listening for. The agent holds the workflow at the data collection node, waits for the return, receives the payload, and continues. All of this while the call is live.

The technical implementation: the short URL resolves via GraphQL and AppSync with connection state bound to the active session ID, so the form submission knows exactly which running instance to deliver the payload to. LiveKit RPC handles the return path with the session remaining open throughout. Connection state handling covers disconnection and retry so a brief signal drop does not orphan the session.

This only works because there is one session underneath all of it. A voice call, an SMS form submission, a WhatsApp message, a webchat interaction - they all feed the same stateful session. If you have three separate bots, there is no session to return the data to. You are firing a webhook into a void and hoping something picks it up after the call ends.

The previous architecture, which is still what most platforms use today, required one SMS per field. Five fields, ten asynchronous exchanges, call long over before collection completes. We replaced this in February 2026 with the single-form RPC architecture.

In production this was stress-tested in roadside assistance. A stranded caller. The agent needs a vehicle registration, a membership number, and a location reference. Over voice, the registration can take three to five exchanges and still produces errors. Post-call collection means the dispatcher works without confirmed vehicle details while the caller waits. With in-session RPC: one SMS, one form, all data collected in under thirty seconds, structured payload delivered before the call ends and without errors. The dispatcher has confirmed data. No callback needed. Single session, start to finish.

Sending an SMS during a call is not the hard part. The hard part is binding a form submission on a second device to an active session on a different channel, delivering the payload in real time, and having the agent act on it within the same conversational turn.

That is the part we built. Nearly a year ago. While the industry was still announcing SMS ingestion as a breakthrough.

Full writeup: https://www.kolsetu.com/blog/the-return-path-nobody-built

reddit.com
u/EdikTheFurry — 11 days ago