after 2.5 years running ~1k calls a day, here's the voice ai stack i'd actually pick today. llm, stt, tts, the whole thing
this is one i've been wanting to write for a while. every time someone asks "what should i use for llm/stt/tts" the honest answer is "depends what you're optimizing for" which is genuinely not useful to anyone trying to ship.
i've been running voice ai across a few hundred businesses, ~1k calls/day for 2.5 years. here's how i'd actually pick the stack today if i was starting from zero. fair warning that the space moves fast, so what's true in may is probably not true in august.
llm:
gpt-4.1 mini is my default for most voice agent loops right now. cheap enough, smart enough, low enough latency that the model basically disappears into the loop. its instruction following on long system prompts is what keeps me from migrating off.
gpt-4o mini still works. slightly faster, slightly worse at multi-turn context. fine for short flows.
groq is the fastest inference layer i've tested by a real margin. first-token latency feels unreal when you hear it. the catch is the open models running on it (llama, qwen) follow instructions less reliably than the openai stack on the exact same prompt. great for narrow agents. less great when the conversation gets messy.
people overthink this layer tbh. unless your agent is doing real reasoning, the gap between 4.1 mini and llama 3.3 on groq is mostly perceived latency, not capability. so pick speed unless you really need the reasoning.
stt:
deepgram is still my default. nova-3 handles accents well, streaming latency is competitive, and the tooling is mature.
openai's whisper is top tier on accuracy but the streaming endpoints lag deepgram. fine for post-call. i wouldn't put it in the live loop yet.
groq whisper is the fastest whisper deployment i've used. if you don't need deepgram's full streaming protocol, groq whisper is genuinely underrated.
stt is mostly a solved problem at this point. the real bugs aren't in transcription quality, they're in how your platform's streaming protocol talks to your turn-taking model. that's where the gnarly debugging happens.
tts:
this is where the most perception lives. nobody complains the llm sounds bad. they complain the voice sounds weird. so this is the layer i'd actually spend the most tuning time on.
elevenlabs flash 2.5 is the safe pick. voices sound right out of the box. the cost gets steep at scale, especially on enterprise tier, but it works.
cartesia sonic 3 is my favorite for price-to-quality right now. fast, voices are solid, cheaper per minute than 11labs. has some lingering edge cases on numbers and acronyms but it's closing.
rime arcana is the most "human" sounding model i've heard in production. great for inbound where you really don't want the caller to feel like they're talking to a robot. it's a tick slower than cartesia or 11labs flash though.
sarvam is the only serious option for indian languages right now (hindi/tamil/telugu). for non-indian languages it's not worth the swap.
starting from zero today i'd go 4.1/4o mini + deepgram nova-3 + cartesia sonic 3. swap in groq for narrow high-frequency agents. swap in elevenlabs flash 2.5 if the budget is there and the brand voice matters. swap in rime if "doesn't sound like a robot" is the top requirement.
none of this would have been the right answer 6 months ago. ask me again in october.