r/TextToSpeech

▲ 5 r/TextToSpeech+2 crossposts

Voice AI biggest unsolved challenges

What do you think are the biggest unsolved challenges in Voice AI that almost nobody is seriously working on right now?

Not “better ASR” or “lower latency” but deeper problems that could define the next generation of voice products/research.

Examples:
- Real-time conversational memory that actually feels human
- Emotion + intent understanding beyond sentiment analysis
- Interruptions/turn-taking that feel natural
- Voice-native UX instead of “ChatGPT but spoken”
- Long-term personalization without being creepy
- Multilingual/code-switching conversations
- Continuous ambient agents
- Social/companion dynamics
- Voice AI for kids/elderly/accessibility
- Real-time multimodal understanding (voice + environment + context)

Curious what people building/using Voice AI think is still fundamentally broken or missing.

reddit.com
u/Legal_Wolverine_7267 — 21 hours ago

Why do emotional AI voices still feel hard to control across longer scripts?

I’ve tried a few emotional TTS tools recently (including Noiz.ai and others).

What I notice is:

they often sound great in short sentences, but once you move into longer narration, the tone becomes less consistent.

It feels like we’re still missing a “director layer” for AI voice control.

Is this just a limitation of current models, or are there tools that actually handle long-form emotional consistency well?

reddit.com
u/Obvious_kirby — 23 hours ago

Elevenlabs SynthID and YT

Google just announced that Elevenlabs will support Google’s SynthID. Do you think this is good for use in YT as voice over? How does it compare with Google’s TTS solution? I’m worried of using TTS if eventually YT will rule against it in the future and my videos get demonetized.

reddit.com

Why does long-form AI narration still feel worse than a real audiobook?

I’ve been experimenting with turning long-form text into audiobook-style audio, and one thing surprised me:
The voice model is usually not the main problem.
A lot of bad long-form AI narration comes from the source text itself. Ebooks, blog posts, PDFs, and scripts are written for reading, not listening. Once they become audio, small things become very obvious:
- table of contents text gets read aloud
- footnotes interrupt the sentence
- repeated headers or page numbers sound strange
- long sentences are harder to follow
- dialogue becomes confusing when you can’t see line breaks
- chapter transitions need to be heard, not just seen
The biggest lesson for me is that “text to speech” and “audiobook production” are not the same workflow.
For short text, you can paste and generate.
For long-form content, the better workflow seems to be:

  1. clean the source text
  2. test a 500–1,000 word sample
  3. listen for pacing, pronunciation, dialogue, and structure
  4. fix the text
  5. then generate the full chapter
    I’m building a small tool around this workflow, but I’m mostly interested in the workflow problem itself.
    For people who use TTS for long-form content: do you clean the text first, or generate first and fix problems afterward?
    For context, I’m testing this idea here: https://audiobookgenerator.net/ — but the main question is whether this sample-first workflow actually matches how people handle long-form TTS.
u/BasisRoutine6228 — 2 days ago
▲ 6 r/TextToSpeech+1 crossposts

I’ve been testing AI voice tools for short storytelling… emotional TTS is still inconsistent

I’ve been using a few AI voice tools recently for short-form storytelling content (TikTok / YouTube Shorts).

Noiz.ai stood out because it actually adds emotional tone, which most TTS tools don’t really do well.

But I’m running into a problem:

sometimes the same script comes out very different emotionally depending on the generation.

One time it sounds perfect, another time it feels over-dramatic or slightly off in pacing.

Curious if others here have found a way to control emotional consistency better in AI voices?

reddit.com
u/Obvious_kirby — 2 days ago

What's the best free TTS currently

I'm starting with YT automation recently and I need a solid not robotic sounding voice
No voice cloning needed

reddit.com
u/A001S — 3 days ago
▲ 2 r/TextToSpeech+1 crossposts

As of May 2026 LongCat Dit 3.5B and Moss TTS 8B are the best SOTA tts models and Qwen tts is not even close.

[Disclaimer: i am totally avoiding fish audio s2 pro because its not a real open-sourced model(non commercial license)]

So the context is i asked many ai to give me best tts model as of now but most of it said qwen 3 tts, and voxtral etc. Nearly none of it ever spoke about LongCat tts and some spoke about Moss tts smaller versions but not the main 8b version.

And the stupid LongCat team didnt even added the text to speech tag in their hugging face repo so its hard to discover.

I am writing this because these both models are heavily underrated for no reason 😑

the #1 longcat Dit 3.5b and #2 Moss tts 8b

Here are the sample by both models by voice cloning. (Real voice also provided) --> https://github.com/9r4n4y/Voice-samples

If you wanna test right now then

For LongCat - https://huggingface.co/spaces/hysts/LongCat-AudioDiT-3.5B

For moss tts - https://studio.mosi.cn/

reddit.com
u/9r4n4y — 3 days ago

Which Paid TTS platform actually gives the MOST usable audio hours for the LOWEST monthly price?

I’m currently mapping out a long-form audio project (roughly 20–30 hours of total runtime), and I am hitting a massive wall trying to figure out the actual ROI of different paid TTS platforms.

ElevenLabs has amazing quality — probably still the most natural voices overall, but for high-volume, long-form content, it is just completely cost-prohibitive for me right now. I don't need a million ultra-premium cinematic emotional whispers; I just need solid, natural-sounding, highly consistent narration that won't require 15 re-generations (which burns through quotas like crazy).

I’ve been doing some deep dives and found that:

  • The Character-to-Hour Conversion Trap: A lot of platforms price at "$X per 1M characters." On paper, 1 million characters sounds like an encyclopedia. In reality, that’s only about 11 to 14 hours of generated audio depending on pacing.
  • The Re-generation Tax: If a budget tool sounds robotic 30% of the time, and I have to re-generate paragraphs to fix the glitchy/distorted artifacts, a "generous" monthly quota suddenly gets cut in half.
  • API vs. Dashboard Pricing: I noticed OpenAI’s TTS standard API runs around $15 per 1M characters (roughly $1.15 to $1.30 per hour of audio), and their new GPT-4o-mini audio output is dirt cheap at around $0.015 per minute, but the workflow is clunky for a non-coder like me who just wants to paste a script.
  • Front-End Apps: I've seen folks mention tools like Podcastle AI (allegedly much cheaper than ElevenLabs for long-form), and Audiobookify (no-subscription models), but I'm wary of hidden limits or sudden voice drop-offs during long scripts.

My Question: If your main metric is strictly Most audio hours generated per dollar spent, what paid platform or wrapper are you actually happy to subscribe to?

Would love to hear from anyone running high-volume channels or doing audiobook narration. Which platform felt affordable at first… until real production work started?

reddit.com
u/Luca_Tangen — 3 days ago

piper tts onnx model for korean

so i tried the Sherpa Korean model, and the quality is very low. Does anyone know about maybe a custom ONNX model for Korean?, i need it to be ONNX because I'm working on a project that requires it to run on a standard mobile phone where resources are rather limited.

reddit.com
u/ForbidenSugar — 3 days ago
▲ 38 r/TextToSpeech+2 crossposts

Benchmarked Kokoro 82M vs Supertonic 3 TTS on CPU

Wanted a real head to head on the two TTS models that actually run well on CPU. Couldn't find one with proper numbers, so I ran one. Posting because the result was not what I expected going in.

Quick context for anyone who hasn't seen Supertonic 3 yet: it's a flow-matching TTS where you can dial down inference steps to trade quality for speed. Default is 5 steps, "speed mode" is 2. Kokoro 82M everyone here knows by now.

Hardware: AMD EPYC 7763, 4 vCPUs, 16GB RAM, no GPU. Roughly comparable to a Ryzen 5600 or a decent N100 box.

Setup: 6 text lengths from 12 chars to 1712 chars, 5 runs each, 120 timed runs total. CUDA explicitly disabled. Warmup run discarded.

Mean RTF (lower is faster):

  • Supertonic 3, 2 steps: 0.165 (6.1x realtime)
  • Supertonic 3, 5 steps: 0.313 (3.2x realtime)
  • Kokoro 82M PyTorch: 0.469 (2.1x realtime)
  • Kokoro 82M ONNX: 0.509 (2.0x realtime)

Wall-clock latency on the medium text (196 chars, about 13 seconds of audio):

  • Supertonic 2-step: 1.82s
  • Supertonic 5-step: 3.67s
  • Kokoro PyTorch: 5.62s
  • Kokoro ONNX: 5.51s

Long and Extended text details in the Github Repo below.

Throughput in chars per second at steady state: Supertonic 2-step gets to ~111, Supertonic 5-step ~55, Kokoro hovers around 33 to 36 regardless of backend.

The quality side, which actually flips the ranking:

Supertonic at 2 steps is fast, but the audio is rough. Words slur, prosody is mechanical, not something I'd ship. At 5 steps it cleans up a lot and is genuinely usable. Kokoro at either backend still produces the most natural speech of anything I've tested in this size class. It's #1 on the TTS Arena leaderboard for a reason.

So the practical ranking is more like:

  • Want it to sound like a human → Kokoro, accept the slower speed
  • Want low latency for an assistant/chatbot → Supertonic 5-step is the sweet spot
  • Supertonic 2-step → demos and prototyping, that's it

Two things that surprised me:

  1. Kokoro ONNX was slower than PyTorch on this CPU. I expected the opposite. ONNX wins on the longer texts but loses on tiny ones because of higher fixed overhead. Worth retesting on Intel hardware to see if it's an AMD thing.
  2. Supertonic has way more fixed per-call overhead than Kokoro. RTF on tiny text is 0.30, on medium it drops to 0.13. Kokoro is much flatter across lengths. So if your workload is lots of short utterances, the gap between them narrows.

Detailed write up and Github Repo with all 24 audio samples, and the benchmarks are mentioned in comments below 👇

This evaluation of both TTS models was performed using Neo AI Engineer that built the eval harness, handled model runtime issues, and consolidated results. I reviewed everything manually.

If anyone has an N100 or a Pi 5 lying around and runs this, I'd love to see the numbers. That's the tier I actually want to deploy on.

u/gvij — 4 days ago

TTS Site that has Dupes of Dallas and Wiseguy?

Lazypy.ro is basically dead, Cepstral is broken and VoiceForge forces you to pay, is there a site or app that has duplicates of those voices for free?

reddit.com
u/Sammysin00 — 3 days ago
▲ 4 r/TextToSpeech+2 crossposts

The text to speech function does not work any more?!

Hi all,

I've sent emails to the google ai studio feedback team but they have not answered me back. Am I the only one that have problems with the text to speech function?

It used to work like a charm but for the last 10 days it does not. The same prompts and workflow that previously generated good audio now either fail, produce no usable output, or behave differently than before.

reddit.com
u/Disastrous_Drama_967 — 3 days ago

TTS for school

Hi, I’m a student who has both adhd and dyslexia. This makes it incredibly difficult for me to read long texts as I bore easily and the fact it’s a challenge to read doesn’t make it any better. So when I found myself needing to read a 55 page article for a class, I sadly started procrastinating. For my brain it is an impossible challenge to read that much text in now two days. So I’m searching for a good tts website. However I am very much against generative ai. I understand that tts algorithms need ai to function but it would be great to grind a non LLM website. I do not want to destroy the climate for my education. And yes I do not have a lot of knowledge about ai. Most of what I know is from friends, family or TikTok. If you don’t have any issue with generative AI please do not interact with this post. I have my personal opinion on generative AI and will not interact with it. If you are also against generative AI and know of a non LLM tts please let me know.

reddit.com
u/Silent_Response4162 — 4 days ago
▲ 21 r/TextToSpeech+1 crossposts

Supertonic TTS (Android) now live on Play Store and F-Droid

It is just a appreciation post to all the beta testers that helped push this app to production on Play Store. Couldn't have done it without you guys.

F-Droid version is slightly behind, I'll soon update it.

Please test it on more devices like Android Auto, Chromebook or wherever else you use to listen to ebooks.

Play Store - https://play.google.com/store/apps/details?id=com.brahmadeo.supertonic.tts

F-Droid - https://f-droid.org/packages/com.brahmadeo.supertonic.tts

Please share with more people if they would want something like this. Thanks again.

u/Brahmadeo — 5 days ago

are there any good easy to download free voice cloners?

im not very good with using python and a lot seem to need you to use python to download them

reddit.com
u/ollietron3 — 4 days ago
▲ 34 r/TextToSpeech+16 crossposts

Audio Option

As I commute long hours on a daily basis, I would like to stay informed about what is happening with Reddit however, I am unable to read while driving. Perhaps Reddit could provide an audio option to read the updates while driving, which I believe could significantly increase the DAU metric.

reddit.com
u/ahabest — 5 days ago
▲ 2 r/TextToSpeech+1 crossposts

Help with using cloned voice from Chatterbox?

I'm running Llama.cpp in OpenWebUI, and have installed Chatterbox TTS to handle the voice side of it, because I really wanted to use a cloned voice locally. I've been at this for hours, trying OpenedAI for TTS, and now Chatterbox. I'm relatively new to Linux and python, etc.

Here's what I've got working:

OpenWebUI sees and uses the model running on Llama.cpp.

Chatterbox Frontend works as expected. Type text, click button, get output in cloned voice. It's a little slow,but that's likely because it's not using my rx580 for inference. Old card, I know.

Here's what looks sketchy to me:

putting in the address for the backend in the browser reveals a terminal like window that simply says: detail: "Not Found".

The actual error:

When trying to generate the TTS part in OpenWebUI, It claims the voice I've very clearly pointed to according to everything I've found, doesn't exist, then proceeds to tell me to use one of the included voices.

What I'm running (old hardware, I know):

AMD Ryzen 5 2600
32GB RAM
RX 580 8G VRAM
OS: Ubuntu 26.04 LTS

reddit.com
u/BrokeBoyFresh — 4 days ago

Voice emotions for cloned voice

Im.using qwen tts and i create my own voice models. Next i used the audio.to clone and narrate text..

The only problem. Cant get emotions in a cloned voice with qwen tts.

I need to add emotions to my cloned voice, and then , use then independant in qwen tts.. (Python coding)

What software should i use to add an emotion to my cloned voice and have a .wav export for that emotion?

My plan is to get sbout 10 emotions for my cloned voice..... And use then as cloned voice in qwentts.....


UPDATE

I’ve already given up on “cloning + emotions”—not even Fish Audio has managed to do it right. (I just need to try Elevenlabs.)

I'm using the “Spanish” language.

I've used Qwen TTS and got a beautiful voice that I really like. The problem is that if I “change” the prompt or the seed, the voice changes completely.

That’s why I can’t create a library of similar voices for different moods (at least with Qwen TTS).

I’ve checked out the ZeroVoice repository, and it’s great (too bad it’s only in English).

What do you recommend for designing a voice and adding emotions to it?

Thanks a lot!!

reddit.com
u/goyetus — 5 days ago