r/tts

Just launched ContextLM on PH today. The most expressive Text-to-Speech platform.
▲ 6 r/tts+2 crossposts

Just launched ContextLM on PH today. The most expressive Text-to-Speech platform.

Hey 👋

We just launched ContextLM on Product Hunt today 🚀

ContextLM is an expressive, context-aware, LLM based Text-to-Speech and Text-to-Podcast platform that enables users to instantly clone voice and generate human- like speech using custom prompts.

Your upvote and feedback will be appreciated.

We have a FREE 10,000 credits 🎁 ready for everyone in this community who share, upvote or comment on our launch today.

DM me for your credits.

Please upvote and comment on Product Hunt:

https://www.producthunt.com/products/contextlm?comment=5382565

Thank you 😊

u/herberz — 1 day ago
▲ 1 r/tts

Which paid TTS websites/apps give the most hours for the lowest price?

Looking specifically for the cheapest services that offer voice cloning and long-form audio generation.

reddit.com
u/tr0picana — 8 days ago
▲ 7 r/tts+2 crossposts

Which TTS API provider would you recommend for long-ish narrations?

I'm making an app where an AI narrates a story for the player to take part in. The app is turn-based, and each turn typically generates around 400 words of narration.

Which TTS API providers would you recommend that can produce around 2–3 minutes of audio in a single request?

I tested Qwen TTS on Alibaba Cloud, but it seems to cut the output off after about 50 seconds, and chunking the audio sounds really bad because the voice changes pitch between chunks.

I'm aiming for a TTS API provider in the range of $13–15 USD per million characters, preferably multilingual.

Any recommendations?

reddit.com
u/popyui — 8 days ago
▲ 16 r/tts

I ran OmniVoice and Qwen3-TTS through the same tests for (english) voice cloning. Here's everything I learned about how they compare.

I ran Qwen3 TTS and Omnivoice through the same tests, on the same hardware (8GB NVIDIA RTX 3070), with the same reference audio. This is by no means scientific - just sharing my observations and adding some quantifiable data to compare both.

Voice match (Tie)
Both models were excellent. I used a 7-second reference clip and generated the same text three times with each. Both produced clones extremely close to the original and unless you were using a voice that you highly recognize, for most use cases you wouldn't notice a difference.

I ran a speaker similarity test using SpeechBrain's ECAPA-TDNN model, which compares speaker embeddings using cosine similarity (-1 to 1, where 1 = same speaker). Also tested Chatterbox since I had it set up.

Model Sample 1 Sample 2 Sample 3 Avg Score
Qwen3-TTS 0.912 0.918 0.908 0.913
Chatterbox 0.876 0.915 0.882 0.891
OmniVoice 0.886 0.894 0.881 0.887

Qwen3 edged out slightly, but at these levels the differences are hard to hear.

Long text (Tie)
Generated a full paragraph (~110 words). Neither model showed voice drift or artifacts. I've had issues with Chatterbox sometimes adding weird artifacts at the end, but not with either of these.

Emotional expression (OmniVoice wins)
I used a reference clip of someone crying while talking. Not full sobbing, but that shaky voice you get when trying to hold it together. OmniVoice carried this quality into the generated speech really well. Qwen3 matched the voice itself but the emotion was much flatter. It sounded like the same person, but a version of that person who wasn't crying.

Speed (OmniVoice)
Most generations were significantly faster with OmniVoice, in some cases 3-5x.

One thing I noticed: OmniVoice tended to rush output with shorter references. A sentence that came out around 5s with Qwen3 was ~4.4s with OmniVoice. I fixed it by changing the speed parameter, but worth knowing.

Numbers, abbreviations, mixed languages (Qwen3 wins)
Tested both with this sentence: "The flight from JFK departs at 7:45 AM on March 3rd, costs $1,249.99, and the pilot announced 'bienvenidos a bordo' before switching back to English for the safety briefing."

Qwen3 handled it cleanly. OmniVoice struggled with the price. It couldn’t get the 99 cents right and kept saying "ninety-nine sons" or "ninety-nines".

This is a known limitation with Omnivoice. It doesn't have built-in text normalization, so complex numbers and currency formats can trip it up. If your text has a lot of numbers or abbreviations, you'd need to write them out ("one thousand two hundred forty-nine dollars and ninety-nine cents" instead of $1,249.99).

Cross-lingual cloning (Omnivoice, if you prefer to preserve source accent)
I tested Italian to English with an Italian-accented reference. Qwen3 kept the Italian accent on some words but slipped into a more English-sounding delivery on others. OmniVoice kept the Italian accent almost completely throughout. Both models matched the voice well though so it comes down to preference and whather you’d like to preserve the source accent or not.

Overall takeaway
Neither model is strictly better. The right choice depends on what you're doing.

Use OmniVoice for: audiobooks, narration, emotional delivery, multilingual content where accent preservation matters. It also supports paralinguistic tags for adding things like laughter, sighs, and other vocal expressions into the output.

Use Qwen3-TTS for: technical content with numbers, prices, dates, abbreviations, anything where text normalization matters and you don't want to pre-process.

For most creative and conversational use cases I'd lean OmniVoice. For structured or technical text, Qwen3 or pre-process before sending to OmniVoice.

If you want to try these without the setup, I've been building a desktop app called Voice Creator Pro that bundles OmniVoice, Qwen3-TTS, and Chatterbox into one interface. It runs on Windows (free trial) and Mac.
Both of these models are open source so you can also try them for free - https://huggingface.co/k2-fsa/OmniVoice, https://huggingface.co/spaces/Qwen/Qwen3-TTS.

Curious to hear what your experience has been if you've tried these or other TTS models.

u/c08mic_cha08 — 13 days ago