r/TextToSpeech

Maya-2-Native is now the #2 Hindi TTS model on Voice Arena!

Voice Arena's latest Hindi TTS rankings have Maya-2-Native at #2, behind Gemini 3.1 Flash TTS and ahead of ElevenLabs v3, Cartesia Sonic 3.5, Sarvam Bulbul-V3, Grok TTS and others.

Interesting to see how quickly Hindi speech synthesis quality is improving across the board.

u/Bladerunner_7_ — 5 hours ago

▲ 53 r/TextToSpeech+21 crossposts

I’ve been working on Murmur, a local text-to-speech app for Apple Silicon Macs.

The new feature I’m building is called Projects / Story Studio, and it solves a problem I kept running into:

TTS tools are fine for one-off clips, but messy for actual audio projects.

If you’re making a podcast segment, audiobook chapter, course lesson, ad, or game dialogue, you usually need multiple speakers, multiple takes, pauses, reactions, music, edits, exports, and a way to come back to the project later.

So I built a project-based workflow:

Write a script → assign voices → generate dialogue → edit clips on a timeline → add music/SFX → export final audio.

It supports things like:

multiple scripts inside one project
Host / Guest / Narrator / Character speakers
inline tags like [pause], [laugh], [chuckle]
per-block regeneration
timeline editing with waveforms
media lane for music and SFX
ripple editing and gap tools
WAV/M4A export
transcript and stem export

Everything runs locally on Mac, so long scripts and voice samples do not need to be uploaded to a cloud service.

I’m still polishing the workflow and would love feedback from Mac users, especially people who make podcasts, audiobooks, courses, YouTube narration, or game dialogue.

u/tarunyadav9761 — 1 day ago

▲ 2 r/TextToSpeech

Looking for Expressive AI TTS

What TTS platform is the best or your choice for expressive character voices?

I’m looking more for voices that can handle creative stuff like dubbing, game dialogue, character reactions, emotional lines, anime-style conversations, etc. rather than commerical corporate voices/customer service style voices.

I’ve seen people mention ElevenLabs, Fish Audio S2.1 Pro, Cartesia, Google 3.1 Flash, OpenAI TTS, and some local models like Omnivoice and chatterbox, but there's a ton out there.

reddit.com

u/Consistent-Teach4336 — 18 hours ago

▲ 3 r/TextToSpeech+2 crossposts

Has anyone achieved consistent voice identity with Gemini 3.1 Flash TTS for long-form narration?

Hi everyone,
I’m currently researching Gemini 3.1 Flash TTS Preview as the primary TTS engine for a long-form audiobook/storytelling application.
So far, the biggest challenge I’ve encountered is voice identity consistency.
My setup
Model: Gemini 3.1 Flash TTS Preview
Voice: Kore
Official Gemini API (Node.js)
Same API key
Same voice
Same prompt
Same text
Same parameters
The problem
When I generate exactly the same text multiple times, the voice does not change gender, but the voice identity changes noticeably.
I’m referring to subtle differences in:
Timbre
Tone
Speaking style
Delivery
Overall vocal character
It still sounds like “Kore,” but more like different recording sessions or different voice actors trying to imitate the same voice.
For long-form narration, this becomes very obvious after stitching multiple chunks together.
What I’ve already tested
I intentionally tested many different approaches:
Generating the exact same text multiple times
Different chunk sizes
Emotion tags
Synonym emotion tags
Natural-language performance directions
No emotion tags at all
Different prompting styles
None of these significantly improved voice consistency.
My question
Has anyone successfully built a long-form narration or audiobook pipeline using Gemini 3.1 Flash TTS?
Specifically:
Have you found a way to keep the voice identity consistent across multiple API calls?
Is there any hidden parameter, seed, or context mechanism that helps?
Does Vertex AI behave differently from the Gemini API?
Are there any prompting techniques that actually improve consistency?
Or is this simply a current limitation of Gemini 3.1 Flash TTS?
I’m not trying to clone a custom voice—I’m only trying to keep the built-in Kore voice sounding like the same narrator throughout an entire audiobook.
Any insights or real-world experience would be greatly appreciated.
Thanks!

reddit.com

u/Ok_Coat4453 — 19 hours ago

▲ 5 r/TextToSpeech

Old TTS suggestions?

Greetings friends, I'm working on a game where I would like to have some "retro" text-to-speech models as voicelines for characters, and I was wondering if you lovely people could link me up with some (preferably free) tools to use TTSs such as Microsoft SAM and the MacOS Boing (the One used on Castle Crashers for the painter boss, i think).

reddit.com

u/Jolly-Ad-1161 — 1 day ago

▲ 8 r/TextToSpeech

The best value Cloud TTS has dropped guys (MAI Voice-2)

I have just migrated my pipelines from using Deepgram to using Microsoft's newly released MAI Voice-2 TTS models - and i have to say, i am impressed!

The voice is crisp, with tons of emotion, sounds very human.

They boast voice cloning as well - but i have not tested this.

I have spent tons of time researching TTS models, and this is by far the best bang for the buck i have seen out there.

Here is their release notes:

https://microsoft.ai/news/mai-voice-2/

I found noone talking about this in this subreddit? Did anyone else check it out yet?

u/byllefar — 3 days ago

▲ 10 r/TextToSpeech

When will a TTS software be released that works flawlessly, can be downloaded with a single click, and operates just like a standard, fast application? Is something like that even possible in the future? I mean, it doesn't necessarily have to be as high-quality just kokoro like quality fine.

When will a TTS software be released that works flawlessly, can be downloaded with a single click, and operates just like a standard, fast application? Is something like that even possible in the future? I mean, it doesn't necessarily have to be as high-quality as ElevenLabs; being on par with Kokoro would be enough.

reddit.com

u/etlorkey — 4 days ago

▲ 4 r/TextToSpeech+1 crossposts

Voice agents, demystified: STT+TTS and 4 demo agents you can talk to in the browser + build yours with RAG and Tools

I added voice to AgentSwarms! You can create voice agents using a few clicks and talk to it in the browser — and you can try 4 demo voice agents right now, no setup, just tap the mic. Here's how it works and why it turned out to be less "new" than I expected.

The surprise building this: a voice agent is basically the chat agent you already know, with a voice on top. Same system prompt, same tools, same RAG, memory, and guardrails. Under the hood it's a simple loop — your mic gets transcribed to text (OpenAI GPT-4o-mini-transcribe), your agent replies exactly like it would in chat, and that reply gets spoken back (OpenAI GPT-4o-mini-TTS). The agent's brain doesn't change at all. You've just added ears and a voice.

Which is the whole point: everything you've already learned building chat agents carries straight over. If your agent can pull an answer from a knowledge base, call a tool, or respect a guardrail in text, it does all of that out loud too — because it's the exact same engine with audio on the two ends, not a separate stripped-down "voice mode."

**What I shipped**

* **New Voice Agent** in the builder: pick a voice (11 of them), a greeting, and your STT/TTS models. That's the whole setup.
* Every spoken reply runs the same pipeline as a chat agent — tools, knowledge base, memory, and guardrails all apply.
* A **Voice Playground**: tap the mic, talk, and hear the reply back, with the transcript on screen so you can read along.

**Talk to it (free, in the browser)** — 4 demos, tap the mic:

* **Aria** — customer support triage
* **Nova** — B2B discovery caller
* **Kai** — Spanish conversation tutor
* **Echo** — daily standup coach

Open one, talk to it, and fork it into your own workspace if you like it.

* Voice Playground → [https://agentswarms.fyi/voice-playground\](https://agentswarms.fyi/voice-playground)
* Build your own (New Voice Agent) → [https://agentswarms.fyi/agents\](https://agentswarms.fyi/agents)
* Docs → [https://agentswarms.fyi/docs/voice\](https://agentswarms.fyi/docs/voice)

*Disclosure: AgentSwarms school of Agentic AI for both no-code people and developers— a learn-by-building platform. The demos are free. Happy to answer anything about the setup in the comments.*

![img](upjeq6kua0bh1)

u/Outside-Risk-8912 — 3 days ago

▲ 7 r/TextToSpeech

Multilingual TTS mispronounces brand names differently in every language. How do you solve this?

I run a video localization SaaS where users generate their own voiceovers/dubs in multiple languages with AI voices.

Problem: a brand name is spelled the same everywhere, but each language's voice pronounces it differently, so it sounds wrong/inconsistent across locales. I can't hand-fix each name, because my users generate the content themselves and telling them to write phoneme tags isn't realistic.

For those building on top of TTS: how do you keep brand/product names consistent across languages when your users drive the generation, not you?

reddit.com

u/RudeAd8468 — 3 days ago

▲ 135 r/TextToSpeech+2 crossposts

[audio.cpp] VibeVoice 1.5B released — 90-min podcast in 22.95 min, 4.08x real-time, 2.86x faster than Python without quantization. Native C++/ggml

I’m the author of audio.cpp, a C++/ggml runtime for local audio models.

I just added VibeVoice 1.5B support and wanted to share the benchmark because long-form multi-speaker TTS is a good stress test for local inference runtimes.

Result on RTX 5090:

VibeVoice 1.5B
Audio length: 5615.73s / 93.60 min
Wall time: 1376.84s / 22.95 min
RTF: 0.245
Speed: 4.08x faster than real time
Python baseline: 92.66 min audio in 65.70 min
Speedup vs baseline: 2.86x
Quantization: none
Diffusion steps: 10

The main point is not just avoiding Python setup pain, though that is part of it. The goal is to make audio models practical in a native local runtime: reusable sessions, server-like usage, long-form generation, stable memory behavior, and CUDA-focused (CPU and Metal later) optimization.

VibeVoice is a useful milestone because it is not just short-sentence TTS. It is designed for long-form, multi-speaker dialogue such as podcasts, character chats, and narration, where runtime behavior matters a lot.

Current framework progress:

Released model families: 16 / 28
[███████████░░░░░░░░░] 57%

The other model families are already running end-to-end internally, but I’m releasing them gradually after testing and cleanup.

The repo is https://github.com/0xShug0/audio.cpp

I’d be interested in feedback from people testing VibeVoice on other GPUs or CPUs, especially long prompts, multi-speaker formatting, VRAM behavior, and performance numbers.

u/Acceptable-Cycle4645 — 5 days ago

▲ 0 r/TextToSpeech

Can I use an ElevenLabs voice in my Android app?

I’m making an Android app with narrated educational stories.I want to use a Turkish voice from the ElevenLabs Voice Library. I’ll use a paid plan, and the texts will be original / edited by me for background music, I’ll only use licensed or royalty-free music.Is it okay to use an ElevenLabs Voice Library voice in a commercial Android app? ,if the voice gets removed later, can I keep using the audio files I already generated?
Not asking for legal advice, just looking for real experience.

reddit.com

u/burakkrkynlu — 4 days ago

▲ 4 r/TextToSpeech+1 crossposts

finetuning Qwen 3 TTS on low resource languages or languages not suppported officially by Qwen 3 TTs

I am trying to finetune qwen 3 TTS on Indian languages. It worked well on Hindi, but it is working very bad on other languages. How do I make the audio better ? Any suggestions ?

reddit.com

u/ravk_1234 — 4 days ago

▲ 2 r/TextToSpeech+1 crossposts

ASR & TTS Data in English Language

We have few hundreds of hours of English data for ASR / TTS model training from native speakers. All of the data r GDPR complient and licensed. We also have more than 1000 native speakers who can contribute to the more data generation if needed. As all the AI solutions / models are growing, they definitely need lots of data for tarining purpose / tuning. As as startup in the speechtech, we are not sure where /how should we approach to someone/orgs to offer our datasets?

Does anyone have any idea or any information about it? it will be a great help if you can share your experience. Thank you so much in advance.

reddit.com

u/Debug_And_Solve — 5 days ago

▲ 3 r/TextToSpeech+1 crossposts

How can I locally Text to Speech (TTS) for a German text?

I tried with Voicebox, but Kokoro does not seem to have german presets

reddit.com

u/HistoricalStrength21 — 6 days ago

▲ 172 r/TextToSpeech+1 crossposts

Purr - free, open-source macOS dictation with Smart Typing, Voice Edit, and Meeting Mode (Wispr Flow / SuperWhisper alternative, 100% on-device)

I used SuperWhisper for a while and liked it. But the audio still goes to a server somewhere, and I didn't want to rent a utility indefinitely. Wispr Flow is the same deal.

So I built Purr. It's free, MIT-licensed, and everything runs on your Mac's Apple Neural Engine. No account, no subscription, no telemetry. After the first model download (~450 MB) it works fully offline.

The basic flow: hold Right Option, speak, release. Words appear in whatever text field has focus - Slack, Notes, a terminal, a browser input, anywhere.

Three things I put real time into that most dictation apps don't do well:

Smart Typing - words appear live as you speak, not all dumped at once when you release the key. Each phrase lands as its own undoable chunk, so Cmd+Z still works the way you'd expect, and autocorrect doesn't go haywire. Uses Parakeet TDT v2 on the ANE, which is why it's Apple Silicon only. Swap to WhisperKit in settings if you need a language other than English (~100 languages, though it's batch so words land on release).

Voice Edit - select some text, hold the voice-edit hotkey, and say "change X to Y", "delete that last sentence", "capitalize", or just speak what you want instead. A parser handles the common command patterns; anything else replaces the selection wholesale. Works in any Accessibility-supported text field (most native macOS apps) and falls back to paste elsewhere.

Meeting Mode - captures your mic and your Mac's system audio together, so the whole call gets transcribed, not just your side. Speaker diarization runs locally (FluidAudio on ANE). Transcripts save as Markdown. Can also generate a sidecar summary with TL;DR, decisions, and action items using Apple's on-device model on macOS 26, or Gemma 3 4B locally on older systems.

Also has a custom dictionary (teach it your proper nouns and acronyms), filler word trimming, and in-speech commands like "new paragraph", "comma", "scratch that".

GitHub: https://github.com/iamarunbrahma/purr
Website: https://purr.arunbrahma.com

u/heliosarun — 8 days ago

▲ 2 r/TextToSpeech

Best paid/free service for natural-sounding narration?

I don't care if it's something I have to download or pay for, I just want to know what the current best is for generating human-sounding narration without oddities that mess it up.

reddit.com

u/Correct_Pick6199 — 6 days ago

▲ 1 r/TextToSpeech

What's the best Amharic text to speech today?

I'm looking for an Amharic text-to-speech service that sounds natural enough for YouTube narration.

I've tried a few options like elevenlabs,speechify but non of them support amharic.

If you've used one for videos, which service gave you the best results?

I'm fine with both paid and free recommendations.

reddit.com

u/Plastic_Account_575 — 5 days ago

▲ 1 r/TextToSpeech

Where could I find the t2s for that unsettling creepypasta type voice?

reddit.com

u/theok8234 — 5 days ago

▲ 6 r/TextToSpeech

Best non-subscription offline one?

As stated what if I was camping in Yellowstone for a whole month no internet at that time only electricity from a generator. What would be the best converter on a laptop or phone. As much text as I want and as much natural sounding voice to listen to books from pdfs.

reddit.com

u/iShaoKhan — 5 days ago

▲ 15 r/TextToSpeech+2 crossposts

VoxFlash-TTS: an ultra-compressed latent diffusion voice cloning model (9 Hz latent space, ONNX, zero-shot CN/EN)

I've been working on a zero-shot voice cloning TTS system and wanted to share it here in case it's useful to anyone working on similar problems.

What it is

VoxFlash-TTS is a flow-matching based voice cloning model that operates in a heavily compressed latent space — 9 Hz, instead of the much higher frame rates most diffusion/flow TTS systems use. The idea was to see how far latent compression could go before quality breaks down, since lower frame rates mean fewer steps and faster inference for a given NFE budget.

Some of the design choices:

Phoneme encoder: ConvNeXtV2-based, rather than a standard Conformer/Transformer stack.
Generative backbone: flow matching with an Euler solver, NFE=16 by default.
Speaker conditioning: ConvNeXtV2 speaker encoder with attentive statistical pooling, fed into AdaLN.
Cross-lingual zero-shot cloning: Chinese and English, including code-switching.
Inference: exported to ONNX, packaged for Docker deployment, no Python training stack required at inference time.

Why 9 Hz

Most latent TTS systems run their diffusion/flow process at much higher temporal resolution. Compressing the latent sequence rate this aggressively is mainly a bet on inference cost — fewer latent frames per second of audio means a much smaller sequence for the flow matching model to denoise, which matters a lot if you care about real-time or low-resource deployment rather than just sample quality in isolation. It's a tradeoff, and I'd be curious to hear from others who've pushed compression in either direction.

Links

GitHub: https://github.com/VoxFlash/VoxFlashTTS
Hugging Face: https://huggingface.co/VoxFlashTTS/VoxFlashTTS
Demo: https://voxflash.github.io

Happy to answer questions about the architecture, the flow matching formulation, or the ONNX export pipeline — these were the trickiest parts to get right, especially the velocity target derivation and keeping VAE latent normalization consistent between training and inference.

u/Significant-Disk1890 — 6 days ago

r/TextToSpeech

Maya-2-Native is now the #2 Hindi TTS model on Voice Arena!

Looking for Expressive AI TTS

Has anyone achieved consistent voice identity with Gemini 3.1 Flash TTS for long-form narration?

Old TTS suggestions?

The best value Cloud TTS has dropped guys (MAI Voice-2)

When will a TTS software be released that works flawlessly, can be downloaded with a single click, and operates just like a standard, fast application? Is something like that even possible in the future? I mean, it doesn't necessarily have to be as high-quality just kokoro like quality fine.

Voice agents, demystified: STT+TTS and 4 demo agents you can talk to in the browser + build yours with RAG and Tools

Multilingual TTS mispronounces brand names differently in every language. How do you solve this?

[audio.cpp] VibeVoice 1.5B released — 90-min podcast in 22.95 min, 4.08x real-time, 2.86x faster than Python without quantization. Native C++/ggml

Can I use an ElevenLabs voice in my Android app?

finetuning Qwen 3 TTS on low resource languages or languages not suppported officially by Qwen 3 TTs

ASR &amp; TTS Data in English Language

How can I locally Text to Speech (TTS) for a German text?

Purr - free, open-source macOS dictation with Smart Typing, Voice Edit, and Meeting Mode (Wispr Flow / SuperWhisper alternative, 100% on-device)

Best paid/free service for natural-sounding narration?

What's the best Amharic text to speech today?

Where could I find the t2s for that unsettling creepypasta type voice?

Best non-subscription offline one?

VoxFlash-TTS: an ultra-compressed latent diffusion voice cloning model (9 Hz latent space, ONNX, zero-shot CN/EN)

ASR & TTS Data in English Language