r/tts

I've been working on a zero-shot voice cloning TTS system and wanted to share it here in case it's useful to anyone working on similar problems.

What it is

VoxFlash-TTS is a flow-matching based voice cloning model that operates in a heavily compressed latent space — 9 Hz, instead of the much higher frame rates most diffusion/flow TTS systems use. The idea was to see how far latent compression could go before quality breaks down, since lower frame rates mean fewer steps and faster inference for a given NFE budget.

Some of the design choices:

Phoneme encoder: ConvNeXtV2-based, rather than a standard Conformer/Transformer stack.
Generative backbone: flow matching with an Euler solver, NFE=16 by default.
Speaker conditioning: ConvNeXtV2 speaker encoder with attentive statistical pooling, fed into AdaLN.
Cross-lingual zero-shot cloning: Chinese and English, including code-switching.
Inference: exported to ONNX, packaged for Docker deployment, no Python training stack required at inference time.

Why 9 Hz

Most latent TTS systems run their diffusion/flow process at much higher temporal resolution. Compressing the latent sequence rate this aggressively is mainly a bet on inference cost — fewer latent frames per second of audio means a much smaller sequence for the flow matching model to denoise, which matters a lot if you care about real-time or low-resource deployment rather than just sample quality in isolation. It's a tradeoff, and I'd be curious to hear from others who've pushed compression in either direction.

Links

GitHub: https://github.com/VoxFlash/VoxFlashTTS
Hugging Face: https://huggingface.co/VoxFlashTTS/VoxFlashTTS
Demo: https://voxflash.github.io

Happy to answer questions about the architecture, the flow matching formulation, or the ONNX export pipeline — these were the trickiest parts to get right, especially the velocity target derivation and keeping VAE latent normalization consistent between training and inference.

Kindle books and Speechify.

finetuning Qwen 3 TTS on low resource languages or languages not suppported officially by Qwen 3 TTs

VoxFlash-TTS: an ultra-compressed latent diffusion voice cloning model (9 Hz latent space, ONNX, zero-shot CN/EN)

Local AI TTS with bulgarian language and voice cloning

I NEED ADAM VOICE FROM ELEVENLABS...?