r/speechtech

▲ 52 r/speechtech+21 crossposts

I’ve been working on Murmur, a local text-to-speech app for Apple Silicon Macs.

The new feature I’m building is called Projects / Story Studio, and it solves a problem I kept running into:

TTS tools are fine for one-off clips, but messy for actual audio projects.

If you’re making a podcast segment, audiobook chapter, course lesson, ad, or game dialogue, you usually need multiple speakers, multiple takes, pauses, reactions, music, edits, exports, and a way to come back to the project later.

So I built a project-based workflow:

Write a script → assign voices → generate dialogue → edit clips on a timeline → add music/SFX → export final audio.

It supports things like:

  • multiple scripts inside one project
  • Host / Guest / Narrator / Character speakers
  • inline tags like [pause], [laugh], [chuckle]
  • per-block regeneration
  • timeline editing with waveforms
  • media lane for music and SFX
  • ripple editing and gap tools
  • WAV/M4A export
  • transcript and stem export

Everything runs locally on Mac, so long scripts and voice samples do not need to be uploaded to a cloud service.

I’m still polishing the workflow and would love feedback from Mac users, especially people who make podcasts, audiobooks, courses, YouTube narration, or game dialogue.

u/tarunyadav9761 — 10 hours ago
▲ 34 r/speechtech+4 crossposts

I created a RecognitionService that handles system-wide voice input fully on-device (no Google, no network)

Most voice input on Android - SpeechRecognizer.createSpeechRecognizer(context) calls — gets routed to Google's network-backed recognizer. I wanted that path to run locally, so I wrote one.

The service hooks the framework's SpeechRecognizer API. Once it's set as the default, any app calling createSpeechRecognizer(context) (no ComponentName) ends up in our pipeline and gets back transcription that never left the device. Pipeline is Silero VAD + Parakeet TDT v3 (114 languages, ~890 MB INT8) on ONNX Runtime with NNAPI.

Honest caveat: Gboard, Samsung Keyboard, and Google Assistant ship their own recognizers and skip the system default. So the default-IME voice button on most phones won't go through this. What does: accessibility tools, custom dictation UIs, and anything calling the framework API directly.

Models download on first use (~1.2 GB) via a foreground WorkManager job so it survives backgrounding. After that, fully offline.

Setup + demo APK: github.com/soniqo/speech-android

audio.soniqo:speech:0.0.9 on Maven Central

Library:

Happy to answer questions about the binder lifecycle, the foreground worker setup, or why SpeechRecognizer is such a tarpit of edge cases.

u/ivan_digital — 13 hours ago

Transcria ... Yet another open-source transcription project — except it got a little out of hand

Fair warning: the world does not need another Whisper wrapper, and that isn't quite what I set out to build. It started for a handful of real users who needed meeting minutes and were tired of paying per minute to ship confidential recordings to a cloud they didn't control. It ended up running inside two companies, which was honestly all I ever wanted from it.

The problem is that I'd already started. You know how that goes.

The summary came from a local LLM, so I figured the LLM might as well correct the SRT using a validated lexicon and the meeting context. Then I wanted better speaker attribution, so I wired in pyannote, later Sortformer, benchmarked a few STT backends against each other, and built a per-VRAM-tier model catalog because "which model fits my GPU" is a question I got tired of answering by hand. Somewhere along the way it grew a GPU job queue, calendar scheduling, a resumable pipeline, multi-user roles, an audit trail, and a full-screen transcript editor. I keep promising people the next feature is the last one. They've stopped believing me.

What I think actually sets it apart isn't the models — we all pull from the same shelf. It's the unglamorous parts. A human-in-the-loop step where you validate speakers from audio excerpts and a domain lexicon before the final pass. Deliverables you can chat with, then fix a term once and have it corrected coherently across the SRT, the summary and the Word minutes. And the boring production plumbing you only miss once real people use the thing every week: "waiting for VRAM" as an actual state instead of a crash, a pipeline that resumes instead of redoing four hours of audio, and a split mode where a CPU web front talks to a separate GPU node.

I tried to keep the code honest — tests, CI, a coverage gate — mostly so future me doesn't resent present me.

Honest caveats before anyone gets ideas: the UI is French-first for now (strings are centralized, so it's a translation job, not a rewrite), reference quality leans on gated models, and you need an NVIDIA GPU. Apache-2.0.

I'd genuinely value a specialist eye on the diarization and STT-arbitration choices — this is one of the few rooms where people actually know where the bodies are buried in this domain. Repo and a few screenshots below.

https://preview.redd.it/emtjrkaiz8bh1.png?width=1331&format=png&auto=webp&s=2637d709f10f13b57a5a05e8d0c800698452bd4f

https://preview.redd.it/gkeyh99lz8bh1.png?width=922&format=png&auto=webp&s=19bf7ddef6925398a0ab2da6e8941435109b0488

https://preview.redd.it/ijuwg8p409bh1.png?width=1331&format=png&auto=webp&s=1b756ef1a78a6bfd8bdedb0487e675ae6f69ec63

https://preview.redd.it/ezbtn3fc29bh1.png?width=1303&format=png&auto=webp&s=765bc4258af9dcad0de4eedfcbc206c2a63e2e4c

https://preview.redd.it/vzvwoo4tj9bh1.png?width=860&format=png&auto=webp&s=244e2480409f21039d97c2872fe5507eb47e8416

https://preview.redd.it/53ipzx85s9bh1.png?width=1960&format=png&auto=webp&s=dc33a04ba77f6bbe647e2be4e23a5d69560542c0

reddit.com
u/Foreign-Watch-3730 — 1 day ago
▲ 4 r/speechtech+1 crossposts

Open source Speech recognition/transcription models in Indian languages

Hey, looking for speech recognition model with proper speaker diarization in Indian languages. It is for a meeting transcription use case.
: which are the best models available?
: fine-tuning is the reason looking for Open source.
Also curious about whether anybody tried Sarvam models.

reddit.com
u/Specialist_Grab9164 — 2 days ago

finetuning Qwen 3 TTS on low resource languages or languages not suppported officially by Qwen 3 TTs

I am trying to finetune qwen 3 TTS on Indian languages. It worked well on Hindi, but it is working very bad on other languages. How do I make the audio better ? Any suggestions ?

reddit.com
u/ravk_1234 — 1 day ago
▲ 17 r/speechtech+4 crossposts

I benchmarked local voice cloning models across English, German, Arabic, Spanish, and Chinese

I put together a dataset-backed benchmark for local voice cloning models:


- OmniVoice int8
- Chatterbox Multilingual fp16
- VoxCPM2 bf16
- Fish Audio S2 Pro fp16


It uses Google FLEURS test clips as references, then reports speaker similarity, WER/CER, generated audio length, and RTF. I also included the reference audio and generated clips for every row, so the table is not just numbers.


Post:
https://www.soniqo.audio/blog/voice-cloning-benchmarks


The result that surprised me most: OmniVoice was the best all-around row set in this run, but the language-by-language behavior is more interesting than the aggregate. VoxCPM2 was strong on Arabic speaker match; Fish Audio had strong German/Arabic similarity but slower RTF; Chatterbox looked good on Arabic/Spanish but I am not benchmarking Chinese until the Swift tokenizer path is ready.


I maintain the Soniqo speech stack, so this is self-promo, but the benchmark is meant to be useful/reproducible rather than a launch post.


Speech Studio is the open-source desktop app built on the same stack:


https://www.soniqo.audio/speech-studio
https://github.com/soniqo/speech-studio


What model/language should I add next?
u/ivan_digital — 3 days ago

UK English IPA > UK Dialect IPA

I am planning on converting some existing uk english IPA dictionaries into various uk dialects, including some old traditional dialects.

I have found a lot of literature on the dialects in question and I think I can combine some scripting with llm calls to help especially any OOV words which may exist in especially the older dialects.

I have no idea what I am doing, to be frank. Which is fine, I’ll muck about and get it sorted eventually.

Any tips, Questions or linguistic nerd discussions we’d like to have spring up from my silly passion project?

reddit.com
u/Spicy_mch4ggis — 5 days ago
▲ 2 r/speechtech+1 crossposts

ASR & TTS Data in English Language

We have few hundreds of hours of English data for ASR / TTS model training from native speakers. All of the data r GDPR complient and licensed. We also have more than 1000 native speakers who can contribute to the more data generation if needed. As all the AI solutions / models are growing, they definitely need lots of data for tarining purpose / tuning. As as startup in the speechtech, we are not sure where /how should we approach to someone/orgs to offer our datasets?

Does anyone have any idea or any information about it? it will be a great help if you can share your experience. Thank you so much in advance.

reddit.com
u/Debug_And_Solve — 4 days ago
▲ 15 r/speechtech+2 crossposts

VoxFlash-TTS: an ultra-compressed latent diffusion voice cloning model (9 Hz latent space, ONNX, zero-shot CN/EN)

I've been working on a zero-shot voice cloning TTS system and wanted to share it here in case it's useful to anyone working on similar problems.

What it is

VoxFlash-TTS is a flow-matching based voice cloning model that operates in a heavily compressed latent space — 9 Hz, instead of the much higher frame rates most diffusion/flow TTS systems use. The idea was to see how far latent compression could go before quality breaks down, since lower frame rates mean fewer steps and faster inference for a given NFE budget.

Some of the design choices:

  • Phoneme encoder: ConvNeXtV2-based, rather than a standard Conformer/Transformer stack.
  • Generative backbone: flow matching with an Euler solver, NFE=16 by default.
  • Speaker conditioning: ConvNeXtV2 speaker encoder with attentive statistical pooling, fed into AdaLN.
  • Cross-lingual zero-shot cloning: Chinese and English, including code-switching.
  • Inference: exported to ONNX, packaged for Docker deployment, no Python training stack required at inference time.

Why 9 Hz

Most latent TTS systems run their diffusion/flow process at much higher temporal resolution. Compressing the latent sequence rate this aggressively is mainly a bet on inference cost — fewer latent frames per second of audio means a much smaller sequence for the flow matching model to denoise, which matters a lot if you care about real-time or low-resource deployment rather than just sample quality in isolation. It's a tradeoff, and I'd be curious to hear from others who've pushed compression in either direction.

Links

Happy to answer questions about the architecture, the flow matching formulation, or the ONNX export pipeline — these were the trickiest parts to get right, especially the velocity target derivation and keeping VAE latent normalization consistent between training and inference.

u/Significant-Disk1890 — 6 days ago

Qwen3-TTS-Triton v0.3.0: Triton + CUDA Graph + batched AR TTS serving, ~14× per-sample throughput

Hi everyone,

I just released Qwen3-TTS-Triton v0.3.0.

When I first started this project, my focus was mostly on single-clip latency. In v0.1.0, I fused several hot paths with Triton — RMSNorm, SwiGLU, RoPE, etc. — and combined that with CUDA Graph. That got single-clip Qwen3-TTS inference to roughly ~5× faster than vanilla PyTorch eager on my setup.

For v0.3.0, I shifted the focus a bit.

I tried PyTorch’s new Helion DSL as well, but for this specific workload it only gave me about 1.03× over the existing Triton version. That pushed me toward a different question: instead of only optimizing one sample at a time, what happens if we optimize batched serving throughput for autoregressive TTS?

The result:

Triton kernels + CUDA Graph + batched serving reached ~14× per-sample throughput vs vanilla PyTorch eager batch=1.

Test setup:

The part I found most interesting is memory efficiency. In the recommended hybrid mode, per-sample VRAM usage drops a lot compared to running each request independently:

  • Hybrid batched serving: around 0.49 GB per sample
  • Previous single-request style: around ~4.4 GB per sample
  • Roughly 6–10× lower per-sample VRAM, depending on the comparison point

I also tried to make sure this was not just a “faster but different output” situation. Batched generation was tested against single-clip generation using my Tier 3 evaluation setup:

  • CER distribution
  • UTMOS distribution
  • Speaker similarity distribution
  • Mann-Whitney statistical comparison

The recommended mode right now is hybrid, since it gives the best balance between throughput, memory, and quality-equivalence.

One thing I’ve been thinking about while working on this: Qwen3-TTS is autoregressive, while OmniVoice is non-autoregressive. I also maintain a Triton version for OmniVoice, and working on both has made the AR vs NAR serving tradeoff much more concrete for me. AR TTS feels much closer to LLM serving than I expected, especially once batching, graph capture, and memory efficiency become the bottlenecks.

Links:

I’d be curious to hear how other people here think about AR TTS serving. In particular, whether you think future local TTS systems should optimize more for lowest single-request latency, or for batched throughput / multi-user serving efficiency.

u/DamageSea2135 — 6 days ago
▲ 24 r/speechtech+3 crossposts

Prof Prathosh AP Open-Sources Vagdhen u, a Vṛtta-Aware Text-to-Speech System for Chanting Sanskrit Shlokas

Found something I had to share. Prof. Prathosh A P (IISc Bengaluru) just open-sourced Vāgdhenu - an AI that chants Sanskrit shlokas perfectly, understanding meter and all. After 15 years of work, he's made the whole thing public with zero venture backing.

It's named after the Upanishadic phrase ॥ वाचं धेनुमुपासीत ॥ (like the divine wish-fulfilling cow). The world's first vrutta-aware, open-source TTS for Sanskrit - genuinely impressed by what one person can build with conviction.

Try the demo here: https://prathosh.in/vagdhenu/

As a Sanskrit student myself (SSS Pravesha level), I've already pasted 10 shlokas from my memorization list and downloaded them. First one's already locked in! This tool is a game-changer for learners.

reddit.com
u/v2click — 6 days ago
▲ 320 r/speechtech+8 crossposts

"AI for the Good of All"?

A study of Brazil's national AI plan finds that the phrase "for the good of all" masks a structural problem: AI algorithms are built to process people at scale, not as individuals, making the promise of equal benefit harder to deliver than the policy suggests.

doi.org
u/Cad_Lin — 8 days ago
▲ 19 r/speechtech+3 crossposts

Compressed Whisper large-v3-turbo to 368 MB with Q3_K-matched QAT — multilingual WER results

I’ve released Orbination Whisper AI, an experiment in compressing Whisper large-v3-turbo into a compact multilingual speech-to-text engine.

The default model is 368 MB using Q3_K quantization and runs through a Go runtime built on whisper.cpp, with no Python required at runtime. It supports CPU/GPU backends and includes CLI + HTTP server modes.

I focused on reducing the train/inference mismatch by training with the actual ggml Q3_K quantize/dequantize path in the forward pass, using a straight-through estimator and teacher distillation. The goal was to make the exported Q3_K checkpoint behave like the model seen during training, rather than fine-tuning first and losing accuracy after quantization.

WER on held-out FLEURS, using beam search in the deployed Go runtime:

- Q3_K, 368 MB: EN 0.065, ES 0.050, FR 0.065, EL 0.148
- Q4_K, 474 MB: EN 0.062, ES 0.048, FR 0.063, EL 0.124
- Q5_K, 574 MB: EN 0.061, ES 0.047, FR 0.061, EL 0.110
- FP16 upper bound, 1.6 GB: EN 0.061, ES 0.046, FR 0.060, EL 0.108

The interesting part for me is that the high-resource languages stay close across precisions, while Greek shows the biggest sensitivity to quantization.

Repo:
https://github.com/amichail-1/Orbination-Whisper-AI

I’d be interested in feedback from people working with Whisper, whisper.cpp, QAT, or multilingual ASR deployment.

u/antonismix36 — 7 days ago

Parakeet-TDT-v3 vs Whisper-Turbo-v3 vs Mega-ASR (Qwen3-ASR): What are people using in production for real-time voice agents?

I'm building a production voice AI system (STT → LLM → TTS) and have been evaluating three ASR options:

  • Parakeet-TDT-v3 (self-hosted via vLLM-Omni)
  • Whisper-Turbo-v3 (via Groq)
  • Mega-ASR / Qwen3-ASR fine-tune (vLLM-Omni)

From my own testing:

  • Whisper-Turbo-v3 produced the best transcription quality overall.
  • Qwen3-ASR / Mega-ASR was noticeably better than Parakeet-TDT-v3.
  • Parakeet-TDT-v3 occasionally missed or misidentified certain words, especially in conversational speech.

However, I keep seeing people recommend Parakeet-v3 as the best open ASR model for production deployments.

For those who have deployed these models in real systems:

What has your experience been with transcription quality .How do they compare on noisy audio and accented speech.

I'd love to hear experiences from people running these models in production for real time voice assistants rather than benchmark-only evaluations.

reddit.com
u/Dark-Horn — 12 days ago
▲ 8 r/speechtech+2 crossposts

ChatGPT Advanced Voice Mode

As far as I can tell, the best voice mode for LLMs right now is ChatGPT's advanced voice mode. That's still running on ChatGPT 4o. So, it's a fascinating toy with occasional real value.

I keep thinking about the AI Assistant in HER, obviously not the romantic part.

The biggest differences right now are the persistence of existence beyond when responding to prompts and thinking quality. Well, we're not going to solve the persistence issue any time soon (at least I hope not).

So, what's left is reasoning quality and integrations. Well, today, with 5.5 and app connectors, I feel like we're getting pretty close if you want to limit your interactions to a keyboard. For multimodal though, we really need a substantial upgrade from 4o.

OpenAI has released GPT-Realtime-2 which supports voice using ChatGPT 5.5. So, I imagine it's only a of time before we see a substantial upgrade to advanced voice mode. At least I sure hope so.

OpenAI, if you're reading this, please let us know at least the theoretical plan.

reddit.com
u/Hybrid-Intelligence — 8 days ago
▲ 21 r/speechtech+8 crossposts

I made a free voice-typing app for Windows after my hands started hurting from typing

Years of long days at the keyboard caught up with my hands. I got tired of two options: push through the ache, or pay for one of the dictation apps I'd tried (Wispr Flow is $144/yr). So I built my own. It's called Pipevoice. It's free, no account, and the code is on GitHub. I made it, so to be clear, I'm showing it and asking for feedback, not selling anything.

​

How it works: hold a key, talk, let go, and the text drops into whatever app you already had focused. Browser, a text box, your editor, a terminal, doesn't matter. It isn't a separate window you copy out of, so you never have to remember what you said and paste it somewhere. There's a second hotkey that puts the text on your clipboard instead, which I reach for when I'm filling out forms.

​

Two settings I think this sub will care about. One is an accent picker (UK/US/AU/Indian/NZ). The other is a plain notes field where you describe how you actually talk. Mine says "I stutter and use a lot of fillers," and the optional AI cleanup pass uses that instead of typing out every "um." Don't want any of that? Turn cleanup off and it keeps your words exactly as spoken.

​

Cost and privacy both come up a lot, so: free, no subscription, no sign-up. You can run it fully offline with local transcription, in which case your voice never leaves your PC. Or plug in your own API key for a cloud engine if you want it faster. Nothing goes through me either way.

​

A 3-minute demo video is in the post if you want to watch it work first.

​

What I actually want from you all: where does it fall short? If you rely on voice input every day, what breaks in the tools you've used? The accent and speech-pattern side is the part I'm least sure I've gotten right, and I'd rather hear that from people it affects than keep guessing.

u/powleads — 13 days ago
▲ 2 r/speechtech+2 crossposts

PEFT and contextual biaising for TTS domain adaptation

I'm working on a cascade based TTS model for a very narrow domain with specific noise environment and heavy jargon library. I've tried every providers ASR/TTS models and honestly it's not good. Really not good (high WER and tuti cunti). So as a ML engineer I'm trying to fine-tune TTS models (currently on QWEN3 ASR), but I'm a text guy, not an audio guy and so is my team haha. So we are learning and trying to build one cause on-the-shelf are bad but it's very difficult and we need to ship soon. Someone in that situation that as counsels or can direct me to GitHub discussions/Discord groups/events.... that may help me ? (I'm US based but I can travel anywhere if there is a real value and a real community of people blocked as us trying to look for a solution

reddit.com
u/Busy-Banana-5257 — 10 days ago

Is Whisper still the best default for speech-to-text if the app needs to be realtime?

For batch transcription, Whisper / faster-whisper / whisper.cpp still feel like the default starting point.

But I’m trying to separate two use cases:

1. Batch transcription
Upload audio → wait → transcript
For this, Whisper is still great. Especially if privacy/local matters.

2. Realtime voice app / voice agent
User speaks → partial transcript → LLM starts reasoning → agent responds
Here the requirements feel very different.

The problems I keep seeing:

  • chunking delay
  • VAD / endpointing hacks
  • no native diarization
  • timestamps need extra work
  • mixed-language audio gets messy
  • GPU cost if you want scale
  • hard to get low p95 latency
  • local setup becomes infra work

Hosted tools I’m seeing people test: Deepgram, AssemblyAI, Speechmatics, Soniox, Gladia, OpenAI realtime/transcribe, and now Smallest AI Pulse for realtime STT.

I’m not trying to dunk on Whisper. It’s still the baseline.

But for a live voice agent or realtime captioning product, when do you personally stop self-hosting and move to a streaming STT API?

Is the line latency? concurrency? diarization? maintenance? cost?

reddit.com
u/Relevant_Duty_7248 — 14 days ago

Anyone running SALMs in production? (Voxtral style models) Looking for training recipes and open-source implementations

I'm curious whether anyone here is actually running SALMs in production today, or actively experimenting with them.

A reasonable starting point seems to be something like:

  • Voxtral-Small + TTS
  • Whisper / mimi-style audio encoder + existing LLM backbone (Qwen, Gemma, etc.)
  • Speech adapters on top of strong tool-calling LLMs

What I'm more interested in is the training side than the inference

For example, suppose we take:

  • Whisper / Mimi as an audio encoder
  • Qwen3 / Gemma as the backbone LLM
  • Freeze most of the LLM initially
  • Train an audio adapter / projector
  • Continue with SFT, distillation, RL, or some combination

Questions:

  1. Has anyone actually built and deployed something like this?
  2. What datasets are people using? Pure ASR data, speech-instruction data, synthetic data, or some mixture?
  3. How are you generating/cooking the data for tool-calling and conversational voice assistants?
  4. Are there any open-source implementations, training recipes, cookbooks, or papers you'd recommend?
  5. How well do these systems scale compared to a traditional voice stack?
  6. What ended up being the hardest part: data, alignment, latency, turn-taking, tool calling, or something else?

Would love to hear from people who've trained these systems themselves rather than only consuming hosted APIs

reddit.com
u/Dark-Horn — 12 days ago