r/speechtech

Just launched ContextLM on PH today. The most expressive Text-to-Speech platform.
▲ 6 r/speechtech+2 crossposts

Just launched ContextLM on PH today. The most expressive Text-to-Speech platform.

Hey 👋

We just launched ContextLM on Product Hunt today 🚀

ContextLM is an expressive, context-aware, LLM based Text-to-Speech and Text-to-Podcast platform that enables users to instantly clone voice and generate human- like speech using custom prompts.

Your upvote and feedback will be appreciated.

We have a FREE 10,000 credits 🎁 ready for everyone in this community who share, upvote or comment on our launch today.

DM me for your credits.

Please upvote and comment on Product Hunt:

https://www.producthunt.com/products/contextlm?comment=5382565

Thank you 😊

u/herberz — 1 day ago

Making text to speech word highlighting work for complex documents

https://preview.redd.it/a5dadaznr52h1.png?width=2816&format=png&auto=webp&s=6fa6ca14c57b1aba9b533603141bab3457a422a1

I’m Joe, the founder of Paper2Audio, a text to speech service that turns PDFs, research papers, ebooks, and web articles into audio, with a focus on accuracy for complex documents.

We’ve recently come up with a solution to a text to speech processing challenge: how to combine accurate text to speech pronunciation with a rich transcript view that maintains the formatting details of the original document, and keeps word-level highlighting accurate when the text shown to the user is not the same text spoken by the TTS model.

For example, in more complex documents like research papers or reports the displayed text might include math equations, HTML tags, markdown, Roman numerals, or other similar formatting. But the spoken text needs to be normalized first so it sounds right. For example, $x^2 + y^2 = r^2$ is read as  “x squared plus y squared equals r squared,” while the transcript highlights the math. 

We wrote up a blog post covering how we went about building a reconciliation algorithm that maps TTS word timestamps back onto the original formatted document.  Our solution is basically a translation layer after TTS. Our TTS model tells us when each word in the cleaned-up spoken text is said. We then line that back up with the richer document text users actually see. Instead of writing separate rules for equations, citations, formatting, and punctuation, we look for matching words in both versions and use them to keep the two texts synced and then word-level highlighting in the audio transcript (our “Reader View”) works properly. 

We were able to improve both the reading and the listening experience without changing the underlying TTS model itself. The audio output stays the same, but the post-processing layer lets us preserve rich document rendering, better pronunciation, and accurate highlighting at the same time.  

As far as we can tell, other text to speech services haven’t figured out how to solve this problem.  I would love feedback from people who have worked on TTS highlighting.  Does this general reconciliation approach match how you’d solve it?  Do you think there are any failure modes we should watch for?

reddit.com
u/goldenjm — 2 days ago
▲ 7 r/speechtech+2 crossposts

Which TTS API provider would you recommend for long-ish narrations?

I'm making an app where an AI narrates a story for the player to take part in. The app is turn-based, and each turn typically generates around 400 words of narration.

Which TTS API providers would you recommend that can produce around 2–3 minutes of audio in a single request?

I tested Qwen TTS on Alibaba Cloud, but it seems to cut the output off after about 50 seconds, and chunking the audio sounds really bad because the voice changes pitch between chunks.

I'm aiming for a TTS API provider in the range of $13–15 USD per million characters, preferably multilingual.

Any recommendations?

reddit.com
u/popyui — 8 days ago

What's a good refresher/crash course on speech analytics, natural language processing and sentiment analysis for someone who hasn't done this stuff in a few years?

I haven't done much data science, machine learning, or NLP in the past few years. I would like to get a refresher/crash course in speech analytics, NLP and sentiment analysis techniques, especially how it's done today. I also want a refresher on speech analytics and how it's done today with the various programs like Nexidia, CallMiner, etc. I'm preparing for a job I will start in a couple of weeks. Preferably something I can review over a week or so. I have done this stuff, but not much in the past few years. Thanks!

reddit.com
u/JustAPieceOfMeat385 — 7 days ago

Vibration and Distortion in CosyVoice3 Fine Tuned Model

I fine-tuned Fun-CosyVoice3-0.5B, but after training, during inference I observe significant distortion, noise, and vibration in the generated audio.

To isolate the issue, I performed the following tests:

1. HiFiGAN-only test

  • Regenerated audio directly from an input audio chunk using HiFiGAN (no tokenizer or Flow)
  • Regenerated Output is exactly like the original clean audio
  • Suggests HiFiGAN is not the source of the issue

2. Full pipeline test (tokenizer → Flow → HiFiGAN)

  • Passed clean audio samples from my dataset through the full pipeline
  • Regenerated Output synthesis contains noticeable vibration and distortion, despite clean input

3. Base vs fine-tuned Flow

Tested with both:

  • Base Flow model
  • Fine-tuned Flow model
  • Both produce similar vibration artifacts

Additional observation:

  • A clicking/mouse-like sound appears at the start and end of generated audio

What I’ve tried:

  • Multiple audio normalization techniques (LUFS) before feeding data to the tokenizer
  • Also tried de-clipping
  • No improvement

I have been stuck with this for weeks now and i cannot figure out a way out. would be really helpful if someone with past experience working with cosyvoice could help out.

Questions:

  • Has anyone encountered similar vibration/distortion artifacts in the tokenizer → Flow → HiFiGAN pipeline?
  • Could this be related to tokenizer encoding/decoding mismatch or preprocessing?
  • Any suggestions on debugging?
reddit.com
u/NoTransition8017 — 10 days ago

Looking for help for a specific use case of speaker diarization between two individuals in a noisy atmosphere. Have tried Seeed Studio microphone and rasberry pi but audio isn't clear enough. Need help.

I have been trying to capture voices in a noisy atmosphere with a Seeed Studio eSpeaker XVF3800 and a rasberry pi. But I can't get the audio clear enough to do the speaker diarization in a high enough level to accomplish what I need. Looking for someone to help me solve this problem. I think I need a sound engineer and someone who also knows how to leverage AI to help enhance the captured audio to do this at scale. Anyone interested or know someone who might be able to help?

reddit.com
u/FitStatistician2661 — 10 days ago
▲ 6 r/speechtech+1 crossposts

best voice api

hello im buildign a app via vibe coding and it really needs audio in and audio out for the ai questions and answers.what is peoples experiances of the best way of achieving a ultra clear audio in and audio out answer #audioai #vibecoding #ai #helpneeded

reddit.com
u/ofah1974 — 12 days ago
▲ 21 r/speechtech+5 crossposts

I was recently trying to transcribe an interview for my dad and he was very cautious about uploading anything to a cloud service which made sense. When I looked for local options everything required complex self-hosted setups that would have taken an hour to configure.So instead of doing the 1hr set up, i spent the next 4 to make it an in-brower, zero setup tool anyone can use to locally transcribe audio . Your audio never leaves your device, you can even turn off your wifi to prove it (after the models loads in ofc). Give it a try and let me know what you think, would love feedback from this community especially.

u/Gizmo_4Life — 13 days ago

Best APIs for speech to text?

Hi colleagues, I have a SaaS that transcribes 10 million minutes of audio per month, and I've tried many different processing methods. Currently, I'm using orchardrun.com because it offers the best performance and price (0.025 per hour) and allows me to handle fairly large audio files. But do you know of any other, more economical options?

reddit.com
u/SmoothConnection1670 — 13 days ago

Building a Voice Assistant for Medication Reminders — Wake Word Detection Was Harder Than Expected

We’ve been building a voice-first medication assistant at https://www.wiserx.health/, where patients can talk to the voice assistant with experience focused on helping patients manage medications at home without apps or caregivers.

One of the hardest parts for us was wake word detection. We tested a few public/open solutions, but accuracy in real-world home environments wasn’t great, especially with elderly users, background TV noise, accents, etc. We also looked at Picovoice, but it was pretty expensive for our stage as a startup.

We ended up working with https://davoice.io/ for custom wake word models and speaker identification, and honestly it’s been solid so far. Detection accuracy has been much better for our use case and we’ve seen way fewer false positives compared to what we tested earlier. Importantly we were trying to optimize the CPU usage and team at DaVoice helped us tweak the model and gave us an efficient one. They also offer other functionalities other than wake word which is speaker identification and isolation.

Curious what others here are using for wake word detection on embedded/edge devices and how you’re handling noisy environments.

u/FinishHot5984 — 13 days ago

Anyone using speech-to-text for Indian languages in production? What's actually working and what's not?

Marketing pages claim 90%+ accuracy on Hinglish. Reality from the teams I've talked to looks very different.

If you're using or have evaluated Indian-language STT for any use-case - voicebots, call analytics, video KYC, transcription, voice search, etc. would love to hear what you picked, why, and where it falls short.

Happy to share my learnings. Drop a comment or DM for a 30 min chat.

reddit.com
u/Spare-Ad2520 — 14 days ago