
Better word highlighting for complex text to speech documents
I’m Joe, the founder of Paper2Audio, a text to speech service that turns PDFs, research papers, ebooks, and web articles into audio, with a focus on accuracy for complex documents.
I really appreciate all of the outstanding feedback I’ve received from many members of this community. We’ve made a ton of improvements to Paper2Audio based on your feature requests and issue reports.
One of the more interesting product/engineering problems we solved recently is how to handle word-level highlighting when the text spoken by the text to speech model is not the same text shown in our audio transcript UI.
In more complex documents like research papers or reports the displayed text might include math equations, HTML tags, Roman numerals, or other similar formatting. But the spoken text needs to be normalized first so it sounds right. For example, $x^2 + y^2 = r^2$ might be spoken as “x squared plus y squared equals r squared,” while the transcript UI still needs to highlight the math as it is being narrated. We want users to still see the original rich document formatting, not the word-by-word audio transcript.
The mismatch between displayed and read aloud words creates a timestamp problem for word highlighting. Our TTS model (Kokoro) gives us word-level timestamps for the spoken text, but our UI needs to highlight the formatted document text as it is being read aloud. A simple character-count mapping doesn’t work because the two strings can have different words, different punctuation, different lengths, and sometimes one visual token maps to many spoken words.
To solve this problem, we treat the spoken text and text displayed in the audio transcript (our “Reader View”) as two separate versions of the same content, then apply the general alignment algorithm we developed between them. After the TTS runs, we use matching words in both versions as anchors, then reconcile the mismatched regions between them. Doing so allows us to display the original formatting in the audio transcript and make sure that the portion being read aloud is getting highlighted at the correct time in our Reader View. Check out our blog post if you want more details.
With this solution, when a citation is visible but skipped in speech, it does not get its own timestamp. When “Part III” is spoken as “Part 3,” it still lines up.
Please let us know if you have any questions or feedback about this post. We’re thinking about writing additional technical blog posts about different challenges we’ve encountered building Paper2Audio, so please feel free to request topics.