u/goldenjm

Better word highlighting for complex text to speech documents

Better word highlighting for complex text to speech documents

https://preview.redd.it/9af3otchwd2h1.png?width=2816&format=png&auto=webp&s=c1c3df477e5301b6b51467fff94c90dc0dec19e2

I’m Joe, the founder of Paper2Audio, a text to speech service that turns PDFs, research papers, ebooks, and web articles into audio, with a focus on accuracy for complex documents.

I really appreciate all of the outstanding feedback I’ve received from many members of this community.  We’ve made a ton of improvements to Paper2Audio based on your feature requests and issue reports.

One of the more interesting product/engineering problems we solved recently is how to handle word-level highlighting when the text spoken by the text to speech model is not the same text shown in our audio transcript UI.

In more complex documents like research papers or reports the displayed text might include math equations, HTML tags, Roman numerals, or other similar formatting. But the spoken text needs to be normalized first so it sounds right. For example, $x^2 + y^2 = r^2$ might be spoken as “x squared plus y squared equals r squared,” while the transcript UI still needs to highlight the math as it is being narrated.  We want users to still see the original rich document formatting, not the word-by-word audio transcript.

The mismatch between displayed and read aloud words creates a timestamp problem for word highlighting. Our TTS model (Kokoro) gives us word-level timestamps for the spoken text, but our UI needs to highlight the formatted document text as it is being read aloud. A simple character-count mapping doesn’t work because the two strings can have different words, different punctuation, different lengths, and sometimes one visual token maps to many spoken words.

To solve this problem, we treat the spoken text and text displayed in the audio transcript (our “Reader View”) as two separate versions of the same content, then apply the general alignment algorithm we developed between them.  After the TTS runs, we use matching words in both versions as anchors, then reconcile the mismatched regions between them.  Doing so allows us to display the original formatting in the audio transcript and make sure that the portion being read aloud is getting highlighted at the correct time in our Reader View.  Check out our blog post if you want more details.

With this solution, when a citation is visible but skipped in speech, it does not get its own timestamp. When “Part III” is spoken as “Part 3,” it still lines up.

Please let us know if you have any questions or feedback about this post.  We’re thinking about writing additional technical blog posts about different challenges we’ve encountered building Paper2Audio, so please feel free to request topics.

reddit.com
u/goldenjm — 1 day ago

Making text to speech word highlighting work for complex documents

https://preview.redd.it/a5dadaznr52h1.png?width=2816&format=png&auto=webp&s=6fa6ca14c57b1aba9b533603141bab3457a422a1

I’m Joe, the founder of Paper2Audio, a text to speech service that turns PDFs, research papers, ebooks, and web articles into audio, with a focus on accuracy for complex documents.

We’ve recently come up with a solution to a text to speech processing challenge: how to combine accurate text to speech pronunciation with a rich transcript view that maintains the formatting details of the original document, and keeps word-level highlighting accurate when the text shown to the user is not the same text spoken by the TTS model.

For example, in more complex documents like research papers or reports the displayed text might include math equations, HTML tags, markdown, Roman numerals, or other similar formatting. But the spoken text needs to be normalized first so it sounds right. For example, $x^2 + y^2 = r^2$ is read as  “x squared plus y squared equals r squared,” while the transcript highlights the math. 

We wrote up a blog post covering how we went about building a reconciliation algorithm that maps TTS word timestamps back onto the original formatted document.  Our solution is basically a translation layer after TTS. Our TTS model tells us when each word in the cleaned-up spoken text is said. We then line that back up with the richer document text users actually see. Instead of writing separate rules for equations, citations, formatting, and punctuation, we look for matching words in both versions and use them to keep the two texts synced and then word-level highlighting in the audio transcript (our “Reader View”) works properly. 

We were able to improve both the reading and the listening experience without changing the underlying TTS model itself. The audio output stays the same, but the post-processing layer lets us preserve rich document rendering, better pronunciation, and accurate highlighting at the same time.  

As far as we can tell, other text to speech services haven’t figured out how to solve this problem.  I would love feedback from people who have worked on TTS highlighting.  Does this general reconciliation approach match how you’d solve it?  Do you think there are any failure modes we should watch for?

reddit.com
u/goldenjm — 2 days ago
▲ 14 r/iosapps

Paper2Audio text to speech, now for reading documents too (Free and paid plan options)

https://preview.redd.it/nqnzyvvpuzzg1.jpg?width=5106&format=pjpg&auto=webp&s=9f0377c5f47e0956882438be922503ccc221bd87

I’m Joe, the founder of Paper2Audio, a free text to speech reader designed to help you get through long documents and books more efficiently, with high-accuracy narration for complicated material and high-quality voices.  Our free plan allows 56 hours of audio generation per week.  

A: What problem does Paper2Audio solve?
Most text to speech tools can handle simple documents, but not messy PDFs, research papers, reports, and books. Paper2Audio is built to turn complex documents into accurate audio. 

I posted in r/iosapps previously, with my most recent post here (4 months ago).  Since then, we’ve been working on a big improvement: making Paper2Audio better not just for listening to documents, but for reading along while you listen so it’s easier to stay focused, skim when needed, retain more from dense material, and get through your reading backlog more quickly. 

B: Why is Paper2Audio better than the top alternatives?

  1. Higher audio limits for our free plan (56 hours weekly audio generation) with high quality voices.
  2. Summarizes figures, tables, math, and even code into plain English so you’re not stuck with symbol-by-symbol or line-by-line narration.
  3. Hyper-focus on accuracy:  Paper2Audio avoids reading things that usually make text to speech audio annoying, like repeated page numbers, headers, footers, citations, footnotes, and unnecessary boilerplate. We clean up and normalize tricky text first, including math, code, abbreviations, Roman numerals, symbols, units, formulas, and other things that often sound wrong when read aloud by other text to speech services.
  4. Use Paper2Audio to read and follow along, not just listen:  With Reader View, our new method of reformatting PDFs and other documents to fit your screen while including rich content like images and document formatting:   
    • Visual elements are included in the transcript: If your document includes visual elements like tables, figures, images, or math, you can now see them directly in the audio transcript which makes it easier to follow along without losing your place or switching views. 
    • "Figure view" for visual elements: Click on any visual element to bring up the figure view pop up, then zoom and pan around the image for a more detailed view.
    • Single column view: Documents with multiple columns are displayed in a single column to improve readability on smaller screens.
    • Rich text formatting: We now preserve the original formatting of your documents, including math, headings, lists, subscripts, and other inline styling, so you can skim, navigate and understand the document more quickly. Citations are also included so that you know when an author is making a reference, but citation text is only read aloud when needed to keep sentences intact.

C: Cost
Paper2Audio is available on iOS, Android and on our website. We have a generous free plan for personal use (56 hours of audio generation per week), as well as a paid Plus subscription with higher audio and file/size limits ($20/month) for business users.

Any feedback or questions?
I’d love feedback on the new Reader View, the Paper2Audio listening experience, and the overall workflow.  What would make Paper2Audio your go-to tool for listening and reading your documents?

reddit.com
u/goldenjm — 13 days ago

https://preview.redd.it/vxjpe8qp1szg1.png?width=831&format=png&auto=webp&s=af7b5fb6968f69de56dd5eb9a11260dcc97d1224

I’m Joe, the founder of Paper2Audio, a free text to speech reader designed to help you get through long documents and books more efficiently, with high-accuracy narration for complicated material and high-quality voices.  Our free plan allows 56 hours of audio generation per week.  

I’ve posted previously in r/productivityapps (most recently 4 months ago here).  Since then, we’ve been working on a big improvement: making Paper2Audio better not just for listening to documents, but for reading along while you listen so it’s easier to stay focused, skim when needed, retain more from dense material, and get through your reading backlog more quickly.   

We are excited to announce Reader View, our new method of reformatting PDFs and other documents to fit your screen while including rich content like images and document formatting.  Reader View is the default when you open a document.

  • Visual elements are included in the transcript: If your document includes visual elements like tables, figures, images, or math, you can now see them directly in the audio transcript which makes it easier to follow along without losing your place or switching views. Any summary text for a visual element will only appear when it is being read to give you a more streamlined reading view.
  • "Figure view" for visual elements: Click on any visual element to bring up the figure view pop up, then zoom and pan around the image for a more detailed view.
  • Single column view: Documents with multiple columns are displayed in a single column to improve readability on smaller screens.
  • Rich text formatting: We now preserve the original formatting of your documents, including math, headings, lists, subscripts, and other inline styling, so you can skim, navigate and understand the document more quickly. Citations are also included so that you know when an author is making a reference, but citation text is only read aloud when needed to keep sentences intact.

What else has improved since our last post?

  • Better image and figure labeling and summaries: Significantly improved the detection and extraction of tables, figures and images, reduced repetitive summaries for the same figure or table, and improved summary accuracy in cases where an item has multiple sub-objects.  This makes technical documents faster to understand because the summaries are less repetitive and more useful.
  • Improved math detection and summaries: Better detection accuracy of math elements within documents so that non-math items are not incorrectly summarized.
  • Overhaul to citation processing: Complete update to our citation system, including changes to better distinguish between citations within a sentence vs. not for smoother listening without citations unnecessarily interrupting your focus.  In-sentence citations that would break a sentence are not removed.
  • More accurate Roman numeral processing: Major changes to overhaul detection and processing for Roman numerals so that they are pronounced correctly (there may still be issues with I and X since those characters often appear on their own in other non-Roman numeral contexts).
  • Faster overall document processing: About 10-20% faster processing for documents, (varies but complicated non-fiction PDFs will benefit most). 
  • Accessibility feature improvements for visually impaired users (better screen reader support and navigation)
  • Lots of bug fixes and minor UI improvements

Where do I get Paper2Audio and how much does it cost?
Paper2Audio is available on iOS, Android and on our website. We have a generous free plan for personal use (56 hours of audio generation per week), as well as a paid Plus subscription with higher audio and file/size limits ($20/month) for business users.

Any feedback or questions?
I’d love feedback on the new Reader View, the Paper2Audio listening experience, and the overall workflow. Is this something you’d use for PDFs, books, research papers, or work documents? What would you want improved?

reddit.com
u/goldenjm — 14 days ago