u/BasisRoutine6228

I’ve been experimenting with turning long-form text into audiobook-style audio, and one thing surprised me:
The voice model is usually not the main problem.
A lot of bad long-form AI narration comes from the source text itself. Ebooks, blog posts, PDFs, and scripts are written for reading, not listening. Once they become audio, small things become very obvious:
- table of contents text gets read aloud
- footnotes interrupt the sentence
- repeated headers or page numbers sound strange
- long sentences are harder to follow
- dialogue becomes confusing when you can’t see line breaks
- chapter transitions need to be heard, not just seen
The biggest lesson for me is that “text to speech” and “audiobook production” are not the same workflow.
For short text, you can paste and generate.
For long-form content, the better workflow seems to be:

clean the source text
test a 500–1,000 word sample
listen for pacing, pronunciation, dialogue, and structure
fix the text
then generate the full chapter
I’m building a small tool around this workflow, but I’m mostly interested in the workflow problem itself.
For people who use TTS for long-form content: do you clean the text first, or generate first and fix problems afterward?
For context, I’m testing this idea here: https://audiobookgenerator.net/ — but the main question is whether this sample-first workflow actually matches how people handle long-form TTS.

Why does long-form AI narration still feel worse than a real audiobook?