How I built a production TTS API: sentence-boundary chunking, Redis distributed locks, and killing the thundering herd problem.
Built a text-to-speech API that converts full articles to MP3. The interesting engineering problems weren't the TTS calls — they were everything around them.
**The chunking problem**
Every TTS provider has a per-request character limit (Polly standard: 3,000 chars). A real article is 8,000–20,000 chars. Naive character-boundary splitting produces broken audio mid-word. The solution: a two-threshold sentence-boundary splitter.
- `target_chars = 2500` — soft target; flush the buffer when reached
- `max_chars = 4000` — hard ceiling; flush before appending if the next sentence would exceed it
- Split regex: `(?<=[.!?])\s+` — only splits after terminal punctuation
Result: every chunk is a coherent group of complete sentences, always within the provider limit.
**The caching layer**
TTS synthesis is deterministic — same text + same voice/engine/region = identical audio bytes every time. Cache key structure:
`sha256(text) + voice_id + engine + region`
All four parameters matter. Swapping from `Joanna/standard` to `Matthew/neural` must be a cache miss, not a hit.
Warm cache: N × `redis.get()` + ffmpeg concat. Latency under 300ms for most articles. Zero upstream calls.
**The thundering herd**
Without locking: 50 concurrent users hit a cold article → 50 × 7 chunks = 350 Polly calls, 349 of them redundant.
Fix: Redis `SET NX` distributed lock per chunk. One worker wins the lock, synthesizes, writes to cache, releases. Everyone else exponential-backoff polls until the cache key appears.
Backoff: start at 50ms, grow ×1.25 per iteration, cap at 500ms.
Critical detail: lock release is in a `finally` block. A failed synthesis that doesn't release its lock blocks all subsequent requests for that chunk until TTL expiry — potentially minutes.
Result under load: `chunk cache stats hits=49 misses=1` per chunk. 7 Polly calls total, not 350.
**Provider comparison (brief)**
- Piper (local): free, no concurrency, model files are hundreds of MB, degrades on long inputs
- ElevenLabs: best voice quality, cost curve is steep at real traffic levels
- Amazon Polly: 5M chars/month free (standard), permanent — right economics for this use case
Full writeup with architecture diagram, all code, and the failure sequence in order: From Piper to Polly: How I Built a Production-Ready Text-to-Speech API (and That Broke Along the Way)
What I'm solving next: moving synthesis off the request thread into an async job queue (ARQ vs Celery) and streaming chunk_0 to the client while chunk_1 is still synthesizing.