u/lizcodes — reddlx

How I built a production TTS API: sentence-boundary chunking, Redis distributed locks, and killing the thundering herd problem.

Built a text-to-speech API that converts full articles to MP3. The interesting engineering problems weren't the TTS calls — they were everything around them.

**The chunking problem**

Every TTS provider has a per-request character limit (Polly standard: 3,000 chars). A real article is 8,000–20,000 chars. Naive character-boundary splitting produces broken audio mid-word. The solution: a two-threshold sentence-boundary splitter.

- `target_chars = 2500` — soft target; flush the buffer when reached

- `max_chars = 4000` — hard ceiling; flush before appending if the next sentence would exceed it

- Split regex: `(?<=[.!?])\s+` — only splits after terminal punctuation

Result: every chunk is a coherent group of complete sentences, always within the provider limit.

**The caching layer**

TTS synthesis is deterministic — same text + same voice/engine/region = identical audio bytes every time. Cache key structure:

`sha256(text) + voice_id + engine + region`

All four parameters matter. Swapping from `Joanna/standard` to `Matthew/neural` must be a cache miss, not a hit.

Warm cache: N × `redis.get()` + ffmpeg concat. Latency under 300ms for most articles. Zero upstream calls.

**The thundering herd**

Without locking: 50 concurrent users hit a cold article → 50 × 7 chunks = 350 Polly calls, 349 of them redundant.

Fix: Redis `SET NX` distributed lock per chunk. One worker wins the lock, synthesizes, writes to cache, releases. Everyone else exponential-backoff polls until the cache key appears.

Backoff: start at 50ms, grow ×1.25 per iteration, cap at 500ms.

Critical detail: lock release is in a `finally` block. A failed synthesis that doesn't release its lock blocks all subsequent requests for that chunk until TTL expiry — potentially minutes.

Result under load: `chunk cache stats hits=49 misses=1` per chunk. 7 Polly calls total, not 350.

**Provider comparison (brief)**

- Piper (local): free, no concurrency, model files are hundreds of MB, degrades on long inputs

- ElevenLabs: best voice quality, cost curve is steep at real traffic levels

- Amazon Polly: 5M chars/month free (standard), permanent — right economics for this use case

Full writeup with architecture diagram, all code, and the failure sequence in order: From Piper to Polly: How I Built a Production-Ready Text-to-Speech API (and That Broke Along the Way)

What I'm solving next: moving synthesis off the request thread into an async job queue (ARQ vs Celery) and streaming chunk_0 to the client while chunk_1 is still synthesizing.

reddit.com

u/lizcodes — 4 days ago

▲ 2 r/developersIndia

How I built a production TTS API: sentence-boundary chunking, Redis distributed locks, and killing the thundering herd

Built a text-to-speech API that converts full articles to MP3. The interesting engineering problems weren't the TTS calls — they were everything around them.

**The chunking problem**

- `target_chars = 2500` — soft target; flush the buffer when reached

- `max_chars = 4000` — hard ceiling; flush before appending if the next sentence would exceed it

- Split regex: `(?<=[.!?])\s+` — only splits after terminal punctuation

Result: every chunk is a coherent group of complete sentences, always within the provider limit.

**The caching layer**

TTS synthesis is deterministic — same text + same voice/engine/region = identical audio bytes every time. Cache key structure:

`sha256(text) + voice_id + engine + region`

All four parameters matter. Swapping from `Joanna/standard` to `Matthew/neural` must be a cache miss, not a hit.

Warm cache: N × `redis.get()` + ffmpeg concat. Latency under 300ms for most articles. Zero upstream calls.

**The thundering herd**

Without locking: 50 concurrent users hit a cold article → 50 × 7 chunks = 350 Polly calls, 349 of them redundant.

Fix: Redis `SET NX` distributed lock per chunk. One worker wins the lock, synthesizes, writes to cache, releases. Everyone else exponential-backoff polls until the cache key appears.

Backoff: start at 50ms, grow ×1.25 per iteration, cap at 500ms.

Critical detail: lock release is in a `finally` block. A failed synthesis that doesn't release its lock blocks all subsequent requests for that chunk until TTL expiry — potentially minutes.

Result under load: `chunk cache stats hits=49 misses=1` per chunk. 7 Polly calls total, not 350.

**Provider comparison (brief)**

- Piper (local): free, no concurrency, model files are hundreds of MB, degrades on long inputs

- ElevenLabs: best voice quality, cost curve is steep at real traffic levels

- Amazon Polly: 5M chars/month free (standard), permanent — right economics for this use case

Full writeup with architecture diagram, all code, and the failure sequence in order: From Piper to Polly: How I Built a Production-Ready Text-to-Speech API (and That Broke Along the Way)

What I'm solving next: moving synthesis off the request thread into an async job queue (ARQ vs Celery) and streaming chunk_0 to the client while chunk_1 is still synthesizing.

reddit.com

u/lizcodes — 6 days ago

▲ 2 r/VoiceAutomationAI

How I built a production TTS API: sentence-boundary chunking, Redis distributed locks, and killing the thundering herd

Built a text-to-speech API that converts full articles to MP3. The interesting engineering problems weren't the TTS calls — they were everything around them.

**The chunking problem**

- `target_chars = 2500` — soft target; flush the buffer when reached

- `max_chars = 4000` — hard ceiling; flush before appending if the next sentence would exceed it

- Split regex: `(?<=[.!?])\s+` — only splits after terminal punctuation

Result: every chunk is a coherent group of complete sentences, always within the provider limit.

**The caching layer**

TTS synthesis is deterministic — same text + same voice/engine/region = identical audio bytes every time. Cache key structure:

`sha256(text) + voice_id + engine + region`

All four parameters matter. Swapping from `Joanna/standard` to `Matthew/neural` must be a cache miss, not a hit.

Warm cache: N × `redis.get()` + ffmpeg concat. Latency under 300ms for most articles. Zero upstream calls.

**The thundering herd**

Without locking: 50 concurrent users hit a cold article → 50 × 7 chunks = 350 Polly calls, 349 of them redundant.

Fix: Redis `SET NX` distributed lock per chunk. One worker wins the lock, synthesizes, writes to cache, releases. Everyone else exponential-backoff polls until the cache key appears.

Backoff: start at 50ms, grow ×1.25 per iteration, cap at 500ms.

Critical detail: lock release is in a `finally` block. A failed synthesis that doesn't release its lock blocks all subsequent requests for that chunk until TTL expiry — potentially minutes.

Result under load: `chunk cache stats hits=49 misses=1` per chunk. 7 Polly calls total, not 350.

**Provider comparison (brief)**

- Piper (local): free, no concurrency, model files are hundreds of MB, degrades on long inputs

- ElevenLabs: best voice quality, cost curve is steep at real traffic levels

- Amazon Polly: 5M chars/month free (standard), permanent — right economics for this use case

Full writeup with architecture diagram, all code, and the failure sequence in order: From Piper to Polly: How I Built a Production-Ready Text-to-Speech API (and That Broke Along the Way)

What I'm solving next: moving synthesis off the request thread into an async job queue (ARQ vs Celery) and streaming chunk_0 to the client while chunk_1 is still synthesizing.

reddit.com

u/lizcodes — 6 days ago