Why are realistic conversational TTS / speech datasets still so hard to find?
Working on conversational voice systems internally and we keep running into the same bottleneck where most public speech/TTS datasets still feel very “lab-like” compared to production environments.
A lot of the common datasets are:
- clean recordings
- isolated speakers
- controlled pacing/prosody
- low interruption overlap
- limited emotional variance
- minimal telecom degradation
- highly structured speech
which varies pretty drastically from real deployments.
We’ve been trying to find stronger datasets around:
- emotionally dynamic conversations
- multilingual / code-switching dialogue
- interruption-heavy interactions
- degraded VoIP/mobile audio
- difficult ASR/TTS environments
- long-form conversational drift
- overlapping speakers
- real customer support style conversations
- retention / escalation scenarios
Any recommendations on where to find datasets like these would be appreciated.
Feels like most public datasets still underrepresent the kinds of conversational conditions voice systems actually face once they hit production traffic.