Anyone else notice how dataset problems start showing up only after a project gets serious?
Early on everything looks fine:
good benchmark numbers,
clean demos,
decent validation results.
Then production starts and suddenly you’re chasing weird edge cases for weeks.
We had one vision pipeline where the actual model wasn’t even the main issue. The bigger headache turned out to be the data itself:
same images coming from different sources,
slightly different labels across batches,
missing metadata,
random scraped assets mixed with curated ones,
etc.
What made it worse is that most of this wasn’t obvious during training. It only started surfacing once we tried scaling the system and auditing failures properly.
At some point we stopped obsessing over architectures and spent more time cleaning ingestion and sourcing workflows instead.
What’s been the biggest hidden dataset issue in your projects?