u/ParsimmonIO

▲ 8 r/datasets+1 crossposts

What’s the most underserved public dataset you wish existed in clean, RAG-ready form?

We’re building Parsimmon, a document parsing pipeline that handles the messy stuff most tools choke on: scanned PDFs, mixed layouts, tables embedded in images, inconsistent formats across sources. We’ve been benchmarking on ParseBench and are sitting alongside Google and Reducto on the leaderboard, with particularly strong recall on complex layouts like XBRL/SEC filings.

We want to use it to do something actually interesting for people, like take a historically significant, publicly available corpus that’s scattered and inaccessible and normalize it into a single clean, queryable dataset we can release for free.

We’ve been kicking around things like:
• Leonardo da Vinci’s notebooks (7,000+ pages scattered across 10+ institutions, never unified)
• Einstein’s personal papers (Princeton/Hebrew University digitized but never normalized)
• Darwin’s notebooks (Cambridge has the full archive digitized but completely scattered)

But we want to know what you actually wish existed. What corpus have you run into that’s technically public but practically unusable? What would you build on top of it if the data were clean?

Ideally something with appeal beyond researchers, but we’re open to anything.

reddit.com
u/ParsimmonIO — 7 days ago