r/datasets

I added a 5th pipeline to my open-source pain-finder - tried using court records for profession-level pain, it didn't work, here's what did
▲ 6 r/datasets+3 crossposts

I added a 5th pipeline to my open-source pain-finder - tried using court records for profession-level pain, it didn't work, here's what did

I've been running unfairgaps-os for a while - MIT repo with 4 pipelines that mine court filings, regulatory fines, and enforcement data to find business pain points. B2B angle: what industry-level problem is documented in lawsuits worth solving with a SaaS.

Wanted to extend it to individual professionals. Started off thinking the same court-records approach would work - just narrow it from "construction in US" to "lawyers in US." It didn't. Lawyers don't get sued over the fact that calculating filing fees per court is tedious. Accountants don't get fined because reconciling trust accounts is annoying. The pain a working professional feels every Tuesday isn't in court records - it's in the regulation that says "you must file form X by date Y or pay penalty Z" plus the daily grind of actually doing that.

So I switched approach. Two-stage pipeline:

Stage 1 is WebSearch - 7 targeted queries pulling regulatory facts from .gov, law.cornell.edu, BLS, and professional association sites. Daily routine + documents, regulations + licensing, software they use, jargon, career levels + fears, professional communities, labor market. Output is a structured JSON profile with ~30 specific facts and source URLs per profession.

Stage 2 hands the profile to Opus 4.7 with a deductive prompt and no web access. Given the regulation and daily routine, infer 8-15 specific recurring tasks that would be painful and produce a structured spec for the AI tool that would solve each one.

Loaded 130 US profession profiles into the repo. Ran stage 2 on 25 of them to seed.

Here's the full output from one run - auto detailers in the US - so you can see what actually comes out:

  1. Price a detail job profitably (cost-plus, not guess) - calculator
  2. Quarterly estimated tax + self-employment tax calculation - calculator
  3. EPA stormwater compliance checklist (avoid wash-water Clean Water Act fines) - checklist
  4. California Car Wash and Polishing Act registration + bond compliance - checklist
  5. Vehicle intake / pre-inspection form (protect against damage claims) - template
  6. Ceramic coating warranty + service agreement template - template
  7. Sales tax on detailing services - state-by-state lookup - reference
  8. Mobile detailer route optimization + travel cost recovery - calculator
  9. Chemical inventory + reorder + PFAS compliance tracker - checklist
  10. Paint correction estimate from photos + paint depth gauge - advisor
  11. Winter cash flow + slow-season pricing strategy - advisor
  12. Damage claim response (customer alleges scratches/damage) - checklist
  13. IDA Certified Detailer (CD/SV-CD) exam prep + study tracker - reference

The first one is the most obviously buildable. Most detailers eyeball pricing and undercut by 25% because they don't run a real cost-plus formula. The actual output JSON includes the formula (labor + chemicals + the 2026 IRS $0.67/mile rate + 15.3% SE tax + monthly overhead allocation), inputs (10 of them including services list and target margin), and outputs (minimum profitable price, recommended price with margin, breakdown, tax set-aside). That's a $19/mo SaaS already specced out.

Number 3 is the scariest. EPA Clean Water Act civil penalty is $64,618 per day per violation if you dump wash water in a storm drain. EPA has literally put mobile detailers out of business for this. The output is a 12-step compliance procedure with warnings (biodegradable soap is NOT a defense) and citations (33 USC 1311, 40 CFR 122.26).

Each of the 13 has a structured spec like that. Not platitudes, buildable tools.

Honest framing: this isn't a problem interview. It's a discovery funnel. The pains are inferred from regulation + daily routine, not from real users complaining. You'd use this to sift 130 professions in an afternoon, pick 5-10 candidates that sound viable, then spend a week on real customer development to validate. Beats brainstorming SaaS ideas with your roommate.

Repo: https://github.com/AyanbekDos/unfairgaps-os Direct link to the auto-detailer output: https://github.com/AyanbekDos/unfairgaps-os/blob/main/data/professions/us/pains/us-auto-detailers.json

105 profiles still need stage 2 run on them. Takes ~5 min of LLM time each.

tldr: open-source repo finds AI tool ideas per profession by reading regulations instead of guessing. 13 specific ideas with formulas + citations for auto detailers as a real example.

u/Ogretape — 19 hours ago
▲ 10 r/datasets+2 crossposts

Honest Opinion - Data Analytics Google Certification

I am currently in the process of completing the Data Analysis Google Course on Couresa. I was wondering if there was any feedback anyone who has completed it can give.

I am wanting to get into data analysis and change my career.

Any tips?

reddit.com
u/Devoo07 — 1 day ago

[dataset] 2.3M U.S. employer profiles joined across 16 federal enforcement agencies (OSHA, EPA, EEOC, WHD, MSHA, and more) — free, CC BY 4.0

Full disclosure [self-promotion]: I'm the solo builder. Happy to answer questions about the data, methodology, or entity resolution approach.

I built FastDOL, a platform that links federal workplace enforcement records across agencies into a single employer profile. The government publishes this data, but each agency has its own database, its own identifiers, and its own terrible search UI.

The cross-agency dataset links enforcement records from OSHA, WHD, MSHA, EPA, EEOC, OFCCP, OFLC, and others at the employer level with parent-company rollup. The interesting finding: employers cited by 3+ agencies have a 3.4x higher worker fatality rate than employers cited by 1-2 agencies.

Four open datasets available so far, all CC BY 4.0:

  • Cross-Agency Federal Violations by Employer (~2.3M rows)
  • OSHA Construction Enforcement by Employer (377K rows)
  • OSHA Citations Q1 2026 (28,827 rows, citation-level)
  • WHD Wage Theft Enforcement Actions by Employer

All hosted on Hugging Face, Kaggle, and Zenodo with DOIs. Full schema, methodology, and BibTeX on the canonical pages: https://www.fastdol.com/datasets

u/chill-botulism — 1 day ago
▲ 30 r/datasets+1 crossposts

Released a free 9.8M doc Indic multilingual corpus — Hindi, Bengali, Tamil, Telugu + 7 more (CC0, HuggingFace)

Built this over the past few weeks as part of a multilingual research project. Figured I'd share it here. Check it out!

What: ~9.8M web documents across 11 languages — hi, bn, ta, te, mr, gu, kn, ml, pa, ur, en. ~8.4B tokens. CC0 license.

🤗 https://huggingface.co/datasets/AM0908/indic-hplt-v1

u/ashtok897 — 3 days ago

Need fun project ideas for a 3 node physical cluster (Uni Project)

Hey guys

I’m building a physical 3-node cluster (1 Master, 2 Workers, Docker Swarm) for a backend class. I need to distribute a heavy workload to process massive text/JSON data, but I want the final presentation to be actually funny. No boring corporate data!!!!

I’m looking for ideas on what exactly to analyze. I want to calculate crazy metrics, find weird patterns, etc

I was thinking on:
• Analyzing League of Legends chat logs but it is meh

The dataset needs to be easy to find (Kaggle, Hugging Face, APIs) but large enough to justify parallel processing on a cluster pleaaaase

Any crazy ideas or dataset links? Thanks! :D

reddit.com
u/Much_Palpitation9699 — 2 days ago
▲ 2 r/datasets+1 crossposts

Need reliable source for 30+ years of S&P 500 historical data for LSTM/Transformer research [P]

Hi everyone,

I'm starting a research project on financial time-series forecasting using LSTM and Transformer models for predicting S&P 500 market direction.

Right now, I'm struggling with obtaining reliable long-term historical data.

I tried Yahoo Finance, but downloads are inconsistent/failing for me, and most Kaggle datasets I found only contain around 5–10 years of data.

I specifically need:

  • Around 30 years of historical S&P 500 data
  • Preferably daily OHLCV data
  • Reliable and clean source suitable for ML research
  • Ideally free or student-friendly

I also want to understand what researchers typically use in academic work for financial forecasting:

  • Yahoo Finance?
  • Alpha Vantage?
  • WRDS/CRSP?
  • Polygon?
  • Kaggle?
  • Something else?

Additionally:

  • Is using only S&P 500 index data enough for a Master's level research project?
  • Or should I include technical indicators, macroeconomic data, sentiment, or constituent stock data?

Would appreciate guidance from people who've actually worked on financial ML projects.

Thanks.

reddit.com
u/stickPotatoe — 3 days ago
▲ 20 r/datasets+12 crossposts

PreSeedVCList.com

PreSeedVCList covers 390 venture capital firms actively writing pre-seed checks, with data on firm websites, investment stages, sectors, office locations, and portfolio links, structured from recent funding activity and updated monthly at https://preseedvclist.com.

u/project_startups — 3 days ago
▲ 3 r/datasets+1 crossposts

Looking for tools to enrich 3,800 licensed property manager names (Ontario, Canada) — need emails. What actually works?

I’m building a lead enrichment pipeline for my friend in Canada and hitting a wall. Looking for advice from anyone who’s done similar work.

The data I have:

•3,800 licensed property managers from Ontario’s official CMRAO registry  
•Name only — no employer, no domain, no address  
•These are real licensed professionals, not residential contacts  

What I’ve already tested (with results):
Apollo.io free tier → blocked on Search API, needs paid plan
Hunter.io → needs company domain to work, useless without it
• PeopleDataLabs → blocked signup, requires work email
• Prospeo → B2B only, 0% hit on Canadian residential-style data
• Spokeo/BeenVerified → US database only, no Canada coverage
• Canada411 via Apify → works but returns phone numbers only, no emails

What I’m trying to figure out:

1.Is Apollo Basic ($49) actually worth it for Canadian property managers? Has anyone tested it for Canada specifically?  
2.Is there any people-search or enrichment tool with decent Canadian professional coverage?  
3.Has anyone successfully enriched name-only Canadian professional contacts at scale?

What I’ve already ruled out:

•US-only people search tools (Spokeo, BeenVerified, TruthFinder)  
•Tools that need a company domain as input  
•Residential Canadian data (confirmed it basically doesn’t exist)

These are licensed professionals so they should have LinkedIn profiles and company affiliations — just need the right tool to match name → email efficiently.

Any real-world experience appreciated. Happy to share results once I find something that works.

u/divyanshu_gupta007 — 4 days ago
▲ 17 r/datasets+1 crossposts

I made the largest public gender-labeled Japanese name dataset, 731k+ names

Built by merging 5 existing public datasets into one. And I've scraped the wiki 69k names too.

Kaggle Dataset License: CC BY-SA 4.0

Dataset Size Male % Notes
Wikipedia 69,209 44.1% Real attested people, 87% have birth year
ENAMDICT 116,009 16.4% Dictionary-based, heavily skewed female
Facebook 530M leak 392,434 60.6% Largest source, kanji or kana only
GenDec 64,139 49.8%
名前由来 89,635 60.4% Popularity rankings, not real frequency
Total 731,426 51.0%

Each individual dataset has its own gaps — size, quality, or skew — but combining them gives a more complete picture. The Wikipedia subset is the only one covering real individuals and has a temporal dimension through birth years. ENAMDICT skews female partly because Japanese female names have more variety. The Facebook data is massive but only records kanji or kana, not both.

Use cases: gender inference (training classifiers without LLMs), Japanese NLP (NER, tokenization, reading prediction), cross-source data quality research

Also working on a gender prediction model, will post when ready. it has around 90% accuracy

reddit.com
u/Careful_Sand_7838 — 4 days ago

How are you handling training data when public datasets don't match your use case?

Public datasets on HF or Kaggle can sometimes be too generic, wrong domain, wrong schema, outdated, or just not enough volume to generalize properly. Collecting real-world proprietary data takes months. What do people actually do? From what I have seen, the options tend to be:

- Ship with what you have and accept degraded performance
- Spend weeks scraping and cleaning, which eats engineering time
- Augmentation techniques like SMOTE or noise injection, which help at the margins but do not solve domain specificity

I am working on a project that approaches this differently. Sourcing permissively licensed real-world data, curating it to a company's specified schema, then running synthetic expansion to hit the volume and edge case coverage the model actually needs. Every output includes a fidelity report showing statistical alignment between the synthetic output and the source distribution.

Before going further with it, I genuinely want to know whether this is a pain people feel acutely or whether most teams have found workarounds that make something like this unnecessary.

If you are hitting a data wall on something you are building right now, I would love to hear what the specific bottleneck looks like. Also happy to put together a free sample dataset for anyone who wants to see whether this approach actually produces something useful for a real use case.

What has worked for you?

reddit.com
u/earthtoali7 — 4 days ago
▲ 4 r/datasets+1 crossposts

What deepfake detection models can I test my validation dataset on?

Hello, I built a validation dataset of real and generated images (with a vanilla SDXL+InstantID architecture). I'm running low on AWS credits/have a low budget, but I want to benchmark the performance detection models against it. Can anyone recommend open-source detection models that I can test?

I know there is a mix of ones created by universities and made by members of the open source community, but any opinions on which 4-5 I should test would be greatly appreciated.

u/Tasty_Pressure_5618 — 5 days ago
▲ 47 r/datasets+1 crossposts

The Keeling Curve: CO₂ at Mauna Loa since 1958, the most important climate measurement in history

datahub.io
u/anuveya — 6 days ago
▲ 0 r/datasets+1 crossposts

Looking for a real world dataset (or website where i can find it) [P]

Hi guys, I’m gonna do a data analysis project based on data privacy, bias and data interpretability. For this reason our professor asked for a real world dataset in order to analyze a real case. Additionally I would prefer the least anonymity possible for that dataset in order to create some interesting technique over it (differential privacy, k-anonimity exc…)

Do you have any advice where to find the dataset? (links or website names)
Because I checked on Kaggle but I don’t know how to find if the dataset is real or not

reddit.com
u/novromeda — 6 days ago

Open, self-hostable pipeline for U.S. financial datasets — SEC filings (full-text), 13F holdings, insider and congressional trades, FINRA short data, FRED, CFTC, CBOE

Sharing an open-source pipeline I built that scrapes, stores, and serves a bundle of public U.S. financial datasets so you can run the whole thing yourself instead of stitching together rate-limited APIs.

Datasets included (with their original sources — pull straight from these too):

  • SEC filings 10-K/10-Q/8-K, full-text searchable — source: SEC EDGAR (https://www.sec.gov/edgar)
  • Institutional holdings (13F-HR) — source: SEC EDGAR
  • Insider transactions (Form 3/4) — source: SEC EDGAR
  • Congressional trades — source: U.S. House & Senate financial disclosures (disclosures-clerk.house.gov / efdsearch.senate.gov)
  • Short data: fails-to-deliver — source: SEC; short volume & short interest — source: FINRA (https://www.finra.org)
  • Economic indicators — source: FRED, Federal Reserve Bank of St. Louis (https://fred.stlouisfed.org)
  • Futures positioning (Commitments of Traders) — source: CFTC (https://www.cftc.gov)
  • VIX & put/call ratios — source: CBOE
  • Daily OHLCV prices + indicators — source: Yahoo Finance

How to get it: self-host with one command (`docker compose up`); data lands in Postgres + ParadeDB so you get SQL + full-text/vector search out of the box. There's a web UI for browsing, a plain HTTP API, and an MCP server if you want to query it from an LLM. Stores everything locally — no account, no paid API.

Repo: https://github.com/daniel3303/Equibles (if you liked it, leave a star :) )

Disclaimer: I'm the developer of this project. It's free and open-source, I'm not selling anything, and all data comes from the public government/exchange sources listed above. Equibles is just the open pipeline to collect and query them yourself.

Feedback and feature requests welcome.

reddit.com
u/DanielAPO — 6 days ago
▲ 49 r/datasets+2 crossposts

S&P 500 market cap vs P/E ratio by sector: where the market is cheap and where it's expensive right now

datahub.io
u/anuveya — 7 days ago
▲ 8 r/datasets+1 crossposts

What’s the most underserved public dataset you wish existed in clean, RAG-ready form?

We’re building Parsimmon, a document parsing pipeline that handles the messy stuff most tools choke on: scanned PDFs, mixed layouts, tables embedded in images, inconsistent formats across sources. We’ve been benchmarking on ParseBench and are sitting alongside Google and Reducto on the leaderboard, with particularly strong recall on complex layouts like XBRL/SEC filings.

We want to use it to do something actually interesting for people, like take a historically significant, publicly available corpus that’s scattered and inaccessible and normalize it into a single clean, queryable dataset we can release for free.

We’ve been kicking around things like:
• Leonardo da Vinci’s notebooks (7,000+ pages scattered across 10+ institutions, never unified)
• Einstein’s personal papers (Princeton/Hebrew University digitized but never normalized)
• Darwin’s notebooks (Cambridge has the full archive digitized but completely scattered)

But we want to know what you actually wish existed. What corpus have you run into that’s technically public but practically unusable? What would you build on top of it if the data were clean?

Ideally something with appeal beyond researchers, but we’re open to anything.

reddit.com
u/ParsimmonIO — 7 days ago
▲ 2 r/datasets+2 crossposts

[Synthetic][PAID][self-promotion] Made-to-order training data generator with web search and exports

Disclosure: I’m on the Abliteration team.

We just shipped a training-data generator for people who need specific examples rather than another generic public dataset.

You describe the examples you want and it generates structured synthetic data. If the dataset needs current or real-world facts, you can turn on web search. Exports are live for Hugging Face, Kaggle, S3, and OpenAI.

The first use cases we built around are classifier and eval datasets for trust and safety: grooming detection, harassment detection, security research evals, jailbreak and edge-case sets, and similar work where teams need examples that general-purpose models often refuse to generate.

I marked this as synthetic and paid because the outputs are generated and this is a commercial tool.

Product: https://abliteration.ai/

Synthetic data page: https://abliteration.ai/use-cases/synthetic-data

Launch video: https://x.com/abliteration_ai/status/2054675554138194178

For people who curate datasets: what export format or per-row provenance metadata do you usually need before a generated dataset is usable?

u/Effective_Attempt_72 — 8 days ago

How to apply normalization for cross sectional time series data ?

I am unable to convince myself to use one method.
Some methods that i thought of were :

  1. I use normalization for full training data of one subject across all features. In this method, i am introducing some kind of lookahead bias, and also this loses on some information which could have been valuable. And also when i want to use one model ( suppose regression with gradient descent) for the subjects combined, then I am unable to judge if this will be a good method.
  2. A bad method was to not care about the subjects, and just normalize across full feature. but this just feels wrong to me.
  3. I was reading about cross sectional normalization which ranks the subjects and does some kind of normalization. But i am unsure how that would be useful.
  4. Another way i found was by using some rolling window, where i keep normalizing not over full data, but the past window data. This seems better but here also what choice of window should be done, and there are lot of questions.

And the bigger problem over all of these is the time series . I would lose quite a lot of information when i don't consider these. ( although not all features have a big factor of this).

reddit.com
u/Virtual-Current6295 — 9 days ago