r/datasets

▲ 7 r/datasets+1 crossposts

Historical Return Data Files

Getting good data is a big hurdle for retail investors. Reliable return histories are often locked behind thousand dollar a year subscriptions. But you can get a lot for free.

I put together a small return dataset covering developed-market stocks, sovereign bonds, interest rates, and currencies.

The goal is to consolidate the kinds of return series that are useful for testing global asset allocation strategies, especially those involving foreign equity, sovereign bonds, currency hedging, and excess returns.

The dataset includes 50+ years of coverage across several files. All available for free. Check it out!

https://github.com/birjusuketupatel/ReturnDataFiles/tree/main

reddit.com

u/NecessarySpread2592 — 1 day ago

▲ 6 r/datasets+1 crossposts

Datasets versioning

Hey folks,

How are yall managing datasets versions? Does unity catalog have this feature or are you using a 3rd party tool? I am looking for something that keeps track of data changes. Last updated, what was updated etc

reddit.com

u/Severe-Committee87 — 1 day ago

▲ 2 r/datasets+1 crossposts

[Spreadsheet newbie] a simple functionality that doesn't seem to exist?

Hi, I'm a spreadsheet newbie here (but I'm working on it!). Recently, I wanted to do something that seems straightforward and basic to me, but it seems (after consulting ai as well) that there is no native, straightforward way to do it.

In particular, I want a formula to run once and return a value, writing it into the cell as plaintext (independent from any input cells), and I want this done only for specific cells and not the whole document. This can be in either Google Sheets or Excel.

In other words, say I have data in the "a" column. I want each cell in the "b" column to have a result based on its corresponding "a" cell (doesn't matter what, lets say it's just adding 1 to it for simplicity's sake). Crucially, I want that result in the "b" column to remain if i were to delete or change the contents of the "a" column. So, phrased differently, I want the result of the function to be written in the cells of the "b" column in plaintext, say once I hit "enter" or something like that.

Solutions that ai has offered me include copying and special pasting value only, writing an extension script, or changing how the entire document behaves around formulas (which wouldn't work because I only want this to apply to a certain cell range).

I understand WHY this could be tricky (cells have no concept of "time", formulas by default are dynamic, etc.), but it still seems like it should be a very simple native functionality: have the result of this formula be written in this cell as plaintext. Am I missing something?

reddit.com

u/SunshineProvides — 1 day ago

▲ 24 r/datasets+8 crossposts

World Atlas

This post contains content not supported on old Reddit. Click here to view the full post

reddit.com

u/No_Twist6127 — 3 days ago

▲ 23 r/datasets+4 crossposts

GitHub - onhexgroup/TABPE: A monthly Windows PE baseline dataset for Cyber security researchers

github.com

u/seyyid_ — 3 days ago

▲ 88 r/datasets+1 crossposts

Dataset: global real interest rates from 1311 to 2018. Schmelzing (2020), 8 countries, annual sovereign bond yields.

datahub.io

u/anuveya — 4 days ago

▲ 11 r/datasets+5 crossposts

I engineered 102 leakage-free ML features from 49,000+ international football matches (1872–2026) and published it as a free dataset

Been working on a football prediction project and couldn't find a dataset that had

the actual context needed to model match outcomes — just raw results everywhere.

So I built one from scratch on top of the International Football Results dataset

by Mart Jürisoo (the well known one on Kaggle with 49,000+ matches going back to 1872).

What I added:

**Elo ratings** — built from scratch, updated after every single match across 150

years. Both teams' ratings, their difference, and the expected win probability

going into each match.

**Rolling form** — win rate, goals scored, goals conceded, goal difference, clean

sheet rate, both-teams-scored rate, scoring rate, and win streak. Computed at

three lookback windows: last 5, last 10, and last 20 matches. For both teams.

**Head-to-head history** — based on the last 10 meetings between those two specific

teams. Some teams have persistent edges over specific opponents that their general

form doesn't explain.

**Fatigue signals** — days since each team's last match and the difference between

the two.

**Penalty reliance** — fraction of each team's historical goals that came from

penalties, pulled from the goalscorer dataset.

**Shootout composure** — historical penalty shootout win rate for each team, from

the shootouts dataset.

**Tournament context** — World Cup, qualifier, friendly, neutral venue, competition

importance weight, confederation.

The thing I spent the most time on: every feature is computed in strict

chronological order using only data that existed before that match was played.

State updates happen after each row is recorded, never before. No lookahead,

no leakage anywhere in the 102 columns.

102 features total. 49,094 rows. result column (H/D/A) included as the label.

Drop date and result, plug into any classifier.

Dataset is fully documented with column descriptors for every feature.

Link: https://www.kaggle.com/datasets/kriishgulati/football-match-results-1872-2026-with-ml-features

Built on top of the original dataset by Mart Jürisoo — full credit and link

in the dataset description.

kaggle.com

u/Kriish_Gulati — 3 days ago

▲ 4 r/datasets+2 crossposts

Is it possible to build an AI-powered platform that automatically transforms messy, complex medical data into reliable, research-ready data for analysis and AI models? Is it worth investing in it?

Recently I've come across this query on many platform.

Here is what I think:

First of all, healthcare data is a completely different beast. Building an AI solution for medical data quality isn't just about fixing duplicate records or filling in missing values. To build an AI-powered model to turn messy data into clean and accurate training data, you need a large volume of representative and relevant medical data.

There are challenges involved in collecting medical data for research, analytics, and AI models. Here are some of the biggest ones:

You need access to large, diverse, and representative patient datasets from different hospitals, regions, and healthcare systems to build a reliable model.
Clinical notes tend to be messy -- doctors' handwriting, abbreviations, and local terminology can make identification and standardization extremely difficult.
Medical coding standards also evolve regularly, so your system has to keep up with those changes.
And because healthcare is heavily regulated, handling sensitive patient information means de-identification, privacy, and compliance aren't optional but crucial.
Staggering ambiguities in clinical data still require domain experts to validate and resolve.

These are areas where healthcare data annotation companies, who work with AI companies, have already invested heavily.

Give it a thought when you are looking to build a model.

What do you guys have to say?

reddit.com

u/manuspresso — 4 days ago

▲ 1 r/datasets

How to deal with null values for a health prediction dataset?

hi! So I have this dataset where the objective is to predict a student's health risk, but I'm a lil confused about how to handle the null values. These are the % of null values for the columns:

             id                          0.000000
health_condition            0.000000
sleep_duration             11.012943
heart_rate                  1.135073
bmi                         2.013946
calorie_expenditure         7.658878
step_count                  2.016554
exercise_duration           1.000017
water_intake                6.300211
diet_type                   1.000017
stress_level               12.000064
sleep_quality               8.452690
physical_activity_level     5.306715
smoking_alcohol             4.141791
gender                      3.097141
dtype: float64id

What would you recommend I do for these values? If I were to drop the columns <5%, I would be losing nearly 100,000 values (out of 700,000) which I don't think is all that good. I thought of using K-means to fill the null BMI values but I don't know.

I would appreciate any advice! Thanks :)

reddit.com

u/Defiant-Ad3530 — 5 days ago

▲ 1 r/datasets

Looking for dataset of surnames with compound names uncompressed

I'm trying to find a database of surnames for use in writing/testing code that converts an author name (e.g, "Stan Sieler") into a sortable/alphabetizable name (e.g, "Sieler, Stan").

Many surnames are compound ("de Camp", "Cartwright-Chickering" (bonus for people who recognize that one!), some with and some without hypens, and some with more than two words.

The U.S. Census database isn't useful to me ... they compress all last names, removing spaces.

(I'm ignoring people like "Arthur Conan Doyle", whose last name at birth was "Doyle", but later adopted the practice of using "Conan Doyle" as his surname ... confusing librarians around the world :)

Any pointers appreciated, thanks!

reddit.com

u/Ssieler — 5 days ago

▲ 6 r/datasets

I pulled data from 1.5 million US websites - what data would you want to know?

Started out with a question, how do I spend $300 in free GCC credits, and how much could I do with it. I started with figuring out how to query HTTP Archives, pulling CRuX data to correlate sites, and learning a bit about BigQuery along the way. I went from ~12 million total sites and pared that down to 1.5 million that I could verify were live, had enough data to be able to classify/categorize, and then built a front end to access the highlights.

So far, I've been focused on identifying key business segments with missing opportunities, classic one click misses, some schema mapping for business type, and wondering why in the world any sane business owner would use Weebly.

What would YOU want to know?

reddit.com

u/gillygangopolus — 7 days ago

▲ 118 r/datasets+1 crossposts

I processed the entire arXiv LaTeX source corpus (3M+ papers) into a metadata-aligned Parquet dataset to save on S3 egress fees

I’ve spent the last few weeks working on a pipeline to solve a problem that has frustrated me (and likely other researchers) for a while: working with arXiv source files at scale.

If you have ever tried to analyze the LaTeX source code of arXiv papers, you have probably run into two major roadblocks:

The Egress Tax: arXiv’s official bulk S3 bucket is configured as "requester-pays." If you try to download the complete 5 TB corpus to any machine outside of the AWS us-east-1 region, you get hit with standard egress fees. At $0.09 per GB, a single full download can cost over $450 in bandwidth alone.
Unpacking Pain: The raw S3 data is packaged as hundreds of nested .tar archives containing gzipped payloads of individual papers. Extracting these, parsing the inner LaTeX code, and matching the files with their JSON metadata snapshots is quite CPU-intensive and requires a lot of boilerplate ingestion code.

To make this easier, I built a pipeline that runs inside AWS us-east-1 (where transfer is free), pulls the raw source files, unpacks them, matches them with the official metadata, and bundles them into ready-to-query Parquet partitions.

HuggingFace Dataset Link: https://huggingface.co/datasets/scholarweave/arxiv

What is inside:

Each row represents a single paper and contains both the official metadata and the parsed source files:

Core Metadata: id, title, authors, abstract, doi, categories, license, versions, etc.
latex (Large String): The parsed, compiled LaTeX source code from the paper. I wrote a parser to bundle the primary .tex, .bib, and .sty files into a single, readable Markdown-style tree structure.

Maintenance & Syncing:

Monthly Updates: I plan to sync the pipeline once a month to capture new uploads.
Resilient Syncing: I maintain an XML manifest file in the HuggingFace repository (arxiv_parquet_manifest.xml) that maps each Parquet partition to its size, MD5 checksum, and the raw S3 .tar source files used to generate it. This should make incremental syncing or troubleshooting much easier.

If you are working on NLP, training LLMs on scientific text, analyzing citation networks, or doing sociolinguistic research, hopefully this saves you some time and cloud budget.

u/Invicto_50 — 11 days ago

▲ 11 r/datasets+1 crossposts

Anyone here into niche dataset creation? 🇧🇷📊🔥

Hey folks,

I’ve started a small generative dataset project (it made some money as well) and now I’m trying to find Brazilians who are into the same weird corner of the universe. I’m not Brazilian myself (from The Netherlands, learning the language for 2 years now etc), but I’m super curious about how people in Brazil think about niche dataset creation, cultural data, local knowledge, all that good stuff that only Brazil seems to produce in unlimited supply.

If you’re in Brazil and you’re experimenting with:

• niche or unusual datasets
• indie AI projects
• creative data ideas that usually start in a WhatsApp group at 2am
• or you just enjoy talking about how to turn Brazilian chaos into structured information

I’d love to connect.

Thinking of making a WhatsApp group just to gather people who enjoy this kind of thing. Nothing formal, nothing corporate, just that classic “Brazilian startup energy” where everyone is building something strange and ambitious at the same time.

If this sounds fun, drop a comment. I’d love to meet more people in Brazil who are into dataset‑making adventures.

reddit.com

u/Gold-Translator3210 — 10 days ago

▲ 22 r/datasets+1 crossposts

Every provider reports a different P/E and hides the formula, I open-sourced one where all 200+ metrics show their math

Run a factor backtest on fundamental signals and your historical P/E series often won't match what you'd actually have seen at the time. The usual cause: trailing-twelve-month vs fiscal-year-end earnings, diluted vs basic shares, different treatment of extraordinary items. Most providers don't document which they use. For example, Microsoft's PE ratio on the 6th of May, 2023 is reported to be 28.93 (Stockopedia), 32.05 (Morningstar), 32.66 (Macrotrends), 33.09 (Finance Charts), 33.66 (Y Charts), 33.67 (Wall Street Journal), 33.80 (Yahoo Finance) and 34.4 (Companies Market Cap). Whereas I report it at 32.07 (diluted), 31.96 (non-diluted), 30.07 (TTM diluted) and 29.93 (TTM non-diluted).

So that's why I open-sourced the math. The P/E, for example, is as simple as stock_price / earnings_per_share where Earnings per Share is (net_income - preferred_dividends) / average_outstanding_shares. There is a layer on top to also be able to work with trailing, TTM and growth rates for any metric which is also documented. This makes it so you can reproduce any provider's P/E if you know which definition they used (albeit that is often the issue..)

Same idea when it comes to any other metrics let it be fundamentals, technicals, risk, performance etc. (about 200). Here's CAPM, VaR and Max Drawdown for NVDA, AMD and INTC at five-year intervals since 2000:

# Install first: pip install financetoolkit -U

from financetoolkit import Toolkit

semis = Toolkit(["NVDA", "AMD", "INTC"], api_key="FMP_KEY", start_date="2000-01-01")
capm = semis.performance.get_capital_asset_pricing_model(period="yearly")
var  = semis.risk.get_value_at_risk(period="yearly")
mdd  = semis.risk.get_maximum_drawdown(period="yearly")

CAPM (expected annual return):

Year	NVDA	AMD	INTC
2010	16.9%	20.2%	13.7%
2015	-1.3%	-1.3%	-0.8%
2020	21.6%	19.0%	18.9%
2025	27.0%	29.2%	22.9%

VaR (95% historic, worst expected daily loss):

Year	NVDA	AMD	INTC
2010	-4.8%	-5.0%	-2.7%
2015	-2.7%	-5.2%	-2.3%
2020	-5.7%	-5.4%	-4.2%
2025	-4.5%	-5.4%	-5.3%

Max Drawdown (peak-to-trough within year):

Year	NVDA	AMD	INTC
2010	-53.0%	-44.8%	-27.1%
2015	-17.7%	-51.1%	-29.9%
2020	-37.6%	-34.3%	-35.6%
2025	-36.9%	-39.6%	-33.8%

The whole point is to make things more transparent and with creating models, that is especially relevant that the metric you're training on is actually how you envision the metric to be.

While the library uses Financial Modeling Prep and Yahoo Finance as a default source, I've added in logic so you can swap in your own data provider or even a local CSV to not be provider-dependent. This way it should really become auditable for your backtests. The library is MIT-licensed, so you can use it in your own projects without restriction.

u/Traditional_Yogurt — 11 days ago

▲ 4 r/datasets+1 crossposts

I am creating a stock market tool, Need some help with data

Hello traders, i am building software for day traders but i need some tradebook data of at least 1000 INTRADAY trades, be it either futures, options or equity to test on.

So anyone will can dm me please.

Will let you know more if it succeeds.

reddit.com

u/Rajnish357 — 10 days ago

▲ 12 r/datasets+5 crossposts

Built APIs for Aussie StartUps , trade contractor rates and PBS drug pricing (plus rental and subscription data)

Been working on this for a while and finally feel like it’s worth sharing.
If you’ve ever tried to get structured, up-to-date Australian data into an app rental prices, drug costs, trade pricing, you know how painful it is. Either you’re scraping PDFs, wrestling with government portals, or paying for enterprise data contracts you can’t afford as a small team.
So I built an API that covers four datasets I kept needing myself:
• Rental prices : median weekly rent by suburb, postcode, and bedroom count across Australia (quarterly data going back to 2000)
• PBS drug pricing: 14,000+ medications with patient copayment costs (what you actually pay at the chemist, not just the benefit price)
• Trade/contractor pricing : what plumbers, electricians, etc. charge across different states
• Subscription pricing: SaaS and streaming prices across AU, US, UK for comparison tools
Single API key, consistent response format across all four. Also ships with an MCP server so you can wire it straight into an AI agent without writing API calls.
Mostly aimed at fintech apps, healthtech, real estate tools, and anyone building something that needs this kind of reference data without standing up their own scraper.
Free tier available — you can register and start hitting endpoints in under a minute.
Docs + free key: https://api.aristocles.com.au/docs
More info: https://aristocles.com.au
Happy to answer questions about the data sources or coverage gaps.

Questions or want to talk enterprise access?
https://aristocles.com.au/contact

reddit.com

u/Fit_Mango7142 — 11 days ago

▲ 0 r/datasets

Need LinkedIn profile data of everyone

I need dataset of all LinkedIn profiles. I know there are some paid sources for this but I want a free source. Reason I want a free source is because it makes no sense to pay for data, if I have to pay for data why can’t I then just sell that data for half price to other people after buying it ?

reddit.com

u/Overall-Suspect7760 — 14 days ago

▲ 1 r/datasets+2 crossposts

[Collaboration] Analyzing Luxury Watches as Alternative Investments (5- Year Auction Dataset)

Hello,

I'm a student researching the secondary market for luxury watches, and I have 5 years of auction data.

My goal is to do a comparison on returns and volatility to see if they hold up as alternative investments.

Since | lack the programming background (Python/R) and can't afford to pay a consultant, I am looking for a co-author to tackle this with me.
If you need a unique, real-world dataset for a portfolio project, let's partner up.

I'II provide the raw material, and you can build out the statistical analysis.

Let me know if you are interested in collaborating!

reddit.com

u/figuringitout1269 — 14 days ago

▲ 35 r/datasets+1 crossposts

UK macroeconomic data from the Domesday Book to the present: the Bank of England's thousand-year dataset

datahub.io

u/anuveya — 14 days ago