u/TarHeelActuary

I'm an actuary, and I spent the last few weeks building a realistic insurance claims dataset. Insurance data is difficult to come by in the wild; most datasets are either too simple or completely proprietary. Other practice datasets are retail sales or Titanic, which don't provide much value. I wanted something that reflects how real industry data actually looks, so I built it.

It's a SQLite database covering four years of claims across employer groups — members, claims, claim lines, providers, plans, premium rates, the works. Realistic messy data: out-of-network pricing spreads, denial reasons, pending claims, annual maximum exhaustion, processing lag.

Comes with 54 exercises across five tiers:

Foundational SQL (SELECT, WHERE, GROUP BY, JOINs)
Intermediate analytics (window functions, utilization metrics, provider analysis)
Advanced (CTEs, self-joins, cost trend, member behavior analysis)
Actuarial analyses (IBNR, experience rating, credibility, frequency/severity)
Data quality investigation (duplicate claims, billing anomalies, eligibility audits)

Plus four open-ended capstone projects suitable for a portfolio (e.g. dashboards).

Full solution guide included. Works in DBeaver, DB Browser for SQLite, or any SQLite-compatible client — no server setup.

I published the dataset and guides on Gumroad. If interested, let me know and I will provide the link.

Hello r/actuary. I spent the last few weeks building a synthetic health insurance claims dataset. I designed it for people who want to practice the kind of analyses that show up in the real world. Throughout my own actuarial journey, I realized that publicly available data was lacking in a lot of ways. Either the datasets were too simple, overly summarized, or proprietary.

The dataset covers four years of claims across employer groups with quasi-realistic benefit logic. I chose to model dental insurance because of its relative simplicity compared to major medical. The data mimics a real-world relational database that you would see in an insurer. The initial version contains multiple domains: claims, groups, providers, members, claims, and premium. I plan to expand the ecosystem in future updates.

I packaged it with 54 exercises organized by difficulty, from basic SQL up to:

- IBNR approximation and analyzing lag patterns

- Experience rating, credibility weighting, and loss ratio variance analysis

- Frequency vs severity trend decomposition

- Provider analysis

I published this on Gumroad if you are interested and want to support the effort. Message me if you need the link. I'd appreciate feedback on it or any ideas for future iterations.

Synthetic Insurance Claims Dataset for SQL practice - 54 exercises from basic to advanced

Realistic Synthetic Health Insurance Dataset for SQL and Actuarial Analysis Practice