Synthetic Insurance Claims Dataset for SQL practice - 54 exercises from basic to advanced
I'm an actuary, and I spent the last few weeks building a realistic insurance claims dataset. Insurance data is difficult to come by in the wild; most datasets are either too simple or completely proprietary. Other practice datasets are retail sales or Titanic, which don't provide much value. I wanted something that reflects how real industry data actually looks, so I built it.
It's a SQLite database covering four years of claims across employer groups — members, claims, claim lines, providers, plans, premium rates, the works. Realistic messy data: out-of-network pricing spreads, denial reasons, pending claims, annual maximum exhaustion, processing lag.
Comes with 54 exercises across five tiers:
Foundational SQL (SELECT, WHERE, GROUP BY, JOINs)
Intermediate analytics (window functions, utilization metrics, provider analysis)
Advanced (CTEs, self-joins, cost trend, member behavior analysis)
Actuarial analyses (IBNR, experience rating, credibility, frequency/severity)
Data quality investigation (duplicate claims, billing anomalies, eligibility audits)
Plus four open-ended capstone projects suitable for a portfolio (e.g. dashboards).
Full solution guide included. Works in DBeaver, DB Browser for SQLite, or any SQLite-compatible client — no server setup.
I published the dataset and guides on Gumroad. If interested, let me know and I will provide the link.