u/Careful_Sand_7838

Built by merging 5 existing public datasets into one. And I've scraped the wiki 69k names too.

Kaggle Dataset License: CC BY-SA 4.0

Dataset	Size	Male %	Notes
Wikipedia	69,209	44.1%	Real attested people, 87% have birth year
ENAMDICT	116,009	16.4%	Dictionary-based, heavily skewed female
Facebook 530M leak	392,434	60.6%	Largest source, kanji or kana only
GenDec	64,139	49.8%
名前由来	89,635	60.4%	Popularity rankings, not real frequency
Total	731,426	51.0%

Each individual dataset has its own gaps — size, quality, or skew — but combining them gives a more complete picture. The Wikipedia subset is the only one covering real individuals and has a temporal dimension through birth years. ENAMDICT skews female partly because Japanese female names have more variety. The Facebook data is massive but only records kanji or kana, not both.

Use cases: gender inference (training classifiers without LLMs), Japanese NLP (NER, tokenization, reading prediction), cross-source data quality research

Also working on a gender prediction model, will post when ready. it has around 90% accuracy

I made the largest public gender-labeled Japanese name dataset, 731k+ names