I made the largest public gender-labeled Japanese name dataset, 731k+ names
Built by merging 5 existing public datasets into one. And I've scraped the wiki 69k names too.
Kaggle Dataset License: CC BY-SA 4.0
| Dataset | Size | Male % | Notes |
|---|---|---|---|
| Wikipedia | 69,209 | 44.1% | Real attested people, 87% have birth year |
| ENAMDICT | 116,009 | 16.4% | Dictionary-based, heavily skewed female |
| Facebook 530M leak | 392,434 | 60.6% | Largest source, kanji or kana only |
| GenDec | 64,139 | 49.8% | |
| 名前由来 | 89,635 | 60.4% | Popularity rankings, not real frequency |
| Total | 731,426 | 51.0% |
Each individual dataset has its own gaps — size, quality, or skew — but combining them gives a more complete picture. The Wikipedia subset is the only one covering real individuals and has a temporal dimension through birth years. ENAMDICT skews female partly because Japanese female names have more variety. The Facebook data is massive but only records kanji or kana, not both.
Use cases: gender inference (training classifiers without LLMs), Japanese NLP (NER, tokenization, reading prediction), cross-source data quality research
Also working on a gender prediction model, will post when ready. it has around 90% accuracy