u/Careful_Sand_7838

▲ 18 r/kaggle+1 crossposts

I made the largest public gender-labeled Japanese name dataset, 731k+ names

Built by merging 5 existing public datasets into one. And I've scraped the wiki 69k names too.

Kaggle Dataset License: CC BY-SA 4.0

Dataset Size Male % Notes
Wikipedia 69,209 44.1% Real attested people, 87% have birth year
ENAMDICT 116,009 16.4% Dictionary-based, heavily skewed female
Facebook 530M leak 392,434 60.6% Largest source, kanji or kana only
GenDec 64,139 49.8%
名前由来 89,635 60.4% Popularity rankings, not real frequency
Total 731,426 51.0%

Each individual dataset has its own gaps — size, quality, or skew — but combining them gives a more complete picture. The Wikipedia subset is the only one covering real individuals and has a temporal dimension through birth years. ENAMDICT skews female partly because Japanese female names have more variety. The Facebook data is massive but only records kanji or kana, not both.

Use cases: gender inference (training classifiers without LLMs), Japanese NLP (NER, tokenization, reading prediction), cross-source data quality research

Also working on a gender prediction model, will post when ready. it has around 90% accuracy

reddit.com
u/Careful_Sand_7838 — 6 days ago