I'm building an Ekegusii ↔ English NLP translator for a critically low-resource Bantu language in KENYA ,here's where I am and what I'm figuring out next
Hey everyone 👋 Long-time lurker, first-time poster. I've been self-teaching NLP over the past few months and got hit with an idea I can't shake: building a machine translation system for Ekegusii (also called Gusii), a Bantu language spoken by the Gusii people in western Kenya roughly 2–3 million speakers.
Ekegusii is critically underrepresented in NLP. There's almost no public tooling, no pre-trained models, and very little parallel data available online. I want to change that, starting with an Ekegusii ↔ English translator, with Kiswahili as a future target.
What I've done so far:
Found a large parallel corpus the Bible in both Ekegusii and English
Parsed and aligned it into a structured .json file with paired sentence entries: { "ekegusii": "...", "english": "..." }
31,000 verse-level pairs , not huge, but a real start for a low-resource language
Where I'm stuck / what I'm figuring out next:
- Should I fine-tune an existing multilingual model (e.g. mBART-50, NLLB-200, or Helsinki-NLP opus-mt) or try to build something smaller from scratch given compute constraints?
- Bible text is highly formal and domain-specific , how much will that hurt generalization?
- Tokenization: Ekegusii has rich morphology, so I'm wondering whether a standard BPE tokenizer will handle it well
- Data augmentation strategies for low-resource MT?
- Has anyone worked on low-resource African language MT before? Any advice, papers, or communities I should know about? Would love to connect with others working on similar problems.
Happy to share the dataset and code publicly once it's cleaned up. I would love for this to become a community resource.