r/LanguageTechnology

I'm building an Ekegusii ↔ English NLP translator for a critically low-resource Bantu language in KENYA ,here's where I am and what I'm figuring out next

Hey everyone 👋 Long-time lurker, first-time poster. I've been self-teaching NLP over the past few months and got hit with an idea I can't shake: building a machine translation system for Ekegusii (also called Gusii), a Bantu language spoken by the Gusii people in western Kenya roughly 2–3 million speakers.

Ekegusii is critically underrepresented in NLP. There's almost no public tooling, no pre-trained models, and very little parallel data available online. I want to change that, starting with an Ekegusii ↔ English translator, with Kiswahili as a future target.

What I've done so far:

Found a large parallel corpus the Bible in both Ekegusii and English

Parsed and aligned it into a structured .json file with paired sentence entries: { "ekegusii": "...", "english": "..." }

31,000 verse-level pairs , not huge, but a real start for a low-resource language

Where I'm stuck / what I'm figuring out next:

  • Should I fine-tune an existing multilingual model (e.g. mBART-50NLLB-200, or Helsinki-NLP opus-mt) or try to build something smaller from scratch given compute constraints?
  • Bible text is highly formal and domain-specific , how much will that hurt generalization?
  • Tokenization: Ekegusii has rich morphology, so I'm wondering whether a standard BPE tokenizer will handle it well
  • Data augmentation strategies for low-resource MT?
  • Has anyone worked on low-resource African language MT before? Any advice, papers, or communities I should know about? Would love to connect with others working on similar problems.

Happy to share the dataset and code publicly once it's cleaned up. I would love for this to become a community resource.

reddit.com
u/Pioskeff — 10 hours ago

Indian accent english speech recognition

Been testing a bunch of ASR models lately, and I think I’ve found the best one so far for English with Indian accents.

NVIDIA’s Parakeet TDT 0.6B v2 has been surprisingly good. Accent handling feels much more natural compared to a lot of models that struggle with Indian pronunciation, mixed speech patterns, or common regional variations.

What stood out for me:

✅ Better recognition of Indian English accents

✅ Strong transcription quality

✅ Fast and lightweight (0.6B)

✅ Handles real-world speech better than expected

Model: parakeet-tdt-0.6b-v2 on huggingface

Curious if others here have tried it against Whisper, Moonshine, or other recent ASR models. So far this might be my favorite for Indian English use cases.

Anyone else tested it?

reddit.com
u/AI_Guy_In_Fintech — 3 days ago

what’s actually the most reliable way to translate spoken audio into english using ai?

been working with a lot of multilingual audio lately like interviews, meetings, recorded calls etc and i still haven’t found a setup that feels actually reliable

transcription is usually decent depending on the tool but translation is where things start to break

meaning gets slightly distorted or sentences come out rearranged in a way that doesn’t sound natural especially when there’s accents background noise or people switching languages mid conversation

just wondering what people are actually using these days
is it still the usual transcription first then translation approach or is there something better now that handles it more cleanly end to end?

reddit.com
u/Little_Tangelo2196 — 3 days ago

Extracting predictive moves from sales call transcripts, patterns too generic

I'm trying to extract useful behavioral patterns from sales call transcripts and I'm stuck on the abstraction level. Hoping someone here has thought about this.

Setup: Danish-language sales calls, around 5 min each, transcribed and speaker-labeled. About 15k calls a month from a team of 15 reps. Binary outcome per call: did the rep book a meeting or not. I want to figure out which conversational moves actually work, so the manager can coach the team on real stuff instead of vibes.

Right now I run transcripts through Gemini Flash and ask it to pull out behavioral patterns with verbatim quotes. Then I aggregate across calls and check if a pattern shows up more often in booked calls vs lost ones. Threshold to call something validated is n>=20, lift >=3pp booking rate, p<0.05.

Problem is the patterns that come out are too generic to actually use. Stuff like "asks follow-up questions" or "mentions price". Technically true, useless as coaching. What the manager actually needs is something like "asks about urgency right after a price objection", a specific move in a specific spot.

I think there are a few things going wrong but I'm not sure which one to fix first:

The LLM produces category-level labels because that's what it's trained to do. Even when I ask for verbatim quotes it still ends up grouping them under a generic label, and the aggregation step throws away the specifics.

The sample size is small once you slice by phase and behavior. 20 to 50 observations per candidate. P-values at that size with no multiple comparisons correction probably means I'm just catching noise.

I'm treating it as a hypothesis test when it should probably be a ranking problem. I don't actually need "this is statistically true". I need "this move is more likely to precede a good outcome than this other move".

Stuff I've considered: tightening the prompt to demand phrase-level output with context (helps a bit, doesn't fix aggregation). Clustering phrase embeddings before aggregating instead of using the LLM label as the unit. Comparing top vs bottom performers within the same team directly instead of trying to make population-level claims. Reframing the whole thing as next-move prediction conditioned on call state.

What I'd love input on: has anyone done conversational success prediction at this kind of low-n where you want phrase-level moves and not category labels? Any prompting tricks for forcing the LLM to keep specifics through aggregation? Any pointers to the dialog acts literature that's actually useful for this vs theoretical?

Happy to share examples if it helps.

reddit.com
u/Playful_Air_7174 — 5 days ago

desk rejection after camera ready version ACL 2026

hi everyone. my paper got accepted at one of ACL '26 workshops. however, only after camera ready submission I realized most of my references were wrong (outdated or not ACL-style). I sent the correct version after a day.

could that lead to rejection? thanks

reddit.com
u/Helpful_Income_9989 — 6 days ago

ACL Conference

My guide requires a virtual ACL conference for my PhD work(India). Does anyone know (1) if ACL proceedings are Scopus indexed and allows virtual presentation (2) the total virtual registration cost for a student paper presenter and (3) if virtual presentation is smooth? Need precise numbers for my guide.

Thanks!

reddit.com
u/StatusArrival3382 — 6 days ago
▲ 3 r/LanguageTechnology+1 crossposts

Has anyone received BioNLP 2026 decisions yet?

The official BioNLP 2026 notification date has already passed, but my SoftConf submission page still says:

“At this time, there are no action items available for this submission.”

I’m trying to understand whether there is a general delay or whether decisions were already released for others.

reddit.com
u/Equivalent_Move_8137 — 7 days ago
▲ 12 r/LanguageTechnology+2 crossposts

Indian Spoken Language detection model

Hey everyone,

Over the past few months, I’ve been building a spoken language identification (LID) model focused specifically on Indic languages and real-world conversational speech.

The model can automatically detect the spoken language directly from audio input, even in noisy telephony-style conversations.

Supported Languages

Hindi

English

Bengali

Marathi

Tamil

Telugu

Kannada

Malayalam

Gujarati

Punjabi

What the Model Handles

Short utterances

Call-center / telephony audio

Conversational speech

Background noise

Indian accents & regional variations

Some level of code-mixed speech

Tech Stack

PyTorch

Deep learning–based audio classification

Custom preprocessing pipeline

Audio embeddings + transformer/CNN experiments

Automated evaluation & benchmarking workflows

Biggest Challenges

One thing I underestimated was how difficult Indic spoken LID becomes in real-world data.

Some major issues:

Similar phonetics across languages

Hindi mixed with regional languages

Accent & dialect diversity

Imbalanced datasets

Extremely short voice samples

Noisy customer-support recordings

A lot of effort went into preprocessing, balancing, and improving robustness.

Potential Use Cases

IVR language routing

Multilingual voice assistants

ASR model selection

Customer support automation

Speech analytics

Voice AI systems for India

Current Focus

Right now I’m experimenting with:

Better short-utterance detection

Robustness on noisy audio

Improving confusion between related languages

Faster inference for production deployment

Looking for Feedback

Would especially appreciate:

Good Indic LID benchmarks/datasets

Ideas for handling heavy code-mixing

Production deployment suggestions

Interest in an open-source release

Happy to discuss architecture choices, datasets, or experiments if people are interested.

reddit.com
u/AI_Guy_In_Fintech — 7 days ago

We checked TranslateGemma-12b's "clean" subtitle translations against human review. Linguists flagged 71% of them.

We've been running translation quality benchmarks at Alconost. A few weeks ago we published one with 6 models (Claude Sonnet 4.6, GPT-5.4 mini, GPT-5.4 nano, DeepSeek V3.2, Gemini Flash Lite, TranslateGemma-12b) translating English subtitles into 6 languages, 167 segments per language pair, scored with two reference-free QE metrics: MetricX-24 and COMETKiwi. TranslateGemma-12b came out on top in every language pair, which made us want to verify the result: when the metrics say a TranslateGemma translation is clean, do human linguists agree?

So we picked 21 English segments from one tutorial video where TranslateGemma's output had scored well on both metrics, in 4 languages - Spanish, Japanese, Thai, and Simplified Chinese (Korean and Traditional Chinese got dropped). We sent those 84 translations to human linguists for MQM annotation.

Headline numbers, using the rule the published benchmark dashboard itself uses to flag segments as poor (MetricX-24 ≥ 5 OR COMETKiwi &lt; 0.70):

auto-flagged human-flagged (any error)
ES 0/21 11/21
JA 0/21 17/21
TH 0/21 17/21
ZH-CN 1/21 15/21
Total 1/84 (1.2%) 60/84 (71%)

The single segment automated metrics flagged was also human-flagged, so there's no disagreement there. The action is on the other side: 59 cases where metrics said clean and humans said not clean.

All 25 Accuracy-class errors found by humans (mistranslation, omission, addition, untranslated content) occurred on segments the metrics rated clean - 100%. Not one accuracy error landed in the auto-flagged region. Japanese accounts for 10 of the 15 mistranslations.

Caveat: small audit on one model and one content set, so the numbers are directional rather than definitive.

PS: I can share the full benchmark in the comments if somebody asks - noticed my own comments with a link get hidden.

reddit.com
u/ritis88 — 10 days ago

Regarding choosing same Reviewer for next ARR cycle

I got reviews (3,3,3.5,2) with confidence (3,3,3,5) in the March cycle.

I have mostly addressed the reviews and concern and plan to resubmit in the next cycle, can someone from their experience tell which is better to choose the same set of reviewers or different. Like if we have answered their queries do they generally give a better score than they did before?

And what are the chances of getting accepted at EMNLP?

reddit.com
u/Happy_Today_3288 — 10 days ago

Commonly used algorithms to compare texts

Hi! I'm new to computational linguistics and recently I need to estimate how much of a text our participants can remember for a project. So far we had a list of "information units" that are in the text, and we manually checked if the participants mentioned them in what they wrote. Now we want to automate this process. I tried to look for machine learning approaches, but I found mostly sentiment analysis papers or word counts, plus a lot with LLMs (however the latter didn't look very standard in the field to me, more like a new approach). Also, algorithms you have to train, but we don't have enough data to do so. In general there was a lot, so I had trouble knowing what to choose or where to even start.

Is there any algorithm or tool already trained that is commonly used for this? Any insights or guidance is appreciated.

reddit.com
u/vnshmnt — 10 days ago

Can ARR reviews commit to a second venue after rejection at the first?

If I commit a paper to EMNLP and it gets rejected, can I then commit the same ARR reviews to AACL or EACL afterwards? Or does the rejection burn that review set and force me to go through a new ARR cycle?

Has anyone actually tried this cascade? Curious whether it's mechanically allowed, formally forbidden, or just gray area in practice.

Thanks.

reddit.com
u/Greedy-Teach1533 — 11 days ago

Computational Linguistics

Hi everyone,

I’m looking into applying for an MS in Computational Linguistics for Fall 2027, specifically at the University of Washington and the University of Rochester, and I wanted to ask if anyone here has had a similar journey/background.

My academic background is in Modern Languages (English & German), and I’m currently doing an MSc in International Business. Linguistics/languages have always been my strongest area, and over the past year I’ve become really interested in NLP, computational linguistics, and language technology.

The biggest issue is that I currently have zero formal background in computer science or coding. No CS degree, no math-heavy background, no programming courses from university. However, I’m fully willing to put in the work before applying - learning Python, taking online courses, improving my quantitative skills, etc.

I wanted to ask:

  • Has anyone here transitioned into computational linguistics from a humanities/languages background?
  • If so, what did you do before applying to become a competitive applicant?
  • Were universities receptive to applicants without a CS degree?
  • What kind of portfolio/projects helped the most?

Also, since I’m an international student, I’d love to hear if anyone had experience getting scholarships, assistantships, funding, or tuition support for computational linguistics programs in the US - especially at UW or Rochester.

Sometimes I feel intimidated seeing applicants with strong CS backgrounds, so hearing from people who successfully made the transition would honestly help a lot.

Thank you!

reddit.com
u/Obvious-Ad6806 — 12 days ago

I need you're help.. with hypothesis

Hi everyone,

I'm not entirely sure this request belongs on this subreddit, but I'll give it a shot anyway.

I'm working on a personal project called WeakSignalFinder, focused on quantitative text analysis to help detect emerging themes.

What the project currently does:

The program relies on Natural Language Processing (NLP) to identify various categories of terms (nouns, pronouns, adjectives, verbs) and quantitatively count the occurrences of a given set of keywords (e.g., war, economic…). It also analyzes co-occurrences, meaning it captures the immediate neighborhood of each word (positions n-1 and n+1), in order to produce a kind of map or dictionary of the linguistic patterns within the input corpus.

The problem I'm currently stuck on:

I'm now tackling a feature that was actually the original goal of the project: identifying weak informational signals (in the Ansoff sense). For a long time this seemed too complex to me, mainly because of one core difficulty: how do you distinguish noise from a genuine weak signal?

The hypothesis I'd like to submit:

A few days ago, I came up with a possible angle. To filter out noise from the pool of terms suspected of being weak signals, one could compute an average coefficient for each of the suspect term (by all occurrences), in order to derive a density of "theme-words" (terms with high, or very high, occurrence rates).

I'm coming to this subreddit today hoping to get critical feedback on this hypothesis, pointers to academic literature that could help me validate, refine, or correct the approach, and ideally any existing implementations or experimental code that have explored these concepts in practice.

Thanks in advance for any help. My current self, armed only with an Associate's Degree in Computer Science, will be more than happy to quench a bit of his insatiable thirst for knowledge.

reddit.com
u/transmision — 12 days ago

ACL TrustNLP Camera-Ready

I have two accepted papers for ACL TrustNLP 2026 workshop and the camera ready submission deadline is May 12th but I don’t see an option to upload the camera ready version in open review. Anybody else facing this issue ? Thanks

reddit.com
u/rohithnamboothiri — 14 days ago