u/ritis88

We checked TranslateGemma-12b's "clean" subtitle translations against human review. Linguists flagged 71% of them.

We've been running translation quality benchmarks at Alconost. A few weeks ago we published one with 6 models (Claude Sonnet 4.6, GPT-5.4 mini, GPT-5.4 nano, DeepSeek V3.2, Gemini Flash Lite, TranslateGemma-12b) translating English subtitles into 6 languages, 167 segments per language pair, scored with two reference-free QE metrics: MetricX-24 and COMETKiwi. TranslateGemma-12b came out on top in every language pair, which made us want to verify the result: when the metrics say a TranslateGemma translation is clean, do human linguists agree?

So we picked 21 English segments from one tutorial video where TranslateGemma's output had scored well on both metrics, in 4 languages - Spanish, Japanese, Thai, and Simplified Chinese (Korean and Traditional Chinese got dropped). We sent those 84 translations to human linguists for MQM annotation.

Headline numbers, using the rule the published benchmark dashboard itself uses to flag segments as poor (MetricX-24 ≥ 5 OR COMETKiwi < 0.70):

auto-flagged human-flagged (any error)
ES 0/21 11/21
JA 0/21 17/21
TH 0/21 17/21
ZH-CN 1/21 15/21
Total 1/84 (1.2%) 60/84 (71%)

The single segment automated metrics flagged was also human-flagged, so there's no disagreement there. The action is on the other side: 59 cases where metrics said clean and humans said not clean.

All 25 Accuracy-class errors found by humans (mistranslation, omission, addition, untranslated content) occurred on segments the metrics rated clean - 100%. Not one accuracy error landed in the auto-flagged region. Japanese accounts for 10 of the 15 mistranslations.

Caveat: small audit on one model and one content set, so the numbers are directional rather than definitive.

PS: I can share the full benchmark in the comments if somebody asks - noticed my own comments with a link get hidden.

reddit.com
u/ritis88 — 10 days ago

Follow-up to my TranslateGemma-12b benchmark post: human reviewers flagged 71% of the segments automated metrics rated clean

A couple of weeks ago I shared the results of a benchmark here showing TranslateGemma-12b beating frontier general models (Claude Sonnet, GPT-5.4, DeepSeek, Gemini Flash Lite) on subtitle translation across 6 languages. The result was strong enough that we wanted to verify it ourselves - was TranslateGemma really that good, or were the metrics easy on it? So we added a layer of human review.

Setup: 21 English subtitle segments from one tutorial video. TranslateGemma's translations into 4 languages (ES, JA, TH, ZH-CN - Korean and Traditional Chinese got dropped). 84 translations total, all chosen because they scored well on both automated metrics. Then we sent every translation to human MQM review.

Under the dashboard's own red-flag threshold (MX ≥ 5 OR CK < 0.70):

auto-flagged human-flagged (any) human-flagged (Major)
ES 0/21 11/21 2/21
JA 0/21 17/21 3/21
TH 0/21 17/21 5/21
ZH-CN 1/21 15/21 3/21
Total 1/84 (1.2%) 60/84 (71%) 13/84 (15%)

Of 25 Accuracy-class errors humans found (mistranslation, omission, addition, untranslated), every single one was in the metric-blind quadrant. The metrics caught zero accuracy errors in this sample.

Per-language failure modes look quite different:

  • Japanese is the "fluent but wrong meaning" pattern - high COMETKiwi (0.86 mean), reasonable MetricX, but 10 of the 15 total mistranslations in the dataset are in JA. In the original report we'd already seen the same pattern in Claude Sonnet 4.6 on Japanese (TQI 0.5364, MetricX 3.90, COMETKiwi 0.79 - fluent-sounding but drifting from source). Looks like the failure mode generalises across model families on JA.
  • Thai is over-production: 5 Accuracy/Addition errors where the model inserted content not in the source, plus a bunch of punctuation errors driven by English-style periods that Thai doesn't use.
  • Spanish is mostly tone inconsistencies (formal/informal switches), genuinely the easiest of the four.
  • Chinese ZH-CN had 4 Major errors total, including the one segment automated metrics flagged (Style - "unidiomatic collocation and inappropriate style"; humans agreed with the metric on that one). The other 3 Majors: another Style ("literal translation"), an Accuracy/Omission where "store" was dropped and the meaning changed, and a Fluency/Inconsistency where "ticket" was translated inconsistently across segments.

Caveat: small audit on one model, one content set, so the numbers are directional rather than definitive.

reddit.com
u/ritis88 — 10 days ago

Follow-up on the TranslateGemma subtitle benchmark: human review of segments rated "clean" by MetricX-24 and COMETKiwi [D]

A few weeks ago I shared the results of a benchmark here comparing 6 LLMs on subtitle translation, scored with two reference-free QE metrics - MetricX-24 (~13B mT5-XXL) and COMETKiwi (~10.7B XLM-R-XXL) - combined into a TQI index. Posting a follow-up because we did human review afterwards, and the result is worth discussing.

The original benchmark put TranslateGemma-12b first in every language pair. The natural question: are those high scores accurate, or are the metrics insensitive in their high-confidence zone? These metrics correlate well with human judgment at the population level (that's what they're trained for), but population-level correlation doesn't tell you whether the segments they call "clean" are actually clean.

So we ran the check directly. 21 English subtitle segments from one tutorial video. TranslateGemma's translations into 4 languages (ES, JA, TH, ZH-CN - Korean and Traditional Chinese got dropped). All 84 translations chosen because they passed the dashboard clean-rule (MX < 5 AND CK ≥ 0.70) in all 4 languages simultaneously. Then full MQM annotation by professional linguists - Major/Minor severity, with categories covering accuracy (mistranslation, omission, addition, untranslated), fluency (grammar, punctuation, inconsistency), style, terminology.

Results under the dashboard threshold:

  • Auto-flagged: 1/84
  • Human-flagged: 60/84 any-error, 13/84 Major-only
  • Metric-blindness rate (auto-clean ∩ human-flagged / auto-clean): 59/83 = 71% any-error, 12/83 = 14.5% Major-only
  • All 25 human-found Accuracy-class errors fell in the metric-blind quadrant. Zero overlap with the auto-flagged region (which contained one Style-category Major error).
  • Japanese carries 10 of 15 total mistranslations across the dataset, all metric-blind, despite having the highest mean COMETKiwi (0.863) of the four languages.

Caveat: small n, one model, one content set, so the numbers are directional rather than definitive.

Original thread: [link]
Full benchmark report: in comments.

reddit.com
u/ritis88 — 10 days ago