We checked TranslateGemma-12b's "clean" subtitle translations against human review. Linguists flagged 71% of them.
We've been running translation quality benchmarks at Alconost. A few weeks ago we published one with 6 models (Claude Sonnet 4.6, GPT-5.4 mini, GPT-5.4 nano, DeepSeek V3.2, Gemini Flash Lite, TranslateGemma-12b) translating English subtitles into 6 languages, 167 segments per language pair, scored with two reference-free QE metrics: MetricX-24 and COMETKiwi. TranslateGemma-12b came out on top in every language pair, which made us want to verify the result: when the metrics say a TranslateGemma translation is clean, do human linguists agree?
So we picked 21 English segments from one tutorial video where TranslateGemma's output had scored well on both metrics, in 4 languages - Spanish, Japanese, Thai, and Simplified Chinese (Korean and Traditional Chinese got dropped). We sent those 84 translations to human linguists for MQM annotation.
Headline numbers, using the rule the published benchmark dashboard itself uses to flag segments as poor (MetricX-24 ≥ 5 OR COMETKiwi < 0.70):
| auto-flagged | human-flagged (any error) | |
|---|---|---|
| ES | 0/21 | 11/21 |
| JA | 0/21 | 17/21 |
| TH | 0/21 | 17/21 |
| ZH-CN | 1/21 | 15/21 |
| Total | 1/84 (1.2%) | 60/84 (71%) |
The single segment automated metrics flagged was also human-flagged, so there's no disagreement there. The action is on the other side: 59 cases where metrics said clean and humans said not clean.
All 25 Accuracy-class errors found by humans (mistranslation, omission, addition, untranslated content) occurred on segments the metrics rated clean - 100%. Not one accuracy error landed in the auto-flagged region. Japanese accounts for 10 of the 15 mistranslations.
Caveat: small audit on one model and one content set, so the numbers are directional rather than definitive.
PS: I can share the full benchmark in the comments if somebody asks - noticed my own comments with a link get hidden.