u/cefoo

Manuel Herranz from Pangeanic here. Long time no posting!

We have been focusing heavily on how to move MTQE beyond passive, post-facto scoring and turn it into a dynamic routing layer in production. I just published a deep dive on why enterprise localization needs to shift away from raw scalar metrics and toward actionable operational signals.

The core argument is that a machine translation can be perfectly fluent and linguistically accurate, yet still fail the job if it ignores client-specific terminology, glosssaries, or specific contextual risk profiles. While frameworks like COMETKiwi have been useful for general evaluation, true production automation requires an adaptive control layer that dynamically triggers human review per use case / client / job / industry, or automatic corrective post-editing based on actual asset compliance rather than a generic confidence number.

For me, this represents a paradigmatic shift from treating QE as a passive audit log to using it as an active routing mechanism. For those interested in the workflow architecture and how we are balancing varying data risks across different content domains, the full article is here:
https://blog.pangeanic.com/mtqe-is-becoming-a-translation-control-layer-from-scores-to-adaptive-quality-workflows

Given the deep technical focus of this community, I would love to get your thoughts on the practical hurdles of automated thresholding. How are your teams handling the engineering challenges of real-time routing, and where do you see the boundary between automated gating and human veto power?

Overview

This shared task focuses on evaluating the performance of automated systems that assess the quality of language translation systems. It continues the WMT 2025 shared task that unified and consolidated the separate shared tasks on Machine Translation Metrics and Quality Estimation (QE) from previous years, under an updated structure designed to encourage the development and assessment of new state-of-the-art translation quality evaluation systems.

A primary focus of the task is on systems that can evaluate translation quality in context — where context is at the document or lengthy multi-segment level — even when the granularity of the generated quality assessments is at the word or segment levels. Content segmentation will still be provided as part of the input, but similar to last year and unlike previous years, these segments will be long multi-sentence units of text. Reference translations will also be provided as an optional (but not required) input parameter to the evaluation systems, thus covering both the classical MT metrics and translation-time QE scenarios. However, unlike previous years, reference translations will be generated by post-editing of MT or via synthetic generation and selection.

The shared task this year consists of three primary subtasks that address translation quality assessment from three perspectives: (1) segment-level translation error detection and span annotation, (2) segment-level quality score prediction, and (3) detection of error-free segments. Curated evaluation data sets will be provided for all three subtasks. These include test sets obtained from the General MT shared task as well as a collection of “challenge sets” that were developed by the organizers and members of the research community. A fourth subtask solicits the submission of these challenge sets.

Languages Covered

The list below provides the language pairs covered this year (which fully parallel the languages covered by the General MT shared task):

Czech to German
Czech to Ukrainian
Czech to Vietnamese [new]
Chinese (Simplified) to Japanese [new]
English to Arabic (Egyptian)
English to Armenian [new]
English to Belarusian [new]
English to Chinese (Simplified)
English to Chinese (Traditional Taiwan) [new]
English to Czech
English to Estonian
English to German
English to Icelandic
English to Indonesian [new]
English to Japanese
English to Kazakh [new]
English to Korean
English to Ladin (Italy) [new]
English to Ligurian (Italy) [new]
English to Russian
English to Northern Sámi [new]
English to Thai [new]
English to Ukrainian

Data and Submission

Human assessments of translation quality, collected by the General MT shared task using an ESA-based protocol, will act as the “gold standard” for our shared task evaluation. For details, see the “Human Evaluation” section of the General MT shared task description.

Training and validation datasets will be made available for each of our three subtasks. See the detailed task descriptions for the list of specific resources available for each of the tasks.

Submissions for Tasks 1, 2, and 3 (described below) will be automated and conducted via Codabench. We will provide an automated mechanism for easing the evaluation of submitted systems for all three tasks: direct submissions to Task 1 will by default also be evaluated on Tasks 2 and 3 via automations developed by the organizing committee. Similarly, direct submissions to Task 2 will by default also be evaluated on Task 3 via a similar automation. Participants have the option of opting out of this automated evaluation as well as submitting separate direct systems to the three tasks. Details of this automation process will be posted here at a later date.

Participants to any shared task will be expected to contribute a one-paragraph description of their system(s) for inclusion in our overall Findings paper, along with a four- to six-page system description paper for inclusion in the wider WMT conference proceedings. See the main WMT website for paper submission information.

Tasks Organized

Task 1: Segment-Level Error Detection and Span Annotation

Task 1 is a segment-level subtask where the goal is to detect translation errors and identify the precise span of each error within the target-side translation along with its severity. For this subtask we use the error spans obtained from the ESA human annotations generated for the General MT primary task as the target “gold standard.” Participants are asked to predict both the error spans (start and end indices) as well as the error severities (major or minor) within each segment. Submissions will be evaluated and ranked based on their ability to correctly identify the presence of errors, correctly mark the spans of any identified errors, and correctly identify the severity of each of these errors.

Information about the annotation specifics, the evaluation criteria, and the available training and development resources is available in the Task 1 detailed description.

Task 2: Segment-Level Quality Score Prediction

Task 2 is similar to the corresponding task from last year and is largely an updated version of similar tasks from previous years’ QE and Metrics tracks. The goal of the segment-level quality prediction subtask is to predict a quality score for each source–target segment pair in the evaluation set. Participants this year will be asked to predict the ESA score. Submissions will be evaluated and ranked based on their prediction correlations with human-annotated scores at both the segment and system levels.

Information about the annotation specifics, the evaluation criteria, and the available training and development resources is available in the Task 2 detailed description.

Task 3: Detection of Error-Free Segments

Task 3 is a new subtask focused on a concrete application of translation quality evaluation systems: detection of error-free segments. While traditional quality estimation focuses on continuous scores or fine-grained error spans, this task simplifies the objective to a binary classification: Is this translation error-free? The goal is to identify segments that can be published or consumed without further human intervention. Participants must predict a binary label (1 for Error-Free, 0 for Contains Errors) for a given set of source–target pairs. Submissions will be evaluated and ranked based on their alignment with gold labels derived from human annotations, measured by the Matthews Correlation Coefficient.

Information about the input and output specifics and the evaluation methodology is available in the Task 3 detailed description.

Task 4: Challenge Sets

While the first three tasks are focused on the development of stronger and better automated quality evaluation systems, the goal of this subtask is for participants to create test sets with challenging evaluation examples that current automated metrics and evaluation systems fail to identify or score correctly. This subtask is organized into three rounds:

Breaking Round: Challenge set participants (Breakers) create challenging examples for metrics. They then send their resulting challenge sets to the organizers.
Scoring Round: The challenge sets created by Breakers will be packaged along with the rest of the evaluation data and sent to all participants in the three previous tasks (the Builders) to score.
Analysis Round: Breakers will receive their data with all the metrics scores for analysis. They are encouraged to then submit an analysis paper describing their findings to the WMT 2026 conference.

This year we are inviting submissions of challenge sets for all (or any of the) three other subtasks (see detailed descriptions above). In addition, challenge sets can target languages beyond the official language pairs listed for Tasks 1–3, but note that evaluation system developers may opt out from evaluating languages other than the official ones. If you are interested in submitting a challenge set this year, the organizers request that you indicate your intentions by completing the sign-up form here.

More details about the challenge sets can be found in the Task 4 detailed description.

Deadlines

Event	Date
Challenge set submission deadline	16 July 2026
Tasks 1, 2, and 3 test data release and submission opening	23 July 2026
Tasks 1, 2, and 3 submission deadline	30 July 2026
Scored challenge sets returned to creators for analysis	3 August 2026
WMT paper submission deadline	TBA (follows EMNLP)
WMT notification of acceptance	TBA (follows EMNLP)
WMT camera-ready submission deadline	TBA (follows EMNLP)
Conference	24–29 October 2026

All deadlines are in Anywhere on Earth (AoE) time.

Contact

Please contact the organizing committee at wmt-qe-metrics-organizers@googlegroups.com in the event of any questions or difficulties regarding any of the four subtasks.

The organizing committee consists of:

Alon Lavie
Aamon Shurtz
Archchana Sindhujan
Chi-kiu (Jackie) Lo
Chrysoula Zerva
Diptesh Kanojia
Eleftherios Avramidis
Fabio Barth
Fred Blain
Giorgos Filandrianos
Greg Hanneman
Lorenzo Proietti
Monishwaran Maheshwaran
Orfeas Menis Mastromichalakis
Shuoyang Ding
Stefano Perrella
Tom Kocmi
Vilém Zouhar

---

See original: groups.google.com/g/wmt-tasks/c/EZxyTA9Yb1U/m/Br71t8TgAAAJ

Moving MTQE from scores to operational signals in production pipelines

Google launches Gemini 3.5 Live Translate

Senior Software Engineer, Localization Engineering at Autodesk (Vancouver, Canada)

Research Engineer (Machine Translation) at Sanas (Palo Alto, CA)

Senior Applied Scientist, Translation Services at Amazon (Hyderabad, India)

AI Researcher / ML Engineer (ASR & Speech Specialist) at LILT (Boston, MA)

WMT 2026 Shared Task on Automated Translation Quality Evaluation Systems

Senior Research Scientist at NVIDIA (Hamburg, Germany)

ModelFront announces outcome-based pricing

Audible reports over 50 million minutes of AI-narrated content have been listened to on its platform

Are subreddit wikis taken into account for Reddit Search answers?

OneMeta launches real-time AI interpretation for emergency services in Mexico ahead of FIFA 2026

AI Technical Program Manager (Localisation) at Canva (London, England)

ElevenLabs releases Dubbing v2, an upgrade that maintains the emotion and performance of the original speaker in every language

Daewoo E&C launches AI translator for foreign workers

CSA Research publishes report on how a clinical research organization uses ModelFront

AI Translations Are Adding ‘Hallucinations’ to Wikipedia Articles