Academic

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

arXiv:2603.12983v1 Announce Type: new Abstract: Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel self-evolution framework based on Minimum Bayes Risk (MBR) decoding, named Iterative MBR Distillation for ESD, which eliminates the reliance on human annotations by leveraging an off-the-shelf LLM to generate pseudo-labels.Extensive experiments on the WMT Metrics Shared Task datasets demonstrate that models trained solely on these self-generated pseudo-labels outperform both unadapted base model and supervised baselines trained on human annotations at the system and span levels, while maintaining competitive sentence-level performance.

B
Boxuan Lyu, Haiyue Song, Zhi Qu
· · 1 min read · 7 views

arXiv:2603.12983v1 Announce Type: new Abstract: Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel self-evolution framework based on Minimum Bayes Risk (MBR) decoding, named Iterative MBR Distillation for ESD, which eliminates the reliance on human annotations by leveraging an off-the-shelf LLM to generate pseudo-labels.Extensive experiments on the WMT Metrics Shared Task datasets demonstrate that models trained solely on these self-generated pseudo-labels outperform both unadapted base model and supervised baselines trained on human annotations at the system and span levels, while maintaining competitive sentence-level performance.

Executive Summary

This article proposes a novel framework, Iterative MBR Distillation, for Error Span Detection in Machine Translation, eliminating the need for human annotations by leveraging an off-the-shelf LLM to generate pseudo-labels. Experimental results demonstrate the effectiveness of this approach, outperforming supervised baselines and unadapted base models. The framework has significant implications for reducing the costs and inconsistencies associated with human annotations, while maintaining competitive performance.

Key Points

  • Introduction of Iterative MBR Distillation for Error Span Detection
  • Elimination of human annotations through self-generated pseudo-labels
  • Experimental results demonstrating the effectiveness of the proposed framework

Merits

Reduced Reliance on Human Annotations

The proposed framework reduces the need for expensive and inconsistent human annotations, making it a more efficient and cost-effective approach.

Competitive Performance

The model trained on self-generated pseudo-labels outperforms supervised baselines and unadapted base models, demonstrating its effectiveness.

Demerits

Dependence on Off-the-Shelf LLM

The framework relies on an off-the-shelf LLM to generate pseudo-labels, which may introduce biases and limitations.

Limited Generalizability

The experimental results are based on specific datasets and may not generalize to other domains or tasks.

Expert Commentary

The proposed Iterative MBR Distillation framework represents a significant advancement in Error Span Detection for Machine Translation, offering a more efficient and cost-effective approach to evaluating translation errors. While the results are promising, further research is needed to address the limitations and potential biases of the framework, particularly with regards to the dependence on off-the-shelf LLMs. Additionally, the generalizability of the framework to other domains and tasks requires further investigation.

Recommendations

  • Further research on the limitations and biases of off-the-shelf LLMs in Machine Translation evaluation
  • Investigation into the generalizability of the proposed framework to other domains and tasks
  • Exploration of active learning techniques to improve the accuracy and efficiency of Machine Translation evaluation

Sources