Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation
arXiv:2603.12983v1 Announce Type: new Abstract: Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel self-evolution framework based on Minimum Bayes Risk (MBR) decoding, named Iterative MBR Distillation for ESD, which eliminates the reliance on human annotations by leveraging an off-the-shelf LLM to generate pseudo-labels.Extensive experiments on the WMT Metrics Shared Task datasets demonstrate that models trained solely on these self-generated pseudo-labels outperform both unadapted base model and supervised baselines trained on human annotations at the system and span levels, while maintaining competitive sentence-level performance.
arXiv:2603.12983v1 Announce Type: new Abstract: Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel self-evolution framework based on Minimum Bayes Risk (MBR) decoding, named Iterative MBR Distillation for ESD, which eliminates the reliance on human annotations by leveraging an off-the-shelf LLM to generate pseudo-labels.Extensive experiments on the WMT Metrics Shared Task datasets demonstrate that models trained solely on these self-generated pseudo-labels outperform both unadapted base model and supervised baselines trained on human annotations at the system and span levels, while maintaining competitive sentence-level performance.
Executive Summary
This article proposes a novel framework, Iterative MBR Distillation, for Error Span Detection in Machine Translation, eliminating the need for human annotations by leveraging an off-the-shelf LLM to generate pseudo-labels. Experimental results demonstrate the effectiveness of this approach, outperforming supervised baselines and unadapted base models. The framework has significant implications for reducing the costs and inconsistencies associated with human annotations, while maintaining competitive performance.
Key Points
- ▸ Introduction of Iterative MBR Distillation for Error Span Detection
- ▸ Elimination of human annotations through self-generated pseudo-labels
- ▸ Experimental results demonstrating the effectiveness of the proposed framework
Merits
Reduced Reliance on Human Annotations
The proposed framework reduces the need for expensive and inconsistent human annotations, making it a more efficient and cost-effective approach.
Competitive Performance
The model trained on self-generated pseudo-labels outperforms supervised baselines and unadapted base models, demonstrating its effectiveness.
Demerits
Dependence on Off-the-Shelf LLM
The framework relies on an off-the-shelf LLM to generate pseudo-labels, which may introduce biases and limitations.
Limited Generalizability
The experimental results are based on specific datasets and may not generalize to other domains or tasks.
Expert Commentary
The proposed Iterative MBR Distillation framework represents a significant advancement in Error Span Detection for Machine Translation, offering a more efficient and cost-effective approach to evaluating translation errors. While the results are promising, further research is needed to address the limitations and potential biases of the framework, particularly with regards to the dependence on off-the-shelf LLMs. Additionally, the generalizability of the framework to other domains and tasks requires further investigation.
Recommendations
- ✓ Further research on the limitations and biases of off-the-shelf LLMs in Machine Translation evaluation
- ✓ Investigation into the generalizability of the proposed framework to other domains and tasks
- ✓ Exploration of active learning techniques to improve the accuracy and efficiency of Machine Translation evaluation