Academic

Evaluating Explainable AI Attribution Methods in Neural Machine Translation via Attention-Guided Knowledge Distillation

arXiv:2603.11342v1 Announce Type: new Abstract: The study of the attribution of input features to the output of neural network models is an active area of research. While numerous Explainable AI (XAI) techniques have been proposed to interpret these models, the systematic and automated evaluation of these methods in sequence-to-sequence (seq2seq) models is less explored. This paper introduces a new approach for evaluating explainability methods in transformer-based seq2seq models. We use teacher-derived attribution maps as a structured side signal to guide a student model, and quantify the utility of different attribution methods through the student's ability to simulate targets. Using the Inseq library, we extract attribution scores over source-target sequence pairs and inject these scores into the attention mechanism of a student transformer model under four composition operators (addition, multiplication, averaging, and replacement). Across three language pairs (de-en, fr-en, ar-en

arXiv:2603.11342v1 Announce Type: new Abstract: The study of the attribution of input features to the output of neural network models is an active area of research. While numerous Explainable AI (XAI) techniques have been proposed to interpret these models, the systematic and automated evaluation of these methods in sequence-to-sequence (seq2seq) models is less explored. This paper introduces a new approach for evaluating explainability methods in transformer-based seq2seq models. We use teacher-derived attribution maps as a structured side signal to guide a student model, and quantify the utility of different attribution methods through the student's ability to simulate targets. Using the Inseq library, we extract attribution scores over source-target sequence pairs and inject these scores into the attention mechanism of a student transformer model under four composition operators (addition, multiplication, averaging, and replacement). Across three language pairs (de-en, fr-en, ar-en) and attributions from Marian-MT and mBART models, Attention, Value Zeroing, and Layer Gradient $\times$ Activation consistently yield the largest gains in BLEU (and corresponding improvements in chrF) relative to baselines. In contrast, other gradient-based methods (Saliency, Integrated Gradients, DeepLIFT, Input $\times$ Gradient, GradientShap) lead to smaller and less consistent improvements. These results suggest that different attribution methods capture distinct signals and that attention-derived attributions better capture alignment between source and target representations in seq2seq models. Finally, we introduce an Attributor transformer that, given a source-target pair, learns to reconstruct the teacher's attribution map. Our findings demonstrate that the more accurately the Attributor can reproduce attribution maps, the more useful an injection of those maps is for the downstream task. The source code can be found on GitHub.

Executive Summary

This study introduces a novel approach to evaluating Explainable AI (XAI) techniques in sequence-to-sequence (seq2seq) models using transformer-based architectures. The authors utilize teacher-derived attribution maps as a structured side signal to guide a student model, demonstrating the efficacy of attention-derived attributions in capturing alignment between source and target representations. The results suggest that distinct attribution methods capture different signals, with attention-based methods yielding larger gains in translation tasks. The study also introduces an Attributor transformer that can learn to reconstruct teacher attribution maps, highlighting the potential for improving model interpretability. Overall, this research contributes to the growing body of work on XAI in machine learning and has significant implications for the development of more transparent and interpretable AI systems.

Key Points

  • The study introduces a novel approach for evaluating XAI techniques in seq2seq models using transformer-based architectures.
  • Attention-derived attributions are shown to be more effective than gradient-based methods in capturing alignment between source and target representations.
  • The Attributor transformer can learn to reconstruct teacher attribution maps, offering a potential solution for improving model interpretability.

Merits

Strength in Methodology

The study's use of teacher-derived attribution maps as a structured side signal to guide a student model is a novel and effective approach for evaluating XAI techniques.

Effective Evaluation Metrics

The use of BLEU and chrF scores as evaluation metrics for translation tasks provides a robust and widely accepted benchmark for assessing the efficacy of XAI techniques.

Demerits

Limited Generalizability

The study's findings may not generalize to other types of machine learning models or tasks, limiting the applicability of the results.

Overreliance on Attention Mechanism

The study's focus on attention-derived attributions may overlook the potential contributions of other attribution methods, such as gradient-based techniques.

Expert Commentary

This study represents a significant contribution to the field of XAI in machine learning, offering a novel approach for evaluating the effectiveness of attribution methods in seq2seq models. The results suggest that attention-derived attributions are more effective than gradient-based methods, but the study's findings may not generalize to other types of machine learning models or tasks. To further improve the interpretability of AI systems, researchers should continue to explore the potential of different attribution methods and develop more robust evaluation metrics. Additionally, policymakers and regulatory bodies should take note of the study's implications for the development of more transparent and accountable AI systems.

Recommendations

  • Future studies should investigate the applicability of attention-derived attributions to other types of machine learning models and tasks.
  • Developers and researchers should consider incorporating more robust evaluation metrics, such as those used in this study, to assess the effectiveness of XAI techniques.

Sources