Academic

Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition

arXiv:2603.18557v1 Announce Type: new Abstract: As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and adapting them to other languages is hindered by the scarcity and cost of human-annotated judgments in most languages. We introduce a decomposition-based evaluation framework built around a Universal Criteria Set (UCS). UCS consists of a shared, language-agnostic set of evaluation dimensions, producing an interpretable intermediate representation that supports cross-lingual transfer with minimal supervision. Experiments on multiple faithfulness tasks across languages and model backbones demonstrate consistent improvements over strong baselines without requiring target-language annotations.

I
Ivaxi Sheth, Zeno Jonke, Amin Mantrach, Saab Mansour
· · 1 min read · 8 views

arXiv:2603.18557v1 Announce Type: new Abstract: As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and adapting them to other languages is hindered by the scarcity and cost of human-annotated judgments in most languages. We introduce a decomposition-based evaluation framework built around a Universal Criteria Set (UCS). UCS consists of a shared, language-agnostic set of evaluation dimensions, producing an interpretable intermediate representation that supports cross-lingual transfer with minimal supervision. Experiments on multiple faithfulness tasks across languages and model backbones demonstrate consistent improvements over strong baselines without requiring target-language annotations.

Executive Summary

The article addresses a critical gap in the evaluation of large language models (LLMs) by proposing a decomposition-based framework to enable cross-lingual automated assessment. The core innovation is the Universal Criteria Set (UCS), a language-agnostic intermediate representation that decomposes evaluation into interpretable dimensions. This approach mitigates the scarcity of human-annotated judgments in non-English languages, which has historically limited the scalability of multilingual LLM evaluation. The framework demonstrates consistent performance improvements across multiple faithfulness tasks, model architectures, and target languages without requiring target-language annotations. The work aligns with the broader push for equitable, multilingual AI systems and offers a pragmatic solution to longstanding challenges in cross-lingual evaluation.

Key Points

  • Introduces a Universal Criteria Set (UCS) as a language-agnostic intermediate evaluation framework to decompose LLM outputs into interpretable dimensions.
  • Demonstrates cross-lingual transferability with minimal supervision, addressing the scarcity of human-annotated judgments in non-English languages.
  • Shows consistent improvements over strong baselines across multiple faithfulness tasks and model backbones, validating the framework's robustness.

Merits

Innovation in Cross-Lingual Evaluation

The UCS framework represents a significant advancement by enabling language-agnostic evaluation, which is critical for deploying LLMs in multilingual contexts. This addresses a major bottleneck in automated assessment where English-centric methods dominate.

Scalability and Efficiency

By reducing reliance on target-language annotations, the framework enhances scalability and reduces costs associated with human evaluation, making it feasible for rapid deployment across diverse languages.

Interpretability and Standardization

The decomposition into interpretable dimensions enhances transparency and enables standardized evaluation metrics, which are essential for rigorous academic and industrial applications.

Demerits

Dependency on English Annotations

While the framework reduces the need for target-language annotations, it still relies on English for the UCS and evaluation infrastructure. This could limit its applicability in languages with no English annotations or where English is not a dominant second language.

Generalizability to Low-Resource Languages

The experiments focus on languages with relatively robust resources. The framework's performance in extremely low-resource languages, where even English annotations may be scarce, remains untested and warrants further investigation.

Faithfulness Task Specificity

The framework is evaluated primarily on faithfulness tasks (e.g., summarization, translation). Its applicability to other tasks (e.g., creative writing, coding) or domains (e.g., specialized legal or medical text) is not explored, leaving its generalizability an open question.

Expert Commentary

The authors present a compelling solution to a longstanding challenge in multilingual LLM evaluation. By introducing the Universal Criteria Set (UCS), they offer a language-agnostic framework that not only improves scalability but also enhances interpretability—a critical yet often overlooked aspect of evaluation metrics. The decomposition approach is particularly innovative, as it aligns with the growing demand for transparent and auditable AI systems. However, the framework's reliance on English for the UCS and evaluation infrastructure may pose challenges in truly low-resource linguistic contexts. Additionally, while the results on faithfulness tasks are impressive, the framework's applicability to other domains remains an open question. This work is a significant step forward, but further validation across a broader range of tasks and languages is essential to fully realize its potential. The implications for both academia and industry are profound, particularly in bridging the gap between English-centric research and global AI deployment.

Recommendations

  • Expand the UCS to incorporate domain-specific dimensions (e.g., legal, medical) to enhance its applicability beyond faithfulness tasks.
  • Conduct rigorous evaluations in low-resource languages to assess the framework's scalability and robustness in truly underrepresented linguistic contexts.
  • Collaborate with standardization bodies (e.g., ISO, IEEE) to develop universal evaluation guidelines that incorporate the UCS framework, ensuring industry-wide adoption and compliance.

Sources