Conference

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing - ACL Anthology

· · 10 min read · 8 views

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing Christos Christodoulopoulos , Tanmoy Chakraborty , Carolyn Rose , Violet Peng (Editors) Anthology ID: 2025.emnlp-main Month: November Year: 2025 Address: Suzhou, China Venue: EMNLP SIG: Publisher: Association for Computational Linguistics URL: https://aclanthology.org/2025.emnlp-main/ DOI: 10.18653/v1/2025.emnlp-main ISBN: 979-8-89176-332-6 Bib Export formats: BibTeX MODS XML EndNote Bib TeX Search Show all abstracts Hide all abstracts pdf bib Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing Christos Christodoulopoulos | Tanmoy Chakraborty | Carolyn Rose | Violet Peng pdf bib abs Towards Automated Error Discovery: A Study in Conversational AI Dominic Petrak | Thy Thy Tran | Iryna Gurevych Although LLM-based conversational agents demonstrate strong fluency and coherence, they still produce undesirable behaviors ( errors ) that are challenging to prevent from reaching users during deployment. Recent research leverages large language models (LLMs) to detect errors and guide response-generation models toward improvement. However, current LLMs struggle to identify errors not explicitly specified in their instructions, such as those arising from updates to the response-generation model or shifts in user behavior. In this work, we introduce Automated Error Discovery , a framework for detecting and defining errors in conversational AI, and propose SEEED (Soft Clustering Extended Encoder-Based Error Detection), as an encoder-based approach to its implementation. We enhance the Soft Nearest Neighbor Loss by amplifying distance weighting for negative samples and introduce Label-Based Sample Ranking to select highly contrastive examples for better representation learning. SEEED outperforms adapted baselines—including GPT-4o and Phi-4—across multiple error-annotated dialogue datasets, improving the accuracy for detecting unknown errors by up to 8 points and demonstrating strong generalization to unknown intent detection. pdf bib abs Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLM s Mohsinul Kabir | Ajwad Abrar | Sophia Ananiadou A large number of studies rely on closed-style multiple-choice surveys to evaluate cultural alignment in Large Language Models (LLMs). In this work, we challenge this constrained evaluation paradigm and explore more realistic, unconstrained approaches. Using the World Values Survey (WVS) and Hofstede Cultural Dimensions as case studies, we demonstrate that LLMs exhibit stronger cultural alignment in less constrained settings, where responses are not forced. Additionally, we show that even minor changes, such as reordering survey choices, lead to inconsistent outputs, exposing the limitations of closed-style evaluations. Our findings advocate for more robust and flexible evaluation frameworks that focus on specific cultural proxies, encouraging more nuanced and accurate assessments of cultural alignment in LLMs. pdf bib abs Biased Tales: Cultural and Topic Bias in Generating Children’s Stories Donya Rooein | Vilém Zouhar | Debora Nozza | Dirk Hovy Stories play a pivotal role in human communication, shaping beliefs and morals, particularly in children. As parents increasingly rely on large language models (LLMs) to craft bedtime stories, the presence of cultural and gender stereotypes in these narratives raises significant concerns. To address this issue, we present Biased Tales, a comprehensive dataset designed to analyze how biases influence protagonists’ attributes and story elements in LLM-generated stories. Our analysis uncovers striking disparities. When the protagonist is described as a girl (as compared to a boy), appearance-related attributes increase by 55.26%. Stories featuring non-Western children disproportionately emphasize cultural heritage, tradition, and family themes far more than those for Western children. Our findings highlight the role of sociocultural bias in making creative AI use more equitable and diverse. pdf bib abs Large Language Models as Realistic Microservice Trace Generators Donghyun Kim | Sriram Ravula | Taemin Ha | Alex Dimakis | Daehyeok Kim | Aditya Akella Workload traces are essential to understand complex computer systems’ behavior and manage processing and memory resources. Since real-world traces are hard to obtain, synthetic trace generation is a promising alternative. This paper proposes a first-of-a-kind approach that relies on training a large language model (LLM) to generate synthetic workload traces, specifically microservice call graphs. To capture complex and arbitrary hierarchical structures and implicit constraints in such traces, we propose to train LLMs to generate recursively, making call graph generation a sequence of more manageable steps. To further enforce learning constraints on the traces and generate uncommon situations, we apply additional instruction tuning steps to align our model with the desired trace features. With this method, we train TraceLLM, an LLM for microservice trace generation, and demonstrate that it produces diverse, realistic traces under varied conditions, outperforming existing approaches in both accuracy and validity. The synthetically generated traces can effectively replace real data to optimize important microservice management tasks. Additionally, TraceLLM adapts to downstream trace-related tasks, such as predicting key trace features and infilling missing data. pdf bib abs JUDGEBERT : Assessing Legal Meaning Preservation Between Sentences David Beauchemin | Michelle Albert-Rochette | Richard Khoury | Pierre-Luc Déziel Simplifying text while preserving its meaning is a complex yet essential task, especially in sensitive domain applications like legal texts. When applied to a specialized field, like the legal domain, preservation differs significantly from its role in regular texts. This paper introduces FrJUDGE, a new dataset to assess legal meaning preservation between two legal texts. It also introduces JUDGEBERT, a novel evaluation metric designed to assess legal meaning preservation in French legal text simplification. JUDGEBERT demonstrates a superior correlation with human judgment compared to existing metrics. It also passes two crucial sanity checks, while other metrics did not: For two identical sentences, it always returns a score of 100%; on the other hand, it returns 0% for two unrelated sentences. Our findings highlight its potential to transform legal NLP applications, ensuring accuracy and accessibility for text simplification for legal practitioners and lay users. pdf bib abs QF r C o LA : a Q uebec- F rench Corpus of Linguistic Acceptability Judgments David Beauchemin | Richard Khoury Large and Transformer-based language models perform outstandingly in various downstream tasks. However, there is limited understanding regarding how these models internalize linguistic knowledge, so various linguistic benchmarks have recently been proposed to facilitate syntactic evaluation of language models across languages. This paper introduces QFrCoLA (Quebec-French Corpus of Linguistic Acceptability Judgments), a normative binary acceptability judgments dataset comprising 25,153 in-domain and 2,675 out-of-domain sentences. Our study leverages the QFrCoLA dataset and seven other linguistic binary acceptability judgment corpora to benchmark seven language models. The results demonstrate that, on average, fine-tuned Transformer-based LM are strong baselines for most languages and that zero-shot binary classification large language models perform poorly on the task. However, for the QFrCoLA benchmark, on average, a fine-tuned Transformer-based LM outperformed other methods tested. It also shows that pre-trained cross-lingual LLMs selected for our experimentation do not seem to have acquired linguistic judgment capabilities during their pre-training for Quebec French. Finally, our experiment results on QFrCoLA show that our dataset, built from examples that illustrate linguistic norms rather than speakers’ feelings, is similar to linguistic acceptability judgment; it is a challenging dataset that can benchmark LM on their linguistic judgment capabilities. pdf bib abs Revisiting LLM Value Probing Strategies: Are They Robust and Expressive? Siqi Shen | Mehar Singh | Lajanugen Logeswaran | Moontae Lee | Honglak Lee | Rada Mihalcea The value orientation of Large Language Models (LLMs) has been extensively studied, as it can shape user experiences across demographic groups.However, two key challenges remain: (1) the lack of systematic comparison across value probing strategies, despite the Multiple Choice Question (MCQ) setting being vulnerable to perturbations, and (2) the uncertainty over whether probed values capture in-context information or predict models’ real-world actions.In this paper, we systematically compare three widely used value probing methods: token likelihood, sequence perplexity, and text generation.Our results show that all three methods exhibit large variances under non-semantic perturbations in prompts and option formats, with sequence perplexity being the most robust overall.We further introduce two tasks to assess expressiveness: demographic prompting, testing whether probed values adapt to cultural context; and value–action agreement, testing the alignment of probed values with value-based actions.We find that demographic context has little effect on the text generation method, and probed values only weakly correlate with action preferences across all methods.Our work highlights the instability and the limited expressive power of current value probing methods, calling for more reliable LLM value representations. pdf bib abs A Systematic Analysis of Base Model Choice for Reward Modeling Kian Ahrabian | Pegah Jandaghi | Negar Mokhberian | Sai Praneeth Karimireddy | Jay Pujara Reinforcement learning from human feedback (RLHF) and, at its core, reward modeling have become a crucial part of training powerful large language models (LLMs). One commonly overlooked factor in training high-quality reward models (RMs) is the effect of the base model, which is becoming more challenging to choose given the rapidly growing pool of LLMs. In this work, we present a systematic analysis of the effect of base model selection on reward modeling performance. Our results show that the performance can be improved by up to 14% compared to the most common (i.e., default) choice. Moreover, we showcase the strong statistical relation between some existing benchmarks and downstream performances. We also demonstrate that the results from a small set of benchmarks could be combined to boost the model selection (+18% on average in the top 5-10). Lastly, we illustrate the impact of different post-training steps on the final performance and explore using estimated data distributions to reduce performance prediction error. pdf bib abs Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break- E ven Performance Branislav Pecher | Ivan Srba | Maria Bielikova When solving NLP tasks with limited labelled data, researchers typically either use a general large language model without further update, or use a small number of labelled samples to tune a specialised smaller model. In this work, we answer an important question – how many labelled samples are required for the specialised small models to outperform general large models, while taking the performance variance into consideration. By observing the behaviour of fine-tuning, instruction-tuning, prompting and in-context learning on 8 language models, we identify such performance break-even points across 8 representative text classification tasks of varying characteristics. We show that the specialised models often need only few samples (on average 100) to be on par or better than the general ones. At the same time, the number of required labels strongly depends on the dataset or task characteristics, with fine-tuning on binary datasets requiring significantly more samples. When performance variance is taken into consideration, the number of required labels increases on average by 100 - 200% . Finally, larger models do not consistently lead to better performance and lower variance, with 4-bit quantisation having negligible impact. pdf bib abs Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding Melanie Subbiah | Akankshya Mishra | Grace Kim | Liyan Tang | Greg Durrett | Kathleen McKeown Determining faithfulness of a claim to a source document is an important problem across many domains. This task is generally treated as a binary judgment of whether the claim is supported or unsupported in relation to the source. In many cases, though, whether a claim is supported can be ambiguous. For instance, it may depend on making inferences from given evidence, and different people can reasonably interpret the claim as either supported or unsupported based on their agreement with those inferences. Forcing binary labels upon such claims lowers the reliability of evaluation. In this work, we reframe the task to manage the subjectivity involved with factuality judgments of ambiguous claims. We introduce LLM-generated edits of summaries as a method of providing a nuanced evaluation of claims: how much does a summary need to be edited to be unambiguous? Whether a claim gets rewritten and how much it changes can be used as an automatic evaluation metric, the Ambiguity Rewrite Metric (ARM), with a much richer feedback signal than a binary judgment of faithfulness. We focus on the area of narrative summarization as it is particularly rife with ambiguity and subjective interpretation. We show that ARM produces a 21% absolute improvement in annotator agreement on claim faithfulness, indicating that subjectivity is reduced. pdf bib abs M ath T utor B ench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors Jakub Macina | Nico Daheim | Ido Hakimi | Manu Kapur | Iryna Gurevych | Mrinmaya Sachan Evaluating the pedagogical capabilities of AI-based tutoring models is critical for making guided progress in the field. Yet, we lack a reliable, easy-to-use, and simple-to-run evaluation that reflects the pedagogical abilities of models. To fill this gap, we present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation. MathTutorBench contains a collection of datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching. To score the pedagogical quality of open-ended teacher responses, we train a reward model and show it can discriminate expert from novice teacher responses with high accuracy. We evaluate a wide set of closed- and open-weight models on MathTutorBench and find that subject expertise, indicated by solving ability, does not immediately translate to

Executive Summary

The Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) present cutting-edge research in the field of natural language processing (NLP). Two notable articles are highlighted: one introducing a framework for automated error discovery in conversational AI, and another challenging the traditional closed-style evaluations of cultural alignment in large language models (LLMs). The conference proceedings reflect advancements in NLP, emphasizing the need for robust, flexible evaluation frameworks and innovative approaches to error detection.

Key Points

  • Introduction of SEEED framework for automated error discovery in conversational AI.
  • Challenging traditional closed-style evaluations of cultural alignment in LLMs.
  • Advocacy for more robust and flexible evaluation frameworks in NLP.

Merits

Innovative Framework

The SEEED framework represents a significant advancement in automated error detection, leveraging soft clustering and contrastive learning to improve accuracy and generalization.

Critical Evaluation

The critique of closed-style evaluations highlights the need for more nuanced and accurate assessment methods, promoting a shift towards more realistic, unconstrained approaches.

Demerits

Limited Generalizability

While SEEED shows promise, its effectiveness may be limited by the specific datasets and error types it was trained on, potentially restricting its generalizability.

Evaluation Constraints

The critique of closed-style evaluations, while valid, does not provide a comprehensive alternative framework, leaving a gap in practical implementation.

Expert Commentary

The Proceedings of the 2025 EMNLP conference underscore the rapid advancements in NLP, particularly in the areas of error detection and cultural alignment evaluation. The introduction of the SEEED framework is a notable step forward, addressing the critical need for automated error discovery in conversational AI. By leveraging soft clustering and contrastive learning, SEEED demonstrates improved accuracy and generalization, outperforming current baselines. However, its effectiveness may be constrained by the specific datasets and error types it was trained on, necessitating further research to enhance its generalizability. The critique of closed-style evaluations is equally significant, highlighting the limitations of traditional multiple-choice surveys in assessing cultural alignment. The findings advocate for more robust and flexible evaluation frameworks, focusing on specific cultural proxies to achieve more nuanced and accurate assessments. This shift is crucial for developing ethical and unbiased AI systems, as it addresses the inherent biases and inconsistencies in current evaluation methods. Policymakers and regulatory bodies should take note of these advancements and limitations, promoting the adoption of comprehensive evaluation frameworks to ensure the responsible development and deployment of AI technologies.

Recommendations

  • Further research should focus on enhancing the generalizability of the SEEED framework to a broader range of error types and datasets.
  • Developers and researchers should explore and implement more flexible evaluation frameworks for cultural alignment, moving beyond closed-style surveys to achieve more accurate and nuanced assessments.

Sources

Related Articles