Conference

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations - ACL Anthology

· · 10 min read · 6 views

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Ivan Habernal , Peter Schulam , Jörg Tiedemann (Editors) Anthology ID: 2025.emnlp-demos Month: November Year: 2025 Address: Suzhou, China Venue: EMNLP SIG: Publisher: Association for Computational Linguistics URL: https://aclanthology.org/2025.emnlp-demos/ DOI: 10.18653/v1/2025.emnlp-demos ISBN: 979-8-89176-334-0 Bib Export formats: BibTeX MODS XML EndNote PDF: https://aclanthology.org/2025.emnlp-demos.pdf PDF (full) Bib TeX Search Show all abstracts Hide all abstracts pdf bib Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Ivan Habernal | Peter Schulam | Jörg Tiedemann pdf bib abs Synthetic Data for Evaluation: Supporting LLM -as-a-Judge Workflows with E val A ssist Martín Santillán Cooper | Zahra Ashktorab | Hyo Jin Do | Erik Miehling | Werner Geyer | Jasmina Gajcin | Elizabeth M. Daly | Qian Pan | Michael Desmond We present a synthetic data generation tool integrated into EvalAssist. EvalAssist is a web-based application designed to assist human-centered evaluation of language model outputs by allowing users to refine LLM-as-a-Judge evaluation criteria. The synthetic data generation tool in EvalAssist is tailored for evaluation contexts and informed by findings from user studies with AI practitioners. Participants identified key pain points in current workflows including circularity risks (where models are judged by criteria derived by themselves), compounded bias (amplification of biases across multiple stages of a pipeline), and poor support for edge cases, and expressed a strong preference for real-world grounding and fine-grained control. In response, our tool supports flexible prompting, RAG-based grounding, persona diversity, and iterative generation workflows. We also incorporate features for quality assurance and edge case discovery. pdf bib abs ROBOTO 2: An Interactive System and Dataset for LLM -assisted Clinical Trial Risk of Bias Assessment Anthony Hevia | Sanjana Chintalapati | Veronica Ka Wai Lai | Nguyen Thanh Tam | Wai-Tat Wong | Terry P Klassen | Lucy Lu Wang We present ROBoto2, an open-source, web-based platform for large language model (LLM)-assisted risk of bias (ROB) assessment of clinical trials. ROBoto2 streamlines the traditionally labor-intensive ROB v2 (ROB2) annotation process via an interactive interface that combines PDF parsing, retrieval-augmented LLM prompting, and human-in-the-loop review. Users can upload clinical trial reports, receive preliminary answers and supporting evidence for ROB2 signaling questions, and provide real-time feedback or corrections to system suggestions. ROBoto2 is publicly available at https://roboto2.vercel.app/, with code and data released to foster reproducibility and adoption. We construct and release a dataset of 521 pediatric clinical trial reports (8954 signaling questions with 1202 evidence passages), annotated using both manually and LLM-assisted methods, serving as a benchmark and enabling future research. Using this dataset, we benchmark ROB2 performance for 4 LLMs and provide an analysis into current model capabilities and ongoing challenges in automating this critical aspect of systematic review. pdf bib abs S pirit RAG : A Q & A System for Religion and Spirituality in the U nited N ations Archive Yingqiang Gao | Fabian Winiger | Patrick Montjourides | Anastassia Shaitarova | Nianlong Gu | Simon Peng-Keller | Gerold Schneider Religion and spirituality (R/S) are complex and highly domain-dependent concepts which have long confounded researchers and policymakers. Due to their context-specificity, R/S are difficult to operationalize in conventional archival search strategies, particularly when datasets are very large, poorly accessible, and marked by information noise. As a result, considerable time investments and specialist knowledge is often needed to extract actionable insights related to R/S from general archival sources, increasing reliance on published literature and manual desk reviews. To address this challenge, we present SpiritRAG, an interactive Question Answering (Q&A) system based on Retrieval-Augmented Generation (RAG). Built using 7,500 United Nations (UN) resolution documents related to R/S in the domains of health and education, SpiritRAG allows researchers and policymakers to conduct complex, context-sensitive database searches of very large datasets using an easily accessible, chat-based web interface. SpiritRAG is lightweight to deploy and leverages both UN documents and user provided documents as source material. A pilot test and evaluation with domain experts on 100 manually composed questions demonstrates the practical value and usefulness of SpiritRAG. pdf bib abs L ing C onv: An Interactive Toolkit for Controlled Paraphrase Generation with Linguistic Attribute Control Mohamed Elgaar | Hadi Amiri We introduce LINGCONV, an interactive toolkit for paraphrase generation enabling finegrained control over 40 specific lexical, syntactic, and discourse linguistic attributes. Users can directly manipulate target attributes using sliders, and with automatic imputation for unspecified attributes, simplifying the control process. Our adaptive Quality Control mechanism employs iterative refinement guided by line search to precisely steer the generation towards target attributes while preserving semantic meaning, overcoming limitations associated with fixed control strengths. Applications of LINGCONV include enhancing text accessibility by adjusting complexity for different literacy levels, enabling personalized communication through style adaptation, providing a valuable tool for linguistics and NLP research, and facilitating second language learning by tailoring text complexity. The system is available at https://mohdelgaar-lingconv.hf.space, with a demo video at https://youtu.be/wRBJEJ6EALQ. pdf bib abs A gent M aster: A Multi-Agent Conversational Framework Using A 2 A and MCP Protocols for Multimodal Information Retrieval and Analysis Callie C. Liao | Duoduo Liao | Sai Surya Gadiraju The rise of Multi-Agent Systems (MAS) in Artificial Intelligence (AI), especially integrated with Large Language Models (LLMs), has greatly facilitated the resolution of complex tasks. However, current systems are still facing challenges of inter-agent communication, coordination, and interaction with heterogeneous tools and resources. Most recently, the Model Context Protocol (MCP) by Anthropic and Agent-to-Agent (A2A) communication protocol by Google have been introduced, and to the best of our knowledge, very few applications exist where both protocols are employed within a single MAS framework. We present a pilot study of AgentMaster, a novel modular multi-protocol MAS framework with self-implemented A2A and MCP, enabling dynamic coordination, flexible communication, and rapid development with faster iteration. Through a unified conversational interface, the system supports natural language interaction without prior technical expertise and responds to multimodal queries for tasks including information retrieval, question answering, and image analysis. The experiments are validated through both human evaluation and quantitative metrics, including BERTScore F1 (96.3%) and LLM-as-a-Judge G-Eval (87.1%). These results demonstrate robust automated inter-agent coordination, query decomposition, task allocation, dynamic routing, and domain-specific relevant responses. Overall, our proposed framework contributes to the potential capabilities of domain-specific, cooperative, and scalable conversational AI powered by MAS. pdf bib abs The i R ead4 S kills Intelligent Complexity Analyzer Wafa Aissa | Raquel Amaro | David Antunes | Thibault Bañeras-Roux | Jorge Baptista | Alejandro Catala | Luís Correia | Thomas François | Marcos Garcia | Mario Izquierdo-Álvarez | Nuno Mamede | Vasco Martins | Miguel Neves | Eugénio Ribeiro | Sandra Rodriguez Rey | Elodie Vanzeveren We present the iRead4Skills Intelligent Complexity Analyzer, an open-access platform specifically designed to assist educators and content developers in addressing the needs of low-literacy adults by analyzing and diagnosing text complexity. This multilingual system integrates a range of Natural Language Processing (NLP) components to assess input texts along multiple levels of granularity and linguistic dimensions in Portuguese, Spanish, and French. It assigns four tailored difficulty levels using state-of-the-art models, and introduces four diagnostic yardsticks—textual structure, lexicon, syntax, and semantics—offering users actionable feedback on specific dimensions of textual complexity. Each component of the system is supported by experiments comparing alternative models on manually annotated data. pdf bib abs AIPOM : Agent-aware Interactive Planning for Multi-Agent Systems Hannah Kim | Kushan Mitra | Chen Shen | Dan Zhang | Estevam Hruschka Large language models (LLMs) are being increasingly used for planning in orchestrated multi-agent systems. However, existing LLM-based approaches often fall short of human expectations and, critically, lack effective mechanisms for users to inspect, understand, and control their behaviors. These limitations call for enhanced transparency, controllability, and human oversight. To address this, we introduce AIPOM, a system supporting human-in-the-loop planning through conversational and graph-based interfaces. AIPOM enables users to transparently inspect, refine, and collaboratively guide LLM-generated plans, significantly enhancing user control and trust in multi-agent workflows. Our code and demo video are available at https://github.com/megagonlabs/aipom. pdf bib abs LAD : L o RA -Adapted Diffusion Ruurd Jan Anthonius Kuiper | Lars de Groot | Bram van Es | Maarten van Smeden | Ayoub Bagheri Autoregressive models dominate text generation but suffer from left-to-right decoding constraints that limit efficiency and bidirectional reasoning. Diffusion-based models offer a flexible alternative but face challenges in adapting to discrete text efficiently. We propose LAD (LoRA-Adapted Diffusion), a framework for non-autoregressive generation that adapts LLaMA models for iterative, bidirectional sequence refinement using LoRA adapters. LAD employs a structural denoising objective combining masking with text perturbations (swaps, duplications and span shifts), enabling full sequence editing during generation. We aim to demonstrate that LAD could be a viable and efficient alternative to training diffusion models from scratch, by providing both validation results as well as two interactive demos directly available online:https://ruurdkuiper.github.io/tini-lad/https://huggingface.co/spaces/Ruurd/tini-ladInference and training code:https://github.com/RuurdKuiper/lad-code pdf bib abs Automated Evidence Extraction and Scoring for Corporate Climate Policy Engagement: A Multilingual RAG Approach Imene Kolli | Saeid Vaghefi | Chiara Colesanti Senni | Shantam Raj | Markus Leippold InfluenceMap’s LobbyMap Platform monitors the climate policy engagement of over 500 companies and 250 industry associations, assessing each entity’s support or opposition to science-based policy pathways for achieving the Paris Agreement’s goal of limiting global warming to 1.5°C. Although InfluenceMap has made progress with automating key elements of the analytical workflow, a significant portion of the assessment remains manual, making it time- and labor-intensive and susceptible to human error. We propose an AI-assisted framework to accelerate the monitoring of corporate climate policy engagement by leveraging Retrieval-Augmented Generation to automate the most time-intensive extraction of relevant evidence from large-scale textual data. Our evaluation shows that a combination of layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies yields the best performance in extracting and classifying evidence from multilingual corporate documents. We conclude that while the automated RAG system effectively accelerates evidence extraction, the nuanced nature of the analysis necessitates a human-in-the-loop approach where the technology augments, rather than replaces, expert judgment to ensure accuracy. pdf bib abs GL i NER 2: Schema-Driven Multi-Task Learning for Structured Information Extraction Urchade Zaratiana | Gil Pasternak | Oliver Boyd | George Hurn-Maloney | Ash Lewis Information extraction (IE) is fundamental to numerous NLP applications, yet existing solutions often require specialized models for different tasks or rely on computationally expensive large language models. We present GLiNER2, a unified framework that enhances the original GLiNER architecture to support named entity recognition, text classification, and hierarchical structured data extraction within a single efficient model. Built on a fine-tuned encoder architecture, GLiNER2 maintains CPU efficiency and compact size while introducing multi-task composition through an intuitive schema-based interface. Our experiments demonstrate competitive performance across diverse IE tasks with substantial improvements in deployment accessibility compared to LLM-based alternatives. We release GLiNER2 as an open-source library available through pip, complete with pre-trained models and comprehensive documentation. pdf bib abs S ci C laims: An End-to-End Generative System for Biomedical Claim Analysis Raúl Ortega | Jose Manuel Gomez-Perez We present SciClaims, an interactive web-based system for end-to-end scientific claim analysis in the biomedical domain. Designed for high-stakes use cases such as systematic literature reviews and patent validation, SciClaims extracts claims from text, retrieves relevant evidence from PubMed, and verifies their veracity. The system features a user-friendly interface where users can input scientific text and view extracted claims, predictions, supporting or refuting evidence, and justifications in natural language. Unlike prior approaches, SciClaims seamlessly integrates the entire scientific claim analysis process using a single large language model, without requiring additional fine-tuning. SciClaims is optimized to run efficiently on a single GPU and is publicly available for live interaction. pdf bib abs A gent CPM - GUI : Building Mobile-Use Agents with Reinforcement Fine-Tuning Zhong Zhang | Yaxi Lu | Yikun Fu | Yupeng Huo | Shenzhi Yang | Yesai Wu | Han Si | Xin Cong | Haotian Chen | Yankai Lin | Jie Xie | Wei Zhou | Wang Xu | Yuanheng Zhang | Zhou Su | Zhongwu Zhai | Xiaoming Liu | Yudong Mei | Jianming Xu | Hongyan Tian | Chongyi Wang | Chi Chen | Yuan Yao | Zhiyuan Liu | Maosong Sun Large language model agents have enabled GUI-based automation, particularly for mobile devices. However, deployment remains limited by noisy data, poor generalization, and lack of support for non-English GUIs. In this work

Executive Summary

The Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations presents innovative tools and methodologies in the field of natural language processing (NLP). The conference highlights two notable contributions: EvalAssist, a synthetic data generation tool integrated into a web-based application designed to support human-centered evaluation of language model outputs, and ROBoto2, an interactive system for LLM-assisted clinical trial risk of bias assessment. Both tools emphasize the importance of user-centered design, flexibility, and iterative workflows in enhancing the evaluation and application of language models.

Key Points

  • EvalAssist addresses key pain points in LLM evaluation, including circularity risks, compounded bias, and poor support for edge cases.
  • ROBoto2 streamlines the risk of bias assessment process in clinical trials using LLM-assisted annotation and human-in-the-loop review.
  • Both tools incorporate user feedback and iterative workflows to improve the evaluation and application of language models.

Merits

User-Centered Design

Both EvalAssist and ROBoto2 are designed with user feedback and iterative workflows, ensuring that the tools meet the practical needs of AI practitioners and researchers.

Flexibility and Control

EvalAssist offers flexible prompting, RAG-based grounding, persona diversity, and iterative generation workflows, providing users with fine-grained control over the evaluation process.

Open-Source and Reproducibility

ROBoto2 is open-source and publicly available, fostering reproducibility and adoption in the research community.

Demerits

Potential Bias in Synthetic Data

While EvalAssist aims to mitigate bias, the use of synthetic data may still introduce unintended biases if not carefully curated and validated.

Limited Scope of ROBoto2

ROBoto2 is specifically designed for clinical trial risk of bias assessment, which may limit its applicability to other domains.

Dependence on User Expertise

The effectiveness of both tools relies on the expertise of the users, which may vary and impact the quality of the evaluations.

Expert Commentary

The Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations showcases significant advancements in the field of NLP, particularly in the evaluation and application of language models. EvalAssist and ROBoto2 represent innovative approaches to addressing critical issues such as bias, circularity, and the need for user-centered design. The tools' emphasis on flexibility, control, and iterative workflows aligns with the growing recognition of the importance of human oversight and expertise in AI systems. However, the potential for bias in synthetic data and the limited scope of ROBoto2 highlight the need for continued research and development in these areas. The open-source nature of ROBoto2 fosters reproducibility and adoption, which is crucial for advancing the field. Overall, these contributions have significant implications for both practical applications and policy recommendations, promoting more transparent, accountable, and reliable AI systems.

Recommendations

  • Further research should focus on validating the effectiveness of synthetic data generation tools in mitigating bias and improving the evaluation of language models.
  • Expanding the scope of tools like ROBoto2 to other domains could enhance their applicability and impact in various fields.

Sources

Related Articles