Conference

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations - ACL Anthology

· March 7, 2026 · 10 min read · 7 views

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Yansong Feng , Els Lefever (Editors) Anthology ID: 2023.emnlp-demo Month: December Year: 2023 Address: Singapore Venue: EMNLP SIG: Publisher: Association for Computational Linguistics URL: https://aclanthology.org/2023.emnlp-demo/ DOI: Bib Export formats: BibTeX MODS XML EndNote PDF: https://aclanthology.org/2023.emnlp-demo.pdf PDF (full) Bib TeX Search Show all abstracts Hide all abstracts pdf bib Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Yansong Feng | Els Lefever pdf bib abs Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLM s Jonas Golde | Patrick Haller | Felix Hamborg | Julian Risch | Alan Akbik Most NLP tasks are modeled as supervised learning and thus require labeled training data to train effective models. However, manually producing such data at sufficient quality and quantity is known to be costly and time-intensive. Current research addresses this bottleneck by exploring a novel paradigm called zero-shot learning via dataset generation. Here, a powerful LLM is prompted with a task description to generate labeled data that can be used to train a downstream NLP model. For instance, an LLM might be prompted to “generate 500 movie reviews with positive overall sentiment, and another 500 with negative sentiment.” The generated data could then be used to train a binary sentiment classifier, effectively leveraging an LLM as a teacher to a smaller student model. With this demo, we introduce Fabricator, an open-source Python toolkit for dataset generation. Fabricator implements common dataset generation workflows, supports a wide range of downstream NLP tasks (such as text classification, question answering, and entity recognition), and is integrated with well-known libraries to facilitate quick experimentation. With Fabricator, we aim to support researchers in conducting reproducible dataset generation experiments using LLMs and help practitioners apply this approach to train models for downstream tasks. pdf bib abs End-to-End Evaluation for Low-Latency Simultaneous Speech Translation Christian Huber | Tu Anh Dinh | Carlos Mullov | Ngoc-Quan Pham | Thai Binh Nguyen | Fabian Retkowski | Stefan Constantin | Enes Ugan | Danni Liu | Zhaolin Li | Sai Koneru | Jan Niehues | Alexander Waibel The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components. Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user. pdf bib abs CHATREPORT : Democratizing Sustainability Disclosure Analysis through LLM -based Tools Jingwei Ni | Julia Bingler | Chiara Colesanti-Senni | Mathias Kraus | Glen Gostlow | Tobias Schimanski | Dominik Stammbach | Saeid Ashraf Vaghefi | Qian Wang | Nicolas Webersinke | Tobias Wekhof | Tingyu Yu | Markus Leippold In the face of climate change, are companies really taking substantial steps toward more sustainable operations? A comprehensive answer lies in the dense, information-rich landscape of corporate sustainability reports. However, the sheer volume and complexity of these reports make human analysis very costly. Therefore, only a few entities worldwide have the resources to analyze these reports at scale, which leads to a lack of transparency in sustainability reporting. Empowering stakeholders with LLM-based automatic analysis tools can be a promising way to democratize sustainability report analysis. However, developing such tools is challenging due to (1) the hallucination of LLMs and (2) the inefficiency of bringing domain experts into the AI development loop. In this paper, we introduce ChatReport, a novel LLM-based system to automate the analysis of corporate sustainability reports, addressing existing challenges by (1) making the answers traceable to reduce the harm of hallucination and (2) actively involving domain experts in the development loop. We make our methodology, annotated datasets, and generated analyses of 1015 reports publicly available. Video Introduction: https://www.youtube.com/watch?v=Q5AzaKzPE4M Github: https://github.com/EdisonNi-hku/chatreport Live web app: reports.chatclimate.ai pdf bib abs R a LL e: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models Yasuto Hoshi | Daisuke Miyashita | Youyang Ng | Kento Tatsuno | Yasuhiro Morioka | Osamu Torii | Jun Deguchi Retrieval-augmented large language models (R-LLMs) combine pre-trained large language models (LLMs) with information retrieval systems to improve the accuracy of factual question-answering. However, current libraries for building R-LLMs provide high-level abstractions without sufficient transparency for evaluating and optimizing prompts within specific inference processes such as retrieval and generation. To address this gap, we present RaLLe, an open-source framework designed to facilitate the development, evaluation, and optimization of R-LLMs for knowledge-intensive tasks. With RaLLe, developers can easily develop and evaluate R-LLMs, improving hand-crafted prompts, assessing individual inference processes, and objectively measuring overall system performance quantitatively. By leveraging these features, developers can enhance the performance and accuracy of their R-LLMs in knowledge-intensive generation tasks. pdf bib abs VIST 5: An Adaptive, Retrieval-Augmented Language Model for Visualization-oriented Dialog Henrik Voigt | Nuno Carvalhais | Monique Meuschke | Markus Reichstein | Sina Zarrieß | Kai Lawonn The advent of large language models has brought about new ways of interacting with data intuitively via natural language. In recent years, a variety of visualization systems have explored the use of natural language to create and modify visualizations through visualization-oriented dialog. However, the majority of these systems rely on tailored dialog agents to analyze domain-specific data and operate domain-specific visualization tools and libraries. This is a major challenge when trying to transfer functionalities between dialog interfaces of different visualization applications. To address this issue, we propose VIST5, a visualization-oriented dialog system that focuses on easy adaptability to an application domain as well as easy transferability of language-controllable visualization library functions between applications. Its architecture is based on a retrieval-augmented T5 language model that leverages few-shot learning capabilities to enable a rapid adaptation of the system. pdf bib abs H 2 O Open Ecosystem for State-of-the-art Large Language Models Arno Candel | Jon McKinney | Philipp Singer | Pascal Pfeiffer | Maximilian Jeblick | Chun Ming Lee | Marcos Conde Large Language Models (LLMs) represent a revolution in AI. However, they also pose many significant risks, such as the presence of biased, private, copyrighted or harmful text. For this reason we need open, transparent and safe solutions. We introduce a complete open-source ecosystem for developing and testing LLMs. The goal of this project is to boost open alternatives to closed-source approaches. We release h2oGPT, a family of fine-tuned LLMs from 7 to 70 Billion parameters. We also introduce H2O LLM Studio, a framework and no-code GUI designed for efficient fine-tuning, evaluation, and deployment of LLMs using the most recent state-of-the-art techniques. Our code and models are licensed under fully permissive Apache 2.0 licenses. We believe open-source language models help to boost AI development and make it more accessible and trustworthy. Our demo is available at: https://gpt.h2o.ai/ pdf bib abs Koala: An Index for Quantifying Overlaps with Pre-training Corpora Thuy-Trang Vu | Xuanli He | Gholamreza Haffari | Ehsan Shareghi In very recent years more attention has been placed on probing the role of pre-training data in Large Language Models (LLMs) downstream behaviour. Despite the importance, there is no public tool that supports such analysis of pre-training corpora at large scale. To help research in this space, we launch Koala, a searchable index over large pre-training corpora using lossless compressed suffix arrays with highly efficient compression rate and search support. In its first release we index the public proportion of OPT 175B, GPT-3, GPT-Neo, GPT-Neo, LLaMA, BERT, ELECTRA, RoBERTA, XLNet pre-training corpora. Koala provides a framework to do forensic analysis on the current and future benchmarks as well as to assess the degree of memorization in the output from the LLMs. Koala is available for public use at https://koala-index.erc.monash.edu/. pdf bib abs Sudowoodo: A C hinese Lyric Imitation System with Source Lyrics Yongzhu Chang | Rongsheng Zhang | Lin Jiang | Qihang Chen | Le Zhang | Jiashu Pu Lyrics generation is a well-known application in natural language generation research, with several previous studies focusing on generating accurate lyrics using precise control such as keywords, rhymes, etc. However, lyrics imitation, which involves writing new lyrics by imitating the style and content of the source lyrics, remains a challenging task due to the lack of a parallel corpus. In this paper, we introduce Sudowoodo, a Chinese lyrics imitation system that can generate new lyrics based on the text of source lyrics. To address the issue of lacking a parallel training corpus for lyrics imitation, we propose a novel framework to construct a parallel corpus based on a keyword-based lyrics model from source lyrics. Then the pairs (new lyrics, source lyrics) are used to train the lyrics imitation model. During the inference process, we utilize a post-processing module to filter and rank the generated lyrics, selecting the highest-quality ones. We incorporated audio information and aligned the lyrics with the audio to form the songs as a bonus. The human evaluation results show that our framework can perform better lyric imitation. Meanwhile, the Sudowoodo system and demo video of the system is available at Sudowoodo and https://youtu.be/u5BBT\_j1L5M pdf bib abs C onv L ab-3: A Flexible Dialogue System Toolkit Based on a Unified Data Format Qi Zhu | Christian Geishauser | Hsien-chin Lin | Carel van Niekerk | Baolin Peng | Zheng Zhang | Shutong Feng | Michael Heck | Nurul Lubis | Dazhen Wan | Xiaochen Zhu | Jianfeng Gao | Milica Gasic | Minlie Huang Task-oriented dialogue (TOD) systems function as digital assistants, guiding users through various tasks such as booking flights or finding restaurants. Existing toolkits for building TOD systems often fall short in delivering comprehensive arrays of data, model, and experimental environments with a user-friendly experience. We introduce ConvLab-3: a multifaceted dialogue system toolkit crafted to bridge this gap. Our unified data format simplifies the integration of diverse datasets and models, significantly reducing complexity and cost for studying generalization and transfer. Enhanced with robust reinforcement learning (RL) tools, featuring a streamlined training process, in-depth evaluation tools, and a selection of user simulators, ConvLab-3 supports the rapid development and evaluation of robust dialogue policies. Through an extensive study, we demonstrate the efficacy of transfer learning and RL and showcase that ConvLab-3 is not only a powerful tool for seasoned researchers but also an accessible platform for newcomers. pdf bib abs FLEEK : Factual Error Detection and Correction with Evidence Retrieved from External Knowledge Farima Fatahi Bayat | Kun Qian | Benjamin Han | Yisi Sang | Anton Belyy | Samira Khorshidi | Fei Wu | Ihab Ilyas | Yunyao Li Detecting factual errors of textual information, whether generated by large language models (LLM) or curated by humans, is crucial for making informed decisions. LLMs’ inability to attribute their claims to external knowledge and their tendency to hallucinate makes it difficult to rely on their responses. Humans, too, are prone to factual errors in their writing. Since manual detection and correction of factual er- rors is labor-intensive, developing an automatic approach can greatly reduce human effort. We present a prototype tool that automatically extracts factual claims from text, gathers evidence from external knowledge sources, evaluates the factuality of each claim, and suggests revisions for identified errors using the collected evidence. Initial empirical evaluation on fact error detection (77-85% F1) shows the potential of our tool. pdf bib abs YATO : Yet Another deep learning based Text analysis Open toolkit Zeqiang Wang | Yile Wang | Jiageng Wu | Zhiyang Teng | Jie Yang We introduce YATO, an open-source, easy-to-use toolkit for text analysis with deep learning. Different from existing heavily engineered toolkits and platforms, YATO is lightweight and user-friendly for researchers from cross-disciplinary areas. Designed in a hierarchical structure, YATO supports free combinations of three types of widely used features including 1) traditional neural networks (CNN, RNN, etc.); 2) pre-trained language models (BERT, RoBERTa, ELECTRA, etc.); and 3) user-customized neural features via a simple configurable file. Benefiting from the advantages of flexibility and ease of use, YATO can facilitate fast reproduction and refinement of state-of-the-art NLP models, and promote the cross-disciplinary applications of NLP techniques. The code, examples, and documentation are publicly available at https://github.com/jiesutd/YATO. A demo video is also available at https://www.youtube.com/playlist?list=PLJ0mhzMcRuDUlTkzBfAftOqiJRxYTTjXH. pdf bib abs Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face Christopher Akiki | Odunayo Ogundepo | Aleksandra Piktus | Xinyu Zhang | Akintunde Oladipo | Jimmy Lin | Martin Potthast We present Spacerini, a tool that integrates the Pyserini toolkit for reproducible information retrieval research with Hugging Face to enable the se

Executive Summary

The Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations presents cutting-edge research and tools in the field of natural language processing. Two notable contributions include Fabricator, an open-source toolkit for generating labeled training data using large language models (LLMs), and a framework for evaluating low-latency simultaneous speech translation. These advancements address critical bottlenecks in NLP research and practical applications, emphasizing reproducibility and real-world applicability.

Key Points

▸ Introduction of Fabricator, an open-source toolkit for dataset generation using LLMs.
▸ Proposal of an end-to-end evaluation framework for low-latency simultaneous speech translation.
▸ Emphasis on reproducibility and practical application in NLP research.

Merits

Innovative Toolkit

Fabricator provides a comprehensive and user-friendly toolkit for generating labeled training data, which is crucial for advancing supervised learning in NLP. Its integration with well-known libraries facilitates quick experimentation and reproducibility.

Comprehensive Evaluation Framework

The proposed framework for evaluating low-latency simultaneous speech translation addresses a significant gap in the current research landscape. It offers a holistic approach to assessing various aspects of speech translation systems under realistic conditions.

Demerits

Limited Scope of Evaluation

While the evaluation framework for low-latency speech translation is comprehensive, it may not cover all possible real-world scenarios and edge cases, which could limit its applicability in certain contexts.

Dependency on LLM Quality

The effectiveness of Fabricator is highly dependent on the quality and capabilities of the LLMs used for dataset generation. Variations in LLM performance could impact the reliability and quality of the generated datasets.

Expert Commentary

The Proceedings of the 2023 EMNLP: System Demonstrations highlight the rapid advancements in NLP research, particularly in the areas of dataset generation and speech translation evaluation. Fabricator represents a significant step forward in addressing the data bottleneck in NLP, offering a robust and reproducible method for generating labeled training data. Its integration with popular libraries and support for a wide range of NLP tasks make it a valuable tool for both researchers and practitioners. The proposed evaluation framework for low-latency speech translation is equally noteworthy, as it provides a comprehensive approach to assessing the performance of these systems under realistic conditions. This is crucial for advancing the field and ensuring that the technologies developed are reliable and effective in real-world applications. However, it is important to note that the effectiveness of these tools and frameworks is contingent on the quality of the underlying LLMs and the specific use cases they are applied to. Future research should focus on addressing these limitations and expanding the scope of evaluation to cover a broader range of scenarios.

Recommendations

✓ Further development and refinement of Fabricator to support a broader range of NLP tasks and improve its robustness across different LLMs.
✓ Expansion of the evaluation framework for low-latency speech translation to include more diverse and complex real-world scenarios, ensuring its applicability in various contexts.

Sources

EMNLP

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations - ACL Anthology

AI Commentary

Executive Summary

Key Points

Merits

Innovative Toolkit

Comprehensive Evaluation Framework

Demerits

Limited Scope of Evaluation

Dependency on LLM Quality

Expert Commentary

Recommendations

Sources

Related Articles

Google Maps

Find Your Next Job

A Retrospective on the ICLR 2026 Review Process

Retrospective on PAT x ICML 2026 AI Paper Assistant Program

JCG, PC

HSOLLC Co., Ltd.