Conference

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations - ACL Anthology

· March 7, 2026 · 10 min read · 7 views

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Delia Irazu Hernandez Farias , Tom Hope , Manling Li (Editors) Anthology ID: 2024.emnlp-demo Month: November Year: 2024 Address: Miami, Florida, USA Venue: EMNLP SIG: Publisher: Association for Computational Linguistics URL: https://aclanthology.org/2024.emnlp-demo/ DOI: 10.18653/v1/2024.emnlp-demo Bib Export formats: BibTeX MODS XML EndNote PDF: https://aclanthology.org/2024.emnlp-demo.pdf PDF (full) Bib TeX Search Show all abstracts Hide all abstracts pdf bib Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Delia Irazu Hernandez Farias | Tom Hope | Manling Li pdf bib abs F ree E val: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models Zhuohao Yu | Chang Gao | Wenjin Yao | Yidong Wang | Zhengran Zeng | Wei Ye | Jindong Wang | Yue Zhang | Shikun Zhang The rapid growth of evaluation methodologies and datasets for large language models (LLMs) has created a pressing need for their unified integration. Meanwhile, concerns about data contamination and bias compromise the trustworthiness of evaluation findings, while the efficiency of evaluation processes remains a bottleneck due to the significant computational costs associated with LLM inference.In response to these challenges, we introduce FreeEval, a modular framework not only for conducting trustworthy and efficient automatic evaluations of LLMs but also serving as a platform to develop and validate new evaluation methodologies. FreeEval addresses key challenges through: (1) unified abstractions that simplify the integration of diverse evaluation methods, including dynamic evaluations requiring complex LLM interactions; (2) built-in meta-evaluation techniques such as data contamination detection and human evaluation to enhance result fairness; (3) a high-performance infrastructure with distributed computation and caching strategies for efficient large-scale evaluations; and (4) an interactive Visualizer for result analysis and interpretation to support innovation of evaluation techniques. We open-source all our code at https://github.com/WisdomShell/FreeEval and our demostration video, live demo, installation guides are available at: https://freeeval.zhuohao.me/. pdf bib abs i-Code Studio: A Configurable and Composable Framework for Integrative AI Yuwei Fang | Mahmoud Khademi | Chenguang Zhu | Ziyi Yang | Reid Pryzant | Yichong Xu | Yao Qian | Takuya Yoshioka | Lu Yuan | Michael Zeng | Xuedong Huang Artificial General Intelligence (AGI) requires comprehensive understanding and generation capabilities for a variety of tasks spanning different modalities and functionalities. Integrative AI is one important direction to approach AGI, through combining multiple models to tackle complex multimodal tasks. However, there is a lack of a flexible and composable platform to facilitate efficient and effective model composition and coordination. In this paper, we propose the i-Code Studio, a configurable and composable framework for Integrative AI. The i-Code Studio orchestrates multiple pre-trained models in a finetuning-free fashion to conduct complex multimodal tasks. Instead of simple model composition, the i-Code Studio provides an integrative, flexible, and composable setting for developers to quickly and easily compose cutting-edge services and technologies tailored to their specific requirements. The i-Code Studio achieves impressive results on a variety of zero-shot multimodal tasks, such as video-to-text retrieval, speech-to-speech translation, and visual question answering. We also demonstrate how to quickly build a multimodal agent based on the i-Code Studio that can communicate and personalize for users. The project page with demonstrations and code is at https://i-code-studio.github.io/. pdf bib abs Evalverse: Unified and Accessible Library for Large Language Model Evaluation Jihoo Kim | Wonho Song | Dahyun Kim | Yunsu Kim | Yungi Kim | Chanjun Park This paper introduces Evalverse, a novel library that streamlines the evaluation of Large Language Models (LLMs) by unifying disparate evaluation tools into a single, user-friendly framework. Evalverse enables individuals with limited knowledge of artificial intelligence to easily request LLM evaluations and receive detailed reports, facilitated by an integration with communication platforms like Slack. Thus, Evalverse serves as a powerful tool for the comprehensive assessment of LLMs, offering both researchers and practitioners a centralized and easily accessible evaluation framework. Finally, we also provide a demo video for Evalverse, showcasing its capabilities and implementation in a two-minute format. pdf bib abs Medico: Towards Hallucination Detection and Correction with Multi-source Evidence Fusion Xinping Zhao | Jindi Yu | Zhenyu Liu | Jifang Wang | Dongfang Li | Yibin Chen | Baotian Hu | Min Zhang As we all know, hallucinations prevail in Large Language Models (LLMs), where the generated content is coherent but factually incorrect, which inflicts a heavy blow on the widespread application of LLMs. Previous studies have shown that LLMs could confidently state non-existent facts rather than answering “I don’t know”. Therefore, it is necessary to resort to external knowledge to detect and correct the hallucinated content. Since manual detection and correction of factual errors is labor-intensive, developing an automatic end-to-end hallucination-checking approach is indeed a needful thing. To this end, we present Medico, a Multi-source evidence fusion enhanced hallucination detection and correction framework. It fuses diverse evidence from multiple sources, detects whether the generated content contains factual errors, provides the rationale behind the judgment, and iteratively revises the hallucinated content. Experimental results on evidence retrieval (0.964 HR@5, 0.908 MRR@5), hallucination detection (0.927-0.951 F1), and hallucination correction (0.973-0.979 approval rate) manifest the great potential of Medico. A video demo of Medico can be found at https://youtu.be/RtsO6CSesBI. pdf bib abs O pen O mni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents Qiang Sun | Yuanyi Luo | Sirui Li | Wenxiao Zhang | Wei Liu Multimodal conversational agents are highly desirable because they offer natural and human-like interaction.However, there is a lack of comprehensive end-to-end solutions to support collaborative development and benchmarking.While proprietary systems like GPT-4o and Gemini demonstrating impressive integration of audio, video, and text with response times of 200-250ms, challenges remain in balancing latency, accuracy, cost, and data privacy.To better understand and quantify these issues, we developed OpenOmni , an open-source, end-to-end pipeline benchmarking tool that integrates advanced technologies such as Speech-to-Text, Emotion Detection, Retrieval Augmented Generation, Large Language Models, along with the ability to integrate customized models.OpenOmni supports local and cloud deployment, ensuring data privacy and supporting latency and accuracy benchmarking. This flexible framework allows researchers to customize the pipeline, focusing on real bottlenecks and facilitating rapid proof-of-concept development. OpenOmni can significantly enhance applications like indoor assistance for visually impaired individuals, advancing human-computer interaction.Our demonstration video is available https://www.youtube.com/watch?v=zaSiT3clWqY, demo is available via https://openomni.ai4wa.com, code is available via https://github.com/AI4WA/OpenOmniFramework. pdf bib abs Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection Taichi Nishimura | Shota Nakada | Hokuto Munakata | Tatsuya Komatsu We propose Lighthouse, a user-friendly library for reproducible video moment retrieval and highlight detection (MR-HD). Although researchers proposed various MR-HD approaches, the research community holds two main issues. The first is a lack of comprehensive and reproducible experiments across various methods, datasets, and video-text features.This is because no unified training and evaluation codebase covers multiple settings. The second is user-unfriendly design. Because previous works use different libraries, researchers set up individual environments. In addition, most works release only the training codes, requiring users to implement the whole inference process of MR-HD. Lighthouse addresses these issues by implementing a unified reproducible codebase that includes six models, three features, and five datasets. In addition, it provides an inference API and web demo to make these methods easily accessible for researchers and developers. Our experiments demonstrate that Lighthouse generally reproduces the reported scores in the reference papers. The code is available at https://github.com/line/lighthouse. pdf bib abs M ark LLM : An Open-Source Toolkit for LLM Watermarking Leyi Pan | Aiwei Liu | Zhiwei He | Zitian Gao | Xuandong Zhao | Yijian Lu | Binglin Zhou | Shuliang Liu | Xuming Hu | Lijie Wen | Irwin King | Philip S. Yu Watermarking for Large Language Models (LLMs), which embeds imperceptible yet algorithmically detectable signals in model outputs to identify LLM-generated text, has become crucial in mitigating the potential misuse of LLMs. However, the abundance of LLM watermarking algorithms, their intricate mechanisms, and the complex evaluation procedures and perspectives pose challenges for researchers and the community to easily understand, implement and evaluate the latest advancements. To address these issues, we introduce MarkLLM, an open-source toolkit for LLM watermarking. MarkLLM offers a unified and extensible framework for implementing LLM watermarking algorithms, while providing user-friendly interfaces to ensure ease of access. Furthermore, it enhances understanding by supporting automatic visualization of the underlying mechanisms of these algorithms. For evaluation, MarkLLM offers a comprehensive suite of 12 tools spanning three perspectives, along with two types of automated evaluation pipelines. Through MarkLLM, we aim to support researchers while improving the comprehension and involvement of the general public in LLM watermarking technology, fostering consensus and driving further advancements in research and application. Our code is available at https://github.com/THU-BPM/MarkLLM. pdf bib abs AUTOGEN STUDIO : A No-Code Developer Tool for Building and Debugging Multi-Agent Systems Victor Dibia | Jingya Chen | Gagan Bansal | Suff Syed | Adam Fourney | Erkang Zhu | Chi Wang | Saleema Amershi Multi-agent systems, where multiple agents (generative AI models + tools) collaborate, are emerging as an effective pattern for solving long-running, complex tasks in numerous do- mains. However, specifying their parameters (such as models, tools, and orchestration mechanisms etc,.) and debugging them remains challenging for most developers. To address this challenge, we present AUTOGEN STUDIO, a no-code developer tool for rapidly prototyping, debugging, and evaluating multi-agent work- flows built upon the AUTOGEN framework. AUTOGEN STUDIO offers a web interface and a Python API for representing LLM-enabled agents using a declarative (JSON-based) specification. It provides an intuitive drag-and-drop UI for agent workflow specification, interactive evaluation and debugging of workflows, and a gallery of reusable agent components. We highlight four design principles for no-code multi-agent developer tools and contribute an open-source implementation. https://github.com/microsoft/autogen/tree/autogenstudio/samples/apps/autogen-studio pdf bib abs T iny A gent: Function Calling at the Edge Lutfi Eren Erdogan | Nicholas Lee | Siddharth Jha | Sehoon Kim | Ryan Tabrizi | Suhong Moon | Coleman Richard Charles Hooper | Gopala Anumanchipalli | Kurt Keutzer | Amir Gholami Recent large language models (LLMs) have enabled the development of advanced agentic systems that can integrate various tools and APIs to fulfill user queries through function calling. However, the deployment of these LLMs on the edge has not been explored since they typically require cloud-based infrastructure due to their substantial model size and computational demands. To this end, we present TinyAgent, an end-to-end framework for training and deploying task-specific small language model agents capable of function calling for driving agentic systems at the edge. We first show how to enable accurate function calling for open-source models via the LLMCompiler framework. We then systematically curate a high-quality dataset for function calling, which we use to fine-tune two small language models, TinyAgent-1.1B and 7B. For efficient inference, we introduce a novel tool retrieval method to reduce the input prompt length and utilize quantization to further accelerate the inference speed. As a driving application, we demonstrate a local Siri-like system for Apple’s MacBook that can execute user commands through text or voice input. Our results show that our models can achieve, and even surpass, the function-calling capabilities of larger models like GPT-4-Turbo, while being fully deployed at the edge. We open-source our [dataset, models, and installable package](https://github.com/SqueezeAILab/TinyAgent) and provide a [demo video](https://www.youtube.com/watch?v=0GvaGL9IDpQ) for our MacBook assistant agent. pdf bib abs T ruth R eader: Towards Trustworthy Document Assistant Chatbot with Reliable Attribution Dongfang Li | Xinshuo Hu | Zetian Sun | Baotian Hu | Shaolin Ye | Zifei Shan | Qian Chen | Min Zhang Document assistant chatbots are empowered with extensive capabilities by Large Language Models (LLMs) and have exhibited significant advancements. However, these systems may suffer from hallucinations that are difficult to verify in the context of given documents.Moreover, despite the emergence of products for document assistants, they either heavily rely on commercial LLM APIs or lack transparency in their technical implementations, leading to expensive usage costs and data privacy concerns. In this work, we introduce a fully open-source document assistant chatbot with reliable attribution, named TruthReader, utilizing adapted conversational retriever and LLMs. Our system enables the LLMs to generate answers with detailed inline citations, which can be attributed to the original document paragraphs, facilitating the verification of the factual consistency of the generated text. To further adapt the generative model, we develop a comprehensive pipeline consisting of data construction and model optimization processes.This pipeline equips the LLMs with the necessary capabilities to generate accurate answers, produce reliable citations, and refuse unanswerable quest

Executive Summary

The 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations presents various innovative systems and frameworks, including FreeEval and i-Code Studio. FreeEval is a modular framework for trustworthy and efficient evaluation of large language models, addressing concerns such as data contamination and bias. i-Code Studio is a configurable and composable framework for integrative AI, facilitating efficient and effective combination of multiple models to tackle complex multimodal tasks.

Key Points

▸ Introduction of FreeEval, a modular framework for evaluating large language models
▸ Presentation of i-Code Studio, a configurable and composable framework for integrative AI
▸ Emphasis on addressing challenges such as data contamination, bias, and efficiency in evaluation processes

Merits

Unified Abstractions

FreeEval's unified abstractions simplify the integration of diverse evaluation methods, making it easier to conduct trustworthy and efficient evaluations.

Flexibility and Composability

i-Code Studio's configurable and composable framework allows for efficient and effective combination of multiple models to tackle complex multimodal tasks.

Demerits

Complexity

The modular and composable nature of FreeEval and i-Code Studio may require significant expertise and resources to fully utilize and customize.

Scalability

The high-performance infrastructure and distributed computation strategies of FreeEval may be challenging to scale for very large language models or complex tasks.

Expert Commentary

The presentation of FreeEval and i-Code Studio at the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations highlights the growing importance of addressing challenges in AI evaluation and development. These frameworks demonstrate the potential for modular and composable approaches to improve the trustworthiness and efficiency of AI systems. However, further research is needed to fully address the complexities and scalability concerns associated with these frameworks.

Recommendations

✓ Further research on the scalability and customization of FreeEval and i-Code Studio
✓ Investigation into the potential applications of these frameworks in real-world AI development and deployment scenarios

Sources

EMNLP

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations - ACL Anthology

AI Commentary

Executive Summary

Key Points

Merits

Unified Abstractions

Flexibility and Composability

Demerits

Complexity

Scalability

Expert Commentary

Recommendations

Sources

Related Articles

Google Maps

Find Your Next Job

A Retrospective on the ICLR 2026 Review Process

Retrospective on PAT x ICML 2026 AI Paper Assistant Program

JCG, PC

HSOLLC Co., Ltd.