Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing - ACL Anthology
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Yaser Al-Onaizan , Mohit Bansal , Yun-Nung Chen (Editors) Anthology ID: 2024.emnlp-main Month: November Year: 2024 Address: Miami, Florida, USA Venue: EMNLP SIG: Publisher: Association for Computational Linguistics URL: https://aclanthology.org/2024.emnlp-main/ DOI: 10.18653/v1/2024.emnlp-main Bib Export formats: BibTeX MODS XML EndNote Bib TeX Search Show all abstracts Hide all abstracts pdf bib Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Yaser Al-Onaizan | Mohit Bansal | Yun-Nung Chen pdf bib abs U ni G en: Universal Domain Generalization for Sentiment Classification via Zero-shot Dataset Generation Juhwan Choi | Yeonghwa Kim | Seunguk Yu | JungMin Yun | YoungBin Kim Although pre-trained language models have exhibited great flexibility and versatility with prompt-based few-shot learning, they suffer from the extensive parameter size and limited applicability for inference. Recent studies have suggested that PLMs be used as dataset generators and a tiny task-specific model be trained to achieve efficient inference. However, their applicability to various domains is limited because they tend to generate domain-specific datasets. In this work, we propose a novel approach to universal domain generalization that generates a dataset regardless of the target domain. This allows for generalization of the tiny task model to any domain that shares the label space, thus enhancing the real-world applicability of the dataset generation paradigm. Our experiments indicate that the proposed method accomplishes generalizability across various domains while using a parameter set that is orders of magnitude smaller than PLMs. pdf bib abs Multi-News+: Cost-efficient Dataset Cleansing via LLM -based Data Annotation Juhwan Choi | JungMin Yun | Kyohoon Jin | YoungBin Kim The quality of the dataset is crucial for ensuring optimal performance and reliability of downstream task models. However, datasets often contain noisy data inadvertently included during the construction process. Numerous attempts have been made to correct this issue through human annotators. However, hiring and managing human annotators is expensive and time-consuming. As an alternative, recent studies are exploring the use of large language models (LLMs) for data annotation.In this study, we present a case study that extends the application of LLM-based data annotation to enhance the quality of existing datasets through a cleansing strategy. Specifically, we leverage approaches such as chain-of-thought and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset, which is widely used for the multi-document summarization task. Through our proposed cleansing method, we introduce an enhanced Multi-News+. By employing LLMs for data cleansing, we demonstrate an efficient and effective approach to improving dataset quality without relying on expensive human annotation efforts. pdf bib abs FIZZ : Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document Joonho Yang | Seunghyun Yoon | ByeongJeong Kim | Hwanhee Lee Through the advent of pre-trained language models, there have been notable advancements in abstractive summarization systems. Simultaneously, a considerable number of novel methods for evaluating factual consistency in abstractive summarization systems has been developed. But these evaluation approaches incorporate substantial limitations, especially on refinement and interpretability. In this work, we propose highly effective and interpretable factual inconsistency detection method FIZZ (Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document) for abstractive summarization systems that is based on fine-grained atomic facts decomposition. Moreover, we align atomic facts decomposed from the summary with the source document through adaptive granularity expansion. These atomic facts represent a more fine-grained unit of information, facilitating detailed understanding and interpretability of the summary’s factual inconsistency. Experimental results demonstrate that our proposed factual consistency checking system significantly outperforms existing systems. We release the code at https://github.com/plm3332/FIZZ. pdf bib abs Prompts have evil twins Rimon Melamed | Lucas Hurley McCabe | Tanay Wakhare | Yejin Kim | H. Howie Huang | Enric Boix-Adserà We discover that many natural-language prompts can be replaced by corresponding prompts that are unintelligible to humans but that provably elicit similar behavior in language models. We call these prompts “evil twins” because they are obfuscated and uninterpretable (evil), but at the same time mimic the functionality of the original natural-language prompts (twins). Remarkably, evil twins transfer between models. We find these prompts by solving a maximum-likelihood problem which has applications of independent interest. pdf bib abs Table Question Answering for Low-resourced I ndic Languages Vaishali Pal | Evangelos Kanoulas | Andrew Yates | Maarten de Rijke TableQA is the task of answering questions over tables of structured information, returning individual cells or tables as output. TableQA research has focused primarily on high-resource languages, leaving medium- and low-resource languages with little progress due to scarcity of annotated data and neural models. We address this gap by introducing a fully automatic large-scale tableQA data generation process for low-resource languages with limited budget. We incorporate our data generation method on two Indic languages, Bengali and Hindi, which have no tableQA datasets or models. TableQA models trained on our large-scale datasets outperform state-of-the-art LLMs. We further study the trained models on different aspects, including mathematical reasoning capabilities and zero-shot cross-lingual transfer. Our work is the first on low-resource tableQA focusing on scalable data generation and evaluation procedures. Our proposed data generation method can be applied to any low-resource language with a web presence. We release datasets, models, and code (https://github.com/kolk/Low-Resource-TableQA-Indic-languages). pdf bib abs I mage I n W ords: Unlocking Hyper-Detailed Image Descriptions Roopal Garg | Andrea Burns | Burcu Karagol Ayan | Yonatan Bitton | Ceslee Montgomery | Yasumasa Onoe | Andrew Bunner | Ranjay Krishna | Jason Michael Baldridge | Radu Soricut Despite the longstanding adage ”an image is worth a thousand words,” generating accurate hyper-detailed image descriptions remains unsolved. Trained on short web-scraped image-text, vision-language models often generate incomplete descriptions with visual inconsistencies. We address this via a novel data-centric approach with ImageInWords (IIW), a carefully designed human-in-the-loop framework for curating hyper-detailed image descriptions. Human evaluations on IIW data show major gains compared to recent datasets (+66%) and GPT-4V (+48%) across comprehensiveness, specificity, hallucinations, and more. We also show that fine-tuning with IIW data improves these metrics by +31% against models trained with prior work, even with only 9k samples. Lastly, we evaluate IIW models with text-to-image generation and vision-language reasoning tasks. Our generated descriptions result in the highest fidelity images, and boost compositional reasoning by up to 6% on ARO, SVO-Probes, and Winoground datasets. We release the IIW-Eval benchmark with human judgement labels, object and image-level annotations from our framework, and existing image caption datasets enriched via IIW-model. pdf bib abs LLM -Based Agent Society Investigation: Collaboration and Confrontation in Avalon Gameplay Yihuai Lan | Zhiqiang Hu | Lei Wang | Yang Wang | Deheng Ye | Peilin Zhao | Ee-Peng Lim | Hui Xiong | Hao Wang This paper explores the open research problem of understanding the social behaviors of LLM-based agents. Using Avalon as a testbed, we employ system prompts to guide LLM agents in gameplay. While previous studies have touched on gameplay with LLM agents, research on their social behaviors is lacking. We propose a novel framework, tailored for Avalon, features a multi-agent system facilitating efficient communication and interaction. We evaluate its performance based on game success and analyze LLM agents’ social behaviors. Results affirm the framework’s effectiveness in creating adaptive agents and suggest LLM-based agents’ potential in navigating dynamic social interactions. By examining collaboration and confrontation behaviors, we offer insights into this field’s research and applications. pdf bib abs When LLM s Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection Xiangyu Zhang | Hexin Liu | Kaishuai Xu | Qiquan Zhang | Daijiao Liu | Beena Ahmed | Julien Epps Depression is a critical concern in global mental health, prompting extensive research into AI-based detection methods. Among various AI technologies, Large Language Models (LLMs) stand out for their versatility in healthcare applications. However, the application of LLMs in the identification and analysis of depressive states remains relatively unexplored, presenting an intriguing avenue for future research. In this paper, we present an innovative approach to employ an LLM in the realm of depression detection, integrating acoustic speech information into the LLM framework for this specific application. We investigate an efficient method for automatic depression detection by integrating speech signals into LLMs utilizing Acoustic Landmarks. This approach is not only valuable for the detection of depression but also represents a new perspective in enhancing the ability of LLMs to comprehend and process speech signals. By incorporating acoustic landmarks, which are specific to the pronunciation of spoken words, our method adds critical dimensions to text transcripts. This integration also provides insights into the unique speech patterns of individuals, revealing the potential mental states of individuals. By encoding acoustic landmarks information into LLMs, evaluations of the proposed approach on the DAIC-WOZ dataset reveal state-of-the-art results when compared with existing Audio-Text baselines. pdf bib abs Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model Xiangyu Zhang | Daijiao Liu | Hexin Liu | Qiquan Zhang | Hanyu Meng | Leibny Paola Garcia | Eng Siong Chng | Lina Yao Recently, Denoising Diffusion Probabilistic Models (DDPMs) have attained leading performances across a diverse range of generative tasks. However, in the field of speech synthesis, although DDPMs exhibit impressive performance, their prolonged training duration and substantial inference costs hinder practical deployment. Existing approaches primarily focus on enhancing inference speed, while approaches to accelerate training—a key factor in the costs associated with adding or customizing voices—often necessitate complex modifications to the model, compromising their universal applicability. To address the aforementioned challenges, we propose an inquiry: is it possible to enhance the training/inference speed and performance of DDPMs by modifying the speech signal itself? In this paper, we double the training and inference speed of Speech DDPMs by simply redirecting the generative target to the wavelet domain. This method not only achieves comparable or superior performance to the original model in speech synthesis tasks but also demonstrates its versatility. By investigating and utilizing different wavelet bases, our approach proves effective not just in speech synthesis, but also in speech enhancement. pdf bib abs Hateful Word in Context Classification Sanne Hoeken | Sina Zarrieß | Özge Alacam Hate speech detection is a prevalent research field, yet it remains underexplored at the level of word meaning. This is significant, as terms used to convey hate often involve non-standard or novel usages which might be overlooked by commonly leveraged LMs trained on general language use. In this paper, we introduce the Hateful Word in Context Classification ( HateWiC ) task and present a dataset of ~4000 WiC-instances, each labeled by three annotators. Our analyses and computational exploration focus on the interplay between the subjective nature (context-dependent connotations) and the descriptive nature (as described in dictionary definitions) of hateful word senses. HateWiC annotations confirm that hatefulness of a word in context does not always derive from the sense definition alone. We explore the prediction of both majority and individual annotator labels, and we experiment with modeling context- and sense-based inputs. Our findings indicate that including definitions proves effective overall, yet not in cases where hateful connotations vary. Conversely, including annotator demographics becomes more important for mitigating performance drop in subjective hate prediction. pdf bib abs Eyes Don’t Lie: Subjective Hate Annotation and Detection with Gaze Özge Alacam | Sanne Hoeken | Sina Zarrieß Hate speech is a complex and subjective phenomenon. In this paper, we present a dataset (GAZE4HATE) that provides gaze data collected in a hate speech annotation experiment. We study whether the gaze of an annotator provides predictors of their subjective hatefulness rating, and how gaze features can improve Hate Speech Detection (HSD). We conduct experiments on statistical modeling of subjective hate ratings and gaze and analyze to what extent rationales derived from hate speech models correspond to human gaze and explanations in our data. Finally, we introduce MEANION, a first gaze-integrated HSD model. Our experiments show that particular gaze features like dwell time or fixation counts systematically correlate with annotators’ subjective hate ratings and improve predictions of text-only hate speech models. pdf bib abs N umero L ogic: Number Encoding for Enhanced LLM s’ Numerical Reasoning Eli Schwartz | Leshem Choshen | Joseph Shtok | Sivan Doveh | Leonid Karlinsky | Assaf Arbelle Language models struggle with handling numerical data and performing arithmetic operations. We hypothesize that this limitation can be partially attributed to non-intuitive textual numbers representation. When a digit is read or generated by a causal language model it does not know its place value (e.g. thousands vs. hundreds) until the entire number is processed. To address this issue, we propose a simple adjustment to how numbers are represented by including the count of digits before each number. For instance, instead of “42”, we suggest using “2:42” as the new format. This approach, which we term NumeroLogic, offers an added advantage in number generation by serving as a Chain of Thought (CoT). By requiring the model to consider the num
Executive Summary
The Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) present cutting-edge research in the field of natural language processing (NLP). Two notable papers are highlighted: one proposing a universal domain generalization method for sentiment classification using zero-shot dataset generation, and another introducing a cost-efficient dataset cleansing approach via large language model (LLM)-based data annotation. These studies aim to enhance the efficiency and applicability of NLP models in real-world scenarios.
Key Points
- ▸ Universal domain generalization for sentiment classification using zero-shot dataset generation.
- ▸ Cost-efficient dataset cleansing via LLM-based data annotation.
- ▸ Improving dataset quality and model performance through innovative NLP techniques.
Merits
Innovative Approach
The proposed methods leverage advanced NLP techniques to address critical issues in dataset generation and cleansing, demonstrating significant improvements in model efficiency and applicability.
Practical Applicability
The research focuses on real-world applicability, making the findings highly relevant for practitioners and researchers in the field of NLP.
Demerits
Limited Scope
The studies are focused on specific tasks (sentiment classification and multi-document summarization), which may limit the generalizability of the findings to other NLP tasks.
Data Dependency
The effectiveness of the proposed methods is highly dependent on the quality and diversity of the datasets used, which may not be universally applicable.
Expert Commentary
The Proceedings of the 2024 EMNLP conference present significant advancements in the field of NLP, particularly in the areas of dataset generation and cleansing. The study on universal domain generalization for sentiment classification introduces a novel approach that leverages zero-shot dataset generation to enhance the applicability of NLP models across various domains. This method addresses the limitations of pre-trained language models (PLMs) by using a smaller parameter set, making it more efficient and scalable. The research on cost-efficient dataset cleansing via LLM-based data annotation demonstrates the potential of using large language models for improving dataset quality without the need for expensive human annotators. This approach not only reduces costs but also enhances the reliability of downstream task models. While these studies show promising results, they are not without limitations. The scope of the research is focused on specific tasks, which may limit the generalizability of the findings. Additionally, the effectiveness of the proposed methods is highly dependent on the quality and diversity of the datasets used. Despite these limitations, the research presents valuable insights and practical applications that can significantly impact the field of NLP. The findings underscore the importance of continuous innovation and investment in NLP research to address the evolving challenges in the deployment of these models in real-world scenarios.
Recommendations
- ✓ Further research should explore the applicability of the proposed methods to a broader range of NLP tasks to enhance their generalizability.
- ✓ Investment in diverse and high-quality datasets is crucial for the effectiveness of the proposed methods and should be a priority for future research.