Conference

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing - ACL Anthology

· March 7, 2026 · 10 min read · 10 views

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing Houda Bouamor , Juan Pino , Kalika Bali (Editors) Anthology ID: 2023.emnlp-main Month: December Year: 2023 Address: Singapore Venue: EMNLP SIG: Publisher: Association for Computational Linguistics URL: https://aclanthology.org/2023.emnlp-main/ DOI: Bib Export formats: BibTeX MODS XML EndNote Bib TeX Search Show all abstracts Hide all abstracts pdf bib Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing Houda Bouamor | Juan Pino | Kalika Bali pdf bib abs IAG : Induction-Augmented Generation Framework for Answering Reasoning Questions Zhebin Zhang | Xinyu Zhang | Yuanhang Ren | Saijiang Shi | Meng Han | Yongkang Wu | Ruofei Lai | Zhao Cao Retrieval-Augmented Generation (RAG), by incorporating external knowledge with parametric memory of language models, has become the state-of-the-art architecture for open-domain QA tasks. However, common knowledge bases are inherently constrained by limited coverage and noisy information, making retrieval-based approaches inadequate to answer implicit reasoning questions. In this paper, we propose an Induction-Augmented Generation (IAG) framework that utilizes inductive knowledge along with the retrieved documents for implicit reasoning. We leverage large language models (LLMs) for deriving such knowledge via a novel prompting method based on inductive reasoning patterns. On top of this, we implement two versions of IAG named IAG-GPT and IAG-Student, respectively. IAG-GPT directly utilizes the knowledge generated by GPT-3 for answer prediction, while IAG-Student gets rid of dependencies on GPT service at inference time by incorporating a student inductor model. The inductor is firstly trained via knowledge distillation and further optimized by back-propagating the generator feedback via differentiable beam scores. Experimental results show that IAG outperforms RAG baselines as well as ChatGPT on two Open-Domain QA tasks. Notably, our best models have won the first place in the official leaderboards of CSQA2.0 (since Nov 1, 2022) and StrategyQA (since Jan 8, 2023). pdf bib abs Absolute Position Embedding Learns Sinusoid-like Waves for Attention Based on Relative Position Yuji Yamamoto | Takuya Matsuzaki Attention weight is a clue to interpret how a Transformer-based model makes an inference. In some attention heads, the attention focuses on the neighbors of each token. This allows the output vector of each token to depend on the surrounding tokens and contributes to make the inference context-dependent. We analyze the mechanism behind the concentration of attention on nearby tokens. We show that the phenomenon emerges as follows: (1) learned position embedding has sinusoid-like components, (2) such components are transmitted to the query and the key in the self-attention, (3) the attention head shifts the phases of the sinusoid-like components so that the attention concentrates on nearby tokens at specific relative positions. In other words, a certain type of Transformer-based model acquires the sinusoidal positional encoding to some extent on its own through Masked Language Modeling. pdf bib abs C hinese Lexical Substitution: Dataset and Method Jipeng Qiang | Kang Liu | Ying Li | Yun Li | Yi Zhu | Yun-Hao Yuan | Xiaocheng Hu | Xiaoye Ouyang Existing lexical substitution (LS) benchmarks were collected by asking human annotators to think of substitutes from memory, resulting in benchmarks with limited coverage and relatively small scales. To overcome this problem, we propose a novel annotation method to construct an LS dataset based on human and machine collaboration. Based on our annotation method, we construct the first Chinese LS dataset CHNLS which consists of 33,695 instances and 144,708 substitutes, covering three text genres (News, Novel, and Wikipedia). Specifically, we first combine four unsupervised LS methods as an ensemble method to generate the candidate substitutes, and then let human annotators judge these candidates or add new ones. This collaborative process combines the diversity of machine-generated substitutes with the expertise of human annotators. Experimental results that the ensemble method outperforms other LS methods. To our best knowledge, this is the first study for the Chinese LS task. pdf bib abs Decoding the Silent Majority: Inducing Belief Augmented Social Graph with Large Language Model for Response Forecasting Chenkai Sun | Jinning Li | Yi Fung | Hou Chan | Tarek Abdelzaher | ChengXiang Zhai | Heng Ji Automatic response forecasting for news media plays a crucial role in enabling content producers to efficiently predict the impact of news releases and prevent unexpected negative outcomes such as social conflict and moral injury. To effectively forecast responses, it is essential to develop measures that leverage the social dynamics and contextual information surrounding individuals, especially in cases where explicit profiles or historical actions of the users are limited (referred to as lurkers). As shown in a previous study, 97% of all tweets are produced by only the most active 25% of users. However, existing approaches have limited exploration of how to best process and utilize these important features. To address this gap, we propose a novel framework, named SocialSense, that leverages a large language model to induce a belief-centered graph on top of an existent social network, along with graph-based propagation to capture social dynamics. We hypothesize that the induced graph that bridges the gap between distant users who share similar beliefs allows the model to effectively capture the response patterns. Our method surpasses existing state-of-the-art in experimental evaluations for both zero-shot and supervised settings, demonstrating its effectiveness in response forecasting. Moreover, the analysis reveals the framework’s capability to effectively handle unseen user and lurker scenarios, further highlighting its robustness and practical applicability. pdf bib abs Fine-grained Conversational Decoding via Isotropic and Proximal Search Yuxuan Yao | Han Wu | Qiling Xu | Linqi Song General-purpose text decoding approaches are usually adopted for dialogue response generation. Although the quality of the generated responses can be improved with dialogue-specific encoding methods, conversational decoding methods are still under-explored. Inspired by SimDRC that a good dialogue feature space should follow the rules of locality and isotropy, we present a fine-grained conversational decoding method, termed isotropic and proximal search (IPS). Our method is designed to generate the semantic-concentrated response, while still maintaining informativeness and discrimination against the context. Experiments show that our approach significantly outperforms existing decoding strategies in the dialogue field across both automatic and human evaluation metrics. More in-depth analyses further confirm the effectiveness of our approach. pdf bib abs Holistic Inter-Annotator Agreement and Corpus Coherence Estimation in a Large-scale Multilingual Annotation Campaign Nicolas Stefanovitch | Jakub Piskorski In this paper we report on the complexity of persuasion technique annotation in the context of a large multilingual annotation campaign involving 6 languages and approximately 40 annotators. We highlight the techniques that appear to be difficult for humans to annotate and elaborate on our findings on the causes of this phenomenon. We introduce Holistic IAA, a new word embedding-based annotator agreement metric and we report on various experiments using this metric and its correlation with the traditional Inter Annotator Agreement (IAA) metrics. However, given somewhat limited and loose interaction between annotators, i.e., only a few annotators annotate the same document subsets, we try to devise a way to assess the coherence of the entire dataset and strive to find a good proxy for IAA between annotators tasked to annotate different documents and in different languages, for which classical IAA metrics can not be applied. pdf bib abs PHD : Pixel-Based Language Modeling of Historical Documents Nadav Borenstein | Phillip Rust | Desmond Elliott | Isabelle Augenstein The digitisation of historical documents has provided historians with unprecedented research opportunities. Yet, the conventional approach to analysing historical documents involves converting them from images to text using OCR, a process that overlooks the potential benefits of treating them as images and introduces high levels of noise. To bridge this gap, we take advantage of recent advancements in pixel-based language models trained to reconstruct masked patches of pixels instead of predicting token distributions. Due to the scarcity of real historical scans, we propose a novel method for generating synthetic scans to resemble real historical documents. We then pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period. Through our experiments, we demonstrate that PHD exhibits high proficiency in reconstructing masked image patches and provide evidence of our model’s noteworthy language understanding capabilities. Notably, we successfully apply our model to a historical QA task, highlighting its usefulness in this domain. pdf bib abs Primacy Effect of C hat GPT Yiwei Wang | Yujun Cai | Muhao Chen | Yuxuan Liang | Bryan Hooi Instruction-tuned large language models (LLMs), such as ChatGPT, have led to promising zero-shot performance in discriminative natural language understanding (NLU) tasks. This involves querying the LLM using a prompt containing the question, and the candidate labels to choose from. The question-answering capabilities of ChatGPT arise from its pre-training on large amounts of human-written text, as well as its subsequent fine-tuning on human preferences, which motivates us to ask: Does ChatGPT also inherit humans’ cognitive biases? In this paper, we study the primacy effect of ChatGPT: the tendency of selecting the labels at earlier positions as the answer. We have two main findings: i) ChatGPT’s decision is sensitive to the order of labels in the prompt; ii) ChatGPT has a clearly higher chance to select the labels at earlier positions as the answer. We hope that our experiments and analyses provide additional insights into building more reliable ChatGPT-based solutions. We release the source code at https://github.com/wangywUST/PrimacyEffectGPT. pdf bib abs Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension Akira Kawabata | Saku Sugawara To precisely evaluate a language model’s capability for logical reading comprehension, we present a dataset for testing the understanding of the rationale behind critical reasoning. For questions taken from an existing multiple-choice logical reading comprehension dataset, we crowdsource rationale texts that explain why we should select or eliminate answer options, resulting in 3,003 multiple-choice subquestions that are associated with 943 main questions. Experiments on our dataset show that recent large language models (e.g., InstructGPT) struggle to answer the subquestions even if they are able to answer the main questions correctly. We find that the models perform particularly poorly in answering subquestions written for the incorrect options of the main questions, implying that the models have a limited capability for explaining why incorrect alternatives should be eliminated. These results suggest that our dataset encourages further investigation into the critical reasoning ability of language models while focusing on the elimination process of relevant alternatives. pdf bib abs Evaluating and Modeling Attribution for Cross-Lingual Question Answering Benjamin Muller | John Wieting | Jonathan H. Clark | Tom Kwiatkowski | Sebastian Ruder | Livio Baldini Soares | Roee Aharoni | Jonathan Herzig | Xinyi Wang Trustworthy answer content is abundant in many high-resource languages and is instantly accessible through question answering systems — yet this content can be hard to access for those that do not speak these languages. The leap forward in cross-lingual modeling quality offered by generative language models offers much promise, yet their raw generations often fall short in factuality. To improve trustworthiness in these systems, a promising direction is to attribute the answer to a retrieved source, possibly in a content-rich language different from the query. Our work is the first to study attribution for cross-lingual question answering. First, we collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system. To our surprise, we find that a substantial portion of the answers is not attributable to any retrieved passages (up to 50% of answers exactly matching a gold reference) despite the system being able to attend directly to the retrieved text. Second, to address this poor attribution level, we experiment with a wide range of attribution detection techniques. We find that Natural Language Inference models and PaLM 2 fine-tuned on a very small amount of attribution data can accurately detect attribution. With these models, we improve the attribution level of a cross-lingual QA system. Overall, we show that current academic generative cross-lingual QA systems have substantial shortcomings in attribution and we build tooling to mitigate these issues. pdf bib abs Better Quality Pre-training Data and T5 Models for A frican Languages Akintunde Oladipo | Mofetoluwa Adeyemi | Orevaoghene Ahia | Abraham Toluwalase Owodunni | Odunayo Ogundepo | David Ifeoluwa Adelani | Jimmy Lin In this study, we highlight the importance of enhancing the quality of pretraining data in multilingual language models. Existing web crawls have demonstrated quality issues, particularly in the context of low-resource languages. Consequently, we introduce a new multilingual pretraining corpus for 16 African languages, designed by carefully auditing existing pretraining corpora to understand and rectify prevalent quality issues. To compile this dataset, we undertake a rigorous examination of current data sources for thirteen languages within one of the most extensive multilingual web crawls, mC4, and extract cleaner data through meticulous auditing and improved web crawling strategies. Subsequently, we pretrain a new T5-based model on this dataset and evaluate its performance on multiple downstream tasks. Our model demonstrates better downstream effectiveness over existing pretrained models across four NLP tasks, underscoring the critical role data quality plays in pretraining language models in low-resource scenarios. Specifically, on cross-lingual QA evaluation, our new model is more than twice as effective as multilingual T5. All code, data and models are publicly available at https://github.com/castorini/Afr

Executive Summary

The Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) present a collection of cutting-edge research in the field of natural language processing (NLP). Notably, the conference highlights two significant papers: one introducing the Induction-Augmented Generation (IAG) framework for answering reasoning questions and another analyzing the mechanisms behind the concentration of attention on nearby tokens in Transformer-based models. The IAG framework enhances Retrieval-Augmented Generation (RAG) by incorporating inductive knowledge derived from large language models, demonstrating superior performance on open-domain QA tasks. The second paper delves into the sinusoid-like waves in absolute position embeddings that influence attention mechanisms, providing insights into the interpretability of Transformer models.

Key Points

▸ Introduction of the Induction-Augmented Generation (IAG) framework for improving open-domain QA tasks.
▸ Analysis of the mechanism behind the concentration of attention on nearby tokens in Transformer-based models.
▸ IAG framework outperforms RAG baselines and ChatGPT on two Open-Domain QA tasks.
▸ Discovery of sinusoid-like components in learned position embeddings affecting attention mechanisms.

Merits

Innovative Framework

The IAG framework represents a significant advancement in NLP by integrating inductive reasoning with retrieved documents, enhancing the ability to answer implicit reasoning questions.

Empirical Validation

The experimental results demonstrate the superiority of the IAG framework over existing RAG baselines and ChatGPT, as evidenced by its top performance on official leaderboards.

Insightful Analysis

The analysis of sinusoid-like waves in position embeddings provides valuable insights into the interpretability and functionality of Transformer-based models.

Demerits

Dependency on Large Language Models

The IAG framework's reliance on large language models like GPT-3 for generating inductive knowledge may limit its accessibility and scalability.

Complexity of Implementation

The implementation of the IAG framework, particularly the training of the student inductor model, involves complex processes that may require significant computational resources.

Limited Generalizability

The findings on sinusoid-like waves in position embeddings may not be universally applicable to all Transformer-based models, potentially limiting the generalizability of the insights.

Expert Commentary

The Proceedings of the 2023 EMNLP conference present a compelling collection of research that pushes the boundaries of natural language processing. The IAG framework, in particular, addresses a critical gap in the current RAG architecture by incorporating inductive reasoning, thereby enhancing the system's ability to handle complex, implicit reasoning questions. This innovation is not only theoretically significant but also practically impactful, as demonstrated by its top performance on official leaderboards. The analysis of sinusoid-like waves in position embeddings provides a deeper understanding of how attention mechanisms operate in Transformer models, contributing to the broader goal of making these models more interpretable. However, the reliance on large language models like GPT-3 raises questions about the scalability and accessibility of the IAG framework. Future research should explore ways to reduce this dependency while maintaining the framework's effectiveness. Additionally, the complexity of implementing the IAG framework underscores the need for continued advancements in computational efficiency and resource management. Overall, the proceedings offer valuable insights and innovations that will likely influence both practical applications and policy discussions in the field of NLP.

Recommendations

✓ Further research should focus on reducing the dependency of the IAG framework on large language models to enhance its accessibility and scalability.
✓ Investigation into the generalizability of the findings on sinusoid-like waves in position embeddings across different Transformer-based models is recommended to validate their broader applicability.

Sources

EMNLP

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing - ACL Anthology

AI Commentary

Executive Summary

Key Points

Merits

Innovative Framework

Empirical Validation

Insightful Analysis

Demerits

Dependency on Large Language Models

Complexity of Implementation

Limited Generalizability

Expert Commentary

Recommendations

Sources

Related Articles

Google Maps

Find Your Next Job

A Retrospective on the ICLR 2026 Review Process

Retrospective on PAT x ICML 2026 AI Paper Assistant Program

JCG, PC

HSOLLC Co., Ltd.