Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) - ACL Anthology
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Bonnie Webber , Trevor Cohn , Yulan He , Yang Liu (Editors) Anthology ID: 2020.emnlp-main Month: November Year: 2020 Address: Online Venue: EMNLP SIG: Publisher: Association for Computational Linguistics URL: https://aclanthology.org/2020.emnlp-main/ DOI: 10.18653/v1/2020.emnlp-main Bib Export formats: BibTeX MODS XML EndNote PDF: https://aclanthology.org/2020.emnlp-main.pdf PDF (full) Bib TeX Search Show all abstracts Hide all abstracts pdf bib Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Bonnie Webber | Trevor Cohn | Yulan He | Yang Liu pdf bib abs Detecting Attackable Sentences in Arguments Yohan Jo | Seojin Bang | Emaad Manzoor | Eduard Hovy | Chris Reed Finding attackable sentences in an argument is the first step toward successful refutation in argumentation. We present a first large-scale analysis of sentence attackability in online arguments. We analyze driving reasons for attacks in argumentation and identify relevant characteristics of sentences. We demonstrate that a sentence’s attackability is associated with many of these characteristics regarding the sentence’s content, proposition types, and tone, and that an external knowledge source can provide useful information about attackability. Building on these findings, we demonstrate that machine learning models can automatically detect attackable sentences in arguments, significantly better than several baselines and comparably well to laypeople. pdf bib abs Extracting Implicitly Asserted Propositions in Argumentation Yohan Jo | Jacky Visser | Chris Reed | Eduard Hovy Argumentation accommodates various rhetorical devices, such as questions, reported speech, and imperatives. These rhetorical tools usually assert argumentatively relevant propositions rather implicitly, so understanding their true meaning is key to understanding certain arguments properly. However, most argument mining systems and computational linguistics research have paid little attention to implicitly asserted propositions in argumentation. In this paper, we examine a wide range of computational methods for extracting propositions that are implicitly asserted in questions, reported speech, and imperatives in argumentation. By evaluating the models on a corpus of 2016 U.S. presidential debates and online commentary, we demonstrate the effectiveness and limitations of the computational models. Our study may inform future research on argument mining and the semantics of these rhetorical devices in argumentation. pdf bib abs Quantitative argument summarization and beyond: Cross-domain key point analysis Roy Bar-Haim | Yoav Kantor | Lilach Eden | Roni Friedman | Dan Lahav | Noam Slonim When summarizing a collection of views, arguments or opinions on some topic, it is often desirable not only to extract the most salient points, but also to quantify their prevalence. Work on multi-document summarization has traditionally focused on creating textual summaries, which lack this quantitative aspect. Recent work has proposed to summarize arguments by mapping them to a small set of expert-generated key points, where the salience of each key point corresponds to the number of its matching arguments. The current work advances key point analysis in two important respects: first, we develop a method for automatic extraction of key points, which enables fully automatic analysis, and is shown to achieve performance comparable to a human expert. Second, we demonstrate that the applicability of key point analysis goes well beyond argumentation data. Using models trained on publicly available argumentation datasets, we achieve promising results in two additional domains: municipal surveys and user reviews. An additional contribution is an in-depth evaluation of argument-to-key point matching models, where we substantially outperform previous results. pdf bib abs Unsupervised stance detection for arguments from consequences Jonathan Kobbe | Ioana Hulpuș | Heiner Stuckenschmidt Social media platforms have become an essential venue for online deliberation where users discuss arguments, debate, and form opinions. In this paper, we propose an unsupervised method to detect the stance of argumentative claims with respect to a topic. Most related work focuses on topic-specific supervised models that need to be trained for every emergent debate topic. To address this limitation, we propose a topic independent approach that focuses on a frequently encountered class of arguments, specifically, on arguments from consequences. We do this by extracting the effects that claims refer to, and proposing a means for inferring if the effect is a good or bad consequence. Our experiments provide promising results that are comparable to, and in particular regards even outperform BERT. Furthermore, we publish a novel dataset of arguments relating to consequences, annotated with Amazon Mechanical Turk. pdf bib abs BLEU might be Guilty but References are not Innocent Markus Freitag | David Grangier | Isaac Caswell The quality of automatic metrics for machine translation has been increasingly called into question, especially for high-quality systems. This paper demonstrates that, while choice of metric is important, the nature of the references is also critical. We study different methods to collect references and compare their value in automated evaluation by reporting correlation with human evaluation for a variety of systems and metrics. Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias. Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output, which have been shown to have low correlation with automatic metrics using standard references. We demonstrate that our methodology improves correlation with all modern evaluation metrics we look at, including embedding-based methods. To complete this picture, we reveal that multi-reference BLEU does not improve the correlation for high quality output, and present an alternative multi-reference formulation that is more effective. pdf bib abs Statistical Power and Translationese in Machine Translation Evaluation Yvette Graham | Barry Haddow | Philipp Koehn The term translationese has been used to describe features of translated text, and in this paper, we provide detailed analysis of potential adverse effects of translationese on machine translation evaluation. Our analysis shows differences in conclusions drawn from evaluations that include translationese in test data compared to experiments that tested only with text originally composed in that language. For this reason we recommend that reverse-created test data be omitted from future machine translation test sets. In addition, we provide a re-evaluation of a past machine translation evaluation claiming human-parity of MT. One important issue not previously considered is statistical power of significance tests applied to comparison of human and machine translation. Since the very aim of past evaluations was investigation of ties between human and MT systems, power analysis is of particular importance, to avoid, for example, claims of human parity simply corresponding to Type II error resulting from the application of a low powered test. We provide detailed analysis of tests used in such evaluations to provide an indication of a suitable minimum sample size for future studies. pdf bib abs Simulated multiple reference training improves low-resource machine translation Huda Khayrallah | Brian Thompson | Matt Post | Philipp Koehn Many valid translations exist for a given sentence, yet machine translation (MT) is trained with a single reference translation, exacerbating data sparsity in low-resource settings. We introduce Simulated Multiple Reference Training (SMRT), a novel MT training method that approximates the full space of possible translations by sampling a paraphrase of the reference sentence from a paraphraser and training the MT model to predict the paraphraser’s distribution over possible tokens. We demonstrate the effectiveness of SMRT in low-resource settings when translating to English, with improvements of 1.2 to 7.0 BLEU. We also find SMRT is complementary to back-translation. pdf bib abs Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing Brian Thompson | Matt Post We frame the task of machine translation evaluation as one of scoring machine translation output with a sequence-to-sequence paraphraser, conditioned on a human reference. We propose training the paraphraser as a multilingual NMT system, treating paraphrasing as a zero-shot translation task (e.g., Czech to Czech). This results in the paraphraser’s output mode being centered around a copy of the input sequence, which represents the best case scenario where the MT system output matches a human reference. Our method is simple and intuitive, and does not require human judgements for training. Our single model (trained in 39 languages) outperforms or statistically ties with all prior metrics on the WMT 2019 segment-level shared metrics task in all languages (excluding Gujarati where the model had no training data). We also explore using our model for the task of quality estimation as a metric—conditioning on the source instead of the reference—and find that it significantly outperforms every submission to the WMT 2019 shared task on quality estimation in every language pair. pdf bib abs PR over: Proof Generation for Interpretable Reasoning over Rules Swarnadeep Saha | Sayan Ghosh | Shashank Srivastava | Mohit Bansal Recent work by Clark et al. (2020) shows that transformers can act as “soft theorem provers” by answering questions over explicitly provided knowledge in natural language. In our work, we take a step closer to emulating formal theorem provers, by proposing PRover, an interpretable transformer-based model that jointly answers binary questions over rule-bases and generates the corresponding proofs. Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm. During inference, a valid proof, satisfying a set of global constraints is generated. We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation, with strong generalization performance. First, PRover generates proofs with an accuracy of 87%, while retaining or improving performance on the QA task, compared to RuleTakers (up to 6% improvement on zero-shot evaluation). Second, when trained on questions requiring lower depths of reasoning, it generalizes significantly better to higher depths (up to 15% improvement). Third, PRover obtains near perfect QA accuracy of 98% using only 40% of the training data. However, generating proofs for questions requiring higher depths of reasoning becomes challenging, and the accuracy drops to 65% for “depth 5”, indicating significant scope for future work. pdf bib abs Learning to Explain: Datasets and Models for Identifying Valid Reasoning Chains in Multihop Question-Answering Harsh Jhamtani | Peter Clark Despite the rapid progress in multihop question-answering (QA), models still have trouble explaining why an answer is correct, with limited explanation training data available to learn from. To address this, we introduce three explanation datasets in which explanations formed from corpus facts are annotated. Our first dataset, eQASC contains over 98K explanation annotations for the multihop question answering dataset QASC, and is the first that annotates multiple candidate explanations for each answer. The second dataset eQASC-perturbed is constructed by crowd-sourcing perturbations (while preserving their validity) of a subset of explanations in QASC, to test consistency and generalization of explanation prediction models. The third dataset eOBQA is constructed by adding explanation annotations to the OBQA dataset to test generalization of models trained on eQASC. We show that this data can be used to significantly improve explanation quality (+14% absolute F1 over a strong retrieval baseline) using a BERT-based classifier, but still behind the upper bound, offering a new challenge for future research. We also explore a delexicalized chain representation in which repeated noun phrases are replaced by variables, thus turning them into generalized reasoning chains (for example: “X is a Y” AND “Y has Z” IMPLIES “X has Z”). We find that generalized chains maintain performance while also being more robust to certain perturbations. pdf bib abs Self-Supervised Knowledge Triplet Learning for Zero-Shot Question Answering Pratyay Banerjee | Chitta Baral The aim of all Question Answering (QA) systems is to generalize to unseen questions. Current supervised methods are reliant on expensive data annotation. Moreover, such annotations can introduce unintended annotator bias, making systems focus more on the bias than the actual task. This work proposes Knowledge Triplet Learning (KTL), a self-supervised task over knowledge graphs. We propose heuristics to create synthetic graphs for commonsense and scientific knowledge. We propose using KTL to perform zero-shot question answering, and our experiments show considerable improvements over large pre-trained transformer language models. pdf bib abs More Bang for Your Buck: Natural Perturbation for Robust Question Answering Daniel Khashabi | Tushar Khot | Ashish Sabharwal Deep learning models for linguistic tasks require large training datasets, which are expensive to create. As an alternative to the traditional approach of creating new instances by repeating the process of creating one instance, we propose doing so by first collecting a set of seed examples and then applying human-driven natural perturbations (as opposed to rule-based machine perturbations), which often change the gold label as well. Such perturbations have the advantage of being relatively easier (and hence cheaper) to create than writing out completely new examples. Further, they help address the issue that even models achieving human-level scores on NLP datasets are known to be considerably sensitive to small changes in input. To evaluate the idea, we consider a recent question-answering dataset (BOOLQ) and study our approach as a function of the perturbation cost ratio, the relative cost of perturbing an existing question vs. creating a new one from scratch. We find that when natural perturbations are moderately cheaper to create (cost ratio under 60%), it is more effective to use them for training BOOLQ models: such models exhibit 9% higher robustness an
Executive Summary
The Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) present a collection of cutting-edge research in the field of computational linguistics. The conference, edited by Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, features studies on argumentation, including detecting attackable sentences, extracting implicitly asserted propositions, and quantitative argument summarization. These papers contribute to the understanding of argument structures and the development of advanced natural language processing techniques.
Key Points
- ▸ Detection of attackable sentences in arguments for successful refutation.
- ▸ Analysis of implicitly asserted propositions in argumentation using computational methods.
- ▸ Quantitative argument summarization and cross-domain key point analysis.
Merits
Innovative Research
The conference proceedings showcase innovative research in natural language processing and argumentation, pushing the boundaries of computational linguistics.
Practical Applications
The studies present practical applications in argument mining and natural language understanding, which can be beneficial in various fields such as law, politics, and education.
Demerits
Limited Scope
Some studies may have a limited scope, focusing on specific datasets or contexts, which may not be generalizable to all scenarios.
Technical Complexity
The advanced methodologies and technical jargon used in the papers may make them less accessible to non-experts in the field.
Expert Commentary
The Proceedings of the 2020 EMNLP conference represent a significant contribution to the field of computational linguistics, particularly in the area of argumentation. The research presented in these proceedings demonstrates the potential of advanced natural language processing techniques to enhance our understanding of human language and argument structures. The studies on detecting attackable sentences and extracting implicitly asserted propositions are particularly noteworthy, as they address critical aspects of argumentation that have been relatively under-explored. The practical applications of these findings are vast, ranging from improving legal and political analysis to enhancing educational tools. However, it is important to note that the technical complexity of the methodologies employed may limit the immediate accessibility of these findings to a broader audience. Future research should aim to make these advanced techniques more accessible and applicable to a wider range of contexts.
Recommendations
- ✓ Encourage further research on the generalization of these methodologies to diverse datasets and contexts.
- ✓ Promote interdisciplinary collaboration to integrate these findings into practical applications in law, politics, and education.