The Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluation
arXiv:2604.00019v1 Announce Type: cross Abstract: We present a configurable pipeline for generating multilingual sets of entities with specified characteristics, such as domain, geographical location and popularity, using data from Wikipedia and Wikidata. These datasets are intended for evaluating the factuality of LLMs' long-form generation, thereby complementing evaluation based on short-form QA datasets. We present the RiDiC dataset as an example of this approach. RiDiC contains 3,000 entities from three domains -- rivers, natural disasters, and car models -- spanning different popularity tiers. Each entity is accompanied by its geographical location, English and Chinese names (if available) and relevant English and Chinese Wikipedia content, which is used to evaluate LLMs' responses. Generations about RiDiC entities were obtained from three LLMs in English and Chinese. These were then evaluated using a third-party factuality checker, which showed that entities from our dataset cau
arXiv:2604.00019v1 Announce Type: cross Abstract: We present a configurable pipeline for generating multilingual sets of entities with specified characteristics, such as domain, geographical location and popularity, using data from Wikipedia and Wikidata. These datasets are intended for evaluating the factuality of LLMs' long-form generation, thereby complementing evaluation based on short-form QA datasets. We present the RiDiC dataset as an example of this approach. RiDiC contains 3,000 entities from three domains -- rivers, natural disasters, and car models -- spanning different popularity tiers. Each entity is accompanied by its geographical location, English and Chinese names (if available) and relevant English and Chinese Wikipedia content, which is used to evaluate LLMs' responses. Generations about RiDiC entities were obtained from three LLMs in English and Chinese. These were then evaluated using a third-party factuality checker, which showed that entities from our dataset caused even frontier models to hallucinate. To facilitate the evaluation of LLMs' long-form factuality in multiple languages, the code, data, and generation/evaluation scripts have been released.
Executive Summary
This study presents a novel dataset generation pipeline, RiDiC, designed to evaluate the factuality of long-form generation in Large Language Models (LLMs) across multiple languages. By leveraging Wikipedia and Wikidata data, the pipeline generates multilingual sets of entities with controlled popularity distribution, enabling comprehensive evaluation of LLMs' capabilities. The authors demonstrate the effectiveness of RiDiC by generating 3,000 entities across three domains and evaluating LLMs' responses using a third-party factuality checker. The dataset and evaluation scripts are publicly released, facilitating future research in this area. The study contributes significantly to the development of more reliable and robust LLMs by providing a valuable resource for factuality evaluation.
Key Points
- ▸ RiDiC is a configurable pipeline for generating multilingual datasets with controlled popularity distribution
- ▸ The pipeline leverages Wikipedia and Wikidata data to create entities with specified characteristics
- ▸ The authors demonstrate the effectiveness of RiDiC by evaluating LLMs' responses using a third-party factuality checker
Merits
Strength in Methodology
The authors provide a well-structured and transparent methodology for generating the RiDiC dataset, ensuring reproducibility and reliability.
Comprehensive Evaluation
The use of a third-party factuality checker provides a robust evaluation framework for LLMs' responses, addressing potential biases and limitations.
Demerits
Limited Domain Scope
The study is limited to three domains (rivers, natural disasters, and car models), which may not be representative of the broader scope of LLM applications.
Dependence on External Data Sources
The pipeline's reliance on Wikipedia and Wikidata data may introduce bias or limitations, particularly if these sources are incomplete or inaccurate.
Expert Commentary
The RiDiC study makes a significant contribution to the field of LLMs by providing a comprehensive and reliable evaluation framework for long-form factuality. The authors' use of a third-party factuality checker and the publicly released dataset and scripts ensure the reproducibility and transparency of the study. However, the limited domain scope and dependence on external data sources are notable limitations. Future research should aim to expand the scope of the RiDiC pipeline and explore alternative data sources to mitigate potential biases. Additionally, the study's implications for policymakers and regulators highlight the need for continued discussion and regulation of LLMs in critical applications.
Recommendations
- ✓ Future researchers should aim to expand the RiDiC pipeline to encompass a broader range of domains and applications.
- ✓ The use of alternative data sources, such as official government data or expert-curated datasets, may help mitigate potential biases and limitations in the RiDiC pipeline.
Sources
Original: arXiv - cs.AI