Academic

The Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluation

arXiv:2604.00019v1 Announce Type: cross Abstract: We present a configurable pipeline for generating multilingual sets of entities with specified characteristics, such as domain, geographical location and popularity, using data from Wikipedia and Wikidata. These datasets are intended for evaluating the factuality of LLMs' long-form generation, thereby complementing evaluation based on short-form QA datasets. We present the RiDiC dataset as an example of this approach. RiDiC contains 3,000 entities from three domains -- rivers, natural disasters, and car models -- spanning different popularity tiers. Each entity is accompanied by its geographical location, English and Chinese names (if available) and relevant English and Chinese Wikipedia content, which is used to evaluate LLMs' responses. Generations about RiDiC entities were obtained from three LLMs in English and Chinese. These were then evaluated using a third-party factuality checker, which showed that entities from our dataset cau

Pavel Braslavski, Dmitrii Iarosh, Nikita Sushko, Andrey Sakhovskiy, Vasily Konovalov, Elena Tutubalina, Alexander Panchenko · April 3, 2026 · 1 min read · 0 views

#cs.CL #cs.AI

Executive Summary

This study presents a novel dataset generation pipeline, RiDiC, designed to evaluate the factuality of long-form generation in Large Language Models (LLMs) across multiple languages. By leveraging Wikipedia and Wikidata data, the pipeline generates multilingual sets of entities with controlled popularity distribution, enabling comprehensive evaluation of LLMs' capabilities. The authors demonstrate the effectiveness of RiDiC by generating 3,000 entities across three domains and evaluating LLMs' responses using a third-party factuality checker. The dataset and evaluation scripts are publicly released, facilitating future research in this area. The study contributes significantly to the development of more reliable and robust LLMs by providing a valuable resource for factuality evaluation.

Key Points

▸ RiDiC is a configurable pipeline for generating multilingual datasets with controlled popularity distribution
▸ The pipeline leverages Wikipedia and Wikidata data to create entities with specified characteristics
▸ The authors demonstrate the effectiveness of RiDiC by evaluating LLMs' responses using a third-party factuality checker

Merits

Strength in Methodology

The authors provide a well-structured and transparent methodology for generating the RiDiC dataset, ensuring reproducibility and reliability.

Comprehensive Evaluation

The use of a third-party factuality checker provides a robust evaluation framework for LLMs' responses, addressing potential biases and limitations.

Demerits

Limited Domain Scope

The study is limited to three domains (rivers, natural disasters, and car models), which may not be representative of the broader scope of LLM applications.

Dependence on External Data Sources

The pipeline's reliance on Wikipedia and Wikidata data may introduce bias or limitations, particularly if these sources are incomplete or inaccurate.

Expert Commentary

The RiDiC study makes a significant contribution to the field of LLMs by providing a comprehensive and reliable evaluation framework for long-form factuality. The authors' use of a third-party factuality checker and the publicly released dataset and scripts ensure the reproducibility and transparency of the study. However, the limited domain scope and dependence on external data sources are notable limitations. Future research should aim to expand the scope of the RiDiC pipeline and explore alternative data sources to mitigate potential biases. Additionally, the study's implications for policymakers and regulators highlight the need for continued discussion and regulation of LLMs in critical applications.

Recommendations

✓ Future researchers should aim to expand the RiDiC pipeline to encompass a broader range of domains and applications.
✓ The use of alternative data sources, such as official government data or expert-curated datasets, may help mitigate potential biases and limitations in the RiDiC pipeline.

Sources

Original: arXiv - cs.AI

arXiv - cs.AI

The Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluation

AI Commentary

Executive Summary

Key Points

Merits

Strength in Methodology

Comprehensive Evaluation

Demerits

Limited Domain Scope

Dependence on External Data Sources

Expert Commentary

Recommendations

Sources

Related Articles

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

High resolution schemes for hyperbolic conservation laws

Robust Graph Representation Learning via Adaptive Spectral Contrast

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

JCG, PC

HSOLLC Co., Ltd.