Academic

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

arXiv:2602.23452v1 Announce Type: new Abstract: Scientific research relies on accurate citation for attribution and integrity, yet large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications. Such hallucinated citations have already been observed in submissions and accepted papers at major machine learning venues, exposing vulnerabilities in peer review. Meanwhile, rapidly growing reference lists make manual verification impractical, and existing automated tools remain fragile to noisy and heterogeneous citation formats and lack standardized evaluation. We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing. Our multi-agent verification pipeline decomposes citation checking into claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment to assess whether a cited source truly supports its claim. We construct a large-sca

arXiv:2602.23452v1 Announce Type: new Abstract: Scientific research relies on accurate citation for attribution and integrity, yet large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications. Such hallucinated citations have already been observed in submissions and accepted papers at major machine learning venues, exposing vulnerabilities in peer review. Meanwhile, rapidly growing reference lists make manual verification impractical, and existing automated tools remain fragile to noisy and heterogeneous citation formats and lack standardized evaluation. We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing. Our multi-agent verification pipeline decomposes citation checking into claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment to assess whether a cited source truly supports its claim. We construct a large-scale human-validated dataset across domains and define unified metrics for citation faithfulness and evidence alignment. Experiments with state-of-the-art LLMs reveal substantial citation errors and show that our framework significantly outperforms prior methods in both accuracy and interpretability. This work provides the first scalable infrastructure for auditing citations in the LLM era and practical tools to improve the trustworthiness of scientific references.

Executive Summary

This article presents CiteAudit, a benchmark and detection framework for verifying scientific references in the Large Language Model (LLM) era. The framework addresses the risk of fabricated references that appear plausible but correspond to no real publications, which has been observed in submissions and accepted papers at major machine learning venues. CiteAudit decomposes citation checking into multiple stages, including claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment. The framework is accompanied by a large-scale human-validated dataset across domains and unified metrics for citation faithfulness and evidence alignment. Experiments with state-of-the-art LLMs reveal substantial citation errors and show that CiteAudit significantly outperforms prior methods in both accuracy and interpretability.

Key Points

  • CiteAudit addresses the risk of fabricated references in scientific writing
  • The framework decomposes citation checking into multiple stages
  • CiteAudit is accompanied by a large-scale human-validated dataset and unified metrics

Merits

Comprehensive Benchmark

CiteAudit provides the first comprehensive benchmark for verifying scientific references in the LLM era, addressing a critical vulnerability in peer review.

Multi-Agent Verification Pipeline

The framework's multi-agent verification pipeline decomposes citation checking into multiple stages, enabling a more robust and accurate approach to verifying citations.

Improved Accuracy and Interpretability

CiteAudit significantly outperforms prior methods in both accuracy and interpretability, providing a scalable infrastructure for auditing citations in the LLM era.

Demerits

Dataset Limitations

The human-validated dataset is large-scale, but it may not be representative of all domains and citation formats.

Evaluation Metrics

The unified metrics for citation faithfulness and evidence alignment may not capture all aspects of citation verification.

Expert Commentary

CiteAudit is a critical contribution to the field of scientific verification, addressing a pressing vulnerability in peer review and providing a scalable infrastructure for auditing citations in the LLM era. The framework's multi-agent verification pipeline and unified metrics are notable strengths, enabling a more robust and accurate approach to verifying citations. However, the dataset limitations and evaluation metrics may require further refinement. As the scientific community continues to grapple with the challenges of LLMs, CiteAudit provides a valuable resource for maintaining the integrity and trustworthiness of scientific research.

Recommendations

  • Future research should focus on refining the dataset and evaluation metrics to ensure that CiteAudit is representative of all domains and citation formats.
  • Researchers and institutions should adopt CiteAudit as a standard tool for verifying citations, promoting a culture of transparency and accountability in scientific research.

Sources