Academic

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Siqi Sun, Ben Peng Wu, Mali Jin, Peizhen Bai, Hanpei Zhang, Xingyi Song · March 16, 2026 · 1 min read · 2 views

#cs.CL #cs.AI

arXiv:2603.13154v1 Announce Type: new Abstract: As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs' ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.

Executive Summary

This article introduces ESG-Bench, a benchmark dataset for evaluating large language models' ability to accurately analyze and reason over environmental, social, and governance (ESG) reports. The dataset consists of human-annotated question-answer pairs from real-world ESG report contexts, with labels indicating factually supported or hallucinated model outputs. The authors design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune state-of-the-art LLMs on ESG-Bench, demonstrating substantial reductions in hallucinations compared to standard prompting and direct fine-tuning. This research has significant implications for mitigating hallucinations in socially sensitive and compliance-critical settings, while also providing a new use case for evaluating LLMs' ability to extract and reason over ESG content.

Key Points

▸ ESG-Bench is a benchmark dataset for evaluating LLMs' ability to accurately analyze ESG reports.
▸ The dataset consists of human-annotated question-answer pairs from real-world ESG report contexts.
▸ CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations.

Merits

Strength

The introduction of ESG-Bench provides a much-needed benchmark for evaluating LLMs' ability to accurately analyze and reason over ESG reports, addressing a significant gap in the current literature.

Strength

The task-specific CoT prompting strategies and fine-tuning methods developed in this research demonstrate substantial reductions in hallucinations, providing a valuable solution for mitigating hallucinations in socially sensitive and compliance-critical settings.

Strength

The research also provides a new use case for evaluating LLMs' ability to extract and reason over ESG content, highlighting the potential applications of LLMs in the ESG domain.

Demerits

Limitation

The dataset is currently limited to ESG reports, and it is unclear whether the CoT-based methods will generalize to other domains or datasets.

Limitation

The experiments were conducted on a limited set of LLMs, and it is unclear whether the findings will hold for other LLM architectures or configurations.

Limitation

The research does not provide a comprehensive analysis of the potential biases or limitations of the ESG-Bench dataset itself.

Expert Commentary

The introduction of ESG-Bench is a significant contribution to the field of LLMs, providing a much-needed benchmark for evaluating LLMs' ability to accurately analyze and reason over ESG reports. The task-specific CoT prompting strategies and fine-tuning methods developed in this research demonstrate substantial reductions in hallucinations, providing a valuable solution for mitigating hallucinations in socially sensitive and compliance-critical settings. However, the research is limited by its focus on ESG reports and the lack of comprehensive analysis of the potential biases or limitations of the ESG-Bench dataset. Nevertheless, the research has significant implications for the development and evaluation of LLMs in the ESG domain, and it highlights the potential applications of LLMs in the ESG domain, including extracting and reasoning over ESG content.

Recommendations

✓ Future research should focus on expanding the scope of ESG-Bench to include other datasets and domains, in order to further evaluate the generalizability of the CoT-based methods.
✓ The development of more comprehensive and nuanced benchmarks for evaluating LLMs' ability to accurately analyze and reason over ESG reports is essential for advancing the field of LLMs.

Sources

arXiv - cs.CL

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

AI Commentary

Executive Summary

Key Points

Merits

Strength

Strength

Strength

Demerits

Limitation

Limitation

Limitation

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs