Academic

FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems

arXiv:2603.20252v1 Announce Type: new Abstract: As organizations increasingly integrate AI-powered question-answering systems into financial information systems for compliance, risk assessment, and decision support, ensuring the factual accuracy of AI-generated outputs becomes a critical engineering challenge. Current Knowledge Graph (KG)-augmented QA systems lack systematic mechanisms to detect hallucinations - factually incorrect outputs that undermine reliability and user trust. We introduce FinBench-QA-Hallucination, a benchmark for evaluating hallucination detection methods in KG-augmented financial QA over SEC 10-K filings. The dataset contains 755 annotated examples from 300 pages, each labeled for groundedness using a conservative evidence-linkage protocol requiring support from both textual chunks and extracted relational triplets. We evaluate six detection approaches - LLM judges, fine-tuned classifiers, Natural Language Inference (NLI) models, span detectors, and embedding-

M
Mahesh Kumar, Bhaskarjit Sarmah, Stefano Pasquali
· · 1 min read · 38 views

arXiv:2603.20252v1 Announce Type: new Abstract: As organizations increasingly integrate AI-powered question-answering systems into financial information systems for compliance, risk assessment, and decision support, ensuring the factual accuracy of AI-generated outputs becomes a critical engineering challenge. Current Knowledge Graph (KG)-augmented QA systems lack systematic mechanisms to detect hallucinations - factually incorrect outputs that undermine reliability and user trust. We introduce FinBench-QA-Hallucination, a benchmark for evaluating hallucination detection methods in KG-augmented financial QA over SEC 10-K filings. The dataset contains 755 annotated examples from 300 pages, each labeled for groundedness using a conservative evidence-linkage protocol requiring support from both textual chunks and extracted relational triplets. We evaluate six detection approaches - LLM judges, fine-tuned classifiers, Natural Language Inference (NLI) models, span detectors, and embedding-based methods under two conditions: with and without KG triplets. Results show that LLM-based judges and embedding approaches achieve the highest performance (F1: 0.82-0.86) under clean conditions. However, most methods degrade significantly when noisy triplets are introduced, with Matthews Correlation Coefficient (MCC) dropping 44-84 percent, while embedding methods remain relatively robust with only 9 percent degradation. Statistical tests (Cochran's Q and McNemar) confirm significant performance differences (p < 0.001). Our findings highlight vulnerabilities in current KG-augmented systems and provide insights for building reliable financial information systems, where hallucinations can lead to regulatory violations and flawed decisions. The benchmark also offers a framework for integrating AI reliability evaluation into information system design across other high-stakes domains such as healthcare, legal, and government.

Executive Summary

The article introduces FinReflectKG -- HalluBench, a novel benchmark for evaluating hallucination detection in KG-augmented financial QA systems using SEC 10-K filings. With 755 annotated examples across 300 pages, it establishes a rigorous protocol for groundedness verification via textual and triplet evidence. The evaluation of six detection methods—LLM judges, fine-tuned classifiers, NLI models, span detectors, and embedding-based approaches—under varying conditions (with/without KG triplets) reveals a critical vulnerability: most methods degrade substantially under noisy triplets, with MCC dropping up to 84%, while embedding methods show relative resilience. Statistical validation confirms significant differences. This work fills a critical gap in AI reliability assessment for high-stakes financial decision-making and offers a scalable framework applicable to healthcare, legal, and government domains.

Key Points

  • Introduction of a benchmark for hallucination detection in financial QA
  • Evaluation of six detection approaches across two conditions
  • Identification of significant degradation in most methods under noisy triplet conditions

Merits

Comprehensive Benchmark Design

The dataset’s conservative evidence-linkage protocol and annotated examples provide a robust foundation for evaluating hallucination detection in real-world financial contexts.

Demerits

Limited Scope of Method Diversity

The evaluation excludes certain emerging detection paradigms, such as multimodal fusion models, potentially limiting generalizability.

Expert Commentary

This work represents a pivotal contribution to the field of AI-augmented decision support systems. The empirical rigor in benchmark construction—particularly the dual-evidence validation mechanism—sets a new standard for evaluating hallucination detection. The observed performance disparities between LLM judges and embedding methods under noisy conditions are particularly noteworthy: while LLM-based judges, despite their sophistication, are disproportionately affected by triplet noise, embedding-based approaches demonstrate a more nuanced resilience, suggesting a deeper alignment with semantic structure over surface-level cues. These findings challenge conventional assumptions about the superiority of LLMs in reliability contexts and open avenues for hybrid architectures that combine LLM interpretability with embedding robustness. Moreover, the extension of applicability beyond finance to healthcare and legal domains underscores the universality of the problem: hallucination is not a domain-specific artifact but a systemic risk in any knowledge-augmented QA pipeline. The authors rightly position this as a foundational resource for future research in AI reliability engineering.

Recommendations

  • Adopt FinBench-QA-Hallucination as a standard for evaluating hallucination detection in financial QA
  • Investigate hybrid models integrating embedding-based resilience with LLM contextual richness

Sources

Original: arXiv - cs.CL