Academic

EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models

arXiv:2603.19532v1 Announce Type: new Abstract: Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbf{EvidenceRL}, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding ($G_{\max}@3$) rises from 47.6 to 78.2; hallucin

arXiv:2603.19532v1 Announce Type: new Abstract: Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbf{EvidenceRL}, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding ($G_{\max}@3$) rises from 47.6 to 78.2; hallucinations drop nearly 5$\times$ and evidence-supported diagnoses increase from 31.8\% to 61.6\%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8\% to 67.6\% on Llama-3.1-8B, demonstrating consistent behavioral change across domains. Our code is open-sourced at https://github.com/Wizaaard/EvidenceRL.git.

Executive Summary

EvidenceRL, a reinforcement learning framework, addresses the issue of hallucinations in large language models by enforcing evidence adherence during training. It scores candidate responses for grounding and correctness, and optimizes the generator using Group Relative Policy Optimization. The framework improves evidence grounding and faithfulness without sacrificing task accuracy in high-stakes domains such as cardiac diagnosis and legal reasoning. The results show significant improvement in F1 scores, grounding, and faithfulness metrics, with a substantial reduction in hallucinations. The open-sourced code allows for further development and adaptation. This framework has the potential to enhance the reliability and trustworthiness of language models in critical applications.

Key Points

  • EvidenceRL is a reinforcement learning framework that enforces evidence adherence in large language models.
  • It improves evidence grounding and faithfulness without sacrificing task accuracy in high-stakes domains.
  • The framework demonstrates significant improvement in F1 scores, grounding, and faithfulness metrics.

Merits

Strength in Addressing Hallucinations

EvidenceRL effectively reduces hallucinations in large language models by enforcing evidence adherence, making it a significant advancement in addressing this critical issue.

Improved Performance in High-Stakes Domains

The framework demonstrates substantial improvement in F1 scores, grounding, and faithfulness metrics in high-stakes domains such as cardiac diagnosis and legal reasoning.

Open-Source Code for Further Development

The open-sourced code allows for further development, adaptation, and integration of EvidenceRL into various applications.

Demerits

Limited Evaluation Across Domains

While the framework is evaluated in two high-stakes domains, its performance and effectiveness may vary in other domains, and further evaluation is necessary to establish its generalizability.

Potential Overreliance on Retrieved Evidence

The framework's reliance on retrieved evidence may lead to overreliance on available data, potentially neglecting critical information not captured in the evidence.

Expert Commentary

EvidenceRL is a significant advancement in addressing the issue of hallucinations in large language models. By enforcing evidence adherence during training, the framework demonstrates substantial improvement in F1 scores, grounding, and faithfulness metrics. However, its limited evaluation across domains and potential overreliance on retrieved evidence are notable limitations. As the framework continues to evolve, it is essential to address these limitations and consider its broader implications for the development and deployment of AI systems in critical applications.

Recommendations

  • Further evaluation of EvidenceRL across various domains and applications is necessary to establish its generalizability and effectiveness.
  • Developers should consider incorporating additional measures to mitigate potential overreliance on retrieved evidence and ensure that the framework captures critical information not captured in the evidence.

Sources

Original: arXiv - cs.CL