Academic

Quantifying Hallucinations in Language Language Models on Medical Textbooks

arXiv:2603.09986v1 Announce Type: cross Abstract: Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments: the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given novel prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of p

B
Brandon C. Colelough, Davis Bartels, Dina Demner-Fushman
· · 1 min read · 22 views

arXiv:2603.09986v1 Announce Type: cross Abstract: Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments: the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given novel prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ($\rho=-0.71$, $p=0.058$). Clinicians produced high agreement (quadratic weighted $\kappa=0.92$) and ($\tau_b=0.06$ to $0.18$, $\kappa=0.57$ to $0.61$) for experiments 1 and ,2 respectively

Executive Summary

The article investigates the phenomenon of hallucinations in large language models, specifically in the context of medical question answering. The study finds that a prominent open-source model, LLaMA-70B-Instruct, hallucinates in approximately 19.7% of answers, despite receiving high plausibility scores. The research highlights the need for more effective benchmarks to evaluate and mitigate hallucinations in medical QA, with implications for the development of reliable language models in healthcare.

Key Points

  • Hallucinations in language models are a significant problem in medical question answering
  • The study evaluates the prevalence of hallucinations in LLaMA-70B-Instruct using novel prompts
  • Lower hallucination rates are associated with higher usefulness scores

Merits

Systematic Evaluation

The study provides a systematic evaluation of hallucinations in a prominent language model, shedding light on the severity of the issue.

Demerits

Limited Model Scope

The study focuses on a single model, LLaMA-70B-Instruct, which may not be representative of all language models used in medical QA.

Expert Commentary

The study's findings underscore the importance of addressing hallucinations in language models, particularly in medical QA. The association between lower hallucination rates and higher usefulness scores suggests that mitigating hallucinations can lead to more reliable and trustworthy language models. However, the study's limitations, such as its focus on a single model, highlight the need for further research to develop more comprehensive solutions to this problem.

Recommendations

  • Developing and evaluating more diverse and robust benchmarks for medical QA
  • Investigating the use of explainability and transparency techniques to mitigate hallucinations in language models

Sources