Academic

Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

arXiv:2603.22935v1 Announce Type: new Abstract: Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validat

arXiv:2603.22935v1 Announce Type: new Abstract: Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.

Executive Summary

The article introduces Ran Score, a novel LLM-based evaluation metric for radiology report generation, addressing critical gaps in current automated evaluation systems—specifically, the underdetection of low-prevalence abnormalities and inadequate handling of negation and ambiguity. By integrating clinician expertise with LLM capabilities, the authors developed a framework that enhances multi-label finding extraction, leading to a significant improvement in macro-averaged scores (from 0.753 to 0.956) on MIMIC-CXR-EN. The Ran Score metric demonstrates superior performance relative to CheXbert on comparable labels and exhibits strong generalization across independent cohorts. This advancement offers a more precise, finding-level evaluation tool that aligns more closely with radiologist-derived standards.

Key Points

  • Development of clinician-guided LLM framework for multi-label finding extraction
  • Introduction of Ran Score as a finding-level evaluation metric
  • Significant improvement in evaluation scores and generalization across cohorts

Merits

Precision Enhancement

Ran Score improves detection of low-prevalence abnormalities through clinician-guided prompt optimization, enabling more accurate fidelity assessment at the finding level.

Generalization

Robust performance across multiple non-overlapping cohorts, including an independent ChestX-CN validation set, indicates strong scalability and applicability.

Demerits

Scope Limitation

Evaluation is confined to chest X-ray reports; applicability to other radiological modalities or clinical domains remains unverified.

Implementation Constraint

Clinician-guided optimization, while effective, may introduce scalability challenges due to dependency on expert input for prompt refinement.

Expert Commentary

This work represents a meaningful step forward in the evaluation of AI-generated radiology reports. The integration of clinician-guided prompt optimization into LLM-based evaluation systems is a sophisticated, pragmatic innovation that bridges the gap between algorithmic output and clinical expectations. Unlike traditional metrics that aggregate at the document level, Ran Score’s finding-level granularity allows for more nuanced, clinically relevant assessments—critical for detecting subtle abnormalities that often evade automated systems. The authors’ rigorous validation across independent datasets strengthens credibility and applicability. However, one must acknowledge the inherent trade-off between expert-driven customization and scalability: while clinician involvement improves accuracy, it may limit widespread adoption without automation of the refinement process. Nevertheless, the demonstrated efficacy and robustness suggest that Ran Score could become a standard benchmark in radiology AI evaluation, influencing both research and clinical validation pipelines. This is a significant contribution to the field of medical AI, particularly for domains where precision in language interpretation is paramount.

Recommendations

  • 1. Encourage adoption of Ran Score as a supplemental or primary evaluation metric in radiology AI research and clinical deployment.
  • 2. Explore automated prompt refinement mechanisms to reduce dependency on manual clinician input while preserving accuracy, thereby enhancing scalability.

Sources

Original: arXiv - cs.AI