This platform requires JavaScript for full functionality. Please enable JavaScript in your browser settings.

Quality follows upgrading

Academic

WHBench: Evaluating Frontier LLMs with Expert-in-the-Loop Validation on Women's Health Topics

arXiv:2604.00024v1 Announce Type: new Abstract: Large language models are increasingly used for medical guidance, but women's health remains under-evaluated in benchmark design. We present the Women's Health Benchmark (WHBench), a targeted evaluation suite of 47 expert-crafted scenarios across 10 women's health topics, designed to expose clinically meaningful failure modes including outdated guidelines, unsafe omissions, dosing errors, and equity-related blind spots. We evaluate 22 models using a 23-criterion rubric spanning clinical accuracy, completeness, safety, communication quality, instruction following, equity, uncertainty handling, and guideline adherence, with safety-weighted penalties and server-side score recalculation. Across 3,102 attempted responses (3,100 scored), no model mean performance exceeds 75 percent; the best model reaches 72.1 percent. Even top models show low fully correct rates and substantial variation in harm rates. Inter-rater reliability is moderate at t

Sneha Maurya, Pragya Saboo, Girish Kumar · April 3, 2026 · 1 min read · 5 views

#cs.CL #cs.AI #cs.CY

arXiv:2604.00024v1 Announce Type: new Abstract: Large language models are increasingly used for medical guidance, but women's health remains under-evaluated in benchmark design. We present the Women's Health Benchmark (WHBench), a targeted evaluation suite of 47 expert-crafted scenarios across 10 women's health topics, designed to expose clinically meaningful failure modes including outdated guidelines, unsafe omissions, dosing errors, and equity-related blind spots. We evaluate 22 models using a 23-criterion rubric spanning clinical accuracy, completeness, safety, communication quality, instruction following, equity, uncertainty handling, and guideline adherence, with safety-weighted penalties and server-side score recalculation. Across 3,102 attempted responses (3,100 scored), no model mean performance exceeds 75 percent; the best model reaches 72.1 percent. Even top models show low fully correct rates and substantial variation in harm rates. Inter-rater reliability is moderate at the response label level but high for model ranking, supporting WHBench utility for comparative system evaluation while highlighting the need for expert oversight in clinical deployment. WHBench provides a public, failure-mode-aware benchmark to track safer and more equitable progress in womens health AI.

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

Related Articles

Academic

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

1 min read

Academic

High resolution schemes for hyperbolic conservation laws

1 min read

Academic

Robust Graph Representation Learning via Adaptive Spectral Contrast

1 min read

Academic

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

1 min read