Academic

MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering

arXiv:2603.14265v1 Announce Type: new Abstract: Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic priva

arXiv:2603.14265v1 Announce Type: new Abstract: Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic privacy pressure. We establish a standardized evaluation protocol leveraging a pre-trained RoBERTa-Natural Language Inference (NLI) model as an automated judge to quantify data leakage, achieving an average of 85.9% alignment with human experts. Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy-utility trade-off. Our findings underscore the necessity of domain-specific benchmarks to validate the safety and efficacy of medical AI systems in privacy-sensitive environments.

Executive Summary

The article introduces MedPriv-Bench, a pioneering benchmark that addresses a critical gap in medical AI evaluation by integrating privacy preservation with clinical utility assessment in open-ended question answering. While current healthcare benchmarks prioritize accuracy, MedPriv-Bench uniquely incorporates a multi-agent, human-in-the-loop framework to simulate realistic privacy risks via contextual leakage, using a RoBERTa-NLI model as an automated judge to quantify data leakage with notable alignment to human experts (85.9%). The evaluation of nine LLMs reveals a consistent privacy-utility trade-off, establishing a much-needed domain-specific evaluation tool for medical AI systems operating in sensitive environments. This work fills a regulatory and ethical void by aligning evaluation metrics with HIPAA and GDPR imperatives.

Key Points

  • First benchmark to jointly evaluate privacy and utility in medical open-ended QA
  • Utilizes human-in-the-loop simulation of contextual leakage
  • Demonstrates pervasive privacy-utility trade-off across nine LLMs

Merits

Innovation

MedPriv-Bench fills a significant void by introducing a domain-specific benchmark that aligns with HIPAA/GDPR compliance requirements, offering a measurable framework for assessing privacy risks in medical AI.

Methodological Rigor

The use of a standardized automated judge (RoBERTa-NLI) and multi-agent pipeline enhances reproducibility and objectivity in evaluating data leakage.

Demerits

Scope Limitation

While comprehensive for open-ended QA, the benchmark may not fully capture privacy risks in other medical AI applications such as diagnostic decision-support or longitudinal data analysis.

Generalizability Concern

The evaluation relies on synthetic contexts; real-world applicability may vary depending on actual clinical data structures and user interaction patterns.

Expert Commentary

MedPriv-Bench represents a significant advance in the ethical evaluation of medical AI. The authors correctly identify that privacy threats via contextual leakage—though subtle—are among the most insidious risks in clinical AI, particularly when external databases are integrated via RAG. Their decision to use a human-in-the-loop validation layer, combined with an automated NLI-based judge, strikes an admirable balance between computational efficiency and human-centered judgment. The 85.9% alignment with expert assessments is a strong indicator of validity and practicality. Moreover, by situating their work within the legal-ethical nexus of HIPAA and GDPR, they elevate the discourse from technical optimization to systemic accountability. This is not merely a benchmark; it is a normative intervention in the field. As medical AI continues to scale, the absence of privacy-centric evaluation benchmarks has been a critical blind spot; MedPriv-Bench fills it with precision and purpose.

Recommendations

  • 1. Extend MedPriv-Bench to include longitudinal and diagnostic AI applications beyond open-ended QA.
  • 2. Develop a public repository of synthetic contexts and queries to facilitate third-party validation and reproducibility.
  • 3. Collaborate with regulatory bodies (e.g., HIPAA oversight agencies, EU data protection authorities) to incorporate MedPriv-Bench into formal certification processes for medical AI systems.

Sources