Academic

MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?

arXiv:2603.23519v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process. We construct the benchmark via scene-by-scene data synthesis refined by manual expert editing, yielding 400 test cases that are highly consistent with real-world application scenarios. Each test case has an average of 22 rounds (maximum of 52 rounds), covering 5 types of difficult instruction following issues. For evaluation, we propose an LLM-as-judge protocol with instance-level rubrics and atomic test points, validated against expert annotations

L
Lin Yang, Yuancheng Yang, Xu Wang, Changkun Liu, Haihua Yang
· · 1 min read · 18 views

arXiv:2603.23519v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process. We construct the benchmark via scene-by-scene data synthesis refined by manual expert editing, yielding 400 test cases that are highly consistent with real-world application scenarios. Each test case has an average of 22 rounds (maximum of 52 rounds), covering 5 types of difficult instruction following issues. For evaluation, we propose an LLM-as-judge protocol with instance-level rubrics and atomic test points, validated against expert annotations with a human-LLM agreement of 91.94\%. We test 17 frontier models, all of which underperform on MedMT-Bench (overall accuracy below 60.00\%), with the best model reaching 59.75\%. MedMT-Bench can be an essential tool for driving future research towards safer and more reliable medical AI. The benchmark is available in https://openreview.net/attachment?id=aKyBCsPOHB&name=supplementary_material

Executive Summary

This article introduces MedMT-Bench, a challenging medical multi-turn instruction following benchmark designed to test the capabilities of Large Language Models (LLMs) in simulating real-world medical diagnosis and treatment processes. The benchmark consists of 400 test cases, each with an average of 22 rounds, covering five types of difficult instruction following issues. The authors evaluate 17 frontier models, all of which underperform on the benchmark, highlighting the need for further research to develop safer and more reliable medical AI. MedMT-Bench is made available for the research community to drive progress in this area.

Key Points

  • MedMT-Bench is a novel medical multi-turn instruction following benchmark designed to test LLMs in real-world medical scenarios.
  • The benchmark consists of 400 test cases, each with an average of 22 rounds, covering 5 types of difficult instruction following issues.
  • 17 frontier models underperform on MedMT-Bench, highlighting the need for further research in medical AI.

Merits

Strength

The authors provide a comprehensive and well-structured benchmark that simulates real-world medical scenarios, addressing a significant gap in existing medical-related benchmarks.

Strength

The LLM-as-judge protocol with instance-level rubrics and atomic test points is a novel and effective evaluation method for benchmarking medical AI models.

Strength

The benchmark is made available for the research community, facilitating progress in the development of safer and more reliable medical AI.

Demerits

Limitation

The article does not provide a detailed analysis of the potential biases in the benchmark data or the evaluation protocol.

Limitation

The authors do not explore the potential applications of MedMT-Bench beyond medical AI research, limiting its potential impact.

Limitation

The article does not provide a clear explanation of how the benchmark can be used to drive future research in medical AI, making it challenging to replicate the results.

Expert Commentary

MedMT-Bench is a significant contribution to the field of medical AI research, providing a comprehensive and well-structured benchmark for evaluating the capabilities of LLMs in real-world medical scenarios. The article's findings highlight the need for further research in medical AI safety, emphasizing the importance of developing reliable and safe AI systems for medical applications. However, the article's limitations, such as the lack of detailed analysis of potential biases in the benchmark data or the evaluation protocol, must be addressed in future research. Furthermore, the article's findings have implications for policymakers, who must consider the potential risks and benefits of integrating AI systems into healthcare settings, and develop regulations to ensure safe and effective AI adoption.

Recommendations

  • Future research should focus on addressing the limitations of MedMT-Bench, such as the lack of detailed analysis of potential biases in the benchmark data or the evaluation protocol.
  • The development of more comprehensive and well-structured benchmarks, such as MedMT-Bench, should be encouraged to facilitate progress in the development of reliable and safe AI systems for medical applications.

Sources

Original: arXiv - cs.CL