Academic

ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

arXiv:2603.11281v1 Announce Type: new Abstract: Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model, GPT-5, achieves only 41.2% fully-correct responses.

M
Monica Munnangi, Saiph Savage
· · 1 min read · 20 views

arXiv:2603.11281v1 Announce Type: new Abstract: Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model, GPT-5, achieves only 41.2% fully-correct responses. All five models degrade significantly from turn 0 to turn 2 (p < 0.001), with wrong-answer rates roughly tripling by the third turn. We identify a fundamental tension between single-turn capability and multi-turn reliability: models with the strongest initial performance (GPT-5: 75.2; Claude Haiku: 72.3 out of 100) exhibit the steepest declines by turn 2 (dropping 16.2 and 25.0 points respectively), while weaker models plateau or marginally improve. We introduce two metrics to quantify multi-turn failure modes: Conversational Consistency Score (CCS) and Error Propagation Rate (EPR). CCS reveals that nearly one in three Claude Haiku conversations swings between a fully correct and a completely wrong response within the same thread. EPR shows that a single wrong turn raises the probability of a subsequent wrong turn by 1.9-6.1x across all models.

Executive Summary

This study introduces ThReadMed-QA, a multi-turn medical dialogue benchmark that captures the iterative and clarification-seeking nature of real patient consultations. The benchmark consists of 2,437 patient-physician conversation threads extracted from the r/AskDocs community, comprising 8,204 question-answer pairs across up to 9 turns. The authors evaluate five state-of-the-art language models on ThReadMed-QA and find that even the strongest model, GPT-5, achieves only 41.2% fully-correct responses. The study highlights the tension between single-turn capability and multi-turn reliability, finding that models with strong initial performance degrade significantly in subsequent turns. The authors introduce two metrics to quantify multi-turn failure modes: Conversational Consistency Score and Error Propagation Rate.

Key Points

  • ThReadMed-QA is the first multi-turn medical dialogue benchmark that captures real patient consultations
  • The benchmark consists of 2,437 conversation threads with up to 9 turns and 8,204 question-answer pairs
  • Even the strongest language model, GPT-5, achieves only 41.2% fully-correct responses on ThReadMed-QA

Merits

Strength

The study provides a comprehensive evaluation of language models on a realistic medical dialogue task, highlighting their limitations in multi-turn settings.

Grounded in Real Data

The benchmark is based on real patient questions and verified physician responses, making it a more authentic representation of medical consultations.

New Metrics

The study introduces two novel metrics, Conversational Consistency Score and Error Propagation Rate, to quantify multi-turn failure modes in language models.

Demerits

Limited Generalizability

The study focuses on a specific online community (r/AskDocs) and may not generalize to other medical consultation settings.

Evaluation Metrics

While the proposed metrics are novel, their interpretation and application require further exploration and validation.

Expert Commentary

The study's findings are significant and timely, given the increasing use of language models in medical AI systems. However, the study's limitations, such as the focus on a specific online community and the novelty of the proposed metrics, require careful consideration. Furthermore, the implications of the study's findings for policy and practice require further exploration and discussion. Overall, the study provides a valuable contribution to the field of medical AI and highlights the need for more research on the limitations and potential risks of language models in medical dialogue tasks.

Recommendations

  • Developers should prioritize the development of more realistic and diverse datasets for training language models, such as ThReadMed-QA.
  • Researchers should explore and validate the proposed metrics, Conversational Consistency Score and Error Propagation Rate, to establish their reliability and applicability in evaluating language models.

Sources