Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases
arXiv:2603.22767v1 Announce Type: new Abstract: Observational studies can yield clinically actionable evidence at scale, but executing them on real-world databases is open-ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps or single answers, missing the integrity and internal structure of the resulting evidence bundle. To address this gap, we introduce RWE-bench, a benchmark grounded in MIMIC-IV and derived from peer-reviewed observational studies. Each task provides the corresponding study protocol as the reference standard, requiring agents to execute experiments in a real database and iteratively generate tree-structured evidence bundles. We evaluate six LLMs (three open-source, three closed-source) under three agent scaffolds using both question-level correctness and end-to-end task metrics. Across 162 tasks, task success is low: the best agent reaches 39.9%, and the best open-source m
arXiv:2603.22767v1 Announce Type: new Abstract: Observational studies can yield clinically actionable evidence at scale, but executing them on real-world databases is open-ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps or single answers, missing the integrity and internal structure of the resulting evidence bundle. To address this gap, we introduce RWE-bench, a benchmark grounded in MIMIC-IV and derived from peer-reviewed observational studies. Each task provides the corresponding study protocol as the reference standard, requiring agents to execute experiments in a real database and iteratively generate tree-structured evidence bundles. We evaluate six LLMs (three open-source, three closed-source) under three agent scaffolds using both question-level correctness and end-to-end task metrics. Across 162 tasks, task success is low: the best agent reaches 39.9%, and the best open-source model reaches 30.4%. Agent scaffolds also matter substantially, causing over 30% variation in performance metrics. Furthermore, we implement an automated cohort evaluation method to rapidly localize errors and identify agent failure modes. Overall, the results highlight persistent limitations in agents' ability to produce end-to-end evidence bundles, and efficient validation remains an important direction for future work. Code and data are available at https://github.com/somewordstoolate/RWE-bench.
Executive Summary
The article evaluates the capability of large language model (LLM) agents to generate real-world evidence (RWE) through observational studies using real medical databases. Introducing RWE-bench, a benchmark grounded in MIMIC-IV, the authors assess six LLMs across three agent scaffolds using both question-level correctness and end-to-end task metrics. The results reveal significant limitations: task success rates remain low—maximum at 39.9% and open-source models at 30.4%. Agent scaffolds exert a material influence, varying performance by over 30%. An automated cohort evaluation method was also introduced to aid error localization. The study underscores persistent challenges in producing coherent, end-to-end evidence bundles, highlighting the need for improved validation mechanisms in LLM-driven RWE generation.
Key Points
- ▸ Low overall task success rates (max 39.9%)
- ▸ Agent scaffold variations impact performance by over 30%
- ▸ RWE-bench provides a structured benchmark for evaluating end-to-end evidence bundle generation
Merits
Innovative Benchmark Design
RWE-bench is a novel, structured benchmark that integrates study protocols as reference standards and evaluates end-to-end evidence bundle generation, filling a critical gap in prior evaluations that focused on isolated steps.
Demerits
Performance Limitations
Despite the benchmark’s rigor, the low success rates—particularly among open-source models—reveal persistent operational constraints in LLM agents’ ability to translate protocol-driven tasks into clinically actionable evidence.
Expert Commentary
This study represents a pivotal contribution to the emerging field of AI-assisted clinical evidence generation. The authors rightly identify a critical gap in prior evaluations: the tendency to assess LLMs in siloed, stepwise contexts rather than as integrated agents producing coherent evidence bundles. The RWE-bench framework is a substantial methodological advance, offering a replicable, protocol-aligned evaluation paradigm that aligns with real-world clinical data workflows. However, the results are sobering—particularly the sub-40% success rate across even the best-performing models. This signals a fundamental challenge: LLMs, despite their linguistic prowess, currently lack the domain-specific interpretive capacity and procedural fidelity required to replicate the nuanced decision-making inherent in observational study design and analysis. The automated cohort evaluation tool adds a valuable layer of diagnostic precision, enabling more targeted debugging of agent failures. Moving forward, the field must pivot from benchmarking isolated outputs to evaluating integrated cognitive pipelines—incorporating domain-aware reasoning, counterfactual validation, and auditability. Without these advances, the promise of AI-driven RWE remains aspirational rather than actionable.
Recommendations
- ✓ Develop domain-adapted LLM architectures with embedded clinical reasoning modules
- ✓ Integrate automated validation pipelines as standard components in LLM-assisted research workflows
Sources
Original: arXiv - cs.AI