Sample Size Calculations for Developing Clinical Prediction Models: Overview and pmsims R package
arXiv:2602.23507v1 Announce Type: new Abstract: Background: Clinical prediction models are increasingly used to inform healthcare decisions, but determining the minimum sample size for their development remains a critical and unresolved challenge. Inadequate sample sizes can lead to overfitting, poor generalisability, and biased predictions. Existing approaches, such as heuristic rules, closed-form formulas, and simulation-based methods, vary in flexibility and accuracy, particularly for complex data structures and machine learning models. Methods: We review current methodologies for sample size estimation in prediction modelling and introduce a conceptual framework that distinguishes between mean-based and assurance-based criteria. Building on this, we propose a novel simulation-based approach that integrates learning curves, Gaussian Process optimisation, and assurance principles to identify sample sizes that achieve target performance with high probability. This approach is impleme
arXiv:2602.23507v1 Announce Type: new Abstract: Background: Clinical prediction models are increasingly used to inform healthcare decisions, but determining the minimum sample size for their development remains a critical and unresolved challenge. Inadequate sample sizes can lead to overfitting, poor generalisability, and biased predictions. Existing approaches, such as heuristic rules, closed-form formulas, and simulation-based methods, vary in flexibility and accuracy, particularly for complex data structures and machine learning models. Methods: We review current methodologies for sample size estimation in prediction modelling and introduce a conceptual framework that distinguishes between mean-based and assurance-based criteria. Building on this, we propose a novel simulation-based approach that integrates learning curves, Gaussian Process optimisation, and assurance principles to identify sample sizes that achieve target performance with high probability. This approach is implemented in pmsims, an open-source, model-agnostic R package. Results: Through case studies, we demonstrate that sample size estimates vary substantially across methods, performance metrics, and modelling strategies. Compared to existing tools, pmsims provides flexible, efficient, and interpretable solutions that accommodate diverse models and user-defined metrics while explicitly accounting for variability in model performance. Conclusions: Our framework and software advance sample size methodology for clinical prediction modelling by combining flexibility with computational efficiency. Future work should extend these methods to hierarchical and multimodal data, incorporate fairness and stability metrics, and address challenges such as missing data and complex dependency structures.
Executive Summary
This article addresses the critical challenge of determining the minimum sample size for developing clinical prediction models. The authors introduce a conceptual framework that distinguishes between mean-based and assurance-based criteria, and propose a novel simulation-based approach implemented in the pmsims R package. Case studies demonstrate that pmsims provides flexible, efficient, and interpretable solutions compared to existing tools. The framework and software advance sample size methodology by combining flexibility with computational efficiency. However, future work is needed to extend these methods to hierarchical and multimodal data, incorporate fairness and stability metrics, and address challenges such as missing data and complex dependency structures. The authors' work has significant implications for improving the development and validation of clinical prediction models.
Key Points
- ▸ The article introduces a conceptual framework for sample size estimation in prediction modelling.
- ▸ A novel simulation-based approach is proposed, implemented in the pmsims R package.
- ▸ Case studies demonstrate the flexibility and efficiency of pmsims compared to existing tools.
Merits
Strength in conceptual framework
The proposed framework provides a comprehensive overview of existing methodologies and distinguishes between mean-based and assurance-based criteria.
Flexibility and efficiency of pmsims
The pmsims R package offers flexible, efficient, and interpretable solutions for sample size estimation, accommodating diverse models and user-defined metrics.
Demerits
Limited scope to hierarchical and multimodal data
The proposed methods may not be directly applicable to hierarchical and multimodal data, which are common in real-world clinical datasets.
Lack of consideration for missing data and complex dependency structures
The authors' framework and software do not explicitly address challenges such as missing data and complex dependency structures, which can impact model performance and generalizability.
Expert Commentary
The article's introduction of a conceptual framework and novel simulation-based approach addresses a critical challenge in clinical prediction model development. However, the limited scope to hierarchical and multimodal data, as well as the lack of consideration for missing data and complex dependency structures, are notable limitations. Future work should aim to extend these methods to address these challenges and improve the applicability of the framework and software to real-world clinical datasets. The pmsims R package has the potential to become a valuable tool for researchers and clinicians developing clinical prediction models.
Recommendations
- ✓ Develop and extend the proposed framework and software to address challenges such as hierarchical and multimodal data, missing data, and complex dependency structures.
- ✓ Conduct further validation studies to assess the performance and reliability of the pmsims R package in real-world clinical datasets.