Brett Klamer

# Required Sample Size

## Prediction models

A suitable sample size is required to prevent overfitting and ensure the generalizability of a prediction model. R package pmsampsize provides an easy way to determine the sample size required for prediction modeling. Based on the and , it uses the following strategies:

1. Binary and time to event outcomes
1. Choose number of predictor variables/parameters that will be included in the model.
2. Choose $$R^2_{CS, adj}$$ (based on previously published, internally validated model in same setting/population) or $$max(R^2_{CS, app})$$ (maximum value determined by population prevalence/risk).
3. Calculate sample size so that the estimated Van Houwellingen’s global shrinkage factor ($$S_{VH}$$) is greater than some chosen constant (commonly $$S_{VH} \geq 0.90$$).
4. Calculate the sample size so that the absolute difference in $$R^2_{Nagelkerke, adj}$$ and $$R^2_{Nagelkerke, app}$$ is less than some constant (commonly $$\leq 0.05$$).
5. Calculate the sample size required so that the absolute margin of error for the population outcome risk is less than a chosen constant (commonly $$\leq 0.05$$).
6. The minimum sample size required for the model is the maximum sample size from steps 3-5.
2. Continuous outcomes
1. Choose number of predictor variables/parameters that will be included in the model.
2. Choose $$R^2_{CS, adj}$$ (based on previously published, internally validated model in same setting/population) or $$R^2_{CS, app}$$ (raw value from previous externally validated model in same setting/population, or adjust an apparent measure).
3. Calculate sample size so that the estimated Van Houwellingen’s global shrinkage factor ($$S_{VH}$$) is greater than some chosen constant (commonly $$S_{VH} \geq 0.90$$).
4. Calculate the sample size so that the absolute difference in $$R^2_{Nagelkerke, adj}$$ and $$R^2_{Nagelkerke, app}$$ is less than some constant (commonly $$\leq 0.05$$).
5. Calculate the sample size so the multiplicative margin of error is within some constant (commonly 10%) of the residual standard deviation.
6. Calculate the sample size so the multiplicative margin of error is within some constant (commonly 10%) of the predicted mean outcome.
7. The minimum sample size required for the model is the maximum sample size from steps 3-6.

provide further guidance on estimating the Cox-Snell $$R^2$$ from a reported ROC-AUC.

## Bibliography

Ensor, Joie, Emma C. Martin, and Richard D. Riley. 2021. Pmsampsize: Calculates the Minimum Sample Size Required for Developing a Multivariable Prediction Model. https://CRAN.R-project.org/package=pmsampsize.
Riley, Richard D., Ben Calster, and Gary S. Collins. 2021. “A Note on Estimating the CoxSnell R2 from a Reported C Statistic (AUROC) to Inform Sample Size Calculations for Developing a Prediction Model with a Binary Outcome.” Statistics in Medicine 40 (4): 859–64. https://doi.org/10.1002/sim.8806.
Riley, Richard D., Kym I. E. Snell, Joie Ensor, Danielle L. Burke, Frank E. Harrell, Karel G. M. Moons, and Gary S. Collins. 2019a. “Minimum Sample Size for Developing a Multivariable Prediction Model: Part I - Continuous Outcomes.” Statistics in Medicine 38 (7): 1262–75. https://doi.org/10.1002/sim.7993.
Riley, Richard D., Kym IE Snell, Joie Ensor, Danielle L Burke, Frank E. Harrell, Karel GM Moons, and Gary S Collins. 2019b. “Minimum Sample Size for Developing a Multivariable Prediction Model: PART II - Binary and Time-to-Event Outcomes.” Statistics in Medicine 38 (7): 1276–96. https://doi.org/10.1002/sim.7992.
Published: 2022-02-14