Brett Klamer

Required Sample Size

Prediction models

A suitable sample size is required to prevent overfitting and ensure the generalizability of a prediction model. R package pmsampsize (Ensor, Martin, and Riley 2021) provides an easy way to determine the sample size required for prediction modeling. Based on the Riley et al. (2019a) and Riley et al. (2019b), it uses the following strategies:

  1. Binary and time to event outcomes
    1. Choose number of predictor variables/parameters that will be included in the model.
    2. Choose \(R^2_{CS, adj}\) (based on previously published, internally validated model in same setting/population) or \(max(R^2_{CS, app})\) (maximum value determined by population prevalence/risk).
    3. Calculate sample size so that the estimated Van Houwellingen’s global shrinkage factor (\(S_{VH}\)) is greater than some chosen constant (commonly \(S_{VH} \geq 0.90\)).
    4. Calculate the sample size so that the absolute difference in \(R^2_{Nagelkerke, adj}\) and \(R^2_{Nagelkerke, app}\) is less than some constant (commonly \(\leq 0.05\)).
    5. Calculate the sample size required so that the absolute margin of error for the population outcome risk is less than a chosen constant (commonly \(\leq 0.05\)).
    6. The minimum sample size required for the model is the maximum sample size from steps 3-5.
  2. Continuous outcomes
    1. Choose number of predictor variables/parameters that will be included in the model.
    2. Choose \(R^2_{CS, adj}\) (based on previously published, internally validated model in same setting/population) or \(R^2_{CS, app}\) (raw value from previous externally validated model in same setting/population, or adjust an apparent measure).
    3. Calculate sample size so that the estimated Van Houwellingen’s global shrinkage factor (\(S_{VH}\)) is greater than some chosen constant (commonly \(S_{VH} \geq 0.90\)).
    4. Calculate the sample size so that the absolute difference in \(R^2_{Nagelkerke, adj}\) and \(R^2_{Nagelkerke, app}\) is less than some constant (commonly \(\leq 0.05\)).
    5. Calculate the sample size so the multiplicative margin of error is within some constant (commonly 10%) of the residual standard deviation.
    6. Calculate the sample size so the multiplicative margin of error is within some constant (commonly 10%) of the predicted mean outcome.
    7. The minimum sample size required for the model is the maximum sample size from steps 3-6.

Riley, Calster, and Collins (2021) provide further guidance on estimating the Cox-Snell \(R^2\) from a reported ROC-AUC.

Bibliography

Ensor, Joie, Emma C. Martin, and Richard D. Riley. 2021. Pmsampsize: Calculates the Minimum Sample Size Required for Developing a Multivariable Prediction Model. https://CRAN.R-project.org/package=pmsampsize.
Riley, Richard D., Ben Calster, and Gary S. Collins. 2021. “A Note on Estimating the CoxSnell R2 from a Reported C Statistic (AUROC) to Inform Sample Size Calculations for Developing a Prediction Model with a Binary Outcome.” Statistics in Medicine 40 (4): 859–64. https://doi.org/10.1002/sim.8806.
Riley, Richard D., Kym I. E. Snell, Joie Ensor, Danielle L. Burke, Frank E. Harrell, Karel G. M. Moons, and Gary S. Collins. 2019a. “Minimum Sample Size for Developing a Multivariable Prediction Model: Part I - Continuous Outcomes.” Statistics in Medicine 38 (7): 1262–75. https://doi.org/10.1002/sim.7993.
Riley, Richard D., Kym IE Snell, Joie Ensor, Danielle L Burke, Frank E. Harrell, Karel GM Moons, and Gary S Collins. 2019b. “Minimum Sample Size for Developing a Multivariable Prediction Model: PART II - Binary and Time-to-Event Outcomes.” Statistics in Medicine 38 (7): 1276–96. https://doi.org/10.1002/sim.7992.
Published: 2022-02-14