Brett Klamer

Resampling Methods

There are two common implementations of internal validation: bootstrap and K-fold (aka V-fold).

The simple nonparametric bootstrap performance estimate is calculated through averaging apparent model performance across bootstrap resamples. Efron showed that a refined method would have less bias through a procedure which calculates and adjusts for the ‘optimism’ of the estimate (for a coherent explanation, see section 17.6: Efron and Tibshirani 1994). Although this refined bootstrap method is better, it can still be biased in various scenarios such as small-N-large-P or when using a discontinuous performance measure such as proportion classified correctly. Efron introduced the .632 and .632+ bootstrap, \(P(\text{observation included}) = 1-(\frac{1}{n})^n=0.632\) as \(n\to\infty\), to attempt a fix for these known biases. The .632 and .632+ methods calculate a weighted performance estimate using the out-of-bootstrap (out-of-bag) observations (weighted .632) and the apparent performance estimate on the original sample (weighted .368) (Efron and Tibshirani 1997). However, the original refined bootstrap method is most commonly used due to 1) it’s simplicity and 2) when used for model selection (performance estimates are not the primary outcome) consistent bias will still lead to a consistent choice for optimal model?

K-fold CV is useful for predictive performance of a single pre-selected model. If you would like to choose from competing models (Note: ‘models’ refers to the various choices of learning methods, variable selection methods, and other tuning parameters/hyperparameters) and interpret the resulting predictive performance, then you’ll want to use nested CV (or the double bootstrap). Nested K-fold CV uses training, validation, and test sets, while simple K-fold CV uses only training and test sets. The training set is used to fit all competing models, the validation set is used to select the top performing model, and the test set is used to estimate the performance of the top model. A single run of k-fold CV may result in highly variable estimates. Repeated runs of k-fold CV are recommended to ensure stable estimates. See Arlot and Celisse (2010) and Krstajic et al. (2014) for more information.

Below I provide detailed algorithms for these two approaches. However, there are many different versions of cross-validation and no single version is optimal in every situation.

Bootstrap validation

The algorithm for bootstrap optimism correction (Efron and Tibshirani 1994; Harrell, Lee, and Mark 1996) is defined as:

  1. Fit the model to the original data and calculate the apparent performance estimate \(S_{app}\).
  2. For \(n = 1, ..., N\)
    1. Generate a bootstrapped dataset (with replacement) from the original data.
    2. Fit the model to resample \(n\).
    3. Calculate the apparent bootstrapped performance estimate \(S_{n_{boot}}\).
    4. Calculate an additional performance estimate \(S_{n_{boot:orig}}\) by evaluating the bootstrapped model on the original data.
  3. Calculate the optimism of the apparent performance estimate. \[O = \frac{1}{N} \sum_{1}^{N} (S_{n_{boot}} - S_{n_{boot:orig}})\]
  4. Calculate the optimism adjusted performance estimate. \[S_{adj} = S_{app} - O\]
  5. Report the model’s validated performance estimate using \(S_{adj}\).

The following method is useful when you wish to compare competing models and report the performance of the selected model. The algorithm for two-stage bootstrap optimism correction is defined as:

  1. For each model \(i = 1, \dots, I\)
    1. Fit model \(i\) to the original data.
    2. Calculate the apparent performance estimate \(S_{i_{app}}\).
  2. Determine which model has best performance.
    1. For resamples \(j = 1, ..., J\)
      1. Generate a bootstrapped dataset (with replacement) from the original data.
      2. For each model \(i = 1, \dots, I\)
        1. Fit model \(i\) to resample \(j\).
        2. Calculate the performance estimate \(S_{ij_{boot}}\).
        3. Calculate an additional performance estimate \(S_{ij_{boot:orig}}\) by evaluating bootstrapped model \(i\) on the original data.
    2. Calculate the optimism of the apparent performance estimate. For each model \(i\), \[O_i = \frac{1}{J} \sum_{1}^{J} (S_{ij_{boot}} - S_{ij_{boot:orig}})\]
    3. Calculate the optimism adjusted performance estimate. For each model \(i\), \[S_{i_{adj}} = S_{i_{app}} - O_i\]
    4. Determine which model has best prediction performance and proceed using selected model method \(i\).
  3. Estimate the prediction performance of the selected model.
    1. For resamples \(k = 1, ..., K\)
      1. Generate a bootstrapped dataset (with replacement) from the original data.
      2. Fit the selected model to resample \(k\).
      3. Calculate the performance estimate \(S_{k_{boot}}\).
      4. Calculate an additional performance estimate \(S_{k_{boot:orig}}\) by evaluating the bootstrapped model on the original data.
    2. Calculate the optimism of the apparent performance estimate. \[O_{top} = \frac{1}{K} \sum_{1}^{K} (S_{k_{boot}} - S_{k_{boot:orig}})\]
    3. Calculate the optimism adjusted performance estimate. \[S_{{top}_{adj}} = S_{{top}_{app}} - O_{top}\]
    4. Report the model’s validated performance estimate using \(S_{{top}_{adj}}\).

The names ‘two-stage bootstrap’ and ‘double bootstrap’ might be used interchangeably. From what I’ve seen, ‘double bootstrap’ is referred to in cases where bootstrap resamples are taken twice, once from the original sample, and again from the bootstrap resamples. This version may be useful for creating confidence intervals, using the standard percentile-based bootstrap confidence interval of the estimator, or for bias adjustment of the optimism correction term. An example of the ‘double bootstrap’ is seen in Noma et al. (2021) (although they use the name ‘two-stage bootstrap’…).

There isn’t much research on combining multiple imputation with bootstrap optimism correction. Bartlett and Hughes (2020) provides a nice summary in the context of parameter variance estimation and confidence interval estimation. They found bootstrap followed by imputation was robust under model misspecification and uncongeniality. Taking this as a hint for how to proceed, we can define two-stage bootstrap-MI optimism correction as:

  1. Select a multiple imputation method which may justifiably make the data conditionally missing at random.
  2. For each imputation \(h = 1, \dots, H\)
    1. Generate imputed original dataset \(D_{h, orig}\).
    2. For each model \(i = 1, \dots, I\)
      1. Fit model \(i\) to dataset \(D_{h, orig}\).
  3. For each model \(i = 1, \dots, I\)
    1. Create pooled model \(i\) from the \(H\) multiple imputations using Rubin’s rules.
    2. Calculate the apparent pooled performance estimate \(S_{{i}_{app}}\).
  4. Determine which model has best performance.
    1. For resamples \(j = 1, ..., J\)
      1. Generate a bootstrapped dataset (with replacement) from the original data.
      2. For each imputation \(h = 1, \dots, H\)
        1. Generate imputed bootstrapped dataset \(D_{hj, boot}\).
        2. For each model \(i = 1, \dots, I\)
          1. Fit model \(i\) to imputed resample \(D_{hj, boot}\).
      3. For each model \(i = 1, \dots, I\)
        1. Create pooled model \(i\) from the \(H\) multiple imputations using Rubin’s rules.
        2. Calculate the pooled performance estimate \(S_{{ij}_{boot}}\).
        3. Calculate an additional pooled performance estimate \(S_{ij_{boot:orig}}\) by evaluating pooled model \(i\) on the original imputed data \(D_{h, orig}\).
    2. Calculate the optimism of the apparent pooled performance estimate. For each model \(i\), \[O_{i} = \frac{1}{J} \sum_{1}^{J} (S_{ij_{boot}} - S_{ij_{boot:orig}})\]
    3. Calculate the optimism adjusted pooled performance estimate. For each model \(i\), \[S_{i_{adj}} = S_{i_{app}} - O_i\]
    4. Determine which model has best prediction performance and proceed using selected model method \(i\).
  5. Estimate the prediction performance of the selected model.
    1. For resamples \(k = 1, ..., K\)
      1. Generate a bootstrapped dataset (with replacement) from the original data.
      2. For each imputation \(h = 1, \dots, H\)
        1. Generate imputed bootstrapped dataset \(D_{hk, boot}\).
        2. Fit the selected model to imputed resample \(D_{hk, boot}\).
      3. Create the pooled model from the \(H\) multiple imputations using Rubin’s rules.
      4. Calculate the pooled performance estimate \(S_{{k}_{boot}}\).
      5. Calculate an additional pooled performance estimate \(S_{k_{boot:orig}}\) by evaluating the pooled model on the original imputed data \(D_{h, orig}\).
    2. Calculate the optimism of the apparent pooled performance estimate. \[O_{top} = \frac{1}{K} \sum_{1}^{K} (S_{k_{boot}} - S_{k_{boot:orig}})\]
    3. Calculate the optimism adjusted pooled performance estimate. \[S_{top_{adj}} = S_{top_{app}} - O_{top}\]
    4. Report the model’s validated pooled performance estimate using \(S_{{top}_{adj}}\).

Practical implementations for this and similar methods can be found at https://cran.r-project.org/web/packages/psfmi.

K-fold cross-validation

The algorithm for repeated K-fold cross-validation is defined as:

  1. Use repeated cross-validation for stable prediction estimates.
  2. For each repeat \(n = 1, \dots, N\)
    1. Randomly divide (or stratify) the dataset into \(K\) folds.
    2. For each fold \(k = 1, \dots, K\)
      1. Let fold \(k\) be the test set and the remaining \(K-1\) folds used for training.
      2. Fit the model on the training set.
        • Note: ‘model’ is referring the learning method, variable selection method, and tuning parameter/hyperparameter choice.
      3. Calculate the performance estimate \(S_{kn}\) by evaluating the model on test set \(k\).
    3. Calculate the mean prediction performance estimate \(\bar{S}\). \[\bar{S} = \frac{1}{KN} \sum_1^{N} \sum_1^{K} S_{kn}\]
  3. Report the cross-validated prediction performance \(\bar{S}\).
  4. Fit the model on the original dataset.

Nested CV is required for situations when you need to perform model selection and report performance estimates of the selected model. The algorithm for repeated grid-search/nested K-fold cross-validation is defined as:

  1. Use repeated cross-validation for stable prediction estimates.
  2. For each repeat \(n = 1, \dots, N\)
    1. Randomly divide (or stratify) the dataset into \(K\) folds.
    2. For each fold \(k = 1, \dots, K\)
      1. Let fold \(k\) be the test set and the remaining \(K-1\) folds used for cross-validation.
      2. Recombine the remaining \(K-1\) folds and randomly divide (or stratify) them into \(J\) folds.
      3. For each fold \(j = 1, \dots, J\)
        1. Let fold \(j\) be the validation set and the remaining \(J-1\) folds used for training.
        2. For each model \(i_{inner} = 1, \dots, I\)
          1. Fit model \(i_{inner}\) on the training set.
            • Note: ‘model’ is referring the unique combination of learning method, variable selection method, and tuning parameter/hyperparameter choice.
          2. Calculate the performance estimate \(S_{i_{inner}jkn}\) by evaluating model \(i_{inner}\) on the validation set \(j\).
      4. For each model \(i_{outer} = 1, \dots, I\)
        1. Fit model \(i_{outer}\) on the cross-validation set.
          • You can construct a more efficient algorithm which does not require fitting the unselected models.
          • You may carry over the hyperparameters estimated in the inner loop.
        2. Calculate the performance estimate \(S_{i_{outer}kn}\) by evaluating model \(i_{outer}\) on the test set \(k\).
  3. Calculate the mean prediction performance \(\bar{S}\) from inner loop \((j)\). For each model \(i_{inner}\), \[\bar{S}_{i_{inner}} = \frac{1}{JKN} \sum_1^{N} \sum_1^{K} \sum_1^{J}(S_{i_{inner}jkn})\]
  4. Determine which model has best prediction performance and proceed using selected model method \(i\).
  5. Calculate the mean prediction performance from outer loop \((k)\). For selected model \(i_{outer}^{top}\), \[\bar{S}_{i_{outer}^{top}} = \frac{1}{KN} \sum_1^{N} \sum_1^{K}(S_{i_{outer}^{top}kn})\]
  6. Report the cross-validated prediction performance \(\bar{S}_{i_{outer}^{top}}\).
  7. Use the selected method to fit the final model on the original dataset.

Bibliography

Arlot, Sylvain, and Alain Celisse. 2010. “A Survey of Cross-Validation Procedures for Model Selection.” Statistics Surveys 4 (none). https://doi.org/10.1214/09-SS054.
Bartlett, Jonathan W, and Rachael A Hughes. 2020. “Bootstrap Inference for Multiple Imputation Under Uncongeniality and Misspecification.” Statistical Methods in Medical Research 29 (12): 3533–46. https://doi.org/10.1177/0962280220932189.
Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the Bootstrap. 0th ed. Chapman; Hall/CRC. https://doi.org/10.1201/9780429246593.
———. 1997. “Improvements on Cross-Validation: The 632+ Bootstrap Method.” Journal of the American Statistical Association 92 (438): 548–60. https://doi.org/10.1080/01621459.1997.10474007.
Harrell, Frank E., K. L. Lee, and D. B. Mark. 1996. “Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors.” Statistics in Medicine 15 (4): 361–87. https://doi.org/10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4.
Krstajic, Damjan, Ljubomir J Buturovic, David E Leahy, and Simon Thomas. 2014. “Cross-Validation Pitfalls When Selecting and Assessing Regression and Classification Models.” Journal of Cheminformatics 6 (1): 10. https://doi.org/10.1186/1758-2946-6-10.
Noma, Hisashi, Tomohiro Shinozaki, Katsuhiro Iba, Satoshi Teramukai, and Toshi A. Furukawa. 2021. “Confidence Intervals of Prediction Accuracy Measures for Multivariable Prediction Models Based on the Bootstrap‐based Optimism Correction Methods.” Statistics in Medicine 40 (26): 5691–701. https://doi.org/10.1002/sim.9148.
Published: 2022-02-19