Brett Klamer

Resampling Methods

There are two common implementations of internal validation: bootstrap and K-fold cross-validation.

The simple nonparametric bootstrap can be used to study the sampling variability of apparent model performance. However, averaging apparent performance across bootstrap resamples is not a valid estimate of out-of-sample performance. Because each bootstrap model is evaluated on data used to fit it, this estimate remains optimistic. Bootstrap internal validation instead requires a method that evaluates or corrects this optimism, such as optimism correction, out-of-bootstrap evaluation, the .632 bootstrap, or the .632+ bootstrap. Efron showed that optimism correction has less bias through a procedure which calculates and adjusts for the optimism of the estimate (for a coherent explanation, see section 17.6: Efron and Tibshirani 1994). Although this refined bootstrap method is better, it can still be biased in various scenarios such as small-N-large-P or when using a discontinuous performance measure such as proportion classified correctly. Efron introduced the .632 and .632+ bootstrap, \[ \begin{aligned} P(\text{observation included}) &= 1 - \left(1 - \frac{1}{n} \right)^n \to 0.632 \text{ as } n\to\infty \\ P(\text{observation not included}) &= \left(1 - \frac{1}{n} \right)^n \to 0.368 \text{ as } n\to\infty, \\ \end{aligned} \] to attempt a fix for these known biases. The .632 bootstrap calculates a weighted performance estimate using the out-of-bootstrap observations (weighted .632) and the apparent performance estimate on the original sample (weighted .368). The .632+ bootstrap modifies this approach by using an adaptive weight based on the relative overfitting rate, so its weight is not fixed at .632 (Efron and Tibshirani 1997). However, the original refined bootstrap method is commonly used due to its simplicity and sufficient performance.

K-fold CV is useful for predictive performance of a single pre-selected model. If you would like to choose from competing models (Note: ‘models’ refers to the various choices of learning methods, variable selection methods, and other tuning parameters/hyperparameters) and interpret the resulting predictive performance, you will need to use nested CV. Nested K-fold CV uses training, validation, and test sets, while simple K-fold CV uses only training and test sets. The training set is used to fit all competing models, the validation set is used to select the top performing model, and the test set is used to estimate the performance of the top model. A single run of k-fold CV may result in highly variable estimates. Repeated runs of k-fold CV are recommended to ensure stable estimates. See Arlot and Celisse (2010) and Krstajic et al. (2014) for more information.

Below I provide detailed algorithms for bootstrap and k-fold cross-validation methods. However, there are many different versions of cross-validation and no single version is optimal in every situation.

Bootstrap optimism correction

The algorithm for bootstrap optimism correction (Efron and Tibshirani 1994; Harrell, Lee, and Mark 1996) estimates and removes the optimism of the apparent performance estimate.

Let \(A\) denote the pre-specified model fitting procedure. The procedure must include all steps that would be applied in practice, including data-dependent preprocessing, feature selection, tuning choices or tuning rules, and model fitting. Let \(S\) denote a performance statistic where larger values indicate better performance. If the original statistic is a loss or error measure where smaller values indicate better performance, define \(S\) as the negative loss or error before applying the algorithm. Let \(S_{app}\) denote the apparent performance estimate from fitting \(A\) to the original dataset and evaluating the fitted model on that same dataset. Let \(S_{b, boot}\) denote the apparent bootstrap performance estimate from fitting \(A\) to bootstrap resample \(b\) and evaluating the fitted model on that same bootstrap resample. Let \(S_{b, orig}\) denote the test performance estimate from evaluating the fitted bootstrap model on the original dataset.

The algorithm is defined as:

  1. Fit the model fitting procedure \(A\) to the original dataset.

  2. Evaluate the fitted model on the original dataset and store the apparent performance estimate as \(S_{app}\).

  3. For each bootstrap resample \(b = 1, \dots, B\):

    1. Generate a bootstrap dataset by sampling observations with replacement from the original dataset.

    2. Fit the model fitting procedure \(A\) to bootstrap resample \(b\).

    3. Evaluate the fitted bootstrap model on bootstrap resample \(b\) and store the apparent bootstrap performance estimate as \(S_{b, boot}\).

    4. Evaluate the fitted bootstrap model on the original dataset and store the test performance estimate as \(S_{b, orig}\).

  4. Calculate the average optimism of the apparent performance estimate:

    \[ O = \frac{1}{B} \sum_{b = 1}^{B} (S_{b, boot} - S_{b, orig}). \]

  5. Calculate the optimism-adjusted performance estimate:

    \[ S_{adj} = S_{app} - O. \]

  6. Report the optimism-adjusted performance estimate \(S_{adj}\).

Bootstrap optimism correction for model selection

Bootstrap optimism correction can also be used when the goal is to select the top performing model fitting procedure and report the internally validated performance of the selected procedure. The key requirement is that the full selection process must be repeated inside each bootstrap resample. This estimates the optimism of the complete data-dependent modeling procedure, including model selection.

Let \(A_1, \dots, A_I\) denote the candidate model fitting procedures. Each \(A_i\) must include all steps that would be applied in practice, including data-dependent preprocessing, feature selection, tuning choices or tuning rules, and model fitting. Let \(M\) denote the pre-specified model-selection process. This process uses the available dataset to estimate the performance of each candidate procedure \(A_1, \dots, A_I\) and then selects the top performing procedure. For example, \(M\) may choose the procedure with the best internally validated performance or another pre-specified selection criterion appropriate for the candidate procedures. If \(M\) uses repeated K-fold cross-validation, that repeated cross-validation procedure must be run using only the original dataset when selecting \(A_{i^*}\) and using only bootstrap resample \(b\) when selecting \(A_{i_b^*}\). If \(M\) uses bootstrap optimism correction, that inner bootstrap validation procedure must also be run using only the dataset currently available to \(M\). This creates a nested bootstrap algorithm, with an outer bootstrap loop to estimate the optimism of the full model-selection process and an inner bootstrap loop used by \(M\) to choose among candidate procedures. Let \(S\) denote a performance statistic where larger values indicate better performance. If the original statistic is a loss or error measure where smaller values indicate better performance, define \(S\) as the negative loss or error before applying the algorithm.

Apply the model-selection process \(M\) to the original dataset and let \(A_{i^*}\) denote the selected model fitting procedure. Let \(S_{app}\) denote the apparent performance estimate from fitting \(A_{i^*}\) to the original dataset and evaluating the fitted model on that same dataset. Let \(i_b^*\) denote the index of the model fitting procedure selected by applying \(M\) to bootstrap resample \(b\). Let \(S_{b, boot}\) denote the apparent bootstrap performance estimate from fitting \(A_{i_b^*}\) to bootstrap resample \(b\) and evaluating the fitted model on that same bootstrap resample. Let \(S_{b, orig}\) denote the test performance estimate from evaluating the fitted bootstrap model on the original dataset.

The algorithm to estimate performance for the selected model fitting procedure after accounting for the selection process is defined as:

  1. Pre-specify the candidate model fitting procedures \(A_1, \dots, A_I\). Each candidate procedure must include all preprocessing, feature selection, tuning, and model fitting steps assigned to that procedure.

  2. Pre-specify the model-selection process \(M\). The same model-selection process must be used on the original dataset and inside every bootstrap resample.

  3. Using the original dataset, estimate the performance of each candidate procedure according to the pre-specified model-selection process \(M\).

  4. Select the top performing procedure and denote it by \(A_{i^*}\).

  5. Fit \(A_{i^*}\) to the original dataset.

  6. Evaluate the selected model on the original dataset and store the apparent performance estimate as \(S_{app}\).

  7. For each bootstrap resample \(b = 1, \dots, B\):

    1. Generate a bootstrap dataset by sampling observations with replacement from the original dataset.

    2. Using only bootstrap resample \(b\), estimate the performance of each candidate procedure according to the same pre-specified model-selection process used on the original dataset.

    3. Select the top performing procedure within bootstrap resample \(b\) and denote it by \(A_{i_b^*}\).

    4. Fit \(A_{i_b^*}\) to bootstrap resample \(b\).

    5. Evaluate the fitted bootstrap-selected model on bootstrap resample \(b\) and store the apparent bootstrap performance estimate as \(S_{b, boot}\).

    6. Evaluate the fitted bootstrap-selected model on the original dataset and store the test performance estimate as \(S_{b, orig}\).

  8. Calculate the average optimism of the complete selection procedure:

    \[ O = \frac{1}{B} \sum_{b = 1}^{B} (S_{b, boot} - S_{b, orig}). \]

  9. Calculate the optimism-adjusted performance estimate for the selected procedure:

    \[ S_{adj} = S_{app} - O. \]

  10. Report the selected model fitting procedure \(A_{i^*}\).

  11. Report the optimism-adjusted performance estimate \(S_{adj}\). Because \(A_{i^*}\) and \(S_{adj}\) are both derived from the same original dataset, \(S_{adj}\) is interpreted as an internally validated out-of-sample estimate rather than an externally validated performance estimate.

see Noma et al. (2021) for additional bootstrap optimism correction applications.

Bootstrap optimism correction for model selection with multiple imputation

When the data have missing values, multiple imputation adds another data-dependent step to the modeling procedure. Bartlett and Hughes (2020) found that, for parameter inference, imputation followed by bootstrapping generally did not provide robust variance estimates under uncongeniality or misspecification, whereas certain bootstrap followed by imputation methods did. For bootstrap optimism correction, this suggests that bootstrap resampling should occur before imputation. The full imputation, model-selection, and model-fitting process should then be repeated inside each bootstrap resample.

Let \(D_{obs}\) denote the original incomplete dataset. Let \(A_1, \dots, A_I\) denote the candidate model fitting procedures. Each \(A_i\) must include all steps that would be applied in practice, including data-dependent preprocessing, feature selection, tuning choices or tuning rules, imputation choices, and model fitting. Let \(M\) denote the pre-specified model-selection process. This process uses the available incomplete dataset to impute missing values as needed, estimate the performance of each candidate procedure \(A_1, \dots, A_I\), and then select the top performing procedure. Let \(H\) denote the number of imputations used for estimating performance and fitting the selected procedure. Let \(S\) denote a performance statistic where larger values indicate better performance. If the original statistic is a loss or error measure where smaller values indicate better performance, define \(S\) as the negative loss or error before applying the algorithm. The final prediction rule after multiple imputation must also be pre-specified. For example, predictions may be averaged across imputation-specific fitted models, or a pooled model may be used when pooling is valid for the model class. The performance statistic must be calculated for that final prediction rule.

Apply the model-selection process \(M\) to \(D_{obs}\) and let \(A_{i^*}\) denote the selected model fitting procedure. Let \(S_{app}\) denote the apparent performance estimate from fitting \(A_{i^*}\) to multiply imputed versions of \(D_{obs}\) and evaluating the resulting prediction rule on those same completed datasets according to the pre-specified prediction-combining and performance-combining rules. Let \(i_b^*\) denote the index of the model fitting procedure selected by applying \(M\) to bootstrap resample \(b\). Let \(S_{b, boot}\) denote the apparent bootstrap performance estimate from fitting \(A_{i_b^*}\) to imputed versions of bootstrap resample \(b\) and evaluating the resulting prediction rule on those same completed bootstrap datasets. Let \(S_{b, orig}\) denote the test performance estimate from evaluating the bootstrap-fitted prediction rule on completed versions of the original dataset generated according to the pre-specified imputation rule for validation data.

The algorithm is defined as:

  1. Pre-specify the imputation process. The imputation model should make the missing at random assumption plausible conditional on the variables included in the imputation model. The imputation process must also define how incomplete validation data will be imputed or otherwise handled when evaluating prediction performance. Any imputation model or preprocessing rule used to complete validation predictors must be fixed in advance or estimated using only the corresponding training data. For the outer bootstrap loop, this means the rule used to complete \(D_{b, h, orig}\) must be fixed in advance or estimated using only \(D_{b, obs}\). Validation outcomes must not be used to complete validation predictors before evaluating prediction performance. It must also define the final prediction rule after multiple imputation. The same prediction-combining rule must be used on the original dataset and inside every bootstrap resample.

  2. Pre-specify the candidate model fitting procedures \(A_1, \dots, A_I\). Each candidate procedure must include all preprocessing, feature selection, tuning, imputation, and model fitting steps assigned to that procedure.

  3. Pre-specify the model-selection process \(M\). The same model-selection process must be used on the original dataset and inside every bootstrap resample. If \(M\) uses repeated K-fold cross-validation, cross-validation must be run using only the dataset currently available to \(M\). If \(M\) uses bootstrap optimism correction, the inner bootstrap validation must also be run using only the dataset currently available to \(M\).

  4. Apply the model-selection process \(M\) to the original incomplete dataset \(D_{obs}\) and select the model fitting procedure \(A_{i^*}\).

  5. For each imputation \(h = 1, \dots, H\):

    1. Generate completed original dataset \(D_{h, orig}\) from \(D_{obs}\).

    2. Fit the selected model fitting procedure \(A_{i^*}\) to \(D_{h, orig}\).

    3. Generate predictions from the pre-specified final prediction rule and store the imputation-specific apparent performance contribution as \(S_{h, app}\).

  6. Combine \(S_{1, app}, \dots, S_{H, app}\) according to the pre-specified performance-combining rule and store the result as \(S_{app}\).

  7. For each bootstrap resample \(b = 1, \dots, B\):

    1. Generate an incomplete bootstrap dataset \(D_{b, obs}\) by sampling observations with replacement from \(D_{obs}\).

    2. Apply the model-selection process \(M\) to \(D_{b, obs}\) and select the model fitting procedure \(A_{i_b^*}\).

    3. For each imputation \(h = 1, \dots, H\):

      1. Generate completed bootstrap dataset \(D_{b, h, boot}\) from \(D_{b, obs}\).

      2. Fit the bootstrap-selected model fitting procedure \(A_{i_b^*}\) to \(D_{b, h, boot}\).

      3. Generate predictions from the pre-specified final prediction rule and store the imputation-specific apparent bootstrap performance contribution as \(S_{b, h, boot}\).

      4. Generate completed original validation dataset \(D_{b, h, orig}\) from \(D_{obs}\) according to the pre-specified imputation rule for validation data. Any imputation model or preprocessing rule used to complete validation predictors must be fixed in advance or estimated using only \(D_{b, obs}\). Validation outcomes must not be used to complete validation predictors before evaluating prediction performance.

      5. Evaluate the same bootstrap-fitted prediction rule on \(D_{b, h, orig}\). Store the resulting test performance estimate as \(S_{b, h, orig}\).

    4. Combine \(S_{b, 1, boot}, \dots, S_{b, H, boot}\) according to the pre-specified performance-combining rule and store the result as \(S_{b, boot}\).

    5. Combine \(S_{b, 1, orig}, \dots, S_{b, H, orig}\) according to the pre-specified performance-combining rule and store the result as \(S_{b, orig}\).

  8. Calculate the average optimism of the complete imputation and model-selection procedure:

    \[ O = \frac{1}{B} \sum_{b = 1}^{B} (S_{b, boot} - S_{b, orig}). \]

  9. Calculate the optimism-adjusted performance estimate for the selected procedure:

    \[ S_{adj} = S_{app} - O. \]

  10. Report the selected model fitting procedure \(A_{i^*}\).

  11. Report the optimism-adjusted performance estimate \(S_{adj}\). Because \(A_{i^*}\) and \(S_{adj}\) are both derived from the same original dataset, \(S_{adj}\) is interpreted as an internally validated out-of-sample estimate rather than an externally validated performance estimate.

Practical implementations for this scenario is found at https://cran.r-project.org/web/packages/psfmi.

Repeated K-fold cross-validation

Repeated K-fold cross-validation can be used to estimate the out-of-sample prediction performance of one pre-specified model fitting procedure.

A model fitting procedure is the complete workflow that would be applied in practice, including preprocessing, feature selection, fixed tuning values or an internal tuning rule, and final model fitting. All data-dependent steps must be performed using only the training data available within each cross-validation split. If the procedure includes tuning, the tuning rule must be part of the pre-specified procedure and must be applied only within the training set for each fold.

The repeated part improves stability by averaging over multiple random fold partitions. This reduces dependence on a single split and provides a more stable performance summary.

Let \(A\) denote the pre-specified model fitting procedure. Let \(S_{kn}\) denote the model performance estimate for fold \(k\) and repeat \(n\).

The algorithm for repeated \(K\)-fold cross-validation is defined as:

  1. Pre-specify the model fitting procedure \(A\). The procedure must include all steps that would be applied in practice, including data-dependent preprocessing, feature selection, tuning choices or tuning rules, and model fitting.

  2. For each repeat \(n = 1, \dots, N\):

    1. Randomly divide or stratify the dataset into \(K\) folds.

    2. For each fold \(k = 1, \dots, K\):

      1. Let fold \(k\) be the test set.

      2. Let the remaining \(K - 1\) folds be the training set.

      3. Fit the model fitting procedure \(A\) using only the training set. This includes all preprocessing, feature selection, tuning, and model fitting steps assigned to \(A\).

      4. Evaluate the fitted model on the test set.

      5. Store the model performance estimate as \(S_{kn}\).

  3. After all repeats and folds are complete, calculate the repeated cross-validated performance estimate:

    \[ \bar{S} = \frac{1}{KN} \sum_{n = 1}^{N} \sum_{k = 1}^{K} S_{kn}. \]

  4. Report the repeated cross-validated prediction performance estimate \(\bar{S}\).

  5. Fit the model fitting procedure \(A\) on the full original dataset. The final fit should use the same preprocessing, feature selection, tuning, and model fitting rules that defined \(A\) during validation.

This algorithm estimates the performance of one pre-specified procedure \(A\). If the goal is to compare several procedures and then report the selected one, use nested cross-validation instead.

Repeated nested K-fold cross-validation

Repeated nested cross-validation can be used to compare several pre-specified model fitting procedures when those procedures include tuning or other data-dependent steps. A model fitting procedure is the complete workflow that would be applied if that procedure were chosen, including preprocessing, feature selection, a set of candidate tuning values, a rule for choosing among those values, and final model fitting. For example, one procedure may be ridge regression with a specified set of candidate \(\lambda\) values and a rule that selects the best \(\lambda\) by inner cross-validation. Another procedure may be lasso regression with a specified set of candidate \(\lambda\) values and the same kind of inner cross-validation tuning rule. In the algorithm below, the inner cross validation loop will be used to choose the tuning values for each model fitting procedure.

The repeated part improves stability by averaging over multiple random fold partitions. This reduces dependence on a single split and provides more stable inner selection summaries and outer performance summaries. Separation among the inner cross-validation averages indicates how clearly the selection criterion favors one procedure over the others. Selection stability can also be assessed by recording which procedure would be selected within each repeat or outer fold. Performance stability can be assessed from the distribution of the outer performance estimates across repeats and folds.

Let \(A_1, \dots, A_I\) denote the model fitting procedures. Let \(\bar{S}_{ikn}\) denote the inner cross-validation summary for model fitting procedure \(i\) within outer fold \(k\) and repeat \(n\). This summary is calculated after the inner loop tunes model fitting procedure \(A_i\). The individual inner validation results are used to choose tuning settings and calculate \(\bar{S}_{ikn}\), but they are not used directly in the final reported performance estimate. Let \(T_{ikn}\) denote the outer test set performance estimate for model fitting procedure \(i\), outer fold \(k\), and repeat \(n\).

The algorithm for repeated nested \(K\)-fold cross-validation with inner-CV procedure selection and outer-CV performance reporting is defined as:

  1. Pre-specify the model fitting procedures \(A_1, \dots, A_I\). Each \(A_i\) must include all steps that would be applied in practice, including data-dependent preprocessing, feature selection, candidate tuning values, a tuning rule, and model fitting.

  2. For each repeat \(n = 1, \dots, N\):

    1. Randomly divide or stratify the dataset into \(K\) outer folds.

    2. For each outer fold \(k = 1, \dots, K\):

      1. Let outer fold \(k\) be the outer test set.

      2. Let the remaining \(K - 1\) folds be the outer training set.

      3. Randomly divide or stratify the outer training set into \(J\) inner folds.

      4. For each model fitting procedure \(i = 1, \dots, I\):

        1. For each inner fold \(j = 1, \dots, J\):

          1. Let inner fold \(j\) be the inner validation set.

          2. Let the remaining \(J - 1\) inner folds be the inner training set.

          3. Fit and evaluate model fitting procedure \(A_i\) on the inner training and validation data according to its tuning rule. This includes evaluating the candidate tuning settings and all other data-dependent steps assigned to \(A_i\).

          4. Store the inner validation results needed to tune \(A_i\).

        2. Use the inner validation results within the current outer training set to choose the tuning settings for \(A_i\). For example, select \(\lambda\) for ridge or lasso regression.

        3. Store the resulting inner cross-validation summary for procedure \(A_i\) as \(\bar{S}_{ikn}\). This is the inner-loop performance summary for \(A_i\) after its tuning settings have been selected.

        4. Refit procedure \(A_i\) on the full outer training set using the selected tuning settings.

        5. Evaluate the refit model on the outer test set.

        6. Store the outer test set performance estimate as \(T_{ikn}\).

  3. After all repeats and folds are complete, calculate the average inner cross-validation summary for each model fitting procedure:

    \[ \bar{S}_i = \frac{1}{KN} \sum_{n = 1}^{N} \sum_{k = 1}^{K} \bar{S}_{ikn}. \]

  4. Select the final model fitting procedure using the averaged inner cross-validation results:

    \[ i^* = \operatorname{argbest}_{i \in \{1, \dots, I\}} \bar{S}_i. \]

  5. Calculate the repeated nested cross-validated outer performance estimate for each model fitting procedure:

    \[ \bar{T}_i = \frac{1}{KN} \sum_{n = 1}^{N} \sum_{k = 1}^{K} T_{ikn}. \]

  6. Report the selected model fitting procedure \(A_{i^*}\).

  7. Report the selected procedure’s outer cross-validation performance estimate:

    \[ \bar{T}_{i^*}. \]

    Because \(i^*\) and \(\bar{T}_{i^*}\) are both derived from the same original dataset, \(\bar{T}_{i^*}\) is interpreted as an internally validated out-of-sample estimate rather than an externally validated (strictly independent) post-selection performance estimate.

  8. Also report the procedure-specific outer estimates \(\bar{T}_1, \dots, \bar{T}_I\) when model comparison is part of the goal. These estimates describe the out-of-sample performance of each pre-specified procedure when evaluated on held-out outer test folds.

  9. Do not include the inner validation results or inner summaries in the reported outer performance estimates. The inner results are used for procedure selection and for tuning within each outer training set. The outer estimates are used for prediction performance reporting.

  10. Fit the selected procedure \(A_{i^*}\) on the full original dataset. The final fit should use the same internal tuning and model fitting rules that defined \(A_{i^*}\) during validation.

Bibliography

Arlot, Sylvain, and Alain Celisse. 2010. “A Survey of Cross-Validation Procedures for Model Selection.” Statistics Surveys 4 (none). https://doi.org/10.1214/09-SS054.
Bartlett, Jonathan W, and Rachael A Hughes. 2020. “Bootstrap Inference for Multiple Imputation Under Uncongeniality and Misspecification.” Statistical Methods in Medical Research 29 (12): 3533–46. https://doi.org/10.1177/0962280220932189.
Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the Bootstrap. 0th ed. Chapman; Hall/CRC. https://doi.org/10.1201/9780429246593.
———. 1997. “Improvements on Cross-Validation: The 632+ Bootstrap Method.” Journal of the American Statistical Association 92 (438): 548–60. https://doi.org/10.1080/01621459.1997.10474007.
Harrell, Frank E., K. L. Lee, and D. B. Mark. 1996. “Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors.” Statistics in Medicine 15 (4): 361–87. https://doi.org/10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4.
Krstajic, Damjan, Ljubomir J Buturovic, David E Leahy, and Simon Thomas. 2014. “Cross-Validation Pitfalls When Selecting and Assessing Regression and Classification Models.” Journal of Cheminformatics 6 (1): 10. https://doi.org/10.1186/1758-2946-6-10.
Noma, Hisashi, Tomohiro Shinozaki, Katsuhiro Iba, Satoshi Teramukai, and Toshi A. Furukawa. 2021. “Confidence Intervals of Prediction Accuracy Measures for Multivariable Prediction Models Based on the Bootstrap‐based Optimism Correction Methods.” Statistics in Medicine 40 (26): 5691–701. https://doi.org/10.1002/sim.9148.
Published: 2022-02-19
Last Updated: 2026-04-28