Brett Klamer

Performance Measures

Below are a few measures for summarizing model predictive performance.

Continuous outcomes

  • \(R^2 = 1-\frac{SSE}{SST}\)
    • For linear models, the proportion of variability in the response explained by the model.
    • For linear models, the squared Pearson correlation between the observed and predicted values. This method is commonly used as a generalized form for other models as well.
    • Ranges from 0, no association between predictions and outcome, to 1, perfect fit/predictions.
    • Does not represent agreement/calibration.
    • We wish to maximize this value.
  • Adjusted \(R^2 = 1 - \frac{SSE / (n-p)}{SST / (n-1)}\)
    • Where \(p\) is the number of fitted parameters, including the intercept.
    • We wish to maximize this value.
  • \(\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2} = \sqrt{\frac{SSE}{n}}\)
    • The root mean squared prediction error.
    • It answers: “What is the average difference between the predicted value and observed value, in the outcome’s units”?
    • The residual standard error estimates the residual standard deviation as \(\sqrt{\frac{SSE}{n-p}}\), where \(p\) is the number of fitted parameters.
    • We wish to minimize this value.
  • \(\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i-\hat{f}(x_i)|\)
    • The mean absolute prediction error.
    • It answers: “What is the average absolute difference between the predicted value and observed value, in the outcome’s units?”
    • It has the same unit of measurement as the outcome.
    • MAE is less sensitive to very large errors than RMSE because errors are not squared.
    • We wish to minimize this value.
  • Mallow’s \(C_p = \frac{1}{n} (SSE + 2d\hat{\sigma}^2)\)
    • We wish to minimize this value.
  • \(\text{AIC} = -2 \log L + 2p\)
    • Where \(p\) is the number of estimated parameters in the model.
    • an information criterion for comparing fitted models using the same outcome likelihood.
    • We wish to minimize this value.
  • Calibration
    • Measures agreement between predicted and observed values on the outcome scale.
    • For continuous outcomes, this is often assessed by regressing the observed outcome on the predicted value.
    • Ideal calibration has intercept 0 and slope 1.
    • A calibration curve can show whether observed values agree with predicted values across the prediction range.
    • The apparent or optimism-adjusted calibration curve should be close to the 45-degree line.

Ordinal outcomes

  • Concordance Index (C-index)
    • Measures how well the model ranks observations by outcome level.
    • The C-index is the probability that, among two observations with different outcome levels, the observation with the higher outcome level receives the higher predicted score, with tied predictions counted as 0.5.
    • The interpretation depends on the predicted score used, such as a linear predictor, an expected outcome score, or the predicted probability of being at or above a chosen outcome level.
    • The C-index measures rank discrimination, not calibration or probability accuracy.
    • We wish to maximize this value.
  • Ranked probability score (RPS)
    • A proper scoring rule for ordinal outcomes that uses the full predicted probability distribution.
    • For one observation with \(K\) ordered outcome categories, \(RPS = \sum_{k=1}^{K-1}(\hat{F}_k - F_k)^2\), where \(\hat{F}_k\) is the predicted cumulative probability up to category \(k\) and \(F_k\) is the observed cumulative indicator.
    • For a dataset, the RPS is averaged over observations.
    • RPS rewards probability assigned near the observed category and penalizes probability assigned far from it.
    • RPS is 0 for a perfect prediction.
    • Some definitions divide by \(K-1\) to put the score on a comparable scale across outcomes with different numbers of categories.
    • A scaled ranked probability score (also called ranked probability skill score) can compare a model with a reference model such as the marginal outcome distribution: \(RPSS = 1 - \frac{\text{RPS}}{\text{RPS}_{\text{ref}}}\).
    • The RPSS equals 1 for perfect predictions, equals 0 when the model matches the reference model, and can be negative when the model performs worse than the reference model.
    • We wish to minimize the RPS and maximize the RPSS.
  • Log score
    • A proper scoring rule based on the predicted probability assigned to the observed category.
    • For one observation, the log score is \(-\log(\hat{p}_y)\), where \(\hat{p}_y\) is the predicted probability of the observed category.
    • It strongly penalizes assigning very low probability to the observed category.
    • Unlike RPS, it does not directly use the ordering of categories.
    • We wish to minimize this value.
  • Classification summaries
    • Exact accuracy is the proportion of observations whose predicted category equals the observed category.
    • Accuracy within one category is the proportion of observations whose predicted category is equal to the observed category or within one adjacent category.
    • Weighted kappa measures agreement between observed and predicted categories, adjusted for chance agreement, while penalizing larger category disagreements more than smaller disagreements.
    • These summaries require reducing predictions to a single category and do not evaluate the full predicted probability distribution.
  • Ordinal calibration
    • Measures whether predicted probabilities agree with observed frequencies for ordered outcome levels.
    • Calibration can be assessed for category probabilities or cumulative probabilities, such as \(P(Y \leq k)\) or \(P(Y \geq k)\).
    • A well-calibrated ordinal model gives reliable probabilities across the outcome scale.
    • Calibration is distinct from rank discrimination.
    • We want calibration curves close to the 45-degree line for the relevant category or cumulative probabilities.

Binary outcomes

  • ROC-AUC
    • Area under the receiver operating characteristic curve.
    • The ROC curve plots sensitivity against 1-specificity across all possible prediction thresholds.
    • For binary outcomes, ROC-AUC is the probability that the model ranks a random observation with the outcome more highly than a random observation without the outcome, with tied predictions counted as 0.5.
    • ROC-AUC measures rank discrimination, not calibration or clinical utility.
    • We wish to maximize this value.
  • Concordance Index (C-index)
    • The proportion of comparable outcome pairs whose predicted risks are correctly ordered, with tied predictions usually counted as 0.5.
    • A pair is concordant when the observation with the worse or higher-risk outcome also has the higher predicted risk.
    • For ordinary uncensored binary outcomes, the C-index is equivalent to ROC-AUC when computed from the same predicted risks.
    • For ordinary uncensored binary outcomes, the C-index is the probability that the model assigns a higher predicted risk to a randomly selected observation with the outcome than to a randomly selected observation without the outcome, with tied predictions counted as 0.5.
    • For binary outcomes, \(C = \frac{\text{concordant pairs} + 0.5 \times \text{tied prediction pairs}}{\text{comparable pairs}}\).
    • The C-index is more general than ROC-AUC because it can also be defined for ordinal, continuous, and time-to-event outcomes, although the definition of comparable pairs changes by outcome type.
    • The C-index measures rank discrimination, not calibration or clinical utility.
    • We wish to maximize this value.
  • Brier score
    • Quadratic scoring rule analogous to mean squared error (MSE).
    • The squared differences between actual binary outcomes Y and predictions p. \(\frac{1}{N}\sum_{i=1}^{N}(y_i-p_i)^2\)
    • The square root of the Brier score is the quadratic mean of the differences between the observed and predicted values on the probability scale.
    • The Brier score ranges from 0 for perfect predictions to 1 for the worst possible deterministic predictions.
    • A non-informative model that always predicts the outcome prevalence \(x\) has Brier score \(x(1-x)^2+(1-x)x^2=x(1-x)\), which equals 0.25 when \(x=0.5\).
    • A scaled Brier score (also called Brier Skill Score) can compare a model with a reference model such as the prevalence-only model: \(BSS = 1 - \frac{\text{Brier}}{\text{Brier}_{\text{null}}}\). The scaled Brier score equals 1 for perfect predictions, equals 0 when the model matches the reference model, and can be negative when the model performs worse than the reference model.
    • The Brier score is a valid overall probability score, but it does not measure clinical utility and can be misleading when the decision problem requires high sensitivity at low outcome prevalence. See Assel, Sjoberg, and Vickers (2017).
    • We wish to minimize the Brier score and maximize the Brier skill score.
  • Calibration (Van Calster et al. 2016)
    • Measures agreement between predicted probabilities and observed outcome frequencies.
    • A well-calibrated model gives predictions that are correct on the probability scale.
    • Calibration is distinct from discrimination: a model can rank observations well but still predict risks that are too high or too low.
    • Mean calibration, also called calibration-in-the-large
      • Compares the mean predicted risk with the observed outcome proportion.
      • For binary logistic models, this is often summarized by a calibration intercept.
      • Ideal calibration has intercept 0.
      • A negative intercept suggests predictions are too high on average.
      • A positive intercept suggests predictions are too low on average.
    • Weak calibration
      • Assesses both calibration-in-the-large and calibration slope.
      • For binary logistic models, this is commonly evaluated by regressing the outcome on the model’s predicted log odds.
      • Ideal calibration has intercept 0 and slope 1.
      • A slope less than 1 suggests predictions are too extreme, often from overfitting.
      • A slope greater than 1 suggests predictions are not extreme enough.
      • The calibration intercept and calibration slope should usually be reported together (Stevens and Poppe 2020; Wang 2020).
    • Moderate calibration
      • Assesses whether observed outcome frequencies agree with predicted risks across the full range of predictions.
      • Often shown with a smoothed calibration curve of observed outcome frequency versus predicted risk.
      • Ideal calibration follows the 45-degree line.
      • Grouped calibration plots can hide problems because they depend on arbitrary risk categories.
    • Strong calibration
      • Requires agreement between predicted and observed outcome risk for every covariate pattern.
      • This is the most demanding form of calibration.
      • It is generally not verifiable in real data because most covariate patterns have too few observations.
    • We want calibration intercept close to 0, calibration slope close to 1, and the calibration curve close to the 45-degree line.
  • Confusion matrix
    • Contingency table of the observed and predicted outcomes. (Note: this forces classification when using probability models, which results in optimization of improper scoring rules)
  • Cohen’s Kappa
    • A measure which takes into account the error rate that would be expected by chance. (Note: this forces classification when using probability models, which results in optimization of improper scoring rules)
    • Ranges from -1, complete discordance, to 1, complete concordance.
  • Receiver operating characteristic (ROC) curves
    • Visualize the sensitivity and specificity tradeoff across all possible prediction thresholds.
    • When comparing models, crossing ROC curves can show that one model performs better in one threshold region while another performs better elsewhere.
  • Precision-recall curve
    • Appropriate to summarize information retrieval ability
    • The area under the curve indicates ranges from the observed prevalence (worst performance) to 1 (best performance).
  • Goodness of fit test (GOF)
    • le Cessie-van Houwelingen-Copas-Hosmer unweighted sum of squares test for global goodness of fit. See Hosmer et al. (1997).

Bibliography

Assel, Melissa, Daniel D. Sjoberg, and Andrew J. Vickers. 2017. “The Brier Score Does Not Evaluate the Clinical Utility of Diagnostic Tests or Prediction Models.” Diagnostic and Prognostic Research 1 (1): 19. https://doi.org/10.1186/s41512-017-0020-3.
Hosmer, D. W., T. Hosmer, S. Le Cessie, and S. Lemeshow. 1997. “A COMPARISON OF GOODNESS-OF-FIT TESTS FOR THE LOGISTIC REGRESSION MODEL.” Statistics in Medicine 16 (9): 965–80. https://doi.org/10.1002/(SICI)1097-0258(19970515)16:9<965::AID-SIM509>3.0.CO;2-O.
Stevens, Richard J., and Katrina K. Poppe. 2020. “Validation of Clinical Prediction Models: What Does the ‘Calibration Slope’ Really Measure?” Journal of Clinical Epidemiology 118 (February): 93–99. https://doi.org/10.1016/j.jclinepi.2019.09.016.
Van Calster, Ben, Daan Nieboer, Yvonne Vergouwe, Bavo De Cock, Michael J. Pencina, and Ewout W. Steyerberg. 2016. “A Calibration Hierarchy for Risk Models Was Defined: From Utopia to Empirical Data.” Journal of Clinical Epidemiology 74 (June): 167–76. https://doi.org/10.1016/j.jclinepi.2015.12.005.
Wang, Junfeng. 2020. “Calibration Slope Versus Discrimination Slope: Shoes on the Wrong Feet.” Journal of Clinical Epidemiology 125 (September): 161–62. https://doi.org/10.1016/j.jclinepi.2020.06.002.
Published: 2022-02-18
Last Updated: 2026-04-29