Performance Measures

Below are a few estimators for summarizing model predictive performance.

Continuous outcomes

\(R^2 = 1-\frac{SSE}{SST}\)
- For linear models, the proportion of variability in the response explained by the model.
- For linear models, the squared Pearson correlation between the observed and predicted values. This method is commonly used as a generalized form for other models as well.
- Ranges from 0, no association between predictions and outcome, to 1, perfect fit/predictions. Thus, we wish to maximize this value.
- Does not represent agreement/calibration.
Adjusted \(R^2 = 1 - \frac{SSE / (n-d-1)}{SST / (n-1)}\)
- Where \(d\) is the number of parameters in the model.
\(\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2} = \sqrt{\frac{SSE}{d.f.}}\)
- The standard deviation of the unexplained variance. The typical difference between the observed and fitted values.
- It has the same unit of measurement as the outcome.
- We wish to minimize this value.
Mean absolute error (MAE)
- The magnitude of deviation of the model errors. We wish to minimize this value.
Mallow’s \(C_p = \frac{1}{n} (SSE + 2d\hat{\sigma}^2)\)
- We wish to minimize this value.
\(\text{AIC} = -2logL+2d\)
- Where \(d\) is the number of parameters in the model.
- If OLS conditions are satisfied, then \(C_p = \text{AIC}\).
- We wish to minimize this value.
Calibration
- Measures agreement between predicted and observed values.
- The optimism adjusted curve should fit closely to the ideal line.

Binary outcomes

ROC-AUC
- The ROC-AUC is the probability that the model ranks a random positive observation more highly than a random negative observation.
- We wish to maximize this value.
Concordant pairs
- The percent of concordant, discordant, and tied pairwise comparisons. The concordant percent is proportional to the AUC. AUC = percent concordant + 0.5*percent tied.
- We wish to maximize this value.
Brier score
- Quadratic scoring rule analogous to mean squared error (MSE).
- The squared differences between actual binary outcomes Y and predictions p. \(\frac{1}{N}\sum_{i=1}^{N}(y_i-p_i)^2\)
- The square root of the Brier score is the expected distance between the observed and predicted values on the probability scale.
- The Brier score for a model can range from 0 for a perfect model to 0.25 for a non-informative model with a 50% prevalence of the outcome. \(max(brier)=x(1-x)^2+(1-x)x^2\), for population prevalence \(x\).
- As with Nagelkerke’s \(R^2\), the scaled version of Brier’s score can be similar to Pearson’s \(R^2\). \(R^2_{\text{Brier}} = 1 - \frac{Brier}{max(Brier)}\) will range between 0-1.
- The Brier score performs poorly in situations with low prevalence when high sensitivity is required. In this case, the Brier score will favor a test with high specificity. See also (Assel, Sjoberg, and Vickers 2017).
- We wish to minimize this value.
Calibration (Van Calster et al. 2016)
- Mean calibration = calibration-in-the-large
  - Calibration intercept
  - Proper calibration leads to an intercept of 0.
- Weak calibration
  - Combination of calibration intercept and calibration slope
  - Proper loss functions lead to apparent intercept = 0 and slope = 1. CV’d values have different intercept and commonly slope < 1.
  - Calibration slope = linear shrinkage factor
  - Need to report calibration intercept together with calibration slope (Stevens and Poppe 2020; Wang 2020).
- Moderate calibration
  - Smoothed calibration plot of observed vs. predicted outcomes.
  - Pointwise calibration suffers from same issues as dichotomization.
- Strong calibration
  - Assessment of observed vs. predicted outcomes over multidimensional covariate space.
  - Not possible for any real situation
Confusion matrix
- Contingency table of the observed and predicted outcomes. (Note: this forces classification when using probability models, which results in optimization of improper scoring rules)
Cohen’s Kappa
- A measure which takes into account the error rate that would be expected by chance. (Note: this forces classification when using probability models, which results in optimization of improper scoring rules)
- Ranges from -1, complete discordance, to 1, complete concordance.
Receiver operating characteristic (ROC) curves
- The curve used to calculate the AUC
- When comparing more than one model, may allow you to determine if one model has better predictive performance in some areas vs. another. (crossing curves)
Precision-recall curve
- Appropriate to summarize information retrieval ability
- The area under the curve indicates ranges from the observed prevalence (worst performance) to 1 (best performance).
Goodness of fit test (GOF)
- le Cessie-van Houwelingen-Copas-Hosmer unweighted sum of squares test for global goodness of fit. See Hosmer et al. (1997).

Bibliography

Assel, Melissa, Daniel D. Sjoberg, and Andrew J. Vickers. 2017. “The Brier Score Does Not Evaluate the Clinical Utility of Diagnostic Tests or Prediction Models.” Diagnostic and Prognostic Research 1 (1): 19. https://doi.org/10.1186/s41512-017-0020-3.

Hosmer, D. W., T. Hosmer, S. Le Cessie, and S. Lemeshow. 1997. “A COMPARISON OF GOODNESS-OF-FIT TESTS FOR THE LOGISTIC REGRESSION MODEL.” Statistics in Medicine 16 (9): 965–80. https://doi.org/10.1002/(SICI)1097-0258(19970515)16:9<965::AID-SIM509>3.0.CO;2-O.

Stevens, Richard J., and Katrina K. Poppe. 2020. “Validation of Clinical Prediction Models: What Does the ‘Calibration Slope’ Really Measure?” Journal of Clinical Epidemiology 118 (February): 93–99. https://doi.org/10.1016/j.jclinepi.2019.09.016.

Van Calster, Ben, Daan Nieboer, Yvonne Vergouwe, Bavo De Cock, Michael J. Pencina, and Ewout W. Steyerberg. 2016. “A Calibration Hierarchy for Risk Models Was Defined: From Utopia to Empirical Data.” Journal of Clinical Epidemiology 74 (June): 167–76. https://doi.org/10.1016/j.jclinepi.2015.12.005.

Wang, Junfeng. 2020. “Calibration Slope Versus Discrimination Slope: Shoes on the Wrong Feet.” Journal of Clinical Epidemiology 125 (September): 161–62. https://doi.org/10.1016/j.jclinepi.2020.06.002.

Published: 2022-02-18