Brett Klamer

# Performance Measures

Below are a few estimators for summarizing model predictive performance.

## Continuous outcomes

• $$R^2 = 1-\frac{SSE}{SST}$$
• For linear models, the proportion of variability in the response explained by the model.
• For linear models, the squared Pearson correlation between the observed and predicted values. This method is commonly used as a generalized form for other models as well.
• Ranges from 0, no association between predictions and outcome, to 1, perfect fit/predictions. Thus, we wish to maximize this value.
• Does not represent agreement/calibration.
• Adjusted $$R^2 = 1 - \frac{SSE / (n-d-1)}{SST / (n-1)}$$
• Where $$d$$ is the number of parameters in the model.
• $$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2} = \sqrt{\frac{SSE}{d.f.}}$$
• The standard deviation of the unexplained variance. The typical difference between the observed and fitted values.
• It has the same unit of measurement as the outcome.
• We wish to minimize this value.
• Mean absolute error (MAE)
• The magnitude of deviation of the model errors. We wish to minimize this value.
• Mallow’s $$C_p = \frac{1}{n} (SSE + 2d\hat{\sigma}^2)$$
• We wish to minimize this value.
• $$\text{AIC} = -2logL+2d$$
• Where $$d$$ is the number of parameters in the model.
• If OLS conditions are satisfied, then $$C_p = \text{AIC}$$.
• We wish to minimize this value.
• Calibration
• Measures agreement between predicted and observed values.
• The optimism adjusted curve should fit closely to the ideal line.

## Binary outcomes

• ROC-AUC
• The ROC-AUC is the probability that the model ranks a random positive observation more highly than a random negative observation.
• We wish to maximize this value.
• Concordant pairs
• The percent of concordant, discordant, and tied pairwise comparisons. The concordant percent is proportional to the AUC. AUC = percent concordant + 0.5*percent tied.
• We wish to maximize this value.
• Brier score
• Quadratic scoring rule analogous to mean squared error (MSE).
• The squared differences between actual binary outcomes Y and predictions p. $$\frac{1}{N}\sum_{i=1}^{N}(y_i-p_i)^2$$
• The square root of the Brier score is the expected distance between the observed and predicted values on the probability scale.
• The Brier score for a model can range from 0 for a perfect model to 0.25 for a non-informative model with a 50% prevalence of the outcome. $$max(brier)=x(1-x)^2+(1-x)x^2$$, for population prevalence $$x$$.
• As with Nagelkerke’s $$R^2$$, the scaled version of Brier’s score can be similar to Pearson’s $$R^2$$. $$R^2_{\text{Brier}} = 1 - \frac{Brier}{max(Brier)}$$ will range between 0-1.
• The Brier score performs poorly in situations with low prevalence when high sensitivity is required. In this case, the Brier score will favor a test with high specificity. See also .
• We wish to minimize this value.
• Calibration
• Mean calibration = calibration-in-the-large
• Calibration intercept
• Proper calibration leads to an intercept of 0.
• Weak calibration
• Combination of calibration intercept and calibration slope
• Proper loss functions lead to apparent intercept = 0 and slope = 1. CV’d values have different intercept and commonly slope < 1.
• Calibration slope = linear shrinkage factor
• Need to report calibration intercept together with calibration slope .
• Moderate calibration
• Smoothed calibration plot of observed vs. predicted outcomes.
• Pointwise calibration suffers from same issues as dichotomization.
• Strong calibration
• Assessment of observed vs. predicted outcomes over multidimensional covariate space.
• Not possible for any real situation
• Confusion matrix
• Contingency table of the observed and predicted outcomes. (Note: this forces classification when using probability models, which results in optimization of improper scoring rules)
• Cohen’s Kappa
• A measure which takes into account the error rate that would be expected by chance. (Note: this forces classification when using probability models, which results in optimization of improper scoring rules)
• Ranges from -1, complete discordance, to 1, complete concordance.
• Receiver operating characteristic (ROC) curves
• The curve used to calculate the AUC
• When comparing more than one model, may allow you to determine if one model has better predictive performance in some areas vs. another. (crossing curves)
• Precision-recall curve
• Appropriate to summarize information retrieval ability
• The area under the curve indicates ranges from the observed prevalence (worst performance) to 1 (best performance).
• Goodness of fit test (GOF)
• le Cessie-van Houwelingen-Copas-Hosmer unweighted sum of squares test for global goodness of fit. See Hosmer et al. (1997).

## Bibliography

Assel, Melissa, Daniel D. Sjoberg, and Andrew J. Vickers. 2017. “The Brier Score Does Not Evaluate the Clinical Utility of Diagnostic Tests or Prediction Models.” Diagnostic and Prognostic Research 1 (1): 19. https://doi.org/10.1186/s41512-017-0020-3.
Hosmer, D. W., T. Hosmer, S. Le Cessie, and S. Lemeshow. 1997. “A COMPARISON OF GOODNESS-OF-FIT TESTS FOR THE LOGISTIC REGRESSION MODEL.” Statistics in Medicine 16 (9): 965–80. https://doi.org/10.1002/(SICI)1097-0258(19970515)16:9<965::AID-SIM509>3.0.CO;2-O.
Stevens, Richard J., and Katrina K. Poppe. 2020. “Validation of Clinical Prediction Models: What Does the ‘Calibration Slope’ Really Measure?” Journal of Clinical Epidemiology 118 (February): 93–99. https://doi.org/10.1016/j.jclinepi.2019.09.016.
Van Calster, Ben, Daan Nieboer, Yvonne Vergouwe, Bavo De Cock, Michael J. Pencina, and Ewout W. Steyerberg. 2016. “A Calibration Hierarchy for Risk Models Was Defined: From Utopia to Empirical Data.” Journal of Clinical Epidemiology 74 (June): 167–76. https://doi.org/10.1016/j.jclinepi.2015.12.005.
Wang, Junfeng. 2020. “Calibration Slope Versus Discrimination Slope: Shoes on the Wrong Feet.” Journal of Clinical Epidemiology 125 (September): 161–62. https://doi.org/10.1016/j.jclinepi.2020.06.002.
Published: 2022-02-18