# Performance Measures

Below are a few estimators for summarizing model predictive performance.

## Continuous outcomes

- \(R^2 = 1-\frac{SSE}{SST}\)
- For linear models, the proportion of variability in the response explained by the model.
- For linear models, the squared Pearson correlation between the observed and predicted values. This method is commonly used as a generalized form for other models as well.
- Ranges from 0, no association between predictions and outcome, to 1, perfect fit/predictions. Thus, we wish to maximize this value.
- Does not represent agreement/calibration.

- Adjusted \(R^2 = 1 - \frac{SSE / (n-d-1)}{SST / (n-1)}\)
- Where \(d\) is the number of parameters in the model.

- \(\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2} = \sqrt{\frac{SSE}{d.f.}}\)
- The standard deviation of the unexplained variance. The typical difference between the observed and fitted values.
- It has the same unit of measurement as the outcome.
- We wish to minimize this value.

- Mean absolute error (MAE)
- The magnitude of deviation of the model errors. We wish to minimize this value.

- Mallow’s \(C_p = \frac{1}{n} (SSE + 2d\hat{\sigma}^2)\)
- We wish to minimize this value.

- \(\text{AIC} = -2logL+2d\)
- Where \(d\) is the number of parameters in the model.
- If OLS conditions are satisfied, then \(C_p = \text{AIC}\).
- We wish to minimize this value.

- Calibration
- Measures agreement between predicted and observed values.
- The optimism adjusted curve should fit closely to the ideal line.

## Binary outcomes

- ROC-AUC
- The ROC-AUC is the probability that the model ranks a random positive observation more highly than a random negative observation.
- We wish to maximize this value.

- Concordant pairs
- The percent of concordant, discordant, and tied pairwise comparisons. The concordant percent is proportional to the AUC. AUC = percent concordant + 0.5*percent tied.
- We wish to maximize this value.

- Brier score
- Quadratic scoring rule
- The squared differences between actual binary outcomes Y and predictions p. \((y-p)^2\)
- Analogous to mean squared error
- The Brier score for a model can range from 0 for a perfect model to 0.25 for a non-informative model with a 50% prevalence of the outcome. \(max(brier)=x(1-x)^2+(1-x)x^2\), for population prevalence \(x\).
- As with Nagelkerke’s \(R^2\), the scaled version of Brier’s score can be similar to Pearson’s \(R^2\). \(R^2_{\text{Brier}} = 1 - \frac{Brier}{max(Brier)}\) will range between 0-1.
- We wish to minimize this value.

- Calibration (Van Calster et al. 2016)
- Mean calibration = calibration-in-the-large
- Calibration intercept
- Proper calibration leads to an intercept of 0.

- Weak calibration
- Combination of calibration intercept and calibration slope
- Proper loss functions lead to apparent intercept = 0 and slope = 1. CV’d values have different intercept and commonly slope < 1.
- Calibration slope = linear shrinkage factor
- Need to report calibration intercept together with calibration slope (Stevens and Poppe 2020; Wang 2020).

- Moderate calibration
- Smoothed calibration plot of observed vs. predicted outcomes.
- Pointwise calibration suffers from same issues as dichotomization.

- Strong calibration
- Assessment of observed vs. predicted outcomes over multidimensional covariate space.
- Not possible for any real situation

- Mean calibration = calibration-in-the-large
- Confusion matrix
- Contingency table of the observed and predicted outcomes. (Note: this forces classification when using probability models, which results in optimization of improper scoring rules)

- Cohen’s Kappa
- A measure which takes into account the error rate that would be expected by chance. (Note: this forces classification when using probability models, which results in optimization of improper scoring rules)
- Ranges from -1, complete discordance, to 1, complete concordance.

- Receiver operating characteristic (ROC) curves
- The curve used to calculate the AUC
- When comparing more than one model, may allow you to determine if one model has better predictive performance in some areas vs. another. (crossing curves)

- Precision-recall curve
- Appropriate to summarize information retrieval ability
- The area under the curve indicates ranges from the observed prevalence (worst performance) to 1 (best performance).

- Goodness of fit test (GOF)
- le Cessie-van Houwelingen-Copas-Hosmer unweighted sum of squares test for global goodness of fit. See Hosmer et al. (1997).

## Bibliography

Hosmer, D. W., T. Hosmer, S. Le Cessie, and S. Lemeshow. 1997. “A COMPARISON OF GOODNESS-OF-FIT TESTS FOR THE LOGISTIC REGRESSION MODEL.”

*Statistics in Medicine*16 (9): 965–80. https://doi.org/10.1002/(SICI)1097-0258(19970515)16:9<965::AID-SIM509>3.0.CO;2-O.
Stevens, Richard J., and Katrina K. Poppe. 2020. “Validation of Clinical Prediction Models: What Does the ‘Calibration Slope’ Really Measure?”

*Journal of Clinical Epidemiology*118 (February): 93–99. https://doi.org/10.1016/j.jclinepi.2019.09.016.
Van Calster, Ben, Daan Nieboer, Yvonne Vergouwe, Bavo De Cock, Michael J. Pencina, and Ewout W. Steyerberg. 2016. “A Calibration Hierarchy for Risk Models Was Defined: From Utopia to Empirical Data.”

*Journal of Clinical Epidemiology*74 (June): 167–76. https://doi.org/10.1016/j.jclinepi.2015.12.005.
Wang, Junfeng. 2020. “Calibration Slope Versus Discrimination Slope: Shoes on the Wrong Feet.”

*Journal of Clinical Epidemiology*125 (September): 161–62. https://doi.org/10.1016/j.jclinepi.2020.06.002.
Published: 2022-02-18