Statistical Models

A statistical model should be designed for one of the following purposes (Breiman 2001; Shmueli 2010; Westreich and Greenland 2013; Hernán, Hsu, and Healy 2019; Laubach et al. 2021; Carlin and Moreno‐Betancur 2025):

Description
- Answers the question: “What are the observed characteristics or associations among observations?”
- Describes what has happened using descriptive statistics and visualizations.
Prediction
- Answers the question: “How well can we predict the outcome variable and what predictor variables were important in predicting the outcome variable?”
- Identifies a set of predictor variables which explain the maximum amount of cross-validated variation in the outcome variable.
Explanation/Association
- Answers the question: “What is the minimally biased association between the predictor of interest and the outcome?”
- Estimates the marginal/conditional relationship between a predictor and the outcome of interest using non-causal language (association, relation, or correlation).
Causation
- Answers the question: “What is the minimally biased effect of the predictor on the outcome?”
- Estimates the causal (asymptotically unbiased) relationship between a predictor and outcome of interest using causal language (effect, affect, or cause).

Prediction Models

Prediction performance is based on the model’s ability to provide valid predictions for new cases. Generalizability of a model is assessed through internal and external validation. Internal validation uses the same data that was used to build the model to assess its validity (i.e. cross validation). External validation uses new data, which is independent from that used to build the model, to assess its validity. External validation is the benchmark method of assessing prediction performance because internal validation may be influenced by any inherent bias within the original dataset.

External validation is more challenging than internal validation. Some may try to use random split-sample (hold-out) performance as a method for external validation. However, not only is this not valid for external validation, it is not even cross validation! Fully independent data is required for external validation. Independent datasets have one or more of the following properties:

Temporal differences
- Data may be collected from the same locations, but over different periods of time.
Geographic differences
- Data was collected from different locations.
Institutional differences
- Data was collected from an organization not connected with the original source.

If the external sample is very similar to the original sample, the assessment is for reproducibility rather than for transportability.

There’s a misconception in predictive modeling where some believe that a ‘true’ parsimonious model exists, that model selection has the ability to discern the ‘true’ model, and that the fitted model allows one to identify which predictors are important in the relationship with the outcome. Unfortunately, statistical methods do not have the ability to distinguish between spurious and real associations. At best, one might be able to truthfully say, “The chosen variables, or variable correlated with the chosen variables but not included in the model, are expected to have some degree of association with the outcome, in a fashion which may behave as the estimated functional form”. In real life, no solution is sparse. An individual predictor variable cannot be set aside from 1) the algorithm used to create the model and 2) the covariates which were included in the model. In summary, a model built for optimal prediction has no guarantees in it’s ability to be appropriate for description or causal association. Likewise, a model built for description or causal association has no guarantees to have practical prediction performance.

Since the estimated coefficients are not the primary focus, the specific functional form is not central to model building. This opens up more possibilities for flexible methods of modeling as seen in regression splines, tree-based methods, support vector machines, and neural networks. These models incorporate nonlinearities and interactions that would otherwise be very hard to pre-specify or incorporate in traditional linear models. At the cost of harder to interpret functional forms, greater predictive performance is often found.

Further guidance on the details of prediction modeling is provided for

Originally written to consolidate my notes on prediction modeling in a single article, I decided to break up each section so that other relevant topics, such as the context for descriptive or causal modeling and code examples, could be included at a later time.

Bibliography

Breiman, Leo. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3). https://doi.org/10.1214/ss/1009213726.

Carlin, John B., and Margarita Moreno‐Betancur. 2025. “On the Uses and Abuses of Regression Models: A Call for Reform of Statistical Practice and Teaching.” Statistics in Medicine 44 (13-14): e10244. https://doi.org/10.1002/sim.10244.

Hernán, Miguel A., John Hsu, and Brian Healy. 2019. “A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks.” CHANCE 32 (1): 42–49. https://doi.org/10.1080/09332480.2019.1579578.

Laubach, Zachary M., Eleanor J. Murray, Kim L. Hoke, Rebecca J. Safran, and Wei Perng. 2021. “A Biologist’s Guide to Model Selection and Causal Inference.” Proceedings of the Royal Society B: Biological Sciences 288 (1943). https://doi.org/10.1098/rspb.2020.2815.

Shmueli, Galit. 2010. “To Explain or to Predict?” Statistical Science 25 (3): 289–310. https://doi.org/10.1214/10-STS330.

Westreich, D., and S. Greenland. 2013. “The Table 2 Fallacy: Presenting and Interpreting Confounder and Modifier Coefficients.” American Journal of Epidemiology 177 (4): 292–98. https://doi.org/10.1093/aje/kws412.

Published: 2022-02-13
Last Updated: 2025-06-16