There are three major areas of focus in modeling (Shmueli 2010):
- Describes what has happened
- Concerned with association between X and Y
- Predicts what will happen
- Concerned with prediction performance
- Explains why something happened
- Concerned with counterfactual outcomes
Prediction performance is based on the model’s ability to provide valid predictions for new cases. Generalizability of a model is assessed through internal and external validation. Internal validation uses the same data that was used to build the model to assess its validity (i.e. cross validation). External validation uses new data, which is independent from that used to build the model, to assess its validity.
External validation is more challenging than internal validation. Some may try to use random split-sample (hold-out) performance as a method for external validation. However, not only is this not valid for external validation, it is not even cross validation! Fully independent data is required for external validation. Independent datasets have one or more of the following properties:
- Temporal differences
- Data may be collected from the same locations, but over different periods of time.
- Geographic differences
- Data was collected from different locations.
- Institutional differences
- Data was collected from an organization not connected with the original source.
If the external sample is very similar to the original sample, the assessment is for reproducibility rather than for transportability.
There’s a misconception in predictive modeling where some believe that a ‘true’ parsimonious model exists, that model selection has the ability to discern the ‘true’ model, and that the fitted model allows one to identify which predictors are important in the relationship with the outcome. Unfortunately, statistical methods do not have the ability to distinguish between spurious and real associations. At best, one might be able to truthfully say, “The chosen variables, or variable correlated with the chosen variables but not included in the model, are expected to have some degree of association with the outcome, in a fashion which may behave as the estimated functional form”. In real life, no solution is sparse. An individual predictor variable cannot be set aside from 1) the algorithm used to create the model and 2) the covariates which were included in the model. In summary, a model built for optimal prediction has no guarantees in it’s ability to be appropriate for description or causal association. Likewise, a model built for description or causal association has no guarantees to have practical prediction performance.
Since the estimated coefficients are not the primary focus, the specific functional form is not central to model building. This opens up more possibilities for flexible methods of modeling as seen in regression splines, tree-based methods, support vector machines, and neural networks. These models incorporate nonlinearities and interactions that would otherwise be very hard to pre-specify or incorporate in traditional linear models. At the cost of harder to interpret functional forms, greater predictive performance is often found.
Further guidance on the details of prediction modeling is provided for
- Required sample size
- Missing Data Imputation
- Variable Selection
- Functional Form
- Learning Methods
- Performance Measures
- Resampling Methods
- External Validation
Originally written to consolidate my notes on prediction modeling in a single article, I decided to break up each section so that other relevant topics, such as the context for descriptive or causal modeling and code examples, could be included at a later time.