## Table of Contents

# Missing Data Imputation

The interaction between how data are missing and where data are missing form the basis for deciding how to handle missing data. For *how* data are missing, there are three main mechanisms, as popularized by Little and Rubin (1987).

- Missing Completely at Random (MCAR) or Not Data Dependent (NDD)
- The probability of being missing is the same for all observations and is unrelated to the data.

- Missing At Random (MAR) or Seen Data Dependent (SDD)
- The probability of being missing is conditional on observed values in the data, but missing conditionally at random

- Missing Not At Random (MNAR) or Unseen Data Dependent (UDD)
- The probability of being missing varies for reasons that are unobserved or unobserved and observed.
- The probability of being missing is dependent on the values of the variable with missingness. This can extend to cases of censoring (such as right, left, and interval censoring seen in time-to-event analysis) and limit of detection (inability to quantify a value below/above a threshold).

Keep in mind that missingness may not just be inability to measure, but also incorrect recording or retrieval. A common instance of this is miscoding. Values are miscoded when the missingness itself signifies a valid category of analysis (often zero for continuous values or a new level for categorical).

There are three locations for *where* data may be missing

- Missing in the outcome (Y)
- Missing in the predictor variables (X)
- Missing in both the outcome and the predictors variables (Y and X).

Many different methods for missing data imputation are available. Too often, little thought is put into the justification for why any one method is most appropriate for a given situation. Even for the best case scenarios (MCAR), many methods will unexpectedly lead to biased estimates. There are four common methods for handling missing data 1) complete case analysis, 2) single dataset imputation, 3) multiple dataset imputation, and 4) fully Bayesian imputation. Another common method, almost always a bad idea, is to simply drop X from the analysis when it has missingness (If applied to Y, it might surprisingly lead to the best possible outcome, i.e. resetting the client’s goals).

When an observation is missing information required for the model, complete case analysis (CCA) will remove the entire observation from analysis. Most modeling methods default to complete case analysis in the event of missing data, except a few which can naturally handle random missingness such as CART and Naive Bayes. Complete case analysis is valid when observations included in the analysis are no different than the observations excluded after conditioning on features that influence relationships of interest. This can lead to loss of efficiency from the reduction in sample size, but is the easiest and most common approach to handling missing data.

- Complete case analysis (CCA, listwise deletion, no imputation)
- Definition:
- Remove any observation (row of data) which has missing information that is required for the model, then proceed to fit the model.

- Why it’s used:
- It’s the easiest solution to implement.
- When data are MCAR, it produces unbiased estimated means, regression coefficients, regression covariance.
- When data are MAR in Y, then multiple imputation and CCA are equivalent in producing (unbiased?) regression coefficient estimates. CCA will still be biased for means and \(R^2\) under MAR.
- If the probability that a value is missing is not related to Y, then CCA regression coefficients are unbiased. This applies whether or not missing values are in Y and/or X, hence CCA may be ok for this special class of MNAR.
- For the case of logistic regression with missing values in dichotomous Y or X (but not in both), the regression coefficients from CCA will be unbiased if the probability to be missing depends only on Y and not on X (e.g. case control studies).

- Where it fails:
- The loss of efficiency caused by fitting on the reduced subset of data will cause the standard errors to be larger than what would be seen when using all available data.
- When data are missing in X and not MCAR (MAR or MNAR), it will produce biased estimates of means, regression coefficients, regression covariance.

- Definition:

## Single imputation

Single imputation methods fill in each missing value with a single imputed value. The resulting (single) dataset is used for analysis. There are many different methods for single dataset imputation and each may have different areas of bias.

- Last observation carried forward (LOCF)
- Definition:
- LOCF uses the last observed value of an individual to fill their later, missing values. Common in longitudinal studies and for time series data.

- Why it’s used:
- It’s easy and makes you feel as though you’ve accomplished something. You get to use all observations without having to make explicit the conditionally missing at random assumption.
- Sometimes you may know that LOCF is correct by definition.

- Where it fails:
- Reduces variation.
- Imputation for X may be justified, but imputation in Y is much harder to justify.

- Definition:
- Mean/Median/Mode imputation
- Definition:
- Missing values for a given variable are filled in with that variables mean/median/mode.

- Why it’s used:
- It’s easy and makes you feel as though you’ve accomplished something. You get to use all observations without having to make explicit the conditionally missing at random assumption.
- If X is MCAR, then the estimated mean will be unbiased.

- Where it fails:
- Reduces variation.
- If missing in X, disturbs correlation between predictor variables.
- If X is not MCAR, the estimated mean (and everything else) will be biased.
- If X is MCAR, everything other than the mean will be biased.

- Definition:
- Indicator method
- Definition:
- For each predictor variable which has missing values, first fill those missing values with 0 (if numeric) or a new level “missing” (if categorical), and second create an indicator variable (0=observed/1=missing) for inclusion in the model.

- Why it’s used:
- Allows for direct comparison between observed and unobserved cases through use of the indicator variable.
- Useful for unbiased estimates of the treatment effect when a baseline covariate has missingness.

- Where it fails:
- Does not support missing data in Y.
- Generally leads to bias in observational data, even under MCAR.
- Only useful for specific scenarios.

- Definition:
- Worst observation carried forward
- Definition:
- Replace each instance of a missing value with the corresponding variables ‘worst’ value (worst being defined as resulting in the most conservative estimate).

- Definition:
- K-nearest neighbor (KNN)
- Definition:
- Uses the k-number of most similar cases to the case with a missing value and fills in the missing value using the average of the k nearest neighbors (also called hot deck imputation). For cases of numeric missingness, the euclidean distance metric may be used. When both numeric and categorical are missing, Gower’s distance is a good alternative. K=5-10 is a sensible default choice.

- Definition:
- Linear regression
- Definition:
- Missing values are filled in based on a linear regression model of the observed data.

- Why it’s used:
- If X is MCAR, then the estimated mean will be unbiased.
- If Y is MCAR, then the regression coefficients will be unbiased.
- If X is MAR and the factors that explain the missingness are observed, then the regression coefficients will be unbiased.

- Where it fails:
- If the imputation model is misspecified, then the imputations may lead to bias.
- Reduces variation. Amount of reduction is a function of the explained variation of the imputation model and the proportion of missingness.
- If missing in X, disturbs correlation between predictor variables (biased upwards).

- Definition:
- Stochastic regression
- Definition:
- Missing values are filled in based on a stochastic regression model of the observed data. Stochastic regression fits a linear model, then adds a random draw from the residuals to the estimated values.

- Why it’s used:
- Conceptual improvement on linear regression imputation.
- Preserves correlation between X.
- If X is MAR and the factors that explain the missingness are observed, then the regression coefficients will be unbiased.

- Where it fails:
- If the imputation model is misspecified, then the imputations may lead to bias.
- May produce implausible imputed values.
- Just as with other single imputation methods, too much certainty is placed on the single imputed values which biases p-values and confidence interval widths downwards.

- Definition:

As seen in van Buuren (2018), we can summarize the basic operating characteristics of these basic methods in tabular form:

Impute method | Unbiased Mean | Unbiased Coefficient | Unbiased Correlation | Standard Error |
---|---|---|---|---|

CCA | MCAR | MCAR | MCAR | Too large |

Mean | MCAR | – | – | Too small |

Regression | MAR | MAR | – | Too small |

Stochastic | MAR | MAR | MAR | Too small |

LOCF | – | – | – | Too small |

Indicator | – | – | – | Too small |

The first line can be interpreted as

- CCA produces an unbiased estimate of the mean provided that the data are MCAR
- CCA produces an estimate of the standard error that is too large.

The “–” indicates that the method cannot produce unbiased estimates. This table is a simplification. For example, CCA produces unbiased regression coefficients when the probability to be missing (in X and/or Y) does not depend on Y (thus CCA is ok for MNAR in this scenario). CCA also produces unbiased **logistic** regression coefficients (except the intercept) when missing data are confined to either a dichotomous Y or to X, but not to both, and if the probability to be missing depends only on Y (which justifies the use of odds ratios in case-control studies).

## Multiple imputation

Multiple imputation is a 3 step method for estimation of incomplete data. First, multiple complete datasets are formed using unique values to fill in missing data in each dataset. Second, each dataset is evaluated as if it was its own complete dataset. Finally, Rubin’s rules (Rubin 2004) are used to combine the estimates from the multiple models. This method is seen as the current best practice for frequentist based inference. By incorporating multiple estimates for each case of missingness, the issues of biased covariances, coefficients, p-values, and confidence intervals widths can be resolved.

Multiple imputation is often implemented using the chained equations method by van Buuren, Boshuizen, and Knook (1999). It takes a variable-by-variable approach, imputing missing values using Gibbs sampling until the process reaches convergence. Separate chains are used to generate the multiple imputations.

van Buuren (2018) describes a seven step process for using multiple imputation

- Check if the MAR is plausible. Multiple imputation under MNAR requires additional modeling assumptions.
- Decide on the imputation model form. Options include
- Imputing any variable type
- Predictive mean matching
- Classification and regression trees
- Random forest

- Imputing numeric variable types
- Bayesian linear regression
- Normal imputation with bootstrap

- Imputing binary variable types
- Logistic regression with bootstrap

- Imputing ordinal variable types
- Proportional odds model

- Imputing nominal variable types
- Polytomous logistic regression
- Discriminant analysis

- Imputing any variable type
- Decide which variables should be used in the imputation model. Auxiliary variables which contain information on the probability of missingness are most important, focus isn’t on predictors of the outcome.
- Leave out derived variables from imputation. For example, the height, weight, and their ratio. A missing value in one can be filled in deterministically with knowledge of others.
- Determine the order in which multiple variables will be imputed.
- Determine the starting imputations and the number of iterations.
- Determine the number of imputed datasets. The easiest rule is to match the number of datasets as the highest percent of missing data encountered (https://stefvanbuuren.name/fimd/sec-howmany.html). von Hippel (2020) shows the actual relationship is more quadratic than linear.

After creating the multiple dataset imputations, use diagnostic plots to check model fit (see https://stefvanbuuren.name/fimd/sec-diagnostics.html). For further applied examples, see https://bookdown.org/mwheymans/bookmi/.

In summary, consider the following strategy

- If you believe the data are MCAR and the sample size of the final model is sufficient, use complete case analysis. You can argue the validity of the MCAR assumption by creating a ‘table one’ grouped by missingness and compare the distribution of other Ys and Xs. If there are differences in distribution, we cannot say if the missingness is due to MAR or MCAR. Keep in mind that lack of any differences does not guarantee MCAR, so do not place too much importance on it compared to subject matter expertise.
- If missing in X, and you have data that may explain why those values are missing (MAR), use multiple imputation (or a ‘nice’ single imputation method if simplicity is more important) where the imputation model includes both Y and the variables related to missingness.
- If missing in Y, or X and Y, and you have data that may explain why those values are missing, use multiple imputation. It is important to use subject matter expertise to define the structure of the imputation mode (make sure to capture nonlinearities and interactions.). If only Y is MAR or MCAR, van Buuren (2013) indicates that MI has little benefit compared to CCA, unless an auxiliary variable is available for filling in Y. If missing in X and Y, then follow the approach of von Hippel (2007).
- For all other cases, your choices are
- Use multiple imputation and pray your results are not significantly biased
- Collect a new sample that will not be biased from missing values
- Do not conduct the analysis.

Since prediction models are dependent on cross validation, the computational and time burden of adding an imputation step (not to mention multiple imputation) can be quite high. In addition, the imputation operating characteristics mentioned above may not necessarily apply to prediction performance estimates.

Evaluating multiple methods of imputation for the purposes of selecting the optimal imputation method seems uncommon. Optimal imputation methods are defined by low bias of parameter estimates and coverage rates that equal the nominal 95% interval with minimum width. Outside of the simulation framework, we really only have diagnostic plot methods as described above. Prediction performance, such as RMSE, is not a sufficient measure of multiple imputation performance.

## Bibliography

*Statistical Analysis with Missing Data*. Wiley Series in Probability and Mathematical Statistics. New York: Wiley.

*Multiple Imputation for Nonresponse in Surveys*. Wiley Classics Library. Hoboken, N.J: Wiley-Interscience.

*Flexible Imputation of Missing Data*. Second edition. Chapman and Hall/CRC Interdisciplinary Statistics Series. Boca Raton: CRC Press, Taylor; Francis Group. https://doi.org/10.1201/9780429492259.

*Statistics in Medicine*18 (6): 681–94. https://doi.org/10.1002/(SICI)1097-0258(19990330)18:6<681::AID-SIM71>3.0.CO;2-R.

*Sociological Methodology*37 (1): 83–117. https://doi.org/10.1111/j.1467-9531.2007.00180.x.

*Sociological Methods & Research*49 (3): 699–718. https://doi.org/10.1177/0049124117747303.