Table of Contents
- Learning Methods
- Ordinary least squares (OLS)
- Generalized least squares (GLS)
- Generalized linear models (GLM)
- Linear mixed models (LMM)
- Generalized linear mixed models (GLMM)
- Generalized additive models (GAM)
- Generalized estimating equations (GEE)
- Time-to-event models (Survival)
- Joint models
- Tree-based models
- K-nearest neighbors (KNN)
- Neural networks
- Bibliography
Learning Methods
Ordinary least squares (OLS)
A model used to predict quantitative outcomes. The model formula is
\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p + \epsilon \]
which may be described as
\[ \begin{aligned} \mathbf{y}_{n \times 1} &= \mathbf{X}_{n \times (p+1)} \mathbf{\beta}_{(p+1) \times 1} + \mathbf{\epsilon}_{n \times 1} \\ E[Y \mid X] &= \mu \\ &= \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p \\ &= \text{linear predictor} \end{aligned} \]
The normal linear model adds the distributional assumption \(Y \mid X \sim N(\mu, \sigma^2)\). This assumption is not required to compute the OLS estimator, but it supports exact finite-sample inference and makes OLS equivalent to maximum likelihood under homoskedastic normal errors.
The predictor variables \(X\) may include quantitative variables, transformations of quantitative variables (polynomials, splines, etc.), dummy variables, interactions, etc.
Assumptions made for OLS models are
Validity
The data collected maps correctly to the questions of interest. The response variable reflects the phenomenon of interest and all relevant predictors are included.
Additivity
The effects of the covariates on the response are additive, along with the additive error term.
Linearity
The response variable is a deterministic component of a linear function of the predictors (linear in the parameters) plus an error term. The conditional means of the response are a linear function of the predictor variables.
No perfect multicolinearity
The predictor variables are not a linear combination of each other. The design matrix \(\mathbf{X}\) is full rank. Imperfect multicolinearity does not make OLS invalid, but it will inflate standard errors.
Independent or uncorrelated errors
The errors from the regression line are conditionally uncorrelated. Independence is a stronger assumption that also rules out nonlinear dependence among the errors. For the usual OLS variance formula, conditional lack of correlation is sufficient; for likelihood-based inference under the normal linear model, conditional independence is typically assumed.
Equal variance of errors
The errors have constant variance conditional on the predictors. In diagnostic plots, this appears as similar residual variability across the range of fitted values or each predictor.
\(E[\epsilon \mid X] = 0\)
The errors have mean 0. This is also known as the exogeneity assumption. Exogeneity is violated when a confounding variable is not included.
Normality of errors
Normality of errors provides finite-sample guarantees for hypothesis testing and equivalence between the OLS estimate and the MLE.
The objective (cost) function is given as
\[ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_jx_{ij} \right)^2 \]
Minimization of the objective function has a closed form solution through solving the normal equations \((X^{\prime}X)\hat{\beta} = X^{\prime}y\) which results in \(\hat{\beta} = (X^{\prime}X)^{-1}X^{\prime}Y\). Here \(p\) is the number of predictors excluding the intercept, so \(\text{rank}(X) = p + 1\) when the design matrix is full rank.
\[ \begin{aligned} E[\hat{\beta} \mid X] &= \beta \implies \text{unbiased}\\ V[\hat{\beta} \mid X] &= \sigma^2(X^{\prime}X)^{-1} \\ \hat{\beta} &\sim N(\beta, \sigma^2(X^{\prime}X)^{-1}) \end{aligned} \]
For simple linear regression with one predictor and an intercept, the slope estimator has conditional variance
\[ V[\hat{\beta}_1 \mid X] = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}. \]
Note that the residuals (observed \(e = y - \hat{y}\)) are estimates of the error (unobserved \(\epsilon = y - x\beta\)).
\[ \begin{aligned} E[\epsilon \mid X] &= 0 \\ E[e \mid X] &= (I - H)E[\epsilon \mid X] \\ &= E[Y \mid X] - E[\hat{Y} \mid X] \\ &= 0 \\ V[\epsilon \mid X] &= \sigma^2 I_{n \times n} \\ V[e \mid X] &= \sigma^2(I - H) \\ H &= X(X^{\prime}X)^{-1}X^{\prime}\\ &= \text{hat matrix} \\ &= \text{projection matrix} \implies \hat{\mu} = Hy \\ s^2 &= \frac{e^{\prime}e}{n - p - 1} \\ &= \frac{1}{n - p - 1} \sum_{i=1}^n (y_i - x_i^{\prime}\hat{\beta})^2 \\ E[s^2 \mid X] &= \sigma^2 \implies \text{unbiased} \\ V[s^2 \mid X] &= \frac{2\sigma^4}{n - p - 1} \end{aligned} \]
What if you would like to run a hypothesis test (generate p-values or confidence intervals) but notice violations of the OLS assumptions? Bootstrap confidence intervals of parameter estimates may be used to relax the assumption that \(Y \mid X\) is normal. Robust standard errors (Huber-White standard errors, heteroskedasticity-consistent standard errors, sandwich estimator, etc.) can make standard errors valid under some forms of heteroskedasticity, but they do not fix biased coefficient estimates caused by an incorrect mean structure, omitted confounding, measurement error, or other forms of endogeneity. If the mean model is misspecified, the parameter estimates are likely to be meaningful only as descriptive summaries of the fitted linear projection.
References
- Fox (2016)
Generalized least squares (GLS)
A model for numeric outcomes. Take the concepts described for OLS and generalize them for errors which have non-constant variance and/or are not all independent. In other words, an extended linear model with no random effects. Alternatives to GLS include linear mixed models or Bayesian hierarchical models, however GLS does not allow subjects/groups to have unique trajectories/slopes.
Instead of the normal OLS definition \(\epsilon \sim N(0, \sigma^2\mathbf{I})\), consider \(\epsilon \sim N(0, \Sigma)\) where
\[ \begin{align} V[\epsilon] &= \Sigma \\ &= \sigma^2 \mathbf{P}_{n \times n} \end{align}\]
If \(\sigma^2\) is a vector, or the diagonal of \(\mathbf{P}\) is a vector of weights, then we can allow for unequal variance between the observations within-group errors. If the off-diagonal elements \(\rho\) are non-zero, then we can allow for correlated observations. Note that GLS specifically addresses dependency in errors (correlations are non-zero in the off-diagonal), while weighted least squares (WLS), a special case of GLS, can handle independent but not identically distributed errors (correlations are all 0 in the off-diagonal).
For example, if \(\mathbf{P}\) represented the correlation between two errors for any given lag, then it may have form
\[ \mathbf{P} = \begin{bmatrix} 1 & \rho_1 & \rho_2 & \cdots & \rho_{n-1} \\ \rho_1 & 1 & \rho_1 & \cdots & \rho_{n-2} \\ \rho_2 & \rho_1 & 1 & \cdots & \rho_{n-3} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho_{n-1} & \rho_{n-2} & \rho_{n-3} & \cdots & 1 \end{bmatrix} \]
This structure leads to the OLS estimator as no longer optimal. Minimization of the objective function through solving the generalized normal equations leads to \(\hat{\beta}_{GLS} = (X^{\prime}\Sigma^{-1}X)^{-1}X^{\prime}\Sigma^{-1}Y\). It follows that
\[ \begin{align} E[\hat{\beta}_{GLS} \mid X] &= \beta \implies \text{unbiased} \\ V[\hat{\beta}_{GLS} \mid X] &= (X^{\prime}\Sigma^{-1}X)^{-1} \\ &= \sigma^2(X^{\prime}P^{-1}X)^{-1} \end{align} \]
By letting \(\Gamma = \sqrt{\Sigma^{-1}}\), we can transform the equation to
\[ \begin{align} \hat{\beta}_{GLS} &= (X^{\prime}\Gamma^{\prime}\Gamma X)^{-1}X^{\prime}\Gamma^{\prime}\Gamma Y \\ &= (X^{*\prime}X^*)^{-1}X^{*\prime}Y^* \end{align} \]
Since \(X^* \equiv \Gamma X\) and \(y^* \equiv \Gamma Y\), the GLS estimator is the OLS estimator for \(y^* = X^*\beta+\epsilon^*\), \(\epsilon^* \sim N(0, \sigma^2I)\), which is just a linear transformation of \(y\) and \(X\). However, this result depends on \(\mathbf{P}\) being known (the correlation and relative variance between the errors are known, only the absolute scale of the variation \(\sigma^2\) is unknown), while in practice \(\mathbf{P}\) is always unknown. Thus, OLS may be used to estimate the regression coefficients, but maximum likelihood estimation (ML) or restricted maximum likelihood estimation (REML) are required for estimating variance parameters (which, after estimating, can be used to update the regression coefficients).
ML maximizes the likelihood with respect to all parameters in a model simultaneously (the regression coefficients and the variance parameters). ML variance-parameter estimates are often biased downward in small samples because ML does not account for the degrees of freedom used to estimate the fixed effects. REML integrates out the regression coefficients and estimates only the variance parameters. When the sample size (or number of clusters) is small, REML tends to reduce this bias because it accounts for the loss of degrees of freedom from estimating the fixed effects. However, REML likelihoods should not be used to compare models with different fixed-effect structures via likelihood ratio tests or information criteria.
Some common variance structures for non-constant variance:
- varFixed
- When the variance of the residuals is linearly related to a continuous covariate (\(x\)).
- \(\Sigma = \sigma^2x\)
- varIdent
- When the variance of the residuals is different for each level (\(k = 1, \dots, K\)) of a categorical covariate(s) and constant within those levels.
- \(\Sigma = \sigma^2_k\mathbf{I}\)
- varPower
- When the variance of the residuals is related to the power of a continuous covariate \(x\).
- \(\Sigma = \sigma^2 |x|^{2\delta}\)
- varExp
- When the variance of the residuals is related to the exponential of a continuous covariate \(x\).
- \(\Sigma = \sigma^2 e^{{2\delta}x}\)
- varComb
- When the variance of the residuals is related to a combination of variance functions. generated by multiplying together the corresponding variance functions.
Common correlation structures for correlated observations:
- Serial correlation
- corCompSymm
- Compound symmetry
- Simplest serial correlation structure. It assumes equal correlation among all within-group errors pertaining to the same group. It’s useful for short time series per group, or when all observations within a group are collected at the same time, as in split-plot experiments.
- corSymm
- General
- The most complex correlation structure. Each correlation is represented by a different parameter. When there are a small number of observations per group, the general correlation structure is useful for determining a more parsimonious correlation model. Must be a positive-definite correlation matrix.
- corAR1
- corCAR1
- Continuous time AR(1) structure.
- corARMA
- corCompSymm
- Spatial correlation
- Semivariogram
- Isotropic Variogram
- The correlation between responses within subject at two times depends only on a measure of the distance between the two times and not the times themselves.
- corExp
- corGaus
- corLin
- corRatio
- corSpher
References
Generalized linear models (GLM)
A model for outcomes such as continuous, binary, binomial, count, positive continuous, or rate data. Take the concepts described for OLS and generalize them for \(Y\), where \(Y\) belongs to the (mostly) exponential family. OLS is a special case of GLM. Ordinal and nominal categorical outcome models are related extensions, but they are not typical single-response GLMs.
A GLM has three components:
- a random component specifying the response distribution.
- a systematic component \(\eta_i = \mathbf{x}_i^\prime\boldsymbol{\beta}\).
- a link function \(g(\mu_i) = \eta_i\) connecting the conditional mean to the linear predictor.
A distribution belongs to the exponential family if it can be written in the form
\[ f(y_i \mid \theta_i, \phi) = \exp\left( \frac{y_i\theta_i - b(\theta_i)}{a_i(\phi)} + c(y_i, \phi) \right) \]
where \(\theta_i\) is the natural parameter, \(\phi\) is a dispersion parameter, and \(a_i(\cdot)\), \(b(\cdot)\), and \(c(\cdot)\) are known functions determined by the distribution. For distributions in the exponential family, the conditional variance of \(Y\) is a function of its mean \(\mu\) and/or the dispersion parameter \(\phi\). The normal distribution has constant, unknown variance \(\sigma^2 = \phi\) while binomial and Poisson families have constant known \(\phi = 1\).
The linear predictor is related to \(E[Y_i \mid \mathbf{x}_i]\) through the link function \(g\).
\[ \begin{aligned} E[Y_i \mid \mathbf{x}_i] &= \mu_i \\ g(\mu_i) &= \eta_i \\ &= \mathbf{x}_i^\prime\boldsymbol{\beta} \end{aligned} \]
The link function \(g\) is a monotonic, invertible, and differentiable function. A link function for which \(g(\mu_i) = \theta_i\) is known as the canonical link, so the linear predictor \(\eta_i\) equals the natural parameter \(\theta_i\). The canonical link is preferred for GLMs since it results in concave log-likelihood functions, simple sufficient statistics, and simple likelihood equations. The link function controls the scale on which covariate effects are additive, and the inverse link maps linear predictions back to the response mean scale.
Examples of canonical link functions:
| Distribution | Canonical Link | Range of Y | Variance function |
|---|---|---|---|
| Gaussian | Identity | \((-\infty, \infty)\) | \(\phi\) |
| Binomial | Logit | \(\frac{0, 1, \dots, m_i}{m_i}\) | \(\frac{\mu(1-\mu)}{m_i}\) |
| Poisson | Log | \(0, 1, 2, \dots\) | \(\mu\) |
| Gamma | Inverse | \((0, \infty)\) | \(\phi \mu^2\) |
| Inverse Gaussian | Inverse-square | \((0, \infty)\) | \(\phi \mu^3\) |
Note that binary data are a subset of binomial data with \(m_i = 1\) trial for each observation, so that \(Y_i = 0\) or \(Y_i = 1\) and \(E(Y_i \mid \mathbf{x}_i) = \mu_i\) is the probability of observing \(Y_i = 1\).
Examples of common link functions:
| Link | \(\eta = g(\mu)\) | \(\mu = g^{-1}(\eta)\) |
|---|---|---|
| Identity | \(\mu\) | \(\eta\) |
| Log | \(\log(\mu)\) | \(\exp(\eta)\) |
| Inverse | \(\mu^{-1}\) | \(\eta^{-1}\) |
| Inverse-square | \(\mu^{-2}\) | \(\eta^{-1/2}\) |
| Square-root | \(\sqrt{\mu}\) | \(\eta^2\) |
| Logit | \(\log\left( \frac{\mu}{1-\mu} \right)\) | \(\frac{1}{1+\exp(-\eta)}\) |
| Probit | \(\Phi^{-1}(\mu)\) | \(\Phi(\eta)\) |
| Log-log | \(-\log[-\log(\mu)]\) | \(\exp[-\exp(-\eta)]\) |
| Complementary log-log | \(\log[-\log(1-\mu)]\) | \(1-\exp[-\exp(\eta)]\) |
Model parameters (both the regression coefficients and asymptotic standard errors of the coefficients) are estimated using maximum likelihood (ML). The basic Wald statistic is \(Z = \frac{\hat{\beta} - \beta_0}{SE(\hat{\beta})}\). For distributions with unknown parameter \(\phi\), the method of moments (MoM) is commonly used as its estimator, although ML could in principle be used as well.
The log-likelihood is found to be
\[ \log(L(\mathbf{\theta}, \phi;\mathbf{y})) = \sum_{i = 1}^{n} \frac{Y_i\theta_i - b(\theta_i)}{a_i(\phi)} + c(Y, \phi) \]
where \(a_i(\cdot)\), \(b(\cdot)\), and \(c(\cdot)\) are known functions depending on the exponential family, \(\theta_i\) is the natural parameter, and \(\phi > 0\) is a dispersion parameter. When the canonical link is used, \(\eta_i = g(\mu_i) = \theta_i\).
The estimating equations for the regression parameters are found by differentiating the log-likelihood with respect to each coefficient using the chain rule. Setting this sum to zero results in the maximum-likelihood estimating equations. Iterative weighted least squares (IWLS, otherwise known as iterative reweighted least squares IRWLS or IRLS) is used to estimate the MLE of the coefficients. The IWLS steps are
- Start with initial estimates for \(\hat{\mu}\) and \(\hat{\eta} = g(\hat{\mu})\). Usually just set \(\hat{\mu}_i = Y_i\).
- At each iteration, compute the working response \(Z\) and the working weights \(W\) using the chosen values of \(\hat{\mu}\) and \(\hat{\eta} = g(\hat{\mu})\)
- Fit a weighted least squares regression of \(Z\) on \(X\) using \(W\) as weights.
- Repeat steps 2 and 3 until convergence to the MLE of the coefficients.
References
Linear mixed models (LMM)
A model for numeric outcomes. Take the concepts described for OLS and generalize them for data resulting from hierarchical structures (lower-level units are nested within higher-level units) or longitudinal structures (measurements are taken over time) which implies that errors may be correlated and/or have unequal variance. Alternatives to LMM include GLS or Bayesian hierarchical models.
The model formula in Laird-Ware form is (Laird and Ware 1982)
\[ \begin{align} Y_{ij} &= \beta_0 + \beta_1X_{1ij} + \beta_2X_{2ij} + \dots + \beta_pX_{pij} + \delta_{1i}Z_{1ij} + \cdots + \delta_{qi}Z_{qij} + \epsilon_{ij} \\ \delta_{ki} &\sim N(0, \Psi^2_k) \\ \epsilon_{ij} &\sim N(0, \sigma^2_{\epsilon}\lambda_{ij}) \end{align}\]
where
- \(Y_{ij}\) is the value of the outcome variable for observation \(j = 1, \dots, n\) in cluster \(i = 1, \dots, m\).
- The \(\beta\)s are fixed-effect coefficients for fixed parameters \(g = 1, \dots, p\) and are constant within clusters.
- The \(X\)s are fixed-effect regressors.
- The \(\delta\)s are random-effect coefficients for random variables \(k = 1, \dots, q\). In a standard LMM, the random effects are assumed to be normally distributed and to vary by group. Random effects are independent between groups. They are ‘random’ because if the study were to be repeated, different clusters would be sampled, resulting in different random effect estimates. We are not interested in the effect of a particular grouping level, but may be interested in the variability between groups.
- \(\delta_{ki}, \delta_{k^{\prime}i^{\prime}}\) are independent for \(i \neq i^{\prime}\)
- The \(Z\)s are random-effect regressors. usually a subset of the \(X\)s.
- \[ \begin{align} E[Y \mid X, \delta] &= \mu = X\beta+Z\delta \\ V[Y \mid X,\delta] &= \sigma^2_{\epsilon}\lambda_{ij} \end{align}\]
- The \(\epsilon\)s are errors for observations within clusters.
- Residual vectors are independent across clusters and independent of the random effects. Within a cluster, residuals may be correlated according to \(\Lambda_i\).
- \(\delta_{ki}, \epsilon_{i^{\prime}j}\) are independent for all \(i, i^{\prime}, k, j\)
- \(\Psi^2_k\) is the variance and \(\Psi_{kk^{\prime}}\) the covariance of the random effects. These are assumed to be constant across groups.
- \(\sigma^2\lambda_{ijj^{\prime}}\) are the covariances among the errors in cluster \(i\).
- The \(\Psi\)s and \(\lambda\)s define the dependencies among the random effects and errors within clusters, respectively.
In matrix form, we have
\[ \begin{align} \mathbf{y}_{i} &= \mathbf{X}_i \beta + \mathbf{Z}_i \delta_{i} + \epsilon_{i} \\ \delta_{i} &\sim N_q(0, \Psi) \\ \epsilon_{i} &\sim N_{n_i}(0, \sigma^2_{\epsilon}\Lambda_{i}) \end{align}\]
where
- \(\mathbf{y}_{i}\) is a \(n_i \times 1\) vector for observations in the \(i\)th of \(m\) groups.
- \(\mathbf{X}_{i}\) is a \(n_i \times p\) model matrix for the fixed-effects in group \(i\).
- \(\mathbf{\beta}\) is a \(p \times 1\) vector.
- \(\mathbf{Z}_{i}\) is a \(n_i \times q\) model matrix for the random-effects in group \(i\).
- \(\mathbf{\delta}_{i}\) is a \(q \times 1\) vector.
- \(\mathbf{\epsilon}_{i}\) is a \(n_i \times 1\) vector.
- \(\mathbf{\Psi}_{i}\) is a \(q \times q\) covariance matrix for the random-effects.
- \(\mathbf{\sigma^2}_{\epsilon}\Lambda_i\) is a \(n_i \times n_i\) covariance matrix for the errors in group \(i\). If the within-group errors are constant and are independent, then \(\Lambda_i = \mathbf{I}_{n_i}\).
where the above can be simplified further by defining \(n = \sum_{i=1}^{m}n_i\). Alternatively, the model structure can be described in a hierarchy of simpler models. When this model formula structure is used, they are often referred to as hierarchical linear models (HLM) or multilevel linear models (MLM). These alternative forms are identical to the Laird-Ware form which is common under the LMM umbrella.
The algorithms used for optimization of the likelihood function are most commonly Newton-Raphson, followed by expectation-maximization (EM) (or expectation conditional maximization, or ECM/ECME), or Fisher scoring. These algorithms may be used for maximum likelihood (ML) or restricted maximum likelihood (REML) estimation.
Optimization algorithms may fail because
- Iterations reach a singular gradient
- The maximum number of iterations is reached without convergence
- Variance estimates are not all positive or may be exactly 1
- Covariance matrices are not positive definite
If models differ in fixed-effects, then ML must be used for model comparison. If models differ only in random-effects, then REML may be used for model comparison. Generally parameter estimates of the final model should be fit using REML.
The landscape in R looks like the following:
| Feature | nlme | lme4 | glmmTMB |
|---|---|---|---|
| REML for Gaussian LMMs | Yes | Yes | Yes |
| residual unequal variance | Yes | No | Yes |
| residual correlated errors | Yes | No | Limited |
| crossed random effects | No | Yes | Yes |
In general, you should have a relatively large number of groups/clusters (required for robust estimate of between group variation) with a small or medium number of observations within groups (enough to estimate group specific means and/or slopes).
References
Generalized linear mixed models (GLMM)
A GLMM extends a GLM by adding random effects to the linear predictor for clustered or grouped outcomes such as continuous, binary, binomial, count, positive continuous, or rate data. Conditional on the random effects, \(Y\) is modeled with an exponential-family distribution. Ordinal and nominal categorical mixed models are related extensions, but they require response-specific likelihoods and link functions.
For observation \(j\) in cluster \(i\),
\[ \begin{aligned} \mu_{ij} &= E[Y_{ij} \mid \boldsymbol{\delta}_i] \\ g(\mu_{ij}) &= \mathbf{x}_{ij}^{\prime}\boldsymbol{\beta} + \mathbf{z}_{ij}^{\prime}\boldsymbol{\delta}_i \\ \boldsymbol{\delta}_i &\sim N_q(0, \boldsymbol{\Psi}) \end{aligned} \]
References
- Faraway (2016)
Generalized additive models (GAM)
A model for outcomes such as continuous, binary, binomial, count, positive continuous, or rate data when the mean response may vary smoothly and nonlinearly with predictors. Take the concepts described for OLS and GLM and generalize them for flexible prediction of \(Y\) on the basis of multiple predictors \(X\). Alternatives to GAM include spline transformations in OLS or GLM models. Ordinal and nominal categorical GAMs are specialized extensions rather than the basic GAM setting.
The model formula may be written on the link scale as
\[ g(\mu_i) = \eta_i = \beta_0 + \sum_{j=1}^{p} s_j(x_{ij}) \]
where \(\mu_i = E[Y_i \mid \mathbf{x}_i]\) and \(s_j(\cdot)\) is a smooth (nonlinear) function of predictor \(j\). This is referred to as an additive model since separate smooths are calculated for each \(X\) and then added together. For a Gaussian identity-link GAM, this reduces to \(Y_i = \beta_0 + \sum_{j=1}^{p} s_j(x_{ij}) + \epsilon_i\). Interactions can be represented with multivariable smooths, including tensor product smooths, such as \(s_{jk}(x_{ij}, x_{ik})\).
The objective function for GAM is similar to OLS or GLM with the addition of a penalty term for the amount of smoothness in the fits. For example, if fitting piecewise linear regression we have
\[ \begin{align} \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_jx_{ij} \right)^2 + \lambda \sum_{j=2}^{k-1} (s(x^*_{j-1}) - 2s(x^*_{j}) + s(x^*_{j+1}))^2 \end{align} \]
where \(x^*_j\) are the \(j=1, \dots, k\) knots. Since natural cubic splines (restricted cubic splines) are almost always a good default choice, the resulting objective function becomes
\[ \begin{align} \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p s_j(x_{ij}) \right)^2 + \sum_{j=1}^{p} \lambda_j \int s_j^{\prime \prime} (t_j)^2 dt_j \end{align} \]
which is the general case for knots at each unique value of \(X\). In practice, an iterative procedure known as ‘backfitting’ is used in conjunction with a method for likelihood maximization (Newton-Raphson or iteratively weighted least squares (IWLS)) to generate an optimal level of smoothness. The tuning parameter \(\lambda_j \geq 0\) controls the amount of smoothness. \(\lambda = 0\) results in no penalization and \(\lambda \to \infty\) produces a straight line.
References
Generalized estimating equations (GEE)
A model for correlated outcomes such as continuous, binary, binomial, count, positive continuous, or rate data. Take the concepts described for GLS and GLM and generalize them for \(Y\), where observations within a cluster may be associated and the variance is linked to the marginal mean. Ordinal and nominal categorical GEEs exist, but they are specialized extensions rather than the basic GEE setting.
GEE models the marginal mean of clustered observations and uses a working covariance structure to account for within-cluster association. This results in population-averaged interpretations of the regression coefficients. This is in contrast to mixed-effect models which produce cluster-specific estimates that have conditional interpretations.
GEE does not specify a complete joint distribution, thus it does not use a likelihood and likelihood-based fitting methods. Likelihood-based methods cannot be used for assessment of model fit, so quasi-likelihood or Wald statistics based on asymptotic normality and the estimated covariance matrix are required. This makes GEE computationally simple relative to other ML based models (GLS, GLM) and places no distributional assumptions on the coefficients. With independent clusters and a correctly specified marginal mean model, regression coefficient estimates remain consistent even if the working covariance structure is incorrect. Robust sandwich standard errors are then used to account for covariance misspecification, although they rely on large-sample approximations in the number of clusters.
References
Time-to-event models (Survival)
There are many different ways to fit time-to-event models, so I’ll focus on just a few common methods here.
Time-to-event models are used when the outcome is the time until an event and some event times may be censored. Unlike a standard GLM for a fully observed response, survival models use both the observed follow-up time and the event indicator. The likelihood is constructed from the survival, density, hazard, or cumulative hazard functions, depending on the model. In the absence of censoring, positive-outcome regression models such as gamma GLMs and parametric survival models can both model event times, but they make different distributional assumptions.
For right-censored data, the observed time is \(\tilde{T} = \min(T, C)\), where \(T\) is the event time and \(C\) is the censoring time. The event indicator is \(\delta = I(T \le C)\), where \(\delta = 1\) indicates that the event was observed and \(\delta = 0\) indicates right-censoring.
Censoring occurs when we only have partial information on a subjects outcome. Right-censoring is defined as an event occurring after time of recording (so the total time to event is greater than the observed time). Right-censoring may be a result from the study ending before the subject had a chance to experience the event, or when the subject is lost to follow-up during the study. Left-censoring is defined as an event occurring before the time of recording (so the actual time to event is less than the time to recording). Interval censoring is defined as an event occurring between two known time points, but the exact time-to-event is unknown.
There are three levels of assumptions for censoring:
Random censoring
Subjects censored at time \(t\) should be representative of all subject who remained at risk at time \(t\).
Independent censoring
Subjects in subgroup \(g\) censored at time \(t\) should be representative of all subjects in subgroup \(g\) who remained at risk at time \(t\). i.e. random censoring conditional on each covariate, or conditional independence of \(T\) and \(C\) given covariates.
Non-informative censoring
When the distribution of survival times \(T\) provides no information about the distribution of censoring times (C). Non-informative censoring is often justifiable when censoring is random or independent.
In general, think of non-informative censoring as censoring due to reasons unrelated to the study. If censoring is caused by poor treatment leading to drop-out, total cures leading to drop-out, etc, then you will have informative censoring. Administrative censoring is usually non-informative because the study ends at the same time for everyone (as long as the study started at the same time for everyone).
If censoring is informative, standard survival models may be biased. Possible approaches for this scenario include using IPTW to weight similar subjects with non-missing information, joint modeling of the event and censoring process, or multiple imputation. These methods require additional assumptions that should be stated explicitly. Otherwise, you may need to abandon the model.
The survival function is the probability that a subject survives longer than some specific time \(t\); \(S(t) = P(T > t)\). In standard survival models without a cured fraction, \(S(t)\) is monotone non-increasing with \(S(0) = 1\) and \(\lim_{t \to \infty} S(t) = 0\). In practice we may estimate \(S(t)\) as a step function (See the Kaplan-Meier estimate).
The hazard function is the instantaneous event rate at time \(t\) among subjects who have survived up to time \(t\). In notation:
\[h(t) = {\lim \atop \Delta t \to 0} \frac{P(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t}\]
In other words, the hazard is an instantaneous event rate at time \(t\) given the subject has survived up to \(t\) as the time interval approaches 0 (\(\lim_{\Delta t \to 0}\)). As an analogy, think of this as the speed you are travelling at a given time point, but what you really want (and don’t know yet) is the total distance covered over the total time of travel. We only assume \(h(t) \geq 0\), so the curve across time may increase or decrease.
The cumulative hazard function is
\[ H(t) = \int_0^t h(u) \, du. \]
For continuous event times, the survival, density, hazard, and cumulative hazard functions are related by
\[ \begin{aligned} S(t) &= \exp\{-H(t)\}, \\ f(t) &= -\frac{dS(t)}{dt}, \\ h(t) &= \frac{f(t)}{S(t)} = -\frac{d}{dt}\log S(t). \end{aligned} \]
References
- Kleinbaum and Klein (2012)
Cox proportional hazards models
The Cox proportional hazards model is
\[ h(t \mid X) = h_0(t) \exp(\beta_1X_1 + \cdots + \beta_pX_p) \]
where \(h_0(t)\) is the unspecified baseline hazard for subjects with \(\mathbf{x} = \mathbf{0}\), or for the reference covariate pattern after centering or coding. Using the method of partial likelihood developed by Cox, the regression coefficients can be estimated without estimation of the baseline hazard. After fitting the model, the baseline cumulative hazard and baseline survival function can be estimated.
A stratified Cox model allows the baseline hazard to vary by strata. The stratifying variable has no associated regression coefficient, but each stratum gets its own baseline hazard while the other covariate effects are shared across strata. This is useful when the stratifying variable may violate proportional hazards and its coefficient is not of direct interest. The stratifying variable should take only a few distinct values with sufficient observations in each level.
Because the Cox model leaves the baseline hazard unspecified but models covariate effects parametrically, it is a semiparametric model.
The assumptions for Cox proportional hazards regression are
Validity
The data collected maps correctly to the questions of interest. The response variable reflects the phenomenon of interest (time to event or censoring) and all relevant predictors are included.
Linearity
The model assumes that each covariate has a linear effect on the log hazard ratio. For two covariate vectors \(\mathbf{x}_1\) and \(\mathbf{x}_0\),
\[ \log\left\{\frac{h(t \mid \mathbf{x}_1)}{h(t \mid \mathbf{x}_0)}\right\} = (\mathbf{x}_1 - \mathbf{x}_0)^\prime \boldsymbol{\beta}. \]
No perfect multicolinearity
The predictor variables are not a linear combination of each other.
Independence of observations
Event times are independent across observational units, conditional on covariates. For clustered, matched, or recurrent-event data, dependence should be handled with robust standard errors, frailty terms, or a model designed for recurrent events.
No event time bias (immortal time bias)
Covariates must be observed before the risk interval in which they are used. Otherwise, you require that the observation remain event free until the time they were measured on that variable. Baseline covariates should be measured before start of survival time. Time-varying covariates are allowed, but their values must be updated using information available before or at each risk time. Using future covariate information may introduce immortal time bias.
No informative censoring bias
Censoring should be independent of the event time, conditional on the covariates included in the model. Subjects censored at time \(t\) should be representative of subjects with the same covariate history who remain at risk at time \(t\).
Proportional hazards
The hazard ratio for a predictor is constant over time. Equivalently, there are no unmodeled interactions between predictors and time on the log hazard scale.
Weibull survival models
The Weibull survival model is a parametric model in which the event time \(T\) is assumed to follow a Weibull distribution. This assumption implies specific forms for the survival and hazard functions.
One common parameterization is
\[ S(t) = \exp\{-(\lambda t)^\gamma\} \]
and
\[ h(t) = \gamma \lambda^\gamma t^{\gamma - 1}, \]
where \(\lambda > 0\) is a scale parameter and \(\gamma > 0\) is a shape parameter. When \(\gamma = 1\), the Weibull model reduces to the exponential model with constant hazard. When \(\gamma > 1\), the hazard increases over time. When \(\gamma < 1\), the hazard decreases over time.
The Weibull model is unusual because it can be expressed as either a proportional hazards model or an accelerated failure time model, depending on the parameterization. Unlike the Cox model, the Weibull model estimates the baseline hazard parametrically. This can improve efficiency when the Weibull assumption is reasonable, but it can introduce bias when the hazard shape is misspecified.
Counting process models
Counting process models represent survival data using start-stop risk intervals and event counts over time. For subject \(i\), the counting process \(N_i(t)\) records the number of observed events up to time \(t\), and the at-risk process \(Y_i(t)\) indicates whether subject \(i\) is under observation and at risk at time \(t\). This framework is useful for recurrent events, delayed entry, time-varying covariates, and other settings where each subject can contribute multiple risk intervals. Because multiple intervals or events from the same subject are usually correlated, robust standard errors, frailty terms, or another subject-level dependence structure are often needed.
Andersen-Gill models
The Andersen-Gill model treats recurrent events as repeated increments in one counting process. For subject \(i\),
\[ N_i(t) = \sum_{k \ge 1} I(T_{ik} \le t, \delta_{ik} = 1), \]
where \(T_{ik}\) is the observed time for recurrent event \(k\) and \(\delta_{ik}\) indicates whether event \(k\) was observed. The intensity model is
\[ \lambda_i(t \mid \mathcal{H}_i(t)) = Y_i(t)\lambda_0(t)\exp\{\mathbf{x}_i(t)^\prime\boldsymbol{\beta}\}, \]
where \(\mathcal{H}_i(t)\) is the event history before time \(t\), \(\lambda_0(t)\) is a common baseline hazard, and \(\mathbf{x}_i(t)\) may include time-varying covariates. A subject can return to the risk set after each event if they remain under observation. The model is simple and flexible, but it does not explicitly distinguish first, second, third, and later events. Robust sandwich standard errors are commonly used because recurrent events within a subject may be correlated.
Prentice-Williams-Peterson models
The Prentice-Williams-Peterson model stratifies the baseline hazard by event number. For event number \(k\), the intensity model is
\[ \lambda_{ik}(t \mid \mathcal{H}_i(t)) = Y_{ik}(t)\lambda_{0k}(t)\exp\{\mathbf{x}_i(t)^\prime\boldsymbol{\beta}\}, \]
where \(Y_{ik}(t)\) indicates whether subject \(i\) is at risk for event \(k\) at time \(t\), and \(\lambda_{0k}(t)\) is the baseline hazard for event number \(k\). The risk set for event \(k\) includes subjects who have already experienced event \(k - 1\). This makes the model conditional on event order. It is useful when the first event, second event, and later events may have different baseline risks. It is less appropriate when event order is not meaningful.
Wei-Lin-Weissfeld models
The Wei-Lin-Weissfeld model treats each event number as a separate marginal process. For event number \(k\), one common model is
\[ \lambda_{ik}(t) = Y_{ik}(t)\lambda_{0k}(t)\exp\{\mathbf{x}_i^\prime\boldsymbol{\beta}_k\}, \]
where \(\lambda_{0k}(t)\) is an event-specific baseline hazard and \(\boldsymbol{\beta}_k\) is an event-specific coefficient vector. Each subject can be considered at risk for each event number from the time origin. The model estimates event-specific marginal hazard ratios. It is useful when comparing covariate effects across event order or event type. For ordered recurrent events, the risk-set construction can be less intuitive because subjects may be included in risk sets for later events before earlier events occur. Robust standard errors are commonly used to handle within-subject dependence.
Frailty models
Frailty models extend survival models by adding a random effect that represents unobserved subject-level or cluster-level risk. They are useful when event times are correlated within groups, or when some subjects or clusters are systematically more event-prone than others after accounting for observed covariates.
A shared frailty model can be written as
\[ h_{ij}(t \mid \mathbf{x}_{ij}, u_i) = u_i h_0(t)\exp\{\mathbf{x}_{ij}^{\prime}\boldsymbol{\beta}\}, \]
where \(u_i > 0\) is the frailty term for cluster \(i\). The frailty distribution is usually constrained so that \(E(u_i) = 1\), making \(h_0(t)\) interpretable as the baseline hazard for an average frailty level. Subjects in the same cluster share the same frailty term, which induces dependence among their event times. A larger frailty value corresponds to a higher hazard for all subjects in that cluster.
Equivalently, using \(b_i = \log(u_i)\),
\[ \log h_{ij}(t \mid \mathbf{x}_{ij}, b_i) = \log h_0(t) + \mathbf{x}_{ij}^{\prime}\boldsymbol{\beta} + b_i. \]
Common choices include gamma frailty for \(u_i\) and normal random effects for \(b_i = \log(u_i)\). The frailty variance measures the amount of unexplained heterogeneity between clusters. Regression coefficients are interpreted conditionally on the frailty term, similar to cluster-specific effects in mixed models. These conditional hazard ratios are not generally equal to marginal hazard ratios averaged over the frailty distribution.
Accelerated failure time models
Accelerated failure time models describe covariate effects as multiplicative changes in survival time rather than multiplicative changes in the hazard. They model the log event time as a linear function of covariates.
\[ \log(T_i) = \mathbf{x}_i^{\prime}\boldsymbol{\beta} + \sigma \epsilon_i, \]
where \(\epsilon_i\) follows a specified distribution and \(\sigma > 0\) is a scale parameter. The distribution of \(\epsilon_i\) determines the corresponding survival distribution for \(T_i\). Common AFT distributions include Weibull, log-normal, log-logistic, and exponential models.
Exponentiating a coefficient gives a time ratio. For a one-unit increase in covariate \(x_j\), \(\exp(\beta_j) > 1\) means the survival time is stretched, while \(\exp(\beta_j) < 1\) means the survival time is shortened.
AFT models are parametric, so they can estimate survival times and survival probabilities directly when the distributional assumption is reasonable. Unlike Cox proportional hazards models, AFT models do not require hazards to be proportional, but they do require the specified event-time distribution to be adequate.
Censored observations contribute information through the survival function rather than the density function. For right-censored data, observed events contribute \(f(t_i \mid \mathbf{x}_i)\) and censored observations contribute \(S(t_i \mid \mathbf{x}_i)\) to the likelihood.
Landmark method
The landmark method is used when prediction or treatment comparison depends on information measured after baseline. A fixed landmark time \(t_L\) is chosen, and the analysis includes only subjects who are alive and event-free at \(t_L\). Covariates are defined using information available up to \(t_L\), and subsequent survival is modeled from \(t_L\) forward.
For subjects with \(T_i > t_L\), define the post-landmark outcome as
\[ T_i^* = T_i - t_L. \]
A landmark Cox model can then be written as
\[ h_i(t \mid T_i > t_L, \mathbf{x}_i(t_L)) = h_0(t)\exp\{\mathbf{x}_i(t_L)^\prime\boldsymbol{\beta}\}, \quad t > t_L. \]
In simple terms, the landmark analysis asks: among subjects who have survived to \(t_L\), how do measurements available at \(t_L\) predict future event risk? Subjects who experience the event before \(t_L\) are excluded because they were never eligible for prediction at the landmark time.
The landmark method can reduce immortal time bias because covariates are only defined using information available at or before \(t_L\). The landmark time should be chosen based on the scientific question, not selected after looking for the most favorable result. Changing \(t_L\) changes the target population because the analysis conditions on surviving event-free to that time.
Competing event models
Fine-Gray cumulative incidence
Aalen-Johansen cumulative incidence
Fine-Gray competing risk model
Multistate model
Multi-task logistic regression (MTLR)
- https://proceedings.neurips.cc/paper/2011/file/1019c8091693ef5c5f55970346633f92-Paper.pdf
- https://www.jmlr.org/papers/volume21/18-772/18-772.pdf
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8232898/pdf/nihms-1714699.pdf
- https://www.jmlr.org/papers/volume21/18-772/18-772.pdf
- https://www.biorxiv.org/content/10.1101/2021.07.11.451967v2.full.pdf
- https://cran.r-project.org/web/packages/MTLR/
- https://cran.r-project.org/web/packages/MTLR/vignettes/workflow.html
Joint models
Tree-based models
Classification/Regression trees are models that relate a categorical/numeric response to their predictors by recursive binary splits. Elith et al. (2008) described how tree-based models inherently model complex nonlinear relationships and interactions between predictors through subsequent recursive splits in the tree.
Classification and regression tree (CART)
CART models are one of the most popular methods for building trees. Alternative methods include C4.5 or C5.0: https://www.rulequest.com/see5-unix.html.
Regression trees
The regression tree model is defined as
\[ \hat{y} = \sum_{m = 1}^M c_mI(X \in R_{m}) \]
where \(m = 1, \dots, M\) is the number of partition regions, \(c_m\) is a constant for the predicted value of \(Y\) in region \(R_m\), and \(R_m\) is a particular region in the covariate space. If we wish to minimize the sum of squared errors \(\sum(y_i - \hat{y}_i)^2\), then \(\hat{c}_m = \text{mean}(y_i \mid x_i \in R_m)\) is the best estimate. The location of the binary split is found through a top-down, greedy approach known as recursive binary splitting. For each predictor \(X_1, \dots, X_p\) and cutpoint \(s\) for each predictor, choose the predictor and cutpoint which minimizes error the greatest. So we wish to minimize
\[ \sum_{x_i \in R_1(j, s)} (y_i - \hat{y}_{R_1})^2 + \sum_{x_i \in R_2(j, s)} (y_i - \hat{y}_{R_2})^2 \]
This process is then repeated, searching for the best predictor and cutpoint which minimizes error within the two previously identified regions, resulting in three regions. The stopping criterion may be to repeat this process until no region has more than 10 observations, for example.
The algorithm can be defined as
- Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
- Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of \(\alpha\).
- Use K-fold cross-validation to choose \(\alpha\). That is, divide the training
observations into \(K\) folds. For each \(k = 1, \dots, K\):
- Repeat Steps 1 and 2 on all but the kth fold of the training data.
- Evaluate the mean squared prediction error on the data in the left-out kth fold, as a function of \(\alpha\).
- Return the subtree from Step 2 that corresponds to the chosen value of \(\alpha\).
The tuning parameter \(\alpha\) controls the trade-off between the subtree’s complexity and its fit to the training data. When \(\alpha = 0\), then the subtree will be at its most complex. As \(\alpha\) increases from zero, branches get pruned from the tree.
References:
Classification trees
The classification tree is similar to the regression tree, except used for categorical outcomes instead of continuous. The classification tree model is defined as
\[ \hat{p}_{mk} = \frac{1}{N_m} \sum_{x \in R_m} I(y_i = k) \]
for node \(m\) in region \(R_m\) with \(N_m\) observations. There are three common objective functions which we could optimize
\[ \begin{align} \text{Misclassification error:}& \frac{1}{N_m} \sum_{i \in R_m} I(y_i \neq k(m)) = 1 - \hat{p}_{mk(m)} \\ \text{Gini index:}& \sum_{k \neq k^{\prime}} \hat{p}_{mk} \hat{p}_{mk^{\prime}} = \sum_{k=1}^{K} \hat{p}_{mk} (1 - \hat{p}_{mk}) \\ \text{Cross-entropy:}& - \sum_{k=1}^{K} \hat{p}_{mk} \log \hat{p}_{mk} \end{align} \]
where \(\hat{p}_{mk}\) is the proportion of training observations in the \(m\)th region that are from the \(k\)th class. All three are suitable for pruning of classification trees, however, the classification error is preferable for optimizing the prediction accuracy of the final pruned tree.
References:
Random forests
Random forests is a technique that falls under the ‘averaging’ or ‘bagging’ class of ensemble-based models. Random forests combines the estimators of many fully grown decision trees (high variance) and averages them to produce an estimator with lower variance.
- For \(b = 1, 2, \dots, B\), where \(B\) is the desired number of trees
- Generate a bootstrapped dataset (with replacement) from the original data.
- Grow the tree \((T_b)\) to the largest extent possible for the chosen stopping criteria and do not prune. Follow these steps for each terminal node of the tree:
- Randomly sample \(p\) predictor variables from the \(P\) predictors available (\(p < P\)).
- Select the best variable/split-point among the \(p\).
- Split the node into two daughter nodes.
- Output the ensemble of trees \(\{T_b\}^B\).
Model predictions are either the most common predicted class of the \(B\) trees (classification) or the mean prediction (sum divided \(B\); regression). Since the
Since each tree is identically distributed, the low expected bias of any one tree is kept as a property for the ensemble of trees. However, the ensemble of trees has reduced variance compared to a single tree.
Tuning parameters of interest:
- The number of trees \(B\). We want enough to produce stable estimates without using too many for efficiency sake.
- The number of predictor variables to sample for each node of a tree.
- For classification, a default recommendation is \(\lfloor \sqrt{P} \rfloor\).
- For regression, a default recommendation is \(\lfloor \frac{P}{3} \rfloor\).
- Tree complexity.
- For classification, minimum node size as 1.
- For regression, a minimum node size as 3.
References:
Gradient boosted trees
Gradient boosted trees is a technique that falls under the ‘boosting’ class of ensemble models. Boosted trees combine many shallow decision trees (high bias) through a sequential algorithm to minimize a chosen loss function.
For squared-error regression, the algorithm can be written as
- Set \(\hat{f}(x) = \bar{y}\) and \(r_i = y_i - \hat{f}(x_i)\) for all \(i\) in the training set.
- For \(b = 1, 2, \dots, B\)
- Fit a tree \(\hat{f}^b\) with \(d\) splits (\(d+1\) terminal nodes) to the training data \((X, r)\).
- Update \(\hat{f}\) by adding in a shrunken version of the new tree \(\hat{f}(x) \leftarrow \hat{f}(x) + \lambda \hat{f}^b(x)\).
- Update the residuals \(r_i \leftarrow r_i - \lambda \hat{f}^b(x_i)\).
- Output the boosted model \(\hat{f}(x) = \bar{y} + \sum_{b=1}^B \lambda \hat{f}^b(x)\)
More generally, the base learner is fit to pseudo-residuals, which are the negative gradient of the loss function with respect to the current predictions. The residual update above is the special case for squared-error loss.
Tuning parameters of interest:
- The number of trees \(B\). The model is at risk of overfitting if \(B\) is too large. Cross-validation can be used to select B.
- The shrinkage parameter \(\lambda \in \mathbb{R}^+\) which controls the learning rate. Usually this is set around 0.001 to 0.01. A small \(\lambda\) can require a large \(B\) to achieve good performance.
- Tree complexity.
- The number of splits in each tree \(d\). This controls the complexity of the ensemble. If \(d=1\), each tree is a stump (single split) which is equivalent to an additive model since each term is a single variable. Higher values can involve \(d\) variables, thus it can be said to control the interaction depth of the model.
References:
K-nearest neighbors (KNN)
A method used to predict categorical or quantitative outcomes. KNN is a nonparametric, model free method.
Neural networks
A method used to predict categorical or quantitative outcomes. Like many other methods presented here, the term neural network is overloaded. I’ll focus on the the most basic example, the single layer perceptron.