Brett Klamer

Table of Contents

Learning Methods

Ordinary least squares (OLS)

A model used to predict quantitative outcomes. The model formula is

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p + \epsilon \]

which may be described as

\[ \begin{aligned} \mathbf{y}_{n \times 1} &= \mathbf{X}_{n \times k+1} \mathbf{\beta}_{k+1 \times 1} + \mathbf{\epsilon}_{n \times 1} \\ E[Y \mid X] &= \mu \\ &= \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p \\ &= \text{linear predictor} \\ Y \mid X &\sim N(\mu, \sigma^2) \end{aligned} \]

The predictor variables \(X\) may include quantitative variables, transformations of quantitative variables (polynomials, splines, etc.), dummy variables, interactions, etc.

Assumptions made for OLS models are

  1. Validity

    The data collected maps correctly to the questions of interest. The response variable reflects the phenomenon of interest and all relevant predictors are included.

  2. Additivity

    The effects of the covariates on the response are additive, along with the additive error term.

  3. Linearity

    The response variable is a deterministic component of a linear function of the predictors (linear in the parameters) plus an error term. The conditional means of the response are a linear function of the predictor variables.

  4. No multicolinearity

    The predictor variables are not a linear combination of each other. The design matrix \(\mathbf{X}\) is full rank.

  5. Independence of errors

    The errors from the regression line are independent (uncorrelated). In other words, the observations are independent from one another.

  6. Equal variance of errors

    The errors from the regression line have similar variability across the range of the response.

  7. \(E[\epsilon \mid X] = 0\)

    The errors have mean 0. This is also known as the strict exogeneity assumption. Exogeneity is violated when a confounding variable is not included.

  8. Normality of errors

    Normality of errors provides finite-sample guarantees for hypothesis testing and equivalence between the OLS estimate and the MLE.

The objective (cost) function is given as

\[ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_jx_{ij} \right)^2 \]

Minimization of the objective function has a closed form solution through solving the normal equations \((X^{\prime}X)\hat{\beta} = X^{\prime}y\) which results in \(\hat{\beta} = (X^{\prime}X)^{-1}X^{\prime}Y\).

\[ \begin{aligned} E[\hat{\beta} \mid X] &= \beta \implies \text{unbiased}\\ V[\hat{\beta} \mid X] &= \sigma^2(X^{\prime}X)^{-1} \\ &= \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \\ \hat{\beta} &\sim N(\beta, \sigma^2(X^{\prime}X)^{-1}) \end{aligned} \]

Note that the residuals (observed \(e = y - \hat{y}\)) are estimates of the error (unobserved \(\epsilon = y - x\beta\)).

\[ \begin{aligned} E[\epsilon \mid X] &= 0 \\ E[e \mid X] &= (I - H)E[\epsilon \mid X] \\ &= E[Y \mid X] - E[\hat{Y} \mid X] \\ &= 0 \\ V[\epsilon \mid X] &= \sigma^2 I_{n \times n} \\ V[e \mid X] &= \sigma^2(I - H) \\ H &= X(X^{\prime}X)^{-1}X^{\prime}\\ &= \text{hat matrix} \\ &= \text{projection matrix} \implies \hat{\mu} = Hy \\ s^2 &= \frac{e^{\prime}e}{n - p} \\ &= \frac{1}{n-p} \sum_{i=1}^n (y - X\hat{\beta})^2 \\ E[s^2 \mid X] &= \sigma^2 \implies \text{unbiased} \\ V[s^2 \mid X] &= \frac{2\sigma^4}{n-p} \end{aligned} \]

What if you would like to run a hypothesis test (generate p-values or confidence intervals) but notice violations of the OLS assumptions? Bootstrap confidence intervals of parameter estimates may be used to relax the assumption that \(Y \mid X\) is normal. Robust standard errors (Huber-White standard errors, heteroskedasticity-consistent standard errors, sandwich estimator, etc.) may be used to relax the assumption of constant variance. However, if the model is misspecified, the parameter estimates are likely to be meaningless outside the descriptive statistic context.


Generalized least squares (GLS)

A model for numeric outcomes. Take the concepts described for OLS and generalize them for errors which have non-constant variance and/or are not all independent. In other words, an extended linear model with no random effects. Alternatives to GLS include linear mixed models or Bayesian hierarchical models, however GLS does not allow subjects/groups to have unique trajectories/slopes.

Instead of the normal OLS definition \(\epsilon \sim N(0, \sigma^2\mathbf{I})\), consider \(\epsilon \sim N(0, \Sigma)\) where

\[ \begin{align} V[\epsilon] &= \Sigma \\ &= \sigma^2 \mathbf{P}_{n \times n} \end{align}\]

If \(\sigma^2\) is a vector, or the diagonal of \(\mathbf{P}\) is a vector of weights, then we can allow for unequal variance between the observations within-group errors. If the off-diagonal elements \(\rho\) are non-zero, then we can allow for correlated observations. Note that GLS specifically addresses dependency in errors (correlations are non-zero in the off-diagonal), while weighted least squares (WLS), a special case of GLS, can handle independent but not identically distributed errors (correlations are all 0 in the off-diagonal).

For example, if \(\mathbf{P}\) represented the correlation between two errors for any given lag, then it may have form

\[ \mathbf{P} = \begin{bmatrix} 1 & \rho_1 & \rho_2 & \cdots & \rho_{n-1} \\ \rho_1 & 1 & \rho_1 & \cdots & \rho_{n-2} \\ \rho_2 & \rho_1 & 1 & \cdots & \rho_{n-3} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho_{n-1} & \rho_{n-2} & \rho_{n-3} & \cdots & 1 \end{bmatrix} \]

This structure leads to the OLS estimator as no longer optimal. Minimization of the objective function through solving the normal equations leads to \(\hat{\beta}_{GLS} = (X^{\prime}\Sigma^{-1}X)^{-1}X\Sigma^{-1}Y\). It follows that

\[ \begin{align} E[\beta \mid X] &= \beta \implies \text{unbiased} \\ V[\beta \mid X] &= (X^{\prime}\Sigma^{-1}X)^{-1} \end{align} \]

By letting \(\Gamma = \sqrt{\Sigma^{-1}}\), we can transform the equation to

\[ \begin{align} \beta_{GLS} &= (X^{\prime}\Gamma^{\prime}\Gamma X)^{-1}X^{\prime}\Gamma^{\prime}\Gamma Y \\ &= (X^{*\prime}X^*)^{-1}X^{*\prime}Y^* \end{align} \]

Since \(X^* \equiv \Gamma X\) and \(y^* \equiv \Gamma Y\), the GLS estimator is the OLS estimator for \(y^* = X^*\beta+\epsilon^*\), \(\epsilon^* \sim N(0, \sigma^2I)\), which is just a linear transformation of \(y\) and \(X\). However, this result depends on \(\mathbf{P}\) being known (the correlation and relative variance between the errors are known, only the absolute scale of the variation \(\sigma^2\) is unknown), while in practice \(\mathbf{P}\) is always unknown. Thus, OLS may be used to estimate the regression coefficients, but maximum likelihood estimation (ML) or restricted maximum likelihood estimation (REML) are required for estimating variance parameters (which, after estimating, can be used to update the regression coefficients).

ML maximizes the likelihood with respect to all parameters in a model simultaneously (the regression coefficients and the variance parameters). ML estimates are biased for small sample sizes, where variance parameters are expected to be biased downwards. REML integrates out the regression coefficients and estimates only the variance parameters. When the sample size (or number of clusters) is small, REML tends to be less biased than ML. This is because it takes into account the loss of degrees of freedom resulting from estimating the regression coefficients. However, models with different fixed-effects that are estimated using REML cannot be compared via likelihood ratio tests or information criteria.

Some common variance structures for non-constant variance:

  • varFixed
    • When the variance of the residuals is linearly related to a continuous covariate (\(x\)).
    • \(\Sigma = \sigma^2x\)
  • varIdent
    • When the variance of the residuals is different for each level (\(k = 1, \dots, K\)) of a categorical covariate(s) and constant within those levels.
    • \(\Sigma = \sigma^2_k\mathbf{I}\)
  • varPower
    • When the variance of the residuals is related to the power of a continuous covariate \(x\).
    • \(\Sigma = \sigma^2 |x|^{2\delta}\)
  • varExp
    • When the variance of the residuals is related to the exponential of a continuous covariate \(x\).
    • \(\Sigma = \sigma^2 e^{{2\delta}x}\)
  • varComb
    • When the variance of the residuals is related to a combination of variance functions. generated by multiplying together the corresponding variance functions.

Common variance structures for correlated observations:

  • Serial correlation
    • corCompSymm
      • Compound symmetry
      • Simplest serial correlation structure. It assumes equal correlation among all within-group errors pertaining to the same group. Its useful for short time series per group, or when all observations within a group are collected at the same time, as in split-plot experiments.
    • corSymm
      • General
      • The most complex correlation structure. Each correlation is represented by a different parameter. When there are a small number of observations per group, the general correlation structure is useful for determining a more parsimonious correlation model. Must be a positive-definite correlation matrix.
    • corAR1
    • corCAR1
      • Continuous time AR(1) structure.
    • corARMA
  • Spatial correlation
    • Semivariogram
    • Isotropic Variogram
      • The correlation between responses within subject at two times depends only on a measure of the distance between the two times and not the times themselves.
      • corExp
      • corGaus
      • corLin
      • corRatio
      • corSpher


Generalized linear models (GLM)

A model for outcomes numeric, ordinal, or categorical in nature. Take the concepts described for OLS and generalize them for \(Y\), where \(Y\) belongs to the (mostly) exponential family. OLS is a special case of GLM.

A distribution belongs to the exponential family if it has form \(f(x | \eta) = h(x)g(\eta)e^{\eta T(x)}\) where \(\eta\) is known as the natural parameter and \(T\) is the sufficient statistic. Note that \(\eta\) and \(T\) are vectors, thus may be length 2 for two-parameter distributions (such as normal) or length \(k\) for multinomial distributions with \(k\) categories/outcomes. For distributions in the exponential family, the conditional variance of \(Y\) is a function of its mean \(\mu\) and/or a dispersion parameter \(\phi\). The normal distribution has constant, unknown variance \(\sigma^2 = \phi\) while binomial and Poisson families have constant known \(\phi = 1\).

The linear predictor is related to \(E[Y \mid X]\) through the link function \(g\).

\[ \begin{aligned} E[Y \mid X] &= \mu \\ g(E[Y \mid X]) &= \mathbf{X\beta} \\ &= \mathbf{\eta} \end{aligned} \]

The link function \(g\) is a monotonic, invertible, and differentiable function. The link function \(g\) that transforms \(\mu\) to the natural parameter is known as the canonical link. The canonical link is preferred for GLMs since it results in concave log-likelihood functions, simple sufficient statistics, and simple likelihood equations.

Examples of canonical link functions:

Distribution Canonical Link Range of Y \(V(Y \mid \eta)\)
Gaussian Identity \((-\infty, \infty)\) \(\phi\)
Binomial Logit \(\frac{0, 1, \dots, n}{n}\) \(\frac{\mu(1-\mu)}{\eta}\)
Poisson Log \(0, 1, 2, \dots\) \(\mu\)
Gamma Inverse \((0, \infty)\) \(\phi \mu^2\)
Inverse-Gamma Inverse-square \((0, \infty)\) \(\phi \mu^3\)

Note that binary data are a subset of binomial data, where all outcomes represent \(\eta = 1\) trial so that \(Y = 0\) or \(Y = 1\) and \(E(Y \mid X) = \mu\) is the probability of observing \(Y = 1\).

Examples of common link functions:

Link \(\eta = g(\mu)\) \(\mu = g^{-1}(\eta)\)
Identity \(\mu\) \(\eta\)
Log \(ln(\mu)\) \(e^{\eta}\)
Inverse \(\mu^{-1}\) \(\eta^{-1}\)
Inverse-square \(\mu^{-2}\) \(\eta^{-1/2}\)
Square-root \(\sqrt{\mu}\) \(\eta^2\)
Logit \(ln\left( \frac{\mu}{1-\mu} \right)\) \(\frac{1}{1+e^{-\eta}}\)
Probit \(\Phi^{-1}(\mu)\) \(\Phi(\eta)\)
Log-log \(-ln[-ln(\mu)]\) \(exp[-exp(-\eta)]\)
Complementary log-log \(ln[-ln(1-\mu)]\) \(1-exp[-exp(\eta)]\)

Model parameters (both the regression coefficients and asymptotic standard errors of the coefficients) are estimated using maximum likelihood (ML). The basic Wald statistic is \(Z = \frac{\hat{\beta} - \beta_0}{SE(\hat{\beta})}\). For distributions with unknown parameter \(\phi\), the method of moments (MoM) is commonly used as its estimator, although ML could in principle be used as well.

The log-likelihood is found to be

\[ ln(L(\mathbf{\theta}, \phi;\mathbf{y})) = \sum_{i = 1}^{n} \frac{Y_i\theta_i - b(\theta_i)}{a_i(\phi)} + c(Y, \phi) \]

where \(a(\cdot)\), \(b(\cdot)\), and \(c(\cdot)\) are known functions depending on the exponential family, \(\theta = g(\mu)\) is the canonical parameter where the canonical link function \(g(\cdot)\) does not depend on \(\phi\), and \(\phi > 0\) is a dispersion parameter.

The estimating equations for the regression parameters are found by differentiating the log-likelihood with respect to each coefficient using the chain rule. Setting this sum to zero results in the maximum-likelihood estimating equations. Iterative weighted least squares (IWLS, otherwise known as iterative reweighted least squares IRWLS or IRLS) is used to estimate the MLE of the coefficients. The IWLS steps are

  1. Start with initial estimates for \(\hat{\mu}\) and \(\hat{\eta} = g(\hat{\mu})\). Usually just set \(\hat{\mu}_i = Y_i\).
  2. At each iteration, compute the working response \(Z\) and the working weights \(W\) using the chosen values of \(\hat{\mu}\) and \(\hat{\eta} = g(\hat{\mu})\)
  3. Fit a weighted least squares regression of \(Z\) on \(X\) using \(W\) as weights.
  4. Repeat steps 2 and 3 until convergence to the MLE of the coefficients.


Linear mixed models (LMM)

A model for numeric outcomes. Take the concepts described for OLS and generalize them for data resulting from hierarchical structures (lower-level units are nested within higher-level units) or longitudinal structures (measurement are taken over time) which implies that errors may be correlated and/or have unequal variance. Alternatives to LMM include GLS or Bayesian hierarchical models.

The model formula in Laird-Ware form is (Laird and Ware 1982)

\[ \begin{align} Y_{ij} &= \beta_0 + \beta_1X_{1ij} + \beta_2X_{2ij} + \dots + \beta_pX_{pij} + \delta_{1i}Z_{1ij} + \cdots + \delta_{qi}Z_{qij} + \epsilon_{ij} \\ \delta_{ki} &\sim N(0, \Psi^2_k) \\ \epsilon_{ij} &\sim N(0, \sigma^2_{\epsilon}\lambda_{ij}) \end{align}\]


  • \(Y_{ij}\) is the value of the outcome variable for observation \(j = 1, \dots, n\) in cluster \(i = 1, \dots, m\).
  • The \(\beta\)s are fixed-effect coefficients for fixed parameters \(g = 1, \dots, p\) and are constant within clusters.
  • The \(X\)s are fixed-effect regressors.
  • The \(\delta\)s are random-effect coefficients for random variables \(k = 1, \dots, q\). By definition the random effects are normally distributed and vary by group (they are independent between groups). They are ‘random’ because if the study were to be repeated, different clusters would be sampled, resulting in different random effect estimates. We are not interested in the effect of a particular grouping level, but may be interested in the variability between groups.
  • \(\delta_{ki}, \delta_{k^{\prime}i^{\prime}}\) are independent for \(i \neq i^{\prime}\)
  • The \(Z\)s are random-effect regressors. usually a subset of the \(X\)s.
  • \[ \begin{align} E[Y \mid X, \delta] &= \mu = X\beta+Z\delta \\ V[Y \mid X,\delta] &= \sigma^2_{\epsilon}\lambda_{ij} \end{align}\]
  • The \(\epsilon\)s are errors for observations within clusters.
  • \(\epsilon_{ij}, \epsilon_{i^{\prime}j^{\prime}}\) are independent for \(i \neq i^{\prime}\) and independent of the random effects.
  • \(\delta_{ki}, \epsilon_{i^{\prime}j}\) are independent for all \(i, i^{\prime}, k, j\)
  • \(\Psi^2_k\) is the variance and \(\Psi_{kk^{\prime}}\) the covariance of the random effects. These are assumed to be constant across groups.
  • \(\sigma^2\lambda_{ijj^{\prime}}\) are the covariances among the errors in cluster \(i\).
  • The \(\Psi\)s and \(\lambda\)s define the dependencies among the random effects and errors within clusters, respectively.

In matrix form, we have

\[ \begin{align} \mathbf{y}_{i} &= \mathbf{X}_i \beta + \mathbf{Z}_i \delta_{i} + \epsilon_{i} \\ \delta_{i} &\sim N_q(0, \Psi) \\ \epsilon_{i} &\sim N_{n_i}(0, \sigma^2_{\epsilon}\Lambda_{i}) \end{align}\]


  • \(\mathbf{y}_{i}\) is a \(n_i \times 1\) vector for observations in the \(i\)th of \(m\) groups.
  • \(\mathbf{\mathbf{X}}_{i}\) is a \(n_i \times p\) model matrix for the fixed-effects in group \(i\).
  • \(\mathbf{\beta}\) is a \(p \times 1\) vector.
  • \(\mathbf{\mathbf{Z}}_{i}\) is a \(n_i \times q\) model matrix for the random-effects in group \(i\).
  • \(\mathbf{\delta}_{i}\) is a \(q \times 1\) vector.
  • \(\mathbf{\epsilon}_{i}\) is a \(n_i \times 1\) vector.
  • \(\mathbf{\Psi}_{i}\) is a \(q \times q\) covariance matrix for the random-effects.
  • \(\mathbf{\sigma^2}_{\epsilon}\Lambda_i\) is a \(n_i \times n_i\) covariance matrix for the errors in group \(i\). If the within-group errors are constant and are independent, then \(\Lambda_i = \mathbf{I}_{n_i}\).

where the above can be simplified further by defining \(n = \sum_{i=1}^{m}n_i\). Alternatively, the model structure can be described in a hierarchy of simpler models. When this model formula structure is used, they are often referred to as hierarchical linear models (HLM) or multilevel linear models (MLM). These alternative forms are identical to the Laird-Ware form which is common under the LMM umbrella.

The algorithms used for optimization of the likelihood function are most commonly Newton-Raphson, followed by expectation-maximization (EM) (or expectation conditional maximization either ECME), or Fisher scoring. These algorithms may be used for maximum likelihood (ML) or restricted maximum likelihood (REML) estimation.

Optimization algorithms may fail because

  • Iterations reach a singular gradient
  • The maximum number of iterations is reached without convergence
  • Variance estimates are not all positive or may be exactly 1
  • Covariance matrices are not positive definite

If models differ in fixed-effects, then ML must be used for model comparison. If models differ only in random-effects, then REML may be used for model comparison. Generally parameter estimates of the final model should be fit using REML.

The landscape in R looks like the following:

Feature nlme lme4 glmmTMB
REML Yes Yes No
unequal variance Yes No Yes
correlated errors Yes No Yes
crossed random effects No Yes Yes

In general, you should have a relatively large number of groups/clusters (required for robust estimate of between group variation) with a small or medium number of observations within groups (enough to estimate group specific means and/or slopes).


  • Pinheiro and Bates (2000)
  • Fox (2016)
  • Singmann and Kellen (2019)
  • Harrison et al. (2018)

Generalized linear mixed models (GLMM)

A model for outcomes numeric, ordinal, or categorical in nature. Take the concepts described for OLS and LMM and generalize them for \(Y\), where \(Y\) belongs to the (mostly) exponential family and both fixed and random-effects are allowed in the linear predictor.

For \(\mu_{ij} = E[Y_{ij} | \mathbf{u}_i]\), the link function is defined as \(g(\mu_{ij}) = \mathbf{X}_{ij}\beta + \mathbf{Z}_{ij}\mathbf{\delta}_i\) for observation \(j\) in cluster \(i\).


Generalized additive models (GAM)

A model for outcomes numeric, ordinal, or categorical in nature. Take the concepts described for OLS and GLM and generalize them for flexible prediction of \(Y\) on the basis of multiple predictors \(X\). Alternatives to GAM include spline transformations in OLS or GLM models.

The model formula may be written as

\[ Y = \beta_0 + \sum_{j=1}^{p} s_j(X_{j}) + \epsilon \]

where \(s_j(\cdot)\) is a smooth (nonlinear) function of predictor \(j\) and \(E[Y \mid X] = \beta_0 + \sum_{j=1}^{p} s_j(X_{j})\). This is referred to as an additive model since separate \(s\) are calculated for each \(X\) and then added together. Just as in OLS, interactions between predictors can be added as \(X_jX_k\).

The objective function for GAM is similar to OLS or GLM with the addition of a penalty term for the amount of smoothness in the fits. For example, if fitting piecewise linear regression we have

\[ \begin{align} \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_jx_{ij} \right)^2 + \lambda \sum_{j=2}^{k-1} (s(x^*_{j-1}) - 2s(x^*_{j}) + s(x^*_{j+1}))^2 \end{align} \]

where \(x^*_j\) are the \(j=1, \dots, k\) knots. Since natural cubic splines (restricted cubic splines) are almost always a good default choice, the resulting objective function becomes

\[ \begin{align} \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p s_j(x_{ij}) \right)^2 + \sum_{j=1}^{p} \lambda_j \int s_j^{\prime \prime} (t_j)^2 dt_j \end{align} \]

which is the general case for knots at each unique value of \(X\). In practice, an iterative procedure known as ‘backfitting’ is used in conjunction with a method for likelihood maximization (Newton-Raphson or iteratively weighted least squares (IWLS)) to generate an optimal level of smoothness. The tuning parameter \(\lambda_j \geq 0\) controls the amount of smoothness. \(\lambda = 0\) results in no penalization and \(\lambda \to \infty\) produces a straight line.


  • Wood (2017)
  • Hastie, Tibshirani, and Friedman (2009)

Generalized estimating equations (GEE)

A model for outcomes numeric, ordinal, or categorical in nature. Take the concepts described for GLS and GLM and generalize them for \(Y\), where \(Y\) belongs to the exponential or even non-exponential families and where errors may have non-constant variance and/or are not all independent.

GEE model the marginal distribution of clustered observations by treating the joint dependence structure as a nuisance. This results in population-averaged interpretations of resulting estimates. This is in contrast to mixed-effect models which produce cluster-specific estimates that have conditional interpretations.

GEE does not specify a complete joint distribution, thus it does not use a likelihood and likelihood-based fitting methods. Likelihood-based methods cannot be used for assessment of model fit, so quasi-likelihood or Wald statistics based on asymptotic normality and the estimated covariance matrix are required. This makes GEE computationally simple relative to other ML based models (GLS, GLM) and places no distributional assumptions on the coefficients. Model estimates are consistent even if we choose the incorrect with-group covariance structure (as long as the model for the mean is correct) since the ‘sandwich’ estimator of \(Cov(\hat{\beta})\) is used.


  • Agresti (2013)
  • Fitzmaurice, Laird, and Ware (2011)

Time-to-event models (Survival)

There are many different ways to fit time-to-event models, so I’ll focus on just a few common methods here.

Take the concepts described for GLM and extend it for \(Y\), where \(Y\) is (roughly) defined as time until some event with possible censoring. For example, a log-link gamma GLM and Weibull survival model produce similar estimates in the absence of censoring.

The outcome is defined by observed time \(Y = \min(T, C)\) for event time \(T\), censoring time \(C\), and event indicator \(\delta = \begin{cases} 0, \text{if censored} \\ 1, \text{if event}\end{cases}\).

Censoring occurs when we only have partial information on a subjects outcome. Right-censoring is defined as an event occurring after time of recording (so the total time to event is greater than the observed time). Right-censoring may be a result from the study ending before the subject had a chance to experience the event, or when the subject is lost to follow-up during the study. Left-censoring is defined as an event occurring before the time of recording (so the actual time to event is less than the time to recording). Interval censoring is defined as an event occurring between two known time points, but the exact time-to-event is unknown.

There are three levels of assumptions for censoring:

  • Random censoring

    Subjects censored at time \(t\) should be representative of all subject who remained at risk at time \(t\).

  • Independent censoring

    Subjects in subgroup \(g\) censored at time \(t\) should be representative of all subjects in subgroup \(g\) who remained at risk at time \(t\). i.e. random censoring conditional on each covariate.

  • Non-informative censoring

    When the distribution of survival times \(T\) provides no information about the distribution of censoring times (C). Non-informative censoring is often justifiable when censoring is random or independent.

    In general, think of non-informative censoring as censoring due to reasons unrelated to the study. If censoring is caused by poor treatment leading to drop-out, total cures leading to drop-out, etc, then you will have informative censoring. Administrative censoring is usually non-informative because the study ends at the same time for everyone (as long as the study started at the same time for everyone).

If censoring is informative, then you can use IPTW to weight similar subjects with non-missing information or use multiple imputation. Otherwise, you may need to abandon the model.

The survival function is the probability that a subject survives longer than some specific time \(t\); \(S(t) = P(T > t)\). The theoretical properties assume that \(S(t)\) is monotonic decreasing with \(S(0) = 1\) and \(S(\infty) = 0\). In practice we may estimate \(S(t)\) as a step function (See the Kaplan-Meier estimate).

The hazard function is the risk of experiencing an event given that you have survived up to time \(t\). In notation:

\[h(t) = {\lim \atop \Delta t \to 0} \frac{P(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t}\]

In other words, the instantaneous risk of having an event at time \(t\) given the subject has survived up to \(t\) as the time interval approaches 0 (\(\lim_{\Delta t \to 0}\)). As an analogy, think of this as the speed you are travelling at a given time point, but what you really want (and don’t know yet) is the total distance covered over the total time of travel. We only assume \(h(t) \geq 0\), so the curve across time may increase or decrease.

The survival and hazard functions are related:

\[ \begin{align} S(t) &= e^{- \int_0^t h(u) du}\\ h(t) &= - \left( \frac{dS(t) / dt}{S(t)} \right) \end{align} \]


  • Kleinbaum and Klein (2012)

Cox proportional hazards models

The Cox proportional hazards model is

\[ h(t \mid X) = h_0(t) exp(\beta_1X_1 + \cdots + \beta_pX_p) \]

where \(h_0(t)\) is the unspecified underlying baseline hazard applicable to all subjects and does not depend on \(X\). Using the method of partial likelihood developed by Cox, the regression coefficients can be estimated without estimation of the baseline hazard. However, after fitting the model, the baseline hazard can be extracted.

It’s possible to let the baseline hazard vary by a selected strata. The variable used for this strata has no associated regression coefficient. This may be advantageous because it frees you from specifying the corresponding interaction model. The strata variable should only take on a few distinct values (with sufficient observations in each level) and the effect estimates should not be of direct interest.

The assumptions for Cox proportional hazards regression are

  1. Validity

    The data collected maps correctly to the questions of interest. The response variable reflects the phenomenon of interest (time to event or censoring) and all relevant predictors are included.

  2. Linearity

    The log hazard (or log cumulative hazard) of the outcome is linearly related to each predictor variable.

  3. No Multicolinearity

    The predictor variables are not a linear combination of each other.

  4. Independence of observations

  5. No event time bias

    Each predictor variable is measured at or before baseline event time. When predictors, which are not a constant property of the observation, are measured after start of event time, then the observation had to remain event free until the time they were measured on that variable. This potential bias is known as event time bias (survival time bias or immortal time bias).

  6. No informative censoring bias

    This is generally defined by independence of censoring. subjects in subgroup \(g\) censored at time \(t\) should be representative of all subjects in subgroup \(g\) who remained at risk at time \(t\). i.e. random censoring conditional on each covariate.

  7. Proportional hazards

    There are no time by predictor interactions. The predictors have the same effect on the hazard function at all time points. The treatment effect should have the same relative difference in hazard compared to the control effect at all time points (their curves should not cross).

Since the Cox model makes no assumption about the distribution of the hazard function, but does have parametric assumptions on the effects of the predictors on the hazard, we call is a semiparametric model.

Weibull survival models

The Weibull survival model is a parametric model since the hazard function is assumed to be from a Weibull distribution.

Counting process models

  • Andersen-Gill
  • Prentice-Williams-Peterson
  • Wei-Lin-Weissfeld

Frailty models

Accelerated failure time models

Landmark method

If a patient who dies early has a different predictor variable value than if they had died later, then this may bias the model estimates. The landmark method is used to reduce this event time bias (survival time bias or immortal time bias).

Competing event models

  • Fine-Gray cumulative incidence
  • Aalen-Johansen cumulative incidence
  • Multistate model

Multivariate/Joint models

Tree-based models

Classification/Regression trees are models that relate a categorical/numeric response to their predictors by recursive binary splits. Elith, Leathwick, and Hastie (2008) described how tree-based models inherently model complex nonlinear relationships and interactions between predictors through subsequent recursive splits in the tree.

Classification and regression tree (CART)

CART models are one of the most popular methods for building trees. Alternative methods include C4.5 or C5.0:

Regression trees

The regression tree model is defined as

\[ \hat{y} = \sum_{m = 1}^M c_mI(X \in R_{m}) \]

where \(m = 1, \dots, M\) is the number of partition regions, \(c_m\) is a constant for the predicted value of \(Y\) in region \(R_m\), and \(R_m\) is a particular region in the covariate space. If we wish to minimize the sum of squared errors \(\sum(y_i - \hat{y}_i)^2\), then \(\hat{c}_m = \text{mean}(y_i \mid x_i \in R_m)\) is the best estimate. The location of the binary split is found through a top-down, greedy approach known as recursive binary splitting. For each predictor \(X_1, \dots, X_p\) and cutpoint \(s\) for each predictor, choose the predictor and cutpoint which minimizes error the greatest. So we wish to minimize

\[ \sum_{x_i \in R_1(j, s)} (y_i - \hat{y}_{R_1})^2 + \sum_{x_i \in R_2(j, s)} (y_i - \hat{y}_{R_2})^2 \]

This process is then repeated, searching for the best predictor and cutpoint which minimizes error within the two previously identified regions, resulting in three regions. The stopping criterion may be to repeat this process until no region has more than 10 observations, for example.

The algorithm can be defined as

  1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
  2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of \(\alpha\).
  3. Use K-fold cross-validation to choose \(\alpha\). That is, divide the training observations into \(K\) folds. For each \(k = 1, \dots, K\):
    1. Repeat Steps 1 and 2 on all but the kth fold of the training data.
    2. Evaluate the mean squared prediction error on the data in the left-out kth fold, as a function of \(\alpha\).
    Average the results for each value of \(\alpha\), and pick \(\alpha\) to minimize the average error.
  4. Return the subtree from Step 2 that corresponds to the chosen value of \(\alpha\).

The tuning parameter \(\alpha\) controls the trade-off between the subtree’s complexity and its fit to the training data. When \(\alpha = 0\), then the subtree will be at its most complex. As \(\alpha\) increases from zero, branches get pruned from the tree.


  • Hastie, Tibshirani, and Friedman (2009)
  • James et al. (2013)

Classification trees

The classification tree is similar to the regression tree, except used for categorical outcomes instead of continuous. The classification tree model is defined as

\[ \hat{p}_{mk} = \frac{1}{N_m} \sum_{x \in R_m} I(y_i = k) \]

for node \(m\) in region \(R_m\) with \(N_m\) observations. There are three common objective functions which we could optimize

\[ \begin{align} \text{Misclassification error:}& \frac{1}{N_m} \sum_{i \in R_m} I(y_i \neq k(m)) = 1 - \hat{p}_{mk(m)} \\ \text{Gini index:}& \sum_{k \neq k^{\prime}} \hat{p}_{mk} \hat{p}_{mk^{\prime}} = \sum_{k=1}^{K} \hat{p}_{mk} (1 - \hat{p}_{mk}) \\ \text{Cross-entropy:}& - \sum_{k=1}^{K} \hat{p}_{mk} \log \hat{p}_{mk} \end{align} \]

where \(\hat{p}_{mk}\) is the proportion of training observations in the \(m\)th region that are from the \(k\)th class. All three are suitable for pruning of classification trees, however, the classification error is preferable for optimizing the prediction accuracy of the final pruned tree.


  • Hastie, Tibshirani, and Friedman (2009)
  • James et al. (2013)

Random forests

Random forests is a technique that falls under the ‘averaging’ or ‘bagging’ class of ensemble-based models. Random forests combines the estimators of many fully grown decision trees (high variance) and averages them to produce an estimator with lower variance.

  1. For \(b = 1, 2, \dots, B\), where \(B\) is the desired number of trees
    1. Generate a bootstrapped dataset (with replacement) from the original data.
    2. Grow the tree \((T_b)\) to the largest extent possible for the chosen stopping criteria and do not prune. Follow these steps for each terminal node of the tree:
      1. Randomly sample \(p\) predictor variables from the \(P\) predictors available (\(p < P\)).
      2. Select the best variable/split-point among the \(p\).
      3. Split the node into two daughter nodes.
  2. Output the ensemble of trees \(\{T_b\}^B\).

Model predictions are either the most common predicted class of the \(B\) trees (classification) or the mean prediction (sum divided \(B\); regression). Since the

Since each tree is identically distributed, the low expected bias of any one tree is kept as a property for the ensemble of trees. However, the ensemble of trees has reduced variance compared to a single tree.

Tuning parameters of interest:

  1. The number of trees \(B\). We want enough to produce stable estimates without using too many for efficiency sake.
  2. The number of predictor variables to sample for each node of a tree.
    • For classification, a default recommendation is \(\lfloor \sqrt{P} \rfloor\).
    • For regression, a default recommendation is \(\lfloor \frac{P}{3} \rfloor\).
  3. Tree complexity.
    • For classification, minimum node size as 1.
    • For regression, a minimum node size as 3.


  • Hastie, Tibshirani, and Friedman (2009)
  • James et al. (2013)

Gradient boosted trees

Gradient boosted trees is a technique that falls under the ‘boosting’ class of ensemble models. Boosted regression trees combine many shallow decision trees (high bias) through a sequential algorithm to provide better predictive performance.

  1. Set \(\hat{f}(x) = 0\) and \(r_i = y_i\) for all \(i\) in the training set.
  2. For \(b = 1, 2, \dots, B\)
    1. Fit a tree \(\hat{f}^b\) with \(d\) splits (\(d+1\) terminal nodes) to the training data \((X, r)\).
    2. Update \(\hat{f}\) by adding in a shrunken version of the new tree \(\hat{f}(x) \leftarrow \hat{f}(x) + \lambda \hat{f}^b(x)\).
    3. Update the residuals \(r_i \leftarrow r_i - \lambda \hat{f}^b(x_i)\).
  3. Output the boosted model \(\hat{f}(x) = \sum_{b=1}^B \lambda \hat{f}^b(x)\)

The base learner is fit to the pseudo-residuals at each iteration and the goal is to sequentially reduce the model error.

Tuning parameters of interest:

  1. The number of trees \(B\). The model is at risk of overfitting if \(B\) is too large. Cross-validation can be used to select B.
  2. The shrinkage parameter \(\lambda \in \mathbb{R}^+\) which controls the learning rate. Usually this is set around 0.001 to 0.01. A small \(\lambda\) can require a large \(B\) to achieve good performance.
  3. Tree complexity.
    • The number of splits in each tree \(d\). This controls the complexity of the ensemble. If \(d=1\), each tree is a stump (single split) which is equivalent to an additive model since each term is a single variable. Higher values can involve \(d\) variables, thus it can be said to control the interaction depth of the model.


  • Hastie, Tibshirani, and Friedman (2009)
  • James et al. (2013)

Support vector machines (SVM)

A model used to predict categorical or quantitative outcomes.

K-nearest neighbors (KNN)

A method used to predict categorical or quantitative outcomes. KNN is a nonparametric, model free method.

Neural networks

A method used to predict categorical or quantitative outcomes. Like many other methods presented here, the term neural network is overloaded. I’ll focus on the the most basic example, the single layer perceptron.


Agresti, Alan. 2013. Categorical Data Analysis. 3rd ed. Wiley Series in Probability and Statistics 792. Hoboken, NJ: Wiley.
Elith, J., J. R. Leathwick, and T. Hastie. 2008. “A Working Guide to Boosted Regression Trees.” Journal of Animal Ecology 77 (4): 802–13.
Faraway, Julian James. 2016. Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. Second edition. Chapman & Hall/CRC Texts in Statistical Science Series. Boca Raton: CRC Press, Taylor & Francis Group.
Fitzmaurice, Garrett M., Nan M. Laird, and James H. Ware. 2011. Applied Longitudinal Analysis. Wiley Series in Probability and Statistics. Hoboken, NJ, USA: John Wiley & Sons, Inc.
Fox, John. 2016. Applied Regression Analysis and Generalized Linear Models. Third Edition. SAGE.
Harrison, Xavier A., Lynda Donaldson, Maria Eugenia Correa-Cano, Julian Evans, David N. Fisher, Cecily E. D. Goodwin, Beth S. Robinson, David J. Hodgson, and Richard Inger. 2018. “A Brief Introduction to Mixed Effects Modelling and Multi-Model Inference in Ecology.” PeerJ 6 (May): e4794.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY: Springer New York.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, eds. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics 103. New York: Springer.
Kleinbaum, David G., and Mitchel Klein. 2012. Survival Analysis: A Self-Learning Text. Statistics for Biology and Health. New York, NY: Springer New York.
Laird, Nan M., and James H. Ware. 1982. “Random-Effects Models for Longitudinal Data.” Biometrics 38 (4): 963.
Pinheiro, José C., and Douglas M. Bates. 2000. Mixed-Effects Models in S and S-PLUS. Statistics and Computing. Springer.
Singmann, Henrik, and David Kellen. 2019. “An Introduction to Mixed Models for Experimental Psychology.”
Wood, Simon N. 2017. Generalized Additive Models: An Introduction with R. Second edition. Chapman & Hall/CRC Texts in Statistical Science. Boca Raton: CRC Press/Taylor & Francis Group.
Published: 2022-02-17