# Table Of Contents

# Q&A

Concise questions and answers for a variety of topics. Things you should probably know off the top of your head. Unfortunately I’ve lost references for most, though generally these are short summaries taken from texts, papers, stackexchange, or other notes.

## Basic Stats

- What is statistics?
Statistics is the scientific process used for data collection, organization, analysis, and interpretation.

- Explain the law of large numbers.
- The law of large numbers shows that the sample mean converges to the population
mean \(\mu\) as the sample size increases.
- This holds for iid observations from a distribution with finite expected value.

- The law of large numbers shows that the sample mean converges to the population
mean \(\mu\) as the sample size increases.
- Explain the central limit theorem.
- The central limit theorem shows that when you sample iid observations from
a distribution with finite variance, then the sampling distribution of the mean
approaches a normal distribution.
- \(\bar{x} \sim N\left( \mu, \left( \frac{\sigma}{\sqrt{n}} \right)^2 \right)\)

- The central limit theorem shows that when you sample iid observations from
a distribution with finite variance, then the sampling distribution of the mean
approaches a normal distribution.
- What is the standard error?
- The standard error is the standard deviation of a statistics sampling distribution.
- For single Mean: \(se = \frac{sd}{\sqrt{n}}\)
- Difference of means (equal variance): \(se = \sqrt{sd^2 \left( \frac{1}{n_1} + \frac{1}{n_2} \right)}\)
- Difference of means (unequal variance): \(se = \sqrt{\left( \frac{sd^2_1}{n_1} + \frac{sd^2_2}{n_2} \right)}\)

- The standard error is the standard deviation of a statistics sampling distribution.
- What is a p-value?
- A p-value is the probability of observing your sampled data, or data more extreme, given
the null hypothesis is true and the model used for inference is true. ~ \(P(data | \theta)\)
- Probability of how well the data fits with what was assumed under the null hypothesis.

- A p-value is the probability of observing your sampled data, or data more extreme, given
the null hypothesis is true and the model used for inference is true. ~ \(P(data | \theta)\)
- If p>0.05 should you fail to reject the null hypothesis?
Not necessarily. A large P value simply flags the data as being not unusual if all the assumptions used to compute it (including the test hypothesis) were correct. If there were violations of those assumptions then we would expect a higher chance of incorrectly concluding to not reject the null hypothesis (false negative).

- If p≤0.05 should you reject the null hypothesis?
Not necessarily. A small P value simply flags the data as being unusual if all the assumptions used to compute it (including the test hypothesis) were correct. If there were violations of those assumptions then we would expect a higher chance of incorrectly concluding to reject the null hypothesis (false positive).

- If p>0.05 does it show evidence for the null hypothesis?
No. A large P value often indicates only that the data are incapable of discriminating among many competing hypotheses (as would be seen immediately by examining the range of the confidence interval).

- If p≤0.05 does this mean the chance you’ve made a false positive conclusion is 5%?
No. If you reject the null hypothesis when it is actually true, then the chance you’ve made a false positive error is 100%. The 5% refers only to how often you would reject it, and therefore be in error, over very many uses of the test across different studies when the test hypothesis and all other assumptions used for the test are true. This statement doesn’t apply to a single test.

- Statistical significance is a property of the phenomenon being studied, and thus statistical tests detect significance.
This misinterpretation is promoted when researchers state that they have or have not found “evidence of” a statistically significant effect. The effect being tested either exists or does not exist. “Statistical significance” is a dichotomous description of a P value (that it is below the chosen cut-off) and thus is a property of a result of a study design and the statistical test; it is not a property of the effect or population being studied. In summary, significance is an attribute of the study, not of the effect.

- Explain the 95% confidence interval.
**Easy to understand**: The confidence interval contains all values of the population statistic that we could reasonably believe to be compatible with the sample data.**Accurate**: If the data collection and analysis method are valid, and repeated many times, then 95% of repeated experiments would result in a confidence interval that includes the population parameter. - The observed 95% CI has a 95% chance of containing the true effect size.
No. The frequency that an observed interval contains the true effect is either 100% if the true effect is within the interval or 0% if not. The 95% refers only to the situation if you replicated the exact same study design an infinite amount of times, then 95% of those confidence intervals would contain the true effect size if all assumptions used to compute the intervals were correct.

- If two CIs overlap, the difference between two estimates or studies is not significant.
No. The 95% confidence intervals from two subgroups or studies may overlap substantially and yet the test for difference between them may still produce P < 0.05. Suppose for example, two 95% confidence intervals for means from normal populations with known variances are (1.04, 4.96) and (4.16, 19.84); these intervals overlap, yet the test of the hypothesis of no difference in effect across studies gives P = 0.03. As with P values, comparison between groups requires statistics that directly test and estimate the differences across groups. It can, however, be noted that if the two 95% confidence intervals fail to overlap, then when using the same assumptions used to compute the confidence intervals we will find P < 0.05 for the difference; and if one of the 95% intervals contains the point estimate from the other group or study, we will find P > 0.05 for the difference.

- What is type 1 error?
- The probability of rejecting a true null hypothesis. AKA false positive.
- \(\alpha\)

- What is type 2 error?
- The probability of failing to reject a false null hypothesis. AKA false negative.
- \(\beta\)

- When do you need to adjust for multiple comparisons?
When you test multiple hypothesis, want to control for possibly increasing number of false positive results, and can allow for increased proportion of false negatives.

- What is the probability of a false positive among multiple independent hypothesis tests?
If n independent hypothesis are tested, then the probability that at least one of them will be found significant is \(1 - (1 - \alpha)^n\) when the null hypothesis are true.

When \(\alpha\)=0.05 and n=20, then \(P(\geq 1 \text{ false positive})=0.64\) - What is power?
- The power of a test to detect a correct alternative hypothesis is the pre-study probability that the test will reject the test hypothesis (e.g., the probability that P will not exceed a pre-specified cut-off such as 0.05). The corresponding pre-study probability of failing to reject the test hypothesis when the alternative is correct is one minus the power, also known as the Type-II or beta error rate. As with P values and confidence intervals, this probability is defined over repetitions of the same study design and so is a frequency probability.
- Power is the probability of correctly rejecting the null hypothesis, given that the alternative hypothesis is true.

- What information is needed for power analysis and sample size calculation?
- Verify the question of interest and the variables needed for this question.
- Verify you are able to collect a representative sample for the question of interest.
- Check if there is previously published data for the question of interest.
- Determine the alpha level required.
- Determine the minimum relevant effect size.
- Determine the variability in the variable of interest.

Standardized effect sizes are nice as they remove the need to specify variance. Raw effect sizes are easier to visualize and interpret.

- Is it ok to calculate retrospective power?
Very rarely. Retrospective power analyses can be conducted by 4 methods:

- using both the observed effect size and variance
- Issues with method 1: Both the P value and power are dependent upon the observed effect size and so are inversely related such that tests with high P values tend to have low power and visa-versa. Therefore calculating power using the observed effect size and variance is simply a way of re-stating the statistical significance of the test. In addition, this method will result in an over-estimate of the true power and the point estimates are often imprecise. The estimate of the observed statistical power is not meaningful because there is no way to be sure that the effect size estimate from the study is the true parameter.

- only the observed variance
- Benefits of method 2: Using observed variances but not observed effect sizes is helpful because it allows one to evaluate whether the sample size and alpha level were sufficient to have a good chance of detecting a biologically significant effect given the observed level of variation. You may report power over a range of effect sizes or determine the minimum effect size detectable for a given power.

- neither the observed effect size nor variance
- Issues with method 3: Using standardized effect sizes avoids the need to specify sampling variances, so we only use the sample size and alpha level from the study. The result will provide information to evaluate the study design, but makes it much harder to assess biological significance.

- avoided completely by computing confidence intervals about the observed effect size.
- Issues with method 4: This method is only relevant before the results of the hypothesis test are known.

Since the goal is often to simply quantify the uncertainty in the findings of a study, calculating confidence intervals are more appropriate than retrospective power. If the goal is to evaluate the ability of a study to detect a biologically meaningful pattern, then retrospective power might be useful. Calculating power using pre-specified effect sizes (or calculating detectable effect size using pre-specified power) is helpful, especially if easily interpreted raw effect size measures are used. Standardized measures may be useful in more complex tests (such as tests for interaction in multi-way ANOVA) where it is hard to specify an intuitive raw measure of effect size. In these cases power analysis may be performed using conventional levels of effect size, such as those proposed by Cohen (1988). All power calculations should be accompanied by a sensitivity analysis. For power calculations that use assumed values for the effect size or variance, this means using a range of plausible values for each variable. Graphs showing how two or more variables interact with one another are particularly valuable. For power calculations that use values estimated from sample data (such as sampling variance), a confidence interval about the power estimate should be given.

- using both the observed effect size and variance
- What is Pearson’s correlation coefficient?
- \(\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}\)
- \(\rho_{X, Y} = \frac{E[(X - \mu_{X})(Y - \mu_{Y})]}{\sigma_X \sigma_Y}\)
- \(\rho_{X, Y} = \frac{E(XY) - E(X)E(Y)}{\sqrt{E(X^2)-(E(X))^2}\sqrt{E(Y^2)-(E(Y))^2}}\)
- In words: The normalized covariance between two variables. A measure of strength as to how the paired values fit a line. Can be -1 to 1.

- \(\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}\)
- What is selection bias?
Selection bias occurs when the sampled data has characteristics that are not representative of the population you wish to make make inference on.

## Models

### Theory

- What is ordinary least squares (OLS)?
- Ordinary Least Squares (OLS) is an estimation method that corresponds to minimizing the sum of square differences between the observed and predicted values (SSE).
- \(\hat{\beta} = (X^TX)^{-1} X^T Y\)
- Under the assumption that the linear model residuals are normally distributed, OLS and MLE estimators are equivalent.
- If the model assumptions are met, then the OLS estimates are the Best Linear Unbiased Estimators (BLUE) (minimum variance and unbiased)

- What is maximum likelihood estimation (MLE)?
- Maximum Likelihood Estimation (MLE) is an estimation method that corresponds to maximizing the likelihood function.
- You find the MLE estimates of the betas through an iterative algorithm:
- Specify the statistical process (linear model) that relates the response to the predictors.
- Choose starting values for the coefficients
- Find the residual and likelihood score
- Propose new coefficient values that will improve the likelihood.
- Repeat until there is no improvement in the likelihood

- In software, we often phrase MLE as minimizing a cost function. MLE thus becomes minimization of the negative log-likelihood (NLL)
- The expectation maximization (EM) algorithm is a common MLE estimation method.
- Under the assumption that the linear model residuals are normally distributed, OLS and MLE estimators are equivalent.

- What is the likelihood function?
- What is the difference between a linear and nonlinear model?
A model is linear (or nonlinear) in its parameters. For example, \(y = \beta x^2 + \epsilon\) is linear in \(\beta\) but not in \(x\), while \(y = exp(\beta) x + \epsilon\) is nonlinear in \(\beta\) but linear in \(x\).

- What is the difference between linear and generalized linear models?
- Linear model:
- Y, conditional on X, is normally distributed with a mean of the predicted values and some variance.
- \(Y|X \sim N(\beta X, \sigma^2)\)
- The error term \(\epsilon \sim N(0, \sigma^2)\)
- \(E(Y|X) = \beta X\)
- Fit with OLS

- Generalized linear model:
- Allows us to fit response variables that are from any exponential family (conditioning on the predictors).
- \(E(Y|X) = g^{-1}(\beta X)\)
- \(g(E(Y|X)) = \beta X\)
- \(g\) is the link function. The link function determines how the expected value of the response relates to the linear predictor.
- It also requires a variance function to determine how \(Var(Y)\) depends on the mean. \(Var(Y) = V(E(Y)).\)
- Fit with MLE

- What is the bias-variance tradeoff?
- The variance refers to the amount the predictions (i.e. the model) would change if the model was built using a different dataset.
- Bias refers to the error that is introduced by approximating a real life problem.
- The Bias and Variance are components of the MSE.
- What is regularization?
Regularization is the process of adding tuning parameters to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L1 (Lasso) or L2 (ridge) penalty terms, or the combination (elastic nets). The model predictions should then minimize the loss function calculated on the regularized training set.

Regularization is synonymous with penalization. We regularize a model by penalizing estimates that would have otherwise violated some desirable behavior (e.g. sparsity)

- What is sparsity?
A sparse statistical model is one in which only a relatively small number of parameters (or predictors) play an important role.

If p>N, and the true model is not sparse, then the number of samples, N, is too small to allow for accurate estimation of the parameters. But if the true model is sparse, so that only k<N parameters are actually nonzero in the true underlying model, then it turns out that we can estimate the parameters effectively using the lasso method. This is possible even though we are not told which k of the p parameters are actually nonzero.

Ref: Page 3, Statistical learning with sparsity by Hastie et al. - Explain Bayesian inference.
Bayesian inference can be thought of in the following three steps

- Creating the full probability model
- A joint probability model for all observable and unobservable quantities in a problem.

- Conditioning on observed data
- Calculate the posterior distribution (the conditional probability distribution of the unobserved quantities of interest, given the observed data)

- Evaluating the fit of the model and the implications of the posterior distribution.
- How well does the model fit the data?
- Are the conclusions reasonable?
- How sensitive are the results to the assumptions made in step 1?

- Posterior density
- \(p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)}\)

- Unnormalized posterior density
- \(\text{Posterior} \propto \text{Likelihood} \times \text{Prior}\)
- \(p(\theta|D) \propto L(\theta|D) p(\theta)\)

- Creating the full probability model
- What’s the difference between likelihood ratio, Wald, and score tests?
The likelihood ratio test was developed by Jerzy Neyman and Egon Pearson (1928), the Wald test by Abraham Wald (1943), and the score test by C.R. Rao (1948) (Aitchison and Silvey (1958) developed the Lagrange Multiplier test which is equal to the score test).

The purpose of these tests is to compare two different (where one model is nested within the other) models to see which fits the data better. All three utilize the likelihood, though do so in different ways. The likelihood ratio test is the gold standard (most powerful), but the Wald and score tests may be computationally easier for certain model comparisons. The Wald and score tests generally perform poorly in small sample size scenarios. We assume the smaller model is true under the null hypothesis and that it is not the true model under the alternative.

- LRT: Fit both models and compare their likelihoods using \[\begin{align} LR &= -2log \left( \frac{L(model_1)}{L(model_2)}\right) \\ &= 2(logL(model_2) - logL(model_1)) \\ &\sim \chi^2_{df} \end{align}\] where \(model_1\) is the more restrictive model (fewer predictors; reduced), \(model_2\) the less restrictive model (more predictors; full), and the degrees of freedom for the test statistic is equal to the number of predictors included in \(model_2\) but not in \(model_1\). A small p-value indicates \(model_2\) is an improvement in fit.
- Wald: The Wald test only requires the larger model to be fit. It then tests how far the estimated coefficients are from zero (or whatever is used on the null hypothesis) in (asymptotic) standard errors and can allow multiple coefficients to be tested simultaneously. Thus we simply need to test that the parameters not included in the reduced model are simultaneously equal to zero. The resulting test statistic is also a \(\chi^2\) with degrees of freedom equal the number of predictors being tested. A small p-value indicates the larger model is an improvement in fit.
- Score: The score test only requires the more restrictive model (reduced model). After fitting the model, the slope, or “score”, of the likelihood function (or log likelihood) can be evaluated at different values of coefficient values. If the score (slope) is very large, then we know that we are not close to the best value of the parameter, the MLE. The resulting test statistic is also a \(\chi^2\) with degrees of freedom equal the number of predictors being tested. A small p-value indicates the larger model is an improvement in fit.

Ref: https://doi.org/10.1080/00031305.1982.10482817

Ref: John Fox, Applied Regression Analysis

The Likelihood is a measure of the extent to which a sample provides support
for each possible value of a parameter in a parametric model.

Given a probability density function \(f(x|\theta) = (\theta + 1)x^{\theta},\)
the likelihood, \(L(\theta|x)\), is defined as \(L(\theta|x) = \prod_{i = 1}^{N}(\theta + 1)x_{i}^{\theta},\)
and the corresponding log-likelihood, \(\Lambda(\theta|x)\), is defined as
\(\Lambda(\theta|x) = \sum_{i = 1}^{N} \text{log}(\theta + 1) + \theta \text{log}(x_{i}).\)

If you want the maximum likelihood estimator (MLE), you need to find the maximum
of the log-likelihood function:
\(0 = \frac{\partial\Lambda(\theta|x)}{\partial\theta} = \sum_{i = 1}^{N} \frac{1}{\theta + 1} + \text{log}(x_{i}) = \frac{N}{\theta + 1} + \sum_{i = 1}^{N}\text{log}(x_{i})\)

Models that have high bias tend to have low variance. For example, simple linear regression. Models that have low bias tend to have high variance. For example, complex, flexible models. We want a model that is complex enough to capture the true relationship between the explanatory variables and the response variable, but not overly complex such that it finds patterns that don’t really exist.

### Assumptions

- What are the assumptions of linear regression?
- Validity: The data collected maps correctly to the questions of interest. The response variable reflects the phenomenon of interest and all relevant predictors are included.
- Additivity: The effects of the predictor variables on the response are additive, along with the additive error term.
- Linearity: The response variable is a deterministic component of a linear function of the predictors plus an error term. The conditional means of the response are a linear function of the predictor variables.
- Multicolinearity: The predictor variables are not a linear combination of each other.
- Independence of errors: The errors from the regression line are independent.
- Equal variance of errors: The errors from the regression line have similar variability across the range of the response.
- Normality of errors: The errors are normally distributed with mean 0.

- How can you validate the validity assumption of linear regression?
- Use expert’s knowledge to determine the functional relationship with the response.
- If a relevant predictor variable is omitted:
- The OLS estimator of the coefficients will be biased (if correlated with each other).
- The variance of the error term will be biased.
- The covariance matrix of the coefficients will be biased.

- If an irrelevant predictor variable is included:
- The OLS estimator of the coefficients are unbiased (As the unneeded coefficient should be 0).
- The covariance matrix of the coefficients will be biased (increased).

- Before estimating coefficients, predict the sign of them. If they are different than expected, you may have omitted something correlated with that predictor variable.

Ref: http://www.sfu.ca/~pendakur/teaching/buec333/The%20Classical%20Model%20and%20Specification.pdf

- How can you validate the linearity assumption of linear regression?
- The residual vs. predicted plot should be symmetrically distributed around
a horizontal line. If this relationship drifts away from zero, then the conditional
expected response isn’t linear in the fitted values.
- Each individual predictor vs. observed plot should be symmetrically distributed around a linear line.

- The residual vs. predicted plot should be symmetrically distributed around
a horizontal line. If this relationship drifts away from zero, then the conditional
expected response isn’t linear in the fitted values.
- How can you validate the multicolinearity assumption of linear regression?
- Check pairwise correlations of predictor variables.
- Check for instability in coefficients and standard error of those estimates.
- Check the variance inflation factors (VIF)

- A VIF of 1.9 tells you that the variance of a particular coefficient is 90% bigger than what you would expect if there was no multicollinearity.
- VIFs are calculated by taking a predictor, and regressing it against every other predictor in the model. This gives you the \(R^2\) values for each predictor. \(VIF = \frac{1}{1-R^2_i}\)

- Check pairwise correlations of predictor variables.
- How can you validate the independence assumption of linear regression?
The residual plots should look as if they were random noise.
- residuals vs. observation order
- residuals vs. time
- residuals vs. response
- residuals vs. each predictor
- ACF plot

Otherwise, verify that each observation in the data is independent of others.

- How can you validate the equal variance assumption of linear regression?
The residual vs. fitted values plot should show constant variability.

- How can you validate the normality of errors assumption of linear regression?
The QQplot should show a slope of one. QQplots can compare the distribution for any two vectors of data. For regression models, we usually look at a plot with the Normal distribution theoretical quantiles on the x-axis and the sample (residual) quantiles on the y-axis.

- How can you fix the equal variance assumption of linear regression?
- Identify source of variation in residuals with residual vs. predictors plots.
- Use a variance stabilizing transformation. (Log or square root transform the response)
- Use a weighted least squares or robust regression.
- Use a generalized least squares (GLS) model.

- How can you fix the independence assumption of linear regression?
A linear mixed model (LMM) or generalized least squares (GLS) model may be a good choice as they provide methods to fit correlated errors.

- How can you fix the linearity assumption of linear regression?
- You can try to identify source of nonlinearity with response vs. predictor plots.
- Log transformations on the response or predictors variables.
- Interaction effects between predictor variables.
- Polynomial terms for predictors.
- Spline terms for predictors.

- How can you fix the normality of errors assumption of linear regression?
- Try transformations on response. (Log or square root)
- Try generalized least squares (GLS) model
- Try generalized linear model (GLM) model

- What are the assumptions of logistic regression?
- Validity: The data collected maps correctly to the questions of interest. The response variable reflects the phenomenon of interest (binary) and all relevant predictors are included.
- Linearity: The log odds of the outcome (logit of outcome) are linearly related to each predictor variable.
- Multicolinearity: The predictor variables are not a linear combination of each other.
- Independence of observations.

- What are the assumptions of Cox proportional hazards regression?
- Validity: The data collected maps correctly to the questions of interest. The response variable reflects the phenomenon of interest (time to event, censoring if lost to follow-up) and all relevant predictors are included.
- Linearity: The log hazard (or log cumulative hazard) of the outcome is linearly related to each predictor variable.
- Multicolinearity: The predictor variables are not a linear combination of each other.
- No information bias: Predictor variables were measure at or before baseline survival time. When predictors, which are not an inherent property of the observation, are measured after start of survival time, then the observation had to remain event free until the time they were measured on that variable.
- Proportional hazards: There are no time by predictor interactions.
- The predictors have the same effect on the hazard function at all timepoints.
- The treatment effect should have the same relative difference in hazard compared to the control effect at all timepoints (their curves should not cross.)

- Independence of observations.

### Paradoxes and other issues

- What is Simpson’s paradox?
This paradox was described in detail by Edward Simpson in 1951. It occurs when the relationship between two categorical variables reverses (or is diminished or enhanced) when you condition on a third categorical variable. Consider the situation where, as a whole, the control group was more likely to get sick than those who took a drug \((\frac{13}{60} = 0.22 > \frac{11}{60} = 0.18)\). However, within each gender, the control group had a

**smaller**chance of getting sick than those who took the drug: \((\frac{1}{20} = 0.05 < \frac{3}{40} = 0.07 \text{ and } \frac{12}{40} = 0.3 < \frac{8}{20} = 0.4)\). This reversal in relative frequency is because of the mathematical property that if \(\frac{A}{B} > \frac{a}{b} \text{ and } \frac{C}{D} > \frac{c}{d}\) does not imply that \(\frac{A + C}{B + D} > \frac{a + c}{b + d}\). In this case \(\frac{3}{40} > \frac{1}{20}\) and \(\frac{8}{20} > \frac{12}{40}\) but \(\frac{3 + 8}{40 + 20} < \frac{1 + 12}{20 + 40}\).Both the control and drug groups had 60 people total. However, the control group had 40 men while the drug group had 20 men, and men were overall more likely to get sick \((\frac{20}{60} = 0.33 > \frac{4}{60} = 0.07)\). Looking at the aggregated data hides this fact and results in Simpson’s paradox.

control

drug sick healthy

total sick

healthy total female 1 19 20 3 37 40 male 12 28 40 8 12 20 total 13 47 60 11 49 60 You might think this means the conditional analysis is always the correct choice, however, the only way to formally decide is to draw the relevant causal networks, decide on which variables are confounding, and make those assumptions known when reporting the results. Choosing the correct way to analyze data is not based on naively conditioning on every variable measured, but rather is extracted from the context of the question, the data generating process, and design of experiment.

Ref: - What is Lord’s paradox?
This paradox was described in detail by Frederic Lord in 1967. It occurs when the relationship between two variables (one categorical and one continuous) reverses (or is diminished or enhanced) when you condition on a third continuous variable. The usual example for the third variable is a baseline measurement, while the outcome (response) is the final measurement. There are two options for analysis

- An unconditioned test on the change score
- A conditioned test on the final outcome.

A t-test on the change score, \(y - x2\), provides a summary for the total effect (direct effect of

*x1*plus indirect effects through all others paths), while an ANCOVA with baseline as a covariate provides the direct effect of*x1*.In a pre-post design, if differences between treatment groups at baseline are expected to be zero (such as in a RCT), then a baseline adjusted model and change score model are expected to give the same result. If differences between the two treatment groups are expected, then the two methods are expected to give different results. Thus, the researcher needs to choose the causal mediation effect of interest in order to have a meaningful interpretation.

Ref: - What is Suppression?
Suppression occurs when the relationship between two continuous variables reverses (or is diminished or enhanced) when you condition on a third continuous variable. The general case is observed when an explanatory variable that is unrelated to the response variable increases the fit of a model. Suppose a model has continuous response

*y*and continuous explanatory*x1*and*x2*. If \(corr(y, x1) = 0\) and \(corr(y, x2) > 0\) and \(corr(x1, x2) > 0\) then including*x1*can suppress the part of*x2*that is uncorrelated with*y*. The coefficient for*x2*can increase, \(R^2\) can increase, and the coefficient for*x1*will be less than zero. In general, suppression is seen when controlling for a ‘third’ variable that is termed a confounder (although this confounder can only be interpreted based on a causal diagram in context of the question). - What is Berkson’s paradox?
In 1946, Joseph Berkson, a biostatistician at the Mayo Clinic, noticed that even if two diseases have no relation to each other in the general population, they can be associated among hospital patients. This can result from a situation where neither of the two diseases by themselves cause hospitalization, but together they do. So conditioning on hospitalization created a spurious association between the two diseases. Determining this effect is hard, and it’s important to create causal diagrams to illustrate the potential confounding paths. This often arises from convenience samples or other sampling bias.

- What is Will Roger’s phenomenon?
This phenomenon is attributed (maybe incorrectly) to comedian will Rogers with the following quote:

“When the Okies left Oklahoma and moved to California, they raised the average intelligence level in both states.”

This is more generally referred to as stage migration. For example, improved detection of illness leads to an increased number of unhealthy people which can result in a increase in life span for both groups even if no improvement in treatment is seen. This is a potential explanation for the increase in cancer survival times. In summary, you may wrongly attribute to treatment effects and should be conditioning on diagnostic criteria.

- What is regression to the mean?
In 1885, Francis Galton published a paper on the height of parents and their children. He found that parents of on the extreme ends of the height spectrum tend to have children whose height is closer to the average. This effect causes issues when selection for measurement is based on some criterion (above or below a certain value) from an initial measure. Consider the situation where you measure blood pressure for 1000 patients and 285 had DBP values greater than 95mmHg. Instead of following up with all 1000 patients, you only called back the 285 and find that their mean DBP value has decreased. In this situation you would not be able to separate the effect of a drug or other intervention from regression to the mean. To see the effect of an intervention, you would need to randomly assign half of the 285 to a control group and half to a treatment group. Then both groups would have the same regression to the mean improvement and any differences could be attributed to the treatment.

### Other

- What is logistic regression?
Logistic regression is a logit-link generalized linear model that will generate predicted probabilities for the binary response variable using a linear combination of predictor variables.

- How would you interpret the coefficient for a continuous predictor in linear regression?
The coefficient of a continuous predictor represents the change in the response variable for each 1 unit change in the predictor.

- How would you interpret the coefficient for a categorical predictor in linear regression?
For treatment or dummy encoded categorical variables, each coefficient represents the unit change in the response variable relative to the reference level of the categorical variable.

- How would you interpret the coefficient for a continuous predictor in logistic regression?
The coefficient of a continuous predictor represents the log odds of the response for a 1 unit change in the predictor.

- How would you interpret the coefficient for a categorical predictor in logistic regression?
For treatment/dummy encoded variables, each coefficient of a categorical predictor represents the log odds ratio of the response relative to the reference level of the categorical variable.

- What is the risk ratio?
- Relative risk = risk ratio.
- The ratio of the probability of an outcome in an exposed group to the probability of an outcome in an unexposed group.
- We generally prefer Odds Ratios as the RR does a poor job when baseline risk varies, and baseline risk tends to vary a lot.
- Risk ratios should not be used in case-control studies.
- Ref: https://doi.org/10.1016/j.jclinepi.2020.08.019

If we found that 17% of smokers develop lung cancer and 1% of non-smokers develop lung cancer, then we can calculate the relative risk of lung cancer in smokers versus non-smokers as: Relative Risk = 17% / 1% = 17

Thus, smokers are 17 times more likely to develop lung cancer than non-smokers.

- What is the odds ratio?
- The ratio of the odds of an outcome in an exposed group to the odds of an outcome in an unexposed group.
- We generally prefer Odds Ratios as the RR does a poor job when baseline risk varies, and baseline risk tends to vary a lot.
- When an event is rare, then odds ratios approximate ratios of probabilities (the risk ratio).
- When an event is not rare, the odds ratio is larger compared to the risk ratio.
- Can be used in case-control studies.
- In comparison with identity and log-link (risk ratio) GLM models, the logit-link model results in non-collapsible odds ratios.
- Ref: https://doi.org/10.1016/j.jclinepi.2020.08.019

If 17 smokers have lung cancer, 83 smokers do not have lung cancer, one non-smoker has lung cancer, and 99 non-smokers do not have lung cancer, the odds ratio is calculated as follows:

- Odds in exposed group = (smokers with lung cancer) / (smokers without lung cancer) = 17/83 = 0.205
- Odds in not exposed group = (non-smokers with lung cancer) / (non-smokers without lung cancer) = 1/99 = 0.01
- Odds ratio = (odds in exposed group) / (odds in not exposed group) = 0.205 / 0.01 = 20.5
- This group of smokers has 20 times the odds of having lung cancer than non-smokers.

- Collapsible vs non-collapsible estimates?
- Collapsible means that adjusting for other covariates which are uncorrelated
with the covariate of interest
**will not**change its estimated value (although it may change the SE). - Non-collapsible means that adjusting for other covariates which are uncorrelated
with the covariate of interest
**will**change its estimated value.

- Collapsible means that adjusting for other covariates which are uncorrelated
with the covariate of interest
- When is a model underfit?
A model is underfit when it assumes the relationship between the explanatory variables and the response variable is more simple than it actually is.

- When is a model overfit?
A model is overfit when the model predictions match very closely with the observed data, but perform poorly on new data.

- What is supervised learning?
Supervised learning is when you have measurements on a set of predictor variables and response variable so that you can create a function, Y = f(X), that best explains the relationship between them. Regression model = supervised learning.

- What is unsupervised learning?
Unsupervised learning is when you have data on a set of variables and would simply like to find underlying structure or patterns within this set of data. Clustering/Association = unsupervised learning.

- What is cross validation?
Cross validation is a method to evaluate the performance of a statistical model on unseen data. Common methods include K-fold cross validation and bootstrap resampling.

- What is K-fold cross validation?
- Randomly divide a dataset into k groups, or “folds”, of roughly equal size.
- Choose one of the folds to be the holdout set. Fit the model on the remaining k-1 folds. Calculate the test MSE (or other measure of fit) on the observations in the fold that was held out.
- Repeat this process k times, using a different set each time as the holdout set.
- Calculate the overall test MSE to be the average of the k test MSE’s.

Typically, given these considerations, one performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.Ref: Page 184, An Introduction to Statistical Learning

- What is bootstrap cross validation?
- Fit the model to the original data and estimate the validation statistics \(VStat_{orig}\)
- Repeat 1000 times:
- Create a bootstrapped dataset (with replacement) from the original data.
- Fit the model to the bootstrapped data and estimate validation statistics \(VStat_{boot}\).
- Estimate new validation statistics \(VStat_{boot:orig}\) by applying the bootstrapped survival model to the original data.

- Calculate the optimism of the original validation statistics \(Optimism = \frac{1}{1000} \sum_{1}^{1000} VStat_{boot} - VStat_{boot:orig}\)
- Adjust the original validation estimate by the optimism: \(VStat_{orig} - Optimism\)

- What cross validation method applies to time series data?
You can use forward chaining. e.g. split the sample into 5 sets chronologically:

- training (1), test (2)
- training (1, 2), test (3)
- training (1, 2, 3), test (4)
- training (1, 2, 3, 4), test (5)

- What model would you use for an ordered response variable?
- Proportional odds model. AKA cumulative logit model with proportional odds assumption.
- Continuation ratio model. Observations assumed to pass through lower levels to reach higher levels.
- Adjacent categories model.

- How would you perform model selection?
- The best model selection is through pre-specification using existing expert knowledge.
- The second best model selection method is to check for selection stability using cross validation.

- Statistical methods cannot distinguish between spurious and real associations
between variables. So we first need to pre-specify a set of candidate covariates
for which domain expertise would assume association with the response variable.
- Since this usually hard to do, consider the results from unsupervised learning methods such as hierarchical clustering or PCA.

- Choose a usual model selection method.
- Traditional

- Univariate screening
- Backward selection

- Univariate screening
- Change in coefficient

- If elimination of standardized coefficient results in more than X% change
in other predictor of interest, then it will not be removed.

- If elimination of standardized coefficient results in more than X% change
in other predictor of interest, then it will not be removed.
- Penalized likelihood

- Lasso
- Elastic net
- Adaptive Lasso

- Lasso
- Boosting

- Gradient boosting
- Likelihood boosting

- Gradient boosting
- Model free

- Feature ordering by conditional independence (FOCI).

- Traditional
- Use the nonparametric bootstrap to generate repeated runs of the selection algorithm.
- Assess the stability of model selection through

- Variable inclusion frequency
- Model selection frequency

- Variable inclusion frequency
- If you do not see stable results, then it is unlikely that a data based model selection method will return interpretable results.

- Statistical methods cannot distinguish between spurious and real associations
between variables. So we first need to pre-specify a set of candidate covariates
for which domain expertise would assume association with the response variable.

- What is Lasso regression?
Lasso regression uses L1 penalization. The penalization is proportional to the sum of the absolute value of the regression coefficients. It will shrink regression coefficients to ~0. LARS (Least angle regression) is the simplest implementation of Lasso.

- Cost = loss function + penalty term
- Cost = \(\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_jx_{ij} \right)^2 + \lambda \sum_{j=1}^p |\beta_j|\)

Generally, we select the tuning parameter value \(\lambda\) for which the cross-validation error is smallest. larger values of \(\lambda\) will be more likely to force coefficients to zero.

- What is Ridge regression?
Ridge regression uses L2 penalization. The penalization is proportional to the sum of the squared regression coefficients. It will shrink regression coefficients, but will not force them to zero.

- Cost = loss function + penalty term
- Cost = \(\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_jx_{ij} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2\)

Generally, we select the tuning parameter value \(\lambda\) for which the cross-validation error is smallest.

- What is the class imbalance problem?
The class imbalance problem is the observation that some ‘classification’ algorithms do not perform well when the proportion of one class is much smaller than the other class.

This problem is really a result of three separate issues:

- Assuming that accuracy is an appropriate metric of model fit. i.e. you assign an equal cost to false positive and false negatives.
- Assuming that the test distribution matches the training distribution when it actually does not.
- The frequency of the minority class is too small to make reliable inference on.

Using the wrong model fit measure (accuracy) or algorithms that do not model probabilities (SVM, tree methods) are the main problem in this scenario.

Otherwise this problem is not that important if you use a model that generates predicted probabilities and have enough observations in the smaller class.

Refs: - What is an interaction effect?
An interaction is when the effect of one predictor variable on the response variable changes at different levels of another predictor variable.

- What are precision, recall, sensitivity, and specificity?
- Precision = Positive predictive value
- Recall = Sensitivity
- Precision = P(truth is positive | estimate is positive) = What is the probability that this is a real hit given my classifier says it is?
- Recall = P(estimate is positive | truth is positive)
- Specificity = P(estimate is negative | truth is negative)
- ROC uses sensitivity and specificity, and they condition on the true class label.

- What is an ROC curve?
The ROC curve is a graphical representation of the contrast between true positive rates and false positive rates at various thresholds.

- When should you use a ROC or PR curve?
- Receiver operator characteristic (ROC)
- Precision Recall (PR) curve
- PR curve: How meaningful is a positive result from my classifier given the baseline probabilities of my problem?
- ROC: How well can this classifier be expected to perform in general, at a variety of different baseline probabilities?
- Refs:

- What is the ROC-AUC?
The AUC is the probability that the model ranks a random positive observation more highly than a random negative observation.

- What is the confusion matrix?
A matrix with actual classification as columns and predicted classification as rows.

actual + actual - test + a (TP) b (FP) test - c (FN) d (TN) - True positive(TP) — Correct positive prediction
- False positive(FP) — Incorrect positive prediction
- True negative(TN) — Correct negative prediction
- False negative(FN) — Incorrect negative prediction

- What are the common binary diagnostic summaries?
actual + actual - test + a (TP) b (FP) test - c (FN) d (TN) - Apparent prevalence \(= P(\text{test} +) = \frac{a+b}{N}\)
- True prevalence \(= P(\text{actual} +) = \frac{a+c}{N}\)
- Sensitivity \(= P(\text{test} + | \text{actual} +) = \frac{a}{a+c}\)
- Specificity \(= P(\text{test} - | \text{actual} -) = \frac{d}{b+d}\)
- Overall accuracy \(= P(\text{test was correct}) = \frac{a+d}{N}\)
- Positive Predictive Value \(= P(\text{actual} + | \text{test} +) = \frac{a}{a+b}\)
- Negative Predictive Value \(= P(\text{actual} - | \text{test} -) = \frac{d}{c+d}\)
- Proportion of false positives \(= P(\text{test} + | \text{actual} -) = \frac{b}{b+d}\)
- Proportion of false negatives \(= P(\text{test} - | \text{actual} +) = \frac{c}{a+c}\)

- What is pruning a decision tree?
When you remove a sub-node of a decision node for the purposes of simplifying the complexity of the classifier.

- What are some model fit summaries for linear regression?
- \(R^2\) - The proportion of variability in the response explained by the
model. \(R^2 = 1-\frac{SSE}{SST}\)
- RMSE - The standard deviation of the unexplained variance (same units as response).
The typical difference between the observed and fitted values. \(RMSE = \sqrt{\frac{SSE}{d.f.}}\)
- Optimism adjusted calibration curve

- \(R^2\) - The proportion of variability in the response explained by the
model. \(R^2 = 1-\frac{SSE}{SST}\)
- What are some model fit summaries for logistic regression?
- AUC - The AUC is the probability that the model ranks a random positive observation more highly than a random negative observation.
- Concordant pairs - The percent of concordant, discordant, and tied pairwise comparisons.
The concordant percent is proportional to the AUC. AUC = percent concordant + 0.5*percent tied?
- Brier score - The squared differences between actual binary outcomes Y and predictions
p are calculated. When the outcome incidence is lower, the maximum score for a
non-informative model is lower. The Brier score for a model can range from 0 for
a perfect model to 0.25 for a non-informative model with a 50% incidence of the outcome.
- Optimism adjusted calibration curve
- Classification table
- Goodness of fit test (GOF) - Hosmer-Lemeshow GOF

- Why should a linear model include an intercept term?
A linear model should include an intercept term so that slope estimates are not unintentionally biased.

Even if we believe the intercept to be zero, there may be threshold input levels that are best estimated including the intercept.

- If two predictor variables have zero correlation, can we consider them statistically independent?
Not necessarily. Lack of linear correlation doesn’t mean they are independent. They might be non-linearly dependent, such as x and x^2. (Hint: verify assumptions before answering.)

- What is a principled way to handle missing data?
- Double check and verify missingness.
- If missingness is informative:
- For a categorical variable, you might try creating a missing level.
- For numeric variables, you can substitute missings with the mean, then add an additional predictor variable that is an indicator for missingness.

- If missingness is at random
- MCAR allows complete case analysis.
- MCAR and MAR allow imputation.

See more at https://stefvanbuuren.name/fimd/

- What are two methods for analyzing pre-post data?
- Baseline adjusted model
- Use a regression model with the pre-response measure as a predictor variable and the post-response measure as the response variable.

- Change score model
- Use a regression model with the pre-post difference as the response variable.

- Baseline adjusted model
- How would you determine the sample size required for a prediction model?
- Continuous outcomes: https://doi.org/10.1002/sim.7993
- Binary and time-to-event outcomes: https://doi.org/10.1002/sim.7992
- Tutorial: https://doi.org/10.1136/bmj.m441

- What is internal model validation?
Internal validation means reusing parts or all of the data on which a model was developed to assess the amount of (likely) overfitting and correct for the amount of ‘optimism’ in the performance of the model.

This is generally done through cross validation and model performance summaries such as R^2, calibration, and discrimination are assessed.

See notes on bootstrap cross validation for implementation.

- What is external model validation?
External model validation means assessing the performance of a model already developed when applied to an independent dataset.

Generally, independent datasets have one or more of the following properties:

- Temporal differences
- Data may be collected from the same locations, but over different periods of time.

- Geographic differences
- Data was collected from different locations.

- Institutional differences
- Data was collected from an organization not connected with the original source.

If the external sample is very similar to the original sample, the assessment is for reproducibility rather than for transportability.

Steps:

- Compare variable characteristics of original sample and new sample.
- Determine if process used to create original model would result in same choice on new data.
- Use the original model and apply it to the new sample, then calculate apparent
calibration and discrimination.
- Compare discrimination and calibration statistics for
- original model, original data, apparent
- original model, original data, optimism adjusted (old model internal validation)
- original model, new data, apparent (old model external validation)

- Determine if differences are from overfitting, missing predictors, differences in data capture, extrapolation, etc..

- Compare discrimination and calibration statistics for
- If warranted, combine original and new data, then re-fit a new model.
- Compare the discrimination and calibration statistics for
- combined model, combined data, apparent
- combined model, combined data, optimism adjusted (new model internal validation)

- Compare the discrimination and calibration statistics for

Royston and Altman suggest:

- Regression on linear predictors of original and external data
- This is one way to calculate calibration.
- Use a regression model with Y = linear predictors from new model fit on external data and X = linear predictors from old model fit on external data.
- If the slope in the validation dataset is < 1, discrimination is poorer. If it is > 1, discrimination is better.
- Likelihood ratio test for slope=1 is recommended, but may be anti-conservative since it does not allow for uncertainty in the estimated regression coefficients that constitute the linear predictor.

- Check model misspecification/fit
- See paper?

- Measures of discrimination
- \(R^2\)
- c-index

- Kaplan-Meier curves for risk groups
- Discrimination: Does the original model fit on new data keep the curves separated?
- Do not use logrank test for differences….

- Calibration: Does the original model fit on the new data keep the curves overlapping?

- Discrimination: Does the original model fit on new data keep the curves separated?

See TRIPOD statement

- Temporal differences
- What is a confounding variable?
Confounding is systematic error introduced into a statistical model when a third variable interferes with the exposure (predictor variable of interest) and outcome variables.

Covariates included in the model should be

- Associated with the predictor of interest.
- Cause the outcome
- The covariate influences the outcome and not the other way around.

- Not located between the exposure and outcome on the causal pathway.

For example:- Bad: exposure -> covariate -> outcome
- Good: covariate -> exposure -> outcome
- Consider a study about grades and alertness in class. Amount of sleep is associated
with alertness and also could be regarded as a causal influence on grades. Thus
amount of sleep is a good (confounding) covariate to adjust for.
- Amount of sleep (confounder) -> alertness (predictor) -> grades (outcome)

So the steps to follow are:

- Identify variables which may cause the outcome.
- Identify the variables in step 1 that are associated with the exposure.
- Make sure the variables in step 2 are not on the causal pathway (the exposure is not causing the covariate)

- How should you model nested data?
When the sampling method incorporates a nesting structure, we need to mirror this structure in the model. This structure creates dependencies among observations and ignoring this can lead to inflation of type 1 errors.

- Hierarchical model / multi-level model / mixed model
- incorporate the sampling structure into the random effect terms
- Should use if you are interested in the variability within the nesting structure.

- Linear model with cluster-robust variance estimation
- Ok to use if your focused on the effects on individuals.

- Or, use a hierarchical model with cluster-robust variance estimation?

- Hierarchical model / multi-level model / mixed model
- What is principal component analysis (PCA)?
PCA was introduced by Karl Pearson (1901) and Harold Hotelling (1933). PCA is an unsupervised learning method, and is useful when you wish to reduce the number of predictor variables, but still keep as much information as possible. In essence, the principal component scores are uncorrelated linear combinations of weighted observed variables which explain the maximal amount of variance in the data. The first principal component identified accounts for most of the variance in the data. The second component identified accounts for the second largest amount of variance in the data and is uncorrelated with the first principal component and so on.

- Eigenvectors are the weights in a linear transformation when computing principal component scores.
- Eigenvalues indicate the amount of variance explained by each principal component or each factor.
- \(\text{Component loading} = \text{Eigenvectors} \times \sqrt{\text{Eigenvalues}}\)

If your goal is to reduce your variable list down to a linear combination of smaller components, then PCA is the way to go. However, if you believe there is some latent construct (the actual characteristic you are interested in, but which cannot be directly measured) that defines the interrelationship between the variables, then factor analysis is more appropriate.

## Math Stat

- What are the basic properties of variance?
- \(\sigma^2 = \frac{\sum(x-\mu)^2}{n}\)
- \(s^2 = \frac{\sum(x-\bar{x})^2}{n-1}\)
- \(Var(X) = E(X^2) - (E(X))^2\)
- \(Var(X) = E \left((X - E(X))^2 \right)\)
- \(Var(aX + b) = a^2 Var(X)\)
- If independent: \(Var(X+Y) = Var(X) + Var(Y)\)
- If not independent: \(Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y)\)
- Variance is a measure of dispersion around the mean.
- Standard deviation is the average distance (same units as the values) between the mean and the sample values.

- \(\sigma^2 = \frac{\sum(x-\mu)^2}{n}\)
- What are the basic properties of covariance?
- \(Cov(X, Y) = E(XY) - E(X)E(Y)\)
- \(Cov(X, Y) = E((X-E(x))(Y-E(Y))\)

- \(Cov(X, Y) = E(XY) - E(X)E(Y)\)
- What are the basic properties of expectation?
- \(E(X) = \int^{\infty}_{-\infty} x f(X) dx\)
- \(E(g(X)) = \int^{\infty}_{-\infty} g(x) f(X) dx\)
- \(E(aX + b) = aE(X) + b\)
- \(E(X + Y) = E(X) + E(Y)\)
- If independent: \(E(XY) = E(X)E(Y)\)

- \(E(X) = \int^{\infty}_{-\infty} x f(X) dx\)

## Probability

### Basics

- What are the basic rules of probability?
If independent: \(P(A \cap B) = P(A)P(B)\)

If not independent: \(P(A \cap B) = P(A)P(B|A) = P(A|B)P(B)\)

If independent: \(P(A|B) = P(A)\)

If not independent: \(P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B|A)P(A)}{P(B)}\)

If mutually exclusive: \(P(A \cup B) = P(A) + P(B)\)

If not mutually exclusive: \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\) - What is permutation?
- The number of ways to order a set of items, with or without replacement,
given that the ordering matters.
- When the order matters, we use the word “permutation”.
- The number of samples of size k out of a population of size n.
- Without replacement: \(P(n, k) = \frac{n!}{(n-k)!}\)
- With replacement: \(n^k\)

- The number of ways to order a set of items, with or without replacement,
given that the ordering matters.
- What is combination?
- The number of ways to order a set of items, with or without replacement,
given that the order does not matter.
- When the order does not matter, we use the word “combination”.
- The number of samples of size k out of a population of size n.
- Without replacement: \(C(n, k) = \binom{n}{k} = \frac{n!}{k!(n-k)!}\)
- With replacement: \(\binom{k+n-1}{k}\)

- The number of ways to order a set of items, with or without replacement,
given that the order does not matter.

### Distributions

- What is the normal distribution?
- Notation: \(N(\mu, \sigma^2)\)
- pdf: \(f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2} (\frac{x-\mu}{\sigma})^2}\)
- Expectation: \(\mu\)
- Variance: \(\sigma^2\)
- mgf: \(e^{(\mu t + \frac{1}{2} \sigma^2 t^2)}\)
- Independent sum: \(N \left( \sum^n_{i = 1} \mu_i, \sum^n_{i = 1} \sigma^2_i \right)\)
- Support: \(x \in (-\infty, \infty)\)
- In words: Describes data that cluster around the mean.

- Notation: \(N(\mu, \sigma^2)\)
- What is the binomial distribution?
- Notation: \(Bin(n, p)\)
- pmf: \(P(X = k) = \binom{n}{k} p^k q^{n-k}\)
- cdf: \(\sum^k_{i = 0} \binom{n}{k} p^k q^{n-k}\)
- Expectation: \(np\)
- Variance: \(npq\)
- mgf: \((q+pe^t)^n\)
- In words: The binomial distribution provides the discrete probability distribution of obtaining exactly \(n\) successes out of \(N\) independent Bernoulli trials (each with probability \(p\))

- Notation: \(Bin(n, p)\)
- What is the uniform distribution?
- Notation: \(U(a, b)\)
- pmf: \(\frac{1}{b-a}\)
- cdf: \(\frac{x-a}{b-a}\)
- Support: \(x \in [a, b]\)
- Expectation: \(\frac{1}{2} (a+b)\)
- Variance: \(\frac{1}{12} (b-a)^2\)
- mgf: \(\frac{e^{tb} - e^{ta}}{t(b-a)}\)
- In words: all intervals of the same length on the distribution’s support are equally probable.

- Notation: \(U(a, b)\)
- What is the geometric distribution?
- Notation: \(G(p)\)
- pmf: \(P(X = k) = (1-p)^{k-1} p\)
- cdf: \(1-(1-p)^k\), where \(k \in \mathbb{N}\)
- Expectation: \(\frac{1}{p}\)
- Variance: \(\frac{1-p}{p^2}\)
- mgf: \(\frac{pe^t}{1-(1-p)e^t}\)
- In words: The number of Bernoulli trials needed to get one success. i.e. The probability of getting the 1st success on the nth trial. Memoryless. It is a special case of the negative binomial distribution when there is only \(r = 1\) success.

- Notation: \(G(p)\)
- What is the Poisson distribution?
- Notation: \(Poisson(\lambda)\)
- pmf: \(\frac{\lambda^k}{k!} e^{-\lambda}\)
- cdf: \(e^{-\lambda} \sum^k_{i=0} \frac{\lambda^i}{i!}\), where \(k \in \mathbb{N}\)
- Expectation: \(\lambda\)
- Variance: \(\lambda\)
- mgf: \(e^{\lambda(e^t - 1)}\)
- independent sum: \(\sum^n_{i=1} X_i \sim Poisson\left( \sum^n_{i=1} \lambda_i \right)\)
- In words: The probability of a number of events occurring in a fixed period of
time, if these events occur with a known average rate and independently of the
time since the last event.
- Other notes: The binomial distribution can get unwieldy as the number of trials gets large. The Poisson distribution is an estimate of the binomial distribution for a large number of trials.

- Notation: \(Poisson(\lambda)\)
- What is the exponential distribution?
- Notation: \(exp(\lambda)\)
- pdf: \(\lambda e^{-\lambda x}\)
- cdf: \(1-e^{-\lambda x}\), where \(x \ge 0\)
- Expectation: \(\frac{1}{\lambda}\)
- Variance: \(\frac{1}{\lambda^2}\)
- mgf: \(\frac{\lambda}{\lambda - t}\)
- independent sum: \(\sum^k_{i=1} X_i \sim Gamma(k, \lambda)\)
- In words: The amount of time until some specific event occurs, starting from now, being memoryless.
- Other notes: This is the simplest way to model survival. The hazard, \(\lambda\), is assumed constant over time and the survival function is \(e^{-\lambda x}\), where x is time.

- Notation: \(exp(\lambda)\)
- What is the gamma distribution?
- Notation: \(Gamma(k, \theta)\)
- pdf: \(\frac{\theta^k x^{k-1} e^{-\theta x}}{\Gamma(k)}\), where \(\Gamma(k) = \int^{\infty}_0 x^{k-1} e^{-x} dx\)
- Expectation: \(k\theta\)
- Variance: \(k\theta^2\)
- mgf: \((1-\theta t)^{-k}, t<\frac{1}{\theta}\)
- independent sum: \(\sum^n_{i=1} X_i \sim Gamma\left( \sum^n_{i=1} k_i, \theta \right)\)
- In words: The sum of k independent exponentially distributed random variables each of which has a mean of \(\theta\) (which is equivalent to a rate parameter of \(\theta^{-1}\))

- Notation: \(Gamma(k, \theta)\)
- What is the negative binomial distribution?
- Notation: \(NBin(k, r, p)\)
- pmf: \(P(X = k) = \binom{k+r-1}{r-1} p^r q^{k} = \binom{k+r-1}{k} p^r q^{k}\),
where \(r\) is the number of successes, \(k\) is the number of failures, and \(p\)
is the probability of success.
- Expectation: \(r \frac{q}{p}\)
- Variance: \(r \frac{q}{p^2}\)
- mgf: \(\left( \frac{p}{1-qe^t} \right)^r\)
- In words: There are many different parameterizations:

- The probability distribution for the number of failures (k) before the \(r^{th}\) success in a bernoulli process. (detailed here)
- The probability distribution for the number of trials (n) given (r) success in a bernoulli process.
- The probability distribution for the number of successes (r) given (n) trials in a bernoulli process.

- The probability distribution for the number of failures (k) before the \(r^{th}\) success in a bernoulli process. (detailed here)

- Notation: \(NBin(k, r, p)\)
- What is the Bernoulli distribution?
- Notation: \(Bern(p)\)
- pmf: \(P(X = \{0, 1\}) = p^x (1-p)^{1-x}\)
- cdf: \((1-p)^{1-x}\)
- Expectation: \(p\)
- Variance: \(pq\)
- mgf: \(q+pe^t\)

- Notation: \(Bern(p)\)
- Review the probability distribution flowchart.