Maximum Likelihood Estimation

Throughout this course, you all have learned about estimating linear models through ordinary least squares (OLS). Assuming a linear model: \[ y = X\beta + \epsilon \] the goal of OLS is to find the set of \(\beta\) that will minimize the squared-errors, where \(\epsilon = y - X\beta\). If you’ve taken calculus, this is a derivation of the least-squares estimator for simple regression slopes (assuming \(x_i, y_i\) are centered): \[ \epsilon_i^2 = (y - \beta_0 - x_i\beta)^2 \\ \sum\epsilon_i^2 = \sum(y_i - \beta_0 - x_i\beta)(y_i - \beta_0 - x_i\beta) \\ = \sum (y_i^2 - 2y_i\beta_0 - 2y_ix_i\beta + \beta_0^2 + 2\beta_0x_i\beta + x_i^2\beta^2) \\ = (n-1)\sigma^2_y -2\beta \sigma^2_{x,y}(n-1) + (n-1)\beta^2\sigma^2_x + n\beta^2_0 + 0 + 0 \left[\sum x_i \text{ or } y_i = 0 \right]\\ \frac{d(sse)}{d\beta} = -2(n-1)\sigma^2_{x,y} + 2(n-1)\sigma^2_x\beta \left[\text{Derivative}\right]\\ 0 = -\sigma^2_{x,y} + \sigma^2_x\beta ]\text{Find where error is at minimum}\\ \frac{\sigma^2_{x,y}}{\sigma^2_x} = \beta \] The OLS estimator is one based on the principle of finding the parameters that minimize the squared prediction error. OLS is one of several techniques for estimating parameters from observed data. Broadly speaking, there are two other estimators that operate on different principles: Maximum likelihood and Bayesian estimation. Only maximum likelihood (ML) will be discussed here.

Whereas OLS operates on the goal of minimizing the squared prediction error, maximum likelihood operates on the goal of maximizing the likelihood of the data. In ML, the data are assumed to be modeled by some probability distribution, like the normal distribution (but there are hundreds of others). The goal of ML is to find the combination of parameters for this distribution that would make the data most likely. The combination of parameter values that would maximize the likelihood of our observations is taken to be the best estimate of the true parameter values.

As an example, let’s assume that data are generated from a normal distribution with unknown mean and standard deviation parameters. The goal is to find the mean (\(\mu\)) and standard deviation (\(\sigma\)) parameters to a normal distribution that would maximize the joint probability of the observations. The code below simulates 100 observations from a normal distribution in which the true \(\mu = 0, \sigma = 1\). Then three possible sets of parameter values for a normal distribution are plotted.

set.seed(13)
y <- rnorm(100,0,1) # Random sample
curve(dnorm(x,mean(y),sd(y)),-5,5,xlab='Y',ylab='Probability')
curve(dnorm(x,0,2),add=TRUE,col='red')
curve(dnorm(x,1,1),add=TRUE,col='blue')
rug(y,side = 1)

The data are plotted on the axis (a rug plot). The goal is to find the parameters for the normal distribution for which the data are most likely to have been generated. The height of the plotted lines represents the probability [density] at each point. We want the curve for which the points are maximally probable, overall. In plotting terms, we want most of the points to fall in a region where the height is high, and few points to fall where the height is low.

In this Goldilocks scenario: There is a red line, a blue line, and a black line. The red line has the right location, but the width is all wrong. The wider \(\sigma\) implies data should be seen near -4 and 4, but we don’t see any beyond -2 and 2. The wider width makes the height of the distribution lower for everyone - Thus, the joint likelihood is lower. This doesn’t fit the data well, because the red line is unnecessarily low for the points we’ve observed.

The blue line seems to have the right width, but its location is wrong. The height of the line is high for those in the 1-2 range, but really low for most of the observed data points. Consequently, the joint likelihood is unnecessarily low for most of the observed data points.

The black line is just right — The width looks about right, and the location is about right. The line is high for a good portion of the data. This black line is the normal distribution described by the maximum-likelihood \(\mu\) and \(\sigma\) values. For our observed data, no other \(\mu\) or \(\sigma\) would fit better. The joint probability (likelihood) of the Y variates under the black curve (\(p(Y|\mu=-.06,\sigma=.95)\)) is greater than the joint probability of the Y variates when \(\mu = 0,\sigma=2\) (red line) or when \(\mu = 1, \sigma=1\) (blue line). Essentially — a normal distribution with parameters \(\mu = -.06, \sigma = .95\) fits the data much better than the other two distributions — The data are more likely under the black distribution than any other normal distribution, and so the parameters for the black line are considered our best estimates.

Working with the likelihood

Maximum likelihood aims to find parameters for a probability distribution that maximize the joint probability of the data under that distribution, and essentially finds the distributional parameters that best reflect the distribution of the data. In other words, the goal of ML is to specify some likelihood function, \(p(Y|\mu,\sigma)\), and find a maximum of that function with respect to the parameters \(\mu, \sigma\) (\(\max_{\mu,\sigma} p(Y|\mu,\sigma)\)), where \(p(Y|\mu,\sigma)\) should be read as “the probability (p) of the data (Y) given (|) the parameters (\(\mu,\sigma\)). When observations are independent of one another, probability theory says that the joint probability is the multiplication of the probability of each observation, : \[ p(Y|\mu,\sigma) = p(y_1|\mu,\sigma)p(y_2|\mu,\sigma)\ldots p(y_n|\mu,\sigma) = \prod_{i=1}^N p(y_i|\mu,\sigma) \]

However, this computation has some difficulties. Joint probabilities become very, very small, very very quickly. As one example, imagine that we have a fair coin, such that the probability of landing heads is .5 after a coin flip. If we flip a coin 100 times and it lands heads 100 times, that probability is \((.5)(.5)(.5)(.5)(.5)(.5)(.5)(.5)(.5)(.5)\ldots_{100} = .0000\ldots_{26\ 0s}\ldots79\); this stretches the computer’s ability to even represent such a tiny number. As another example, if we computed the probability [density] of ten observations from a normal distribution:

probs <- dnorm(y[1:10])
probs
##  [1] 0.34212548 0.38357707 0.08253432 0.39200411 0.20770810 0.36594599
##  [7] 0.18734907 0.38792348 0.37318134 0.21662003
prod(probs)
## [1] 1.89605e-06

the value is .0000019. The more data we have, the smaller the joint probability. If we use all 100 observations of our y variable, the joint probability [density] is \(7.77\times10^{-56}\); imagine if we had thousands of observations! Quite literally, joint probabilities can become so small that our computers’ processors cannot represent the number.

Instead, we work with the log likelihood. If the joint probability is .00000000000000000000001, the log is -52.95946, which is much easier for a computer to represent. Because log-transformations are “monotonic” (if Y increases, then log(Y) increases), finding the maximum of a log likelihood is the same as finding the maximum of a likelihood.

The above likelihood formula on the log scale equals: \[ \log p(Y|\mu,\sigma) = \sum_{i = 1}^N \log p(y_i|\mu,\sigma) \] The log-likelihood therefore has some nice properties. Instead of multiplying a bunch, you can add a bunch instead, which is easier. The computer can represent the numbers easier. Maximizing the log-likelihood means maximizing the likelihood. In calculus, taking derivatives of logs (for finding the maximal point) is sometimes easier than taking the derivative of the raw function.

In sum, we can define a probability distribution for our data, and find the parameters for that distribution that maximize the (log) likelihood of the data. These maximum-likelihood estimates are taken to be our best estimates.

Likelihood maximization

Let’s assume \(\sigma\) is known, and is set to .9508399. We can plot the log likelihood as a function of \(\mu\), and the goal is to find the value of \(\mu\) that maximizes the log-likelihood (LL). The log-likelihood is equal to: \[ \sum_i^N \log \left[Normal(y_i | \mu, \sigma = .9508399)\right] \]

sigma.y <- sd(y)
LL <- function(mu,y) {
  sum(dnorm(y,mean=mu,sd=sigma.y,log=TRUE))
}
LL <- Vectorize(LL,'mu')
curve(LL(x,y),-3,3,xlab=expression(mu),ylab='LL')
abline(v=mean(y),lty='dashed')
abline(h=LL(mean(y),y),lty='dashed')

The above plot is the log-likelihood of the data across various values of \(\mu\). Holding \(\sigma\) constant, we see that the likelihood of the data is maximized when \(\mu = -.06182533\), and therefore we treat that as the best estimate of \(\mu\) (the mean).

We don’t have to hold \(\sigma\) constant; instead we can compute the log-likelihood surface with both \(\mu\) and \(\sigma\) unknown.

library(rgl)
d <- y # To avoid confusion
LL <- function(x,y,d){
  sum(dnorm(d,x,y,log=TRUE))
}
LL <- Vectorize(LL,c('x','y'))
xmap <- seq(-3,3,by=.1)
ymap <- seq(.01,3,by=.1)
zmap <- outer(xmap,ymap,LL,d=y)
persp3d(LL,xlim=c(-3,3),ylim=c(.8,1.4),otherargs=list(d=y),col='blue',alpha=.8,lit=FALSE,xlab=expression(mu),ylab=expression(sigma))
points3d(mean(y),sd(y),LL(mean(d),sd(d),d),col='red')
segments3d(c(mean(d),mean(d)),c(sd(d),sd(d)),c(-900,LL(mean(d),sd(d),d)),col='red')
grid3d(c('z'))
rglwidget()

The 3D plot above is a log-likelihood surface. The maximum of this surface is where \(\mu = -.06179853, \sigma = .94611845\) — The maximum likelihood estimates of the mean and standard deviation parameters. Assuming a normal probability distribution describes our data, our data are most probable (log-likelihood = -136.3529) under a normal distribution in which the mean is -.0618 and the standard deviation is .946.

Thankfully, we do not have to do this by hand. The computer has all sorts of optimization algorithms for this. If you’re curious about how they work, one intuitive method is to invert that surface to look like a valley, so that higher points are worse and lower points are better. In the plot above, rotate it to be upside down. Then imagine that you drop a virtual marble onto that and let it roll into the lowest point. At that “lowest” point, the log-likelihood of the data is maximized. The code below uses an algorithm that essentially does that for us.

negLogLik <- function(x, data){ # Negative log likelihood; flip the surface plotted above
  -sum(dnorm(x=data,mean=x[1],sd=x[2],log=TRUE)) # Negative sum of the log likelihoods
}
optim(c(mu=0,sigma=1),negLogLik,data=y)$par # Drop the marble and let it roll; The output is where it ends up
##          mu       sigma 
## -0.06179853  0.94611845

Regression in maximum likelihood

Regression coefficients can be estimated in maximum likelihood. We assume that Y is generated from a model such that \(y_i = \beta_0 + \beta_1 x_i + \epsilon_i\), where \(\epsilon_i \sim Normal(0,\sigma)\). This is the same thing as saying \(y_i \sim Normal(\mu_i = \beta_0 + \beta_1 x_i, \sigma)\), or “Y is normally distributed around some predicted mean value”. Our goal then, is to find the \(\beta\) and \(\sigma\) parameters that maximize the likelihood of the observed Y data under those normal distributions. We can consider the \(\beta\) parameters as predicting a \(\mu\) parameter to a normal distribution. E.g., \[ \hat y_i = \hat \mu_i \\ \hat\mu_i = \beta_0 + \beta_1x_i \\ y_i = \hat\mu_i + \epsilon_i \\ \epsilon_i \sim \text{Normal}(0,\sigma) \\ y_i \sim \text{Normal}(\mu_i = \beta_0 + \beta_1x_i, \sigma) \\ LL = \sum_{i=1}^N \log P_\text{Normal}(y_i | \mu_i = \beta_0 + \beta_1x_i, \sigma) \] So just as we could use maximum likelihood to estimate a normal distribution’s \(\mu, \sigma\), we can use maximum likelihood to estimate \(\hat\mu_i = \beta_0 + \beta_1x_i, \sigma\). In words: “What regression coefficients, when multiplied by the predictors, produce a mean to a normal distribution that maximizes the likelihood of the observed outcome variable?” Just as before, we are finding the parameters to a normal distribution that maximizes the likelihood of the data, but the mean (\(\mu\)) parameter shifts across the predictor — This makes sense, because we are trying to predict expected values in Y from a predictor X. These normal distributions are plotted below. The red line is the predicted \(\mu_i\) values (the predicted, expected value in the outcome Y) — The regression line. The blue lines are the normal distributions assumed to generate the Y values at each predictor value.

set.seed(14)
x <- rnorm(100,0,1)
y <- rnorm(100,2 + .7*x, .7141428) # y ~ N(mu_i = 2 + .7*x, sigma = .714)
glmOut <- glm(y ~ x) # Maximum likelihood estimation
ml.sigma <- sigma(glmOut)
plot3d(x,y,0,xlab='Predictor',ylab='Outcome')
mu_i <- predict(glmOut)
for(n in 1:length(y)){
  lines3d(x = x[n],y=seq(0,4,by=.1),dnorm(seq(0,4,by=.1),mu_i[n],ml.sigma),col='blue',alpha=.4)
}
lines3d(x=seq(-2,2,by=.01),y=coef(glmOut)[1] + coef(glmOut)[2]*seq(-2,2,by=.01),z=0,col='red')
rglwidget()

Just as in the simpler example above, if the regression coefficient were greater or lesser, the distributions would be located poorly — the black points would be too often in low-height regions of the blue curves, and the joint probability would be lower. If the \(\sigma\) were greater, then all the blue curves would be wider, and their heights too low. If \(\sigma\) were smaller, then the blue curves would be taller, but more points would be at the low-height tails of the curves. So the maximum likelihood solution is the goldilocks solution — “It’s just right”, and assuming the regression coefficients predict some value, across which values are expected to be distributed normally, the regression coefficients maximize the likelihood of the data.

Recap

Whereas OLS is based on minimizing the squared residual error, ML is based on maximizing the joint probability of the observed data. A maximum likelihood estimate is the combination of parameter estimate values that maximizes that joint probability. Regression is just another instance of this. We assume that Y variables are generated probabilistically from some normal distribution whose mean is predicted from X, and the ML solution is the coefficients and standard deviation that would maximize the likelihood of every predicted Y data point.

Why are you teaching me this?

Excellent question! Because maximum likelihood estimation opens up doors to new questions and new models. Maximum likelihood allows us to expand beyond normally distributed outcome variables, and venture into variables that are distributed in all sorts of weird ways. Logistic regression is a good example of this, and will be discussed briefly below.

Second, different models can be compared based on their respective data likelihoods. If the data are more probable under one model than another, then you might prefer the model with a greater data likelihood. This logic will be discussed in a section below.

Logistic regression

On the course projects and test summaries, some students in the past have mistakenly described the results of a t-test backwards. E.g., if the correct summary includes the sentence “Senior adults scored higher than younger adults”, the backwards summary would be “Those who score higher tend to be senior adults”. These sentences may be linguistically similar, but the test used for each is not the same.

In one case, we are assessing whether a mean differs between two groups, coded 0 (young adults) and 1 (senior adults). This uses a t-test, modeled using a regression model: \(y_i = \bar y_0 + (\bar y_1 - \bar y_0)x_i + \epsilon_i\), with a test on whether \(\bar y_1 - \bar y_0\) is zero. Continuous data are predicted from group membership.

In the other case, we are assessing whether the probability of being young adults (coded 0) or senior adults (coded 1) changes as a function of a continuous variable \(y_i\). Dichotomous data (0s and 1s) are predicted from continuous data. In this latter case, we use logistic regression, which is an example of a generalized linear model.

Logistic regression is extraordinarily useful, and used frequently in biomedicine, health psychology, and other fields wherein some event either occurs (Heart attack = 1) or does not (0). However, it can be used to predict any dichotomous variable, such as whether one is a young adult (0) or senior adult (1).

Ordinary least squares cannot be used for dichotomous outcomes. First, the assumption of normality is obviously not met - People can only score 0 or 1, and nowhere beyond or inbetween. Second, there isn’t a “residual” error, because each score is necessarily either 0 or 1, so what error is there to be minimized?

Maximum likelihood saves the day. In logistic regression, there is a linear model component, and a likelihood component.

The likelihood is defined such that there are two categories (0 and 1). The probability of each observation belonging to category 1 is estimated (\(\pi_i\)), and the probability of belonging to 0 is its inverse (\(1 - \pi_i\)). So if person \(i\) is predicted to have a \(\pi_i = .9\) probability of belonging to group 1, and they are in fact in group 1, then the likelihood of that observation is .9. Conversely, if they are in group zero, the likelihood of the observation is .1.

Probabilities are hard to predict, and they are constrained to be between 0 and 1. This makes linear predictions difficult. Instead, we predict a log-odds value, and convert that into a probability. The log odds is literally the log of the odds of belonging to group 1 vs group 0. This “log odds” unit is called a “logit” (log-odds unit), and is why this technique is called “logistic” regression.

This is a two-step procedure. A linear model predicts the log-odds of belonging to group 1 vs 0. We take the predicted value, convert it to probabilities, and use those probabilities in the likelihood.

\[ \text{log odds} = \log \frac{\pi_i}{1 - \pi_i} = \beta_0 + \beta_1x_i \\ \text{Probability} = \pi_i = \frac{\text{Odds}}{\text{1 + Odds}} = \frac{e^{\beta_0 + \beta_1x_i}}{1 + e^{\beta_0 + \beta_1x_i}} \\ p(y_i|\pi_i) = \begin{cases}\pi_i & \text{if } y_i = 1 \\ 1 - \pi_i & \text{if } y_i = 0\end{cases} \\ LL = \sum \log p(y_i|\pi_i) \] In sum, logistic regression uses coefficients to predict some value, which when converted to a probability defines the predicted probability of a given observation belonging to group 1. The coefficients are interpreted on the log-odds scale: The intercept is the log-odds of belonging to group 1 when x is zero; the beta coefficient is the change in log-odds for each one unit increase in the predictor. To clarify this, let’s work through an example.

Example:

Let’s say we have two continuous predictors (positive emotions, coffee liking) and a dichotomous outcome (Senior = 1, Young = 0). We wish to determine whether the extent of positive emotions predicts whether you are a senior or a young adult. The data look as follows:

ds
##          PosEmot Senior      Coffee
## 1   -0.661849828      0  1.90565357
## 2    1.718954157      1 -0.50151999
## 3    2.121666991      1  4.38486480
## 4    1.497153684      1 -5.77691614
## 5   -0.036140578      1  4.59131718
## 6    1.231945176      1 -1.87364817
## 7   -0.064880768      1 -4.02380238
## 8    1.068993731      1  0.25425071
## 9   -0.376965306      1  4.09989039
## 10   1.043183093      1  2.29256807
## 11  -0.382821881      1  7.38872840
## 12   0.299421596      1  0.44773348
## 13   0.674239758      0 -2.98414928
## 14  -0.292816322      0  6.63285464
## 15   0.488053358      1  2.88688227
## 16   0.882801824      1 -6.65232199
## 17   1.862748980      1  2.30363814
## 18   1.611725288      1  1.89104047
## 19   0.135479541      1 -2.17092663
## 20   1.088086012      1  4.51074829
## 21  -1.266814760      1 -6.59119046
## 22  -0.198583285      1 -1.25669580
## 23   0.138865782      1 -0.73072628
## 24  -0.279336001      1  5.88191397
## 25   0.708919415      0 -3.46359514
## 26  -0.766610461      1  6.10986793
## 27   1.443362947      1  4.21671224
## 28   0.844879300      0  4.12028404
## 29  -0.399370404      0  3.36064343
## 30  -1.427767586      0  0.86785881
## 31  -1.421992452      1 -2.69010230
## 32  -0.328228291      1  0.53039411
## 33   0.284570073      1 -0.28370939
## 34   0.719335881      0 -3.77078188
## 35   0.432415977      1 -4.08812400
## 36  -0.351924768      1  1.12222050
## 37   0.297721428      1  2.17913347
## 38  -0.261432362      1  0.52347901
## 39   1.308689731      1  1.12737757
## 40   0.015870264      1 -1.17092301
## 41  -0.431174368      1 -5.30141249
## 42   0.382446538      0  8.26054261
## 43   0.041125095      1  0.96869216
## 44  -0.059224001      1 -1.39638892
## 45  -1.296423984      0 -2.52324949
## 46  -2.136976377      0  1.13562686
## 47  -0.893614686      0  0.48630795
## 48   0.612732875      0  2.26537644
## 49   0.582971232      0  2.27613159
## 50  -0.005882013      1 -0.36234705
## 51  -1.865448725      0  0.92198598
## 52   1.829984327      1  3.01141514
## 53  -0.991115900      0  3.45354164
## 54  -1.450434624      0  3.35279455
## 55  -0.014824761      0 -7.06340847
## 56   0.536639054      1  2.73397800
## 57  -0.811103766      1  4.65204512
## 58  -0.318324798      1 -4.80240739
## 59   1.111476005      1 12.99812042
## 60  -0.141744714      1 -2.70294293
## 61  -0.205378953      1 -4.85170893
## 62  -0.680795796      1 -1.66820165
## 63  -0.042991742      1 -2.26221201
## 64  -1.161552608      1 -5.95892097
## 65  -0.772692171      1 -4.73398262
## 66  -0.918739132      0 -2.90844848
## 67   0.089153175      1 -2.73737287
## 68  -0.757539131      1 -2.82807316
## 69   1.748081178      1  3.51903461
## 70   0.170460702      1 -2.03947452
## 71   0.825531128      1 -5.17651618
## 72  -0.017155915      1 -0.45250082
## 73   0.492221664      1  3.54550378
## 74   0.355488970      1  0.06818458
## 75   0.112576702      1  3.10416580
## 76   1.137029319      1  0.29182107
## 77  -0.566944843      1 -8.06202318
## 78   0.552616244      1 -4.03080639
## 79  -0.012501730      1 -0.62194684
## 80  -0.571210609      0  3.75416819
## 81  -0.684277142      1  6.25946081
## 82   1.169309216      1  4.34294264
## 83   0.369468674      0  4.87981963
## 84  -0.510734921      0 -1.65322048
## 85  -0.631544836      1 -0.52178285
## 86   1.713277962      1 -0.59863933
## 87  -0.223003222      0  3.30354623
## 88  -0.958897273      0 -0.49367157
## 89   0.037745242      0  5.03387269
## 90  -1.641506242      0 -9.27382903
## 91   0.789993246      1 -2.54603004
## 92  -1.057130371      1  1.23990313
## 93  -1.028925359      0 -8.49029513
## 94   1.100929767      1  2.25253497
## 95  -0.613909980      1  2.41197743
## 96  -0.345882414      1 -4.42785588
## 97   0.085990118      0  2.59777204
## 98  -0.649316293      0  2.57803841
## 99   0.305962791      1 -2.82823518
## 100 -0.771453283      0  5.85078825

A logistic regression model is fit using a general linear model with a maximum likelihood estimator.

glmOut <- glm(data=ds, Senior ~ PosEmot,family=binomial('logit'))
summary(glmOut)
## 
## Call:
## glm(formula = Senior ~ PosEmot, family = binomial("logit"), data = ds)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9950  -0.9775   0.5521   0.8348   1.4042  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.9627     0.2461   3.911 9.18e-05 ***
## PosEmot       1.0419     0.3086   3.376 0.000734 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 122.17  on 99  degrees of freedom
## Residual deviance: 107.82  on 98  degrees of freedom
## AIC: 111.82
## 
## Number of Fisher Scoring iterations: 4

These coefficients are on the log-odds scale. When a positive emotion score is zero, the expected log-odds of being a senior is .9627 (the intercept). For each one-unit increase in positive emotions, there is a 1.0419 log-odds increase (the coefficient) in being a senior adult. For example, if someone has a positive emotion score of 1, then their predicted log-odds is 2.0046, which means their predicted probability of being a senior is \(\frac{e^{2.0046}}{1 + e^{2.0046}} = 0.8813\).

The relationship between positive emotions and senior adult membership can be plotted, one on the logit scale and the other on the probability scale.

par(mfrow=c(1,2))
curve(coef(glmOut)[1] + coef(glmOut)[2]*x, -4,4,xlab='Pos. Emotions',ylab='Log-odds (logit) of Senior Adult')
curve(odds_to_prob(logit_to_odds(coef(glmOut)[1] + coef(glmOut)[2]*x)),-4,4,xlab='Pos. Emotions',ylab='Probability of Senior Adult',ylim=c(0,1))

From this, we can say that the more positive emotions one reports, the greater the probability that one is a senior adult, and not a young adult. The relationship between positive emotions and senior-adulthood is positive, and significant.

Model comparison

We see that we can predict senior adulthood from positive emotions. Would adding a predictor help? Could knowing how much they like coffee predict senior adulthood over and above just knowing positive emotions?

In normal multiple regression, we can test this using hierarchical stepwise-regression. In that procedure, we estimate two models, A with one predictor, B with both predictors. The question is whether B (the fuller model) explains more variance over and above model A (a simpler model). You typically compute the \(\Delta R^2\) and the \(\Delta F\), and see whether \(\Delta F\) is significant.

In logistic regression, we don’t have “variance” to explain, so this method will not work for us. Instead - We can ask “are the data sufficiently more likely under model B than under model A?” For this - We can compute a Likelihood Ratio Test.

Likelihood ratio test

The likelihood ratio test is fairly simple:

  1. Compute the log-likelihood of model A
  2. Compute the log-likelihood of model B
  3. Compute \(\lambda = -2(LL_A - LL_B) = -2\log\frac{L_A}{L_B}\)
  4. Compute degrees of freedom: \(df = \text{Number of parameters in B} - \text{Number of parameters in A}\)
  5. Compute the p-value for \(\lambda\), from a \(\chi^2\) distribution with \(df\) degrees of freedom.

If \(p < \alpha\), then the data under the more complex model B is significantly more likely than under simpler model A. Thankfully, R can compute this for us.

Example:

glmOut2 <- glm(data=ds,Senior ~ PosEmot + Coffee, family=binomial('logit'))
anova(glmOut,glmOut2,test = 'LRT')
## Analysis of Deviance Table
## 
## Model 1: Senior ~ PosEmot
## Model 2: Senior ~ PosEmot + Coffee
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1        98     107.82                       
## 2        97     104.98  1   2.8415  0.09186 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, the addition of the coffee liking predictor did not significantly improve the likelihood of the data. Therefore, we can stick with the simpler model that only includes positive emotion as a predictor.

Information Criterion

A very similar method to comparing two models is to use an “information criterion”. Information criteria are all used similarly: Choose the model with the lowest information criterion. Lower is better; -1200 is better than -1100. 10 is better than 15. They also have similar goals: Information criteria attempt to provide some metric of performance given a goal, and penalizes for model complexity. Several ICs exist (AIC, AICc, BIC, SABIC, WAIC, DIC, LOOIC, to name a few), but we will focus on the most common: The Akaike Information Criterion.

The Akaike Information Criterion (AIC) is used to choose a model that should perform the best at out of sample prediction. This means that if you were to get a new sample with coffee-liking and positive emotion scores, the model you estimated with the lowest AIC should predict whether the people are senior or young adults better than any models with higher AIC values.

Its computation is simple: \[ AIC = -2LL + 2p \] ,where \(p\) is the number of parameters estimated (i.e., regression coefficients + intercept).

You fit multiple models, and compute their respective AIC values. The model with the lowest AIC is expected to perform the best at predicting new data, and therefore is chosen over the other models. Although a formal test could be conducted on the AIC (it amounts to a likelihood ratio test penalized for model complexity), the typical use is to simply choose the model with the lowest AIC.

Example

AIC(glmOut,glmOut2)
##         df      AIC
## glmOut   2 111.8230
## glmOut2  3 110.9815

In this case, the AIC of the model with coffee-liking as a predictor is lower than the model without coffee-liking, and therefore should be chosen as the better model for later predictions. This conflicts with the likelihood ratio test above, but these have two distinct goals: The AIC method chooses a model that should be better at out-of-sample prediction; the likelihood ratio test is simply a test of whether observations are more likely to occur under a more complex model vs another. Therefore: The observed data are not significantly more likely to occur under the more complex model, but the more complex model may be better at out-of-sample prediction of whether someone is a young adult or a senior.