Throughout this course, you all have learned about estimating linear models through ordinary least squares (OLS). Assuming a linear model: \[
y = X\beta + \epsilon
\] the goal of OLS is to find the set of \(\beta\) that will minimize the squared-errors, where \(\epsilon = y - X\beta\). If you’ve taken calculus, this is a derivation of the least-squares estimator for simple regression slopes (assuming \(x_i, y_i\) are centered): \[
\epsilon_i^2 = (y - \beta_0 - x_i\beta)^2 \\
\sum\epsilon_i^2 = \sum(y_i - \beta_0 - x_i\beta)(y_i - \beta_0 - x_i\beta) \\
= \sum (y_i^2 - 2y_i\beta_0 - 2y_ix_i\beta + \beta_0^2 + 2\beta_0x_i\beta + x_i^2\beta^2) \\
= (n-1)\sigma^2_y -2\beta \sigma^2_{x,y}(n-1) + (n-1)\beta^2\sigma^2_x + n\beta^2_0 + 0 + 0 \left[\sum x_i \text{ or } y_i = 0 \right]\\
\frac{d(sse)}{d\beta} = -2(n-1)\sigma^2_{x,y} + 2(n-1)\sigma^2_x\beta \left[\text{Derivative}\right]\\
0 = -\sigma^2_{x,y} + \sigma^2_x\beta ]\text{Find where error is at minimum}\\
\frac{\sigma^2_{x,y}}{\sigma^2_x} = \beta
\] The OLS estimator is one based on the principle of finding the parameters that minimize the squared prediction error. OLS is *one* of several techniques for estimating parameters from observed data. Broadly speaking, there are two other estimators that operate on different principles: Maximum likelihood and Bayesian estimation. Only maximum likelihood (ML) will be discussed here.

Whereas OLS operates on the goal of minimizing the squared prediction error, maximum likelihood operates on the goal of *maximizing* the *likelihood* of the data. In ML, the data are assumed to be modeled by some probability distribution, like the normal distribution (but there are hundreds of others). The goal of ML is to find the combination of parameters for this distribution that would make the data *most likely*. The combination of parameter values that would maximize the likelihood of our observations is taken to be the best estimate of the true parameter values.

As an example, let’s assume that data are generated from a normal distribution with unknown mean and standard deviation parameters. The goal is to find the mean (\(\mu\)) and standard deviation (\(\sigma\)) parameters to a normal distribution that would *maximize* the joint probability of the observations. The code below simulates 100 observations from a normal distribution in which the true \(\mu = 0, \sigma = 1\). Then three possible sets of parameter values for a normal distribution are plotted.

```
set.seed(13)
y <- rnorm(100,0,1) # Random sample
curve(dnorm(x,mean(y),sd(y)),-5,5,xlab='Y',ylab='Probability')
curve(dnorm(x,0,2),add=TRUE,col='red')
curve(dnorm(x,1,1),add=TRUE,col='blue')
rug(y,side = 1)
```