Why (N-1) vs N [Long post]

I have a lot of students ask why we use (n – 1) vs (n) when computing variance. There are of course canned answers, like “Using N-1 unbiases the estimate” or “Using N-1 inflates the estimate, and it’s more accurate” or “Using N results in an underestimate”.

These are all true, but they’re deeply unsatisfying. I remember learning the more precise reason in my first semester of my phd program, but the question came up again and I decided to and write it out myself as a reference. This will be fairly basic stuff for many, I’m sure, but for the sake of my students and my future self, here it is:

First, distinguish between an estimator and its properties in expectation.
The estimator is some method with a defined goal that produces the best estimate given the goals.

Let’s say we are estimating the $\mu$ and $\sigma$ parameters to a normal distribution.
Let’s further say that we want a point estimate, and we’re not Bayesians (for now…), nor do we have any penalties.
One estimator is the maximum likelihood estimator, for obtaining the maximum likelihood estimate (MLE).

The MLE is the estimate that maximizes the joint probability density of the observed data, assuming some probability distribution underlies the sampling distribution.
In our case, we have the normal distribution, which looks like:
p(x | \mu, \sigma) = \sigma^{-1}(2\pi)^{-.5}e^{-(2\sigma^2)^{-1}(x – \mu)^2}
When we have more than one x – A vector of x called $X$ (capitalized), and if we assume independence, then the joint density is:
p(X | \mu, \sigma) = \prod_{i = 1}^N \sigma^{-1}(2\pi)^{-.5}e^{-(2\sigma^2)^{-1}(x_i – \mu)^2}

This is a pain to compute with, and the log-likelihood is much easier.

f(X | \mu, \sigma) = -N\log(\sigma) – .5N\log(2\pi) – (2\sigma^2)^{-1}\sum_{i = 1}^N (x_i – \mu)^2

Now, using this log-likelihood, we just need to see which value of $\mu$ and which value of $\sigma$ maximize the function output, the log-likelihood.
Because the log function is a monotonic function, maximizing the log-likelihood is the same as maximizing the likelihood.

To do this, we need to take the derivative of the log-likelihood with respect to each parameter at a time. Ignoring the notation for partial derivatives for the moment, the derivatives are as follows:

f(X|\mu,\sigma) = \ldots -(2\sigma^2)^{-1}\sum(x_i^2 – 2x_i\mu + \mu^2) \\
= \ldots -(2\sigma^2)^{-1} \left ( \sum x_i^2 – 2\mu\sum x_i + N\mu^2 \right ) \\
= \ldots – \frac{\sum x_i^2}{2\sigma^2} + \frac{\mu}{\sigma^2}\sum x_i – N\mu^2(2\sigma^2)^{-1} \\
\frac{d(f(X|\mu,\sigma)}{d\mu} = 0 + \frac{\sum x_i}{\sigma^2} – N\mu(\sigma^2)^{-1} \\
0 = \frac{d(f(X|\mu,\sigma)}{d\mu} \\
0 = \sum x_i – N\mu \\
\mu = \frac{\sum x_i}{N}

I’ll call that $\hat\mu$, since it’s an estimate that maximizes the likelihood of the observations, and not the “true” $\mu$ parameter value.

Onto $\sigma$:
f(X | \mu, \sigma) = -N\log(\sigma) – .5N\log(2\pi) – (2\sigma^2)^{-1}\sum_{i = 1}^N (x_i – \mu)^2 \\
\frac{d(f)}{d\sigma} = -N\sigma^{-1} – 0 – \sum(x_i – \mu)^2\sigma^{-3} \\
0 = \sigma^{-1}(N – \sigma^{-2}\sum(x_i – \mu)^2 \\
\sigma^{-2}\sum(x_i – \mu)^2 = N \\
\sigma^2 = \frac{\sum(x_i – \mu)^2}{N}

I’ll call that $\hat\sigma^2$, since it’s an estimate that maximizes the likelihood of the observations, and not the “true” $\sigma$ parameter value.

So now we have two estimators, derived from maximum likelihood goals:
\hat\mu = \frac{\sum x_i}{N} \\
\hat\sigma^2 = \frac{(\sum x_i – \mu)^2}{N}

Now the question, what are the expected values of these parameter estimates, and are they equal to the desired parameter values?

Let me just take a shortcut and say the expected value of $\hat\mu$ is indeed $\mu$.
There is variance of the $\hat\mu$ estimate around $\mu$, derived as follows.
Assume $E(x_i) = \mu$, $E(\sum x_i) = \sum E(x_i)$, $E(cX) = cE(X)$, which are some basic expectation rules, and the rest follows.

E(\hat\mu – \mu)^2 = E(\hat\mu^2 – 2\hat\mu\mu + \mu^2) \\
= E(\hat\mu^2) – 2\mu E(\hat\mu) + \mu^2 \\
= E(\frac{(\sum x_i)^2}{N^2}) – 2\mu E(\frac{\sum x_i}{N}) + \mu^2 \\
= \frac{1}{N^2}E((\sum x_i)^2) – \mu^2 \\
= \frac{1}{N^2}E\left (\sum x_i – \mu + \mu \right )^2 -\mu^2 \\
= \frac{1}{N^2}E\left ((\sum x_i – \mu) + N\mu \right )^2 – \mu^2 \\
= \frac{1}{N^2} E \left ( (\sum (x_i – \mu)^2 + 2N\mu\sum(x_i – \mu) + N^2\mu^2)\right ) – \mu^2\\
= \frac{1}{N^2} (N\sigma^2 + 0 + N^2\mu^2) – \mu^2\\
= \frac{\sigma^2}{N} = \sigma^2_\mu
Voila, the expected variance (VAR) of the $\hat\mu$ estimator, defined as $E(\hat\mu – \mu)^2)$, is equal to $\frac{\sigma^2}{N}$.
Just as our textbooks tell us.

With that out of the way, what is the expected value of $\hat\sigma^2$?

\hat\sigma^2 = \frac{\sum(x_i – \hat\mu)^2}{N} \\
E(\hat\sigma^2) = \frac{1}{N}E(\sum((x_i – \mu) – (\hat\mu – \mu))^2) \\
= \frac{1}{N}E(\sum (x_i – \mu)^2 -2\sum(x_i – \mu)(\hat\mu – \mu) + \sum(\hat\mu – \mu)^2) \\
= \frac{1}{N}\left (N\sigma^2 – 2\sum E(x_i – \mu)(\hat\mu – \mu) + N\sigma^2_\mu) \right ) \\
= \frac{1}{N}\left (N\sigma^2 – 2\sum E(\hat\mu x_i – \mu x_i – \mu\hat\mu + \mu^2) + N\sigma^2_\mu \right ) \\
\sigma^2 + \sigma^2_\mu – \frac{2}{N}E(N\mu^2 – N\mu\hat\mu – N\mu\hat\mu + N\hat\mu^2) \\
= \sigma^2 + \sigma^2_\mu – 2E(\hat\mu – \mu)^2\\
= \sigma^2 – \sigma^2_\mu \\
E(\hat\sigma^2)= \sigma^2 – \frac{\sigma^2}{N} = \sigma^2(1 – \frac{1}{N}) = \sigma^2(\frac{N-1}{N}) \\
\sigma^2 = \hat\sigma^2\frac{N}{N-1}

  1. $E(\hat\sigma^2)= \sigma^2 – \sigma^2_\mu$ is really interesting. It’s saying the expected value of our estimate of the variance is equal to the true variance, minus the error variance of the mean estimator. Conversely, you can say that the true variance is equal to the observed variance plus the error variance in the mean estimate. That already should intuitively suggest to you that the reason the MLE for $\sigma$ is underestimated is because it fails to account for the variability in the mean used to compute the sample variance. Scores not only vary around the mean, but the mean varies around the parameter, and by neglecting that latter variability, we underestimate the population variance.
    From an ordinary least squares perspective, you can think of it this way: The mean is the value for which the average squared deviation from it is the smallest. Already, the mean is minimizing the variability. But because the mean does not equal the population parameter, that variability will be minimized around the incorrect point.
  2. The $\frac{N}{N-1}$ adjustment merely alters the estimated value so that the expected value will equal the true value. It does so basically by indirectly adding back some error variance due to the estimation of the mean itself. And if that fraction doesn’t look familiar to you, maybe this helps: $\hat\sigma^2 \frac{N}{N-1} = \frac{(\sum x_i- \hat\mu)^2 N}{N(N-1)} = \frac{\sum (x_i – \hat\mu)^2}{N-1}$

And that’s basically it. I probably have minor mistakes in the above, as this was all hastily written. But that’s the gist – Using (N-1) as the divisor ultimately comes from the fact that our estimator is not expected to equal the true variance parameter. The estimated sample variance is off by a factor of $\frac{N-1}{N}$, asymptotically, so we just adjust it so the expected value with the adjustment does equal the true variance.

Leave a Reply