Maximum Likelihood Lecture for Undergraduates

This semester I was permitted to give one brief lecture on maximum likelihood in the advanced statistics undergraduate course.

The course focuses primarily on the linear model, so we talk about OLS linear models, and variance-based approaches to model selection.
I have free reign over three lectures each semester, and my final lecture was on maximum likelihood approaches, and how we can use them.

The knitted lecture notes are embedded below. Note: I know some content is incorrect; I throw the term “likelihood” around pretty freely and incorrectly, in order to focus on the intuition rather than the correctness of that term. So before anyone comments on how I used the term likelihood incorrectly — I know I did. It was intentional, and with only one lecture to spend, it wasn’t particularly worth it to me to go into what precisely likelihood means, or why the $p(Y|\mu,\sigma)$ is not Fisher-friendly (because it treats parameters as random values to be conditioned on, even if mathematically they are the same).

Anyway, these notes got good reviews from undergraduates who had previously never seen or heard of maximum likelihood; even a couple of graduate students thought the visualizations were helpful to them.
The undergraduates were also given a fairly basic homework assignment wherein they fit logistic regression models, computed and used LRT and AIC for model selection, and made predictions (e.g., if high in twitter addiction, the model would suggest a high probability of being a graduate student).

Yes, collect more data and pool them.

A question arose on one of the numerous facebook-methods groups. The question, in essence was: I have a sample of 30 from a dissertation; I want to collect 30 more observations to have better [sic] results for a publication. Should I use a bayesian analysis with the prior informed by the first 30, or should I just use an uninformed prior and completely pool the data? My response: Yes, collect more data and just completely pool the data.


Partial pooling is great, and I encourage it nearly all of the time. However, in this case there was just an arbitrary break in the data collection from the same process, so I think complete pooling is fine (also, partial pooling with only two samples seems pointless, but I could be wrong).
You could of course use the posterior from the first 30 as the prior for the second 30, but this winds up being the same thing as just completely pooling the data, assuming the observations are independently sampled from the same process:
p(\theta|y_1) &= \frac{p(y_1 | \theta)p(\theta)}{p(y_1)} \\
p(\theta, y_1, y_2) &= p(y_2 | \theta, y_1)p(\theta|y_1)p(y_1) \\
p(\theta|y_1,y_2) &\propto p(y_2 | \theta, y_1)p(y_1|\theta)p(\theta) \\
p(\theta|y_1,y_2) &\propto p(y_1,y_2|\theta)p(\theta) \\
p(\theta|y_1,y_2) &\propto p(y|\theta)p(\theta)
So – Just to save time, I suggested to just pool the data.

Another user was a bit suspicious though.
This user said that if you collect 30 observations which suggest something interesting enough that you should continue collecting observations, then this data-dependent choice will bias your results.

I disagree! This is very close to the problem of multiple stopping inflating the probability of a wrong decision, but is very different for a special reason: The data are not stopped if they hit “significance”, but rather continued if they hit “significance”. If you stop upon hitting significance, yes, you will bias your result and inflate error rates, if you care for that sort of thing.
But if you continue after hitting significance for some fixed N, the estimate thereafter is less biased, and if the “null” is true, then you will actually reduce the error rate.

The reason it is less biased is simply that if you stopped after significance with an N = 30, then your estimate is, for most effect sizes, necessarily an overestimate. It must be an extreme estimate in order to pass a threshold, if the study is underpowered (which, for most effect sizes, N=30 is underpowered). If you then sample 30 more, your estimate will recede toward the mean, and is not likely to become more extreme, but less extreme. Gaining more observations means gaining a better representation of the population, and you are far more probable to observe new values near the typical set than anything more extreme than you already observed.

The reason you have less type 1 error (of course, assuming the nil-null is “true”) is for similar reasons – With an N=30, you still have a $\alpha$ chance of a t1 error. Given that t1 error, sampling $N_\Delta = 30$ more observations will draw your chance-extreme estimate back toward the true mean of zero. In order to make a t1 error after $N_\Delta$ more, think about what would need to occur: With some $Z_n$ statistic for the first $N$ sample, what is the $N_\Delta$ sample needed to maintain at least a $Z_\alpha$ for rejection in the same direction? Or – What sample would you need to keep being significant purely by chance, and what is the probability of obtaining that sample from a nil-null sampling distribution? It’s small!


When the null is true

Let’s test it via simulation.
The following procedure is simulated 100,000 times.

  1. Collect N=30 from Null.
  2. If significant, collect 30 more.
  3. Report $p$, the estimate, and the n (30 or 60).
  • Type 1 error: .01641
  • Expected effect estimate: .00042
  • Effect estimate conditioned on significance: -.01311

Note that this conditional effect should be biased, but due to the 2-tailed test, the biases are canceling each other out.

When the true effect is .4

  • Power: .54841
  • Expected effect estimate: .36552
  • Effect estimate conditioned on significance: .46870
  • Effect estimate when conditioned on obtaining $N_\Delta = 30$ more after initial significance: .46207

The confirmatory approach

But what if we only did as the other user suggested instead: Ignore the first sample, just collect an N=30 sample as a confirmatory study?
That is – Collect N=30, if significant, collect another N=30 and analyze the latter one by itself. We’ll only simulate from the latter one — Just simulating from the distributions.

When the null is true

  • Type 1 error: .05032
  • Expected effect estimate: -.00070
  • Conditional effect estimate: .00019

Note that the conditional effect estimate really should be biased — It’s just that I’m using a 2-tailed test, and the biases are canceling each other out. The expected positive effect is about .415.

When the true effect is .4

  • Power: .56336
  • Expected effect estimate: .39981
  • Conditional effect estimate: .52331


From the above, I recommend that if you have a sample, and you find something interesting in it, do not be afraid to collect a larger sample and include your previous data in it. If the “null” is true, the error rate decreases due to regression toward the mean. If the null is false, you will have a less biased estimate. Ideally, you should collect enough data that any initial sampling-error induced extremeness will shrink away.
Yes, there’s a chance that you could “lose” significance — That’s probably for a good reason though, given the simulations above; but there’s a pretty strong chance you would lose significance anyway. In the 30 + 30 case, the power was about .54 overall; in the 30 first, 30 again confirmatory case, the power is about .56 each time – So the chance of you getting two samples that support your claim is about .32.


  • Do not collect data until you hit significance, then turn off your data collector. You will obtain overestimates and drastically alter your t1 error rate; under extreme conditions, there is a 100% chance of rejecting the null even if it’s true under this procedure.
  • Do collect more data after finding something interesting, because you will improve or maintain your power and decrease the significance-conditioned bias in your estimate. That’s probably a good rule of thumb anyway – Once you find something interesting even in adequately powered studies, collect a bit more just to shimmy toward the generative parameter a bit. Type 1 error is inflated a bit relative to two independent tests, but not by much — In one case it’s about .0025, and in the other it’s .01; not enough to overwhelm the benefit in estimate accuracy and power. The more N you collect after the first set, the lower this is (which is counter-intuitive, given the p-value asymmetry, but it’s due to joint t1 errors: Those non-sig aren’t continued; those significant then gain more N, but samples are more likely to be near zero than extreme enough to maintain the already improbable t1 error).
  • Do this with enough N to overcome any sampling-induced extreme estimates — As in, don’t collect 30, get significance, then collect 1 more to say the estimate is more accurate, that is silly. You can play around with the code below to see how much N to add to reduce conditional estimates. All things being equal, more N is better N.
  • Do this with a fixed $N_\Delta$, rather than doing it repeatedly until you hit significance again, otherwise you wind up with the conditional stops problem. Basically, don’t switch your rule from “stop once an N is hit” to “stop once an N is hit and significance is found again”.
  • Do partially pool if you can, and you probably can.


I wanted to clarify: I don’t like NHST! And I know Bayesian inference doesn’t have nor care about error rates for inference. I know this. Do not yell at me. The above is in response to the idea that pooling data with “interesting” (significant, probable, whatever you want) results with a new dataset from the same process would bias your results and/or inflate error rates — I am playing a frequentist for a moment and saying, no they don’t (or, it’s less biased with better error properties).

sim_study <- function(n1,n2,mu,sigma,alpha=.05){
  d <- rnorm(n1,mu,sigma)
  lmOut <- lm(d ~ 1)
  lmSum <- summary(lmOut)
  p <- lmSum$coefficients[1,4]
  eff <- lmSum$coefficients[1,1]

  # Comment until you see #%%% If you don't want to sample more
  if(p <= alpha) {
    d <- c(d,rnorm(n2,mu,sigma))
    lmOut <- lm(d ~ 1)
    lmSum <- summary(lmOut)
    p <- lmSum$coefficients[1,4]
    eff <- lmSum$coefficients[1,1]

# Null
reps.null <- t(replicate(100000,expr = sim_study(30,300,0,1)))

mean(reps.null[,'p'] < .05)
mean(reps.null[reps.null[,'p'] < .05,'eff'])

# Not null
reps <- t(replicate(100000,expr = sim_study(30,30,.4,1)))

mean(reps[,'p'] < .05)
mean(reps[reps[,'p'] < .05,'eff'])

A short thought on the inconsistency of simplifications

While grading tests today for our undergraduate advance stats course, I had a funny (I think, anyway) thought about the inconsistency of model simplifications.

The course is focused on linear models. It’s “advanced” statistics for this reason (advanced is… relative). So they learn to do independent samples t-tests, dependent samples-t tests, ANOVA all within a linear model framework.
Then of course, they learn about [normal-assumptive] multiple regression, and we teach them to think about statistics themselves as being a super simple linear model of sorts ($\bar x_{0,s,p} = \mu_{0,p} + \epsilon_{0,s,p}$).

Anyway, on the test, they have to interpret a paired-samples “t-test” (a linear model on paired score differences). I had a realization about something, that I think highlights why I don’t particularly care about model simplification, and some inconsistent practices we seem to engage in.

When the analyst knows that scores are paired, and they want to conduct a t-test, they wind up using a dependent samples t-test which explicitly includes a covariance in the computation of the standard error (It’s twice subtracted out).
However, I don’t know if anyone actually bothers to see whether the paired-sample correlation is “significant” before doing so.
They simply acknowledge that a dependency in scores exists due to the design and units collected. It’s a design feature, and it’s a data-generative feature that is assumed to exist.
Basically, the researcher knows that some dependency probably exists, and then they use the “appropriate” model or test for such a dependency, all most likely without actually assessing whether that assumed dependency is “significant” and should be included.

I say that to contrast what we tend to do with other models.
I was helping a labmate with some multilevel modeling problems shortly before grading the tests, and this is why these ideas clicked together.
A researcher collects a sample which inevitably consists of both known and unknown subpopulations and clusters for which dependencies may exist — If you know someone is of a certain demographic, then there’s a chance that their score may correlate with someone of the same demographic, and hence there is a dependency.
But what do we do?

Some MLM books and programs by default recommend or compute for you the significance of the random effects or random correlations.
These tests basically include likelihood ratio tests (likelihood with correlation over likelihood without correlation), Wald tests (Horrible for variance params, so not gonna discuss that), and variance-testing (Chi-2 test for whether the random variance is greater than the [estimated] expected sampling error variance).
A researcher may determine whether to include the random variance or not, depending on the significance, despite knowing that these clusters do exist and could have dependencies.

The funny realization is that – In these harder models where more thoughtwork is needed, researchers are told to check for these dependencies via some test or model comparison, and only if deemed necessary include it in the model.
But I hazard a guess – These people probably don’t do the same with paired-samples t-tests — They probably don’t test to see whether the paired-samples correlation is significant before using a dependent samples t-test.

Why? Because in the paired-samples t-test, you acknowledge the design and structure that presumably does exist in the data, and doing so improves your inferences. Yet we don’t do the same for linear models!
This dovetails with other recommendations for why random effects models should really be the default. We know there is most likely a dependency among scores to some degree; we don’t really need to test for that before fitting the model. If the researcher knows that a dependency may exist, just use MLM, period, just like if you have two paired data vectors you assume are dependent you would by default use a paired-samples t-test.

Basically what I am saying is – Be consistent.
If you would consider the design and possible dependencies, and in doing so be ok with using a paired-samples t-test without actually assessing the significance of the paired correlation, you should do the exact same thing with linear models.
If you suspect a dependency due to pairings and clusters, use a MLM instead of an LM, just as you would choose a paired-samples t-test instead of the independent samples t-test.
I am not suggesting that you actually check for significance before using a dependent samples t-test or MLM — On the contrary, your model should express what you know and can feasibly model. If subpopulations exist that could feasibly be dependent, just use a MLM; easy; no checking necessary. Best case scenario, your inference is improved; worst case scenario, it’ll basically collapse back into an LM anyway, just as a dependent t-test with zero covariance will wind up being an independent samples t-test.

Separate frequentist inference, frequency probability, and estimation

This is a short post, but I think people need to separate frequentist inference, frequency probability, and estimation a bit better when describing statistical practices.

Some concepts in an oversimplified nutshell:

  • Frequentist inference: Inferring about an unknown parameter or process based on assumed or known frequentist properties. E.g., “If the generative parameter were zero, the [assumed, frequentist] probability of observing something at least as extreme as what I observed is low, so I will assume the generative parameter is not zero.” This inference is based on an assumed distribution resulting from repeated sampling of the same process, and from this assumed distribution, the frequency of occurrence of some effect (or more extreme) is small. One obtains data, employs an assumed distribution of events, and makes inferences about the relative frequency of occurrences.
  • Frequency probability: The computation of a probability from frequencies of events. This is required for frequentist inference, but is not limited to frequentist inference. Bayes theorem can use frequency-derived probabilities. Frequency-derived probabilites are sometimes used as data for a Bayesian model. Simulations of model-based inferential decisions use frequency probabilities — The frequentist probability of making Bayesian-informed decisions may be computed, and is irrelevant to frequentist inference per se. It is what it is — The probability of an event over repeated instances; this is useful for so many contexts outside of frequentist inference.
  • Estimation — Does not depend on frequentism or frequentist probabilities. I throw this in because I see people say “frequentism assumes a uniform prior” and that’s not really true. Maximum likelihood estimation may implicitly assume a uniform prior, but that is not relevant to frequentist inference. You could use penalized maximum likelihood (i.e., a non-uniform prior combined with a likelihood function) and still be frequentist; you’re a frequentist once your inferential logic depends on an assumed universe of possible repeated events.

You can be a Bayesian and accept the existence of frequency probabilities. But as a Bayesian, you do not construct frequency probabilities for use in inference, as inference comes from posterior uncertainty. Frequent_ists_ construct frequency probabilities with an assumed sample space of events, and use the implied frequency probability for inference. In fact, modern Bayesians can’t really reject the premise of frequency probabilities, given that posterior inferences are typically defined from the frequentist properties of an MCMC posterior sampler. What is p(theta > 0 | data)? You estimate it from the proportion of representative posterior MCMC samples that are above zero — It’s a frequency probability used to estimate a Bayesian quantity. There’s nothing inconsistent about this — Seriously, there isn’t. In the same way that a Bayesian can still flip a fair coin 100 times and see that 54% of the flips resulted in heads; that probability is not wrong or invalid, we just don’t use assumed sampling distributions of possible coin flips for inference about the fairness of a coin. The inference about fairness comes from posterior uncertainty given that 54/100 of the flips were heads, instead of “assuming the coin is fair, what is the probability of seeing 54/100 of coin flips?”. Frequency probabilities are not bound to frequentist inferences, but are required for frequentist inference.

You can be a frequentist and use non-uniform “priors” (in the sense of penalized maximum likelihood), because estimation is separate from frequentist inference. The frequentist inference would be employed the second you construct a hypothetical sampling distribution of possible events under some parameter given some statistical model (including a PML-estimated model) for use in inferring from some observed estimate. I.e., a frequentist may use PML to estimate a model, then construct a sampling distribution of PML-model derived estimates under some fixed parameter value, and compute the probability of a given estimate (or greater) from that sampling distribution. That’s still frequentist, even with a estimation penalty, simply because the inference is derived from the frequency of events under an assumed counter-factual.

TLDR; frequentist inference depends on frequentist probabilities and properties. Frequentist probabilities are not limited to frequentist inference. Bayesians may still use frequency-probabilities (as data, as priors [See Jaynes], in MCMC samplers, to understand frequentist properties of Bayesian estimators and decisions over repeated instances). Estimation is independent of frequentism, and frequentism is not defined by the use of implicit “uniform” priors (that is a property of the estimator, not the inferential logic).

Thought Droppings on Substantive and Statistical Hypotheses

The recent “use p=.005”, “use BFs”, “no, use estimation”, “no, justify your alpha” discussion rekindled some of my thinking about substantive and statistical hypotheses. This posts will truly be a rambling, without any organization or goals, but maybe it will have something interesting in there somewhere.

Researchers need to understand that a statistical hypothesis is not a substantive hypothesis in the vast majority of cases.
Alternatively put, statistical hypotheses very rarely map onto substantive hypothesis in a one-to-one manner. This is true whether you are Bayesian, frequentist, or anywhere in between. In the vast majority of cases, when you test a statistical hypothesis in any manner, you have to do some logical gymnastics in order to make an inference about the substantive hypothesis.

So what do I mean by this?
A substantive hypothesis is what you posit in normal language: “ESP exists”; “Power posing makes people feel powerful.”

A statistical hypothesis is some statement about a parameter that the substantive hypothesis implies: $H_1: \mu \neq 0$.

Now, without putting much more thoughtwork into it, this may make sense. “Yes, if ESP exists, then the mean cannot be zero. If power posing works in people, then the mean shouldn’t be zero.”

Oof, let’s be careful though. In order to reason backwards and forwards from/to substantive and statistical hypotheses so simply, these things must map one-to-one, or closely to it. And they generally do not.

Thought droppings based on the hypothesis-testing wars

The common nil-null hypothesis only makes sense to me under a couple of circumstances:

  • A physical law constrains a parameter to be zero: ESP can actually have a zero-effect, because known physical laws prohibit it. It can, in fact, have a zero-effect in any individual, anywhere in time or space, and of course in the population at large. Zero. Effect. None. If ESP exists, it would require a reconsideration of physical laws and constraints. In this case, hypotheses about ESP can actually map onto “Zero, period” and “Not zero, in the population or the individual”.
  • The measurement sucks. If on a latent continuum, some effect exists to some teeny-tiny decimal point (d=.00000000000000001), then the nil-null is false. This is what Tukey, Gelman, and several, several others mean when they say that H0 is false to some decimal point, and so it’s pointless to test it. Substantively, the Null is false, because the effect is assumed to exist to some stupid small decimal point that noone cares about, but nevertheless is non-zero — Everything is related to everything else, or insert your favorite zen concept here. However, the statistical hypothesis that the population parameter is zero could in fact be true due to an artifact. If we had a measure that could only be precise to .1, then even if we had every member of a population, the population parameter may in fact be zero. But then if we used sufficiently precise instruments, perhaps the population parameter would indeed be d=.000000000000001 or whatever.

Those are basically the only two circumstances where I can justify a true nil-null statistical hypothesis: Physical constraints and measurement constraints.

In the case of ESP, both the substantive and statistical hypotheses could easily be that there is no effect. There is a constraint by physical law. The effect could statistically be zero (Although, the measurement problem nevertheless exists for assessing the alternative). There is a symmetry here. If ESP doesn’t exist, then the statistical hypothesis can be zero within individuals and in the population. If it does exist, then the value is indeed not zero; even if one person has ESP, the population parameter will be non-zero. Barring methodological problems, if the population parameter is zero, then ESP does not exist according to the substantive hypothesis; likewise, if the population parameter is non-zero, then ESP does exist. This is one of the very few circumstances where substantive and statistical hypotheses map decently well to one another, and a nil-null can make sense.

The case of, say, power posing is not so simple.
Let’s say the substantive hypothesis is that the act power posing causes people to feel/behave powerfully. The statistical hypothesis to be rejected is that the effect is zero. The statistical hypothesis of interest is that the effect is not zero.

A rejection of the zero-effect is taken to mean evidence that power posing causes powerful feelings/behaviors. However, this is not one-to-one. The statistical hypothesis of interest is that there is a non-zero effect; that can be true, even if the substantive hypothesis is false — Perhaps people can guess the purpose of the study, and inspire themselves to act more powerfully; power posing itself is not the cause of powerful behavior, but instead self-motivation is, and the only reason power posing worked is because it made people think about self-motivating. In this case, the statistical H1 is correct, but the substantive hypothesis is false.
Conversely, the substantive hypothesis could be true, but the statistical hypothesis of interest is false.
Let’s assume we had an imperfect measure of powerful feelings/behaviors following power posing. The population mean could in fact be zero, with some people feeling more or less powerful after power-posing. In this case, the very vague substantive hypothesis that “power posing will have an effect” is actually true, even if the mean of the entire population is exactly zero.

It’s for these reasons, I get testy about hypothesis testing. It’s rare that a substantive hypothesis can map directly onto a statistical hypothesis. Multiple substantive hypotheses can map onto the same statistical hypothesis, and multiple statistical hypotheses can map onto several substantive hypotheses. Even if you do attempt a mapping, it requires a fairly specific substantive hypothesis. For instance, saying “people are affected by power-posing” is entirely too vague; if you had 10,000 people do 1,000 trials of power-posing, then what do you do with that hypothesis if the population mean is zero, but effect variance > 0? People in general (on average) are not affected by power-posing, but certainly some are positively or negatively — In this case, the substantive hypothesis is simultaneously true and false, depending on how you read it.1 And regardless, in the vast majority of psychology where the question realistically is not “whether a physical law exists that permits such an event” but rather “to what degree an event predicts another event”, the nil-null statistical hypothesis is only really valid when the measurement sucks2, not because substantively a zero-effect is plausible in the population. In such a case, the substantive H0 that “an effect as measured by my crappy measure is equal to zero in the population” may be true and match the statistical hypothesis, but “an effect does not exist” is simply wrong.

Substantive vs statistical parsimony

The issue of substantive vs statistical hypotheses crop up in other domains too; in particular: Occam’s razor.
Here’s something that will probably bother a lot of people:

  • I like Occam’s razor
  • I also like super complex, perhaps unnecessarily “over-parameterized” models.

Occam’s razor is essentially that the simplest set of explanations that can account for a phenomenon is the preferred one.
To me, this is a statement about the substantive explanation, or substantive hypothesis.
I, personally, do not care about simple models. I may posit a single fairly simple substantive mechanism, but the model may also have 1,000 parameters due to latent variables, crossed random effects, locations/scales/shapes of residual distributions, etc. My substantive hypothesis doesn’t say much about how much skew is to be expected in residuals, if any — So I may relax an assertion that residuals must be symmetric-normal, and instead use a skew-normal. My substantive hypothesis doesn’t say much about whether random effects are or aren’t correlated, so I may just permit them to be correlated in any manner they please.
For many, that’s terrible — “Your model is too complicated! Simple is better! Occam’s razor!”.
To me, the important part of Occam’s razor is the simplicity in the substantive theory or hypothesis, not in a model used to extract information for evaluating that theory or hypothesis.
If someone claims that a model is too complex and cites Occam’s razor, I tend to have two thoughts: First, you are conflating substantive parsimony with statistical parsimony — Substantive hypotheses are not one-to-one with statistical hypotheses or models, and I doubt when you say “Power posing is effective” you actually are hypothesizing “Power posing should linearly affect the population on average with d = .2 with absolutely zero effect variance and perfectly normally distributed residual variance that is homogenous across all individuals within a population”. In fact, that sounds much less substantively parsimonious, even if it is statistically parsimonious in terms of the number of parameters. There are way more assertions to that statement. In some sense then, a less parsimonious substantive claim may imply a more parsimonious statistical model. It really depends on how you define parsimony; if fewer parameters means parsimony to you, then obviously complex models are less parsimonious. On the other hand, if fewer substantive assertions and assumptions imply parsimony to you, then more complicated models may actually be more parsimonious. It really depends on what type of parsimony you value, but for me personally, I’m much more concerned with substantive parsimony than I am with the complexity of a model used to infer about that “substance”.

Second, do we really need to worry about parsimonious hypotheses or models in psychology at this point? At least in social psychology, we’re barely able to explain even 10% of the variance in the sample. We know social behavior and cognitions are complex, multiply determined, and vary within and between individuals. There are thousands of causes at various levels that produce any one given social behavior or cognition. If adding a predictor to my model makes it seem “unparsimonious”, I fear we will never be able to fully grok human psychology — We’re too afraid of overexplaining, when really we’re underexplaining nearly everything anyway. I think we’re a long, long way off from overexplaining a behavior.

Concluding thoughts

My take-home point is that substantive and statistical hypotheses need to be conceptually separated.
Multiple substantive hypotheses can map onto the same statistical hypothesis, and multiple statistical hypotheses can map onto the same substantive hypothesis.
They are not one-to-one mappings.
The understanding of each, their separation, their relation, are all important to understanding hypothesis testing — Whether nil-null hypotheses are even feasible, what model you build, the kind of inference you make.
Substantive hypotheses about the data generating mechanism should always precede any statistical hypotheses, but one must also note that statistical hypotheses may not map extremely well onto a substantive hypothesis.
Evaluating a substantive hypothesis is therefore much more complicated than testing a statistical hypothesis and calling it a day. Think of all the reasons there may be a mismatch, why a statistical hypothesis may be true or false regardless of your substantive hypothesis, or whether either hypothesis even makes sense (i.e., can there be a truly nil-null 0.000…000 effect).
To really support a substantive hypothesis (which is what we really care about anyway), we need to build better models; rule out more explanations; choose useful hypotheses to pit against our own in prediction; have converging evidence from several modalities and disciplines.
But mainly, stop thinking that a statistical hypothesis is your hypothesis; it isn’t. It’s a proxy at best.

  1. In the case of ESP, this sort of thing can occur as well, but it would be… weirder. If ESP is “true”, but the population parameter were still zero, that would imply that people can have negative or positive ESP effects — Or, people have ESP and make the correct decision, or people have anti-ESP, where they know what will happen and make the totally wrong decision. I doubt this fits with the substantive ESP hypothesis, so if the substantive ESP hypothesis were correct, we’d need to see a mean greater than 0. 
  2. To be fair, this is probably valid in most cases. This has made me a little ambivalent about the nil-null hypothesis, because with a truly crappy measure, the population parameter could actually be zero due to imprecision, but I doubt that’s what people mean when they say the “effect is none”. 

Bayesian CFA primer

I’m working on [yet another] Bayesian CFA model. I’ve grown to really like using Stan for latent variable models like CFA, SEM, mixtures, etc.
The problem is that the docs are rather sparse about how to do this, or even where to begin.
This post won’t get you to the point of writing the Stan code, but I can at least give the basic gist while I wait for Stan to finish.

Most CFA packages operate on covariance matrices. The “aim” is to specify a set of linear equations such that the implied covariance matrix adequately matches the observed covariance matrix.
This has a host of “problems” that people have been solving for years.
Some examples:

  • Latent Interactions are hard
  • Distributional assumptions are hard to include
  • Assume normality, then tweak every fit statistic to be robust against the inherent non-normality.
  • The observed covariance matrix is often treated as a “known parameter” in the minimization formula (e.g., this manifests in treating computing the density of the implied covariance matrix given the observed covariance matrix)

Although you can still take a covariance-comparison approach in Bayes, there isn’t really a need to if you have access to raw data.

Let’s say you have one factor and ten indicators.
For the sake of simplicity, we’ll include a normal assumption on the indicators.
Also for the sake of simplicity, we will standardize the data and assume a standardized latent variable (mean of zero, sd of 1).

The basic assumptions are as follows:

\theta_i &\sim N(0,1) \\
y_{ij} &\sim N(\theta_i \lambda_j, \sigma_j) \\
\lambda_j &\sim N(0,1)^+ \\
\sigma_j &\sim Cauchy(0,1)^+

where $\theta_i$ is the latent score for person $i$, $y_ij$ is the observation of item $j$ for person $i$, $\lambda_j$ is a factor loading, and $\sigma_j$ is the residual standard deviation.
It’s really about that simple. Note that this method gives you the posterior distributions (and expected values) for factor scores for free.
That also means that in structural models, or moderated CFAs, you can very easily include interactions, as $\theta_i\times x_i$ or $\theta_i\times \gamma_i$ (where $\gamma_i$ is some other latent variable).

The nice thing about this is that it’s flexible. Want differing assumptions about the item likelihoods? Go for it; don’t use a normal likelihood for each item. Use ordered logit, ordered probit, skew-normal, skew-T, T, or whatever the heart desires.
Want to model residual covariance? Go for it. Want to allow everything to possibly residually covary using a soft constraint? Sure! Use a multivariate normal for all manifest variable likelihoods, and put a strong LKJ prior on $\Sigma_r$ to have a soft-constraint.
Do you want non-normal latent score distributions? Sure! Don’t assume latent scores are distributed normally; just use a different prior for $\theta_i$ and estimate any other parameters you need to.

This was a really simple model, but it’s very easy to code into Stan. Extensions are just about as easy.
If you want multiple factors to covary, use a MVN prior with an unknown correlation matrix.
If you want multiple factors to potentially load onto multiple indicators beyond those hypothesized, just use priors to place soft-constraints. This is equivalent to exploratory CFA/SEM modeling. You can have all factors load onto all items, just with non-hypothesized paths having peaked zero-priors for model identification.

Of course, the remaining problem is — How do you assess model fit?
Well, two ways come to mind.
First, you could actually estimate the observed covariance matrix for all observed variables and the model-implied covariance matrix within the same model, then compute the standard fit indices based on those matrices.
This would provide not only the fit indices of interest, but uncertainty about those fit indices.
This sounds like a terrible pain though.

Second, you can ignore model fit indices (e.g., RMSEA, SRMR, CFI) and just focus on predictive adequacy. This comes in multiple flavors.
You can obtain the log-likelihood of each observation within each manifest variable, and examine for which people the model is inadequate for which items using point-wise LOOIC.
This is a killer diagnostic tool.
You can also obtain the log-likelihood of each observation across all items jointly, and compare models’ person-wise LOOIC values. This would give you a metric akin to the AIC, but in the Bayesian world. It’s essentially the approximate joint leave-one-out error of the model. The best model is the one with the lowest LOOIC.
Another nice thing about the elpd (which is computed in the LOOIC), is that when used in model comparison, it’s akin to assessing the posterior predictive performance in a similar manner to an “informed” Bayes factor.
This isn’t the time to really get into what this means, but suffice it to say that IF you computed the posteriors of a model, then plugged those in as priors for a second equally-sized sample and computed the marginal likelihood, it would approximately equal what the elpd within the LOOIC attempts to estimate (I will take another post to explain why that is, but you can test that yourself if you want).

“Second generation p-value” – Interesting

Blume et al. (2017) posted a paper to the ArXiv about a Second generation p-value.
The proposal is essentially as follows.

  • Define an interval [a, b] defining a range of estimates that would be considered support for the null. E.g., perhaps a d statistic would have a null interval of -.1, .1, with the explanation that if the true d statistic falls within that interval, the effect is useless/incompatible with substantive hypothesis/etc.
  • Compute the statistic and some interval representing the range of values compatible with the data. In frequentist terms, this may be a CI that is constructed such that any values outside of the interval would be rejected IF the null hypothesis were the point estimate. In Bayesian terms, this may be some highest density or quantile interval of the posterior for the estimate.
  • Compute the proportion of the statistic-interval that is contained within the null-interval. If the statistic-interval is “too large”, then apply a correction to make the proportion max out at .5.
  • The second generation p-value is then the proportion of the “data-consistent interval” that overlaps with the “null-consistent interval”.

Some properties include:

  • If H0 is true, such that the true parameter value is within the null interval, then as information increases, the p-value will increase to 1.
  • If H1 is true, such that the true parameter is outside of the null interval, then as information increases, the p-value will decrease to 0.
  • If an estimate is directly on the null interval border, the p-value will be .5.
  • The authors argue that it outperforms in other ways, such as controlling t1 error rates, even compared to FDR/FWER correction procedures.

I like the idea, though it sounds so much like a frequentist has rediscovered the Bayesian ROPE concept — Define a Region of practical equivalence, and use the posterior densities to evaluate the probability of the parameter falling within the ROPE. In fact, it’s pretty much the same thing, with the only difference being that the second-generation p-value uses overlapping intervals, rather than probability mass contained within an interval. The Bayesian ROPE can have a probabilistic interpretation, while the authors argue that the second generation p-value is just a descriptive, a proportion, not a posterior probability or anything similar to that. I respect them for admitting that, rather than trying to shoehorn in a probabilistic interpretation.

One thing I’m curious about, is whether another X-generation p-value could essentially be a marginalized p-value.
This make already exist, in which case I am rediscovering it much like they may have rediscovered the ROPE.
The p-value (for a nil-null hypothesis) is defined as: $p(t > T(y,\theta)|\theta = 0)$. What if instead of defining an interval, we defined the null hypothesis as a density.
E.g., instead of saying $H0: \theta = [-.1,1]$ we said $H0: \theta \sim N(0,.05)$, or something similar.
Then computed: $\int p(t > T(y,\theta)|\theta)p(\theta)d\theta$.

Before any frequentists jump on me for putting a GASP parameter on a distribution: just imagine this is a probabilistic weight, for penalization, rather than some philosophical statement about a-priori plausibility of parameters.
But I think that, too, would have interesting and similar properties.
I may explore that at some point. It’d be conceptually similar to assurance, which is just marginal power when an effect size is uncertain.

The stickiest points, for me, are the use of CIs as the data-consistent intervals — Only certain CIs can be built with the property needed for this to work, and even then they can be problematic. So that leaves one with credible intervals as an alternative, but then you may as well just use the posterior for inference anyway; you’re in Bayes land, just enjoy your time there.
The other issue is in the case where the true hypothesis is, somehow, EXACTLY equal to a null interval boundary, the second generation p-value would asymptotically be .5. I’m not sure that’s bad or unintended, given that it would imply you can’t really choose between the null or the alternative, but it’s a bit annoying to have a singularity of indecision.

Anyway, it may be an interesting metric. I haven’t decided yet whether I like it more than a p-value (but still less than a Bayesian treatment, of course); but I see it being a potentially useful metric nevertheless.

Proposal of a new term: “DOCO”

I am proposing a new term: DOCO. I will, in spirit, add it to Gelman’s already impressive list of useful terminology. DOCO stands for Data(or datum) Otherwise Considered Outliers.

That is, if you have a X-assumptive model, and you see statistical “outliers”, then you should probably change the model assumptions. These are not “wrong” data, but rather data that, if you assume X, are notably outlying observations.

Thus: DOCOs. If you have data that are outliers according to your statistical model, then that implies that your model is most likely incomplete, and does not fully describe the data generating process. Have several positive “outliers”? Your data are skewed, and you should use a skew normal distribution or some other skew-permissive distribution. Have several tail-end “outliers”? Your data have fat tails, and you should use a student-t distribution. Have coding errors? Then part of the data generating process involves an error-in-coding component; if you’re only interested in the non-error-in-coding generative component, then sure, remove those points. If not, maybe you can simultaneously model the secondary component. Data seem to be very ill-defined by any single generative distribution? You should probably model the data under some fixed-mixture model that could sufficiently generate observations.

The point is: Outliers shouldn’t exist in principle. Something generated that data, and if some data are considered outliers, then your model isn’t adequately accounting for such observations and needs modifications. As such these types of data should be called DOCOs with a revised model.

Bayes and optional stopping

When I explain to people why I love Bayes, it’s typically some combination of:

  • Build the model you want, with assumptions you want
  • Use probabilistic (soft) instead of hard constraints
  • Incorporate prior information
  • Identify difficult models with priors
  • Interpret the posterior how you’ve always wanted to interpret non-Bayesian quantities
  • Obtain better estimates and make better inferences thanks to priors and modeling the DGP as you see fit
  • Infer without worrying too much about the sampling distribution

However, some people tack on an extra benefit:

  • Don’t worry about optional stopping; the likelihood principle says there’s no problem with it.

I’ve long stopped saying that, because I’ve stopped believing it.

Stopping rules do affect your inference. I am not the first to say this (e.g., 1 2 3 4, several papers if you google scholar this topic).
First, stopping rules can by their very rule provide information about the parameter.
Second, stopping rules affect the sample space, these outcomes which are used to inform the parameter. Therefore, even if a stopping rule doesn’t directly modify the likelihood of a particular set of observations, it can affect the distribution of possible parameter values that could even be inferred from the sample space. This can be represented in the prior specification, $p(\theta|S)$, where $S$ is a stopping rule, among other places in the model.

I think there are ways out of that predicament by modeling a DGP that takes into account the modification of sample space and the consequent modification of parameter estimates that may result from observation, estimate, or posterior-dependent stopping rules. Essentially, a joint generative model must account for the observations, $N$, the modified sample space, etc all induced by the stopping rule (stopping which, itself, is contingent on current data, N, and posterior quantities… a confusing mess). But that’s not my point here (and also, I imagine that’s insanely complicated, and should there be a solution… a paper, and not a blog post, should be written).

Instead, I just want to argue that Bayesian inference is affected by stopping rules.
The effect of stopping rules such as “95% HDIs must exclude zero” on parameter estimates are already known. These sorts of stopping rules are guaranteed to bias your inference. You’re sampling from a population until a magic sequence of observations produces the desired non-zero interval, and then you turn off the data collecting machine once the moving posterior has moved far enough away. Most probably, this will result in an overestimate and an erroneous uncertainty in the direction and magnitude of an effect. Observations are random, and randomly ordered. At some point, to infinity, there will be some sequence extreme enough to sway a posterior estimate toward an extreme, and it’s at this moment the collection machine turns off. Should you let the collection machine continue, you would likely see a sequence of observations that would move the posterior distribution back toward the expected region. In essence, you’re waiting for an occurrence of a sequence of events extreme enough to push the posterior distribution sufficiently far away from zero, then stopping. Yes, of course this is going to bias your estimate — You are waiting for an extreme event in order to make your (now extreme) estimate.

This problem is manifest in any stopping rule, really. However, some stopping rules aren’t as inherently problematic as others. For instance, sampling until a desired precision (posterior SD, SE) is obtained isn’t really problematic for parameter locations. That generally means you just have sufficient information about the parameter, and all is well; you aren’t waiting for an extreme event that translates the posterior before stopping, you aren’t waiting for a sequence to occur that would produce a spurious confirmation of a hypothesis before stopping, etc. I’m guessing there would be some amount of bias in the posterior width, in the sense that you could have a sequence of events that spuriously makes the posterior width smaller than had you collected $\Delta_n$ more (presumably extreme values), and therefore you could have a negligibly spurious amount of (un)certainty about the estimate. But to summarize, as long as you decide to turn off the data collection machine upon hitting some data, estimate, or posterior-dependent condition, noone should be surprised that the type of data, estimate, or posteriors obtained from such a procedure cause an overrepresentation of these conditions — And there will therefore be some undesirable properties.

Bayes Factors are not immune to this issue. A common selling point of Bayes Factors (and Bayesianism in general) is that the sampling intention and stopping rule are irrelevant to the inferential procedure.
As stated, this is correct. The Bayesian posterior and BFs (not a posterior quantity) are interpreted the same way regardless of the sampling intention. P-values, on the other hand, depend explicitly on the sampling distribution, and optional stopping and sampling intentions both modify that sampling distribution — Great care must be taken to obtain an interpretable p-value, let alone a correct one.

Although bayesian quantities are interpretable despite sampling intentions and stopping rules, the probability of bayesian decisions is certainly affected by optional stopping, and not in a desirable way.
Here is my simple example, then I’ll stop rambling.

We’ll use the BF, since people too often consider the BF a panacea of inference that is immune to the stopping rule.
Bayes Factors are essentially a ratio of prior-predictive success from two priors defined by hypotheses. In the case of optional stopping, what you are doing is waiting for a sequence of observations for which prior predictive success is relatively high. It should be of no surprise then, that with a stopping rule condition of relatively high prior predictive success, some sequence of data may spuriously “overstate” prior predictive success and thus evidence of one hypothesis over another.

Two hypotheses are defined, conveniently as the defaults for the BayesFactor package. H0: A point mass at $\delta = 0$, and H1: Cauchy(0,.707) 1
For this example, H0 is true, and $\delta = 0$.
Two stopping rules are defined.
For the first, the stopping rule is defined as a fixed-N stopping rule, such that the researcher collects N=400 and stops to evaluate the BF.
For the second, the stopping rule is defined as: Collect 4 observations, test whether the BF is beyond threshold for either H1 or H0; if so, stop and report the BF; if not, collect two more observations and repeat until N=400.

I simulated this procedure across 10,000 “studies”. Each study reported their BF at the time of stopping. With a BF threshold of 3, H1 is supported if BF > 3, H0 if BF < 1/3, or neither if BF is inbetween.

H0 Neither H1
Fixed-N .8629 .1280 .0091
Optional Stopping .8467 .0006 .1527

Under a fixed-N, 86% of the studies made the correct decision to support H0, 13% couldn’t adequately distinguish between H1 and H0, and 1% erroneously2 decided on H1.
Under optional stopping, the proportion that made the correct decision did not change much, but the percentage that erroneously decided on H1 was 14% higher (17x more often)!

Before you say “but BF of 3 is too low”, let’s use BF of 10 (and 1/10) as the threshold.

H0 Neither H1
Fixed-N 0.00 .9975 .0025
Optional Stopping 0.00 .9528 .0472

I’m guessing H0 wasn’t supported only due to some monte carlo error. But the point is again evident: H1 was supported 19x more often by the data (erroneously, by chance) when optional stopping was used than when it was not used.

This is the problem I have with the claim that BFs and other bayesian quantities are immune to optional stopping. Yes, the interpretation of the quantity remains the same. Also yes, if you use optional stopping, it can change the probability of making some erroneous inference or decision. Their definitions are unchanged by optional stopping, but the probability of decisions are affected by optional stopping.

So next time someone tells you that Bayesian inference is unaffected by optional stopping, ask them why the probability of making an erroneous decision increases notably when using Bayesian quantities produced from an optional stopping rule.

  1. The reader should know that I 1) have a list of grievances with the bayes factor and do not actually like using them and 2) I hate the hypotheses tested by default in this package. 
  2. By “erroneously”, I mean the truth is H0, and the data collected would support H1. The data observed indeed did support H1, by chance, but this is erroneous given the true known state that H0 is true. 

Power analysis for SEM with only $\alpha$

I was helping a labmate with a power analysis problem. He’s planning a study and will use SEM. Unfortunately, the scale he’s using only reported Cronbach’s $\alpha = .80$ with 14 items.

This is irritating, to say the least. But nevertheless, there are two approaches to this.
One, you could simulate “true” scores that have some desired relationship with another variable, then simulate 14 realizations of those true scores using a standard error that equates to $\alpha$.

But I wanted to see whether we could construct a latent variable model that implies an $\alpha = .80$.

Reliability in SEM is a complicated thing, with many definitions. Some are interested in AVE (average variance explained). Then there’s $\omega_1, \omega_2, \omega_3$, the definitions of which can be found in the semTools::reliability help page.
$\omega_1$ and $\omega_2$ are more similar to each other than either are to $\omega_3$, but generally the definition can be thought of as “IF there is a latent variable, and we see J realizations of that latent variable, what percentage of the variance in the sum-scores is attributable to the latent variable?”
This is an estimate of the “true” reliability, which can be thought of as the proportion of variance in $\hat\theta$ explained by $\theta$. Of course, in reality we don’t have $\theta$, so we settle for the proportion of variance in some observed metric that is explained by the latent estimate. With greater N, this estimates converges with the “true” reliability.
Side note: If you have high N but a crappy measure, your reliability does not improve, it just means your estimate of your crappy reliability is more accurate of the true crappiness.

Here’s the basic, probably imperfect, proof of reliability:

\tilde x_i = \sum x_{ij}$$

$$x_{ij} = \hat x_{ij} + \epsilon_{ij}$$
$$x_{ij} = \lambda_j\theta_i + \epsilon_{ij}$$
$$\tilde x_i = \sum_j \lambda_j\theta_i + \sum_j\epsilon_{ij}$$
$$Var(\tilde x_i) = Var(\sum_j \lambda_j\theta_i + \sum_j\epsilon_{ij})$$
$$Var(\tilde x_i) = (\sum_j\lambda_j)^2Var(\theta_i) + \sum_j Var(\epsilon_{ij})$$
$$Reliability_{\omega_2} = \frac{(\sum_j\lambda_j)^2Var(\theta_i)}{(\sum_j\lambda_j)^2 Var(\theta_i) + \sum_j Var(\epsilon_{ij})}$$

If variance is set to 1, and outcomes standardized: $$\omega_2 = \frac{(\sum_j\lambda_j)^2}{(\sum_j\lambda_j)^2 + \sum_j(1 – \delta_j)}$$

Well, what we have is $\alpha$, not $\omega_2$. However, $\alpha$ can be approximated in SEM assuming that the residual variance and loadings are equal across items.
This further simplifies things.
$$\alpha = \frac{(J\lambda)^2}{(J\lambda)^2 + J\delta} \
= \frac{(J\lambda)^2}{(J\lambda)^2 + J(1 – \lambda^2)}$$
Solving for $\lambda$ gives you:
$$\lambda = \sqrt{\frac{\alpha}{(J – \alpha(J-1))}}$$

This represents the $\lambda$ needed to obtain $\alpha$ with $J$ standardized items, assuming a standardized latent variable.
We tested it to be sure; we simulated data using lavaan in R, and sure enough, running psych::alpha(x) gave $\alpha \approx .80$.

This is not to say that when scales have $\alpha=.80$ that the factor loadings are indeed equal to the above (they won’t be), but if you only have reliability and you want to simulate a measurement model that would suggest such reliability, then there you go.
This is also not for doing power analysis for the factor loadings themselves, of course.
But if you wanted to do a power analysis for a correlation or structural coefficient with one or more variables being latents + measurement error, this approach may work when information is lacking.