The recent “use p=.005”, “use BFs”, “no, use estimation”, “no, justify your alpha” discussion rekindled some of my thinking about substantive and statistical hypotheses. This posts will truly be a rambling, without any organization or goals, but maybe it will have something interesting in there somewhere.

Researchers *need* to understand that a statistical hypothesis is *not* a substantive hypothesis in the vast majority of cases.

Alternatively put, statistical hypotheses very *rarely* map onto substantive hypothesis in a one-to-one manner. This is true whether you are Bayesian, frequentist, or anywhere in between. In the vast majority of cases, when you test a statistical hypothesis in any manner, you have to do some logical gymnastics in order to make an inference about the *substantive* hypothesis.

So what do I mean by this?

A substantive hypothesis is what you posit in normal language: “ESP exists”; “Power posing makes people feel powerful.”

A statistical hypothesis is some statement about a parameter that the substantive hypothesis implies: $H_1: \mu \neq 0$.

Now, without putting much more thoughtwork into it, this may make sense. “Yes, if ESP exists, then the mean cannot be zero. If power posing works in people, then the mean shouldn’t be zero.”

Oof, let’s be careful though. In order to reason backwards and forwards from/to substantive and statistical hypotheses so simply, these things *must* map one-to-one, or closely to it. And they generally do not.

# Thought droppings based on the hypothesis-testing wars

The common nil-null hypothesis only makes sense to me under a couple of circumstances:

- A physical law constrains a parameter to be zero: ESP can actually have a zero-effect, because known physical laws prohibit it. It can, in fact, have a zero-effect in any individual, anywhere in time or space, and of course in the population at large. Zero. Effect. None. If ESP exists, it would require a reconsideration of physical laws and constraints. In this case, hypotheses about ESP can actually map onto “Zero, period” and “Not zero, in the population or the individual”.
- The measurement sucks. If on a latent continuum, some effect exists to some teeny-tiny decimal point (d=.00000000000000001), then the nil-null is false. This is what Tukey, Gelman, and several, several others mean when they say that H0 is false to some decimal point, and so it’s pointless to test it. Substantively, the Null is false, because the effect is assumed to exist to some stupid small decimal point that noone cares about, but nevertheless is non-zero — Everything is related to everything else, or insert your favorite zen concept here. However, the statistical hypothesis that the population parameter is zero could in fact be true due to an artifact. If we had a measure that could only be precise to .1, then even if we had every member of a population, the population parameter may in fact be zero. But then if we used sufficiently precise instruments, perhaps the population parameter would indeed be d=.000000000000001 or whatever.

Those are basically the only two circumstances where I can justify a true nil-null statistical hypothesis: Physical constraints and measurement constraints.

In the case of ESP, both the substantive and statistical hypotheses could easily be that there is no effect. There is a constraint by physical law. The effect *could* statistically be zero (Although, the measurement problem nevertheless exists for assessing the alternative). There is a symmetry here. If ESP doesn’t exist, then the statistical hypothesis can be zero within individuals and in the population. If it does exist, then the value is indeed not zero; even if one person has ESP, the population parameter will be non-zero. Barring methodological problems, if the population parameter is zero, then ESP does not exist according to the substantive hypothesis; likewise, if the population parameter is non-zero, then ESP does exist. This is one of the very few circumstances where substantive and statistical hypotheses map decently well to one another, and a nil-null can make sense.

The case of, say, power posing is not so simple.

Let’s say the substantive hypothesis is that the act power posing causes people to feel/behave powerfully. The statistical hypothesis to be rejected is that the effect is zero. The statistical hypothesis of interest is that the effect is not zero.

A rejection of the zero-effect is taken to mean evidence that power posing causes powerful feelings/behaviors. However, this is not one-to-one. The statistical hypothesis of *interest* is that there is a non-zero effect; that can be true, even if the substantive hypothesis is false — Perhaps people can guess the purpose of the study, and inspire themselves to act more powerfully; power posing itself is not the cause of powerful behavior, but instead self-motivation is, and the only reason power posing worked is because it made people think about self-motivating. In this case, the statistical H1 is correct, but the substantive hypothesis is false.

Conversely, the substantive hypothesis could be true, but the statistical hypothesis of *interest* is false.

Let’s assume we had an imperfect measure of powerful feelings/behaviors following power posing. The population mean could in fact be zero, with some people feeling more or less powerful after power-posing. In this case, the very vague substantive hypothesis that “power posing will have an effect” is actually true, even if the mean of the entire population is exactly zero.

It’s for these reasons, I get testy about hypothesis testing. It’s rare that a substantive hypothesis can map directly onto a statistical hypothesis. Multiple substantive hypotheses can map onto the same statistical hypothesis, and multiple statistical hypotheses can map onto several substantive hypotheses. Even if you do attempt a mapping, it requires a fairly specific substantive hypothesis. For instance, saying “people are affected by power-posing” is entirely too vague; if you had 10,000 people do 1,000 trials of power-posing, then what do you do with that hypothesis if the population mean is zero, *but* effect variance > 0? People in *general* (on average) are not affected by power-posing, but certainly *some* are positively or negatively — In this case, the substantive hypothesis is simultaneously true and false, depending on how you read it.^{1} And regardless, in the vast majority of psychology where the question realistically is not “whether a physical law exists that permits such an event” but rather “to what degree an event predicts another event”, the nil-null statistical hypothesis is only really valid when the measurement sucks^{2}, not because substantively a zero-effect is plausible in the population. In such a case, the substantive H0 that “an effect as measured by my crappy measure is equal to zero in the population” may be true and match the statistical hypothesis, but “an effect does not exist” is simply wrong.

# Substantive vs statistical parsimony

The issue of substantive vs statistical hypotheses crop up in other domains too; in particular: Occam’s razor.

Here’s something that will probably bother a lot of people:

- I like Occam’s razor
- I also like super complex, perhaps unnecessarily “over-parameterized” models.

Occam’s razor is essentially that the simplest set of explanations that can account for a phenomenon is the preferred one.

To me, this is a statement about the *substantive* explanation, or *substantive* hypothesis.

I, personally, do not care about simple *models*. I may posit a single fairly simple *substantive* mechanism, but the model may also have 1,000 parameters due to latent variables, crossed random effects, locations/scales/shapes of residual distributions, etc. My substantive hypothesis doesn’t say much about how much skew is to be expected in residuals, if any — So I may relax an assertion that residuals *must* be symmetric-normal, and instead use a skew-normal. My substantive hypothesis doesn’t say much about whether random effects are or aren’t correlated, so I may just permit them to be correlated in any manner they please.

For many, that’s terrible — “Your model is too complicated! Simple is better! Occam’s razor!”.

To me, the important part of Occam’s razor is the simplicity in the substantive theory or hypothesis, not in a model used to extract information for evaluating that theory or hypothesis.

If someone claims that a model is too complex and cites Occam’s razor, I tend to have two thoughts: First, you are conflating substantive parsimony with statistical parsimony — Substantive hypotheses are not one-to-one with statistical hypotheses or models, and I doubt when you say “Power posing is effective” you actually are hypothesizing “Power posing should linearly affect the population on average with d = .2 with absolutely zero effect variance and perfectly normally distributed residual variance that is homogenous across all individuals within a population”. In fact, that sounds much less substantively parsimonious, even if it is statistically parsimonious in terms of the number of parameters. There are *way* more assertions to that statement. In some sense then, a less parsimonious substantive claim may imply a more parsimonious statistical model. It really depends on how you define parsimony; if fewer parameters means parsimony to you, then obviously complex models are less parsimonious. On the other hand, if fewer substantive assertions and assumptions imply parsimony to you, then more complicated models may actually be more parsimonious. It really depends on what type of parsimony you value, but for me personally, I’m much more concerned with substantive parsimony than I am with the complexity of a model used to infer about that “substance”.

Second, do we *really* need to worry about parsimonious hypotheses or models in psychology at this point? At least in social psychology, we’re barely able to explain even 10% of the variance in the sample. We *know* social behavior and cognitions are complex, multiply determined, and vary within and between individuals. There are thousands of causes at various levels that produce any one given social behavior or cognition. If adding a predictor to my model makes it seem “unparsimonious”, I fear we will never be able to fully grok human psychology — We’re too afraid of overexplaining, when really we’re underexplaining nearly everything anyway. I think we’re a long, long way off from overexplaining a behavior.

# Concluding thoughts

My take-home point is that substantive and statistical hypotheses need to be conceptually separated.

Multiple substantive hypotheses can map onto the same statistical hypothesis, and multiple statistical hypotheses can map onto the same substantive hypothesis.

They are not one-to-one mappings.

The understanding of each, their separation, their relation, are all important to understanding hypothesis testing — Whether nil-null hypotheses are even feasible, what model you build, the kind of inference you make.

Substantive hypotheses about the data generating mechanism should always precede any statistical hypotheses, but one must also note that statistical hypotheses may not map extremely well onto a substantive hypothesis.

Evaluating a substantive hypothesis is therefore much more complicated than testing a statistical hypothesis and calling it a day. Think of all the reasons there may be a mismatch, why a statistical hypothesis may be true or false regardless of your substantive hypothesis, or whether either hypothesis even makes sense (i.e., can there be a truly nil-null 0.000…000 effect).

To really support a substantive hypothesis (which is what we really care about anyway), we need to build better models; rule out more explanations; choose *useful* hypotheses to pit against our own in prediction; have converging evidence from several modalities and disciplines.

But mainly, stop thinking that a statistical hypothesis *is* your hypothesis; it isn’t. It’s a proxy at best.

- In the case of ESP, this sort of thing can occur as well, but it would be… weirder. If ESP is “true”, but the population parameter were still zero, that would imply that people can have negative or positive ESP effects — Or, people have ESP and make the correct decision, or people have anti-ESP, where they know what will happen and make the totally wrong decision. I doubt this fits with the substantive ESP hypothesis, so if the substantive ESP hypothesis were correct, we’d need to see a mean greater than 0. ↩
- To be fair, this is probably valid in most cases. This has made me a little ambivalent about the nil-null hypothesis, because with a truly crappy measure, the population parameter could actually be zero due to imprecision, but I doubt that’s what people mean when they say the “effect is none”. ↩