In this post, I want to talk about the difficulties of probabilistic inference and generalizability of claims.

The recent twitter debates about Tal’s paper The Generalizability Crisis have rendered me befuddled. Personally, I find his arguments to be obvious, conditional on reading Meehl and giving more than 5 minutes of thought toward inference and statistical testing. But, apparently, it is not obvious, and some people strongly disagree. Tal further clarifies matters in a well-written response here.

Tal’s central point, in my view, can be boiled down to the following gist:

- Verbal hypotheses should map onto statistical hypotheses, if statistical hypothesis testing is to be relevant to supporting or rejecting the verbal hypothesis.
- The generalizability of the supported or rejected hypotheses is bounded by the statistical model/test employed.
- Psychologists do not tend to map verbal hypotheses onto statistical hypotheses, but nevertheless use statistical hypothesis tests to support verbal hypotheses.
- Psychologists tend to invoke a "Weak" form of inference to support broad claims. These broad claims are supported using, literally, a weak form of inference (a confirmationist framework), and reduced statistical models that can not, and are not intended to, support claims broader than permitted by the variables entered into it, which are in turn generated from a specific sampling and collection scheme. I.e., the statistical models, used to assess a verbal hypothesis, operate on assumptions and are bound by the variables collected, which do not cleanly translate to broader hypothetical statements. More simply, a 2-group randomized experiment using measure X to measure construct F may imply that one group’s mean F score is greater than another, but that is where statistical generalization ends. Seeing such a quantitative difference, even if definitive, does not imply one’s verbal hypothesis is
*generally supported*(e.g., across populations, across other measures of construct F, across contexts, across time), or is*uniquely*supported (that other hypotheses would not have made the same prediction).

Tal goes onto to offer some solutions. One solution is to include hypothetically exchangeable components into the model itself. If the verbal hypothesis makes a claim that should be evident regardless of the precise measure, stimuli, time, context, population, or subject, then the hypothesis is implicitly claiming that these variables are exchangeable in some way. Therefore, the statistical model for the hypothetical process should also treat these as exchangeable, and a random effects model could be employed.

Of course, knowing all possible sources of exchangeable variation is hard, if not impossible. Collecting data on and actually varying these exchangeable variables even more so. So a second, very reasonable, solution is to stop making claims broader than the statistical model permits. Instead of saying "Receiving an anonymous gift makes people grateful", one could instead say "Receiving an anonymous gift [in the form of digital raffle tickets for a $50 Amazon gift card on a computerized task from a fake stranger] makes [undergraduates at our Southern US University] [score higher on gratitude measure X, assuming the measurement model is specified as it was] [in a controlled double-blind lab study]."

One may say, "That is not ideal", and I would say "Tough, that’s exactly what you can gain from your study." If you wish to claim support for a hypothesis more generalizable than that, then you need to generalize your design to make the nuances exchangeable.

One may also say, "The verbal hypothesis does not need to be the same as the statistical hypothesis", and "The verbal hypothesis can be broad, but still supported by specific studies, or refuted by other specific studies." To the former, I disagree, for reasons I explain in this post. To the latter, I agree, but I rarely see this in psychology; I, more often than not, see a broad claim, described as "confirmed" by a weak statistical test, with some lip service paid in a limitations section.

With all of this said, I really want to dive into the "Weak inference" part, and the notion that the verbal hypothesis needs to match the statistical hypothesis.

# Is my lamp broken?

Tal’s paper raised at least two major points of contention from critics. The first, is whether a verbal hypothesis must map onto a statistical hypothesis, in order for the statistical test to mean anything for the verbal hypothesis. The second, is in the use of the term "induction" when describing statistical inference. I am mainly going to focus on the former, with some sprinkling of the latter.

Tal (and Meehl, and I) argued that a statistical test can inform the credence of a verbal hypothesis to the extent that the statistical test maps injectively (one-to-one, uniquely) onto the verbal hypothesis. I.e., if many verbal hypotheses can predict the same test result, then obtaining that test result tells you little about whether *your* verbal hypothesis is true. Likewise, if *your* verbal hypothesis implies nearly an infinite number of possible statistical outcomes, then nearly an infinite number of possible statistical outcomes would support your claim, and therefore it is no test at all. In even simpler terms, if verbal hypotheses A, B, and C all imply a quantity X, and you see X, it is hard to say A is ‘right’; conversely, if A implies any X other than zero, then you had no real statistical prediction at all, and any of the infinite possible outcomes would have supported A (non-falsifiable).

Regardless of your position on Popper, Lakatos, deduction, induction, abduction, etc, the above holds. This problem is not unique to statistical inference; statistical inference merely adds a link to the inferential chain. It does not matter what your particular brand of inference is — If your hypothesis does not imply outcomes unique from other hypotheses, or if any outcome could be explained from one’s hypothesis, it is a poor hypothesis. This would break deduction, induction, and abduction alike; these all require meaningful, unique predictions.

However, because critics of the paper have seemingly latched onto Tal’s use of "induction" as an argument, I am going to discuss this problem with respect to deductive logic. Again, it does not matter whether one clings to "inductive" or "deductive" inference, and arguing about the paper’s use of the word "induction" is a red herring.

Let’s start with a really simple case: Is my lamp broken?
I have a home office. It’s simple: A nice desktop machine that I built, a window, and a lamp. Let’s assume my blinds are closed, so there’s *some* ambient light, but not much. My desktop is off, so the monitor isn’t producing any light. I want to know if my lamp is broken.

## Modus Tollens

The Modus Tollens framework is as follows:

If A, then B. [This must necessarily be true, for the rest to follow].

Not B.

Therefore, not A.

For my trusty lamp then,

If my lamp is broken, then the room should be dark.

The room is not dark.

Therefore, my lamp is not broken.

This logic makes perfect sense, assuming these postulates are indeed *true*.
It *must* be true, that *if* my lamp is broken, then the room *must be dark*. Then if I see that the room is *not* dark, I can safely conclude that my lamp is *not* broken. Simple. Elegant.

And also, impractically naive for how science works.

## Operationalizing

Let’s start by asking, "What does it mean for the room to be dark?"

For the sake of argument, let’s say I have a light sensor. It detects, on a scale of 0-100, how much light there is. Let’s also assume it is perfectly reliable.

I will define "dark" loosely as "notably darker than it would be, if my lamp were on".

Now, I measure the light in my room at 4pm on a Tuesday in May, with my lamp off. It reads 10. Then I turn my lamp on, and it reads 20. So I can operationalize "dark" as the measure of light at 4pm on a Tuesday in May when my lamp is off. When my lamp is on, it’s at 20; when it’s not, the room is dark with a light reading of 10.

Let’s continue the Modus Tollens then:

If my lamp is broken, then the room should be dark. [Verbal hypothesis]

If the room is dark (and if I operationalize dark as the reading observed when the lamp is off at 4pm on a May Tuesday), then my measure should be 10. [Statistical hypothesis]

My measure is 20. The room is not dark.

Therefore, my lamp is not broken.

Notice how the hypotheses chain:

If A, then B. [A verbal hypothesis, with a necessary prediction]

If B, then X. [The prediction, mapped onto a statistical prediction]

Not X, therefore not B. [The statistical prediction is untrue]

Not B, therefore not A. [Therefore, the verbal prediction is untrue].

So far so good. We have a good enough hypothesis to generate a prediction; that prediction can be mapped onto a quantitative prediction. If that quantitative prediction is falsified, then the verbal prediction is invalidated, and the hypothesis is therefore falsified. Simple. Elegant.

And also impractically naive for how science works.

## "Probabilistic" Modus Tollens

We do not have perfect measures… so now we have to enter the realm of probability. My light sensor is not perfectly reliable. Each time I turn on the light sensor, it produces a slightly different answer. There’s some error to it. But the means are consistent, and the distribution of observations is fairly small when the lamp is on and off. But we do have to revise our inferential chain.

If my lamp is broken, then the room should be dark.

If the room is dark (and if [operationalization, model assumptions]), then the true light measure should be approximately 10.

If the measure must be approximately 10, then my observation (21.5) would be highly unlikely.

Therefore, the measure is unlikely to be 10; the room is unlikely to be dark; and the lamp is unlikely to be broken.

Oof. That makes our statement much more tentative. It is also questionable whether Modus Tollens really makes sense once converted into a "probabilistic" modus tollens. Modus tollens is a logical argument; logical systems aren’t necessarily coherent once uncertainty is introduced into the statements. For one thing, it’s inverting the conditional probability. p(observation | 10) is not the same as p(10 | observation). For another, our quantitative prediction should be somewhat uncertain too… "if the room is dark, then my measure should be distributed N(mu[0], .1); mu[0] ~ N(10, .01)".

In order to have a ‘proper’ probabilistic modus tollens, we need to write the modus tollens as a probability function. Bear with me here. This part won’t be heavily mathy. The modus tollens argument can be written as a probability function, where some things have probability 1, and other things have probability 0.

For example, the "If A, then B; Not B; therefore, not A" can be understood through the following probability statements.
First, let’s define the probability of "A", given "Not B":
$$
p(A|\lnot B) = p(\lnot B | A)p(A) / P(\lnot B)
$$
This looks difficult to compute, but under the standard modus tollens, it is simplified greatly.
When there is *no* uncertainty in the logical statement, such that B *must* occur if A is true, then $p(B | A) = 1; p(\lnot B|A) = 0$. The whole formula collapses:
$$
p(A|\lnot B) = 0 * p(A) / p(\lnot B) = 0
$$
That is, assuming that B must occur given A, then the probability of "A" given "not B" is 0. Therefore, "Not A". Voila! Modus tollens expressed via probability.

The key point of this, is that your proposition *must be true*. "A" must *necessarily* imply "B". When "A" happens, ‘B" must happen. If that does not happen, then modus tollens does not easily hold.

For example, if "B" only occurs with some probability, given "A", it becomes more complicated. "If A, then probably B [let’s say, .95]; not B, therefore …" looks like this: $$ p(A | \lnot B) = .05 \times p(A) / p(\lnot B) $$ Once you enter probability space, you have to start justifying some probabilities. It is no longer straight forward.

So… let’s return once again to my possibly broken lamp. This is going to get complicated. If probability statements overwhelm you, then your take-away from this section should simply be: "Inference under uncertainty is extremely hard, and not as straightforward as Modus Tollens would suggest." It is fine if you do not understand this portion; that is largely the point of this section! With uncertainty (in data, in parameters, in statements), inference becomes much more complicated.

Let’s express the probabilistic modus tollens of the lamp problem. Letters in parentheses represent conditions to include in the probability algebra.

If my lamp is broken (A), the room must be dark (B). [A necessary implication]

If the room is dark (B) (and if [operationalization, model assumptions]), then the true light measure (C) should be approximately 10. [A probabilistic statement]

If the true light quantity should be around 10 (C), then my observation (D) would be highly unlikely [A probabilistic statement]

So what is the answer? I want to know whether my lamp is broken. Unfortunately, there is uncertainty in my predicted value (due to measurement error in my tool). Argh! What I do know, is "D"; I have an observation. $$ \begin{align} p(A | D) &= p(D | A)p(A) / p(D) \\ p(D | A) &= \int p(D | C) p(C | B) p(B| A) \end{align} $$ We need to know a number of things. What is the probability of the true light measure, given that it is dark: $p(C | B)$. What is the probability of my observation, given the true measure in darkness: $p(D | C)$. What is the a-priori probability of my lamp being broken: p(A) [e.g., maybe it’s broken many times in the past; maybe it is 30 years old; maybe this brand has a known defect; maybe it is currently sizzling and smoking].

This is.. difficult, obviously. If I previously, repeatedly measured the ambient light when the room is ‘dark’, then I can form a distribution for $p(C|B)$. If we take for granted that the room is always considered dark when the lamp is broken, then $p(B|A) = 1$. If we know the distribution of measurement error for our light sensor, then we can specify the probability of the observed value given the true ambient light value.

This is all *possible*, for sure. But the point is to highlight: Modus Tollens is much more difficult when the logical statements become probabilistic statements. It is unclear whether this is *even considered deductive*, since it includes uncertainty instead of merely deterministic statements. Given that it is no longer defined by logical, binary statements, but rather probabilistic ones, it appears more similar to induction than deduction, despite having a deductive bent. It is not straightforward, and I have never, in my life, seen a paper in psychology actually reason through, then compute something like this.

## Random variation

Moreover, this is *still* a simplified example. We have thus far only talked about my measurement error. There are all sorts of external sources of variation that insert themselves into the system. What if I randomly sampled the light sensor across the year to collect data on baseline light levels when the lamp is on and off? Then there are new conditions, new sources of uncertainty.

Recall that I previously collected light data when the lamp is on and off, in order to operationalize and quantify what I consider "dark" and "not dark", on a May Tuesday at 4pm. My statistical inference is based on whether my current light data is improbably larger than would be expected assuming the light measure is under the operationalized ‘dark’ regime. But across multiple data collections, there can be several random sources of light variation that are irrelevant to my question, but nevertheless affect my data collection process.

If I collected light data across multiple rooms, multiple days, multiple seasons, etc, then there are all kinds of effects on my light data. Whether my computer monitor is on matters. Whether it’s 6pm in the summer, or 6pm in the winter matters. Whether I blocked my lamp with a giant coffee mug matters. Where I placed the light sensor matters. These are all stochastic, and these all change the distributions of uncertainty. They all change what the expected amount of light would be, what I would consider ‘dark’, and how much error there is in my measurement itself.

Importantly, I would *have* to change my model to account for these things if I wanted to claim my lamp is, indeed, not currently broken at any point in time. I would need to estimate and control for the effects of these stochastic influences, if I wanted to know whether, in a given moment, my lamp is not broken. If I wanted to know whether my lamp is broken, and I broadly operationalize the verbal prediction "the room is dark" as "the amount of light present when the lamp is off [no matter what time of day, whether the monitor is on, the weather, the sensor distance, etc]", then I would need to model and control for all the sources of light variation that are irrelevant to that broader prediction.

What happens if I do *not* account for these things?
Well, it really depends on how you collected your data. If you collect data in a very controlled environment first (e.g., holding sensor location constant, collecting repeatedly in a short time frame, turning the lamp on and off to obtain a distribution of samples from each condition), then you may not need to include other sources of variation. *However*, any inference about your lamp is now conditioned on that exact circumstance. For you to say "my lamp is not broken", it is necessary to say "my lamp is not broken, assuming this exact sampling scheme, my model assumptions are met, no other sources of variation are present, and my context is the same as when my data were collected on that sunny May Tuesday afternoon, on which I operationalized what ‘dark’ even means". A harder sell, for sure, but it’s also more *honest*.

If you collect data across the year, while recording all the different sources of variation you know about, then the story changes. You want to know if your lamp is broken, given all this data and known sources of random influences. You obtain your single observation from the light detector, and throw it into a model that does *not* account for any of these variations. Well, obviously, whether your inference is correct is a crapshoot. You’re comparing it to [uncertain] values averaged across a whole host of random influences. Because you are not accounting for known random influences, your comparison may be completely wrong, and lead to the wrong conclusion altogether (inflated error rates). Moreover, because you are excluding the *uncertainty* of these effects, you are *more confidently* wrong too. You may say "my lamp is not broken", even though it is, just because you’re measuring at 10am on a sunny day with the blinds open, and failed to account for those effects.

If you wanted to test the broader prediction that the room is dark, broadly operationalized as "the amount of light present when the lamp is off or broken", then you would need to broaden your model as well, to obtain a predicted light level after controlling for the various random influences that bear no importance to the lamp itself. To reiterate, the role of the expanded model is to allow a generalized verbal prediction to map onto a generalized statistical prediction — One that is valid across various hypothetically unimportant nuances. If you wanted to test the prediction that the room is dark, and strictly operationalize "dark" as the ambient light present when the lamp is off on Tuesday, May 12, at 4pm in Davis, California; then you don’t need to control for other various things. Of course, that also means your hypothetical claim of support is limited to that specific measure, at that specific place. So, you don’t get to claim that your "lamp is not broken, because the room is not dark", but rather your "lamp is not broken because the room is not dark, assuming we define darkness as the amount of ambient light in this specific place, with this specific measure taken at this time."

This is analogous to what Tal is talking about. For one, we need our verbal hypotheses to *imply* something statistically precise (an injective mapping between the two), if we want a statistical inference to correspond to a verbal inference. Then for generalizable inferences to be accurately stated, we either need to *model* the hypothetically exchangeable conditions (e.g., via random effects models), or we need to *limit* our inferential claims to the boundaries of our conditions, operationalization, and design. In other words, we can either vary, model, and account for the conditions over which we wish to broadly generalize; or we can clearly state the conditions assumed and given, and constrain our expressed hypothetical support appropriately. We can either integrate out known random effects, or we cannot, and instead keep our claims conditional on those effects.

# What happens in Psychology?

The lamp example is not perfect, but it’s a seemingly simple inferential problem that can complicated by introducing common research problems into the mix. Psychology, in reality, has a much harder problem. At least with the lamp, we can directly see and toggle the light. Psychology, on the other hand, largely deals with intangible, unobservable constructs in complex, ever-changing organisms. The uncertainty only compounds.

Complications aside, Psychology does not seemingly use any acceptable form of inference. Consider the following normative inferential chain, void of any probability statements:

If [verbal hypothesis], then [verbal prediction].

If [verbal prediction], then [non-zero quantity].

Non-zero quantity, therefore [verbal prediction].

[verbal prediction], therefore [verbal hypothesis].

This is problematic:

- This is not modus tollens, nor valid deductive logic, despite critics seemingly arguing that it is. It simply isn’t. This is 100% an invalid statement. It breaks down because the
*statistical*hypothesis is committing a logical fallacy. You cannot falsify by confirming, and yet this is what this statistical hypothesis is doing. It therefore breaks the inferential chain, rendering it fallacious. Yes, this implies that normative NHST is*not*deductive. - There is no unique mapping between the verbal and statistical hypothesis whatsoever. There are likely to be any number of alternative hypotheses for why the quantity is not
*exactly*zero. Whether it’s Meehl’s crud factor, or differences in stimuli, or experimenter effects, or whatever else, this verbal hypothesis effectively predicts*nothing*, by predicting*anything*. Because any number of things can also predict a non-zero quantity, we therefore have extremely weak support for*our*hypothesis – In actuality, it merely rejects any hypothesis that would predict zero, and retain all other possible hypotheses.

Unfortunately, this inferential chain is commonly used, and nevertheless wrong. You cannot affirm the consequent. More obviously, rejecting the statistical hypothesis implied by a completely different verbal hypothesis, does nothing to support *your* verbal hypothesis that makes a completely different statistical prediction, if any prediction was made at all.

Ok, so let’s change our testing behavior to something that is not entirely fallacious.

If [verbal hypothesis], then [verbal prediction].

If [verbal prediction], then [non-zero quantity].

Not [non-zero quantity], therefore not [verbal prediction].

Not [verbal prediction], therefore [not verbal hypothesis].

Now we have ourselves a modus-tollens. How did we do it? By completely reversing the role of null hypothesis testing. We do not test a prediction that we *did not posit* (a zero). Instead, we test the prediction that we *did* posit (non-zero).

Note, that a normative NHST is not necessarily fallacious, it is the way in which we *use* it that makes it fallacious. A valid modus tollens for a nil-null hypothesis would be:

If [verbal hypothesis], then [verbal prediction].

If [verbal prediction], then [zero-quantity].

Not [zero-quantity], therefore not [verbal prediction].

Not [verbal prediction], therefore not [verbal hypothesis].

But this would imply *our* hypothesis predicts a zero. If we observe no zero quantity, then *our* hypothesis is falsified. It says nothing of any other hypotheses, only that whatever hypothesis implies a value of zero, is to be rejected. See how that is different from the fallacious chain above?

However, it still is not this simple. The above assumes no probabilistic statements. Once again, this will break down considerably once we do. Back to our non-zero inferential chain:

If [verbal hypothesis], then [verbal prediction].

If [verbal prediction], then probably [non-zero quantity].

Probably not a [non-zero quantity], therefore probably not [verbal prediction].

Probably not [verbal prediction], therefore probably not [verbal hypothesis].

This once again requires us to create a joint probability model. I will spare you this here, but the probability algebra is straightforward. Getting the distributions, however, may not be so straightforward.

## Inferential chains must be thoroughly justified

Even *if* you can test in a logically consistent manner, from verbal down to the data, you will *have to justify* each of these steps. You cannot just state "If A, then B"; you have to either prove that to be the case, or justify why such an assumption is made. Modus Tollens *depends* on the validity of each logical consequent. Otherwise, you can make nonsense inferences, in a logically structured way. For example:

If unicorns are fictional, then my coffee mug would be full. (If A, then B)

My coffee mug is not full. (Not B)

Therefore, unicorns are not fictional. (Therefore not A)

This is, of course, ridiculous. Even though it does indeed follow modus tollens, the assumptions are completely wrong. For this to be valid, I would need to prove, or sufficiently argue, that if indeed unicorns are fictional, then my coffee mug would be full.

Psychology is not as ridiculous, but the requirements are exactly the same, from the verbal hypothesis to the statistical.

Therefore, every single link in the inferential chain, from the verbal hypothesis, to the statistical hypothesis, must be coherent, justified, or proved.

# Inference is hard

Hypothesis testing therefore has some high requirements, that are rarely, if ever, met in psychology.

- The verbal prediction
*must*be justified, and adequately linked to any generative verbal hypothesis. - Any statistical prediction
*must*be injectively mapped, and adequately linked to any generative verbal prediction, if a statistical inference is to inform the support for or against a hypothesis against any others. - Any stated support for a verbal hypothesis is necessarily constrained — bound by the assumptions, explicit or implicit, of the model, design, and operationalization used to ultimately test it. Every assumption in the design and model is ultimately added as a logical conditional, or a conditional in the probability model. There are many to consider, and I barely scratched the surface here. In essence, every "If" statement in the inferential chain is added to a global list of "If" statements that must precede the claim of hypothetical support; one cannot simply state the hypothesis and whether it was supported. To reiterate, this
*includes all*implicit and explicit assumptions in the statistical model (e.g., distributional shape, whether stimuli have an effect or not, whether a measure includes error or not), because these ultimately underlie how a statistical prediction is formulated, and what is treated as exchangeable.

In sum, hypothesis testing is hard. Inference is hard. Formulating valid, justifiable, precise predictions is hard. Once uncertainty is introduced to those predictions, it is made much harder. It’s not as straightforward as some seem to think. Probability makes logical argumentation explode into a cascading set of assumptions, constraints, and probability equations. Making precise, unique statistical predictions that map onto a verbal prediction is uniquely difficult in psychology, where "[measures are] made up, and points don’t matter". And unfortunately, I do not see an easy way out of it. We can continue on as normal, and make unjustifiably broad claims from incompatible models, using fallacious confirmationist arguments masquerating as falsificationism; or we can temper our claims, operate exploratively, and make careful predictions when we can.

# Communication is hard

Lastly, I’d like to mention that *communication* about inference is difficult. Even a broad statement like "Bears are brown" can be taken two opposite ways from people. One person may take that to mean "Bears can be brown, but may not necessarily be brown in all places, across all species." I.e., one answer to the question "What is brown? — Bears are brown!". Another person may interpret it as "Bears are always brown, no matter what random place or species you choose," and take issue with such a broad claim. I.e., one answer to "What describes a bear? — Bears are brown!" Some may think in terms of "What outcome could my hypothesis predict?" Vs "What hypothesis could predict my outcome?"; and these people would interpret a broad statement differently.

Part of me thinks some of the debate surrounding Tal’s paper is actually a "Bears are brown" problem. If someone said "Receiving an unrequested gift makes people feel grateful," I take that to be a broad statement, applicable across measures, gifts, persons, etc; and I may criticize that paper to say "Only if you measure gratitude as you do, and only if the gift is valuable, and only if there is no expectation of reciprocation." Others may interpret as "Receiving an unrequested gift *can* make people feel grateful," and have no issue with it failing to generalize.

I think Tal and I view such broad hypotheses ("Bears are brown") as ones to be generalized ("Bears are always brown"); to support such a claim, they would then need to model and control for hypothetically exchangeable conditions across which such a broad claim should be valid; and then map a verbal prediction onto the quantity of interest.

Others may view such broad hypotheses ("Bears are brown") as non-generalized claims to be tested ("Bears can be brown; when are they not?"). This is fine too. However, I do not often see papers making claims so that others can test them — They usually seek to confirm generalized claims from specific operationalizations/conditions/contexts, using faulty inferential logic, then attempt to explain away (find moderators for) specific replication failures without modifying the broad claim. So really, the point is moot to me — Regardless of how you interpret broad claims, any serious attempts to generalize such statements or to refute them with logically coherent argumentation seems extraordinarily rare. Nevertheless, these two interpretations may contribute to the conflict.