The absurdity of mapping p-values to Bayes factors

In the “Redefine statistical significance” (RSS) paper, the authors argued that p = .05 corresponds to low evidentiary value.

Let’s unpack what is meant by that.

The authors’ argument comes from an “upper bound” argument on how large the Bayes Factor can be given a particular p-value.

Recall that a Bayes Factor is defined as:

$$
\frac{p(D|H_a)}{p(D|H_0)} = \frac{\int_l^u p(D|\theta,H_a)p(\theta|H_a)d\theta}{\int_l^u p(D|\theta,H_0)p(\theta|H_0)d\theta}
$$

The parameter priors, $ p(\theta|H_k) $ corresponds to an uncertain prediction from some hypothesis, $H_k$.
The marginal likelihoods computed in the numerator and denominators above can then be thought of “prior predictive success”, or “predictive success, marginalized over the uncertainty expressed in the parameter from hypothesis $k$ “. The BF10 then represents the evidence in favor of hypothesis a over hypothesis 0.

With that out of the way, let’s return to the RSS paper.
The authors wished to argue that a p=.05 necessarily has a low upper bound on evidence in favor of the alternative hypothesis.

RSS used as an example a BF for a two-tailed alternative hypothesis in comparison to a nil-null hypothesis.
This makes intuitive sense, because a commonly used test is the two-tailed t-test with a nil-null hypothesis.
In the latter case, the p-value is corresponding to the probability of observing a t-statistic at least as extreme as the observed statistic, assuming that the true sampling distribution is centered around zero.
No explicit alternative hypothesis really exists; the statement is simply $H_a: \theta \neq 0$, which isn’t exactly a useful hypothesis, but I digress.

In contrast, the BF does use two explicit hypotheses.
$H_0: p(\theta|H_0) = \begin{cases}1 & \text{if } \theta = 0 \\ 0 & \text{if } \theta \neq 0\end{cases}$
and
$H_a$ depends slightly on how favorable you wish to be, in order to obtain an upper bound.
The most direct upper bound of BF for a p=.05 value with some specified N would simply be another point mass at exactly the quantile corresponding to the .975 percentile of the t-distribution.

Using this line of reasoning, they argue that at best, a p = .05 corresponds to a small BF, and as such, p = .05 has a low upper bound on evidentiary value.
Instead, they argue that p = .005 would correspond to a more considerable evidentiary value, with BFs ranging from 14–28 or so.

So what’s the problem? Seems reasonable!

The problem(s)

Ok, well let’s start with some intuitive problems.

First, as mentioned in the “Justify your alpha” paper, why should the rate at which a hypothetical error is presumably controlled map onto a marginal likelihood ratio (I refuse to call this “evidence”, as though the whole of scientific evidence can be expressed in a single unidimensional metric of prior predictive success ratios)?
We could likewise flip this around — How about we devise Bayes factor thresholds that map onto hypothetical error rates, so that these crazy Bayesians don’t make more or less type 1 errors than expected? (I am not encouraging this, just highlighting the arbitrariness).

Second, these are metrics of completely different things! A p-value is exactly what it states: The probability of observing what we did (or more, or more extreme) assuming some state-to-be-nullified is true? The BF is a ratio of prior predictive success, or a marginal likelihood ratio, or the ratio of probability densities of the observed data from two prior-predictive distributions corresponding to two hypotheses.
The goals of each are different too.
The p-value is just a “surprise” factor, given some assumed state.
The BF is a relative measure, evaluating the data under two certain or uncertain states.
The p-value is used inferentially as a means to reject some state if the observation is sufficiently improbable under it.
The BF is used to express how much one’s prior odds in each state would change as a function of the data.

Why these things should map onto one another is not intuitive.
The p-value has its uses. The BF has its uses. I’m not sure why one needs to be defined in terms of the other.

Third, and most importantly, the mapping from “highly evidentiary” upper bound BFs to p-values (or alphas) is not generalizable. Richard Morey has an excellent series on this topic. The gist is, the whole “p=.005 has reasonable upper bounds, but p=.05 does not”, depends entirely on which hypotheses you are testing. And as I said above, the RSS authors used a two-tailed alternative against a point-mass null, and from the standpoint that statistical hypotheses should map onto their respective substantive hypotheses, this makes no sense anyway. More on that in a future blog post.

Morey recently made his point again on twitter.
I decided to take it a step further.

Likelihood Ratio Tests

A likelihood ratio test is defined as:
$$
Q = -2(LL_0 – LL_a) \\
Q = 2(LL_a – LL_0) \\
Q = 2\log\frac{L_a}{L_0} \\
k_i = \text{Number of parameters in model i} \\
df = k_a – k_0 \\
Q \sim \chi^2(df)
$$

We can solve for the likelihood ratio required to reject at some $\alpha$ level:
$$
e^{\frac{Q_{c,\alpha,df}}{2}} = \frac{L_a}{L_0}
$$
Given some $\alpha$ and $df$, find the critical value for the $\chi^2$ statistic, $Q_{c,\alpha,df}$, then simply solve for the likelihood ratio.

Now, let’s define the upper bound BF.
As it turns out, if you have two hypothetical models with point masses at fixed values ($\hat\theta$), then the BF is simply a good ‘ol fashion likelihood ratio.
$$
p(D|H_a) = \int p(D|\theta,H_a)p(\theta|H_a) \\
\text{If }p(\theta|H_a) = \begin{cases}1 & \text{if } \theta=\hat\theta \\ 0 & \text{if } \theta\neq\hat\theta\end{cases} \\
p(D|H_a) = p(D|\hat\theta,H_a)
$$
Meaning, that the BF will just become a likelihood ratio.

If you have model A with 2 parameters, and model 0 with 1 parameter, and p = .05, what is the upper bound BF ($BF_{ub}$)?
$$
df = 1 \\
Q_{c,\alpha=.05,df=1} = 3.841459 \\
e^{Q/2} = 6.825936 = BF_{ub}
$$

So if you are comparing two models with a difference of one parameter, and you see p = .05, your upper bound Bayes Factor is 6.825936.
Basically, that means if your alternative hypothesis had one more parameter and your hypothesis predicts the exact estimate achieved, and you got p = .05, then the greatest BF you could see is 6.825936.
That’s not stellar, which is the point of the RSS argument.

Not so fast! That’s in one particular scenario, wherein you had two models, and the only difference is the inclusion of an extra parameter.
Let’s vary the p-value obtained and the df for the test. Recall that df in LRTs represents the difference in parameters estimated for each model, so we are varying how complex the alternative model is from the reference or null model.
Plot time.

library(ggplot2)
alphas <- seq(.005,.5,by=.005)
dfs <- 1:10

ds <- expand.grid(alpha=alphas,df=dfs)
ds$qchisq <- qchisq(p = 1-ds$alpha,df = ds$df)
ds$bfup <- exp(ds$qchisq/2)

ggplot(data=ds,aes(color=factor(df),x=alpha,y=log(bfup))) + 
    geom_line() + 
    labs(x='p',y='log(BF10) Upper Bound',color='LRT df') + 
    theme_classic() + 
    geom_hline(yintercept=c(log(14),log(28))) + 
    scale_x_continuous(breaks=seq(0,.5,by=.025)) + 
    theme(axis.text.x = element_text(angle = 90))

The $\log BF_{ub}$ is plotted on the Y axis. P-values on the X-axis. Colors represent the df (i.e., difference in number of parameters from alternative to null models).
The black horizontal lines represent BF10 14-28.

If your model has 2 parameters more than the null, then p=.05 has an upper bound of “strong evidence”, according to the RSS authors.
Well, that changes things doesn’t it?
In fact, if you only have one parameter difference, then p=.01-.03 has “strong evidence” upper bounds.
A p=.005 would have an upper bound BF of about 50.

In the more extreme case, if you have 6–7 more parameters in the alternative model, then a p=.50 corresponds to “strong evidence” upper bounds.
Let me repeat that, for those who misread .50 as .05.
A p-value, of .50 (50%, point five zero, not 5% or point zero five), corresponds to strong evidence upper bounds.

Using the logic of the RSS authors, if you have 6-7 parameters more in your alternative model than the null model, then alpha should be set to .50, because that’s where the upper bound of “evidence” is strong.
That’s some really great news for people who fit SEM and MLM models, both which often compare much higher dimensional models to much lower dimensional models. Go for broke, know that if you have p=.42, it’s justifiable as a cutoff, since it has a high evidence upper bound.

Note that this does make sense from a BF-inferential perspective.
If you have a hypothesis that predicts precisely the parameter values that you estimated, then that hypothesis is insanely good.
And the goodness of that hypothesis increases as your model has more parameters that it precisely predicted.
This reasoning is intuitively why as df increases, the upper bound increases too.
If you have a hypothesis that is perfectly predicting the parameter values, then the BF upper bound should be higher for models that have more parameters to predict.
In other words, it’s much more impressive for a hypothesis to perfectly predict 10 parameters than 2.

Despite the pattern making sense from a BF-inferential perspective, it makes little sense to claim p-values overstate the “evidence”, when in many cases the p-values understate the “evidence”.
Sure, a p=.05 may only imply a BF of 3-6.
But also, a p=.05 may imply a BF of 20-10000.

It depends on the hypotheses you are testing, and therefore there can be no general mapping between a p-value and the “evidence” indicated by a BF.

Final thoughts

The argument that p=.005 should be the new standard due to its relationship to the BF is questionable.
Philosophies aside, the p-value does not map onto the BF.
A p-value maps onto a BF.
It all depends on the hypotheses you test. A p of .5 could mean a BF of 106; a p of .005 could mean a BF of 51; a p of .05 could mean a BF from 7 to 10000. Who knows?

The recommendation that all findings be labeled significant only when $p \leq .005$ is based purely on one particular set of hypotheses, under a set of assumptions, with the goal of matching it to upper bounds of one particular evidentiary metric.
It cannot generalize to all computed p-values, and I doubt we’d want a big table of significance thresholds corresponding to a range of BFs for every possible significance test.

Finally, I agree that a single p-value near .05 is insufficient evidence; just as I would agree that a single BF of 3-10000 would be insufficient evidence.
The reason I state that a single p-value near .05 is insufficient evidence is not because the .05 is necessarily weak, but rather that any singular metric is necessarily weak, when assessing the evidence for a hypothesis.
More on this in a future blog post, I’m sure.

In sum, use $\alpha = .005$ if you’d like. Use BFs if you’d like. Use posteriors and CIs if you’d like. Report all the things if you’d like.
Justify your inference within your inferential framework, justify the decisions made.
But maybe don’t say p=.05 is necessarily weak because your p = .05.

And maybe all of this is wrong, I don’t know. You tell me.

Leave a Reply