# The absurdity of mapping p-values to Bayes factors

In the “Redefine statistical significance” (RSS) paper, the authors argued that p = .05 corresponds to low evidentiary value.

Let’s unpack what is meant by that.

The authors’ argument comes from an “upper bound” argument on how large the Bayes Factor can be given a particular p-value.

Recall that a Bayes Factor is defined as:

$$\frac{p(D|H_a)}{p(D|H_0)} = \frac{\int_l^u p(D|\theta,H_a)p(\theta|H_a)d\theta}{\int_l^u p(D|\theta,H_0)p(\theta|H_0)d\theta}$$

The parameter priors, $p(\theta|H_k)$ corresponds to an uncertain prediction from some hypothesis, $H_k$.
The marginal likelihoods computed in the numerator and denominators above can then be thought of “prior predictive success”, or “predictive success, marginalized over the uncertainty expressed in the parameter from hypothesis $k$ “. The BF10 then represents the evidence in favor of hypothesis a over hypothesis 0.

The authors wished to argue that a p=.05 necessarily has a low upper bound on evidence in favor of the alternative hypothesis.

RSS used as an example a BF for a two-tailed alternative hypothesis in comparison to a nil-null hypothesis.
This makes intuitive sense, because a commonly used test is the two-tailed t-test with a nil-null hypothesis.
In the latter case, the p-value is corresponding to the probability of observing a t-statistic at least as extreme as the observed statistic, assuming that the true sampling distribution is centered around zero.
No explicit alternative hypothesis really exists; the statement is simply $H_a: \theta \neq 0$, which isn’t exactly a useful hypothesis, but I digress.

In contrast, the BF does use two explicit hypotheses.
$H_0: p(\theta|H_0) = \begin{cases}1 & \text{if } \theta = 0 \\ 0 & \text{if } \theta \neq 0\end{cases}$
and
$H_a$ depends slightly on how favorable you wish to be, in order to obtain an upper bound.
The most direct upper bound of BF for a p=.05 value with some specified N would simply be another point mass at exactly the quantile corresponding to the .975 percentile of the t-distribution.

Using this line of reasoning, they argue that at best, a p = .05 corresponds to a small BF, and as such, p = .05 has a low upper bound on evidentiary value.
Instead, they argue that p = .005 would correspond to a more considerable evidentiary value, with BFs ranging from 14–28 or so.

So what’s the problem? Seems reasonable!

# The problem(s)

First, as mentioned in the “Justify your alpha” paper, why should the rate at which a hypothetical error is presumably controlled map onto a marginal likelihood ratio (I refuse to call this “evidence”, as though the whole of scientific evidence can be expressed in a single unidimensional metric of prior predictive success ratios)?
We could likewise flip this around — How about we devise Bayes factor thresholds that map onto hypothetical error rates, so that these crazy Bayesians don’t make more or less type 1 errors than expected? (I am not encouraging this, just highlighting the arbitrariness).

Second, these are metrics of completely different things! A p-value is exactly what it states: The probability of observing what we did (or more, or more extreme) assuming some state-to-be-nullified is true? The BF is a ratio of prior predictive success, or a marginal likelihood ratio, or the ratio of probability densities of the observed data from two prior-predictive distributions corresponding to two hypotheses.
The goals of each are different too.
The p-value is just a “surprise” factor, given some assumed state.
The BF is a relative measure, evaluating the data under two certain or uncertain states.
The p-value is used inferentially as a means to reject some state if the observation is sufficiently improbable under it.
The BF is used to express how much one’s prior odds in each state would change as a function of the data.

Why these things should map onto one another is not intuitive.
The p-value has its uses. The BF has its uses. I’m not sure why one needs to be defined in terms of the other.

Third, and most importantly, the mapping from “highly evidentiary” upper bound BFs to p-values (or alphas) is not generalizable. Richard Morey has an excellent series on this topic. The gist is, the whole “p=.005 has reasonable upper bounds, but p=.05 does not”, depends entirely on which hypotheses you are testing. And as I said above, the RSS authors used a two-tailed alternative against a point-mass null, and from the standpoint that statistical hypotheses should map onto their respective substantive hypotheses, this makes no sense anyway. More on that in a future blog post.

I decided to take it a step further.

# Likelihood Ratio Tests

A likelihood ratio test is defined as:
$$Q = -2(LL_0 – LL_a) \\ Q = 2(LL_a – LL_0) \\ Q = 2\log\frac{L_a}{L_0} \\ k_i = \text{Number of parameters in model i} \\ df = k_a – k_0 \\ Q \sim \chi^2(df)$$

We can solve for the likelihood ratio required to reject at some $\alpha$ level:
$$e^{\frac{Q_{c,\alpha,df}}{2}} = \frac{L_a}{L_0}$$
Given some $\alpha$ and $df$, find the critical value for the $\chi^2$ statistic, $Q_{c,\alpha,df}$, then simply solve for the likelihood ratio.

Now, let’s define the upper bound BF.
As it turns out, if you have two hypothetical models with point masses at fixed values ($\hat\theta$), then the BF is simply a good ‘ol fashion likelihood ratio.
$$p(D|H_a) = \int p(D|\theta,H_a)p(\theta|H_a) \\ \text{If }p(\theta|H_a) = \begin{cases}1 & \text{if } \theta=\hat\theta \\ 0 & \text{if } \theta\neq\hat\theta\end{cases} \\ p(D|H_a) = p(D|\hat\theta,H_a)$$
Meaning, that the BF will just become a likelihood ratio.

If you have model A with 2 parameters, and model 0 with 1 parameter, and p = .05, what is the upper bound BF ($BF_{ub}$)?
$$df = 1 \\ Q_{c,\alpha=.05,df=1} = 3.841459 \\ e^{Q/2} = 6.825936 = BF_{ub}$$

So if you are comparing two models with a difference of one parameter, and you see p = .05, your upper bound Bayes Factor is 6.825936.
Basically, that means if your alternative hypothesis had one more parameter and your hypothesis predicts the exact estimate achieved, and you got p = .05, then the greatest BF you could see is 6.825936.
That’s not stellar, which is the point of the RSS argument.

Not so fast! That’s in one particular scenario, wherein you had two models, and the only difference is the inclusion of an extra parameter.
Let’s vary the p-value obtained and the df for the test. Recall that df in LRTs represents the difference in parameters estimated for each model, so we are varying how complex the alternative model is from the reference or null model.
Plot time.

library(ggplot2)
alphas <- seq(.005,.5,by=.005)
dfs <- 1:10

ds <- expand.grid(alpha=alphas,df=dfs)