Blume et al. (2017) posted a paper to the ArXiv about a Second generation p-value.
The proposal is essentially as follows.
- Define an interval [a, b] defining a range of estimates that would be considered support for the null. E.g., perhaps a d statistic would have a null interval of -.1, .1, with the explanation that if the true d statistic falls within that interval, the effect is useless/incompatible with substantive hypothesis/etc.
- Compute the statistic and some interval representing the range of values compatible with the data. In frequentist terms, this may be a CI that is constructed such that any values outside of the interval would be rejected IF the null hypothesis were the point estimate. In Bayesian terms, this may be some highest density or quantile interval of the posterior for the estimate.
- Compute the proportion of the statistic-interval that is contained within the null-interval. If the statistic-interval is “too large”, then apply a correction to make the proportion max out at .5.
- The second generation p-value is then the proportion of the “data-consistent interval” that overlaps with the “null-consistent interval”.
Some properties include:
- If H0 is true, such that the true parameter value is within the null interval, then as information increases, the p-value will increase to 1.
- If H1 is true, such that the true parameter is outside of the null interval, then as information increases, the p-value will decrease to 0.
- If an estimate is directly on the null interval border, the p-value will be .5.
- The authors argue that it outperforms in other ways, such as controlling t1 error rates, even compared to FDR/FWER correction procedures.
I like the idea, though it sounds so much like a frequentist has rediscovered the Bayesian ROPE concept — Define a Region of practical equivalence, and use the posterior densities to evaluate the probability of the parameter falling within the ROPE. In fact, it’s pretty much the same thing, with the only difference being that the second-generation p-value uses overlapping intervals, rather than probability mass contained within an interval. The Bayesian ROPE can have a probabilistic interpretation, while the authors argue that the second generation p-value is just a descriptive, a proportion, not a posterior probability or anything similar to that. I respect them for admitting that, rather than trying to shoehorn in a probabilistic interpretation.
One thing I’m curious about, is whether another X-generation p-value could essentially be a marginalized p-value.
This make already exist, in which case I am rediscovering it much like they may have rediscovered the ROPE.
The p-value (for a nil-null hypothesis) is defined as: $p(t > T(y,\theta)|\theta = 0)$. What if instead of defining an interval, we defined the null hypothesis as a density.
E.g., instead of saying $H0: \theta = [-.1,1]$ we said $H0: \theta \sim N(0,.05)$, or something similar.
Then computed: $\int p(t > T(y,\theta)|\theta)p(\theta)d\theta$.
Before any frequentists jump on me for putting a GASP parameter on a distribution: just imagine this is a probabilistic weight, for penalization, rather than some philosophical statement about a-priori plausibility of parameters.
But I think that, too, would have interesting and similar properties.
I may explore that at some point. It’d be conceptually similar to assurance, which is just marginal power when an effect size is uncertain.
The stickiest points, for me, are the use of CIs as the data-consistent intervals — Only certain CIs can be built with the property needed for this to work, and even then they can be problematic. So that leaves one with credible intervals as an alternative, but then you may as well just use the posterior for inference anyway; you’re in Bayes land, just enjoy your time there.
The other issue is in the case where the true hypothesis is, somehow, EXACTLY equal to a null interval boundary, the second generation p-value would asymptotically be .5. I’m not sure that’s bad or unintended, given that it would imply you can’t really choose between the null or the alternative, but it’s a bit annoying to have a singularity of indecision.
Anyway, it may be an interesting metric. I haven’t decided yet whether I like it more than a p-value (but still less than a Bayesian treatment, of course); but I see it being a potentially useful metric nevertheless.