Bayes and optional stopping

When I explain to people why I love Bayes, it’s typically some combination of:

  • Build the model you want, with assumptions you want
  • Use probabilistic (soft) instead of hard constraints
  • Incorporate prior information
  • Identify difficult models with priors
  • Interpret the posterior how you’ve always wanted to interpret non-Bayesian quantities
  • Obtain better estimates and make better inferences thanks to priors and modeling the DGP as you see fit
  • Infer without worrying too much about the sampling distribution

However, some people tack on an extra benefit:

  • Don’t worry about optional stopping; the likelihood principle says there’s no problem with it.

I’ve long stopped saying that, because I’ve stopped believing it.

Stopping rules do affect your inference. I am not the first to say this (e.g., 1 2 3 4, several papers if you google scholar this topic).
First, stopping rules can by their very rule provide information about the parameter.
Second, stopping rules affect the sample space, these outcomes which are used to inform the parameter. Therefore, even if a stopping rule doesn’t directly modify the likelihood of a particular set of observations, it can affect the distribution of possible parameter values that could even be inferred from the sample space. This can be represented in the prior specification, $p(\theta|S)$, where $S$ is a stopping rule, among other places in the model.

I think there are ways out of that predicament by modeling a DGP that takes into account the modification of sample space and the consequent modification of parameter estimates that may result from observation, estimate, or posterior-dependent stopping rules. Essentially, a joint generative model must account for the observations, $N$, the modified sample space, etc all induced by the stopping rule (stopping which, itself, is contingent on current data, N, and posterior quantities… a confusing mess). But that’s not my point here (and also, I imagine that’s insanely complicated, and should there be a solution… a paper, and not a blog post, should be written).

Instead, I just want to argue that Bayesian inference is affected by stopping rules.
The effect of stopping rules such as “95% HDIs must exclude zero” on parameter estimates are already known. These sorts of stopping rules are guaranteed to bias your inference. You’re sampling from a population until a magic sequence of observations produces the desired non-zero interval, and then you turn off the data collecting machine once the moving posterior has moved far enough away. Most probably, this will result in an overestimate and an erroneous uncertainty in the direction and magnitude of an effect. Observations are random, and randomly ordered. At some point, to infinity, there will be some sequence extreme enough to sway a posterior estimate toward an extreme, and it’s at this moment the collection machine turns off. Should you let the collection machine continue, you would likely see a sequence of observations that would move the posterior distribution back toward the expected region. In essence, you’re waiting for an occurrence of a sequence of events extreme enough to push the posterior distribution sufficiently far away from zero, then stopping. Yes, of course this is going to bias your estimate — You are waiting for an extreme event in order to make your (now extreme) estimate.

This problem is manifest in any stopping rule, really. However, some stopping rules aren’t as inherently problematic as others. For instance, sampling until a desired precision (posterior SD, SE) is obtained isn’t really problematic for parameter locations. That generally means you just have sufficient information about the parameter, and all is well; you aren’t waiting for an extreme event that translates the posterior before stopping, you aren’t waiting for a sequence to occur that would produce a spurious confirmation of a hypothesis before stopping, etc. I’m guessing there would be some amount of bias in the posterior width, in the sense that you could have a sequence of events that spuriously makes the posterior width smaller than had you collected $\Delta_n$ more (presumably extreme values), and therefore you could have a negligibly spurious amount of (un)certainty about the estimate. But to summarize, as long as you decide to turn off the data collection machine upon hitting some data, estimate, or posterior-dependent condition, noone should be surprised that the type of data, estimate, or posteriors obtained from such a procedure cause an overrepresentation of these conditions — And there will therefore be some undesirable properties.

Bayes Factors are not immune to this issue. A common selling point of Bayes Factors (and Bayesianism in general) is that the sampling intention and stopping rule are irrelevant to the inferential procedure.
As stated, this is correct. The Bayesian posterior and BFs (not a posterior quantity) are interpreted the same way regardless of the sampling intention. P-values, on the other hand, depend explicitly on the sampling distribution, and optional stopping and sampling intentions both modify that sampling distribution — Great care must be taken to obtain an interpretable p-value, let alone a correct one.

Although bayesian quantities are interpretable despite sampling intentions and stopping rules, the probability of bayesian decisions is certainly affected by optional stopping, and not in a desirable way.
Here is my simple example, then I’ll stop rambling.

We’ll use the BF, since people too often consider the BF a panacea of inference that is immune to the stopping rule.
Bayes Factors are essentially a ratio of prior-predictive success from two priors defined by hypotheses. In the case of optional stopping, what you are doing is waiting for a sequence of observations for which prior predictive success is relatively high. It should be of no surprise then, that with a stopping rule condition of relatively high prior predictive success, some sequence of data may spuriously “overstate” prior predictive success and thus evidence of one hypothesis over another.

Two hypotheses are defined, conveniently as the defaults for the BayesFactor package. H0: A point mass at $\delta = 0$, and H1: Cauchy(0,.707) 1
For this example, H0 is true, and $\delta = 0$.
Two stopping rules are defined.
For the first, the stopping rule is defined as a fixed-N stopping rule, such that the researcher collects N=400 and stops to evaluate the BF.
For the second, the stopping rule is defined as: Collect 4 observations, test whether the BF is beyond threshold for either H1 or H0; if so, stop and report the BF; if not, collect two more observations and repeat until N=400.

I simulated this procedure across 10,000 “studies”. Each study reported their BF at the time of stopping. With a BF threshold of 3, H1 is supported if BF > 3, H0 if BF < 1/3, or neither if BF is inbetween.

H0 Neither H1
Fixed-N .8629 .1280 .0091
Optional Stopping .8467 .0006 .1527

Under a fixed-N, 86% of the studies made the correct decision to support H0, 13% couldn’t adequately distinguish between H1 and H0, and 1% erroneously2 decided on H1.
Under optional stopping, the proportion that made the correct decision did not change much, but the percentage that erroneously decided on H1 was 14% higher (17x more often)!

Before you say “but BF of 3 is too low”, let’s use BF of 10 (and 1/10) as the threshold.

H0 Neither H1
Fixed-N 0.00 .9975 .0025
Optional Stopping 0.00 .9528 .0472

I’m guessing H0 wasn’t supported only due to some monte carlo error. But the point is again evident: H1 was supported 19x more often by the data (erroneously, by chance) when optional stopping was used than when it was not used.

This is the problem I have with the claim that BFs and other bayesian quantities are immune to optional stopping. Yes, the interpretation of the quantity remains the same. Also yes, if you use optional stopping, it can change the probability of making some erroneous inference or decision. Their definitions are unchanged by optional stopping, but the probability of decisions are affected by optional stopping.

So next time someone tells you that Bayesian inference is unaffected by optional stopping, ask them why the probability of making an erroneous decision increases notably when using Bayesian quantities produced from an optional stopping rule.


  1. The reader should know that I 1) have a list of grievances with the bayes factor and do not actually like using them and 2) I hate the hypotheses tested by default in this package. 
  2. By “erroneously”, I mean the truth is H0, and the data collected would support H1. The data observed indeed did support H1, by chance, but this is erroneous given the true known state that H0 is true. 

Power analysis for SEM with only $\alpha$

I was helping a labmate with a power analysis problem. He’s planning a study and will use SEM. Unfortunately, the scale he’s using only reported Cronbach’s $\alpha = .80$ with 14 items.

This is irritating, to say the least. But nevertheless, there are two approaches to this.
One, you could simulate “true” scores that have some desired relationship with another variable, then simulate 14 realizations of those true scores using a standard error that equates to $\alpha$.

But I wanted to see whether we could construct a latent variable model that implies an $\alpha = .80$.

Reliability in SEM is a complicated thing, with many definitions. Some are interested in AVE (average variance explained). Then there’s $\omega_1, \omega_2, \omega_3$, the definitions of which can be found in the semTools::reliability help page.
$\omega_1$ and $\omega_2$ are more similar to each other than either are to $\omega_3$, but generally the definition can be thought of as “IF there is a latent variable, and we see J realizations of that latent variable, what percentage of the variance in the sum-scores is attributable to the latent variable?”
This is an estimate of the “true” reliability, which can be thought of as the proportion of variance in $\hat\theta$ explained by $\theta$. Of course, in reality we don’t have $\theta$, so we settle for the proportion of variance in some observed metric that is explained by the latent estimate. With greater N, this estimates converges with the “true” reliability.
Side note: If you have high N but a crappy measure, your reliability does not improve, it just means your estimate of your crappy reliability is more accurate of the true crappiness.

Here’s the basic, probably imperfect, proof of reliability:

$$
\tilde x_i = \sum x_{ij}$$

$$x_{ij} = \hat x_{ij} + \epsilon_{ij}$$
$$x_{ij} = \lambda_j\theta_i + \epsilon_{ij}$$
$$\tilde x_i = \sum_j \lambda_j\theta_i + \sum_j\epsilon_{ij}$$
$$Var(\tilde x_i) = Var(\sum_j \lambda_j\theta_i + \sum_j\epsilon_{ij})$$
$$Var(\tilde x_i) = (\sum_j\lambda_j)^2Var(\theta_i) + \sum_j Var(\epsilon_{ij})$$
$$Reliability_{\omega_2} = \frac{(\sum_j\lambda_j)^2Var(\theta_i)}{(\sum_j\lambda_j)^2 Var(\theta_i) + \sum_j Var(\epsilon_{ij})}$$

If variance is set to 1, and outcomes standardized: $$\omega_2 = \frac{(\sum_j\lambda_j)^2}{(\sum_j\lambda_j)^2 + \sum_j(1 – \delta_j)}$$

Well, what we have is $\alpha$, not $\omega_2$. However, $\alpha$ can be approximated in SEM assuming that the residual variance and loadings are equal across items.
This further simplifies things.
$$\alpha = \frac{(J\lambda)^2}{(J\lambda)^2 + J\delta} \
= \frac{(J\lambda)^2}{(J\lambda)^2 + J(1 – \lambda^2)}$$
Solving for $\lambda$ gives you:
$$\lambda = \sqrt{\frac{\alpha}{(J – \alpha(J-1))}}$$

This represents the $\lambda$ needed to obtain $\alpha$ with $J$ standardized items, assuming a standardized latent variable.
We tested it to be sure; we simulated data using lavaan in R, and sure enough, running psych::alpha(x) gave $\alpha \approx .80$.

This is not to say that when scales have $\alpha=.80$ that the factor loadings are indeed equal to the above (they won’t be), but if you only have reliability and you want to simulate a measurement model that would suggest such reliability, then there you go.
This is also not for doing power analysis for the factor loadings themselves, of course.
But if you wanted to do a power analysis for a correlation or structural coefficient with one or more variables being latents + measurement error, this approach may work when information is lacking.

Finally, a blog with [drum roll please…]

Mathjax and markdown1.

Markdown

Markdown is enable for both posts and comments. It can be used like this:

Markdown
======
Markdown is enable for both posts _and_ comments. It can be used like this:

Recursion!

Mathjax

One thing I hate about commenting on facebook, twitter, and most blogs is the lack of latex or mathjax equations.
But behold, if/when you want to tell me my math is terrible, you can!

$p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)}$

produces:
$p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)}$

Displayed equations can just use two \$\$ on each side instead: $$p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)}$$


  1. My markdown and latex2 addiction are really out of control at this point. 
  2. To clarify, my addiction is to $\LaTeX$, not latex 

Well, it’s about time.

I spend entirely too much time on the facebook groups (psychMAD, psychMAP, R2 must be stopped, bayesian methods, etc) discussing psychology, statistics, and inference.  Every day, I open my browser to visit entirely too many blogs about statistics and inference.  I respond to emails on mailing lists about statistics and inference. At times, I even spend hours in [usually civil] debates about these things, making new friendships, alliances, and in all probability, some accidental enemies.

At the same time, I have no real platform to vent frustrations about the life of a graduate student, nor do I have a space for going into much more detail about my perspectives on stat methods.

To feed this addiction of mine, and because I need a more permanent web presence, I’m finally just giving in and starting a blog. We’ll see how it goes.

This blog will probably lean toward thoughts about quantitative methods, frequentism and bayesianism, rants about my own field, and grievances about the graduate life.  Some may be short [and hopefully, at least somewhat humorous; at least to me], some may be longer and go into depth about some online debate I’m in, some simulation I ran to combat an argument, or some fun new intuitions or methods I’ve built.

If you already know me from twitter or facebook, hit me up with any questions you may want my longer thoughts on.