Yes, collect more data and pool them.

A question arose on one of the numerous facebook-methods groups. The question, in essence was: I have a sample of 30 from a dissertation; I want to collect 30 more observations to have better [sic] results for a publication. Should I use a bayesian analysis with the prior informed by the first 30, or should I just use an uninformed prior and completely pool the data? My response: Yes, collect more data and just completely pool the data.

Why

Partial pooling is great, and I encourage it nearly all of the time. However, in this case there was just an arbitrary break in the data collection from the same process, so I think complete pooling is fine (also, partial pooling with only two samples seems pointless, but I could be wrong).
You could of course use the posterior from the first 30 as the prior for the second 30, but this winds up being the same thing as just completely pooling the data, assuming the observations are independently sampled from the same process:
\begin{align}
p(\theta|y_1) &= \frac{p(y_1 | \theta)p(\theta)}{p(y_1)} \\
p(\theta, y_1, y_2) &= p(y_2 | \theta, y_1)p(\theta|y_1)p(y_1) \\
p(\theta|y_1,y_2) &\propto p(y_2 | \theta, y_1)p(y_1|\theta)p(\theta) \\
p(\theta|y_1,y_2) &\propto p(y_1,y_2|\theta)p(\theta) \\
p(\theta|y_1,y_2) &\propto p(y|\theta)p(\theta)
\end{align}
So – Just to save time, I suggested to just pool the data.

Another user was a bit suspicious though.
This user said that if you collect 30 observations which suggest something interesting enough that you should continue collecting observations, then this data-dependent choice will bias your results.

I disagree! This is very close to the problem of multiple stopping inflating the probability of a wrong decision, but is very different for a special reason: The data are not stopped if they hit “significance”, but rather continued if they hit “significance”. If you stop upon hitting significance, yes, you will bias your result and inflate error rates, if you care for that sort of thing.
But if you continue after hitting significance for some fixed N, the estimate thereafter is less biased, and if the “null” is true, then you will actually reduce the error rate.

The reason it is less biased is simply that if you stopped after significance with an N = 30, then your estimate is, for most effect sizes, necessarily an overestimate. It must be an extreme estimate in order to pass a threshold, if the study is underpowered (which, for most effect sizes, N=30 is underpowered). If you then sample 30 more, your estimate will recede toward the mean, and is not likely to become more extreme, but less extreme. Gaining more observations means gaining a better representation of the population, and you are far more probable to observe new values near the typical set than anything more extreme than you already observed.

The reason you have less type 1 error (of course, assuming the nil-null is “true”) is for similar reasons – With an N=30, you still have a $\alpha$ chance of a t1 error. Given that t1 error, sampling $N_\Delta = 30$ more observations will draw your chance-extreme estimate back toward the true mean of zero. In order to make a t1 error after $N_\Delta$ more, think about what would need to occur: With some $Z_n$ statistic for the first $N$ sample, what is the $N_\Delta$ sample needed to maintain at least a $Z_\alpha$ for rejection in the same direction? Or – What sample would you need to keep being significant purely by chance, and what is the probability of obtaining that sample from a nil-null sampling distribution? It’s small!

Simulation

When the null is true

Let’s test it via simulation.
The following procedure is simulated 100,000 times.

  1. Collect N=30 from Null.
  2. If significant, collect 30 more.
  3. Report $p$, the estimate, and the n (30 or 60).
  • Type 1 error: .01641
  • Expected effect estimate: .00042
  • Effect estimate conditioned on significance: -.01311

Note that this conditional effect should be biased, but due to the 2-tailed test, the biases are canceling each other out.

When the true effect is .4

  • Power: .54841
  • Expected effect estimate: .36552
  • Effect estimate conditioned on significance: .46870
  • Effect estimate when conditioned on obtaining $N_\Delta = 30$ more after initial significance: .46207

The confirmatory approach

But what if we only did as the other user suggested instead: Ignore the first sample, just collect an N=30 sample as a confirmatory study?
That is – Collect N=30, if significant, collect another N=30 and analyze the latter one by itself. We’ll only simulate from the latter one — Just simulating from the distributions.

When the null is true

  • Type 1 error: .05032
  • Expected effect estimate: -.00070
  • Conditional effect estimate: .00019

Note that the conditional effect estimate really should be biased — It’s just that I’m using a 2-tailed test, and the biases are canceling each other out. The expected positive effect is about .415.

When the true effect is .4

  • Power: .56336
  • Expected effect estimate: .39981
  • Conditional effect estimate: .52331

Conclusion

From the above, I recommend that if you have a sample, and you find something interesting in it, do not be afraid to collect a larger sample and include your previous data in it. If the “null” is true, the error rate decreases due to regression toward the mean. If the null is false, you will have a less biased estimate. Ideally, you should collect enough data that any initial sampling-error induced extremeness will shrink away.
Yes, there’s a chance that you could “lose” significance — That’s probably for a good reason though, given the simulations above; but there’s a pretty strong chance you would lose significance anyway. In the 30 + 30 case, the power was about .54 overall; in the 30 first, 30 again confirmatory case, the power is about .56 each time – So the chance of you getting two samples that support your claim is about .32.

TLDR:

  • Do not collect data until you hit significance, then turn off your data collector. You will obtain overestimates and drastically alter your t1 error rate; under extreme conditions, there is a 100% chance of rejecting the null even if it’s true under this procedure.
  • Do collect more data after finding something interesting, because you will improve or maintain your power and decrease the significance-conditioned bias in your estimate. That’s probably a good rule of thumb anyway – Once you find something interesting even in adequately powered studies, collect a bit more just to shimmy toward the generative parameter a bit. Type 1 error is inflated a bit relative to two independent tests, but not by much — In one case it’s about .0025, and in the other it’s .01; not enough to overwhelm the benefit in estimate accuracy and power. The more N you collect after the first set, the lower this is (which is counter-intuitive, given the p-value asymmetry, but it’s due to joint t1 errors: Those non-sig aren’t continued; those significant then gain more N, but samples are more likely to be near zero than extreme enough to maintain the already improbable t1 error).
  • Do this with enough N to overcome any sampling-induced extreme estimates — As in, don’t collect 30, get significance, then collect 1 more to say the estimate is more accurate, that is silly. You can play around with the code below to see how much N to add to reduce conditional estimates. All things being equal, more N is better N.
  • Do this with a fixed $N_\Delta$, rather than doing it repeatedly until you hit significance again, otherwise you wind up with the conditional stops problem. Basically, don’t switch your rule from “stop once an N is hit” to “stop once an N is hit and significance is found again”.
  • Do partially pool if you can, and you probably can.

EDIT

I wanted to clarify: I don’t like NHST! And I know Bayesian inference doesn’t have nor care about error rates for inference. I know this. Do not yell at me. The above is in response to the idea that pooling data with “interesting” (significant, probable, whatever you want) results with a new dataset from the same process would bias your results and/or inflate error rates — I am playing a frequentist for a moment and saying, no they don’t (or, it’s less biased with better error properties).

sim_study <- function(n1,n2,mu,sigma,alpha=.05){
  d <- rnorm(n1,mu,sigma)
  lmOut <- lm(d ~ 1)
  lmSum <- summary(lmOut)
  p <- lmSum$coefficients[1,4]
  eff <- lmSum$coefficients[1,1]

  # Comment until you see #%%% If you don't want to sample more
  if(p <= alpha) {
    d <- c(d,rnorm(n2,mu,sigma))
    lmOut <- lm(d ~ 1)
    lmSum <- summary(lmOut)
    p <- lmSum$coefficients[1,4]
    eff <- lmSum$coefficients[1,1]
  }
  #%%%
  c(eff=eff,p=p,n=length(d))
}

# Null
reps.null <- t(replicate(100000,expr = sim_study(30,300,0,1)))

mean(reps.null[,'p'] < .05)
mean(reps.null[,'eff'])
mean(reps.null[reps.null[,'p'] < .05,'eff'])

# Not null
reps <- t(replicate(100000,expr = sim_study(30,30,.4,1)))

mean(reps[,'p'] < .05)
mean(reps[,'eff'])
mean(reps[reps[,'p'] < .05,'eff'])

Leave a Reply