LaTeX in Slack!

I just joined a new Slack workspace, and was growing weary of using the codecogs equation editor just to paste in an image of some latex math into Slack.
It took a bit of effort, but I thought I would document the process for anyone who is curious.

You need:
– Slack, obviously
– A Linux server, accessible from anywhere
– Familiarity with some command line

I have a digital ocean droplet hosting this webpage, running Ubuntu 16.04.

First, SSH into your server.

git clone https://github.com/colbygk/mathslax.git
sudo apt-get install nodejs npm
cd mathslax

Second, you will probably need an updated version of node.

sudo npm install -g n
sudo n latest

Third, get mathslax ready.

make install
npm audit fix

Fourth, navigate to slack chat -> Workspace name -> Customize slack -> Apps -> Search for Slash Commands, install that integration.
Go to Custom Integrations -> Slash Commands -> Add Configuration
Enter some settings:

Command: /math
URL: http://yourServerName:yourConfiguredPortNumber/slashtypeset
Customize Name: LatexBot

yourServerName should be your webpage name; e.g., mine is srmart.in

yourConfiguredPortNumber should be the port that the mathslax server will listen on. E.g., 7777

Write down or copy the Token that is generated for you.

Fifth, in your SSH session or server, create a simple script for launching the mathslax server

#!/bin/sh
cd /usr/local/bin/mathslax
SERVER=yourServerName PORT=yourConfiguredPortNumber SLACK_AUTH_TOKEN="TheToken" node server.js

Save that to /usr/local/bin/tex.sh, and mark as executable

sudo mv tex.sh /usr/local/bin/tex.sh
sudo chown root:root /usr/local/bin/tex.sh
sudo chmod a+x /usr/local/bin/tex.sh

Sixth, move mathslax to a location. I did not follow best practices; this should probably be in /opt, /usr/share, or some other directory, but whatever.

cd ~
sudo mv mathslax /usr/local/bin/
sudo chown -R root:root /usr/local/bin/mathslax

Finally, create a systemd services script, and place it in /etc/systemd/system/mathslax.service

[Unit]
Description=MathSlax
After=network.target

[Service]
ExecStart=/usr/local/bin/tex.sh
Restart=on-failure

[Install]
WantedBy=multi-user.target

Then:

sudo systemctl enable mathslax.service
sudo systemctl start mathslax.service

You may need to enable the port in your firewall. If using UFW, then it’s simply sudo ufw allow yourConfiguredPortNumber

In slack, you can then type:

/math \begin{align}y &\sim p(\mu, \sigma) // \mu &\sim N(0,1) \\ \sigma &\sim N^+ N(0,1)\end{align}

And get an image of those three aligned sampling statements.

Edit

Part two. The bot method.

On your server,

git clone https://github.com/sand500/SlackLateX.git
cd SlackLateX
npm install fs
npm install request
npm install websocket

Create a file /usr/local/bin/texbot.sh with:

#!/bin/sh

cd /usr/local/bin/SlackLateX
node ./bot.js

Create a file /etc/systemd/system/SlackLateX.service with:

[Unit]
Description=SlackLateX
After=network.target

[Service]
ExecStart=/usr/local/bin/texbot.sh
Restart=on-failure

[Install]
WantedBy=multi-user.target

Go to your slack chat workspace -> Apps -> Install Slack bots -> Add configuration:

Customize name: latexbot

Copy the API token, paste it into /usr/local/bin/SlackLateX/secret.txt

Then simply invite latexbot into the channel, and start the bot sudo systemctl start SlackLateX

Now you can type $\pi$ to get a rendering of the pi symbol (or any other latex equation). This method improves upon the first method because it’s channel-specific, and you can edit latex code, which will re-render.

A foray into Bayesian handling of missing data

I have been, uh, “blessed” by the data gods for most of my research data, in that I really rarely have non-planned missing data. Seriously.
Most of my research has involved surveys, lab experiments, or within-subject repeated measures, and for some reason, I just rarely had missing data.
The missing data was small enough to ignore (like, less than 1%).
Consequently, I’ve never really had a need to “handle” missing observations.

Sure, I’ve dealt with models wherein some unknowns are treated as missing data, like latent scores or groups in latent variable models, but that’s not really the same thing, now is it? Those were “known-unknowns”.
I’ve also had data where missingness is planned and ignorable, like a between-subjects manipulation of which response condition a repeated-measures task requires.

I use Stan or brms for nearly every analysis I do, and I have constructed some fairly complicated stan models.
Multilevel SEM with non-normal residual distributions and moderation? Check.
Assessing differential item functioning or measurement variance through item model competition? Check.
Simultaneously estimating the probability that some item is DIF across latent groups? Check.
Full information meta-analytic path models? You betcha.
And so much more (Seriously, Stan rocks).

But missing observations? Never dealt with it.
Inspired by an assignment for a course, I decided to dive in and see just how bad missing data handling is in Stan.

The model

The data has 6 columns: read, parents, iq, ses, absent, and treat, roughly corresponding to a reading score, number of parents (0 being 1, 1 being 2), IQ, socioeconomic status, number of absences, and whether the person was involved in the reading improvement treatment.
Simple enough.
The goal is to estimate the basic linear regression, read ~ parents + iq + ses + treat, which is of course very easy.
However, there’s fairly substantial missingness in read, iq, and ses.
One-third of the IQ scores are missing, 29% of SES is missing, and 14% of reading scores are missing.

So what do you do?
The two most common methods are multiple imputation and full information maximum likelihood.

I specify the model in lavaan.

lavMod <- '
read ~ 1 + parents + ses + iq + treat
'

By default, lavaan uses listwise deletion, which is not so much a way of “handling” missing data as it is ignoring it and hoping we have enough N to say anything at all; and that’s only if data are MCAR, which is basically an insane assumption.
In reality, we can expect data to be MAR, and listwise deletion will result in some bad estimates.

Easy enough to fix in lavaan; to use FIML, you just add missings='fiml' as an argument.

lavModOutFiml <- sem(lavMod,data=ds,missing='fiml')

All FIML really does, is change the estimation technique.
Instead of operating only on covariance matrices, the estimator maximizes a likelihood function that is at the observation-level, then I think it integrates out the missings.
This allows every observed variable to provide information to the model, and share information for missing variables.

Or we could use multiple imputation, which is fairly easy as well.

library(semTools)
lavModOutMI <- runMI(lavMod,data=ds,m=20,miPackage='mice',fun='sem')

Multiple imputation generates $M$ datasets using, basically, gibbs sampling for the missings.
Then you estimate the model on each dataset and pool the estimates and compute the total standard errors.
In effect, this also integrates out the missings, and is essentially a less principled Bayesian method.

Easy enough, but onto Stan!

Bayesian Missing Data

The nice thing about Bayesian modeling, is that there is not really a clear line between parameters and mere “unknowns”.
Really, observations are known, and everything else is unknown.
The goal is to condition on those knowns to make probabilistic claims about the unknowns.

Usually when people talk about unknowns, they mean parameters, but that is needlessly restrictive.
Missing data are unknown, latent groups and states are unknown, latent scores are unknown, but none are “parameters” per se.

In order to “handle” missings, we merely need a model for them; then any posteriors of interest can be computed after marginalizing across it.
No need to scrap entire rows of data — Just model the missings with the observed quantities, condition on the known and unknown data, then marginalize.
In this way, missing data handling in Bayesian models is very natural. No external imputation needed; no fancy algorithm required. Missing data are merely part of the joint probability system.
Most realizations were observed with absolute certainty; some were not observed, but are informed by what is observed.

Take multiple regression as an example. Let X be the non-missing predictors, $\tilde{X}$ the missing predictors, $\sigma$ is the residual standard deviation, $\beta$ is the vector of regression coefficients, y is the outcome, $\mu$ is the vector of means and $\Sigma$ the covariance matrix for a multivariate normal distribution on the predictors.
$$
p(\beta,\sigma,\mu,\Sigma|X,\tilde{X},y) \propto p(y | X, \tilde{X},\beta,\sigma)p(\tilde{X}|X,\mu,\Sigma)p(\mu,\Sigma,\beta,\sigma) \\
p(\beta,\sigma,\mu,\Sigma|X,y) \propto \int p(y | X, \tilde{X},\beta,\sigma)p(\tilde{X}|X,\mu,\Sigma)p(\mu,\Sigma,\beta,\sigma) d\tilde{X}
$$
Essentially, we impose a multivariate normal distribution on the predictor variables, with unknown mean and covariance parameters.
The known predictors inform the mu and covariances, which in turn inform unknown scores.
Both the known and informed unknown scores predict y, and this in turn also informs the unknown scores (It’s a joint probability system, after all).
The goal is to obtain the marginal posterior of the parameters of interest, and to do so you must integrate over the unknowns, including unknown scores.

MCMC is there to help us approximate integrals and expectations.
Importantly though, MCMC samplers are essentially imputing the unknown data points just like multiple imputation, but the model also uses full information likelihoods to inform the model.
Bayesian handling of missing data therefore sits somewhere between multiple imputation and FIML-like techniques.
From a mathematical perspective, it looks like FIML. From an estimation perspective, it looks like multiple imputation.

To stan!

The Stan model, decrypted

Let me premise this section by saying: The Stan code I show below is not optimized. It is written for clarity, not for speed. There are several ways of optimizing this, but for a one-off model, it’s not critical.
Additionally, there are multiple ways of handling missings in Stan; the one I show below seemed easiest to me, even though it is an expensive method.

Stan hates NA values. Stan (or I assume, their C++ header and libraries) has no concept of missing values, and has no way of representing them.
So we need to do two things. We need to save which values are missing, and also replace those missing values with a temporary value.
This R code accomplishes those goals:

missings <- lapply(ds,function(var){which(is.na(var))}) # For each variable in ds, which rows are NA
nMissings <- lapply(missings,function(var){length(var)}) # How many NAs?
names(missings) <- paste0(names(missings),'_index') # Rename indices to variable_index
names(nMissings) <- paste0(names(nMissings),'_nMiss') # Rename n_miss to variable_nMiss

ds2 <- ds # Work with copy of dataset ds
ds2[is.na(ds)] <- -100 # Magic number to remove NAs

For clarity, this is what missings looks like:

> missings
$read_index
 [1]   7  22  24  30  40  42  50  59  62  70  86  87  93  96  99 104 111 120 139 148 150 157 172 174 180 190 192 200 209 212 220 236 237 243 246 249 254 261 270 289 298 300

$parents_index
integer(0)

$ses_index
 [1]   1   7  17  21  22  24  30  31  44  50  52  59  60  61  62  64  65  68  72  82  83  86  89  90  92  94  96  98  99 101 104 115 119 125 126 129 138 139 141 144 145 148 150 151 157 167 171 172 174 180
[51] 181 194 200 202 209 210 211 212 214 215 218 222 232 233 236 239 240 242 244 246 248 249 251 254 265 269 275 276 279 288 289 291 294 295 298 300

$iq_index
  [1]   8  11  15  17  21  22  24  31  40  52  59  60  61  62  68  70  71  73  74  77  80  82  85  86  89  90  92  93  94  95  96  97  98  99 104 108 110 112 114 115 119 120 122 124 125 135 136 138 139
 [50] 141 158 161 165 167 171 172 174 181 190 202 209 210 211 212 218 220 221 223 224 227 230 232 235 236 239 240 242 243 244 245 246 247 248 249 254 258 260 262 264 265 269 270 272 274 275 285 286 288
 [99] 289 291

$absent_index
integer(0)

$treat_index
integer(0)

Time to compose the Stan model.

The data block is as follows:

data {
    int N;
    vector[N] read; // Data vectors
    vector[N] parents;
    vector[N] ses;
    vector[N] iq;
    vector[N] treat;

    int read_nMiss; // N missing from variables; excluded have no missings
    int ses_nMiss;
    int iq_nMiss;

    int read_index[read_nMiss];
    int ses_index[ses_nMiss];
    int iq_index[iq_nMiss]; 
}

N is defined as the number of rows in the dataset (number of observations).
The outcome variable vector and the four predictor vectors are expected.
The number of missings for the three variables containing missing values are expected.
Finally, an integer array for the vector indices containing missings is expected for each variable with missings.

The parameters block:

parameters {
    // Structural
    vector[4] beta;
    real intercept;
    real<lower=0> sigma;

    // MVN covariates; for imputation
    cholesky_factor_cov[3] Sigma;
    vector[3] Mu;

    // Missings
    vector[read_nMiss] read_imp;
    vector[ses_nMiss] ses_imp;
    vector[iq_nMiss] iq_imp;
}

The structural parameters are the four regression coefficients, the intercept, and sigma — Corresponding to the model $y \sim \text{Normal}(X\beta,\sigma)$.
The multivariate normal parameters include a cholesky-factorized covariance matrix $\Sigma$, and $\mu$-vector; the known predictor values will inform the parameters to a multivariate normal distribution, which will in turn inform the unknown values for these variables.
The three remaining vectors correspond to the unknowns of each variable. The _imp should be read as “imputed”.

Transformed parameters block:

transformed parameters {
    // Recombine data
    matrix[N,5] Data; // 1: read, 2: parents, 3: ses, 4: iq, 5: treat
    Data[,2] = parents;
    Data[,3] = ses; Data[ses_index,3] = ses_imp;
    Data[,4] = iq; Data[iq_index,4] = iq_imp;
    Data[,5] = treat;
    Data[,1] = read; Data[read_index,1] = read_imp;
}

Here, we combine the observed and missing data into a single data matrix, called Data.
Normally, I would put this in the model block, but I hope to use this in generated quantities as well, and I do not want to copy and paste all of this.
We cannot merely edit the data vectors provided in the data block, because Stan disallows it.
Instead, a data matrix is created, and modified to include the estimated missing values.
This block is straight forward. An Nx5 matrix is created named Data, and I create a little key corresponding to which columns should represent which variables.
Each column is initially defined to be the corresponding vector provided in the data block.
For those three variables with missings, the indices with missing values (which we set to -100) are replaced with the “imputed” value for that person.

Model block:

model {
    // MVN imputation
    for(n in 1:N){
        Data[n,2:4] ~ multi_normal_cholesky(Mu,Sigma);
    }

    // Priors
    beta ~ normal(0,2);
    intercept ~ normal(0,2);
    sigma ~ cauchy(0,1);
    Mu[3] ~ normal(100,10);
    Mu[1:2] ~ normal(0,2);
    //// Sigma given uniform prior implicitly

    Data[,1] ~ normal(Data[,2:5]*beta + intercept, sigma);
}

For each person, the parents, ses, and iq quantities (whether it be observed or unknown) are assumed distributed multivariate normal with $\mu$ means and $\Sigma\Sigma’$ covariance ($\Sigma$ is a cholesky factor).
Any observed data contribute to the likelihood, and thus inform these unknown parameters.
Any unknown data are simulated (in a sense) from the distribution.
And yes, it is weird to assume the number of parents is normally distributed; I am ignoring that for now for the sake of simplicity, because all I care about is the covariance, and I am not imputing the parent variable.
Note that I could have included all predictors into the multivariate normal, but treatment is completely orthogonal to every other variable, and was excluded for simplicity.

Priors are loosely defined by values I think are plausible given the scales of the variables.
The only odd looking one out is Mu[3], but that corresponds to IQ, and a-priori I can assume the mean is about 100, and extreme means are very unlikely.
Finally, read is assumed distributed normally about a value predicted from the known and unknown data.
Note that unknown read values are likewise predicted or imputed from the model, although I do not think it has any impact on the other parameters.

Generated quantities:

generated quantities {
    cov_matrix[3] Omega = Sigma*(Sigma');
    real<lower=0,upper=1> R2;
    {
        vector[N] muHat = Data[,2:5]*beta + intercept;
        R2 = (variance(muHat))/(variance(muHat) + variance(Data[,1]-muHat));
    }
}

In this block, I compute the covariance matrix of the three predictors involved in imputation.
The $R^2$ value is computed on the full data as well.

Then we run Stan. I only monitor the parameters of interest, and not the imputed data.

stan_data <- c(missings,nMissings,ds2,N=nrow(ds2))
stanOut <- stan(model_code = stanMod,data=stan_data,cores=4,pars = c('beta','sigma','intercept','Mu','Omega','R2'))

It stacks up well to lavaan’s FIML and MI output.

The model above produced the “Bayes” line near the bottom.
The Bayesian model looks very similar to the FIML estimator from lavaan.

Auxiliary variables can also be used, and a model with an Auxiliary variable for the multivariate normal imputation method is reported on the final line of the table.
I won’t put the stan code here, but the gist is: Don’t predict the outcome with the Auxiliary variable; permit the Auxiliary variable to covary with all the predictors in order for it to add information to the unknown predictor values. You can also have the auxiliary variable covary with the residual of the outcome variable (requiring a multivariate normal response model) to inform imputed outcomes. An alternative is to have all the predictors additionally predict the auxiliary variable, the residuals which covary with the outcome variable residuals. The former is a saturated covariate model, the latter is an added DV model; both accomplish the same goal of informing both missing predictors and missing outcomes.

In the end, I was pleasantly surprised with how easy it was to handle missing data in Stan. Ideally, you specify your generative model, and just combine the known data with the unknown data. The known data will inform the unknown data through its influence on the unknown parameters. In this case, I simply chose to model the exogenous variables as multivariate normal, which permitted unknown data to be informed and used along with the known data to predict the outcome of interest.

For MVN imputation:

  • Save which observations are missing, and how many, from each variable.
  • In addition to the typical parameters, include parameters for a multivariate normal.
  • Include a “parameter” per missing value.
  • Combine the known data with the unknown data into a new data structure.
  • Model exogenous variables as multivariate normal.
  • Use full combined data in the model
  • Thanks to MCMC, marginal posteriors will already be integrated over unknown data and parameter uncertainty.

Demystifying the Bayesian analysis of ego-depletion

At SPSP 2018, there was the big reveal of the awesome Many Labs replication report of ego depletion effects.
Thirty-six labs participated, and not even the labs had yet seen the full results.
I was in my room decompressing a bit (conferences are exhausting), but following along through Sanjay Srivastava’s live tweeting the session.

The anticipation for the results was palpable even on twitter. I had to rush down there to hear the results myself.
I ran in just in time to hear the results.

Wagenmakers couldn’t be there to deliver the results of a Bayesian analysis himself, unfortunately.
The speaker did their best to convey the results, but I had the sense that some nuances were missed a bit.
I want to reiterate: The speaker did a good job! But there are some subtleties to these analyses that unless you’re a fairly experienced Bayesian, you will miss or under-emphasize.
So this post isn’t to criticize the speaker at all!

The gist I got from the Bayesian results was that:

  • The data are approximately 4x more likely under the point-mass null model than under the a-priori specified model.
  • The meta-analytic effect is approximately .08. (I mistakenly read it as .06, but others told me .08; I was in the back, sorry!)

But it seemed like the take-home message was “The null was favored; but the meta-analytic effect is .08, so let’s instead talk about how we think about small effects”.

Those I spoke with after the session had the same take-home message — The effect exists, but it’s tiny: $d = .08$.

It’s not actually this simple. The “real” best-estimate is, according to Wagenmakers, about $d = .016$.
The “problem” is that the Bayesian meta-analytic method used provides a conditional effect estimate — An effect conditioned on the assumption that an effect is non-zero and positive.
To non-Bayesians, this is easily ignored. To Bayesians, this is a critical piece of information, because that condition completely changes the point estimates.

Let’s dig into the method, as best as I currently understand it.
Keep in mind, please, I have absolutely nothing to do with this analysis; I’m not involved in this project.
Many things I type here may be incorrect, because I simply do not know the details of the analysis.
Also worth mentioning: I have no invested interest in the approach being correct, either professionally or personally.
Quite honestly, I have had a lot of disputes with Wagenmakers et al’s approach to inference, because not all Bayesians are alike, and I generally hate model averaging and Bayes factors. With that said, if I am correct about how the Bayesian approach was performed, I think it was a good approach for the purposes of this project. Kudos to the entire Many Labs project, and kudos to Wagenmaker et al for their analysis.

The two meta-analytic models

The slides mentioned that both a fixed-effects and random-effects meta-analysis was performed, and the overall effect was estimated using both.
This does not mean the data were used to “choose” a fixed or random-effects model.
Instead, this method weights the “overall” estimate by the meta-analytic model probability.

A fixed-effects meta-analysis in Bayes looks like this:

$$
p(\theta|y_1,\ldots,y_k,MA=1) \propto p(y_1,\ldots,y_k|\theta)p(\theta|MA=1)
$$
Or with sufficient statistics…
$$
p(\theta|\hat\theta_k,\sigma_{\hat\theta_k},MA=1) \propto p(\hat\theta_k|\sigma_{\hat\theta_k},\theta,MA=1)p(\theta|MA=1)
$$

Essentially, either one is simply stating that there is one fixed effect, and each dataset ($y_k$) or estimate-information combination ($\hat\theta_k,\sigma_{\hat\theta_k}$) is an instance of that fixed effect.

The random effects meta-analysis looks similar, except that each estimated effect is assumed to come from a distribution of effects with an unknown variance and mean.
The mean is taken to be the expected value of the effect.

$$
p(\Theta|y_1,\ldots,y_k,\theta_k,\sigma_{\theta_k},MA=2) \propto p(y_k|\theta_k,\Theta,\sigma_{\theta_k})p(\theta_k|\Theta,\sigma_{\theta_k})p(\Theta,\sigma_{\theta_k}|MA=2)
$$
Or with sufficient statistics…
$$
p(\Theta|\hat\theta_k,\sigma_{\hat\theta_k},\sigma_{\theta_k},MA=2) \propto p(\hat\theta_k|\sigma_{\hat\theta_k},\theta_k,\Theta,\sigma_{\theta_k})p(\theta_k|\Theta,\sigma_{\theta_k})p(\Theta,\sigma_{\theta_k}|MA=2)
$$

In order to get just the posterior for $\theta$ or $\Theta$ (fixed effects, respectively), you’d have to integrate out the unknowns, which I’ll skip for the time being, because MCMC does that for you.

But across fixed-effects and random-effects models, what is the posterior distribution? Well, you have to integrate over the particular models themselves.
Assuming we’re working with the raw data vector, $y$:
$$
p(\Theta|y) = p(\Theta|y,MA=2)p(MA=2|y) + p(\theta|y,MA=1)p(MA=1|y) \\
p(MA=1|y) = \frac{p(y|MA=1)p(MA=1)}{p(y)} \\
p(MA=2|y) = \frac{p(y|MA=2)p(MA=2)}{p(y)} \\
p(y|MA=1) = \int p(y|\theta,MA=1)p(\theta|MA=1)d\theta \\
p(y|MA=2) = \iiint(p|y|\theta_k,\Theta,\sigma_{\theta_k},MA=2)p(\theta_k|\Theta,\sigma_{\theta_k},MA=2)p(\Theta,\sigma_{\theta_k}|MA=2)d\theta_k d\Theta d\sigma_{\theta_k} \\
p(y) = p(y|MA=1)p(MA=1) + p(y|MA=2)p(MA=2)
$$

This all looks really convoluted, but in words:

  • The posterior probability distribution of the fixed effect is marginalized across the two different meta-analytic methods (fixed and random effect models), weighted by the posterior probability of each meta-analytic model given the data.
  • The posterior probability of a fixed-effects model is the probability of the data under that model times the prior probability (probably .5), divided by the marginal probability of the data across both. This is just Bayes theorem.
  • The posterior probability of a random effects model is the probability of the data under that model times the prior probability (probably .5), divided by the marginal probability of the data across both. This is just Bayes theorem.
  • The probability of the data under each model is the probability of the data given the parameters in each model, marginalized over the a-priori specified priors for each unknown parameter. This is where the $\text{Normal}^+(.3,.15)$ prior comes in for both $\theta$ and $\Theta$.

All of this comes from probability theory and Bayes theorem.
In a nutshell, you estimate a fixed-effects model and a random-effects model, assuming that the hypothesis about the effect is true and is expressed statistically as $\text{Normal}^+(.3,.15)$.
You get the marginal probability of the data under each model, using bridgesampling, because computing that manually is intractable.
This marginal probability represents the probability of observing the data they did given the a-priori hypothesis-informed uncertainty about the fixed effects.
The posterior probability distribution of the fixed effect is then informed by both the fixed and random-effects models, weighted by their respective marginal probabilities.

Both of these models assume the effect exists, and is positive.
A-priori, the hypothesis states that the ego-depletion effect is positive, and with uncertainty represented by Normal(.3,.15).
The posterior distribution for the ego-depletion effect, assuming such an effect exists as hypothesized, is informed by both the fixed and random-effects models, but weighted by the probability of observing the data they did under each model.
Really, the posterior distribution is: $p(\Theta|y,H1)$, meaning it is conditioned on both the observed data and that the hypothesized effect is true.

The hypothesis-driven prior on $\Theta$ (d) looks a bit like this:

And the posterior might look something like this:

But remember! This posterior is conditioned on the effect being positive, non-zero, because the prior constrains it to be so.
And this is likely not what the posterior actually looks like, this is just a best-guess approximation.
Something else to consider…

Posterior distributions and Bayesian point estimates are not like modal estimates

This is a quick aside.
Bayesian posteriors are distributions of effects, representing the data-informed posterior probability density or mass of parameters.
Outside of Bayesian inference, people are really accustomed to modal estimates, with an accompanying standard error or confidence interval.
In Bayesian methods, people often don’t report modal estimates (maximum a-posteriori), but rather medians or expectations with an accompanying credible interval.

When a prior induces a posterior boundary, these will not equal one another.
Just to build the intuition, let’s assume you fit a model without any boundaries, and the posterior looks like this:

This posterior suggests a mean, median, and mode of zero.

Now let’s use a prior to restrict negative values altogether. The posterior looks like this:

The modal estimate is still zero, but the median is about .629, and the mean (expected value) is about .765.

Why is this important?
The Bayesian MA posterior, conditioned on the effect being positive, has a median estimate of .08.
The mean estimate is necessarily higher, and the modal estimate is necessarily lower.
For what we currently know, that posterior could literally look like this:

In this posterior, the median is .08 (the same effect estimated by the Bayesian MA model), but the modal estimate is zero.

This is where Bayesian estimates and non-Bayesian estimates diverge: Maximizing estimators would say the effect is actually zero, whereas the Bayesian median and mean would be .08 and .10, respectively.
It’s not that Bayesian estimates are wrong, but they are really descriptions of a posterior distribution.
They are not intuitively the same as maximum-likelihood or other estimators. In this case, that posterior is really just a normal distribution, centered at zero, with all negative mass piled up at zero.
In other words, if the prior had not constrained the posterior to positivity, that posterior would have a mean, median, and mode of about zero.

This is important to recognize, because having a median effect of .08 is not particularly impressive or exciting.
In reality, that suggests a 50% posterior probability that the effect is between 0 and .08, assuming it can only be positive — And even effects that are in actuality zero can have a median estimate of .08 when positivity is induced.

The actual posterior

Wagenmakers and others compared two models.
One model was the combined-meta-analytic approach discussed above.
The other model assumed a point-mass of zero on the effect.

The Bayes Factor for this was $BF_{01} \approx 4$, from my best recollection.
This means the data are four times more likely to occur if the true effect were precisely zero than if the hypothesized (meta-analytic) effect of $\text{Normal}^+(.3,.15)$ were true.

If you want to know the overall posterior distribution for the effect, you have to marginalize across these models.
$$
p(\Theta|y) = p(\Theta|y,H1)p(H1|y) + p(\Theta|y,H0)p(H0|y)
$$

This posterior will look a bit like this:

The mean estimate is about .016.
That is — Assuming the effect could either be zero or positive, the effect is about .016.

This is not the traditional meta-analytic effect size estimate, but given the a-priori hypothesized effect size range and the alternative hypothesized effect of zero, then the meta-analytic effect would be approximately .016, not .08.
Once again, .08 assumes that the effect is positive and exists, which we are not certain of.
If we marginalize across the uncertainty in the models themselves, the expected estimate is about .016.

Hopefully, a more traditional meta-analytic estimate will be reported in the paper as well.
For example, using a zero-symmetric estimation prior (rather than the hypothesis-testing prior employed in the presented analyses) permits the effect to be any real number, weighted by the prior.
This would serve to probabilistically down-weight insane estimates (like $d\geq1$).
I am guessing that the meta-analytic effect would be between 0 and .05.
I am further guessing that if they used a prior mass at zero along with a zero-symmetric prior, the mean,median, and modal estimate would all be essentially zero (between 0 and .02).

Summary

From what I can gather:

  • The meta-analytic effect used a hypothesis-informed prior for the effect, meaning the meta-analytic effect is estimated assuming it is positive.
  • The data are about 4 times more likely to come from a parameter value of zero than from the hypothesized distribution of effects.
  • Assuming the hypothesis is correct, the median estimate is about .08.
  • Marginalizing across whether H1 or H0 are correct, the median estimate is about .016.
  • Do not take the meta-analytic effect size estimate as the take-home message, because that neglects the possibility that zero is a fairly probable estimate. I.e., don’t forget that the estimate is conditioned on the effect existing as hypothesized, which is 4x less likely than the null hypothesis.

Why (N-1) vs N [Long post]

I have a lot of students ask why we use (n – 1) vs (n) when computing variance. There are of course canned answers, like “Using N-1 unbiases the estimate” or “Using N-1 inflates the estimate, and it’s more accurate” or “Using N results in an underestimate”.

These are all true, but they’re deeply unsatisfying. I remember learning the more precise reason in my first semester of my phd program, but the question came up again and I decided to and write it out myself as a reference. This will be fairly basic stuff for many, I’m sure, but for the sake of my students and my future self, here it is:

First, distinguish between an estimator and its properties in expectation.
The estimator is some method with a defined goal that produces the best estimate given the goals.

Let’s say we are estimating the $\mu$ and $\sigma$ parameters to a normal distribution.
Let’s further say that we want a point estimate, and we’re not Bayesians (for now…), nor do we have any penalties.
One estimator is the maximum likelihood estimator, for obtaining the maximum likelihood estimate (MLE).

The MLE is the estimate that maximizes the joint probability density of the observed data, assuming some probability distribution underlies the sampling distribution.
In our case, we have the normal distribution, which looks like:
$$
p(x | \mu, \sigma) = \sigma^{-1}(2\pi)^{-.5}e^{-(2\sigma^2)^{-1}(x – \mu)^2}
$$
When we have more than one x – A vector of x called $X$ (capitalized), and if we assume independence, then the joint density is:
$$
p(X | \mu, \sigma) = \prod_{i = 1}^N \sigma^{-1}(2\pi)^{-.5}e^{-(2\sigma^2)^{-1}(x_i – \mu)^2}
$$

This is a pain to compute with, and the log-likelihood is much easier.

$$
f(X | \mu, \sigma) = -N\log(\sigma) – .5N\log(2\pi) – (2\sigma^2)^{-1}\sum_{i = 1}^N (x_i – \mu)^2
$$

Now, using this log-likelihood, we just need to see which value of $\mu$ and which value of $\sigma$ maximize the function output, the log-likelihood.
Because the log function is a monotonic function, maximizing the log-likelihood is the same as maximizing the likelihood.

To do this, we need to take the derivative of the log-likelihood with respect to each parameter at a time. Ignoring the notation for partial derivatives for the moment, the derivatives are as follows:

$$
f(X|\mu,\sigma) = \ldots -(2\sigma^2)^{-1}\sum(x_i^2 – 2x_i\mu + \mu^2) \\
= \ldots -(2\sigma^2)^{-1} \left ( \sum x_i^2 – 2\mu\sum x_i + N\mu^2 \right ) \\
= \ldots – \frac{\sum x_i^2}{2\sigma^2} + \frac{\mu}{\sigma^2}\sum x_i – N\mu^2(2\sigma^2)^{-1} \\
\frac{d(f(X|\mu,\sigma)}{d\mu} = 0 + \frac{\sum x_i}{\sigma^2} – N\mu(\sigma^2)^{-1} \\
0 = \frac{d(f(X|\mu,\sigma)}{d\mu} \\
0 = \sum x_i – N\mu \\
\mu = \frac{\sum x_i}{N}
$$

I’ll call that $\hat\mu$, since it’s an estimate that maximizes the likelihood of the observations, and not the “true” $\mu$ parameter value.

Onto $\sigma$:
$$
f(X | \mu, \sigma) = -N\log(\sigma) – .5N\log(2\pi) – (2\sigma^2)^{-1}\sum_{i = 1}^N (x_i – \mu)^2 \\
\frac{d(f)}{d\sigma} = -N\sigma^{-1} – 0 – \sum(x_i – \mu)^2\sigma^{-3} \\
0 = \sigma^{-1}(N – \sigma^{-2}\sum(x_i – \mu)^2 \\
\sigma^{-2}\sum(x_i – \mu)^2 = N \\
\sigma^2 = \frac{\sum(x_i – \mu)^2}{N}
$$

I’ll call that $\hat\sigma^2$, since it’s an estimate that maximizes the likelihood of the observations, and not the “true” $\sigma$ parameter value.

So now we have two estimators, derived from maximum likelihood goals:
$$
\hat\mu = \frac{\sum x_i}{N} \\
\hat\sigma^2 = \frac{(\sum x_i – \mu)^2}{N}
$$

Now the question, what are the expected values of these parameter estimates, and are they equal to the desired parameter values?

Let me just take a shortcut and say the expected value of $\hat\mu$ is indeed $\mu$.
There is variance of the $\hat\mu$ estimate around $\mu$, derived as follows.
Assume $E(x_i) = \mu$, $E(\sum x_i) = \sum E(x_i)$, $E(cX) = cE(X)$, which are some basic expectation rules, and the rest follows.

$$
E(\hat\mu – \mu)^2 = E(\hat\mu^2 – 2\hat\mu\mu + \mu^2) \\
= E(\hat\mu^2) – 2\mu E(\hat\mu) + \mu^2 \\
= E(\frac{(\sum x_i)^2}{N^2}) – 2\mu E(\frac{\sum x_i}{N}) + \mu^2 \\
= \frac{1}{N^2}E((\sum x_i)^2) – \mu^2 \\
= \frac{1}{N^2}E\left (\sum x_i – \mu + \mu \right )^2 -\mu^2 \\
= \frac{1}{N^2}E\left ((\sum x_i – \mu) + N\mu \right )^2 – \mu^2 \\
= \frac{1}{N^2} E \left ( (\sum (x_i – \mu)^2 + 2N\mu\sum(x_i – \mu) + N^2\mu^2)\right ) – \mu^2\\
= \frac{1}{N^2} (N\sigma^2 + 0 + N^2\mu^2) – \mu^2\\
= \frac{\sigma^2}{N} = \sigma^2_\mu
$$
Voila, the expected variance (VAR) of the $\hat\mu$ estimator, defined as $E(\hat\mu – \mu)^2)$, is equal to $\frac{\sigma^2}{N}$.
Just as our textbooks tell us.

With that out of the way, what is the expected value of $\hat\sigma^2$?

$$
\hat\sigma^2 = \frac{\sum(x_i – \hat\mu)^2}{N} \\
E(\hat\sigma^2) = \frac{1}{N}E(\sum((x_i – \mu) – (\hat\mu – \mu))^2) \\
= \frac{1}{N}E(\sum (x_i – \mu)^2 -2\sum(x_i – \mu)(\hat\mu – \mu) + \sum(\hat\mu – \mu)^2) \\
= \frac{1}{N}\left (N\sigma^2 – 2\sum E(x_i – \mu)(\hat\mu – \mu) + N\sigma^2_\mu) \right ) \\
= \frac{1}{N}\left (N\sigma^2 – 2\sum E(\hat\mu x_i – \mu x_i – \mu\hat\mu + \mu^2) + N\sigma^2_\mu \right ) \\
\sigma^2 + \sigma^2_\mu – \frac{2}{N}E(N\mu^2 – N\mu\hat\mu – N\mu\hat\mu + N\hat\mu^2) \\
= \sigma^2 + \sigma^2_\mu – 2E(\hat\mu – \mu)^2\\
= \sigma^2 – \sigma^2_\mu \\
E(\hat\sigma^2)= \sigma^2 – \frac{\sigma^2}{N} = \sigma^2(1 – \frac{1}{N}) = \sigma^2(\frac{N-1}{N}) \\
\sigma^2 = \hat\sigma^2\frac{N}{N-1}
$$

  1. $E(\hat\sigma^2)= \sigma^2 – \sigma^2_\mu$ is really interesting. It’s saying the expected value of our estimate of the variance is equal to the true variance, minus the error variance of the mean estimator. Conversely, you can say that the true variance is equal to the observed variance plus the error variance in the mean estimate. That already should intuitively suggest to you that the reason the MLE for $\sigma$ is underestimated is because it fails to account for the variability in the mean used to compute the sample variance. Scores not only vary around the mean, but the mean varies around the parameter, and by neglecting that latter variability, we underestimate the population variance.
    From an ordinary least squares perspective, you can think of it this way: The mean is the value for which the average squared deviation from it is the smallest. Already, the mean is minimizing the variability. But because the mean does not equal the population parameter, that variability will be minimized around the incorrect point.
  2. The $\frac{N}{N-1}$ adjustment merely alters the estimated value so that the expected value will equal the true value. It does so basically by indirectly adding back some error variance due to the estimation of the mean itself. And if that fraction doesn’t look familiar to you, maybe this helps: $\hat\sigma^2 \frac{N}{N-1} = \frac{(\sum x_i- \hat\mu)^2 N}{N(N-1)} = \frac{\sum (x_i – \hat\mu)^2}{N-1}$

And that’s basically it. I probably have minor mistakes in the above, as this was all hastily written. But that’s the gist – Using (N-1) as the divisor ultimately comes from the fact that our estimator is not expected to equal the true variance parameter. The estimated sample variance is off by a factor of $\frac{N-1}{N}$, asymptotically, so we just adjust it so the expected value with the adjustment does equal the true variance.

The absurdity of mapping p-values to Bayes factors

In the “Redefine statistical significance” (RSS) paper, the authors argued that p = .05 corresponds to low evidentiary value.

Let’s unpack what is meant by that.

The authors’ argument comes from an “upper bound” argument on how large the Bayes Factor can be given a particular p-value.

Recall that a Bayes Factor is defined as:

$$
\frac{p(D|H_a)}{p(D|H_0)} = \frac{\int_l^u p(D|\theta,H_a)p(\theta|H_a)d\theta}{\int_l^u p(D|\theta,H_0)p(\theta|H_0)d\theta}
$$

The parameter priors, $ p(\theta|H_k) $ corresponds to an uncertain prediction from some hypothesis, $H_k$.
The marginal likelihoods computed in the numerator and denominators above can then be thought of “prior predictive success”, or “predictive success, marginalized over the uncertainty expressed in the parameter from hypothesis $k$ “. The BF10 then represents the evidence in favor of hypothesis a over hypothesis 0.

With that out of the way, let’s return to the RSS paper.
The authors wished to argue that a p=.05 necessarily has a low upper bound on evidence in favor of the alternative hypothesis.

RSS used as an example a BF for a two-tailed alternative hypothesis in comparison to a nil-null hypothesis.
This makes intuitive sense, because a commonly used test is the two-tailed t-test with a nil-null hypothesis.
In the latter case, the p-value is corresponding to the probability of observing a t-statistic at least as extreme as the observed statistic, assuming that the true sampling distribution is centered around zero.
No explicit alternative hypothesis really exists; the statement is simply $H_a: \theta \neq 0$, which isn’t exactly a useful hypothesis, but I digress.

In contrast, the BF does use two explicit hypotheses.
$H_0: p(\theta|H_0) = \begin{cases}1 & \text{if } \theta = 0 \\ 0 & \text{if } \theta \neq 0\end{cases}$
and
$H_a$ depends slightly on how favorable you wish to be, in order to obtain an upper bound.
The most direct upper bound of BF for a p=.05 value with some specified N would simply be another point mass at exactly the quantile corresponding to the .975 percentile of the t-distribution.

Using this line of reasoning, they argue that at best, a p = .05 corresponds to a small BF, and as such, p = .05 has a low upper bound on evidentiary value.
Instead, they argue that p = .005 would correspond to a more considerable evidentiary value, with BFs ranging from 14–28 or so.

So what’s the problem? Seems reasonable!

The problem(s)

Ok, well let’s start with some intuitive problems.

First, as mentioned in the “Justify your alpha” paper, why should the rate at which a hypothetical error is presumably controlled map onto a marginal likelihood ratio (I refuse to call this “evidence”, as though the whole of scientific evidence can be expressed in a single unidimensional metric of prior predictive success ratios)?
We could likewise flip this around — How about we devise Bayes factor thresholds that map onto hypothetical error rates, so that these crazy Bayesians don’t make more or less type 1 errors than expected? (I am not encouraging this, just highlighting the arbitrariness).

Second, these are metrics of completely different things! A p-value is exactly what it states: The probability of observing what we did (or more, or more extreme) assuming some state-to-be-nullified is true? The BF is a ratio of prior predictive success, or a marginal likelihood ratio, or the ratio of probability densities of the observed data from two prior-predictive distributions corresponding to two hypotheses.
The goals of each are different too.
The p-value is just a “surprise” factor, given some assumed state.
The BF is a relative measure, evaluating the data under two certain or uncertain states.
The p-value is used inferentially as a means to reject some state if the observation is sufficiently improbable under it.
The BF is used to express how much one’s prior odds in each state would change as a function of the data.

Why these things should map onto one another is not intuitive.
The p-value has its uses. The BF has its uses. I’m not sure why one needs to be defined in terms of the other.

Third, and most importantly, the mapping from “highly evidentiary” upper bound BFs to p-values (or alphas) is not generalizable. Richard Morey has an excellent series on this topic. The gist is, the whole “p=.005 has reasonable upper bounds, but p=.05 does not”, depends entirely on which hypotheses you are testing. And as I said above, the RSS authors used a two-tailed alternative against a point-mass null, and from the standpoint that statistical hypotheses should map onto their respective substantive hypotheses, this makes no sense anyway. More on that in a future blog post.

Morey recently made his point again on twitter.
I decided to take it a step further.

Likelihood Ratio Tests

A likelihood ratio test is defined as:
$$
Q = -2(LL_0 – LL_a) \\
Q = 2(LL_a – LL_0) \\
Q = 2\log\frac{L_a}{L_0} \\
k_i = \text{Number of parameters in model i} \\
df = k_a – k_0 \\
Q \sim \chi^2(df)
$$

We can solve for the likelihood ratio required to reject at some $\alpha$ level:
$$
e^{\frac{Q_{c,\alpha,df}}{2}} = \frac{L_a}{L_0}
$$
Given some $\alpha$ and $df$, find the critical value for the $\chi^2$ statistic, $Q_{c,\alpha,df}$, then simply solve for the likelihood ratio.

Now, let’s define the upper bound BF.
As it turns out, if you have two hypothetical models with point masses at fixed values ($\hat\theta$), then the BF is simply a good ‘ol fashion likelihood ratio.
$$
p(D|H_a) = \int p(D|\theta,H_a)p(\theta|H_a) \\
\text{If }p(\theta|H_a) = \begin{cases}1 & \text{if } \theta=\hat\theta \\ 0 & \text{if } \theta\neq\hat\theta\end{cases} \\
p(D|H_a) = p(D|\hat\theta,H_a)
$$
Meaning, that the BF will just become a likelihood ratio.

If you have model A with 2 parameters, and model 0 with 1 parameter, and p = .05, what is the upper bound BF ($BF_{ub}$)?
$$
df = 1 \\
Q_{c,\alpha=.05,df=1} = 3.841459 \\
e^{Q/2} = 6.825936 = BF_{ub}
$$

So if you are comparing two models with a difference of one parameter, and you see p = .05, your upper bound Bayes Factor is 6.825936.
Basically, that means if your alternative hypothesis had one more parameter and your hypothesis predicts the exact estimate achieved, and you got p = .05, then the greatest BF you could see is 6.825936.
That’s not stellar, which is the point of the RSS argument.

Not so fast! That’s in one particular scenario, wherein you had two models, and the only difference is the inclusion of an extra parameter.
Let’s vary the p-value obtained and the df for the test. Recall that df in LRTs represents the difference in parameters estimated for each model, so we are varying how complex the alternative model is from the reference or null model.
Plot time.

library(ggplot2)
alphas <- seq(.005,.5,by=.005)
dfs <- 1:10

ds <- expand.grid(alpha=alphas,df=dfs)
ds$qchisq <- qchisq(p = 1-ds$alpha,df = ds$df)
ds$bfup <- exp(ds$qchisq/2)

ggplot(data=ds,aes(color=factor(df),x=alpha,y=log(bfup))) + 
    geom_line() + 
    labs(x='p',y='log(BF10) Upper Bound',color='LRT df') + 
    theme_classic() + 
    geom_hline(yintercept=c(log(14),log(28))) + 
    scale_x_continuous(breaks=seq(0,.5,by=.025)) + 
    theme(axis.text.x = element_text(angle = 90))

The $\log BF_{ub}$ is plotted on the Y axis. P-values on the X-axis. Colors represent the df (i.e., difference in number of parameters from alternative to null models).
The black horizontal lines represent BF10 14-28.

If your model has 2 parameters more than the null, then p=.05 has an upper bound of “strong evidence”, according to the RSS authors.
Well, that changes things doesn’t it?
In fact, if you only have one parameter difference, then p=.01-.03 has “strong evidence” upper bounds.
A p=.005 would have an upper bound BF of about 50.

In the more extreme case, if you have 6–7 more parameters in the alternative model, then a p=.50 corresponds to “strong evidence” upper bounds.
Let me repeat that, for those who misread .50 as .05.
A p-value, of .50 (50%, point five zero, not 5% or point zero five), corresponds to strong evidence upper bounds.

Using the logic of the RSS authors, if you have 6-7 parameters more in your alternative model than the null model, then alpha should be set to .50, because that’s where the upper bound of “evidence” is strong.
That’s some really great news for people who fit SEM and MLM models, both which often compare much higher dimensional models to much lower dimensional models. Go for broke, know that if you have p=.42, it’s justifiable as a cutoff, since it has a high evidence upper bound.

Note that this does make sense from a BF-inferential perspective.
If you have a hypothesis that predicts precisely the parameter values that you estimated, then that hypothesis is insanely good.
And the goodness of that hypothesis increases as your model has more parameters that it precisely predicted.
This reasoning is intuitively why as df increases, the upper bound increases too.
If you have a hypothesis that is perfectly predicting the parameter values, then the BF upper bound should be higher for models that have more parameters to predict.
In other words, it’s much more impressive for a hypothesis to perfectly predict 10 parameters than 2.

Despite the pattern making sense from a BF-inferential perspective, it makes little sense to claim p-values overstate the “evidence”, when in many cases the p-values understate the “evidence”.
Sure, a p=.05 may only imply a BF of 3-6.
But also, a p=.05 may imply a BF of 20-10000.

It depends on the hypotheses you are testing, and therefore there can be no general mapping between a p-value and the “evidence” indicated by a BF.

Final thoughts

The argument that p=.005 should be the new standard due to its relationship to the BF is questionable.
Philosophies aside, the p-value does not map onto the BF.
A p-value maps onto a BF.
It all depends on the hypotheses you test. A p of .5 could mean a BF of 106; a p of .005 could mean a BF of 51; a p of .05 could mean a BF from 7 to 10000. Who knows?

The recommendation that all findings be labeled significant only when $p \leq .005$ is based purely on one particular set of hypotheses, under a set of assumptions, with the goal of matching it to upper bounds of one particular evidentiary metric.
It cannot generalize to all computed p-values, and I doubt we’d want a big table of significance thresholds corresponding to a range of BFs for every possible significance test.

Finally, I agree that a single p-value near .05 is insufficient evidence; just as I would agree that a single BF of 3-10000 would be insufficient evidence.
The reason I state that a single p-value near .05 is insufficient evidence is not because the .05 is necessarily weak, but rather that any singular metric is necessarily weak, when assessing the evidence for a hypothesis.
More on this in a future blog post, I’m sure.

In sum, use $\alpha = .005$ if you’d like. Use BFs if you’d like. Use posteriors and CIs if you’d like. Report all the things if you’d like.
Justify your inference within your inferential framework, justify the decisions made.
But maybe don’t say p=.05 is necessarily weak because your p = .05.

And maybe all of this is wrong, I don’t know. You tell me.

Updating? Pooled data? Meta-analysis? Yes.

Meta-psychology is a journal that is founded on the principles of total transparency. Manuscripts are open. Reviews are open. Open commentary is permitted. Each draft is visible. It’s as though the journal cares about science or something.
Rickard mentioned a new submission about the utility of Bayesian updating, or as the authors call it – Posterior passing.

My comment can be read here by clicking on the upper-right comment button. Due to the excruciating 1000 character limit, you will have to hit the ‘+’ sign to read all of them.

Although I generally appreciate the paper, I think it misses an important point.
Their proposal to use “posterior passing” instead of meta-analysis or using full raw datasets is strange upon realizing that posterior passing is the same as having all of the raw data, and the same as a fixed-effects meta-analysis under a few reasonable assumptions.

Posterior Passing

The gist of posterior passing is that a researcher can take the posterior of a parameter computed from a previous sample, and use that posterior as a prior.
Or, as the rest of the world calls it – Bayesian updating.
The prior is then updated by the new sample data into a new posterior.
That new posterior can serve as someone else’s prior, and so on.
Over repeated samples, this results in decreasing uncertainty (smaller posteriors) about parameter values.

Posterior passing looks like the following. Assume $Y_j$ represents some current observation vector, and $Y_{-j}$ represents a set of previous observation vectors. Or: $Y_j$ is your data, and $Y_{-j}$ is not your data.
\begin{align}
p(\theta,Y_j|Y_{-j}) & \propto p(Y_{j}|\theta,Y_{-j})\overbrace{p(\theta|Y_{-j})}^{\text{Your prior, their posterior}}
\end{align}

That’s about it. Obtain data, estimate a posterior, use that posterior as a prior for the next sample. Keep doing it ad nauseam, and you have posterior passing.

The argument is that this method is underutilized in the sciences, and that engaging in posterior passing will permit researchers to quickly converge toward a more certain, accurate estimate.

I do not disagree with the idea of Bayesian updating, of course. It’s just a mathematical fact, and it is underutilized. People are way too afraid to incorporate prior information into a model.
With that said, the proposal has a few problems, some they mention and some they do not.

  • The proposed use of it assumes a single set of generative parameters across all of the studies. Given that multiple studies have minor to major differences between them, surely their corresponding data are not generated from precisely the same set of parameters. It would be difficult to find three, let alone 60, studies that are feasibly generated from the exact same parameters.
  • The quick convergence toward the “true value” strongly depends on the quality of the available posteriors from previous studies. $ p_\text{osterior}-\text{hacked} $ studies and publication bias result in a misleading assortment of posterior distributions, and though your own data may overwhelm the priors, it is not guaranteed. Indeed, if they had simulated the existence of hacking and pub bias, I suspect the performance would be underwhelming.

My bigger disagreement is with advocating it over meta-analysis, or even comparing it to having the full data available.

They’re the same things! Advocating for repeated Bayesian updating across studies in order to get a highly informative posterior is the same thing as advocating for full-data analysis and fixed effects meta-analysis.

Bayesian updating vs full-data analysis

Imagine two researchers are arguing about their next three samples.
One researcher wants to collect 300 observations, then another 300, then another 300.
She then wants to compute a posterior based on the first 300, feed that in as a prior for the second 300, and finally the resulting posterior into the third.
Specify an initial prior, then chain posteriors into the next model’s prior, until we get a three-study Bayesian update on the state of information.
The other researcher thinks that’s inferior to the full-data method.
He wants to just collect 900 and be done with it; analyze the whole thing.
Specify an initial prior, then throw all 900 observations into a single model.
They argue for hours. Friendships are severed. Children are neglected. They use all caps when texting because their voices have given out.

Neither researcher is right or wrong.
These two approaches will accomplish the exact same end result.
They should have just worked out the math and grabbed a beer instead.

\begin{align}
p(\theta,Y_j,Y_{-j}) &\propto p(Y_j|\theta,Y_{-j})\overbrace{p(\theta|Y_{-j})}^\text{A posterior} \\
&\propto p(Y_j|\theta,Y_{-j})p(Y_{-j}|\theta)p(\theta) \\
\text{If IID…} &\\
&\propto p(Y_\cdot|\theta)p(\theta)
\end{align}

Assuming your data and other data are IID, you can either use a posterior as your prior, or combine all the data.
The result is the same. The joint distribution decomposes into either one.

Consequently, using all available data in a single model is no better than using posteriors derived from all previous data to inform the current data.

Bayesian updating vs fixed-effect meta-analysis

A fixed-effect meta-analysis is defined as follows.
Some parameter, $\theta$, exists that is estimated by several $\hat\theta_i$. A variance estimate, $\sigma^2_{\theta_i}$ is computed for each $\hat\theta_i$ estimate.
It follows then, that:
$$
p(\theta | \hat\theta_\cdot,\sigma_{\theta_\cdot}) \propto p(\hat\theta_\cdot | \theta, \sigma_{\theta_\cdot})p(\theta)
$$

Decomposing this further,

\begin{align}
p(\theta | \hat\theta_\cdot,\sigma_{\theta_\cdot}) &\propto p(\hat\theta_\cdot | \theta, \sigma_{\theta_\cdot})p(\theta) \\
&\propto p(\hat\theta_j | \theta, \sigma_{\hat\theta_j})p(\hat\theta_{-j}|\theta,\sigma_{\hat\theta_{-j}})p(\theta) \\
&\propto p(\hat\theta_j | \theta, \sigma_{\hat\theta_j})p(\theta | \hat\theta_{-j},\sigma_{\hat\theta_{-j}})
\end{align}

If $\hat\theta_j, \sigma_{\hat\theta_j}$ are sufficient statistics for $Y_j$, then
$$
p(\hat\theta_j | \theta, \sigma_{\hat\theta_j})p(\theta | \hat\theta_{-j},\sigma_{\hat\theta_{-j}}) = p(Y_j|\theta)p(\theta|Y_{-j})
$$

Assuming each study’s $\hat\theta_j, \sigma_{\hat\theta_j}$ calculations are sufficient statistics and are independent of one another, the fixed-effect meta-analysis will provide the same information as posterior passing or full-data analysis.

Summary

In essence, whether you want to utilize posterior passing, incorporate all data into a single model, or use a fixed-effect MA, you will produce the same information in the posterior.

I think, anyway. Feel free to critique the above.

It is inconsequential, based on the above, to use one version or another. They will all suffer the same consequences with the same problems, and have the same troubling assumption that all included data are presumed generated from the same single parameter.

Practically speaking, I believe the fixed-effect MA would be the simplest to utilize of those three. Regardless, I think the assumption of one-true-effect for multiple studies is seriously improbable in practice, and would recommend a random effects meta-analysis.

Maximum Likelihood Lecture for Undergraduates

This semester I was permitted to give one brief lecture on maximum likelihood in the advanced statistics undergraduate course.

The course focuses primarily on the linear model, so we talk about OLS linear models, and variance-based approaches to model selection.
I have free reign over three lectures each semester, and my final lecture was on maximum likelihood approaches, and how we can use them.

The knitted lecture notes are embedded below. Note: I know some content is incorrect; I throw the term “likelihood” around pretty freely and incorrectly, in order to focus on the intuition rather than the correctness of that term. So before anyone comments on how I used the term likelihood incorrectly — I know I did. It was intentional, and with only one lecture to spend, it wasn’t particularly worth it to me to go into what precisely likelihood means, or why the $p(Y|\mu,\sigma)$ is not Fisher-friendly (because it treats parameters as random values to be conditioned on, even if mathematically they are the same).

Anyway, these notes got good reviews from undergraduates who had previously never seen or heard of maximum likelihood; even a couple of graduate students thought the visualizations were helpful to them.
The undergraduates were also given a fairly basic homework assignment wherein they fit logistic regression models, computed and used LRT and AIC for model selection, and made predictions (e.g., if high in twitter addiction, the model would suggest a high probability of being a graduate student).

Yes, collect more data and pool them.

A question arose on one of the numerous facebook-methods groups. The question, in essence was: I have a sample of 30 from a dissertation; I want to collect 30 more observations to have better [sic] results for a publication. Should I use a bayesian analysis with the prior informed by the first 30, or should I just use an uninformed prior and completely pool the data? My response: Yes, collect more data and just completely pool the data.

Why

Partial pooling is great, and I encourage it nearly all of the time. However, in this case there was just an arbitrary break in the data collection from the same process, so I think complete pooling is fine (also, partial pooling with only two samples seems pointless, but I could be wrong).
You could of course use the posterior from the first 30 as the prior for the second 30, but this winds up being the same thing as just completely pooling the data, assuming the observations are independently sampled from the same process:
\begin{align}
p(\theta|y_1) &= \frac{p(y_1 | \theta)p(\theta)}{p(y_1)} \\
p(\theta, y_1, y_2) &= p(y_2 | \theta, y_1)p(\theta|y_1)p(y_1) \\
p(\theta|y_1,y_2) &\propto p(y_2 | \theta, y_1)p(y_1|\theta)p(\theta) \\
p(\theta|y_1,y_2) &\propto p(y_1,y_2|\theta)p(\theta) \\
p(\theta|y_1,y_2) &\propto p(y|\theta)p(\theta)
\end{align}
So – Just to save time, I suggested to just pool the data.

Another user was a bit suspicious though.
This user said that if you collect 30 observations which suggest something interesting enough that you should continue collecting observations, then this data-dependent choice will bias your results.

I disagree! This is very close to the problem of multiple stopping inflating the probability of a wrong decision, but is very different for a special reason: The data are not stopped if they hit “significance”, but rather continued if they hit “significance”. If you stop upon hitting significance, yes, you will bias your result and inflate error rates, if you care for that sort of thing.
But if you continue after hitting significance for some fixed N, the estimate thereafter is less biased, and if the “null” is true, then you will actually reduce the error rate.

The reason it is less biased is simply that if you stopped after significance with an N = 30, then your estimate is, for most effect sizes, necessarily an overestimate. It must be an extreme estimate in order to pass a threshold, if the study is underpowered (which, for most effect sizes, N=30 is underpowered). If you then sample 30 more, your estimate will recede toward the mean, and is not likely to become more extreme, but less extreme. Gaining more observations means gaining a better representation of the population, and you are far more probable to observe new values near the typical set than anything more extreme than you already observed.

The reason you have less type 1 error (of course, assuming the nil-null is “true”) is for similar reasons – With an N=30, you still have a $\alpha$ chance of a t1 error. Given that t1 error, sampling $N_\Delta = 30$ more observations will draw your chance-extreme estimate back toward the true mean of zero. In order to make a t1 error after $N_\Delta$ more, think about what would need to occur: With some $Z_n$ statistic for the first $N$ sample, what is the $N_\Delta$ sample needed to maintain at least a $Z_\alpha$ for rejection in the same direction? Or – What sample would you need to keep being significant purely by chance, and what is the probability of obtaining that sample from a nil-null sampling distribution? It’s small!

Simulation

When the null is true

Let’s test it via simulation.
The following procedure is simulated 100,000 times.

  1. Collect N=30 from Null.
  2. If significant, collect 30 more.
  3. Report $p$, the estimate, and the n (30 or 60).
  • Type 1 error: .01641
  • Expected effect estimate: .00042
  • Effect estimate conditioned on significance: -.01311

Note that this conditional effect should be biased, but due to the 2-tailed test, the biases are canceling each other out.

When the true effect is .4

  • Power: .54841
  • Expected effect estimate: .36552
  • Effect estimate conditioned on significance: .46870
  • Effect estimate when conditioned on obtaining $N_\Delta = 30$ more after initial significance: .46207

The confirmatory approach

But what if we only did as the other user suggested instead: Ignore the first sample, just collect an N=30 sample as a confirmatory study?
That is – Collect N=30, if significant, collect another N=30 and analyze the latter one by itself. We’ll only simulate from the latter one — Just simulating from the distributions.

When the null is true

  • Type 1 error: .05032
  • Expected effect estimate: -.00070
  • Conditional effect estimate: .00019

Note that the conditional effect estimate really should be biased — It’s just that I’m using a 2-tailed test, and the biases are canceling each other out. The expected positive effect is about .415.

When the true effect is .4

  • Power: .56336
  • Expected effect estimate: .39981
  • Conditional effect estimate: .52331

Conclusion

From the above, I recommend that if you have a sample, and you find something interesting in it, do not be afraid to collect a larger sample and include your previous data in it. If the “null” is true, the error rate decreases due to regression toward the mean. If the null is false, you will have a less biased estimate. Ideally, you should collect enough data that any initial sampling-error induced extremeness will shrink away.
Yes, there’s a chance that you could “lose” significance — That’s probably for a good reason though, given the simulations above; but there’s a pretty strong chance you would lose significance anyway. In the 30 + 30 case, the power was about .54 overall; in the 30 first, 30 again confirmatory case, the power is about .56 each time – So the chance of you getting two samples that support your claim is about .32.

TLDR:

  • Do not collect data until you hit significance, then turn off your data collector. You will obtain overestimates and drastically alter your t1 error rate; under extreme conditions, there is a 100% chance of rejecting the null even if it’s true under this procedure.
  • Do collect more data after finding something interesting, because you will improve or maintain your power and decrease the significance-conditioned bias in your estimate. That’s probably a good rule of thumb anyway – Once you find something interesting even in adequately powered studies, collect a bit more just to shimmy toward the generative parameter a bit. Type 1 error is inflated a bit relative to two independent tests, but not by much — In one case it’s about .0025, and in the other it’s .01; not enough to overwhelm the benefit in estimate accuracy and power. The more N you collect after the first set, the lower this is (which is counter-intuitive, given the p-value asymmetry, but it’s due to joint t1 errors: Those non-sig aren’t continued; those significant then gain more N, but samples are more likely to be near zero than extreme enough to maintain the already improbable t1 error).
  • Do this with enough N to overcome any sampling-induced extreme estimates — As in, don’t collect 30, get significance, then collect 1 more to say the estimate is more accurate, that is silly. You can play around with the code below to see how much N to add to reduce conditional estimates. All things being equal, more N is better N.
  • Do this with a fixed $N_\Delta$, rather than doing it repeatedly until you hit significance again, otherwise you wind up with the conditional stops problem. Basically, don’t switch your rule from “stop once an N is hit” to “stop once an N is hit and significance is found again”.
  • Do partially pool if you can, and you probably can.

EDIT

I wanted to clarify: I don’t like NHST! And I know Bayesian inference doesn’t have nor care about error rates for inference. I know this. Do not yell at me. The above is in response to the idea that pooling data with “interesting” (significant, probable, whatever you want) results with a new dataset from the same process would bias your results and/or inflate error rates — I am playing a frequentist for a moment and saying, no they don’t (or, it’s less biased with better error properties).

sim_study <- function(n1,n2,mu,sigma,alpha=.05){
  d <- rnorm(n1,mu,sigma)
  lmOut <- lm(d ~ 1)
  lmSum <- summary(lmOut)
  p <- lmSum$coefficients[1,4]
  eff <- lmSum$coefficients[1,1]

  # Comment until you see #%%% If you don't want to sample more
  if(p <= alpha) {
    d <- c(d,rnorm(n2,mu,sigma))
    lmOut <- lm(d ~ 1)
    lmSum <- summary(lmOut)
    p <- lmSum$coefficients[1,4]
    eff <- lmSum$coefficients[1,1]
  }
  #%%%
  c(eff=eff,p=p,n=length(d))
}

# Null
reps.null <- t(replicate(100000,expr = sim_study(30,300,0,1)))

mean(reps.null[,'p'] < .05)
mean(reps.null[,'eff'])
mean(reps.null[reps.null[,'p'] < .05,'eff'])

# Not null
reps <- t(replicate(100000,expr = sim_study(30,30,.4,1)))

mean(reps[,'p'] < .05)
mean(reps[,'eff'])
mean(reps[reps[,'p'] < .05,'eff'])

A short thought on the inconsistency of simplifications

While grading tests today for our undergraduate advance stats course, I had a funny (I think, anyway) thought about the inconsistency of model simplifications.

The course is focused on linear models. It’s “advanced” statistics for this reason (advanced is… relative). So they learn to do independent samples t-tests, dependent samples-t tests, ANOVA all within a linear model framework.
Then of course, they learn about [normal-assumptive] multiple regression, and we teach them to think about statistics themselves as being a super simple linear model of sorts ($\bar x_{0,s,p} = \mu_{0,p} + \epsilon_{0,s,p}$).

Anyway, on the test, they have to interpret a paired-samples “t-test” (a linear model on paired score differences). I had a realization about something, that I think highlights why I don’t particularly care about model simplification, and some inconsistent practices we seem to engage in.

When the analyst knows that scores are paired, and they want to conduct a t-test, they wind up using a dependent samples t-test which explicitly includes a covariance in the computation of the standard error (It’s twice subtracted out).
However, I don’t know if anyone actually bothers to see whether the paired-sample correlation is “significant” before doing so.
They simply acknowledge that a dependency in scores exists due to the design and units collected. It’s a design feature, and it’s a data-generative feature that is assumed to exist.
Basically, the researcher knows that some dependency probably exists, and then they use the “appropriate” model or test for such a dependency, all most likely without actually assessing whether that assumed dependency is “significant” and should be included.

I say that to contrast what we tend to do with other models.
I was helping a labmate with some multilevel modeling problems shortly before grading the tests, and this is why these ideas clicked together.
A researcher collects a sample which inevitably consists of both known and unknown subpopulations and clusters for which dependencies may exist — If you know someone is of a certain demographic, then there’s a chance that their score may correlate with someone of the same demographic, and hence there is a dependency.
But what do we do?

Some MLM books and programs by default recommend or compute for you the significance of the random effects or random correlations.
These tests basically include likelihood ratio tests (likelihood with correlation over likelihood without correlation), Wald tests (Horrible for variance params, so not gonna discuss that), and variance-testing (Chi-2 test for whether the random variance is greater than the [estimated] expected sampling error variance).
A researcher may determine whether to include the random variance or not, depending on the significance, despite knowing that these clusters do exist and could have dependencies.

The funny realization is that – In these harder models where more thoughtwork is needed, researchers are told to check for these dependencies via some test or model comparison, and only if deemed necessary include it in the model.
But I hazard a guess – These people probably don’t do the same with paired-samples t-tests — They probably don’t test to see whether the paired-samples correlation is significant before using a dependent samples t-test.

Why? Because in the paired-samples t-test, you acknowledge the design and structure that presumably does exist in the data, and doing so improves your inferences. Yet we don’t do the same for linear models!
This dovetails with other recommendations for why random effects models should really be the default. We know there is most likely a dependency among scores to some degree; we don’t really need to test for that before fitting the model. If the researcher knows that a dependency may exist, just use MLM, period, just like if you have two paired data vectors you assume are dependent you would by default use a paired-samples t-test.

Basically what I am saying is – Be consistent.
If you would consider the design and possible dependencies, and in doing so be ok with using a paired-samples t-test without actually assessing the significance of the paired correlation, you should do the exact same thing with linear models.
If you suspect a dependency due to pairings and clusters, use a MLM instead of an LM, just as you would choose a paired-samples t-test instead of the independent samples t-test.
I am not suggesting that you actually check for significance before using a dependent samples t-test or MLM — On the contrary, your model should express what you know and can feasibly model. If subpopulations exist that could feasibly be dependent, just use a MLM; easy; no checking necessary. Best case scenario, your inference is improved; worst case scenario, it’ll basically collapse back into an LM anyway, just as a dependent t-test with zero covariance will wind up being an independent samples t-test.

Separate frequentist inference, frequency probability, and estimation

This is a short post, but I think people need to separate frequentist inference, frequency probability, and estimation a bit better when describing statistical practices.

Some concepts in an oversimplified nutshell:

  • Frequentist inference: Inferring about an unknown parameter or process based on assumed or known frequentist properties. E.g., “If the generative parameter were zero, the [assumed, frequentist] probability of observing something at least as extreme as what I observed is low, so I will assume the generative parameter is not zero.” This inference is based on an assumed distribution resulting from repeated sampling of the same process, and from this assumed distribution, the frequency of occurrence of some effect (or more extreme) is small. One obtains data, employs an assumed distribution of events, and makes inferences about the relative frequency of occurrences.
  • Frequency probability: The computation of a probability from frequencies of events. This is required for frequentist inference, but is not limited to frequentist inference. Bayes theorem can use frequency-derived probabilities. Frequency-derived probabilites are sometimes used as data for a Bayesian model. Simulations of model-based inferential decisions use frequency probabilities — The frequentist probability of making Bayesian-informed decisions may be computed, and is irrelevant to frequentist inference per se. It is what it is — The probability of an event over repeated instances; this is useful for so many contexts outside of frequentist inference.
  • Estimation — Does not depend on frequentism or frequentist probabilities. I throw this in because I see people say “frequentism assumes a uniform prior” and that’s not really true. Maximum likelihood estimation may implicitly assume a uniform prior, but that is not relevant to frequentist inference. You could use penalized maximum likelihood (i.e., a non-uniform prior combined with a likelihood function) and still be frequentist; you’re a frequentist once your inferential logic depends on an assumed universe of possible repeated events.

You can be a Bayesian and accept the existence of frequency probabilities. But as a Bayesian, you do not construct frequency probabilities for use in inference, as inference comes from posterior uncertainty. Frequent_ists_ construct frequency probabilities with an assumed sample space of events, and use the implied frequency probability for inference. In fact, modern Bayesians can’t really reject the premise of frequency probabilities, given that posterior inferences are typically defined from the frequentist properties of an MCMC posterior sampler. What is p(theta > 0 | data)? You estimate it from the proportion of representative posterior MCMC samples that are above zero — It’s a frequency probability used to estimate a Bayesian quantity. There’s nothing inconsistent about this — Seriously, there isn’t. In the same way that a Bayesian can still flip a fair coin 100 times and see that 54% of the flips resulted in heads; that probability is not wrong or invalid, we just don’t use assumed sampling distributions of possible coin flips for inference about the fairness of a coin. The inference about fairness comes from posterior uncertainty given that 54/100 of the flips were heads, instead of “assuming the coin is fair, what is the probability of seeing 54/100 of coin flips?”. Frequency probabilities are not bound to frequentist inferences, but are required for frequentist inference.

You can be a frequentist and use non-uniform “priors” (in the sense of penalized maximum likelihood), because estimation is separate from frequentist inference. The frequentist inference would be employed the second you construct a hypothetical sampling distribution of possible events under some parameter given some statistical model (including a PML-estimated model) for use in inferring from some observed estimate. I.e., a frequentist may use PML to estimate a model, then construct a sampling distribution of PML-model derived estimates under some fixed parameter value, and compute the probability of a given estimate (or greater) from that sampling distribution. That’s still frequentist, even with a estimation penalty, simply because the inference is derived from the frequency of events under an assumed counter-factual.

TLDR; frequentist inference depends on frequentist probabilities and properties. Frequentist probabilities are not limited to frequentist inference. Bayesians may still use frequency-probabilities (as data, as priors [See Jaynes], in MCMC samplers, to understand frequentist properties of Bayesian estimators and decisions over repeated instances). Estimation is independent of frequentism, and frequentism is not defined by the use of implicit “uniform” priors (that is a property of the estimator, not the inferential logic).