Informative priors for correlation matrices: An easy approach

Estimating correlations is a dead simple thing to do, right?


But sampling correlations is a dead simple thing to do right?

Not necessarily.

Correlation matrices must be positive semi-definite. When doing Bayesian analysis, you therefore need to parameterize in such a way that the correlation matrix is positive semi-definite. That is not a trivial procedure. Thankfully, Stan (and others, e.g., PyMC3) do that hard work for you, under the hood, and with the correct Jacobian adjustments.

However, even a uniformly distributed correlation matrix does not yield marginally uniform elements. My previous post explored this. Essentially, the positive semi-definite (PSD, for short) constraint restricts the range of possible correlation matrices, which imposes an implicit prior on the matrix, and in turn the unique elements.

More to the point, the LKJ(1) prior is a uniform prior on permissible matrices (permissible, meaning symmetric and PSD). Most permissible matrices beyond dimension 2 will be in a hypersphere around the identity matrix. Even with a uniform density, most of the volume contains solutions without large individual elements.

In short – A uniform LKJ, which says all permissible solutions are equally likely, is not "uniform" in the intuitive meaning of "uniform", because most permissible solutions are found around an identity matrix.

Because probability mass is density $\times$ volume (the integral of density over area), most probability mass will therefore be near identity. Unfortunately, this means that it can be surprisingly hard to attain high posterior mass over correlation matrices with high correlations.

To show this "problem", I generated 50 observations of 15 highly correlated data points. I computed the observed correlation matrix, and plotted each element in black. I then estimated a Bayesian multivariate normal with a uniform LKJ prior on the correlation matrix. The posterior means are plotted in red.

Notice how they are all considerably lower than they "should be" given the output of cor(y).

This is a well-known "problem" of the LKJ prior. However, this is not a problem with LKJ at all, but with most of the volume, and therefore most of the mass, being around the identity. With enough data points, you will be able to "overcome" the volume problem due to the much higher joint density of larger correlations.

Using priors

When there are problems like this, the natural Bayesian solution is to include some more prior information. Recall that the uniform prior on correlation matrices has the implicit prior of "Must be symmetric and positive semi-definite". Well, we may know that some correlations should be higher with high certainty, or others are relatively unknown.

Naively, you may try to construct a matrix from $D(D-1)/2$ estimated parameters, and put priors on each of the individual parameters. But you will be sorely disappointed — The probability of such a matrix being PSD is vanishingly small as D increases.

However! What you can do is construct a PSD matrix, and add more prior information. That is:

$$ p(R) \propto \text{LKJ}(R|\eta = 1)\prod_u \mathcal{N}(R_u | \mu_u, \sigma_u) $$ where $R$ is a symmetric, PSD correlation matrix, $u$ indexes the unique, lower-triangular elements of $R$, and the normal distribution with locations $\mu_u$ and scales $\sigma_u$ encode prior information about each element $R_u$. In words, this just means that R has a joint prior consisting of a uniform LKJ on the matrix (effectively contributing nothing) and normals on its elements.

E.g., if you think R[2,1] should be fairly high, then you could say $\mu_1 = .8, \sigma_1 = .2$.

This approach is adding more density to certain regions of R’s parameter space. By adding more density, you are adding more mass. Intuitively, you are countering the enormous volume of near-identity PSD solutions with more density further away from identity. Therefore, more mass can exist away from the high-volume set of PSD solutions that otherwise "shrink" your gloriously high correlations down to zero.

Some Stan code

functions {
    Extract lower-triangular from mat to vector.
    Excludes diagonal.
    @param mat Matrix.
    @return vector
  vector lower_tri(matrix mat) {
    int d = rows(mat);
    int lower_tri_d = d * (d - 1) / 2;
    vector[lower_tri_d] lower;
    int count = 1;
    for(r in 2:d) {
      for(c in 1:(r - 1)) {
	lower[count] = mat[r,c];
	count += 1;

    Proportional log-density for correlation matrix.
    @param cor Matrix. PSD, symmetric correlation matrix.
    @param point_mu_lower Vector. D(D-1)/2 length. Locations of each [unique] element's prior.
    @param point_scale_lower Vector. D(D-1)/2 length. Scales of each [unique] element's prior.
    @return log density.
  real lkj_corr_point_lower_tri_lpdf(matrix cor, vector point_mu_lower, vector point_scale_lower) {
    real lpdf = lkj_corr_lpdf(cor | 1) + normal_lpdf(lower_tri(cor) | point_mu_lower, point_scale_lower);
    Same as lkj_corr_point_lower_tri_lpdf, but takes matrices of prior locations and scales.
  real lkj_corr_point_lpdf(matrix cor, matrix point_mu, matrix point_scale) {
    int d = rows(cor);
    int lower_tri_d = d * (d - 1) / 2;
    vector[lower_tri_d] point_mu_lower = lower_tri(point_mu);
    vector[lower_tri_d] point_scale_lower = lower_tri(point_scale);
    real out = lkj_corr_point_lower_tri_lpdf(cor| point_mu_lower, point_scale_lower);
    Same as lkj_corr_point_lower_tri_lpdf, but takes cholesky-factor of correlation matrix.
  real lkj_corr_cholesky_point_lower_tri_lpdf(matrix cor_L, vector point_mu_lower, vector point_scale_lower) {
    real lpdf = lkj_corr_cholesky_lpdf(cor_L | 1);
    int d = rows(cor_L);
    matrix[d,d] cor = multiply_lower_tri_self_transpose(cor_L);
    lpdf += normal_lpdf(lower_tri(cor) | point_mu_lower, point_scale_lower);

data {
  int N; // # observations
  int D; // # variables
  vector[D] y[N]; // Data
  matrix[D,D] cor_guess; // Prior correlation location guesses.
  matrix[D,D] cor_sd; // Prior correlation uncertainties/scales.
  int just_lkj; // Whether to just use LKJ, or use defined lpdfs.

transformed data {
  int d_unique = D*(D-1)/2;
  vector[d_unique] cor_guess_lower = lower_tri(cor_guess);
  vector[d_unique] cor_sd_lower = lower_tri(cor_sd);

parameters {
  vector[D] mu;
  vector<lower=0>[D] sigma;
  cholesky_factor_corr[D] cor_L;

model {
  mu ~ std_normal();
  sigma ~ std_normal();
  if(just_lkj) {
    cor_L ~ lkj_corr_cholesky(1);
  } else {
    cor_L ~ lkj_corr_cholesky_point_lower_tri(cor_guess_lower, cor_sd_lower);

  y ~ multi_normal_cholesky(mu, diag_pre_multiply(sigma, cor_L));

generated quantities {
  matrix[D,D] cor = multiply_lower_tri_self_transpose(cor_L);

Some data generating code


sm <- stan_model("lkj.stan")

N <- 50
D <- 15

y <- data.frame(x1 = rnorm(N))
y[,paste0("x",2:D)] <- y[,1] + matrix(rnorm(N*(D-1), 0, .5), N, D-1)

# Uninformative (High scale value)

cor_guess <- cor(y)
cor_sd <- matrix(10, D,D)

stan_data <- list(N = N,
                  D = D,
                  y = y,
                  cor_guess = cor_guess,
                  cor_sd = cor_sd,
                  just_lkj = FALSE)

sOut_loose <- sampling(sm, stan_data, cores = 4)
means_loose <- summary(sOut_loose, pars = "cor")$summary[,"mean"]

# Make informative
cor_sd <- matrix(.1, D, D)
stan_data$cor_sd <- cor_sd

sOut_tight <- sampling(sm, stan_data, cores = 4)
means_tight <- summary(sOut_tight, pars = "cor")$summary[,"mean"]

# Compare to LKJ
stan_data$just_lkj <- TRUE
sOut_lkj <- sampling(sm, stan_data, cores = 4)
means_lkj <- summary(sOut_lkj, pars = "cor")$summary[,"mean"]

How did we do?

Note that I’m effectively "cheating" here by using the computed correlation matrix as our "prior guess". However, this shows a "best case scenario" where our prior guesses are exactly as observed in the data, plus a bit of uncertainty encoded into the cor_sd matrix.

Below is a plot of cor(y) values (black) and posterior means of a uniform LKJ prior (red), uninformed prior (purple), and informed prior (blue). The informed prior approach generally performed better. Note that they are all still a bit low, but the uninformed and uniform LKJ priors were consistently far away from the cor(y) values, to a startling degree.

Recasting to a cleaner plot below, you can see that the informed approach tended to map onto the estimated correlations much better. The black line indicates a perfect mapping. The informed approach still has downward bias, but that is nothing compared to the bias seen in the uninformed and LKJ priors.


The LKJ is extremely popular in Stan and other similar samplers. However, due to the PSD constraint, the posterior mass of high-dimensional correlation matrices tends to be around identity, and high correlations are difficult to accurately estimate. Coupling the PSD constraint with additional prior information can attenuate this problem. The low volume of PSD solutions is matched with greater density, leading to more mass near the high correlation solutions.

In retrospect, this is an obvious solution to an annoying problem. I think the reason it was not obvious before, is due to the way PPLs express priors. When one says a ~ p(), one assumes (rightfully so) that a is generated from p(). This makes joint priors unintuitive. With something like Stan, writing a ~ p(); a ~ p_2() implies a joint probability contribution, not a "simulation" statement. Similarly, one can say a ~ p(); f(a) ~ p_2(). This all translates to joint statements, not generative ones. As a downside, this also means it is difficult to generate random variables from the joint prior.

Future work could improve this method. I used a Normal kernel to contribute more density to the desired regions. While this is fine for unnormalized, proportional distributions, one would need to account for the (-1,1) constraint for each element — This requires a (potentially) costly set of CDF divisions to normalize the contribution. For Stan this isn’t an issue, because those statements would be constants, and constants are irrelevant for the proportional densities that MCMC needs.

Instead, one could substitute the normal kernel for something like a location-scale parameterized, translated beta that has support on (-1,1). In that case, no CDF is required, because the support is already naturally in the right region. Additionally, it may be more intuitive to reason with, because translated beta is effectively how LKJ-distributed matrix elements are distributed. Most generically, you could think of this approach as: $$ p(R|\eta, K, \phi) = \text{LKJ}(R|\eta)\prod_uK(R_u|\phi_u) $$, where R is a constructed as PSD, $u$ indexes unique lower diagonal elements, $K$ is some kernel (density function), and $\phi_u$ are the parameters for element $u$ and kernel $K$.

Another big benefit of this approach is that it may be possible to hierarchically model specific correlation matrix elements as randomly varying across sets of groups. Instead of providing a prior on each element, one would estimate the prior. One can already hierarchically model the entire correlation matrix, but not individual elements. This approach would likely fix that problem, and permit fully-hierarchical models. E.g., if you used an mixed effects location scale model, then location, scale, and correlation parameters could all hierarchically vary.

Lastly, shout out to Mike Lawrence and Sean Pinkney for inspiration of this approach. They have both been working on problems related to underestimated, or non-PSD, correlation matrices. Without their and my discussions about these issues, I wouldn’t have even tried this admittedly simple (but useful) approach.

Addendum 1 – Sean’s perspective and equivalent solution

I wanted to elaborate a bit more on Sean Pinkney’s parallel approach to this. By mathematical coincidence, we came up with equivalent solutions to this problem, but from different perspectives, and all within a month of one another.

In a very long Slack thread, Mike, Sean, and I were all talking about correlation matrix issues (underestimation, non-PSD). At some point in a twitter DM and in that thread, I mentioned the approach above as a possible solution to adding prior information about high correlations (and similarly, about where correlations are expected to be zero).

At the same time, Sean has been working on a Bayesian approach to finding a "Nearest PSD matrix to the target matrix", as I understand it. On a recent thread, he supplied some code for this approach.

Sean’s approach was elegant, and made perfect sense:

  • Construct a PSD matrix
  • Have target correlations
  • Say that target_cor ~ normal(estimated_cor, corpriorscale)

Therefore, the estimated PSD matrix has to be 1) PSD 2) Fit the target_cor well and 3) Fit the data well. Consequently, it gives a "Nearest-PSD solution to the target correlations and data".

Then I realized that Stan constructs PSD matrices (using the technique that Sean’s code does), and that due to the symmetric nature of the Normal parameterization: target_cor ~ normal(estimated_cor, corpriorscale) [Sean’s approach] is the same as saying estimated_cor ~ normal(target_cor, corpriorscale) [My approach].

Therefore, our approaches are equivalent if using a normal kernel. I love that this happened, because it can be understood in two ways. From my perspective — One is adding prior information to the individual elements, thereby creating a joint prior on the matrix and its elements. From Sean’s perspective — One is rotating a PSD matrix toward a target solution and data. But it’s the same thing! I love when this happens in statistical methods. One intuition strengthens another. I suspect they may not be equal for other kernels (because for some distributional families, one parameter and one data point may not be swapped), but I also suspect the two would always be one transform away from another. After all, these are all just joint probability statements.

Addendum 2 – Constraints!

Ed Merkle (in the comments) brought up the idea of constraints. I think this approach could work well for constraints.

For constraining an element, $R_a$ to be effectively zero, one could use: R_a ~ normal(0, very_small_scale). It would not be exactly zero, but it would be effectively zero. This is akin to how Bayesian EFA replaces a-priori zero-constrained loadings with loadings that have spiked priors at zero.

Similarly, one could constrain one correlation, $R_b$ to be effectively equal to $R_a$, by using:

R_a ~ normal(mu_a, scale_a)
R_b ~ normal(R_a, very_small_scale_b)

Again, this does not enforce a hard constraint that the two are precisely equal, but rather a soft constraint that the two should be effectively equal.

Addendum 3 – An intuitive plot

  • Left pane: LKJ(1) (a uniform prior over all permissible correlation matrices).
  • Middle pane: Add density to solutions where $R_{2,1}$ is high (using Normal(.8, .1))
  • Right pane: The resulting prior.

Neural Networks in Stan: Or how I was utterly surprised that it worked at all.

Feed-forward neural networks are a staple in machine learning. The basic feed-forward NN (which I’ll just call a NN from here on out) is a relatively simple idea.

The tale is as old as (statistical) time: You have a set of "features" (covariates) and you want to predict an outcome. This outcome might be continuous, or it might be categorical (for classification tasks). However, you may not care whatsoever about interpreting parameters to the model, you just want good predictions (or, "inferences", depending on your background). Stated alternatively, you want to estimate a function that takes some input and spits out a relatively accurate output.

The set of predictive models is enormous. Given my Bayesian background, I tend to prefer principled approaches, such as the Gaussian process models, or Dirichlet process models, etc, for function-finding. Look no further than my omegad package for evidence of that. There, I use a GP to model the relationship between exogenous variables, latent variables, and a latent measurement error term to get non-parametric, non-linear predictions of reliability! In a forthcoming package called mires, I use Dirichlet processes to estimate some terrible posterior density functions from MCMC output for Bayes factor computation.

But I digress, you are all here for neural nets in Stan.

NN in a nutshell

A neural network takes features, and projects them into a latent space. It does this via a multivariate linear model. Then those linear variates are fed through an activation function, which maps the linear variate to a constrained space. Some examples are the inverse logit function, tanh, and rectified linear units (ReLU).

You then take those activation outputs, and project them again into a second latent space (a second hidden layer). You again run these through activation functions. Rinse and repeat through the hidden layers until you arrive at the output stage.

In the output stage, you project the last hidden layer’s values into your prediction, again using a (possibly multivariate) linear model. If you are doing classification, then this last stage involves a multinomial logistic regression. If you are doing regression, this this last stage is a linear model prediction.

To make it more concrete, let’s look at formulas. I am going to describe it as a multivariate logistic model; however, in the machine learning world, they often work with all these things transposed. I am also going to fit a rectangular NN, such that all hidden layers have the same number of neurons (or nodes, or linear outputs).

Let $H_{it}$ be the $t$-th hidden layer $X_i$ is a vector of inputs. $Y_i$ is the output. $W_t$ is the linear coefficient matrix for the $t$-th hidden layer; or "weights". $b_t$ is the row vector of intercepts, or "biases", for the $t$-th hidden layer. $n_H$ is the number of neurons per hidden layer.

Step 1: Project $X_i$ into latent space, and call activation function $f$ on each element: $$ H_{i1} = f(X_i W_1 + b_1) $$ $H_{i1}$ now contains the $n_H$ activations. I use the inverse logit function for this, so that each element in $H_{i1}$ is squashed to be between 0 and 1.

Step 2: Project the neural outputs onto a second layer: $$ H_{i2} = f(H_{i1} W_2 + b_2) $$

Step 3 through T: Continue this process until the final layer. $$ H_{it} = f(H_{i, t – 1}W_t + b_t) $$

Step T+1: Generate output predictions.

How this step is implemented depends on the task. For classification, you are predicting multinomial logits (usually with reference to a base class). $$ \begin{align*} \hat y_\text{logit} &= H_T W_\text{out} + b_\text{out} \\ \hat y_\text{prob} &= \text{softmax}(\hat y_\text{logit}) \\ y &\sim Categorical(\pi = \hat y_\text{prob}) \tag{If using Bayes} \end{align*} $$ For regression, it is just a linear model: $$ \begin{align*} \hat y &= H_T W_\text{out} + b_\text{out} \\ y &\sim \mathcal{N}(\hat y, \sigma) \tag{If using Bayes} \end{align*} $$

That’s it. That’s the model. Then for predictions, you just feed in new $X_i$ values and carry through the matrix multiplication and activations. This is, quite literally, a multivariate logistic regression model in the hidden layers, then a multinomial regression for classification, or a linear model for regression.

There are all sorts of tweaks you can include if you choose. Perhaps you don’t want a normal likelihood (assuming you’re using a probabilistic loss at all). Perhaps you want to model both the $\mu$ and $\sigma$ parameters from a NN (given my work in location scale models, this is an interesting idea to me). You could embed this into larger models to jointly include random effects and latent variables into it. And this doesn’t even begin to get into, e.g., dropout (i.e., Bayesian spike and slab) or regularization.

Bayesian models need priors (alternatively stated, your model deserves priors). I was lazy, I just threw standard normal priors on all weights and biases. In effect, this is similar to L2 regularization. For sampling, I also assume biases within each layer are ordered, because neural nets are notoriously overparameterized and underidentified.

And this is why I was surprised it worked at all in Stan. Bayesian NNs are extremely hard, because the posterior is intractably aliased. All hidden layers contain multiple neurons, and it does not much matter which neuron does what. In essence, it’s a label-switching problem as seen in mixture modeling, but taken to a new extreme.

When estimating NNs, the MCMC chains should either oscillate between identical solutions within each chain; or each chain should find a different mode. I am sure that my implementation would simply not sample well for even moderately large NNs. Stan is just not optimized for this problem; it would take until the heat death of the universe to fit a truly large, deep NN, and you’d have long exhausted the memory needed for it. But in my small regime of say, two layers each with 50 nodes, it sampled fine! More specifically, the posterior predictive distribution sampled fine. However, sometimes the NN parameters even sampled fine, much to my utter surprise.

The classification model

functions {
  vector[] nn_predict(matrix x, matrix d_t_h, matrix[] h_t_h, matrix h_t_d, row_vector[] hidden_bias, row_vector y_bias) {
    int N = rows(x);
    int n_H = cols(d_t_h);
    int H = size(hidden_bias);
    int num_labels = cols(y_bias) + 1;
    matrix[N, n_H] hidden_layers[H];
    vector[num_labels] output_layer_logit[N];
    vector[N] ones = rep_vector(1., N);

    hidden_layers[1] = inv_logit(x * d_t_h + ones * hidden_bias[1]);
    for(h in 2:H) {
      hidden_layers[h] = inv_logit(hidden_layers[h-1] * h_t_h[h - 1] + ones * hidden_bias[h]);
    for(n in 1:N) {
      output_layer_logit[n, 1] = 0.0;
      output_layer_logit[n, 2:num_labels] = (hidden_layers[H, n] * h_t_d + y_bias)';

data {
  int N; // Number of training samples
  int P; // Number of predictors (features)
  matrix[N, P] x; // Feature data
  int labels[N]; // Outcome labels
  int H; // Number of hidden layers
  int n_H; // Number of nodes per layer (All get the same)

  int N_test; // Number of test samples
  matrix[N_test, P] x_test; // Test predictors

transformed data {
  int num_labels = max(labels); // How many labels are there

parameters {
  matrix[P, n_H] data_to_hidden_weights; // Data -> Hidden 1
  matrix[n_H, n_H] hidden_to_hidden_weights[H - 1]; // Hidden[t] -> Hidden[t+1]
  matrix[n_H, num_labels - 1] hidden_to_data_weights; // Hidden[T] -> Labels. Base class gets 0.
  // ordered[n_H] hidden_bias[H]; // Use ordered if using NUTS
  row_vector[n_H] hidden_bias[H]; // Hidden layer biases
  row_vector[num_labels - 1] labels_bias; // Labels biases. Base class gets 0.

transformed parameters {
  vector[num_labels] output_layer_logit[N]; // Predicted output layer logits

  output_layer_logit = nn_predict(x,


model {
  // Priors
  to_vector(data_to_hidden_weights) ~ std_normal();

  for(h in 1:(H-1)) {
    to_vector(hidden_to_hidden_weights[h]) ~ std_normal();

  to_vector(hidden_to_data_weights) ~ std_normal();

  for(h in 1:H) {
    to_vector(hidden_bias[h]) ~ std_normal();
  labels_bias ~ std_normal();

  for(n in 1:N) { // Likelihood
    labels[n] ~ categorical_logit(output_layer_logit[n]);

generated quantities {
  vector[num_labels] output_layer_logit_test[N_test] = nn_predict(x_test,
  matrix[N_test, num_labels] output_test;
  for(n in 1:N_test) {
    output_test[n] = softmax(output_layer_logit_test[n])';

The regression model

functions {
  vector nn_predict(matrix x, matrix d_t_h, matrix[] h_t_h, vector h_t_d, row_vector[] hidden_bias, real y_bias) {
    int N = rows(x);
    int n_H = cols(d_t_h);
    int H = size(hidden_bias);
    matrix[N, n_H] hidden_layers[H];
    vector[N] output_layer;
    vector[N] ones = rep_vector(1., N);

    hidden_layers[1] = inv_logit(x * d_t_h + ones * hidden_bias[1]);
    for(h in 2:H) {
      hidden_layers[h] = inv_logit(hidden_layers[h-1] * h_t_h[h - 1] + ones * hidden_bias[h]);
    output_layer = hidden_layers[H] * h_t_d + y_bias;

data {
  int N; // Number of training samples
  int P; // Number of predictors (features)
  matrix[N, P] x; // Feature data
  vector[N] y; // Outcome
  int H; // Number of hidden layers
  int n_H; // Number of nodes per layer (All get the same)

  int N_test; // Number of test samples
  matrix[N_test, P] x_test; // Test predictors

transformed data {

parameters {
  matrix[P, n_H] data_to_hidden_weights; // Data -> Hidden 1
  matrix[n_H, n_H] hidden_to_hidden_weights[H - 1]; // Hidden[t] -> Hidden[t+1]
  vector[n_H] hidden_to_data_weights;
  // ordered[n_H] hidden_bias[H]; // Use ordered if using NUTS
  row_vector[n_H] hidden_bias[H]; // Hidden layer biases
  real y_bias; // Bias. 
  real<lower=0> sigma;

transformed parameters {
  vector[N] output_layer;

  output_layer = nn_predict(x,


model {
  // Priors
  to_vector(data_to_hidden_weights) ~ std_normal();

  for(h in 1:(H-1)) {
    to_vector(hidden_to_hidden_weights[h]) ~ std_normal();

  to_vector(hidden_to_data_weights) ~ std_normal();

  for(h in 1:H) {
    to_vector(hidden_bias[h]) ~ std_normal();
  y_bias ~ std_normal();

  sigma ~ std_normal();

  y ~ normal(output_layer, sigma);

generated quantities {
  vector[N_test] output_test = nn_predict(x_test,
  vector[N_test] output_test_rng;
  for(n in 1:N_test) {
    output_test_rng[n] = normal_rng(output_test[n], sigma);

Fitting the categorical model

As is typical, you need to compile the stan model. I also include a (quickly hacked together, messy) function to fit the model and give some sane output (Stan’s optimizing output is… less than stellar).

sm <- stan_model("nn_cat.stan")

fit_nn_cat <- function(x_train, y_train, x_test, y_test, H, n_H, method = "optimizing", ...) {
    stan_data <- list(
        N = nrow(x_train),
        P = ncol(x_train),
        x = x_train,
        labels = y_train,
        H = H,
        n_H = n_H,
        N_test = length(y_test)
    if(method == "optimizing") {
        optOut <- optimizing(sm, data = stan_data)
        test_char <- paste0("output_test[",1:length(y_test), ",",rep(1:max(y_train), each = length(y_test)),"]") 
        y_test_pred <- matrix(optOut$par[test_char], stan_data$N_test, max(y_train))
        y_test_cat <- apply(y_test_pred, 1, which.max)
        out <- list(y_test_pred = y_test_pred,
                    y_test_cat = y_test_cat,
                    conf = table(y_test_cat, y_test),
                    fit = optOut)
    } else if(method == "sampling") {
        out <- sampling(sm, data = stan_data, pars = "output_test", ...)


Now let’s get the Iris data, split into a training and testing sample, and fit.

x <- iris[,1:4]
y <- as.numeric(as.factor((iris[,"Species"])))

N_test <- 50
test_indices <- sample(1:nrow(x), N_test)
x_train <- x[-test_indices,]
y_train <- y[-test_indices]
x_test <- x[test_indices,]
y_test <- y[test_indices]

I just used two hidden layers, each with 50 neurons. Fewer neurons seems to be fine too. Adding more layers seems to break prediction considerably in optimization, and sampling takes ages due to the rather insane increase of parameters that new layers bring.

fit_opt <- fit_nn_cat(x_train, y_train, x_test, y_test, 2, 50, method = "optimizing")

Let’s see how we did:

> fit_opt$conf
y_test_cat  1  2  3
         1 16  0  0
         2  0 13  0
         3  0  0 21

Nice! Our held-out sample was classified perfectly.

Let’s try with NUTS/MCMC:

fit_nuts <- fit_nn_cat(x_train, y_train, x_test, y_test, 2, 50, method = "sampling", cores = 4, iter = 1000)

And… let’s see how we did:

cat_nuts <- summary(fit_nuts)$summary[,"mean"] %>%
                            matrix(N_test, 3, byrow = TRUE) %>%
                            apply(1, which.max)
table(cat_nuts, y_test)

That terribly named variable name contains the classifications of each test output, based on the posterior expectation (not the mode) of their predicted class probabilities.

After about 30 seconds, we get:

>         y_test
cat_nuts  1  2  3
       1 16  0  0
       2  0 13  0
       3  0  0 21

Which is great!

The Iris dataset is fairly easy to predict, because the labels are well-separated in the feature space. Let’s give another classic a go.


The MNIST dataset contains handwritten digits from 0 to 9. The dataset I import has all the pixel data, flattened into rows, for each image. It then has the true label as the last column. Altogether, this dataset has 70,000 images, and 784 features (one for each pixel in a 28×28 px image).

Because Stan is, and I cannot emphasize this enough, not built for neural nets, I cannot estimate a large NN, nor can I feed it a lot of data. When you have ‘bigger’ data and lots of matrix multiplication, you really need GPU acceleration and multithreaded computations. Stan now has some GPU acceleration, and some support for multithreading; rstan (2.21) does not currently have GPU acceleration, and its support for multithreading depends on map_rect, which would be awful to implement for this model.

Therefore, I’m only going to optimize, not sample, and with only 5000 training, 1000 test samples. Moreover, I’ll only use 2 hidden layers, each with 30 neurons.

library(snedata) #

mnist <- download_mnist()

x <- mnist[,c(-ncol(mnist))]
y <- as.numeric(mnist[,"Label"])

N_train <- 5000
N_test <- nrow(x) - N_train
N_test <- 1000

x_train <- head(x, N_train)
y_train <- head(y, N_train)
x_test <- tail(x, N_test)
y_test <- tail(y, N_test)

fit <- fit_nn_cat(x_train, y_train, x_test, y_test, 2, 30)

After waiting for a bit, we get:

# Classification matrix for MNIST
y_test_cat   1   2   3   4   5   6   7   8   9  10
        1   97   0   5   0   0   1   1   0   0   0
        2    0 115   0   0   0   0   0   0   2   0
        3    0   2  85   2   1   3   1   5   0   0
        4    0   0   2  94   0   4   0   0   1   0
        5    1   0   0   0  91   1   2   0   1  10
        6    2   0   1   1   0  72   2   0   5   0
        7    1   0   0   0   0   4  95   0   4   0
        8    0   0   1   1   0   0   0 109   0   5
        9    1   2   4   3   0   0   1   0  81   0
        10   0   0   1   1   0   0   0   1   0  75

This is an accuracy of .914. For only using around 7% of the data, and a relatively small NN, this is decent.


Stan was used to estimate relatively small neural nets on the Iris and MNIST datasets. Despite Stan not being built for it, it can technically do it. I am not saying you should do it, but as a heavy Stan user, and a moderate neural net user, it was nice to see that Stan didn’t collapse in on itself like a dying star when trying to fit one.

In practice, if you really wanted to use Stan for optimizing NNs, I would suggest doing so at the C++ level, and implementing a gradient descent approach with Stan’s autodiff. The optimizing algorithm used in Stan is a Quasi-Newton method – L-BFGS. A great optimizer, no doubt, but not well-suited for extremely high-dimensional models with irregular loss surfaces. Gradient descent approaches, by contrast, are fairly cheap, and probably easier to parallelize. Primarily though, you really need GPU acceleration to make quick work of NNs; here’s to looking forward to more GPU compute from the Stan team.

For sampling neural nets, I wouldn’t recommend it. The Stan team wouldn’t recommend it either. It’s just an intractable posterior. On the Iris dataset, I was lucky to have acceptable $\hat R$ values, even for the NN parameters. The ordered bias constraint (and priors) should not be enough to identify the NN parameters. Nevertheless, I did shockingly obtain convergence across the whole set of parameters.

Fitting a NN regression

Let’s first generate data from a somewhat ugly function. We’ll only work with a univariate function for now, because visualization is simple.

N <- 1000
x <- rnorm(1000)
sigma <- .5

y_normal_f <- function(x) {2 * x^2 - 2*x + 3*cos(3*x) - sqrt(abs(x) / 10) + .2 * (x - 3)^2 - 10}
y <- y_normal_f(x) + rnorm(N,0, sigma)

When plotted, this data is:

No doubt that a GP could make quick work of that. But today is neural net day, so let’s fit a NN regression.

Let’s split our data again into train and test sets. Then we’ll optimize a 2-layer, 10-neuron NN regression model.

x <- matrix(x, N, 1)

N_test <- 200 
test_indices <- sample(1:nrow(x), N_test)
x_train <- x[-test_indices,, drop = FALSE]
y_train <- y[-test_indices]
x_test <- x[test_indices,, drop = FALSE]
y_test <- y[test_indices]

fit_opt <- fit_nn_reg(x_train, y_train, x_test, y_test, 2, 10, method = "optimize")

The mean squared error was .275, and the predictions look spot on.

This plot contains the true function (in purple), and the test points. The blue line represents the (modal) predicted values, and the dashed lines are just the prediction $\pm 1.96 \sigma$. Therefore, the dashed lines are modal prediction intervals.

plot(x_test, y_test, xlab = "X", ylab = "Y")
curve(y_normal_f, col = "purple", add = TRUE)
lines(sort(x_test), fit_opt$y_test_pred[order(x_test)], col = "blue")
lines(sort(x_test), fit_opt$y_test_pred[order(x_test)] - fit_opt$sigma*1.96, lty = "dashed")
lines(sort(x_test), fit_opt$y_test_pred[order(x_test)] + fit_opt$sigma*1.96, lty = "dashed")
legend("topright", legend = c("True", "NN"), lty = "solid", col = c("purple", "blue"))

Optimizing the NN therefore did a decent job, and only took around 4 seconds. It also recovered the $\sigma$ parameter well; the true value was .5, the estimated value was .51.

Now let’s use NUTS/MCMC to sample the posterior:

fit_nuts <- fit_nn_reg(x_train, y_train, x_test, y_test, 2, 10, method = "sampling", cores = 4, iter = 300)

Of no surprise, sampling takes longer at 7 minutes, for only a total of 600 posterior samples.

Let’s plot the predictions now. Note that we do not have modal estimates, but expectations. And by the magic of probabilistic inference, we have distributions across parameters and predictions. Therefore, the uncertainty expressed in the predictive plot includes uncertainty in the neural net parameters and in the realizations. The dotted line represents the 95% predictive interval — This includes uncertainty in the NN parameters, the $\sigma$ parameter, and the realizations. The dashed line represents the 95% fitted interval — This only includes uncertainty in the function itself (i.e., the NN parameters).

Again — NUTS did well here. There is not much uncertainty in the expected value, which is why the predictive intervals look so similar to those of the optimized solution.

The $\hat R$ values for the predicted quantities are all acceptable (i.e., $\approx 1.0$). This is not unexpected from unidentified models. In mixtures or other unidentifiable models, the aliased or otherwise unidentified parameters can have terrible convergence, but functions of those parameters can be identified. E.g., a model with $a = b + c$ may be unidentified because the values of b and c can be interchanged without affecting a: 10 = 5 + 5, 4 + 6, -10 + 20, etc. Without prior information, b and c are therefore unidentified. However, the MCMC samples of a can nevertheless be convergent.

That is the case here. Even if the NN parameters are not identified, the resulting predictions remain the same (in a sense, this is why it is unidentified). Therefore, the model converged with respect to the NN predictions.


I wrote out a quick Bayesian neural network model for both regression and classification. Despite Stan not being built for such models, it performed well! Optimization was fairly quick, and NUTS converged with respect to the predictions.

Would I use Stan for NNs?

No. No I wouldn’t. This was a fun, toy exercise for an afternoon, just to see if it could be done. I’ve had good luck with mxnet and JAX for actually estimating NNs; and these frameworks are general enough that you can still add priors (penalties) on the parameters if you want to (look at the "loss" function as an unnormalized negative log posterior density for why).

What would I use in Stan instead?

Gaussian processes, where data permit. Approximate Gaussian processes (as I did in omegad) are fairly quick. GAMs would also work well in these examples. Stan really shines in generative modeling contexts, to build custom models for processes of interest; it lets you build things like multivariate mixed effects location scale models with latent variables, interactions, non-linearities, etc. It lets you estimate models with structures and assumptions that you simply can’t do otherwise.

But Stan is really not meant for big data, nor for models with inherently bad posteriors like neural nets. But as a testament to the robustness of Stan, it still estimated a neural net, despite not being built for it. And that’s cool.


I just realized I forgot to include the fit_reg_nn function:

sm_reg <- stan_model("nn_reg.stan")

fit_nn_reg <- function(x_train, y_train, x_test, y_test, H, n_H, method = "optimize", ...) {
    stan_data <- list(
        N = nrow(x_train),
        P = ncol(x_train),
        x = x_train,
        y = y_train,
        H = H,
        n_H = n_H,
        N_test = length(y_test),
        y_test = y_test
    if(method == "optimize") {
        optOut <- optimizing(sm_reg, data = stan_data)
        test_char <- paste0("output_test[", 1:stan_data$N_test, "]")
        y_test_pred <- optOut$par[test_char]
        mse <- mean((y_test_pred - y_test)^2)
        out <- list(y_test_pred = y_test_pred,
                    sigma = optOut$par["sigma"],
                    mse  = mse,
                    fit = optOut)
    } else {
        if(method == "sampling") {
            out <- sampling(sm_reg, data = stan_data, ...)
        } else if (method == "vb") {
            out <- vb(sm_reg, data = stan_data, pars = c("output_test", "sigma", "output_test_rng"), ...)
        y_test_pred <- summary(out, pars = "output_test")$summary
        sigma <- summary(out, pars = "sigma")$summary
        out <- list(y_test_pred = y_test_pred,
                    sigma = sigma,
                    fit = out)

Is the LKJ(1) prior uniform? “Yes”

This post is about how the LKJ(1) prior is indeed "uniform", but perhaps not in the way people expect it to be. To be clear – This result is not a new one, nor is this post novel, because this is a well-known result. However, I was long confused by it, and I see others confused by it as well.

The LKJ prior is a spherical prior on correlation matrices. I emphasize matrices because it is not a marginal prior on individual elements. It is commonly used in Stan, in part due to performance reasons, and in part because specifying priors on correlations separately from variances is much easier. This latter point is especially useful when estimating multivariate scale models, wherein variances can vary, but we may want correlations to remain constant or vary according to a different model.

The LKJ prior takes one parameter, $\eta \in \mathbb{R}^+$. As $\eta$ increases, more mass is placed over identity matrices (i.e., a diagonal of one; little to no correlation). When $\eta = 1$, the LKJ prior is a uniform prior over correlation matrices. Because people (myself included) are, in general, poor at visualizing and intuitively understanding multidimensional space beyond three dimensions — One may think that the elements of the correlation matrices should therefore be uniformly distributed.

However, when you sample from the LKJ(1) prior with $K > 2$ variables (i.e., more than one correlation, a $K>2 \times K>2$ matrix), and look at the marginal distributions of the elements, they are not distributed uniformly. What gives?

The answer lies in the constraints of the correlation matrix. Correlation matrices must be symmetric, and positive semi-definite (PSD). That PSD constraint alters where probability mass can exist, and where it will accumulate marginally. When you only have $K=2$ variables, this is not an issue. But when you have $K>2$ variables, then one correlation’s value constrains what other correlations can be, if the psd constraint is to be met. I.e., you cannot create a $K\times K$ correlation matrix, and fill the off-diagonals with uniform(-1,1) values and expect it to be PSD. As K increases, the chances of creating a correlation matrix from uniformly distributed elements that is also PSD rapidly increases near-zero.

The LKJ(1) prior can therefore be thought of as a "uniform prior on permissible correlation matrices"; that is, it is uniform subject to the PSD constraint.

To build that intuition, I randomly generated 50,000, $K = 4$, "correlation matrices", wherein each unique off-diagonal element is sampled from a uniform(-1,1) distribution. For each matrix, I checked whether the PSD constraint was met. I then resampled each invalid matrix until the PSD constraint was met[^1].

I then plotted the marginal distribution of $\rho_{1,2}$. For comparison, I also sampled from LKJ(1) using rethinking::rlkjcorr. The plot below shows the two overlaid. In red, I show the $\rho_{1,2}$ from the resampling method. In blue, I show the $\rho_{1,2}$ from the LKJ(1) distribution.

Marginal distribution

The two are effectively identical (within sampling error). As K increases, the PSD constraint will cause the marginal distributions to be increasingly concentrated over zero. It is hard to say that it is "regularizing more" as dimensionality increases, but rather that in large matrices, the space in which the PSD constraint is met, is considerably smaller than when dimensionality is low.

You may want to know what the marginal priors for each correlation are, given a uniform LKJ and K variables. A useful result is provided, as always, by Ben Goodrich. He stated:

In general, when there are K variables, then the marginal distribution of a single correlation is Beta on the (-1,1) interval with both shape parameters equal to K / 2.

We can examine this nifty result. I took the PSD-enforced samples of $\rho_{1,2}$, and transformed them to be in the [0, 1] domain ($\tilde\rho_{1,2}^s = .5\rho_{1,2}^s + .5$, for each $s \in S$ samples). I then plotted the $\text{beta}(x| 4/2, 4/2)$ density on top of the normalized histogram. This is the result:

Marginal Transformed with Beta overlay

Indeed, the marginal distribution matches perfectly with the $Beta(\frac{K}{2}, \frac{K}{2})$ distribution.

To recap: I generated the elements of a correlation matrix from a uniform(-1, 1) distribution. I then enforced an PSD constraint by resampling the matrix until PSD was met. The marginal distributions of the correlations are then equivalent to those from the LKJ(1) distribution. In other words: The LKJ(1) distribution is indeed a "uniform" prior over matrices, subject to the constraint that the matrix must be positive semi-definite. This constraint disallows the marginals from being uniformly distributed.

The LKJ prior is not the only possible prior for correlations. The Wishart can be used, provided that you divide out the diagonals. If you want to better control the marginal priors of the elements, a better option may be the matrix-F distribution — You can get effectively uniform marginal priors for the correlations using this, but it is not standard (not too hard to implement though).

R Code

rlkj_naive <- function(N, k, enforce_spd = TRUE, as_samples = TRUE) {
    n_cors <- k*(k - 1)/2
    out <- list()
    gen_unif_matrix <- function(n_cors, k) {
        cors <- runif(n_cors, -1, 1)
        mat <- diag(1, k, k)
        mat[lower.tri(mat)] <- cors
        mat <- as.matrix(Matrix::forceSymmetric(mat, uplo = "L"))
    for(n in 1:N) {
        out[[n]] <- gen_unif_matrix(n_cors, k)

    if(enforce_spd) {
        is_not_spd <- !sapply(out, matrixcalc::is.positive.semi.definite)
        if(any(is_not_spd)) {
            warning("PSD not met; rejection sampling. ", mean(is_not_spd), " of the mats are not PSD.")
        while(any(is_not_spd)) {
            n_not_spd <- sum(is_not_spd)
            out[is_not_spd] <- replicate(n_not_spd, gen_unif_matrix(n_cors, k), simplify = FALSE)
            is_not_spd <- !sapply(out, matrixcalc::is.positive.semi.definite)
    if(as_samples) {
        out <- t(sapply(out, function(x){x[lower.tri(x)]}))

cors <- rlkj_naive(50000, 4)
cors_better <- rethinking::rlkjcorr(50000, 4, 1)

hist(cors[,1], col = rgb(1,0,0,.5), xlab = expression(rho[12]), freq = FALSE, main = "Marginal")
hist(cors_better[,2,1], add = TRUE, col = rgb(0,0,1,.5), freq = FALSE)

cor12_trans <- (cors[,1] / 2) + .5
hist(cor12_trans, freq = FALSE, xlab = expression(tilde(rho)[12]), main = "Marginal (transformed)")
curve(dbeta(x, 4/2, 4/2), 0, 1, add = TRUE)

Matrix F distribution (For those who are curious)

$E(V) = \nu/(\delta – 2)*B$

$\nu, \delta$ control the shape. B scales it. For correlations, you can sample from matrix F, and divide out the diagonal to produce a correlation matrix. If $\nu = K$, where K is the cov dimension, and $\delta = 3$ (the minimum value to have a defined mean), then you have a fairly uniform prior on the correlations.

  real matrix_f_lpdf(matrix cov, real nu, real delta){
    int k = cols(cov);
    return(lmgamma(k, (nu + delta + k - 1)/2) - (lmgamma(k, nu/2) + lmgamma(k, (delta + k - 1)/2)) + log_determinant(cov)*((nu -k - 1)/2) - (nu + delta + k - 1)/2 * log_determinant(cov + diag_matrix(rep_vector(1, k))));
  real matrix_f_fast_lpdf(matrix cov, real nu, real delta){
    int k = cols(cov);
    real log_det_cov = 2*sum(log(diagonal(cholesky_decompose(cov))));
    real I_Sig_log_det = 2*sum(log(diagonal(cholesky_decompose(diag_matrix(rep_vector(1,k)) + cov))));
    return(log_det_cov*(nu - k - 1)/2 - I_Sig_log_det*(nu + delta + k - 1)/2);
    // return(log_determinant(cov)*(nu - k - 1)/2 + log_determinant(cov + diag_matrix(rep_vector(1,k)))*-(nu + delta + k - 1)/2);
  real f_lpdf(real v, real nu, real delta){
    return((nu/2 - 1)*log(v) - (nu + delta)/2*log(1 + v));
  real matrix_f_fast_cholesky_lpdf(matrix L, real nu, real delta){
    int k = cols(L);
    vector[k] L_diag = diagonal(L);
    real log_det_cov = 2*sum(log(L_diag));
    real log_jac = k*log(2);
    real I_Sig_log_det;
    for(i in 1:k){
      log_jac += (k - i + 1)*log(L_diag[i]);
    // I_Sig_log_det = log_determinant(diag_matrix(rep_vector(1,k)) + multiply_lower_tri_self_transpose(L));
    I_Sig_log_det = 2*sum(log(diagonal(cholesky_decompose(diag_matrix(rep_vector(1,k)) + multiply_lower_tri_self_transpose(L)))));
    return(log_jac + log_det_cov*((nu - k - 1)/2) - ((nu + delta + k - 1)/2)*I_Sig_log_det);
  real matrix_f_B_fast_cholesky_lpdf(matrix L, real nu, real delta,real B){
    int k = cols(L);
    vector[k] L_diag = diagonal(L);
    real log_det_cov = 2*sum(log(L_diag));
    real log_jac = k*log(2);
    matrix[k,k] BmatInv = diag_matrix(rep_vector(1/B,k));
    real I_Sig_log_det;
    for(i in 1:k){
      log_jac += (k - i + 1)*log(L_diag[i]);
    I_Sig_log_det = 2*sum(log(diagonal(cholesky_decompose(diag_matrix(rep_vector(1,k)) + multiply_lower_tri_self_transpose(L)*BmatInv))));
    return(log_jac + log_det_cov*((nu - k - 1)/2) - ((nu + delta + k - 1)/2)*I_Sig_log_det);


[^1]: This took a long time. It is highly improbable to find an PSD matrix, even when the matrix is a measly 4×4, when constructing a symmetric matrix with uniformly distributed elements. For reference, of the 50,000 initial matrices, 81.58% of them were not PSD.

Is my lamp broken?

In this post, I want to talk about the difficulties of probabilistic inference and generalizability of claims.

The recent twitter debates about Tal’s paper The Generalizability Crisis have rendered me befuddled. Personally, I find his arguments to be obvious, conditional on reading Meehl and giving more than 5 minutes of thought toward inference and statistical testing. But, apparently, it is not obvious, and some people strongly disagree. Tal further clarifies matters in a well-written response here.

Tal’s central point, in my view, can be boiled down to the following gist:

  • Verbal hypotheses should map onto statistical hypotheses, if statistical hypothesis testing is to be relevant to supporting or rejecting the verbal hypothesis.
  • The generalizability of the supported or rejected hypotheses is bounded by the statistical model/test employed.
  • Psychologists do not tend to map verbal hypotheses onto statistical hypotheses, but nevertheless use statistical hypothesis tests to support verbal hypotheses.
  • Psychologists tend to invoke a "Weak" form of inference to support broad claims. These broad claims are supported using, literally, a weak form of inference (a confirmationist framework), and reduced statistical models that can not, and are not intended to, support claims broader than permitted by the variables entered into it, which are in turn generated from a specific sampling and collection scheme. I.e., the statistical models, used to assess a verbal hypothesis, operate on assumptions and are bound by the variables collected, which do not cleanly translate to broader hypothetical statements. More simply, a 2-group randomized experiment using measure X to measure construct F may imply that one group’s mean F score is greater than another, but that is where statistical generalization ends. Seeing such a quantitative difference, even if definitive, does not imply one’s verbal hypothesis is generally supported (e.g., across populations, across other measures of construct F, across contexts, across time), or is uniquely supported (that other hypotheses would not have made the same prediction).

Tal goes onto to offer some solutions. One solution is to include hypothetically exchangeable components into the model itself. If the verbal hypothesis makes a claim that should be evident regardless of the precise measure, stimuli, time, context, population, or subject, then the hypothesis is implicitly claiming that these variables are exchangeable in some way. Therefore, the statistical model for the hypothetical process should also treat these as exchangeable, and a random effects model could be employed.

Of course, knowing all possible sources of exchangeable variation is hard, if not impossible. Collecting data on and actually varying these exchangeable variables even more so. So a second, very reasonable, solution is to stop making claims broader than the statistical model permits. Instead of saying "Receiving an anonymous gift makes people grateful", one could instead say "Receiving an anonymous gift [in the form of digital raffle tickets for a $50 Amazon gift card on a computerized task from a fake stranger] makes [undergraduates at our Southern US University] [score higher on gratitude measure X, assuming the measurement model is specified as it was] [in a controlled double-blind lab study]."

One may say, "That is not ideal", and I would say "Tough, that’s exactly what you can gain from your study." If you wish to claim support for a hypothesis more generalizable than that, then you need to generalize your design to make the nuances exchangeable.

One may also say, "The verbal hypothesis does not need to be the same as the statistical hypothesis", and "The verbal hypothesis can be broad, but still supported by specific studies, or refuted by other specific studies." To the former, I disagree, for reasons I explain in this post. To the latter, I agree, but I rarely see this in psychology; I, more often than not, see a broad claim, described as "confirmed" by a weak statistical test, with some lip service paid in a limitations section.

With all of this said, I really want to dive into the "Weak inference" part, and the notion that the verbal hypothesis needs to match the statistical hypothesis.

Is my lamp broken?

Tal’s paper raised at least two major points of contention from critics. The first, is whether a verbal hypothesis must map onto a statistical hypothesis, in order for the statistical test to mean anything for the verbal hypothesis. The second, is in the use of the term "induction" when describing statistical inference. I am mainly going to focus on the former, with some sprinkling of the latter.

Tal (and Meehl, and I) argued that a statistical test can inform the credence of a verbal hypothesis to the extent that the statistical test maps injectively (one-to-one, uniquely) onto the verbal hypothesis. I.e., if many verbal hypotheses can predict the same test result, then obtaining that test result tells you little about whether your verbal hypothesis is true. Likewise, if your verbal hypothesis implies nearly an infinite number of possible statistical outcomes, then nearly an infinite number of possible statistical outcomes would support your claim, and therefore it is no test at all. In even simpler terms, if verbal hypotheses A, B, and C all imply a quantity X, and you see X, it is hard to say A is ‘right’; conversely, if A implies any X other than zero, then you had no real statistical prediction at all, and any of the infinite possible outcomes would have supported A (non-falsifiable).

Regardless of your position on Popper, Lakatos, deduction, induction, abduction, etc, the above holds. This problem is not unique to statistical inference; statistical inference merely adds a link to the inferential chain. It does not matter what your particular brand of inference is — If your hypothesis does not imply outcomes unique from other hypotheses, or if any outcome could be explained from one’s hypothesis, it is a poor hypothesis. This would break deduction, induction, and abduction alike; these all require meaningful, unique predictions.

However, because critics of the paper have seemingly latched onto Tal’s use of "induction" as an argument, I am going to discuss this problem with respect to deductive logic. Again, it does not matter whether one clings to "inductive" or "deductive" inference, and arguing about the paper’s use of the word "induction" is a red herring.

Let’s start with a really simple case: Is my lamp broken? I have a home office. It’s simple: A nice desktop machine that I built, a window, and a lamp. Let’s assume my blinds are closed, so there’s some ambient light, but not much. My desktop is off, so the monitor isn’t producing any light. I want to know if my lamp is broken.

Modus Tollens

The Modus Tollens framework is as follows:

If A, then B. [This must necessarily be true, for the rest to follow].

Not B.

Therefore, not A.

For my trusty lamp then,

If my lamp is broken, then the room should be dark.

The room is not dark.

Therefore, my lamp is not broken.

This logic makes perfect sense, assuming these postulates are indeed true. It must be true, that if my lamp is broken, then the room must be dark. Then if I see that the room is not dark, I can safely conclude that my lamp is not broken. Simple. Elegant.

And also, impractically naive for how science works.


Let’s start by asking, "What does it mean for the room to be dark?"

For the sake of argument, let’s say I have a light sensor. It detects, on a scale of 0-100, how much light there is. Let’s also assume it is perfectly reliable.

I will define "dark" loosely as "notably darker than it would be, if my lamp were on".

Now, I measure the light in my room at 4pm on a Tuesday in May, with my lamp off. It reads 10. Then I turn my lamp on, and it reads 20. So I can operationalize "dark" as the measure of light at 4pm on a Tuesday in May when my lamp is off. When my lamp is on, it’s at 20; when it’s not, the room is dark with a light reading of 10.

Let’s continue the Modus Tollens then:

If my lamp is broken, then the room should be dark. [Verbal hypothesis]

If the room is dark (and if I operationalize dark as the reading observed when the lamp is off at 4pm on a May Tuesday), then my measure should be 10. [Statistical hypothesis]

My measure is 20. The room is not dark.

Therefore, my lamp is not broken.

Notice how the hypotheses chain:

If A, then B. [A verbal hypothesis, with a necessary prediction]

If B, then X. [The prediction, mapped onto a statistical prediction]

Not X, therefore not B. [The statistical prediction is untrue]

Not B, therefore not A. [Therefore, the verbal prediction is untrue].

So far so good. We have a good enough hypothesis to generate a prediction; that prediction can be mapped onto a quantitative prediction. If that quantitative prediction is falsified, then the verbal prediction is invalidated, and the hypothesis is therefore falsified. Simple. Elegant.

And also impractically naive for how science works.

"Probabilistic" Modus Tollens

We do not have perfect measures… so now we have to enter the realm of probability. My light sensor is not perfectly reliable. Each time I turn on the light sensor, it produces a slightly different answer. There’s some error to it. But the means are consistent, and the distribution of observations is fairly small when the lamp is on and off. But we do have to revise our inferential chain.

If my lamp is broken, then the room should be dark.

If the room is dark (and if [operationalization, model assumptions]), then the true light measure should be approximately 10.

If the measure must be approximately 10, then my observation (21.5) would be highly unlikely.

Therefore, the measure is unlikely to be 10; the room is unlikely to be dark; and the lamp is unlikely to be broken.

Oof. That makes our statement much more tentative. It is also questionable whether Modus Tollens really makes sense once converted into a "probabilistic" modus tollens. Modus tollens is a logical argument; logical systems aren’t necessarily coherent once uncertainty is introduced into the statements. For one thing, it’s inverting the conditional probability. p(observation | 10) is not the same as p(10 | observation). For another, our quantitative prediction should be somewhat uncertain too… "if the room is dark, then my measure should be distributed N(mu[0], .1); mu[0] ~ N(10, .01)".

In order to have a ‘proper’ probabilistic modus tollens, we need to write the modus tollens as a probability function. Bear with me here. This part won’t be heavily mathy. The modus tollens argument can be written as a probability function, where some things have probability 1, and other things have probability 0.

For example, the "If A, then B; Not B; therefore, not A" can be understood through the following probability statements. First, let’s define the probability of "A", given "Not B": $$ p(A|\lnot B) = p(\lnot B | A)p(A) / P(\lnot B) $$ This looks difficult to compute, but under the standard modus tollens, it is simplified greatly. When there is no uncertainty in the logical statement, such that B must occur if A is true, then $p(B | A) = 1; p(\lnot B|A) = 0$. The whole formula collapses: $$ p(A|\lnot B) = 0 * p(A) / p(\lnot B) = 0 $$ That is, assuming that B must occur given A, then the probability of "A" given "not B" is 0. Therefore, "Not A". Voila! Modus tollens expressed via probability.

The key point of this, is that your proposition must be true. "A" must necessarily imply "B". When "A" happens, ‘B" must happen. If that does not happen, then modus tollens does not easily hold.

For example, if "B" only occurs with some probability, given "A", it becomes more complicated. "If A, then probably B [let’s say, .95]; not B, therefore …" looks like this: $$ p(A | \lnot B) = .05 \times p(A) / p(\lnot B) $$ Once you enter probability space, you have to start justifying some probabilities. It is no longer straight forward.

So… let’s return once again to my possibly broken lamp. This is going to get complicated. If probability statements overwhelm you, then your take-away from this section should simply be: "Inference under uncertainty is extremely hard, and not as straightforward as Modus Tollens would suggest." It is fine if you do not understand this portion; that is largely the point of this section! With uncertainty (in data, in parameters, in statements), inference becomes much more complicated.

Let’s express the probabilistic modus tollens of the lamp problem. Letters in parentheses represent conditions to include in the probability algebra.

If my lamp is broken (A), the room must be dark (B). [A necessary implication]

If the room is dark (B) (and if [operationalization, model assumptions]), then the true light measure (C) should be approximately 10. [A probabilistic statement]

If the true light quantity should be around 10 (C), then my observation (D) would be highly unlikely [A probabilistic statement]

So what is the answer? I want to know whether my lamp is broken. Unfortunately, there is uncertainty in my predicted value (due to measurement error in my tool). Argh! What I do know, is "D"; I have an observation. $$ \begin{align} p(A | D) &= p(D | A)p(A) / p(D) \\ p(D | A) &= \int p(D | C) p(C | B) p(B| A) \end{align} $$ We need to know a number of things. What is the probability of the true light measure, given that it is dark: $p(C | B)$. What is the probability of my observation, given the true measure in darkness: $p(D | C)$. What is the a-priori probability of my lamp being broken: p(A) [e.g., maybe it’s broken many times in the past; maybe it is 30 years old; maybe this brand has a known defect; maybe it is currently sizzling and smoking].

This is.. difficult, obviously. If I previously, repeatedly measured the ambient light when the room is ‘dark’, then I can form a distribution for $p(C|B)$. If we take for granted that the room is always considered dark when the lamp is broken, then $p(B|A) = 1$. If we know the distribution of measurement error for our light sensor, then we can specify the probability of the observed value given the true ambient light value.

This is all possible, for sure. But the point is to highlight: Modus Tollens is much more difficult when the logical statements become probabilistic statements. It is unclear whether this is even considered deductive, since it includes uncertainty instead of merely deterministic statements. Given that it is no longer defined by logical, binary statements, but rather probabilistic ones, it appears more similar to induction than deduction, despite having a deductive bent. It is not straightforward, and I have never, in my life, seen a paper in psychology actually reason through, then compute something like this.

Random variation

Moreover, this is still a simplified example. We have thus far only talked about my measurement error. There are all sorts of external sources of variation that insert themselves into the system. What if I randomly sampled the light sensor across the year to collect data on baseline light levels when the lamp is on and off? Then there are new conditions, new sources of uncertainty.

Recall that I previously collected light data when the lamp is on and off, in order to operationalize and quantify what I consider "dark" and "not dark", on a May Tuesday at 4pm. My statistical inference is based on whether my current light data is improbably larger than would be expected assuming the light measure is under the operationalized ‘dark’ regime. But across multiple data collections, there can be several random sources of light variation that are irrelevant to my question, but nevertheless affect my data collection process.

If I collected light data across multiple rooms, multiple days, multiple seasons, etc, then there are all kinds of effects on my light data. Whether my computer monitor is on matters. Whether it’s 6pm in the summer, or 6pm in the winter matters. Whether I blocked my lamp with a giant coffee mug matters. Where I placed the light sensor matters. These are all stochastic, and these all change the distributions of uncertainty. They all change what the expected amount of light would be, what I would consider ‘dark’, and how much error there is in my measurement itself.

Importantly, I would have to change my model to account for these things if I wanted to claim my lamp is, indeed, not currently broken at any point in time. I would need to estimate and control for the effects of these stochastic influences, if I wanted to know whether, in a given moment, my lamp is not broken. If I wanted to know whether my lamp is broken, and I broadly operationalize the verbal prediction "the room is dark" as "the amount of light present when the lamp is off [no matter what time of day, whether the monitor is on, the weather, the sensor distance, etc]", then I would need to model and control for all the sources of light variation that are irrelevant to that broader prediction.

What happens if I do not account for these things? Well, it really depends on how you collected your data. If you collect data in a very controlled environment first (e.g., holding sensor location constant, collecting repeatedly in a short time frame, turning the lamp on and off to obtain a distribution of samples from each condition), then you may not need to include other sources of variation. However, any inference about your lamp is now conditioned on that exact circumstance. For you to say "my lamp is not broken", it is necessary to say "my lamp is not broken, assuming this exact sampling scheme, my model assumptions are met, no other sources of variation are present, and my context is the same as when my data were collected on that sunny May Tuesday afternoon, on which I operationalized what ‘dark’ even means". A harder sell, for sure, but it’s also more honest.

If you collect data across the year, while recording all the different sources of variation you know about, then the story changes. You want to know if your lamp is broken, given all this data and known sources of random influences. You obtain your single observation from the light detector, and throw it into a model that does not account for any of these variations. Well, obviously, whether your inference is correct is a crapshoot. You’re comparing it to [uncertain] values averaged across a whole host of random influences. Because you are not accounting for known random influences, your comparison may be completely wrong, and lead to the wrong conclusion altogether (inflated error rates). Moreover, because you are excluding the uncertainty of these effects, you are more confidently wrong too. You may say "my lamp is not broken", even though it is, just because you’re measuring at 10am on a sunny day with the blinds open, and failed to account for those effects.

If you wanted to test the broader prediction that the room is dark, broadly operationalized as "the amount of light present when the lamp is off or broken", then you would need to broaden your model as well, to obtain a predicted light level after controlling for the various random influences that bear no importance to the lamp itself. To reiterate, the role of the expanded model is to allow a generalized verbal prediction to map onto a generalized statistical prediction — One that is valid across various hypothetically unimportant nuances. If you wanted to test the prediction that the room is dark, and strictly operationalize "dark" as the ambient light present when the lamp is off on Tuesday, May 12, at 4pm in Davis, California; then you don’t need to control for other various things. Of course, that also means your hypothetical claim of support is limited to that specific measure, at that specific place. So, you don’t get to claim that your "lamp is not broken, because the room is not dark", but rather your "lamp is not broken because the room is not dark, assuming we define darkness as the amount of ambient light in this specific place, with this specific measure taken at this time."

This is analogous to what Tal is talking about. For one, we need our verbal hypotheses to imply something statistically precise (an injective mapping between the two), if we want a statistical inference to correspond to a verbal inference. Then for generalizable inferences to be accurately stated, we either need to model the hypothetically exchangeable conditions (e.g., via random effects models), or we need to limit our inferential claims to the boundaries of our conditions, operationalization, and design. In other words, we can either vary, model, and account for the conditions over which we wish to broadly generalize; or we can clearly state the conditions assumed and given, and constrain our expressed hypothetical support appropriately. We can either integrate out known random effects, or we cannot, and instead keep our claims conditional on those effects.

What happens in Psychology?

The lamp example is not perfect, but it’s a seemingly simple inferential problem that can complicated by introducing common research problems into the mix. Psychology, in reality, has a much harder problem. At least with the lamp, we can directly see and toggle the light. Psychology, on the other hand, largely deals with intangible, unobservable constructs in complex, ever-changing organisms. The uncertainty only compounds.

Complications aside, Psychology does not seemingly use any acceptable form of inference. Consider the following normative inferential chain, void of any probability statements:

If [verbal hypothesis], then [verbal prediction].

If [verbal prediction], then [non-zero quantity].

Non-zero quantity, therefore [verbal prediction].

[verbal prediction], therefore [verbal hypothesis].

This is problematic:

  1. This is not modus tollens, nor valid deductive logic, despite critics seemingly arguing that it is. It simply isn’t. This is 100% an invalid statement. It breaks down because the statistical hypothesis is committing a logical fallacy. You cannot falsify by confirming, and yet this is what this statistical hypothesis is doing. It therefore breaks the inferential chain, rendering it fallacious. Yes, this implies that normative NHST is not deductive.
  2. There is no unique mapping between the verbal and statistical hypothesis whatsoever. There are likely to be any number of alternative hypotheses for why the quantity is not exactly zero. Whether it’s Meehl’s crud factor, or differences in stimuli, or experimenter effects, or whatever else, this verbal hypothesis effectively predicts nothing, by predicting anything. Because any number of things can also predict a non-zero quantity, we therefore have extremely weak support for our hypothesis – In actuality, it merely rejects any hypothesis that would predict zero, and retain all other possible hypotheses.

Unfortunately, this inferential chain is commonly used, and nevertheless wrong. You cannot affirm the consequent. More obviously, rejecting the statistical hypothesis implied by a completely different verbal hypothesis, does nothing to support your verbal hypothesis that makes a completely different statistical prediction, if any prediction was made at all.

Ok, so let’s change our testing behavior to something that is not entirely fallacious.

If [verbal hypothesis], then [verbal prediction].

If [verbal prediction], then [non-zero quantity].

Not [non-zero quantity], therefore not [verbal prediction].

Not [verbal prediction], therefore [not verbal hypothesis].

Now we have ourselves a modus-tollens. How did we do it? By completely reversing the role of null hypothesis testing. We do not test a prediction that we did not posit (a zero). Instead, we test the prediction that we did posit (non-zero).

Note, that a normative NHST is not necessarily fallacious, it is the way in which we use it that makes it fallacious. A valid modus tollens for a nil-null hypothesis would be:

If [verbal hypothesis], then [verbal prediction].

If [verbal prediction], then [zero-quantity].

Not [zero-quantity], therefore not [verbal prediction].

Not [verbal prediction], therefore not [verbal hypothesis].

But this would imply our hypothesis predicts a zero. If we observe no zero quantity, then our hypothesis is falsified. It says nothing of any other hypotheses, only that whatever hypothesis implies a value of zero, is to be rejected. See how that is different from the fallacious chain above?

However, it still is not this simple. The above assumes no probabilistic statements. Once again, this will break down considerably once we do. Back to our non-zero inferential chain:

If [verbal hypothesis], then [verbal prediction].

If [verbal prediction], then probably [non-zero quantity].

Probably not a [non-zero quantity], therefore probably not [verbal prediction].

Probably not [verbal prediction], therefore probably not [verbal hypothesis].

This once again requires us to create a joint probability model. I will spare you this here, but the probability algebra is straightforward. Getting the distributions, however, may not be so straightforward.

Inferential chains must be thoroughly justified

Even if you can test in a logically consistent manner, from verbal down to the data, you will have to justify each of these steps. You cannot just state "If A, then B"; you have to either prove that to be the case, or justify why such an assumption is made. Modus Tollens depends on the validity of each logical consequent. Otherwise, you can make nonsense inferences, in a logically structured way. For example:

If unicorns are fictional, then my coffee mug would be full. (If A, then B)

My coffee mug is not full. (Not B)

Therefore, unicorns are not fictional. (Therefore not A)

This is, of course, ridiculous. Even though it does indeed follow modus tollens, the assumptions are completely wrong. For this to be valid, I would need to prove, or sufficiently argue, that if indeed unicorns are fictional, then my coffee mug would be full.

Psychology is not as ridiculous, but the requirements are exactly the same, from the verbal hypothesis to the statistical.

Therefore, every single link in the inferential chain, from the verbal hypothesis, to the statistical hypothesis, must be coherent, justified, or proved.

Inference is hard

Hypothesis testing therefore has some high requirements, that are rarely, if ever, met in psychology.

  1. The verbal prediction must be justified, and adequately linked to any generative verbal hypothesis.
  2. Any statistical prediction must be injectively mapped, and adequately linked to any generative verbal prediction, if a statistical inference is to inform the support for or against a hypothesis against any others.
  3. Any stated support for a verbal hypothesis is necessarily constrained — bound by the assumptions, explicit or implicit, of the model, design, and operationalization used to ultimately test it. Every assumption in the design and model is ultimately added as a logical conditional, or a conditional in the probability model. There are many to consider, and I barely scratched the surface here. In essence, every "If" statement in the inferential chain is added to a global list of "If" statements that must precede the claim of hypothetical support; one cannot simply state the hypothesis and whether it was supported. To reiterate, this includes all implicit and explicit assumptions in the statistical model (e.g., distributional shape, whether stimuli have an effect or not, whether a measure includes error or not), because these ultimately underlie how a statistical prediction is formulated, and what is treated as exchangeable.

In sum, hypothesis testing is hard. Inference is hard. Formulating valid, justifiable, precise predictions is hard. Once uncertainty is introduced to those predictions, it is made much harder. It’s not as straightforward as some seem to think. Probability makes logical argumentation explode into a cascading set of assumptions, constraints, and probability equations. Making precise, unique statistical predictions that map onto a verbal prediction is uniquely difficult in psychology, where "[measures are] made up, and points don’t matter". And unfortunately, I do not see an easy way out of it. We can continue on as normal, and make unjustifiably broad claims from incompatible models, using fallacious confirmationist arguments masquerating as falsificationism; or we can temper our claims, operate exploratively, and make careful predictions when we can.

Communication is hard

Lastly, I’d like to mention that communication about inference is difficult. Even a broad statement like "Bears are brown" can be taken two opposite ways from people. One person may take that to mean "Bears can be brown, but may not necessarily be brown in all places, across all species." I.e., one answer to the question "What is brown? — Bears are brown!". Another person may interpret it as "Bears are always brown, no matter what random place or species you choose," and take issue with such a broad claim. I.e., one answer to "What describes a bear? — Bears are brown!" Some may think in terms of "What outcome could my hypothesis predict?" Vs "What hypothesis could predict my outcome?"; and these people would interpret a broad statement differently.

Part of me thinks some of the debate surrounding Tal’s paper is actually a "Bears are brown" problem. If someone said "Receiving an unrequested gift makes people feel grateful," I take that to be a broad statement, applicable across measures, gifts, persons, etc; and I may criticize that paper to say "Only if you measure gratitude as you do, and only if the gift is valuable, and only if there is no expectation of reciprocation." Others may interpret as "Receiving an unrequested gift can make people feel grateful," and have no issue with it failing to generalize.

I think Tal and I view such broad hypotheses ("Bears are brown") as ones to be generalized ("Bears are always brown"); to support such a claim, they would then need to model and control for hypothetically exchangeable conditions across which such a broad claim should be valid; and then map a verbal prediction onto the quantity of interest.

Others may view such broad hypotheses ("Bears are brown") as non-generalized claims to be tested ("Bears can be brown; when are they not?"). This is fine too. However, I do not often see papers making claims so that others can test them — They usually seek to confirm generalized claims from specific operationalizations/conditions/contexts, using faulty inferential logic, then attempt to explain away (find moderators for) specific replication failures without modifying the broad claim. So really, the point is moot to me — Regardless of how you interpret broad claims, any serious attempts to generalize such statements or to refute them with logically coherent argumentation seems extraordinarily rare. Nevertheless, these two interpretations may contribute to the conflict.

LaTeX in Slack!

I just joined a new Slack workspace, and was growing weary of using the codecogs equation editor just to paste in an image of some latex math into Slack.
It took a bit of effort, but I thought I would document the process for anyone who is curious.

You need:
– Slack, obviously
– A Linux server, accessible from anywhere
– Familiarity with some command line

I have a digital ocean droplet hosting this webpage, running Ubuntu 16.04.

First, SSH into your server.

git clone
sudo apt-get install nodejs npm
cd mathslax

Second, you will probably need an updated version of node.

sudo npm install -g n
sudo n latest

Third, get mathslax ready.

make install
npm audit fix

Fourth, navigate to slack chat -> Workspace name -> Customize slack -> Apps -> Search for Slash Commands, install that integration.
Go to Custom Integrations -> Slash Commands -> Add Configuration
Enter some settings:

Command: /math
URL: http://yourServerName:yourConfiguredPortNumber/slashtypeset
Customize Name: LatexBot

yourServerName should be your webpage name; e.g., mine is

yourConfiguredPortNumber should be the port that the mathslax server will listen on. E.g., 7777

Write down or copy the Token that is generated for you.

Fifth, in your SSH session or server, create a simple script for launching the mathslax server

cd /usr/local/bin/mathslax
SERVER=yourServerName PORT=yourConfiguredPortNumber SLACK_AUTH_TOKEN="TheToken" node server.js

Save that to /usr/local/bin/, and mark as executable

sudo mv /usr/local/bin/
sudo chown root:root /usr/local/bin/
sudo chmod a+x /usr/local/bin/

Sixth, move mathslax to a location. I did not follow best practices; this should probably be in /opt, /usr/share, or some other directory, but whatever.

cd ~
sudo mv mathslax /usr/local/bin/
sudo chown -R root:root /usr/local/bin/mathslax

Finally, create a systemd services script, and place it in /etc/systemd/system/mathslax.service





sudo systemctl enable mathslax.service
sudo systemctl start mathslax.service

You may need to enable the port in your firewall. If using UFW, then it’s simply sudo ufw allow yourConfiguredPortNumber

In slack, you can then type:

/math \begin{align}y &\sim p(\mu, \sigma) // \mu &\sim N(0,1) \\ \sigma &\sim N^+ N(0,1)\end{align}

And get an image of those three aligned sampling statements.


Part two. The bot method.

On your server,

git clone
cd SlackLateX
npm install fs
npm install request
npm install websocket

Create a file /usr/local/bin/ with:


cd /usr/local/bin/SlackLateX
node ./bot.js

Create a file /etc/systemd/system/SlackLateX.service with:




Go to your slack chat workspace -> Apps -> Install Slack bots -> Add configuration:

Customize name: latexbot

Copy the API token, paste it into /usr/local/bin/SlackLateX/secret.txt

Then simply invite latexbot into the channel, and start the bot sudo systemctl start SlackLateX

Now you can type $\pi$ to get a rendering of the pi symbol (or any other latex equation). This method improves upon the first method because it’s channel-specific, and you can edit latex code, which will re-render.

A foray into Bayesian handling of missing data

I have been, uh, “blessed” by the data gods for most of my research data, in that I really rarely have non-planned missing data. Seriously.
Most of my research has involved surveys, lab experiments, or within-subject repeated measures, and for some reason, I just rarely had missing data.
The missing data was small enough to ignore (like, less than 1%).
Consequently, I’ve never really had a need to “handle” missing observations.

Sure, I’ve dealt with models wherein some unknowns are treated as missing data, like latent scores or groups in latent variable models, but that’s not really the same thing, now is it? Those were “known-unknowns”.
I’ve also had data where missingness is planned and ignorable, like a between-subjects manipulation of which response condition a repeated-measures task requires.

I use Stan or brms for nearly every analysis I do, and I have constructed some fairly complicated stan models.
Multilevel SEM with non-normal residual distributions and moderation? Check.
Assessing differential item functioning or measurement variance through item model competition? Check.
Simultaneously estimating the probability that some item is DIF across latent groups? Check.
Full information meta-analytic path models? You betcha.
And so much more (Seriously, Stan rocks).

But missing observations? Never dealt with it.
Inspired by an assignment for a course, I decided to dive in and see just how bad missing data handling is in Stan.

The model

The data has 6 columns: read, parents, iq, ses, absent, and treat, roughly corresponding to a reading score, number of parents (0 being 1, 1 being 2), IQ, socioeconomic status, number of absences, and whether the person was involved in the reading improvement treatment.
Simple enough.
The goal is to estimate the basic linear regression, read ~ parents + iq + ses + treat, which is of course very easy.
However, there’s fairly substantial missingness in read, iq, and ses.
One-third of the IQ scores are missing, 29% of SES is missing, and 14% of reading scores are missing.

So what do you do?
The two most common methods are multiple imputation and full information maximum likelihood.

I specify the model in lavaan.

lavMod <- '
read ~ 1 + parents + ses + iq + treat

By default, lavaan uses listwise deletion, which is not so much a way of “handling” missing data as it is ignoring it and hoping we have enough N to say anything at all; and that’s only if data are MCAR, which is basically an insane assumption.
In reality, we can expect data to be MAR, and listwise deletion will result in some bad estimates.

Easy enough to fix in lavaan; to use FIML, you just add missings='fiml' as an argument.

lavModOutFiml <- sem(lavMod,data=ds,missing='fiml')

All FIML really does, is change the estimation technique.
Instead of operating only on covariance matrices, the estimator maximizes a likelihood function that is at the observation-level, then I think it integrates out the missings.
This allows every observed variable to provide information to the model, and share information for missing variables.

Or we could use multiple imputation, which is fairly easy as well.

lavModOutMI <- runMI(lavMod,data=ds,m=20,miPackage='mice',fun='sem')

Multiple imputation generates $M$ datasets using, basically, gibbs sampling for the missings.
Then you estimate the model on each dataset and pool the estimates and compute the total standard errors.
In effect, this also integrates out the missings, and is essentially a less principled Bayesian method.

Easy enough, but onto Stan!

Bayesian Missing Data

The nice thing about Bayesian modeling, is that there is not really a clear line between parameters and mere “unknowns”.
Really, observations are known, and everything else is unknown.
The goal is to condition on those knowns to make probabilistic claims about the unknowns.

Usually when people talk about unknowns, they mean parameters, but that is needlessly restrictive.
Missing data are unknown, latent groups and states are unknown, latent scores are unknown, but none are “parameters” per se.

In order to “handle” missings, we merely need a model for them; then any posteriors of interest can be computed after marginalizing across it.
No need to scrap entire rows of data — Just model the missings with the observed quantities, condition on the known and unknown data, then marginalize.
In this way, missing data handling in Bayesian models is very natural. No external imputation needed; no fancy algorithm required. Missing data are merely part of the joint probability system.
Most realizations were observed with absolute certainty; some were not observed, but are informed by what is observed.

Take multiple regression as an example. Let X be the non-missing predictors, $\tilde{X}$ the missing predictors, $\sigma$ is the residual standard deviation, $\beta$ is the vector of regression coefficients, y is the outcome, $\mu$ is the vector of means and $\Sigma$ the covariance matrix for a multivariate normal distribution on the predictors.
p(\beta,\sigma,\mu,\Sigma|X,\tilde{X},y) &\propto p(y | X, \tilde{X},\beta,\sigma)p(\tilde{X}|X,\mu,\Sigma)p(\mu,\Sigma,\beta,\sigma) \\
p(\beta,\sigma,\mu,\Sigma|X,y) &\propto \int p(y | X, \tilde{X},\beta,\sigma)p(\tilde{X}|X,\mu,\Sigma)p(\mu,\Sigma,\beta,\sigma) d\tilde{X}
Essentially, we impose a multivariate normal distribution on the predictor variables, with unknown mean and covariance parameters.
The known predictors inform the mu and covariances, which in turn inform unknown scores.
Both the known and informed unknown scores predict y, and this in turn also informs the unknown scores (It’s a joint probability system, after all).
The goal is to obtain the marginal posterior of the parameters of interest, and to do so you must integrate over the unknowns, including unknown scores.

MCMC is there to help us approximate integrals and expectations.
Importantly though, MCMC samplers are essentially imputing the unknown data points just like multiple imputation, but the model also uses full information likelihoods to inform the model.
Bayesian handling of missing data therefore sits somewhere between multiple imputation and FIML-like techniques.
From a mathematical perspective, it looks like FIML. From an estimation perspective, it looks like multiple imputation.

To stan!

The Stan model, decrypted

Let me premise this section by saying: The Stan code I show below is not optimized. It is written for clarity, not for speed. There are several ways of optimizing this, but for a one-off model, it’s not critical.
Additionally, there are multiple ways of handling missings in Stan; the one I show below seemed easiest to me, even though it is an expensive method.

Stan hates NA values. Stan (or I assume, their C++ header and libraries) has no concept of missing values, and has no way of representing them.
So we need to do two things. We need to save which values are missing, and also replace those missing values with a temporary value.
This R code accomplishes those goals:

missings <- lapply(ds,function(var){which(}) # For each variable in ds, which rows are NA
nMissings <- lapply(missings,function(var){length(var)}) # How many NAs?
names(missings) <- paste0(names(missings),'_index') # Rename indices to variable_index
names(nMissings) <- paste0(names(nMissings),'_nMiss') # Rename n_miss to variable_nMiss

ds2 <- ds # Work with copy of dataset ds
ds2[] <- -100 # Magic number to remove NAs

For clarity, this is what missings looks like:

> missings
 [1]   7  22  24  30  40  42  50  59  62  70  86  87  93  96  99 104 111 120 139 148 150 157 172 174 180 190 192 200 209 212 220 236 237 243 246 249 254 261 270 289 298 300


 [1]   1   7  17  21  22  24  30  31  44  50  52  59  60  61  62  64  65  68  72  82  83  86  89  90  92  94  96  98  99 101 104 115 119 125 126 129 138 139 141 144 145 148 150 151 157 167 171 172 174 180
[51] 181 194 200 202 209 210 211 212 214 215 218 222 232 233 236 239 240 242 244 246 248 249 251 254 265 269 275 276 279 288 289 291 294 295 298 300

  [1]   8  11  15  17  21  22  24  31  40  52  59  60  61  62  68  70  71  73  74  77  80  82  85  86  89  90  92  93  94  95  96  97  98  99 104 108 110 112 114 115 119 120 122 124 125 135 136 138 139
 [50] 141 158 161 165 167 171 172 174 181 190 202 209 210 211 212 218 220 221 223 224 227 230 232 235 236 239 240 242 243 244 245 246 247 248 249 254 258 260 262 264 265 269 270 272 274 275 285 286 288
 [99] 289 291



Time to compose the Stan model.

The data block is as follows:

data {
    int N;
    vector[N] read; // Data vectors
    vector[N] parents;
    vector[N] ses;
    vector[N] iq;
    vector[N] treat;

    int read_nMiss; // N missing from variables; excluded have no missings
    int ses_nMiss;
    int iq_nMiss;

    int read_index[read_nMiss];
    int ses_index[ses_nMiss];
    int iq_index[iq_nMiss]; 

N is defined as the number of rows in the dataset (number of observations).
The outcome variable vector and the four predictor vectors are expected.
The number of missings for the three variables containing missing values are expected.
Finally, an integer array for the vector indices containing missings is expected for each variable with missings.

The parameters block:

parameters {
    // Structural
    vector[4] beta;
    real intercept;
    real<lower=0> sigma;

    // MVN covariates; for imputation
    cholesky_factor_cov[3] Sigma;
    vector[3] Mu;

    // Missings
    vector[read_nMiss] read_imp;
    vector[ses_nMiss] ses_imp;
    vector[iq_nMiss] iq_imp;

The structural parameters are the four regression coefficients, the intercept, and sigma — Corresponding to the model $y \sim \text{Normal}(X\beta,\sigma)$.
The multivariate normal parameters include a cholesky-factorized covariance matrix $\Sigma$, and $\mu$-vector; the known predictor values will inform the parameters to a multivariate normal distribution, which will in turn inform the unknown values for these variables.
The three remaining vectors correspond to the unknowns of each variable. The _imp should be read as “imputed”.

Transformed parameters block:

transformed parameters {
    // Recombine data
    matrix[N,5] Data; // 1: read, 2: parents, 3: ses, 4: iq, 5: treat
    Data[,2] = parents;
    Data[,3] = ses; Data[ses_index,3] = ses_imp;
    Data[,4] = iq; Data[iq_index,4] = iq_imp;
    Data[,5] = treat;
    Data[,1] = read; Data[read_index,1] = read_imp;

Here, we combine the observed and missing data into a single data matrix, called Data.
Normally, I would put this in the model block, but I hope to use this in generated quantities as well, and I do not want to copy and paste all of this.
We cannot merely edit the data vectors provided in the data block, because Stan disallows it.
Instead, a data matrix is created, and modified to include the estimated missing values.
This block is straight forward. An Nx5 matrix is created named Data, and I create a little key corresponding to which columns should represent which variables.
Each column is initially defined to be the corresponding vector provided in the data block.
For those three variables with missings, the indices with missing values (which we set to -100) are replaced with the “imputed” value for that person.

Model block:

model {
    // MVN imputation
    for(n in 1:N){
        Data[n,2:4] ~ multi_normal_cholesky(Mu,Sigma);

    // Priors
    beta ~ normal(0,2);
    intercept ~ normal(0,2);
    sigma ~ cauchy(0,1);
    Mu[3] ~ normal(100,10);
    Mu[1:2] ~ normal(0,2);
    //// Sigma given uniform prior implicitly

    Data[,1] ~ normal(Data[,2:5]*beta + intercept, sigma);

For each person, the parents, ses, and iq quantities (whether it be observed or unknown) are assumed distributed multivariate normal with $\mu$ means and $\Sigma\Sigma’$ covariance ($\Sigma$ is a cholesky factor).
Any observed data contribute to the likelihood, and thus inform these unknown parameters.
Any unknown data are simulated (in a sense) from the distribution.
And yes, it is weird to assume the number of parents is normally distributed; I am ignoring that for now for the sake of simplicity, because all I care about is the covariance, and I am not imputing the parent variable.
Note that I could have included all predictors into the multivariate normal, but treatment is completely orthogonal to every other variable, and was excluded for simplicity.

Priors are loosely defined by values I think are plausible given the scales of the variables.
The only odd looking one out is Mu[3], but that corresponds to IQ, and a-priori I can assume the mean is about 100, and extreme means are very unlikely.
Finally, read is assumed distributed normally about a value predicted from the known and unknown data.
Note that unknown read values are likewise predicted or imputed from the model, although I do not think it has any impact on the other parameters.

Generated quantities:

generated quantities {
    cov_matrix[3] Omega = Sigma*(Sigma');
    real<lower=0,upper=1> R2;
        vector[N] muHat = Data[,2:5]*beta + intercept;
        R2 = (variance(muHat))/(variance(muHat) + variance(Data[,1]-muHat));

In this block, I compute the covariance matrix of the three predictors involved in imputation.
The $R^2$ value is computed on the full data as well.

Then we run Stan. I only monitor the parameters of interest, and not the imputed data.

stan_data <- c(missings,nMissings,ds2,N=nrow(ds2))
stanOut <- stan(model_code = stanMod,data=stan_data,cores=4,pars = c('beta','sigma','intercept','Mu','Omega','R2'))

It stacks up well to lavaan’s FIML and MI output.

The model above produced the “Bayes” line near the bottom.
The Bayesian model looks very similar to the FIML estimator from lavaan.

Auxiliary variables can also be used, and a model with an Auxiliary variable for the multivariate normal imputation method is reported on the final line of the table.
I won’t put the stan code here, but the gist is: Don’t predict the outcome with the Auxiliary variable; permit the Auxiliary variable to covary with all the predictors in order for it to add information to the unknown predictor values. You can also have the auxiliary variable covary with the residual of the outcome variable (requiring a multivariate normal response model) to inform imputed outcomes. An alternative is to have all the predictors additionally predict the auxiliary variable, the residuals which covary with the outcome variable residuals. The former is a saturated covariate model, the latter is an added DV model; both accomplish the same goal of informing both missing predictors and missing outcomes.

In the end, I was pleasantly surprised with how easy it was to handle missing data in Stan. Ideally, you specify your generative model, and just combine the known data with the unknown data. The known data will inform the unknown data through its influence on the unknown parameters. In this case, I simply chose to model the exogenous variables as multivariate normal, which permitted unknown data to be informed and used along with the known data to predict the outcome of interest.

For MVN imputation:

  • Save which observations are missing, and how many, from each variable.
  • In addition to the typical parameters, include parameters for a multivariate normal.
  • Include a “parameter” per missing value.
  • Combine the known data with the unknown data into a new data structure.
  • Model exogenous variables as multivariate normal.
  • Use full combined data in the model
  • Thanks to MCMC, marginal posteriors will already be integrated over unknown data and parameter uncertainty.

Demystifying the Bayesian analysis of ego-depletion

At SPSP 2018, there was the big reveal of the awesome Many Labs replication report of ego depletion effects.
Thirty-six labs participated, and not even the labs had yet seen the full results.
I was in my room decompressing a bit (conferences are exhausting), but following along through Sanjay Srivastava’s live tweeting the session.

The anticipation for the results was palpable even on twitter. I had to rush down there to hear the results myself.
I ran in just in time to hear the results.

Wagenmakers couldn’t be there to deliver the results of a Bayesian analysis himself, unfortunately.
The speaker did their best to convey the results, but I had the sense that some nuances were missed a bit.
I want to reiterate: The speaker did a good job! But there are some subtleties to these analyses that unless you’re a fairly experienced Bayesian, you will miss or under-emphasize.
So this post isn’t to criticize the speaker at all!

The gist I got from the Bayesian results was that:

  • The data are approximately 4x more likely under the point-mass null model than under the a-priori specified model.
  • The meta-analytic effect is approximately .08. (I mistakenly read it as .06, but others told me .08; I was in the back, sorry!)

But it seemed like the take-home message was “The null was favored; but the meta-analytic effect is .08, so let’s instead talk about how we think about small effects”.

Those I spoke with after the session had the same take-home message — The effect exists, but it’s tiny: $d = .08$.

It’s not actually this simple. The “real” best-estimate is, according to Wagenmakers, about $d = .016$.
The “problem” is that the Bayesian meta-analytic method used provides a conditional effect estimate — An effect conditioned on the assumption that an effect is non-zero and positive.
To non-Bayesians, this is easily ignored. To Bayesians, this is a critical piece of information, because that condition completely changes the point estimates.

Let’s dig into the method, as best as I currently understand it.
Keep in mind, please, I have absolutely nothing to do with this analysis; I’m not involved in this project.
Many things I type here may be incorrect, because I simply do not know the details of the analysis.
Also worth mentioning: I have no invested interest in the approach being correct, either professionally or personally.
Quite honestly, I have had a lot of disputes with Wagenmakers et al’s approach to inference, because not all Bayesians are alike, and I generally hate model averaging and Bayes factors. With that said, if I am correct about how the Bayesian approach was performed, I think it was a good approach for the purposes of this project. Kudos to the entire Many Labs project, and kudos to Wagenmaker et al for their analysis.

The two meta-analytic models

The slides mentioned that both a fixed-effects and random-effects meta-analysis was performed, and the overall effect was estimated using both.
This does not mean the data were used to “choose” a fixed or random-effects model.
Instead, this method weights the “overall” estimate by the meta-analytic model probability.

A fixed-effects meta-analysis in Bayes looks like this:

p(\theta|y_1,\ldots,y_k,MA=1) \propto p(y_1,\ldots,y_k|\theta)p(\theta|MA=1)
Or with sufficient statistics…
p(\theta|\hat\theta_k,\sigma_{\hat\theta_k},MA=1) \propto p(\hat\theta_k|\sigma_{\hat\theta_k},\theta,MA=1)p(\theta|MA=1)

Essentially, either one is simply stating that there is one fixed effect, and each dataset ($y_k$) or estimate-information combination ($\hat\theta_k,\sigma_{\hat\theta_k}$) is an instance of that fixed effect.

The random effects meta-analysis looks similar, except that each estimated effect is assumed to come from a distribution of effects with an unknown variance and mean.
The mean is taken to be the expected value of the effect.

p(\Theta|y_1,\ldots,y_k,\theta_k,\sigma_{\theta_k},MA=2) \propto p(y_k|\theta_k,\Theta,\sigma_{\theta_k})p(\theta_k|\Theta,\sigma_{\theta_k})p(\Theta,\sigma_{\theta_k}|MA=2)
Or with sufficient statistics…
p(\Theta|\hat\theta_k,\sigma_{\hat\theta_k},\sigma_{\theta_k},MA=2) \propto p(\hat\theta_k|\sigma_{\hat\theta_k},\theta_k,\Theta,\sigma_{\theta_k})p(\theta_k|\Theta,\sigma_{\theta_k})p(\Theta,\sigma_{\theta_k}|MA=2)

In order to get just the posterior for $\theta$ or $\Theta$ (fixed effects, respectively), you’d have to integrate out the unknowns, which I’ll skip for the time being, because MCMC does that for you.

But across fixed-effects and random-effects models, what is the posterior distribution? Well, you have to integrate over the particular models themselves.
Assuming we’re working with the raw data vector, $y$:
p(\Theta|y) &= p(\Theta|y,MA=2)p(MA=2|y) + p(\theta|y,MA=1)p(MA=1|y) \\
p(MA=1|y) &= \frac{p(y|MA=1)p(MA=1)}{p(y)} \\
p(MA=2|y) &= \frac{p(y|MA=2)p(MA=2)}{p(y)} \\
p(y|MA=1) &= \int p(y|\theta,MA=1)p(\theta|MA=1)d\theta \\
p(y|MA=2) &= \iiint p(y|\theta_k,\Theta,\sigma_{\theta_k},MA=2)p(\theta_k|\Theta,\sigma_{\theta_k},MA=2)p(\Theta,\sigma_{\theta_k}|MA=2)d\theta_k d\Theta d\sigma_{\theta_k} \\
p(y) &= p(y|MA=1)p(MA=1) + p(y|MA=2)p(MA=2)

This all looks really convoluted, but in words:

  • The posterior probability distribution of the fixed effect is marginalized across the two different meta-analytic methods (fixed and random effect models), weighted by the posterior probability of each meta-analytic model given the data.
  • The posterior probability of a fixed-effects model is the probability of the data under that model times the prior probability (probably .5), divided by the marginal probability of the data across both. This is just Bayes theorem.
  • The posterior probability of a random effects model is the probability of the data under that model times the prior probability (probably .5), divided by the marginal probability of the data across both. This is just Bayes theorem.
  • The probability of the data under each model is the probability of the data given the parameters in each model, marginalized over the a-priori specified priors for each unknown parameter. This is where the $\text{Normal}^+(.3,.15)$ prior comes in for both $\theta$ and $\Theta$.

All of this comes from probability theory and Bayes theorem.
In a nutshell, you estimate a fixed-effects model and a random-effects model, assuming that the hypothesis about the effect is true and is expressed statistically as $\text{Normal}^+(.3,.15)$.
You get the marginal probability of the data under each model, using bridgesampling, because computing that manually is intractable.
This marginal probability represents the probability of observing the data they did given the a-priori hypothesis-informed uncertainty about the fixed effects.
The posterior probability distribution of the fixed effect is then informed by both the fixed and random-effects models, weighted by their respective marginal probabilities.

Both of these models assume the effect exists, and is positive.
A-priori, the hypothesis states that the ego-depletion effect is positive, and with uncertainty represented by Normal(.3,.15).
The posterior distribution for the ego-depletion effect, assuming such an effect exists as hypothesized, is informed by both the fixed and random-effects models, but weighted by the probability of observing the data they did under each model.
Really, the posterior distribution is: $p(\Theta|y,H1)$, meaning it is conditioned on both the observed data and that the hypothesized effect is true.

The hypothesis-driven prior on $\Theta$ (d) looks a bit like this:

And the posterior might look something like this:

But remember! This posterior is conditioned on the effect being positive, non-zero, because the prior constrains it to be so.
And this is likely not what the posterior actually looks like, this is just a best-guess approximation.
Something else to consider…

Posterior distributions and Bayesian point estimates are not like modal estimates

This is a quick aside.
Bayesian posteriors are distributions of effects, representing the data-informed posterior probability density or mass of parameters.
Outside of Bayesian inference, people are really accustomed to modal estimates, with an accompanying standard error or confidence interval.
In Bayesian methods, people often don’t report modal estimates (maximum a-posteriori), but rather medians or expectations with an accompanying credible interval.

When a prior induces a posterior boundary, these will not equal one another.
Just to build the intuition, let’s assume you fit a model without any boundaries, and the posterior looks like this:

This posterior suggests a mean, median, and mode of zero.

Now let’s use a prior to restrict negative values altogether. The posterior looks like this:

The modal estimate is still zero, but the median is about .629, and the mean (expected value) is about .765.

Why is this important?
The Bayesian MA posterior, conditioned on the effect being positive, has a median estimate of .08.
The mean estimate is necessarily higher, and the modal estimate is necessarily lower.
For what we currently know, that posterior could literally look like this:

In this posterior, the median is .08 (the same effect estimated by the Bayesian MA model), but the modal estimate is zero.

This is where Bayesian estimates and non-Bayesian estimates diverge: Maximizing estimators would say the effect is actually zero, whereas the Bayesian median and mean would be .08 and .10, respectively.
It’s not that Bayesian estimates are wrong, but they are really descriptions of a posterior distribution.
They are not intuitively the same as maximum-likelihood or other estimators. In this case, that posterior is really just a normal distribution, centered at zero, with all negative mass piled up at zero.
In other words, if the prior had not constrained the posterior to positivity, that posterior would have a mean, median, and mode of about zero.

This is important to recognize, because having a median effect of .08 is not particularly impressive or exciting.
In reality, that suggests a 50% posterior probability that the effect is between 0 and .08, assuming it can only be positive — And even effects that are in actuality zero can have a median estimate of .08 when positivity is induced.

The actual posterior

Wagenmakers and others compared two models.
One model was the combined-meta-analytic approach discussed above.
The other model assumed a point-mass of zero on the effect.

The Bayes Factor for this was $BF_{01} \approx 4$, from my best recollection.
This means the data are four times more likely to occur if the true effect were precisely zero than if the hypothesized (meta-analytic) effect of $\text{Normal}^+(.3,.15)$ were true.

If you want to know the overall posterior distribution for the effect, you have to marginalize across these models.
p(\Theta|y) = p(\Theta|y,H1)p(H1|y) + p(\Theta|y,H0)p(H0|y)

This posterior will look a bit like this:

The mean estimate is about .016.
That is — Assuming the effect could either be zero or positive, the effect is about .016.

This is not the traditional meta-analytic effect size estimate, but given the a-priori hypothesized effect size range and the alternative hypothesized effect of zero, then the meta-analytic effect would be approximately .016, not .08.
Once again, .08 assumes that the effect is positive and exists, which we are not certain of.
If we marginalize across the uncertainty in the models themselves, the expected estimate is about .016.

Hopefully, a more traditional meta-analytic estimate will be reported in the paper as well.
For example, using a zero-symmetric estimation prior (rather than the hypothesis-testing prior employed in the presented analyses) permits the effect to be any real number, weighted by the prior.
This would serve to probabilistically down-weight insane estimates (like $d\geq1$).
I am guessing that the meta-analytic effect would be between 0 and .05.
I am further guessing that if they used a prior mass at zero along with a zero-symmetric prior, the mean,median, and modal estimate would all be essentially zero (between 0 and .02).


From what I can gather:

  • The meta-analytic effect used a hypothesis-informed prior for the effect, meaning the meta-analytic effect is estimated assuming it is positive.
  • The data are about 4 times more likely to come from a parameter value of zero than from the hypothesized distribution of effects.
  • Assuming the hypothesis is correct, the median estimate is about .08.
  • Marginalizing across whether H1 or H0 are correct, the median estimate is about .016.
  • Do not take the meta-analytic effect size estimate as the take-home message, because that neglects the possibility that zero is a fairly probable estimate. I.e., don’t forget that the estimate is conditioned on the effect existing as hypothesized, which is 4x less likely than the null hypothesis.

Why (N-1) vs N [Long post]

I have a lot of students ask why we use (n – 1) vs (n) when computing variance. There are of course canned answers, like “Using N-1 unbiases the estimate” or “Using N-1 inflates the estimate, and it’s more accurate” or “Using N results in an underestimate”.

These are all true, but they’re deeply unsatisfying. I remember learning the more precise reason in my first semester of my phd program, but the question came up again and I decided to and write it out myself as a reference. This will be fairly basic stuff for many, I’m sure, but for the sake of my students and my future self, here it is:

First, distinguish between an estimator and its properties in expectation.
The estimator is some method with a defined goal that produces the best estimate given the goals.

Let’s say we are estimating the $\mu$ and $\sigma$ parameters to a normal distribution.
Let’s further say that we want a point estimate, and we’re not Bayesians (for now…), nor do we have any penalties.
One estimator is the maximum likelihood estimator, for obtaining the maximum likelihood estimate (MLE).

The MLE is the estimate that maximizes the joint probability density of the observed data, assuming some probability distribution underlies the sampling distribution.
In our case, we have the normal distribution, which looks like:
p(x | \mu, \sigma) = \sigma^{-1}(2\pi)^{-.5}e^{-(2\sigma^2)^{-1}(x – \mu)^2}
When we have more than one x – A vector of x called $X$ (capitalized), and if we assume independence, then the joint density is:
p(X | \mu, \sigma) = \prod_{i = 1}^N \sigma^{-1}(2\pi)^{-.5}e^{-(2\sigma^2)^{-1}(x_i – \mu)^2}

This is a pain to compute with, and the log-likelihood is much easier.

f(X | \mu, \sigma) = -N\log(\sigma) – .5N\log(2\pi) – (2\sigma^2)^{-1}\sum_{i = 1}^N (x_i – \mu)^2

Now, using this log-likelihood, we just need to see which value of $\mu$ and which value of $\sigma$ maximize the function output, the log-likelihood.
Because the log function is a monotonic function, maximizing the log-likelihood is the same as maximizing the likelihood.

To do this, we need to take the derivative of the log-likelihood with respect to each parameter at a time. Ignoring the notation for partial derivatives for the moment, the derivatives are as follows:

f(X|\mu,\sigma) &= \ldots -(2\sigma^2)^{-1}\sum(x_i^2 – 2x_i\mu + \mu^2) \\
&= \ldots -(2\sigma^2)^{-1} \left ( \sum x_i^2 – 2\mu\sum x_i + N\mu^2 \right ) \\
&= \ldots – \frac{\sum x_i^2}{2\sigma^2} + \frac{\mu}{\sigma^2}\sum x_i – N\mu^2(2\sigma^2)^{-1} \\
\frac{d(f(X|\mu,\sigma)}{d\mu} &= 0 + \frac{\sum x_i}{\sigma^2} – N\mu(\sigma^2)^{-1} \\
0 &= \frac{d(f(X|\mu,\sigma)}{d\mu} \\
0 &= \sum x_i – N\mu \\
\mu &= \frac{\sum x_i}{N}

I’ll call that $\hat\mu$, since it’s an estimate that maximizes the likelihood of the observations, and not the “true” $\mu$ parameter value.

Onto $\sigma$:
f(X | \mu, \sigma) &= -N\log(\sigma) – .5N\log(2\pi) – (2\sigma^2)^{-1}\sum_{i = 1}^N (x_i – \mu)^2 \\
\frac{d(f)}{d\sigma} &= -N\sigma^{-1} – 0 – \sum(x_i – \mu)^2\sigma^{-3} \\
0 &= \sigma^{-1}(N – \sigma^{-2}\sum(x_i – \mu)^2 \\
\sigma^{-2}\sum(x_i – \mu)^2 &= N \\
\sigma^2 &= \frac{\sum(x_i – \mu)^2}{N}

I’ll call that $\hat\sigma^2$, since it’s an estimate that maximizes the likelihood of the observations, and not the “true” $\sigma$ parameter value.

So now we have two estimators, derived from maximum likelihood goals:
\hat\mu = \frac{\sum x_i}{N} \
\hat\sigma^2 = \frac{(\sum x_i – \mu)^2}{N}

Now the question, what are the expected values of these parameter estimates, and are they equal to the desired parameter values?

Let me just take a shortcut and say the expected value of $\hat\mu$ is indeed $\mu$.
There is variance of the $\hat\mu$ estimate around $\mu$, derived as follows.
Assume $E(x_i) = \mu$, $E(\sum x_i) = \sum E(x_i)$, $E(cX) = cE(X)$, which are some basic expectation rules, and the rest follows.

E(\hat\mu – \mu)^2 &= E(\hat\mu^2 – 2\hat\mu\mu + \mu^2) \\
&= E(\hat\mu^2) – 2\mu E(\hat\mu) + \mu^2 \\
&= E(\frac{(\sum x_i)^2}{N^2}) – 2\mu E(\frac{\sum x_i}{N}) + \mu^2 \\
&= \frac{1}{N^2}E((\sum x_i)^2) – \mu^2 \\
&= \frac{1}{N^2}E\left (\sum x_i – \mu + \mu \right )^2 -\mu^2 \\
&= \frac{1}{N^2}E\left ((\sum x_i – \mu) + N\mu \right )^2 – \mu^2 \\
&= \frac{1}{N^2} E \left ( (\sum (x_i – \mu)^2 + 2N\mu\sum(x_i – \mu) + N^2\mu^2)\right ) – \mu^2\\
&= \frac{1}{N^2} (N\sigma^2 + 0 + N^2\mu^2) – \mu^2\\
&= \frac{\sigma^2}{N} = \sigma^2_\mu
Voila, the expected variance (VAR) of the $\hat\mu$ estimator, defined as $E(\hat\mu – \mu)^2)$, is equal to $\frac{\sigma^2}{N}$.
Just as our textbooks tell us.

With that out of the way, what is the expected value of $\hat\sigma^2$?

\hat\sigma^2 &= \frac{\sum(x_i – \hat\mu)^2}{N} \\
E(\hat\sigma^2) &= \frac{1}{N}E(\sum((x_i – \mu) – (\hat\mu – \mu))^2) \\
&= \frac{1}{N}E(\sum (x_i – \mu)^2 -2\sum(x_i – \mu)(\hat\mu – \mu) + \sum(\hat\mu – \mu)^2) \\
&= \frac{1}{N}\left (N\sigma^2 – 2\sum E(x_i – \mu)(\hat\mu – \mu) + N\sigma^2_\mu) \right ) \\
&= \frac{1}{N}\left (N\sigma^2 – 2\sum E(\hat\mu x_i – \mu x_i – \mu\hat\mu + \mu^2) + N\sigma^2_\mu \right ) \\
&= \sigma^2 + \sigma^2_\mu – \frac{2}{N}E(N\mu^2 – N\mu\hat\mu – N\mu\hat\mu + N\hat\mu^2) \\
&= \sigma^2 + \sigma^2_\mu – 2E(\hat\mu – \mu)^2\\
&= \sigma^2 – \sigma^2_\mu \\
E(\hat\sigma^2)&= \sigma^2 – \frac{\sigma^2}{N} = \sigma^2(1 – \frac{1}{N}) = \sigma^2(\frac{N-1}{N}) \\
\sigma^2 &= \hat\sigma^2\frac{N}{N-1}

  1. $E(\hat\sigma^2)= \sigma^2 – \sigma^2_\mu$ is really interesting. It’s saying the expected value of our estimate of the variance is equal to the true variance, minus the error variance of the mean estimator. Conversely, you can say that the true variance is equal to the observed variance plus the error variance in the mean estimate. That already should intuitively suggest to you that the reason the MLE for $\sigma$ is underestimated is because it fails to account for the variability in the mean used to compute the sample variance. Scores not only vary around the mean, but the mean varies around the parameter, and by neglecting that latter variability, we underestimate the population variance.
    From an ordinary least squares perspective, you can think of it this way: The mean is the value for which the average squared deviation from it is the smallest. Already, the mean is minimizing the variability. But because the mean does not equal the population parameter, that variability will be minimized around the incorrect point.
  2. The $\frac{N}{N-1}$ adjustment merely alters the estimated value so that the expected value will equal the true value. It does so basically by indirectly adding back some error variance due to the estimation of the mean itself. And if that fraction doesn’t look familiar to you, maybe this helps: $\hat\sigma^2 \frac{N}{N-1} = \frac{(\sum x_i- \hat\mu)^2 N}{N(N-1)} = \frac{\sum (x_i – \hat\mu)^2}{N-1}$

And that’s basically it. I probably have minor mistakes in the above, as this was all hastily written. But that’s the gist – Using (N-1) as the divisor ultimately comes from the fact that our estimator is not expected to equal the true variance parameter. The estimated sample variance is off by a factor of $\frac{N-1}{N}$, asymptotically, so we just adjust it so the expected value with the adjustment does equal the true variance.

The absurdity of mapping p-values to Bayes factors

In the “Redefine statistical significance” (RSS) paper, the authors argued that p = .05 corresponds to low evidentiary value.

Let’s unpack what is meant by that.

The authors’ argument comes from an “upper bound” argument on how large the Bayes Factor can be given a particular p-value.

Recall that a Bayes Factor is defined as:

\frac{p(D|H_a)}{p(D|H_0)} = \frac{\int_l^u p(D|\theta,H_a)p(\theta|H_a)d\theta}{\int_l^u p(D|\theta,H_0)p(\theta|H_0)d\theta}

The parameter priors, $ p(\theta|H_k) $ corresponds to an uncertain prediction from some hypothesis, $H_k$.
The marginal likelihoods computed in the numerator and denominators above can then be thought of “prior predictive success”, or “predictive success, marginalized over the uncertainty expressed in the parameter from hypothesis $k$ “. The BF10 then represents the evidence in favor of hypothesis a over hypothesis 0.

With that out of the way, let’s return to the RSS paper.
The authors wished to argue that a p=.05 necessarily has a low upper bound on evidence in favor of the alternative hypothesis.

RSS used as an example a BF for a two-tailed alternative hypothesis in comparison to a nil-null hypothesis.
This makes intuitive sense, because a commonly used test is the two-tailed t-test with a nil-null hypothesis.
In the latter case, the p-value is corresponding to the probability of observing a t-statistic at least as extreme as the observed statistic, assuming that the true sampling distribution is centered around zero.
No explicit alternative hypothesis really exists; the statement is simply $H_a: \theta \neq 0$, which isn’t exactly a useful hypothesis, but I digress.

In contrast, the BF does use two explicit hypotheses.
$H_0: p(\theta|H_0) = \begin{cases}1 & \text{if } \theta = 0 \\ 0 & \text{if } \theta \neq 0\end{cases}$
$H_a$ depends slightly on how favorable you wish to be, in order to obtain an upper bound.
The most direct upper bound of BF for a p=.05 value with some specified N would simply be another point mass at exactly the quantile corresponding to the .975 percentile of the t-distribution.

Using this line of reasoning, they argue that at best, a p = .05 corresponds to a small BF, and as such, p = .05 has a low upper bound on evidentiary value.
Instead, they argue that p = .005 would correspond to a more considerable evidentiary value, with BFs ranging from 14–28 or so.

So what’s the problem? Seems reasonable!

The problem(s)

Ok, well let’s start with some intuitive problems.

First, as mentioned in the “Justify your alpha” paper, why should the rate at which a hypothetical error is presumably controlled map onto a marginal likelihood ratio (I refuse to call this “evidence”, as though the whole of scientific evidence can be expressed in a single unidimensional metric of prior predictive success ratios)?
We could likewise flip this around — How about we devise Bayes factor thresholds that map onto hypothetical error rates, so that these crazy Bayesians don’t make more or less type 1 errors than expected? (I am not encouraging this, just highlighting the arbitrariness).

Second, these are metrics of completely different things! A p-value is exactly what it states: The probability of observing what we did (or more, or more extreme) assuming some state-to-be-nullified is true? The BF is a ratio of prior predictive success, or a marginal likelihood ratio, or the ratio of probability densities of the observed data from two prior-predictive distributions corresponding to two hypotheses.
The goals of each are different too.
The p-value is just a “surprise” factor, given some assumed state.
The BF is a relative measure, evaluating the data under two certain or uncertain states.
The p-value is used inferentially as a means to reject some state if the observation is sufficiently improbable under it.
The BF is used to express how much one’s prior odds in each state would change as a function of the data.

Why these things should map onto one another is not intuitive.
The p-value has its uses. The BF has its uses. I’m not sure why one needs to be defined in terms of the other.

Third, and most importantly, the mapping from “highly evidentiary” upper bound BFs to p-values (or alphas) is not generalizable. Richard Morey has an excellent series on this topic. The gist is, the whole “p=.005 has reasonable upper bounds, but p=.05 does not”, depends entirely on which hypotheses you are testing. And as I said above, the RSS authors used a two-tailed alternative against a point-mass null, and from the standpoint that statistical hypotheses should map onto their respective substantive hypotheses, this makes no sense anyway. More on that in a future blog post.

Morey recently made his point again on twitter.
I decided to take it a step further.

Likelihood Ratio Tests

A likelihood ratio test is defined as:
Q &= -2(LL_0 – LL_a) \\
Q &= 2(LL_a – LL_0) \\
Q &= 2\log\frac{L_a}{L_0} \\
k_i &= \text{Number of parameters in model i} \\
df &= k_a – k_0 \\
Q &\sim \chi^2(df)|

We can solve for the likelihood ratio required to reject at some $\alpha$ level:
e^{\frac{Q_{c,\alpha,df}}{2}} = \frac{L_a}{L_0}
Given some $\alpha$ and $df$, find the critical value for the $\chi^2$ statistic, $Q_{c,\alpha,df}$, then simply solve for the likelihood ratio.

Now, let’s define the upper bound BF.
As it turns out, if you have two hypothetical models with point masses at fixed values ($\hat\theta$), then the BF is simply a good ‘ol fashion likelihood ratio.
p(D|H_a) &= \int p(D|\theta,H_a)p(\theta|H_a) \\
\text{If }p(\theta|H_a) &= \begin{cases}1 & \text{if } \theta=\hat\theta \\ 0 & \text{if } \theta\neq\hat\theta\end{cases} \\
p(D|H_a) &= p(D|\hat\theta,H_a)
Meaning, that the BF will just become a likelihood ratio.

If you have model A with 2 parameters, and model 0 with 1 parameter, and p = .05, what is the upper bound BF ($BF_{ub}$)?
df &= 1 \\
Q_{c,\alpha=.05,df=1} &= 3.841459 \\
e^{Q/2} &= 6.825936 = BF_{ub}

So if you are comparing two models with a difference of one parameter, and you see p = .05, your upper bound Bayes Factor is 6.825936.
Basically, that means if your alternative hypothesis had one more parameter and your hypothesis predicts the exact estimate achieved, and you got p = .05, then the greatest BF you could see is 6.825936.
That’s not stellar, which is the point of the RSS argument.

Not so fast! That’s in one particular scenario, wherein you had two models, and the only difference is the inclusion of an extra parameter.
Let’s vary the p-value obtained and the df for the test. Recall that df in LRTs represents the difference in parameters estimated for each model, so we are varying how complex the alternative model is from the reference or null model.
Plot time.

alphas <- seq(.005,.5,by=.005)
dfs <- 1:10

ds <- expand.grid(alpha=alphas,df=dfs)
ds$qchisq <- qchisq(p = 1-ds$alpha,df = ds$df)
ds$bfup <- exp(ds$qchisq/2)

ggplot(data=ds,aes(color=factor(df),x=alpha,y=log(bfup))) + 
    geom_line() + 
    labs(x='p',y='log(BF10) Upper Bound',color='LRT df') + 
    theme_classic() + 
    geom_hline(yintercept=c(log(14),log(28))) + 
    scale_x_continuous(breaks=seq(0,.5,by=.025)) + 
    theme(axis.text.x = element_text(angle = 90))

The $\log BF_{ub}$ is plotted on the Y axis. P-values on the X-axis. Colors represent the df (i.e., difference in number of parameters from alternative to null models).
The black horizontal lines represent BF10 14-28.

If your model has 2 parameters more than the null, then p=.05 has an upper bound of “strong evidence”, according to the RSS authors.
Well, that changes things doesn’t it?
In fact, if you only have one parameter difference, then p=.01-.03 has “strong evidence” upper bounds.
A p=.005 would have an upper bound BF of about 50.

In the more extreme case, if you have 6–7 more parameters in the alternative model, then a p=.50 corresponds to “strong evidence” upper bounds.
Let me repeat that, for those who misread .50 as .05.
A p-value, of .50 (50%, point five zero, not 5% or point zero five), corresponds to strong evidence upper bounds.

Using the logic of the RSS authors, if you have 6-7 parameters more in your alternative model than the null model, then alpha should be set to .50, because that’s where the upper bound of “evidence” is strong.
That’s some really great news for people who fit SEM and MLM models, both which often compare much higher dimensional models to much lower dimensional models. Go for broke, know that if you have p=.42, it’s justifiable as a cutoff, since it has a high evidence upper bound.

Note that this does make sense from a BF-inferential perspective.
If you have a hypothesis that predicts precisely the parameter values that you estimated, then that hypothesis is insanely good.
And the goodness of that hypothesis increases as your model has more parameters that it precisely predicted.
This reasoning is intuitively why as df increases, the upper bound increases too.
If you have a hypothesis that is perfectly predicting the parameter values, then the BF upper bound should be higher for models that have more parameters to predict.
In other words, it’s much more impressive for a hypothesis to perfectly predict 10 parameters than 2.

Despite the pattern making sense from a BF-inferential perspective, it makes little sense to claim p-values overstate the “evidence”, when in many cases the p-values understate the “evidence”.
Sure, a p=.05 may only imply a BF of 3-6.
But also, a p=.05 may imply a BF of 20-10000.

It depends on the hypotheses you are testing, and therefore there can be no general mapping between a p-value and the “evidence” indicated by a BF.

Final thoughts

The argument that p=.005 should be the new standard due to its relationship to the BF is questionable.
Philosophies aside, the p-value does not map onto the BF.
A p-value maps onto a BF.
It all depends on the hypotheses you test. A p of .5 could mean a BF of 106; a p of .005 could mean a BF of 51; a p of .05 could mean a BF from 7 to 10000. Who knows?

The recommendation that all findings be labeled significant only when $p \leq .005$ is based purely on one particular set of hypotheses, under a set of assumptions, with the goal of matching it to upper bounds of one particular evidentiary metric.
It cannot generalize to all computed p-values, and I doubt we’d want a big table of significance thresholds corresponding to a range of BFs for every possible significance test.

Finally, I agree that a single p-value near .05 is insufficient evidence; just as I would agree that a single BF of 3-10000 would be insufficient evidence.
The reason I state that a single p-value near .05 is insufficient evidence is not because the .05 is necessarily weak, but rather that any singular metric is necessarily weak, when assessing the evidence for a hypothesis.
More on this in a future blog post, I’m sure.

In sum, use $\alpha = .005$ if you’d like. Use BFs if you’d like. Use posteriors and CIs if you’d like. Report all the things if you’d like.
Justify your inference within your inferential framework, justify the decisions made.
But maybe don’t say p=.05 is necessarily weak because your p = .05.

And maybe all of this is wrong, I don’t know. You tell me.

Updating? Pooled data? Meta-analysis? Yes.

Meta-psychology is a journal that is founded on the principles of total transparency. Manuscripts are open. Reviews are open. Open commentary is permitted. Each draft is visible. It’s as though the journal cares about science or something.
Rickard mentioned a new submission about the utility of Bayesian updating, or as the authors call it – Posterior passing.

My comment can be read here by clicking on the upper-right comment button. Due to the excruciating 1000 character limit, you will have to hit the ‘+’ sign to read all of them.

Although I generally appreciate the paper, I think it misses an important point.
Their proposal to use “posterior passing” instead of meta-analysis or using full raw datasets is strange upon realizing that posterior passing is the same as having all of the raw data, and the same as a fixed-effects meta-analysis under a few reasonable assumptions.

Posterior Passing

The gist of posterior passing is that a researcher can take the posterior of a parameter computed from a previous sample, and use that posterior as a prior.
Or, as the rest of the world calls it – Bayesian updating.
The prior is then updated by the new sample data into a new posterior.
That new posterior can serve as someone else’s prior, and so on.
Over repeated samples, this results in decreasing uncertainty (smaller posteriors) about parameter values.

Posterior passing looks like the following. Assume $Y_j$ represents some current observation vector, and $Y_{-j}$ represents a set of previous observation vectors. Or: $Y_j$ is your data, and $Y_{-j}$ is not your data.
p(\theta,Y_j|Y_{-j}) & \propto p(Y_{j}|\theta,Y_{-j})\overbrace{p(\theta|Y_{-j})}^{\text{Your prior, their posterior}}

That’s about it. Obtain data, estimate a posterior, use that posterior as a prior for the next sample. Keep doing it ad nauseam, and you have posterior passing.

The argument is that this method is underutilized in the sciences, and that engaging in posterior passing will permit researchers to quickly converge toward a more certain, accurate estimate.

I do not disagree with the idea of Bayesian updating, of course. It’s just a mathematical fact, and it is underutilized. People are way too afraid to incorporate prior information into a model.
With that said, the proposal has a few problems, some they mention and some they do not.

  • The proposed use of it assumes a single set of generative parameters across all of the studies. Given that multiple studies have minor to major differences between them, surely their corresponding data are not generated from precisely the same set of parameters. It would be difficult to find three, let alone 60, studies that are feasibly generated from the exact same parameters.
  • The quick convergence toward the “true value” strongly depends on the quality of the available posteriors from previous studies. $ p_\text{osterior}-\text{hacked} $ studies and publication bias result in a misleading assortment of posterior distributions, and though your own data may overwhelm the priors, it is not guaranteed. Indeed, if they had simulated the existence of hacking and pub bias, I suspect the performance would be underwhelming.

My bigger disagreement is with advocating it over meta-analysis, or even comparing it to having the full data available.

They’re the same things! Advocating for repeated Bayesian updating across studies in order to get a highly informative posterior is the same thing as advocating for full-data analysis and fixed effects meta-analysis.

Bayesian updating vs full-data analysis

Imagine two researchers are arguing about their next three samples.
One researcher wants to collect 300 observations, then another 300, then another 300.
She then wants to compute a posterior based on the first 300, feed that in as a prior for the second 300, and finally the resulting posterior into the third.
Specify an initial prior, then chain posteriors into the next model’s prior, until we get a three-study Bayesian update on the state of information.
The other researcher thinks that’s inferior to the full-data method.
He wants to just collect 900 and be done with it; analyze the whole thing.
Specify an initial prior, then throw all 900 observations into a single model.
They argue for hours. Friendships are severed. Children are neglected. They use all caps when texting because their voices have given out.

Neither researcher is right or wrong.
These two approaches will accomplish the exact same end result.
They should have just worked out the math and grabbed a beer instead.

p(\theta,Y_j,Y_{-j}) &\propto p(Y_j|\theta,Y_{-j})\overbrace{p(\theta|Y_{-j})}^\text{A posterior} \\
&\propto p(Y_j|\theta,Y_{-j})p(Y_{-j}|\theta)p(\theta) \\
\text{If IID…} &\\
&\propto p(Y_\cdot|\theta)p(\theta)

Assuming your data and other data are IID, you can either use a posterior as your prior, or combine all the data.
The result is the same. The joint distribution decomposes into either one.

Consequently, using all available data in a single model is no better than using posteriors derived from all previous data to inform the current data.

Bayesian updating vs fixed-effect meta-analysis

A fixed-effect meta-analysis is defined as follows.
Some parameter, $\theta$, exists that is estimated by several $\hat\theta_i$. A variance estimate, $\sigma^2_{\theta_i}$ is computed for each $\hat\theta_i$ estimate.
It follows then, that:
p(\theta | \hat\theta_\cdot,\sigma_{\theta_\cdot}) \propto p(\hat\theta_\cdot | \theta, \sigma_{\theta_\cdot})p(\theta)

Decomposing this further,

p(\theta | \hat\theta_\cdot,\sigma_{\theta_\cdot}) &\propto p(\hat\theta_\cdot | \theta, \sigma_{\theta_\cdot})p(\theta) \\
&\propto p(\hat\theta_j | \theta, \sigma_{\hat\theta_j})p(\hat\theta_{-j}|\theta,\sigma_{\hat\theta_{-j}})p(\theta) \\
&\propto p(\hat\theta_j | \theta, \sigma_{\hat\theta_j})p(\theta | \hat\theta_{-j},\sigma_{\hat\theta_{-j}})

If $\hat\theta_j, \sigma_{\hat\theta_j}$ are sufficient statistics for $Y_j$, then
p(\hat\theta_j | \theta, \sigma_{\hat\theta_j})p(\theta | \hat\theta_{-j},\sigma_{\hat\theta_{-j}}) = p(Y_j|\theta)p(\theta|Y_{-j})

Assuming each study’s $\hat\theta_j, \sigma_{\hat\theta_j}$ calculations are sufficient statistics and are independent of one another, the fixed-effect meta-analysis will provide the same information as posterior passing or full-data analysis.


In essence, whether you want to utilize posterior passing, incorporate all data into a single model, or use a fixed-effect MA, you will produce the same information in the posterior.

I think, anyway. Feel free to critique the above.

It is inconsequential, based on the above, to use one version or another. They will all suffer the same consequences with the same problems, and have the same troubling assumption that all included data are presumed generated from the same single parameter.

Practically speaking, I believe the fixed-effect MA would be the simplest to utilize of those three. Regardless, I think the assumption of one-true-effect for multiple studies is seriously improbable in practice, and would recommend a random effects meta-analysis.