Informative priors for correlation matrices: An easy approach

Estimating correlations is a dead simple thing to do, right?

Yes.

But sampling correlations is a dead simple thing to do right?

Not necessarily.

Correlation matrices must be positive semi-definite. When doing Bayesian analysis, you therefore need to parameterize in such a way that the correlation matrix is positive semi-definite. That is not a trivial procedure. Thankfully, Stan (and others, e.g., PyMC3) do that hard work for you, under the hood, and with the correct Jacobian adjustments.

However, even a uniformly distributed correlation matrix does not yield marginally uniform elements. My previous post explored this. Essentially, the positive semi-definite (PSD, for short) constraint restricts the range of possible correlation matrices, which imposes an implicit prior on the matrix, and in turn the unique elements.

More to the point, the LKJ(1) prior is a uniform prior on permissible matrices (permissible, meaning symmetric and PSD). Most permissible matrices beyond dimension 2 will be in a hypersphere around the identity matrix. Even with a uniform density, most of the volume contains solutions without large individual elements.

In short – A uniform LKJ, which says all permissible solutions are equally likely, is not "uniform" in the intuitive meaning of "uniform", because most permissible solutions are found around an identity matrix.

Because probability mass is density $\times$ volume (the integral of density over area), most probability mass will therefore be near identity. Unfortunately, this means that it can be surprisingly hard to attain high posterior mass over correlation matrices with high correlations.

To show this "problem", I generated 50 observations of 15 highly correlated data points. I computed the observed correlation matrix, and plotted each element in black. I then estimated a Bayesian multivariate normal with a uniform LKJ prior on the correlation matrix. The posterior means are plotted in red.

Notice how they are all considerably lower than they "should be" given the output of cor(y).

This is a well-known "problem" of the LKJ prior. However, this is not a problem with LKJ at all, but with most of the volume, and therefore most of the mass, being around the identity. With enough data points, you will be able to "overcome" the volume problem due to the much higher joint density of larger correlations.

Using priors

When there are problems like this, the natural Bayesian solution is to include some more prior information. Recall that the uniform prior on correlation matrices has the implicit prior of "Must be symmetric and positive semi-definite". Well, we may know that some correlations should be higher with high certainty, or others are relatively unknown.

Naively, you may try to construct a matrix from $D(D-1)/2$ estimated parameters, and put priors on each of the individual parameters. But you will be sorely disappointed — The probability of such a matrix being PSD is vanishingly small as D increases.

However! What you can do is construct a PSD matrix, and add more prior information. That is:

$$ p(R) \propto \text{LKJ}(R|\eta = 1)\prod_u \mathcal{N}(R_u | \mu_u, \sigma_u) $$ where $R$ is a symmetric, PSD correlation matrix, $u$ indexes the unique, lower-triangular elements of $R$, and the normal distribution with locations $\mu_u$ and scales $\sigma_u$ encode prior information about each element $R_u$. In words, this just means that R has a joint prior consisting of a uniform LKJ on the matrix (effectively contributing nothing) and normals on its elements.

E.g., if you think R[2,1] should be fairly high, then you could say $\mu_1 = .8, \sigma_1 = .2$.

This approach is adding more density to certain regions of R’s parameter space. By adding more density, you are adding more mass. Intuitively, you are countering the enormous volume of near-identity PSD solutions with more density further away from identity. Therefore, more mass can exist away from the high-volume set of PSD solutions that otherwise "shrink" your gloriously high correlations down to zero.

Some Stan code

functions {
  /*
    Extract lower-triangular from mat to vector.
    Excludes diagonal.
    @param mat Matrix.
    @return vector
   */
  vector lower_tri(matrix mat) {
    int d = rows(mat);
    int lower_tri_d = d * (d - 1) / 2;
    vector[lower_tri_d] lower;
    int count = 1;
    for(r in 2:d) {
      for(c in 1:(r - 1)) {
	lower[count] = mat[r,c];
	count += 1;
      }
    }
    return(lower);

  }
  /*
    Proportional log-density for correlation matrix.
    @param cor Matrix. PSD, symmetric correlation matrix.
    @param point_mu_lower Vector. D(D-1)/2 length. Locations of each [unique] element's prior.
    @param point_scale_lower Vector. D(D-1)/2 length. Scales of each [unique] element's prior.
    @return log density.
   */
  real lkj_corr_point_lower_tri_lpdf(matrix cor, vector point_mu_lower, vector point_scale_lower) {
    real lpdf = lkj_corr_lpdf(cor | 1) + normal_lpdf(lower_tri(cor) | point_mu_lower, point_scale_lower);
    return(lpdf);
  }
  /*
    Same as lkj_corr_point_lower_tri_lpdf, but takes matrices of prior locations and scales.
   */
  real lkj_corr_point_lpdf(matrix cor, matrix point_mu, matrix point_scale) {
    int d = rows(cor);
    int lower_tri_d = d * (d - 1) / 2;
    vector[lower_tri_d] point_mu_lower = lower_tri(point_mu);
    vector[lower_tri_d] point_scale_lower = lower_tri(point_scale);
    real out = lkj_corr_point_lower_tri_lpdf(cor| point_mu_lower, point_scale_lower);
    return(out);
    
  }
  /*
    Same as lkj_corr_point_lower_tri_lpdf, but takes cholesky-factor of correlation matrix.
   */
  real lkj_corr_cholesky_point_lower_tri_lpdf(matrix cor_L, vector point_mu_lower, vector point_scale_lower) {
    real lpdf = lkj_corr_cholesky_lpdf(cor_L | 1);
    int d = rows(cor_L);
    matrix[d,d] cor = multiply_lower_tri_self_transpose(cor_L);
    lpdf += normal_lpdf(lower_tri(cor) | point_mu_lower, point_scale_lower);
    return(lpdf);
  }
}

data {
  int N; // # observations
  int D; // # variables
  vector[D] y[N]; // Data
  matrix[D,D] cor_guess; // Prior correlation location guesses.
  matrix[D,D] cor_sd; // Prior correlation uncertainties/scales.
  int just_lkj; // Whether to just use LKJ, or use defined lpdfs.
}

transformed data {
  int d_unique = D*(D-1)/2;
  vector[d_unique] cor_guess_lower = lower_tri(cor_guess);
  vector[d_unique] cor_sd_lower = lower_tri(cor_sd);
}

parameters {
  vector[D] mu;
  vector<lower=0>[D] sigma;
  cholesky_factor_corr[D] cor_L;
}

model {
  mu ~ std_normal();
  sigma ~ std_normal();
  if(just_lkj) {
    cor_L ~ lkj_corr_cholesky(1);
  } else {
    cor_L ~ lkj_corr_cholesky_point_lower_tri(cor_guess_lower, cor_sd_lower);
  }

  y ~ multi_normal_cholesky(mu, diag_pre_multiply(sigma, cor_L));
}

generated quantities {
  matrix[D,D] cor = multiply_lower_tri_self_transpose(cor_L);
}


Some data generating code

library(rstan)
set.seed(13)

sm <- stan_model("lkj.stan")

N <- 50
D <- 15

y <- data.frame(x1 = rnorm(N))
y[,paste0("x",2:D)] <- y[,1] + matrix(rnorm(N*(D-1), 0, .5), N, D-1)

# Uninformative (High scale value)

cor_guess <- cor(y)
cor_sd <- matrix(10, D,D)

stan_data <- list(N = N,
                  D = D,
                  y = y,
                  cor_guess = cor_guess,
                  cor_sd = cor_sd,
                  just_lkj = FALSE)

sOut_loose <- sampling(sm, stan_data, cores = 4)
means_loose <- summary(sOut_loose, pars = "cor")$summary[,"mean"]

# Make informative
cor_sd <- matrix(.1, D, D)
stan_data$cor_sd <- cor_sd

sOut_tight <- sampling(sm, stan_data, cores = 4)
means_tight <- summary(sOut_tight, pars = "cor")$summary[,"mean"]

# Compare to LKJ
stan_data$just_lkj <- TRUE
sOut_lkj <- sampling(sm, stan_data, cores = 4)
means_lkj <- summary(sOut_lkj, pars = "cor")$summary[,"mean"]

How did we do?

Note that I’m effectively "cheating" here by using the computed correlation matrix as our "prior guess". However, this shows a "best case scenario" where our prior guesses are exactly as observed in the data, plus a bit of uncertainty encoded into the cor_sd matrix.

Below is a plot of cor(y) values (black) and posterior means of a uniform LKJ prior (red), uninformed prior (purple), and informed prior (blue). The informed prior approach generally performed better. Note that they are all still a bit low, but the uninformed and uniform LKJ priors were consistently far away from the cor(y) values, to a startling degree.

Recasting to a cleaner plot below, you can see that the informed approach tended to map onto the estimated correlations much better. The black line indicates a perfect mapping. The informed approach still has downward bias, but that is nothing compared to the bias seen in the uninformed and LKJ priors.

Summary

The LKJ is extremely popular in Stan and other similar samplers. However, due to the PSD constraint, the posterior mass of high-dimensional correlation matrices tends to be around identity, and high correlations are difficult to accurately estimate. Coupling the PSD constraint with additional prior information can attenuate this problem. The low volume of PSD solutions is matched with greater density, leading to more mass near the high correlation solutions.

In retrospect, this is an obvious solution to an annoying problem. I think the reason it was not obvious before, is due to the way PPLs express priors. When one says a ~ p(), one assumes (rightfully so) that a is generated from p(). This makes joint priors unintuitive. With something like Stan, writing a ~ p(); a ~ p_2() implies a joint probability contribution, not a "simulation" statement. Similarly, one can say a ~ p(); f(a) ~ p_2(). This all translates to joint statements, not generative ones. As a downside, this also means it is difficult to generate random variables from the joint prior.

Future work could improve this method. I used a Normal kernel to contribute more density to the desired regions. While this is fine for unnormalized, proportional distributions, one would need to account for the (-1,1) constraint for each element — This requires a (potentially) costly set of CDF divisions to normalize the contribution. For Stan this isn’t an issue, because those statements would be constants, and constants are irrelevant for the proportional densities that MCMC needs.

Instead, one could substitute the normal kernel for something like a location-scale parameterized, translated beta that has support on (-1,1). In that case, no CDF is required, because the support is already naturally in the right region. Additionally, it may be more intuitive to reason with, because translated beta is effectively how LKJ-distributed matrix elements are distributed. Most generically, you could think of this approach as: $$ p(R|\eta, K, \phi) = \text{LKJ}(R|\eta)\prod_uK(R_u|\phi_u) $$, where R is a constructed as PSD, $u$ indexes unique lower diagonal elements, $K$ is some kernel (density function), and $\phi_u$ are the parameters for element $u$ and kernel $K$.

Another big benefit of this approach is that it may be possible to hierarchically model specific correlation matrix elements as randomly varying across sets of groups. Instead of providing a prior on each element, one would estimate the prior. One can already hierarchically model the entire correlation matrix, but not individual elements. This approach would likely fix that problem, and permit fully-hierarchical models. E.g., if you used an mixed effects location scale model, then location, scale, and correlation parameters could all hierarchically vary.

Lastly, shout out to Mike Lawrence and Sean Pinkney for inspiration of this approach. They have both been working on problems related to underestimated, or non-PSD, correlation matrices. Without their and my discussions about these issues, I wouldn’t have even tried this admittedly simple (but useful) approach.

Addendum 1 – Sean’s perspective and equivalent solution

I wanted to elaborate a bit more on Sean Pinkney’s parallel approach to this. By mathematical coincidence, we came up with equivalent solutions to this problem, but from different perspectives, and all within a month of one another.

In a very long Slack thread, Mike, Sean, and I were all talking about correlation matrix issues (underestimation, non-PSD). At some point in a twitter DM and in that thread, I mentioned the approach above as a possible solution to adding prior information about high correlations (and similarly, about where correlations are expected to be zero).

At the same time, Sean has been working on a Bayesian approach to finding a "Nearest PSD matrix to the target matrix", as I understand it. On a recent thread, he supplied some code for this approach.

Sean’s approach was elegant, and made perfect sense:

  • Construct a PSD matrix
  • Have target correlations
  • Say that target_cor ~ normal(estimated_cor, corpriorscale)

Therefore, the estimated PSD matrix has to be 1) PSD 2) Fit the target_cor well and 3) Fit the data well. Consequently, it gives a "Nearest-PSD solution to the target correlations and data".

Then I realized that Stan constructs PSD matrices (using the technique that Sean’s code does), and that due to the symmetric nature of the Normal parameterization: target_cor ~ normal(estimated_cor, corpriorscale) [Sean’s approach] is the same as saying estimated_cor ~ normal(target_cor, corpriorscale) [My approach].

Therefore, our approaches are equivalent if using a normal kernel. I love that this happened, because it can be understood in two ways. From my perspective — One is adding prior information to the individual elements, thereby creating a joint prior on the matrix and its elements. From Sean’s perspective — One is rotating a PSD matrix toward a target solution and data. But it’s the same thing! I love when this happens in statistical methods. One intuition strengthens another. I suspect they may not be equal for other kernels (because for some distributional families, one parameter and one data point may not be swapped), but I also suspect the two would always be one transform away from another. After all, these are all just joint probability statements.

Addendum 2 – Constraints!

Ed Merkle (in the comments) brought up the idea of constraints. I think this approach could work well for constraints.

For constraining an element, $R_a$ to be effectively zero, one could use: R_a ~ normal(0, very_small_scale). It would not be exactly zero, but it would be effectively zero. This is akin to how Bayesian EFA replaces a-priori zero-constrained loadings with loadings that have spiked priors at zero.

Similarly, one could constrain one correlation, $R_b$ to be effectively equal to $R_a$, by using:

R_a ~ normal(mu_a, scale_a)
R_b ~ normal(R_a, very_small_scale_b)

Again, this does not enforce a hard constraint that the two are precisely equal, but rather a soft constraint that the two should be effectively equal.

Addendum 3 – An intuitive plot

  • Left pane: LKJ(1) (a uniform prior over all permissible correlation matrices).
  • Middle pane: Add density to solutions where $R_{2,1}$ is high (using Normal(.8, .1))
  • Right pane: The resulting prior.

6 Replies to “Informative priors for correlation matrices: An easy approach”

  1. This is a nice approach that I am likely to use in the future. It seems like it can also work when some correlations are fixed to 0, which can happen in certain factor analysis/structural equation models. I wonder whether it can work when two correlations are constrained equal to one another.

  2. Great! I’m glad it may see some use.

    As for constraining some to zero, I think one could say something like R_a ~ normal(0, very_small_scale); it wouldn’t be exactly zero, but it would feed information that it should be effectively zero. This is akin to the soft constraints in Bayesian EFA, in which loadings that are not a-priori free (i.e., assumed to be zero) are given spiked priors at zero.

    Similarly, for constraining to correlations to be equal -One could say:
    R_a ~ normal(mu_a, priorscale); R_b ~ normal(R_a, very_small_scale). This would likewise add information that the two should be effectively equal, even if they won’t be exactly equal.

    1. Thanks, I can see how that would work. Though it might require some care to set the value of “very_small_scale”, i.e., large datasets could overpower the prior, so that what you think is fixed to 0 actually strays from 0.

  3. I have code to skip over the “known” 0s and finds the nearest-PSD with those elements as 0. I’ll add it to the short paper Stephen and I putting together.

    1. Hey,

      any news on the progress in that short paper? Seems to be tightly related to what I am working on (and being able to cite a document would be a plus)

Leave a Reply