## Is the LKJ(1) prior uniform? “Yes”

This post is about how the LKJ(1) prior is indeed "uniform", but perhaps not in the way people expect it to be. To be clear – This result is not a new one, nor is this post novel, because this is a well-known result. However, I was long confused by it, and I see others confused by it as well.

The LKJ prior is a spherical prior on correlation matrices. I emphasize matrices because it is not a marginal prior on individual elements. It is commonly used in Stan, in part due to performance reasons, and in part because specifying priors on correlations separately from variances is much easier. This latter point is especially useful when estimating multivariate scale models, wherein variances can vary, but we may want correlations to remain constant or vary according to a different model.

The LKJ prior takes one parameter, $\eta \in \mathbb{R}^+$. As $\eta$ increases, more mass is placed over identity matrices (i.e., a diagonal of one; little to no correlation). When $\eta = 1$, the LKJ prior is a uniform prior over correlation matrices. Because people (myself included) are, in general, poor at visualizing and intuitively understanding multidimensional space beyond three dimensions — One may think that the elements of the correlation matrices should therefore be uniformly distributed.

However, when you sample from the LKJ(1) prior with $K > 2$ variables (i.e., more than one correlation, a $K>2 \times K>2$ matrix), and look at the marginal distributions of the elements, they are not distributed uniformly. What gives?

The answer lies in the constraints of the correlation matrix. Correlation matrices must be symmetric, and positive semi-definite (PSD). That PSD constraint alters where probability mass can exist, and where it will accumulate marginally. When you only have $K=2$ variables, this is not an issue. But when you have $K>2$ variables, then one correlation’s value constrains what other correlations can be, if the psd constraint is to be met. I.e., you cannot create a $K\times K$ correlation matrix, and fill the off-diagonals with uniform(-1,1) values and expect it to be PSD. As K increases, the chances of creating a correlation matrix from uniformly distributed elements that is also PSD rapidly increases near-zero.

The LKJ(1) prior can therefore be thought of as a "uniform prior on permissible correlation matrices"; that is, it is uniform subject to the PSD constraint.

To build that intuition, I randomly generated 50,000, $K = 4$, "correlation matrices", wherein each unique off-diagonal element is sampled from a uniform(-1,1) distribution. For each matrix, I checked whether the PSD constraint was met. I then resampled each invalid matrix until the PSD constraint was met[^1].

I then plotted the marginal distribution of $\rho_{1,2}$. For comparison, I also sampled from LKJ(1) using rethinking::rlkjcorr. The plot below shows the two overlaid. In red, I show the $\rho_{1,2}$ from the resampling method. In blue, I show the $\rho_{1,2}$ from the LKJ(1) distribution. The two are effectively identical (within sampling error). As K increases, the PSD constraint will cause the marginal distributions to be increasingly concentrated over zero. It is hard to say that it is "regularizing more" as dimensionality increases, but rather that in large matrices, the space in which the PSD constraint is met, is considerably smaller than when dimensionality is low.

You may want to know what the marginal priors for each correlation are, given a uniform LKJ and K variables. A useful result is provided, as always, by Ben Goodrich. He stated:

In general, when there are K variables, then the marginal distribution of a single correlation is Beta on the (-1,1) interval with both shape parameters equal to K / 2.

We can examine this nifty result. I took the PSD-enforced samples of $\rho_{1,2}$, and transformed them to be in the [0, 1] domain ($\tilde\rho_{1,2}^s = .5\rho_{1,2}^s + .5$, for each $s \in S$ samples). I then plotted the $\text{beta}(x| 4/2, 4/2)$ density on top of the normalized histogram. This is the result: Indeed, the marginal distribution matches perfectly with the $Beta(\frac{K}{2}, \frac{K}{2})$ distribution.

To recap: I generated the elements of a correlation matrix from a uniform(-1, 1) distribution. I then enforced an PSD constraint by resampling the matrix until PSD was met. The marginal distributions of the correlations are then equivalent to those from the LKJ(1) distribution. In other words: The LKJ(1) distribution is indeed a "uniform" prior over matrices, subject to the constraint that the matrix must be positive semi-definite. This constraint disallows the marginals from being uniformly distributed.

The LKJ prior is not the only possible prior for correlations. The Wishart can be used, provided that you divide out the diagonals. If you want to better control the marginal priors of the elements, a better option may be the matrix-F distribution — You can get effectively uniform marginal priors for the correlations using this, but it is not standard (not too hard to implement though).

# R Code

rlkj_naive <- function(N, k, enforce_spd = TRUE, as_samples = TRUE) {
n_cors <- k*(k - 1)/2
out <- list()
gen_unif_matrix <- function(n_cors, k) {
cors <- runif(n_cors, -1, 1)
mat <- diag(1, k, k)
mat[lower.tri(mat)] <- cors
mat <- as.matrix(Matrix::forceSymmetric(mat, uplo = "L"))
return(mat)
}
for(n in 1:N) {
out[[n]] <- gen_unif_matrix(n_cors, k)
}

if(enforce_spd) {
is_not_spd <- !sapply(out, matrixcalc::is.positive.semi.definite)
if(any(is_not_spd)) {
mean(is_not_spd)
warning("PSD not met; rejection sampling. ", mean(is_not_spd), " of the mats are not PSD.")
}
while(any(is_not_spd)) {
n_not_spd <- sum(is_not_spd)
out[is_not_spd] <- replicate(n_not_spd, gen_unif_matrix(n_cors, k), simplify = FALSE)
is_not_spd <- !sapply(out, matrixcalc::is.positive.semi.definite)
}
}

if(as_samples) {
out <- t(sapply(out, function(x){x[lower.tri(x)]}))
}
return(out)
}

cors <- rlkj_naive(50000, 4)
cors_better <- rethinking::rlkjcorr(50000, 4, 1)

png("psdResampling_vs_lkj.png")
hist(cors[,1], col = rgb(1,0,0,.5), xlab = expression(rho), freq = FALSE, main = "Marginal")
hist(cors_better[,2,1], add = TRUE, col = rgb(0,0,1,.5), freq = FALSE)
dev.off()

png("corTrans_vs_beta.png")
cor12_trans <- (cors[,1] / 2) + .5
hist(cor12_trans, freq = FALSE, xlab = expression(tilde(rho)), main = "Marginal (transformed)")
curve(dbeta(x, 4/2, 4/2), 0, 1, add = TRUE)
dev.off()



# Matrix F distribution (For those who are curious)

$E(V) = \nu/(\delta – 2)*B$

$\nu, \delta$ control the shape. B scales it. For correlations, you can sample from matrix F, and divide out the diagonal to produce a correlation matrix. If $\nu = K$, where K is the cov dimension, and $\delta = 3$ (the minimum value to have a defined mean), then you have a fairly uniform prior on the correlations.

  real matrix_f_lpdf(matrix cov, real nu, real delta){
int k = cols(cov);
return(lmgamma(k, (nu + delta + k - 1)/2) - (lmgamma(k, nu/2) + lmgamma(k, (delta + k - 1)/2)) + log_determinant(cov)*((nu -k - 1)/2) - (nu + delta + k - 1)/2 * log_determinant(cov + diag_matrix(rep_vector(1, k))));
}

real matrix_f_fast_lpdf(matrix cov, real nu, real delta){
int k = cols(cov);
real log_det_cov = 2*sum(log(diagonal(cholesky_decompose(cov))));
real I_Sig_log_det = 2*sum(log(diagonal(cholesky_decompose(diag_matrix(rep_vector(1,k)) + cov))));
return(log_det_cov*(nu - k - 1)/2 - I_Sig_log_det*(nu + delta + k - 1)/2);
// return(log_determinant(cov)*(nu - k - 1)/2 + log_determinant(cov + diag_matrix(rep_vector(1,k)))*-(nu + delta + k - 1)/2);
}

real f_lpdf(real v, real nu, real delta){
return((nu/2 - 1)*log(v) - (nu + delta)/2*log(1 + v));
}

real matrix_f_fast_cholesky_lpdf(matrix L, real nu, real delta){
int k = cols(L);
vector[k] L_diag = diagonal(L);
real log_det_cov = 2*sum(log(L_diag));
real log_jac = k*log(2);
real I_Sig_log_det;
for(i in 1:k){
log_jac += (k - i + 1)*log(L_diag[i]);
}
// I_Sig_log_det = log_determinant(diag_matrix(rep_vector(1,k)) + multiply_lower_tri_self_transpose(L));
I_Sig_log_det = 2*sum(log(diagonal(cholesky_decompose(diag_matrix(rep_vector(1,k)) + multiply_lower_tri_self_transpose(L)))));
return(log_jac + log_det_cov*((nu - k - 1)/2) - ((nu + delta + k - 1)/2)*I_Sig_log_det);
}
real matrix_f_B_fast_cholesky_lpdf(matrix L, real nu, real delta,real B){
int k = cols(L);
vector[k] L_diag = diagonal(L);
real log_det_cov = 2*sum(log(L_diag));
real log_jac = k*log(2);
matrix[k,k] BmatInv = diag_matrix(rep_vector(1/B,k));
real I_Sig_log_det;
for(i in 1:k){
log_jac += (k - i + 1)*log(L_diag[i]);
}
I_Sig_log_det = 2*sum(log(diagonal(cholesky_decompose(diag_matrix(rep_vector(1,k)) + multiply_lower_tri_self_transpose(L)*BmatInv))));
return(log_jac + log_det_cov*((nu - k - 1)/2) - ((nu + delta + k - 1)/2)*I_Sig_log_det);
}



# Footnotes

[^1]: This took a long time. It is highly improbable to find an PSD matrix, even when the matrix is a measly 4×4, when constructing a symmetric matrix with uniformly distributed elements. For reference, of the 50,000 initial matrices, 81.58% of them were not PSD.

## Is my lamp broken?

In this post, I want to talk about the difficulties of probabilistic inference and generalizability of claims.

The recent twitter debates about Tal’s paper The Generalizability Crisis have rendered me befuddled. Personally, I find his arguments to be obvious, conditional on reading Meehl and giving more than 5 minutes of thought toward inference and statistical testing. But, apparently, it is not obvious, and some people strongly disagree. Tal further clarifies matters in a well-written response here.

Tal’s central point, in my view, can be boiled down to the following gist:

• Verbal hypotheses should map onto statistical hypotheses, if statistical hypothesis testing is to be relevant to supporting or rejecting the verbal hypothesis.
• The generalizability of the supported or rejected hypotheses is bounded by the statistical model/test employed.
• Psychologists do not tend to map verbal hypotheses onto statistical hypotheses, but nevertheless use statistical hypothesis tests to support verbal hypotheses.
• Psychologists tend to invoke a "Weak" form of inference to support broad claims. These broad claims are supported using, literally, a weak form of inference (a confirmationist framework), and reduced statistical models that can not, and are not intended to, support claims broader than permitted by the variables entered into it, which are in turn generated from a specific sampling and collection scheme. I.e., the statistical models, used to assess a verbal hypothesis, operate on assumptions and are bound by the variables collected, which do not cleanly translate to broader hypothetical statements. More simply, a 2-group randomized experiment using measure X to measure construct F may imply that one group’s mean F score is greater than another, but that is where statistical generalization ends. Seeing such a quantitative difference, even if definitive, does not imply one’s verbal hypothesis is generally supported (e.g., across populations, across other measures of construct F, across contexts, across time), or is uniquely supported (that other hypotheses would not have made the same prediction).

Tal goes onto to offer some solutions. One solution is to include hypothetically exchangeable components into the model itself. If the verbal hypothesis makes a claim that should be evident regardless of the precise measure, stimuli, time, context, population, or subject, then the hypothesis is implicitly claiming that these variables are exchangeable in some way. Therefore, the statistical model for the hypothetical process should also treat these as exchangeable, and a random effects model could be employed.

Of course, knowing all possible sources of exchangeable variation is hard, if not impossible. Collecting data on and actually varying these exchangeable variables even more so. So a second, very reasonable, solution is to stop making claims broader than the statistical model permits. Instead of saying "Receiving an anonymous gift makes people grateful", one could instead say "Receiving an anonymous gift [in the form of digital raffle tickets for a $50 Amazon gift card on a computerized task from a fake stranger] makes [undergraduates at our Southern US University] [score higher on gratitude measure X, assuming the measurement model is specified as it was] [in a controlled double-blind lab study]." One may say, "That is not ideal", and I would say "Tough, that’s exactly what you can gain from your study." If you wish to claim support for a hypothesis more generalizable than that, then you need to generalize your design to make the nuances exchangeable. One may also say, "The verbal hypothesis does not need to be the same as the statistical hypothesis", and "The verbal hypothesis can be broad, but still supported by specific studies, or refuted by other specific studies." To the former, I disagree, for reasons I explain in this post. To the latter, I agree, but I rarely see this in psychology; I, more often than not, see a broad claim, described as "confirmed" by a weak statistical test, with some lip service paid in a limitations section. With all of this said, I really want to dive into the "Weak inference" part, and the notion that the verbal hypothesis needs to match the statistical hypothesis. # Is my lamp broken? Tal’s paper raised at least two major points of contention from critics. The first, is whether a verbal hypothesis must map onto a statistical hypothesis, in order for the statistical test to mean anything for the verbal hypothesis. The second, is in the use of the term "induction" when describing statistical inference. I am mainly going to focus on the former, with some sprinkling of the latter. Tal (and Meehl, and I) argued that a statistical test can inform the credence of a verbal hypothesis to the extent that the statistical test maps injectively (one-to-one, uniquely) onto the verbal hypothesis. I.e., if many verbal hypotheses can predict the same test result, then obtaining that test result tells you little about whether your verbal hypothesis is true. Likewise, if your verbal hypothesis implies nearly an infinite number of possible statistical outcomes, then nearly an infinite number of possible statistical outcomes would support your claim, and therefore it is no test at all. In even simpler terms, if verbal hypotheses A, B, and C all imply a quantity X, and you see X, it is hard to say A is ‘right’; conversely, if A implies any X other than zero, then you had no real statistical prediction at all, and any of the infinite possible outcomes would have supported A (non-falsifiable). Regardless of your position on Popper, Lakatos, deduction, induction, abduction, etc, the above holds. This problem is not unique to statistical inference; statistical inference merely adds a link to the inferential chain. It does not matter what your particular brand of inference is — If your hypothesis does not imply outcomes unique from other hypotheses, or if any outcome could be explained from one’s hypothesis, it is a poor hypothesis. This would break deduction, induction, and abduction alike; these all require meaningful, unique predictions. However, because critics of the paper have seemingly latched onto Tal’s use of "induction" as an argument, I am going to discuss this problem with respect to deductive logic. Again, it does not matter whether one clings to "inductive" or "deductive" inference, and arguing about the paper’s use of the word "induction" is a red herring. Let’s start with a really simple case: Is my lamp broken? I have a home office. It’s simple: A nice desktop machine that I built, a window, and a lamp. Let’s assume my blinds are closed, so there’s some ambient light, but not much. My desktop is off, so the monitor isn’t producing any light. I want to know if my lamp is broken. ## Modus Tollens The Modus Tollens framework is as follows: If A, then B. [This must necessarily be true, for the rest to follow]. Not B. Therefore, not A. For my trusty lamp then, If my lamp is broken, then the room should be dark. The room is not dark. Therefore, my lamp is not broken. This logic makes perfect sense, assuming these postulates are indeed true. It must be true, that if my lamp is broken, then the room must be dark. Then if I see that the room is not dark, I can safely conclude that my lamp is not broken. Simple. Elegant. And also, impractically naive for how science works. ## Operationalizing Let’s start by asking, "What does it mean for the room to be dark?" For the sake of argument, let’s say I have a light sensor. It detects, on a scale of 0-100, how much light there is. Let’s also assume it is perfectly reliable. I will define "dark" loosely as "notably darker than it would be, if my lamp were on". Now, I measure the light in my room at 4pm on a Tuesday in May, with my lamp off. It reads 10. Then I turn my lamp on, and it reads 20. So I can operationalize "dark" as the measure of light at 4pm on a Tuesday in May when my lamp is off. When my lamp is on, it’s at 20; when it’s not, the room is dark with a light reading of 10. Let’s continue the Modus Tollens then: If my lamp is broken, then the room should be dark. [Verbal hypothesis] If the room is dark (and if I operationalize dark as the reading observed when the lamp is off at 4pm on a May Tuesday), then my measure should be 10. [Statistical hypothesis] My measure is 20. The room is not dark. Therefore, my lamp is not broken. Notice how the hypotheses chain: If A, then B. [A verbal hypothesis, with a necessary prediction] If B, then X. [The prediction, mapped onto a statistical prediction] Not X, therefore not B. [The statistical prediction is untrue] Not B, therefore not A. [Therefore, the verbal prediction is untrue]. So far so good. We have a good enough hypothesis to generate a prediction; that prediction can be mapped onto a quantitative prediction. If that quantitative prediction is falsified, then the verbal prediction is invalidated, and the hypothesis is therefore falsified. Simple. Elegant. And also impractically naive for how science works. ## "Probabilistic" Modus Tollens We do not have perfect measures… so now we have to enter the realm of probability. My light sensor is not perfectly reliable. Each time I turn on the light sensor, it produces a slightly different answer. There’s some error to it. But the means are consistent, and the distribution of observations is fairly small when the lamp is on and off. But we do have to revise our inferential chain. If my lamp is broken, then the room should be dark. If the room is dark (and if [operationalization, model assumptions]), then the true light measure should be approximately 10. If the measure must be approximately 10, then my observation (21.5) would be highly unlikely. Therefore, the measure is unlikely to be 10; the room is unlikely to be dark; and the lamp is unlikely to be broken. Oof. That makes our statement much more tentative. It is also questionable whether Modus Tollens really makes sense once converted into a "probabilistic" modus tollens. Modus tollens is a logical argument; logical systems aren’t necessarily coherent once uncertainty is introduced into the statements. For one thing, it’s inverting the conditional probability. p(observation | 10) is not the same as p(10 | observation). For another, our quantitative prediction should be somewhat uncertain too… "if the room is dark, then my measure should be distributed N(mu, .1); mu ~ N(10, .01)". In order to have a ‘proper’ probabilistic modus tollens, we need to write the modus tollens as a probability function. Bear with me here. This part won’t be heavily mathy. The modus tollens argument can be written as a probability function, where some things have probability 1, and other things have probability 0. For example, the "If A, then B; Not B; therefore, not A" can be understood through the following probability statements. First, let’s define the probability of "A", given "Not B": $$p(A|\lnot B) = p(\lnot B | A)p(A) / P(\lnot B)$$ This looks difficult to compute, but under the standard modus tollens, it is simplified greatly. When there is no uncertainty in the logical statement, such that B must occur if A is true, then$p(B | A) = 1; p(\lnot B|A) = 0. The whole formula collapses: $$p(A|\lnot B) = 0 * p(A) / p(\lnot B) = 0$$ That is, assuming that B must occur given A, then the probability of "A" given "not B" is 0. Therefore, "Not A". Voila! Modus tollens expressed via probability. The key point of this, is that your proposition must be true. "A" must necessarily imply "B". When "A" happens, ‘B" must happen. If that does not happen, then modus tollens does not easily hold. For example, if "B" only occurs with some probability, given "A", it becomes more complicated. "If A, then probably B [let’s say, .95]; not B, therefore …" looks like this: $$p(A | \lnot B) = .05 \times p(A) / p(\lnot B)$$ Once you enter probability space, you have to start justifying some probabilities. It is no longer straight forward. So… let’s return once again to my possibly broken lamp. This is going to get complicated. If probability statements overwhelm you, then your take-away from this section should simply be: "Inference under uncertainty is extremely hard, and not as straightforward as Modus Tollens would suggest." It is fine if you do not understand this portion; that is largely the point of this section! With uncertainty (in data, in parameters, in statements), inference becomes much more complicated. Let’s express the probabilistic modus tollens of the lamp problem. Letters in parentheses represent conditions to include in the probability algebra. If my lamp is broken (A), the room must be dark (B). [A necessary implication] If the room is dark (B) (and if [operationalization, model assumptions]), then the true light measure (C) should be approximately 10. [A probabilistic statement] If the true light quantity should be around 10 (C), then my observation (D) would be highly unlikely [A probabilistic statement] So what is the answer? I want to know whether my lamp is broken. Unfortunately, there is uncertainty in my predicted value (due to measurement error in my tool). Argh! What I do know, is "D"; I have an observation. \begin{align} p(A | D) &= p(D | A)p(A) / p(D) \\ p(D | A) &= \int p(D | C) p(C | B) p(B| A) \end{align} We need to know a number of things. What is the probability of the true light measure, given that it is dark:p(C | B)$. What is the probability of my observation, given the true measure in darkness:$p(D | C)$. What is the a-priori probability of my lamp being broken: p(A) [e.g., maybe it’s broken many times in the past; maybe it is 30 years old; maybe this brand has a known defect; maybe it is currently sizzling and smoking]. This is.. difficult, obviously. If I previously, repeatedly measured the ambient light when the room is ‘dark’, then I can form a distribution for$p(C|B)$. If we take for granted that the room is always considered dark when the lamp is broken, then$p(B|A) = 1. If we know the distribution of measurement error for our light sensor, then we can specify the probability of the observed value given the true ambient light value. This is all possible, for sure. But the point is to highlight: Modus Tollens is much more difficult when the logical statements become probabilistic statements. It is unclear whether this is even considered deductive, since it includes uncertainty instead of merely deterministic statements. Given that it is no longer defined by logical, binary statements, but rather probabilistic ones, it appears more similar to induction than deduction, despite having a deductive bent. It is not straightforward, and I have never, in my life, seen a paper in psychology actually reason through, then compute something like this. ## Random variation Moreover, this is still a simplified example. We have thus far only talked about my measurement error. There are all sorts of external sources of variation that insert themselves into the system. What if I randomly sampled the light sensor across the year to collect data on baseline light levels when the lamp is on and off? Then there are new conditions, new sources of uncertainty. Recall that I previously collected light data when the lamp is on and off, in order to operationalize and quantify what I consider "dark" and "not dark", on a May Tuesday at 4pm. My statistical inference is based on whether my current light data is improbably larger than would be expected assuming the light measure is under the operationalized ‘dark’ regime. But across multiple data collections, there can be several random sources of light variation that are irrelevant to my question, but nevertheless affect my data collection process. If I collected light data across multiple rooms, multiple days, multiple seasons, etc, then there are all kinds of effects on my light data. Whether my computer monitor is on matters. Whether it’s 6pm in the summer, or 6pm in the winter matters. Whether I blocked my lamp with a giant coffee mug matters. Where I placed the light sensor matters. These are all stochastic, and these all change the distributions of uncertainty. They all change what the expected amount of light would be, what I would consider ‘dark’, and how much error there is in my measurement itself. Importantly, I would have to change my model to account for these things if I wanted to claim my lamp is, indeed, not currently broken at any point in time. I would need to estimate and control for the effects of these stochastic influences, if I wanted to know whether, in a given moment, my lamp is not broken. If I wanted to know whether my lamp is broken, and I broadly operationalize the verbal prediction "the room is dark" as "the amount of light present when the lamp is off [no matter what time of day, whether the monitor is on, the weather, the sensor distance, etc]", then I would need to model and control for all the sources of light variation that are irrelevant to that broader prediction. What happens if I do not account for these things? Well, it really depends on how you collected your data. If you collect data in a very controlled environment first (e.g., holding sensor location constant, collecting repeatedly in a short time frame, turning the lamp on and off to obtain a distribution of samples from each condition), then you may not need to include other sources of variation. However, any inference about your lamp is now conditioned on that exact circumstance. For you to say "my lamp is not broken", it is necessary to say "my lamp is not broken, assuming this exact sampling scheme, my model assumptions are met, no other sources of variation are present, and my context is the same as when my data were collected on that sunny May Tuesday afternoon, on which I operationalized what ‘dark’ even means". A harder sell, for sure, but it’s also more honest. If you collect data across the year, while recording all the different sources of variation you know about, then the story changes. You want to know if your lamp is broken, given all this data and known sources of random influences. You obtain your single observation from the light detector, and throw it into a model that does not account for any of these variations. Well, obviously, whether your inference is correct is a crapshoot. You’re comparing it to [uncertain] values averaged across a whole host of random influences. Because you are not accounting for known random influences, your comparison may be completely wrong, and lead to the wrong conclusion altogether (inflated error rates). Moreover, because you are excluding the uncertainty of these effects, you are more confidently wrong too. You may say "my lamp is not broken", even though it is, just because you’re measuring at 10am on a sunny day with the blinds open, and failed to account for those effects. If you wanted to test the broader prediction that the room is dark, broadly operationalized as "the amount of light present when the lamp is off or broken", then you would need to broaden your model as well, to obtain a predicted light level after controlling for the various random influences that bear no importance to the lamp itself. To reiterate, the role of the expanded model is to allow a generalized verbal prediction to map onto a generalized statistical prediction — One that is valid across various hypothetically unimportant nuances. If you wanted to test the prediction that the room is dark, and strictly operationalize "dark" as the ambient light present when the lamp is off on Tuesday, May 12, at 4pm in Davis, California; then you don’t need to control for other various things. Of course, that also means your hypothetical claim of support is limited to that specific measure, at that specific place. So, you don’t get to claim that your "lamp is not broken, because the room is not dark", but rather your "lamp is not broken because the room is not dark, assuming we define darkness as the amount of ambient light in this specific place, with this specific measure taken at this time." This is analogous to what Tal is talking about. For one, we need our verbal hypotheses to imply something statistically precise (an injective mapping between the two), if we want a statistical inference to correspond to a verbal inference. Then for generalizable inferences to be accurately stated, we either need to model the hypothetically exchangeable conditions (e.g., via random effects models), or we need to limit our inferential claims to the boundaries of our conditions, operationalization, and design. In other words, we can either vary, model, and account for the conditions over which we wish to broadly generalize; or we can clearly state the conditions assumed and given, and constrain our expressed hypothetical support appropriately. We can either integrate out known random effects, or we cannot, and instead keep our claims conditional on those effects. # What happens in Psychology? The lamp example is not perfect, but it’s a seemingly simple inferential problem that can complicated by introducing common research problems into the mix. Psychology, in reality, has a much harder problem. At least with the lamp, we can directly see and toggle the light. Psychology, on the other hand, largely deals with intangible, unobservable constructs in complex, ever-changing organisms. The uncertainty only compounds. Complications aside, Psychology does not seemingly use any acceptable form of inference. Consider the following normative inferential chain, void of any probability statements: If [verbal hypothesis], then [verbal prediction]. If [verbal prediction], then [non-zero quantity]. Non-zero quantity, therefore [verbal prediction]. [verbal prediction], therefore [verbal hypothesis]. This is problematic: 1. This is not modus tollens, nor valid deductive logic, despite critics seemingly arguing that it is. It simply isn’t. This is 100% an invalid statement. It breaks down because the statistical hypothesis is committing a logical fallacy. You cannot falsify by confirming, and yet this is what this statistical hypothesis is doing. It therefore breaks the inferential chain, rendering it fallacious. Yes, this implies that normative NHST is not deductive. 2. There is no unique mapping between the verbal and statistical hypothesis whatsoever. There are likely to be any number of alternative hypotheses for why the quantity is not exactly zero. Whether it’s Meehl’s crud factor, or differences in stimuli, or experimenter effects, or whatever else, this verbal hypothesis effectively predicts nothing, by predicting anything. Because any number of things can also predict a non-zero quantity, we therefore have extremely weak support for our hypothesis – In actuality, it merely rejects any hypothesis that would predict zero, and retain all other possible hypotheses. Unfortunately, this inferential chain is commonly used, and nevertheless wrong. You cannot affirm the consequent. More obviously, rejecting the statistical hypothesis implied by a completely different verbal hypothesis, does nothing to support your verbal hypothesis that makes a completely different statistical prediction, if any prediction was made at all. Ok, so let’s change our testing behavior to something that is not entirely fallacious. If [verbal hypothesis], then [verbal prediction]. If [verbal prediction], then [non-zero quantity]. Not [non-zero quantity], therefore not [verbal prediction]. Not [verbal prediction], therefore [not verbal hypothesis]. Now we have ourselves a modus-tollens. How did we do it? By completely reversing the role of null hypothesis testing. We do not test a prediction that we did not posit (a zero). Instead, we test the prediction that we did posit (non-zero). Note, that a normative NHST is not necessarily fallacious, it is the way in which we use it that makes it fallacious. A valid modus tollens for a nil-null hypothesis would be: If [verbal hypothesis], then [verbal prediction]. If [verbal prediction], then [zero-quantity]. Not [zero-quantity], therefore not [verbal prediction]. Not [verbal prediction], therefore not [verbal hypothesis]. But this would imply our hypothesis predicts a zero. If we observe no zero quantity, then our hypothesis is falsified. It says nothing of any other hypotheses, only that whatever hypothesis implies a value of zero, is to be rejected. See how that is different from the fallacious chain above? However, it still is not this simple. The above assumes no probabilistic statements. Once again, this will break down considerably once we do. Back to our non-zero inferential chain: If [verbal hypothesis], then [verbal prediction]. If [verbal prediction], then probably [non-zero quantity]. Probably not a [non-zero quantity], therefore probably not [verbal prediction]. Probably not [verbal prediction], therefore probably not [verbal hypothesis]. This once again requires us to create a joint probability model. I will spare you this here, but the probability algebra is straightforward. Getting the distributions, however, may not be so straightforward. ## Inferential chains must be thoroughly justified Even if you can test in a logically consistent manner, from verbal down to the data, you will have to justify each of these steps. You cannot just state "If A, then B"; you have to either prove that to be the case, or justify why such an assumption is made. Modus Tollens depends on the validity of each logical consequent. Otherwise, you can make nonsense inferences, in a logically structured way. For example: If unicorns are fictional, then my coffee mug would be full. (If A, then B) My coffee mug is not full. (Not B) Therefore, unicorns are not fictional. (Therefore not A) This is, of course, ridiculous. Even though it does indeed follow modus tollens, the assumptions are completely wrong. For this to be valid, I would need to prove, or sufficiently argue, that if indeed unicorns are fictional, then my coffee mug would be full. Psychology is not as ridiculous, but the requirements are exactly the same, from the verbal hypothesis to the statistical. Therefore, every single link in the inferential chain, from the verbal hypothesis, to the statistical hypothesis, must be coherent, justified, or proved. # Inference is hard Hypothesis testing therefore has some high requirements, that are rarely, if ever, met in psychology. 1. The verbal prediction must be justified, and adequately linked to any generative verbal hypothesis. 2. Any statistical prediction must be injectively mapped, and adequately linked to any generative verbal prediction, if a statistical inference is to inform the support for or against a hypothesis against any others. 3. Any stated support for a verbal hypothesis is necessarily constrained — bound by the assumptions, explicit or implicit, of the model, design, and operationalization used to ultimately test it. Every assumption in the design and model is ultimately added as a logical conditional, or a conditional in the probability model. There are many to consider, and I barely scratched the surface here. In essence, every "If" statement in the inferential chain is added to a global list of "If" statements that must precede the claim of hypothetical support; one cannot simply state the hypothesis and whether it was supported. To reiterate, this includes all implicit and explicit assumptions in the statistical model (e.g., distributional shape, whether stimuli have an effect or not, whether a measure includes error or not), because these ultimately underlie how a statistical prediction is formulated, and what is treated as exchangeable. In sum, hypothesis testing is hard. Inference is hard. Formulating valid, justifiable, precise predictions is hard. Once uncertainty is introduced to those predictions, it is made much harder. It’s not as straightforward as some seem to think. Probability makes logical argumentation explode into a cascading set of assumptions, constraints, and probability equations. Making precise, unique statistical predictions that map onto a verbal prediction is uniquely difficult in psychology, where "[measures are] made up, and points don’t matter". And unfortunately, I do not see an easy way out of it. We can continue on as normal, and make unjustifiably broad claims from incompatible models, using fallacious confirmationist arguments masquerating as falsificationism; or we can temper our claims, operate exploratively, and make careful predictions when we can. # Communication is hard Lastly, I’d like to mention that communication about inference is difficult. Even a broad statement like "Bears are brown" can be taken two opposite ways from people. One person may take that to mean "Bears can be brown, but may not necessarily be brown in all places, across all species." I.e., one answer to the question "What is brown? — Bears are brown!". Another person may interpret it as "Bears are always brown, no matter what random place or species you choose," and take issue with such a broad claim. I.e., one answer to "What describes a bear? — Bears are brown!" Some may think in terms of "What outcome could my hypothesis predict?" Vs "What hypothesis could predict my outcome?"; and these people would interpret a broad statement differently. Part of me thinks some of the debate surrounding Tal’s paper is actually a "Bears are brown" problem. If someone said "Receiving an unrequested gift makes people feel grateful," I take that to be a broad statement, applicable across measures, gifts, persons, etc; and I may criticize that paper to say "Only if you measure gratitude as you do, and only if the gift is valuable, and only if there is no expectation of reciprocation." Others may interpret as "Receiving an unrequested gift can make people feel grateful," and have no issue with it failing to generalize. I think Tal and I view such broad hypotheses ("Bears are brown") as ones to be generalized ("Bears are always brown"); to support such a claim, they would then need to model and control for hypothetically exchangeable conditions across which such a broad claim should be valid; and then map a verbal prediction onto the quantity of interest. Others may view such broad hypotheses ("Bears are brown") as non-generalized claims to be tested ("Bears can be brown; when are they not?"). This is fine too. However, I do not often see papers making claims so that others can test them — They usually seek to confirm generalized claims from specific operationalizations/conditions/contexts, using faulty inferential logic, then attempt to explain away (find moderators for) specific replication failures without modifying the broad claim. So really, the point is moot to me — Regardless of how you interpret broad claims, any serious attempts to generalize such statements or to refute them with logically coherent argumentation seems extraordinarily rare. Nevertheless, these two interpretations may contribute to the conflict. ## LaTeX in Slack! I just joined a new Slack workspace, and was growing weary of using the codecogs equation editor just to paste in an image of some latex math into Slack. It took a bit of effort, but I thought I would document the process for anyone who is curious. You need: – Slack, obviously – A Linux server, accessible from anywhere – Familiarity with some command line I have a digital ocean droplet hosting this webpage, running Ubuntu 16.04. First, SSH into your server. git clone https://github.com/colbygk/mathslax.git sudo apt-get install nodejs npm cd mathslax  Second, you will probably need an updated version of node. sudo npm install -g n sudo n latest  Third, get mathslax ready. make install npm audit fix  Fourth, navigate to slack chat -> Workspace name -> Customize slack -> Apps -> Search for Slash Commands, install that integration. Go to Custom Integrations -> Slash Commands -> Add Configuration Enter some settings: Command: /math URL: http://yourServerName:yourConfiguredPortNumber/slashtypeset Customize Name: LatexBot  yourServerName should be your webpage name; e.g., mine is srmart.in yourConfiguredPortNumber should be the port that the mathslax server will listen on. E.g., 7777 Write down or copy the Token that is generated for you. Fifth, in your SSH session or server, create a simple script for launching the mathslax server #!/bin/sh cd /usr/local/bin/mathslax SERVER=yourServerName PORT=yourConfiguredPortNumber SLACK_AUTH_TOKEN="TheToken" node server.js  Save that to /usr/local/bin/tex.sh, and mark as executable sudo mv tex.sh /usr/local/bin/tex.sh sudo chown root:root /usr/local/bin/tex.sh sudo chmod a+x /usr/local/bin/tex.sh  Sixth, move mathslax to a location. I did not follow best practices; this should probably be in /opt, /usr/share, or some other directory, but whatever. cd ~ sudo mv mathslax /usr/local/bin/ sudo chown -R root:root /usr/local/bin/mathslax  Finally, create a systemd services script, and place it in /etc/systemd/system/mathslax.service [Unit] Description=MathSlax After=network.target [Service] ExecStart=/usr/local/bin/tex.sh Restart=on-failure [Install] WantedBy=multi-user.target  Then: sudo systemctl enable mathslax.service sudo systemctl start mathslax.service  You may need to enable the port in your firewall. If using UFW, then it’s simply sudo ufw allow yourConfiguredPortNumber In slack, you can then type: /math \begin{align}y &\sim p(\mu, \sigma) // \mu &\sim N(0,1) \\ \sigma &\sim N^+ N(0,1)\end{align}  And get an image of those three aligned sampling statements. # Edit Part two. The bot method. On your server, git clone https://github.com/sand500/SlackLateX.git cd SlackLateX npm install fs npm install request npm install websocket  Create a file /usr/local/bin/texbot.sh with: #!/bin/sh cd /usr/local/bin/SlackLateX node ./bot.js  Create a file /etc/systemd/system/SlackLateX.service with: [Unit] Description=SlackLateX After=network.target [Service] ExecStart=/usr/local/bin/texbot.sh Restart=on-failure [Install] WantedBy=multi-user.target  Go to your slack chat workspace -> Apps -> Install Slack bots -> Add configuration: Customize name: latexbot  Copy the API token, paste it into /usr/local/bin/SlackLateX/secret.txt Then simply invite latexbot into the channel, and start the bot sudo systemctl start SlackLateX Now you can type \pi$ to get a rendering of the pi symbol (or any other latex equation). This method improves upon the first method because it’s channel-specific, and you can edit latex code, which will re-render. ## A foray into Bayesian handling of missing data I have been, uh, “blessed” by the data gods for most of my research data, in that I really rarely have non-planned missing data. Seriously. Most of my research has involved surveys, lab experiments, or within-subject repeated measures, and for some reason, I just rarely had missing data. The missing data was small enough to ignore (like, less than 1%). Consequently, I’ve never really had a need to “handle” missing observations. Sure, I’ve dealt with models wherein some unknowns are treated as missing data, like latent scores or groups in latent variable models, but that’s not really the same thing, now is it? Those were “known-unknowns”. I’ve also had data where missingness is planned and ignorable, like a between-subjects manipulation of which response condition a repeated-measures task requires. I use Stan or brms for nearly every analysis I do, and I have constructed some fairly complicated stan models. Multilevel SEM with non-normal residual distributions and moderation? Check. Assessing differential item functioning or measurement variance through item model competition? Check. Simultaneously estimating the probability that some item is DIF across latent groups? Check. Full information meta-analytic path models? You betcha. And so much more (Seriously, Stan rocks). But missing observations? Never dealt with it. Inspired by an assignment for a course, I decided to dive in and see just how bad missing data handling is in Stan. # The model The data has 6 columns: read, parents, iq, ses, absent, and treat, roughly corresponding to a reading score, number of parents (0 being 1, 1 being 2), IQ, socioeconomic status, number of absences, and whether the person was involved in the reading improvement treatment. Simple enough. The goal is to estimate the basic linear regression, read ~ parents + iq + ses + treat, which is of course very easy. However, there’s fairly substantial missingness in read, iq, and ses. One-third of the IQ scores are missing, 29% of SES is missing, and 14% of reading scores are missing. So what do you do? The two most common methods are multiple imputation and full information maximum likelihood. I specify the model in lavaan. lavMod <- ' read ~ 1 + parents + ses + iq + treat '  By default, lavaan uses listwise deletion, which is not so much a way of “handling” missing data as it is ignoring it and hoping we have enough N to say anything at all; and that’s only if data are MCAR, which is basically an insane assumption. In reality, we can expect data to be MAR, and listwise deletion will result in some bad estimates. Easy enough to fix in lavaan; to use FIML, you just add missings='fiml' as an argument. lavModOutFiml <- sem(lavMod,data=ds,missing='fiml')  All FIML really does, is change the estimation technique. Instead of operating only on covariance matrices, the estimator maximizes a likelihood function that is at the observation-level, then I think it integrates out the missings. This allows every observed variable to provide information to the model, and share information for missing variables. Or we could use multiple imputation, which is fairly easy as well. library(semTools) lavModOutMI <- runMI(lavMod,data=ds,m=20,miPackage='mice',fun='sem')  Multiple imputation generates$M$datasets using, basically, gibbs sampling for the missings. Then you estimate the model on each dataset and pool the estimates and compute the total standard errors. In effect, this also integrates out the missings, and is essentially a less principled Bayesian method. Easy enough, but onto Stan! # Bayesian Missing Data The nice thing about Bayesian modeling, is that there is not really a clear line between parameters and mere “unknowns”. Really, observations are known, and everything else is unknown. The goal is to condition on those knowns to make probabilistic claims about the unknowns. Usually when people talk about unknowns, they mean parameters, but that is needlessly restrictive. Missing data are unknown, latent groups and states are unknown, latent scores are unknown, but none are “parameters” per se. In order to “handle” missings, we merely need a model for them; then any posteriors of interest can be computed after marginalizing across it. No need to scrap entire rows of data — Just model the missings with the observed quantities, condition on the known and unknown data, then marginalize. In this way, missing data handling in Bayesian models is very natural. No external imputation needed; no fancy algorithm required. Missing data are merely part of the joint probability system. Most realizations were observed with absolute certainty; some were not observed, but are informed by what is observed. Take multiple regression as an example. Let X be the non-missing predictors,$\tilde{X}$the missing predictors,$\sigma$is the residual standard deviation,$\beta$is the vector of regression coefficients, y is the outcome,$\mu$is the vector of means and$\Sigmathe covariance matrix for a multivariate normal distribution on the predictors. \begin{align} p(\beta,\sigma,\mu,\Sigma|X,\tilde{X},y) &\propto p(y | X, \tilde{X},\beta,\sigma)p(\tilde{X}|X,\mu,\Sigma)p(\mu,\Sigma,\beta,\sigma) \\ p(\beta,\sigma,\mu,\Sigma|X,y) &\propto \int p(y | X, \tilde{X},\beta,\sigma)p(\tilde{X}|X,\mu,\Sigma)p(\mu,\Sigma,\beta,\sigma) d\tilde{X} \end{align} Essentially, we impose a multivariate normal distribution on the predictor variables, with unknown mean and covariance parameters. The known predictors inform the mu and covariances, which in turn inform unknown scores. Both the known and informed unknown scores predict y, and this in turn also informs the unknown scores (It’s a joint probability system, after all). The goal is to obtain the marginal posterior of the parameters of interest, and to do so you must integrate over the unknowns, including unknown scores. MCMC is there to help us approximate integrals and expectations. Importantly though, MCMC samplers are essentially imputing the unknown data points just like multiple imputation, but the model also uses full information likelihoods to inform the model. Bayesian handling of missing data therefore sits somewhere between multiple imputation and FIML-like techniques. From a mathematical perspective, it looks like FIML. From an estimation perspective, it looks like multiple imputation. To stan! # The Stan model, decrypted Let me premise this section by saying: The Stan code I show below is not optimized. It is written for clarity, not for speed. There are several ways of optimizing this, but for a one-off model, it’s not critical. Additionally, there are multiple ways of handling missings in Stan; the one I show below seemed easiest to me, even though it is an expensive method. Stan hates NA values. Stan (or I assume, their C++ header and libraries) has no concept of missing values, and has no way of representing them. So we need to do two things. We need to save which values are missing, and also replace those missing values with a temporary value. This R code accomplishes those goals: missings <- lapply(ds,function(var){which(is.na(var))}) # For each variable in ds, which rows are NA nMissings <- lapply(missings,function(var){length(var)}) # How many NAs? names(missings) <- paste0(names(missings),'_index') # Rename indices to variable_index names(nMissings) <- paste0(names(nMissings),'_nMiss') # Rename n_miss to variable_nMiss ds2 <- ds # Work with copy of dataset ds ds2[is.na(ds)] <- -100 # Magic number to remove NAs  For clarity, this is what missings looks like: > missingsread_index
   7  22  24  30  40  42  50  59  62  70  86  87  93  96  99 104 111 120 139 148 150 157 172 174 180 190 192 200 209 212 220 236 237 243 246 249 254 261 270 289 298 300

$parents_index integer(0)$ses_index
   1   7  17  21  22  24  30  31  44  50  52  59  60  61  62  64  65  68  72  82  83  86  89  90  92  94  96  98  99 101 104 115 119 125 126 129 138 139 141 144 145 148 150 151 157 167 171 172 174 180
 181 194 200 202 209 210 211 212 214 215 218 222 232 233 236 239 240 242 244 246 248 249 251 254 265 269 275 276 279 288 289 291 294 295 298 300

$iq_index  8 11 15 17 21 22 24 31 40 52 59 60 61 62 68 70 71 73 74 77 80 82 85 86 89 90 92 93 94 95 96 97 98 99 104 108 110 112 114 115 119 120 122 124 125 135 136 138 139  141 158 161 165 167 171 172 174 181 190 202 209 210 211 212 218 220 221 223 224 227 230 232 235 236 239 240 242 243 244 245 246 247 248 249 254 258 260 262 264 265 269 270 272 274 275 285 286 288  289 291$absent_index
integer(0)

$treat_index integer(0)  Time to compose the Stan model. The data block is as follows: data { int N; vector[N] read; // Data vectors vector[N] parents; vector[N] ses; vector[N] iq; vector[N] treat; int read_nMiss; // N missing from variables; excluded have no missings int ses_nMiss; int iq_nMiss; int read_index[read_nMiss]; int ses_index[ses_nMiss]; int iq_index[iq_nMiss]; }  N is defined as the number of rows in the dataset (number of observations). The outcome variable vector and the four predictor vectors are expected. The number of missings for the three variables containing missing values are expected. Finally, an integer array for the vector indices containing missings is expected for each variable with missings. The parameters block: parameters { // Structural vector beta; real intercept; real<lower=0> sigma; // MVN covariates; for imputation cholesky_factor_cov Sigma; vector Mu; // Missings vector[read_nMiss] read_imp; vector[ses_nMiss] ses_imp; vector[iq_nMiss] iq_imp; }  The structural parameters are the four regression coefficients, the intercept, and sigma — Corresponding to the model$y \sim \text{Normal}(X\beta,\sigma)$. The multivariate normal parameters include a cholesky-factorized covariance matrix$\Sigma$, and$\mu$-vector; the known predictor values will inform the parameters to a multivariate normal distribution, which will in turn inform the unknown values for these variables. The three remaining vectors correspond to the unknowns of each variable. The _imp should be read as “imputed”. Transformed parameters block: transformed parameters { // Recombine data matrix[N,5] Data; // 1: read, 2: parents, 3: ses, 4: iq, 5: treat Data[,2] = parents; Data[,3] = ses; Data[ses_index,3] = ses_imp; Data[,4] = iq; Data[iq_index,4] = iq_imp; Data[,5] = treat; Data[,1] = read; Data[read_index,1] = read_imp; }  Here, we combine the observed and missing data into a single data matrix, called Data. Normally, I would put this in the model block, but I hope to use this in generated quantities as well, and I do not want to copy and paste all of this. We cannot merely edit the data vectors provided in the data block, because Stan disallows it. Instead, a data matrix is created, and modified to include the estimated missing values. This block is straight forward. An Nx5 matrix is created named Data, and I create a little key corresponding to which columns should represent which variables. Each column is initially defined to be the corresponding vector provided in the data block. For those three variables with missings, the indices with missing values (which we set to -100) are replaced with the “imputed” value for that person. Model block: model { // MVN imputation for(n in 1:N){ Data[n,2:4] ~ multi_normal_cholesky(Mu,Sigma); } // Priors beta ~ normal(0,2); intercept ~ normal(0,2); sigma ~ cauchy(0,1); Mu ~ normal(100,10); Mu[1:2] ~ normal(0,2); //// Sigma given uniform prior implicitly Data[,1] ~ normal(Data[,2:5]*beta + intercept, sigma); }  For each person, the parents, ses, and iq quantities (whether it be observed or unknown) are assumed distributed multivariate normal with$\mu$means and$\Sigma\Sigma’$covariance ($\Sigma$is a cholesky factor). Any observed data contribute to the likelihood, and thus inform these unknown parameters. Any unknown data are simulated (in a sense) from the distribution. And yes, it is weird to assume the number of parents is normally distributed; I am ignoring that for now for the sake of simplicity, because all I care about is the covariance, and I am not imputing the parent variable. Note that I could have included all predictors into the multivariate normal, but treatment is completely orthogonal to every other variable, and was excluded for simplicity. Priors are loosely defined by values I think are plausible given the scales of the variables. The only odd looking one out is Mu, but that corresponds to IQ, and a-priori I can assume the mean is about 100, and extreme means are very unlikely. Finally, read is assumed distributed normally about a value predicted from the known and unknown data. Note that unknown read values are likewise predicted or imputed from the model, although I do not think it has any impact on the other parameters. Generated quantities: generated quantities { cov_matrix Omega = Sigma*(Sigma'); real<lower=0,upper=1> R2; { vector[N] muHat = Data[,2:5]*beta + intercept; R2 = (variance(muHat))/(variance(muHat) + variance(Data[,1]-muHat)); } }  In this block, I compute the covariance matrix of the three predictors involved in imputation. The$R^2$value is computed on the full data as well. Then we run Stan. I only monitor the parameters of interest, and not the imputed data. stan_data <- c(missings,nMissings,ds2,N=nrow(ds2)) stanOut <- stan(model_code = stanMod,data=stan_data,cores=4,pars = c('beta','sigma','intercept','Mu','Omega','R2'))  It stacks up well to lavaan’s FIML and MI output. The model above produced the “Bayes” line near the bottom. The Bayesian model looks very similar to the FIML estimator from lavaan. Auxiliary variables can also be used, and a model with an Auxiliary variable for the multivariate normal imputation method is reported on the final line of the table. I won’t put the stan code here, but the gist is: Don’t predict the outcome with the Auxiliary variable; permit the Auxiliary variable to covary with all the predictors in order for it to add information to the unknown predictor values. You can also have the auxiliary variable covary with the residual of the outcome variable (requiring a multivariate normal response model) to inform imputed outcomes. An alternative is to have all the predictors additionally predict the auxiliary variable, the residuals which covary with the outcome variable residuals. The former is a saturated covariate model, the latter is an added DV model; both accomplish the same goal of informing both missing predictors and missing outcomes. In the end, I was pleasantly surprised with how easy it was to handle missing data in Stan. Ideally, you specify your generative model, and just combine the known data with the unknown data. The known data will inform the unknown data through its influence on the unknown parameters. In this case, I simply chose to model the exogenous variables as multivariate normal, which permitted unknown data to be informed and used along with the known data to predict the outcome of interest. For MVN imputation: • Save which observations are missing, and how many, from each variable. • In addition to the typical parameters, include parameters for a multivariate normal. • Include a “parameter” per missing value. • Combine the known data with the unknown data into a new data structure. • Model exogenous variables as multivariate normal. • Use full combined data in the model • Thanks to MCMC, marginal posteriors will already be integrated over unknown data and parameter uncertainty. ## Demystifying the Bayesian analysis of ego-depletion At SPSP 2018, there was the big reveal of the awesome Many Labs replication report of ego depletion effects. Thirty-six labs participated, and not even the labs had yet seen the full results. I was in my room decompressing a bit (conferences are exhausting), but following along through Sanjay Srivastava’s live tweeting the session. The anticipation for the results was palpable even on twitter. I had to rush down there to hear the results myself. I ran in just in time to hear the results. Wagenmakers couldn’t be there to deliver the results of a Bayesian analysis himself, unfortunately. The speaker did their best to convey the results, but I had the sense that some nuances were missed a bit. I want to reiterate: The speaker did a good job! But there are some subtleties to these analyses that unless you’re a fairly experienced Bayesian, you will miss or under-emphasize. So this post isn’t to criticize the speaker at all! The gist I got from the Bayesian results was that: • The data are approximately 4x more likely under the point-mass null model than under the a-priori specified model. • The meta-analytic effect is approximately .08. (I mistakenly read it as .06, but others told me .08; I was in the back, sorry!) But it seemed like the take-home message was “The null was favored; but the meta-analytic effect is .08, so let’s instead talk about how we think about small effects”. Those I spoke with after the session had the same take-home message — The effect exists, but it’s tiny:$d = .08$. It’s not actually this simple. The “real” best-estimate is, according to Wagenmakers, about$d = .016$. The “problem” is that the Bayesian meta-analytic method used provides a conditional effect estimate — An effect conditioned on the assumption that an effect is non-zero and positive. To non-Bayesians, this is easily ignored. To Bayesians, this is a critical piece of information, because that condition completely changes the point estimates. Let’s dig into the method, as best as I currently understand it. Keep in mind, please, I have absolutely nothing to do with this analysis; I’m not involved in this project. Many things I type here may be incorrect, because I simply do not know the details of the analysis. Also worth mentioning: I have no invested interest in the approach being correct, either professionally or personally. Quite honestly, I have had a lot of disputes with Wagenmakers et al’s approach to inference, because not all Bayesians are alike, and I generally hate model averaging and Bayes factors. With that said, if I am correct about how the Bayesian approach was performed, I think it was a good approach for the purposes of this project. Kudos to the entire Many Labs project, and kudos to Wagenmaker et al for their analysis. # The two meta-analytic models The slides mentioned that both a fixed-effects and random-effects meta-analysis was performed, and the overall effect was estimated using both. This does not mean the data were used to “choose” a fixed or random-effects model. Instead, this method weights the “overall” estimate by the meta-analytic model probability. A fixed-effects meta-analysis in Bayes looks like this: $$p(\theta|y_1,\ldots,y_k,MA=1) \propto p(y_1,\ldots,y_k|\theta)p(\theta|MA=1)$$ Or with sufficient statistics… $$p(\theta|\hat\theta_k,\sigma_{\hat\theta_k},MA=1) \propto p(\hat\theta_k|\sigma_{\hat\theta_k},\theta,MA=1)p(\theta|MA=1)$$ Essentially, either one is simply stating that there is one fixed effect, and each dataset ($y_k$) or estimate-information combination ($\hat\theta_k,\sigma_{\hat\theta_k}$) is an instance of that fixed effect. The random effects meta-analysis looks similar, except that each estimated effect is assumed to come from a distribution of effects with an unknown variance and mean. The mean is taken to be the expected value of the effect. $$p(\Theta|y_1,\ldots,y_k,\theta_k,\sigma_{\theta_k},MA=2) \propto p(y_k|\theta_k,\Theta,\sigma_{\theta_k})p(\theta_k|\Theta,\sigma_{\theta_k})p(\Theta,\sigma_{\theta_k}|MA=2)$$ Or with sufficient statistics… $$p(\Theta|\hat\theta_k,\sigma_{\hat\theta_k},\sigma_{\theta_k},MA=2) \propto p(\hat\theta_k|\sigma_{\hat\theta_k},\theta_k,\Theta,\sigma_{\theta_k})p(\theta_k|\Theta,\sigma_{\theta_k})p(\Theta,\sigma_{\theta_k}|MA=2)$$ In order to get just the posterior for$\theta$or$\Theta$(fixed effects, respectively), you’d have to integrate out the unknowns, which I’ll skip for the time being, because MCMC does that for you. But across fixed-effects and random-effects models, what is the posterior distribution? Well, you have to integrate over the particular models themselves. Assuming we’re working with the raw data vector,$y: \begin{align} p(\Theta|y) &= p(\Theta|y,MA=2)p(MA=2|y) + p(\theta|y,MA=1)p(MA=1|y) \\ p(MA=1|y) &= \frac{p(y|MA=1)p(MA=1)}{p(y)} \\ p(MA=2|y) &= \frac{p(y|MA=2)p(MA=2)}{p(y)} \\ p(y|MA=1) &= \int p(y|\theta,MA=1)p(\theta|MA=1)d\theta \\ p(y|MA=2) &= \iiint p(y|\theta_k,\Theta,\sigma_{\theta_k},MA=2)p(\theta_k|\Theta,\sigma_{\theta_k},MA=2)p(\Theta,\sigma_{\theta_k}|MA=2)d\theta_k d\Theta d\sigma_{\theta_k} \\ p(y) &= p(y|MA=1)p(MA=1) + p(y|MA=2)p(MA=2) \end{align} This all looks really convoluted, but in words: • The posterior probability distribution of the fixed effect is marginalized across the two different meta-analytic methods (fixed and random effect models), weighted by the posterior probability of each meta-analytic model given the data. • The posterior probability of a fixed-effects model is the probability of the data under that model times the prior probability (probably .5), divided by the marginal probability of the data across both. This is just Bayes theorem. • The posterior probability of a random effects model is the probability of the data under that model times the prior probability (probably .5), divided by the marginal probability of the data across both. This is just Bayes theorem. • The probability of the data under each model is the probability of the data given the parameters in each model, marginalized over the a-priori specified priors for each unknown parameter. This is where the\text{Normal}^+(.3,.15)$prior comes in for both$\theta$and$\Theta$. All of this comes from probability theory and Bayes theorem. In a nutshell, you estimate a fixed-effects model and a random-effects model, assuming that the hypothesis about the effect is true and is expressed statistically as$\text{Normal}^+(.3,.15)$. You get the marginal probability of the data under each model, using bridgesampling, because computing that manually is intractable. This marginal probability represents the probability of observing the data they did given the a-priori hypothesis-informed uncertainty about the fixed effects. The posterior probability distribution of the fixed effect is then informed by both the fixed and random-effects models, weighted by their respective marginal probabilities. Both of these models assume the effect exists, and is positive. A-priori, the hypothesis states that the ego-depletion effect is positive, and with uncertainty represented by Normal(.3,.15). The posterior distribution for the ego-depletion effect, assuming such an effect exists as hypothesized, is informed by both the fixed and random-effects models, but weighted by the probability of observing the data they did under each model. Really, the posterior distribution is:$p(\Theta|y,H1)$, meaning it is conditioned on both the observed data and that the hypothesized effect is true. The hypothesis-driven prior on$\Theta$(d) looks a bit like this: And the posterior might look something like this: But remember! This posterior is conditioned on the effect being positive, non-zero, because the prior constrains it to be so. And this is likely not what the posterior actually looks like, this is just a best-guess approximation. Something else to consider… # Posterior distributions and Bayesian point estimates are not like modal estimates This is a quick aside. Bayesian posteriors are distributions of effects, representing the data-informed posterior probability density or mass of parameters. Outside of Bayesian inference, people are really accustomed to modal estimates, with an accompanying standard error or confidence interval. In Bayesian methods, people often don’t report modal estimates (maximum a-posteriori), but rather medians or expectations with an accompanying credible interval. When a prior induces a posterior boundary, these will not equal one another. Just to build the intuition, let’s assume you fit a model without any boundaries, and the posterior looks like this: This posterior suggests a mean, median, and mode of zero. Now let’s use a prior to restrict negative values altogether. The posterior looks like this: The modal estimate is still zero, but the median is about .629, and the mean (expected value) is about .765. Why is this important? The Bayesian MA posterior, conditioned on the effect being positive, has a median estimate of .08. The mean estimate is necessarily higher, and the modal estimate is necessarily lower. For what we currently know, that posterior could literally look like this: In this posterior, the median is .08 (the same effect estimated by the Bayesian MA model), but the modal estimate is zero. This is where Bayesian estimates and non-Bayesian estimates diverge: Maximizing estimators would say the effect is actually zero, whereas the Bayesian median and mean would be .08 and .10, respectively. It’s not that Bayesian estimates are wrong, but they are really descriptions of a posterior distribution. They are not intuitively the same as maximum-likelihood or other estimators. In this case, that posterior is really just a normal distribution, centered at zero, with all negative mass piled up at zero. In other words, if the prior had not constrained the posterior to positivity, that posterior would have a mean, median, and mode of about zero. This is important to recognize, because having a median effect of .08 is not particularly impressive or exciting. In reality, that suggests a 50% posterior probability that the effect is between 0 and .08, assuming it can only be positive — And even effects that are in actuality zero can have a median estimate of .08 when positivity is induced. # The actual posterior Wagenmakers and others compared two models. One model was the combined-meta-analytic approach discussed above. The other model assumed a point-mass of zero on the effect. The Bayes Factor for this was$BF_{01} \approx 4$, from my best recollection. This means the data are four times more likely to occur if the true effect were precisely zero than if the hypothesized (meta-analytic) effect of$\text{Normal}^+(.3,.15)$were true. If you want to know the overall posterior distribution for the effect, you have to marginalize across these models. $$p(\Theta|y) = p(\Theta|y,H1)p(H1|y) + p(\Theta|y,H0)p(H0|y)$$ This posterior will look a bit like this: The mean estimate is about .016. That is — Assuming the effect could either be zero or positive, the effect is about .016. This is not the traditional meta-analytic effect size estimate, but given the a-priori hypothesized effect size range and the alternative hypothesized effect of zero, then the meta-analytic effect would be approximately .016, not .08. Once again, .08 assumes that the effect is positive and exists, which we are not certain of. If we marginalize across the uncertainty in the models themselves, the expected estimate is about .016. Hopefully, a more traditional meta-analytic estimate will be reported in the paper as well. For example, using a zero-symmetric estimation prior (rather than the hypothesis-testing prior employed in the presented analyses) permits the effect to be any real number, weighted by the prior. This would serve to probabilistically down-weight insane estimates (like$d\geq1$). I am guessing that the meta-analytic effect would be between 0 and .05. I am further guessing that if they used a prior mass at zero along with a zero-symmetric prior, the mean,median, and modal estimate would all be essentially zero (between 0 and .02). # Summary From what I can gather: • The meta-analytic effect used a hypothesis-informed prior for the effect, meaning the meta-analytic effect is estimated assuming it is positive. • The data are about 4 times more likely to come from a parameter value of zero than from the hypothesized distribution of effects. • Assuming the hypothesis is correct, the median estimate is about .08. • Marginalizing across whether H1 or H0 are correct, the median estimate is about .016. • Do not take the meta-analytic effect size estimate as the take-home message, because that neglects the possibility that zero is a fairly probable estimate. I.e., don’t forget that the estimate is conditioned on the effect existing as hypothesized, which is 4x less likely than the null hypothesis. ## Why (N-1) vs N [Long post] I have a lot of students ask why we use (n – 1) vs (n) when computing variance. There are of course canned answers, like “Using N-1 unbiases the estimate” or “Using N-1 inflates the estimate, and it’s more accurate” or “Using N results in an underestimate”. These are all true, but they’re deeply unsatisfying. I remember learning the more precise reason in my first semester of my phd program, but the question came up again and I decided to and write it out myself as a reference. This will be fairly basic stuff for many, I’m sure, but for the sake of my students and my future self, here it is: First, distinguish between an estimator and its properties in expectation. The estimator is some method with a defined goal that produces the best estimate given the goals. Let’s say we are estimating the$\mu$and$\sigma$parameters to a normal distribution. Let’s further say that we want a point estimate, and we’re not Bayesians (for now…), nor do we have any penalties. One estimator is the maximum likelihood estimator, for obtaining the maximum likelihood estimate (MLE). The MLE is the estimate that maximizes the joint probability density of the observed data, assuming some probability distribution underlies the sampling distribution. In our case, we have the normal distribution, which looks like: $$p(x | \mu, \sigma) = \sigma^{-1}(2\pi)^{-.5}e^{-(2\sigma^2)^{-1}(x – \mu)^2}$$ When we have more than one x – A vector of x called$X$(capitalized), and if we assume independence, then the joint density is: $$p(X | \mu, \sigma) = \prod_{i = 1}^N \sigma^{-1}(2\pi)^{-.5}e^{-(2\sigma^2)^{-1}(x_i – \mu)^2}$$ This is a pain to compute with, and the log-likelihood is much easier. $$f(X | \mu, \sigma) = -N\log(\sigma) – .5N\log(2\pi) – (2\sigma^2)^{-1}\sum_{i = 1}^N (x_i – \mu)^2$$ Now, using this log-likelihood, we just need to see which value of$\mu$and which value of$\sigmamaximize the function output, the log-likelihood. Because the log function is a monotonic function, maximizing the log-likelihood is the same as maximizing the likelihood. To do this, we need to take the derivative of the log-likelihood with respect to each parameter at a time. Ignoring the notation for partial derivatives for the moment, the derivatives are as follows: \begin{align} f(X|\mu,\sigma) &= \ldots -(2\sigma^2)^{-1}\sum(x_i^2 – 2x_i\mu + \mu^2) \\ &= \ldots -(2\sigma^2)^{-1} \left ( \sum x_i^2 – 2\mu\sum x_i + N\mu^2 \right ) \\ &= \ldots – \frac{\sum x_i^2}{2\sigma^2} + \frac{\mu}{\sigma^2}\sum x_i – N\mu^2(2\sigma^2)^{-1} \\ \frac{d(f(X|\mu,\sigma)}{d\mu} &= 0 + \frac{\sum x_i}{\sigma^2} – N\mu(\sigma^2)^{-1} \\ 0 &= \frac{d(f(X|\mu,\sigma)}{d\mu} \\ 0 &= \sum x_i – N\mu \\ \mu &= \frac{\sum x_i}{N} \end{align} I’ll call that\hat\mu$, since it’s an estimate that maximizes the likelihood of the observations, and not the “true”$\mu$parameter value. Onto$\sigma: \begin{align} f(X | \mu, \sigma) &= -N\log(\sigma) – .5N\log(2\pi) – (2\sigma^2)^{-1}\sum_{i = 1}^N (x_i – \mu)^2 \\ \frac{d(f)}{d\sigma} &= -N\sigma^{-1} – 0 – \sum(x_i – \mu)^2\sigma^{-3} \\ 0 &= \sigma^{-1}(N – \sigma^{-2}\sum(x_i – \mu)^2 \\ \sigma^{-2}\sum(x_i – \mu)^2 &= N \\ \sigma^2 &= \frac{\sum(x_i – \mu)^2}{N} \end{align} I’ll call that\hat\sigma^2$, since it’s an estimate that maximizes the likelihood of the observations, and not the “true”$\sigma$parameter value. So now we have two estimators, derived from maximum likelihood goals: $$\hat\mu = \frac{\sum x_i}{N} \ \hat\sigma^2 = \frac{(\sum x_i – \mu)^2}{N}$$ Now the question, what are the expected values of these parameter estimates, and are they equal to the desired parameter values? Let me just take a shortcut and say the expected value of$\hat\mu$is indeed$\mu$. There is variance of the$\hat\mu$estimate around$\mu$, derived as follows. Assume$E(x_i) = \mu$,$E(\sum x_i) = \sum E(x_i)$,$E(cX) = cE(X), which are some basic expectation rules, and the rest follows. \begin{align} E(\hat\mu – \mu)^2 &= E(\hat\mu^2 – 2\hat\mu\mu + \mu^2) \\ &= E(\hat\mu^2) – 2\mu E(\hat\mu) + \mu^2 \\ &= E(\frac{(\sum x_i)^2}{N^2}) – 2\mu E(\frac{\sum x_i}{N}) + \mu^2 \\ &= \frac{1}{N^2}E((\sum x_i)^2) – \mu^2 \\ &= \frac{1}{N^2}E\left (\sum x_i – \mu + \mu \right )^2 -\mu^2 \\ &= \frac{1}{N^2}E\left ((\sum x_i – \mu) + N\mu \right )^2 – \mu^2 \\ &= \frac{1}{N^2} E \left ( (\sum (x_i – \mu)^2 + 2N\mu\sum(x_i – \mu) + N^2\mu^2)\right ) – \mu^2\\ &= \frac{1}{N^2} (N\sigma^2 + 0 + N^2\mu^2) – \mu^2\\ &= \frac{\sigma^2}{N} = \sigma^2_\mu \end{align} Voila, the expected variance (VAR) of the\hat\mu$estimator, defined as$E(\hat\mu – \mu)^2)$, is equal to$\frac{\sigma^2}{N}$. Just as our textbooks tell us. With that out of the way, what is the expected value of$\hat\sigma^2? \begin{align} \hat\sigma^2 &= \frac{\sum(x_i – \hat\mu)^2}{N} \\ E(\hat\sigma^2) &= \frac{1}{N}E(\sum((x_i – \mu) – (\hat\mu – \mu))^2) \\ &= \frac{1}{N}E(\sum (x_i – \mu)^2 -2\sum(x_i – \mu)(\hat\mu – \mu) + \sum(\hat\mu – \mu)^2) \\ &= \frac{1}{N}\left (N\sigma^2 – 2\sum E(x_i – \mu)(\hat\mu – \mu) + N\sigma^2_\mu) \right ) \\ &= \frac{1}{N}\left (N\sigma^2 – 2\sum E(\hat\mu x_i – \mu x_i – \mu\hat\mu + \mu^2) + N\sigma^2_\mu \right ) \\ &= \sigma^2 + \sigma^2_\mu – \frac{2}{N}E(N\mu^2 – N\mu\hat\mu – N\mu\hat\mu + N\hat\mu^2) \\ &= \sigma^2 + \sigma^2_\mu – 2E(\hat\mu – \mu)^2\\ &= \sigma^2 – \sigma^2_\mu \\ E(\hat\sigma^2)&= \sigma^2 – \frac{\sigma^2}{N} = \sigma^2(1 – \frac{1}{N}) = \sigma^2(\frac{N-1}{N}) \\ \sigma^2 &= \hat\sigma^2\frac{N}{N-1} \end{align} 1.E(\hat\sigma^2)= \sigma^2 – \sigma^2_\mu$is really interesting. It’s saying the expected value of our estimate of the variance is equal to the true variance, minus the error variance of the mean estimator. Conversely, you can say that the true variance is equal to the observed variance plus the error variance in the mean estimate. That already should intuitively suggest to you that the reason the MLE for$\sigma$is underestimated is because it fails to account for the variability in the mean used to compute the sample variance. Scores not only vary around the mean, but the mean varies around the parameter, and by neglecting that latter variability, we underestimate the population variance. From an ordinary least squares perspective, you can think of it this way: The mean is the value for which the average squared deviation from it is the smallest. Already, the mean is minimizing the variability. But because the mean does not equal the population parameter, that variability will be minimized around the incorrect point. 2. The$\frac{N}{N-1}$adjustment merely alters the estimated value so that the expected value will equal the true value. It does so basically by indirectly adding back some error variance due to the estimation of the mean itself. And if that fraction doesn’t look familiar to you, maybe this helps:$\hat\sigma^2 \frac{N}{N-1} = \frac{(\sum x_i- \hat\mu)^2 N}{N(N-1)} = \frac{\sum (x_i – \hat\mu)^2}{N-1}$And that’s basically it. I probably have minor mistakes in the above, as this was all hastily written. But that’s the gist – Using (N-1) as the divisor ultimately comes from the fact that our estimator is not expected to equal the true variance parameter. The estimated sample variance is off by a factor of$\frac{N-1}{N}$, asymptotically, so we just adjust it so the expected value with the adjustment does equal the true variance. ## The absurdity of mapping p-values to Bayes factors In the “Redefine statistical significance” (RSS) paper, the authors argued that p = .05 corresponds to low evidentiary value. Let’s unpack what is meant by that. The authors’ argument comes from an “upper bound” argument on how large the Bayes Factor can be given a particular p-value. Recall that a Bayes Factor is defined as: $$\frac{p(D|H_a)}{p(D|H_0)} = \frac{\int_l^u p(D|\theta,H_a)p(\theta|H_a)d\theta}{\int_l^u p(D|\theta,H_0)p(\theta|H_0)d\theta}$$ The parameter priors,$ p(\theta|H_k) $corresponds to an uncertain prediction from some hypothesis,$H_k$. The marginal likelihoods computed in the numerator and denominators above can then be thought of “prior predictive success”, or “predictive success, marginalized over the uncertainty expressed in the parameter from hypothesis$k$“. The BF10 then represents the evidence in favor of hypothesis a over hypothesis 0. With that out of the way, let’s return to the RSS paper. The authors wished to argue that a p=.05 necessarily has a low upper bound on evidence in favor of the alternative hypothesis. RSS used as an example a BF for a two-tailed alternative hypothesis in comparison to a nil-null hypothesis. This makes intuitive sense, because a commonly used test is the two-tailed t-test with a nil-null hypothesis. In the latter case, the p-value is corresponding to the probability of observing a t-statistic at least as extreme as the observed statistic, assuming that the true sampling distribution is centered around zero. No explicit alternative hypothesis really exists; the statement is simply$H_a: \theta \neq 0$, which isn’t exactly a useful hypothesis, but I digress. In contrast, the BF does use two explicit hypotheses.$H_0: p(\theta|H_0) = \begin{cases}1 & \text{if } \theta = 0 \\ 0 & \text{if } \theta \neq 0\end{cases}$and$H_adepends slightly on how favorable you wish to be, in order to obtain an upper bound. The most direct upper bound of BF for a p=.05 value with some specified N would simply be another point mass at exactly the quantile corresponding to the .975 percentile of the t-distribution. Using this line of reasoning, they argue that at best, a p = .05 corresponds to a small BF, and as such, p = .05 has a low upper bound on evidentiary value. Instead, they argue that p = .005 would correspond to a more considerable evidentiary value, with BFs ranging from 14–28 or so. So what’s the problem? Seems reasonable! # The problem(s) Ok, well let’s start with some intuitive problems. First, as mentioned in the “Justify your alpha” paper, why should the rate at which a hypothetical error is presumably controlled map onto a marginal likelihood ratio (I refuse to call this “evidence”, as though the whole of scientific evidence can be expressed in a single unidimensional metric of prior predictive success ratios)? We could likewise flip this around — How about we devise Bayes factor thresholds that map onto hypothetical error rates, so that these crazy Bayesians don’t make more or less type 1 errors than expected? (I am not encouraging this, just highlighting the arbitrariness). Second, these are metrics of completely different things! A p-value is exactly what it states: The probability of observing what we did (or more, or more extreme) assuming some state-to-be-nullified is true? The BF is a ratio of prior predictive success, or a marginal likelihood ratio, or the ratio of probability densities of the observed data from two prior-predictive distributions corresponding to two hypotheses. The goals of each are different too. The p-value is just a “surprise” factor, given some assumed state. The BF is a relative measure, evaluating the data under two certain or uncertain states. The p-value is used inferentially as a means to reject some state if the observation is sufficiently improbable under it. The BF is used to express how much one’s prior odds in each state would change as a function of the data. Why these things should map onto one another is not intuitive. The p-value has its uses. The BF has its uses. I’m not sure why one needs to be defined in terms of the other. Third, and most importantly, the mapping from “highly evidentiary” upper bound BFs to p-values (or alphas) is not generalizable. Richard Morey has an excellent series on this topic. The gist is, the whole “p=.005 has reasonable upper bounds, but p=.05 does not”, depends entirely on which hypotheses you are testing. And as I said above, the RSS authors used a two-tailed alternative against a point-mass null, and from the standpoint that statistical hypotheses should map onto their respective substantive hypotheses, this makes no sense anyway. More on that in a future blog post. Morey recently made his point again on twitter. I decided to take it a step further. # Likelihood Ratio Tests A likelihood ratio test is defined as: \begin{align} Q &= -2(LL_0 – LL_a) \\ Q &= 2(LL_a – LL_0) \\ Q &= 2\log\frac{L_a}{L_0} \\ k_i &= \text{Number of parameters in model i} \\ df &= k_a – k_0 \\ Q &\sim \chi^2(df)| \end{align} We can solve for the likelihood ratio required to reject at some\alpha$level: $$e^{\frac{Q_{c,\alpha,df}}{2}} = \frac{L_a}{L_0}$$ Given some$\alpha$and$df$, find the critical value for the$\chi^2$statistic,$Q_{c,\alpha,df}$, then simply solve for the likelihood ratio. Now, let’s define the upper bound BF. As it turns out, if you have two hypothetical models with point masses at fixed values ($\hat\theta), then the BF is simply a good ‘ol fashion likelihood ratio. \begin{align} p(D|H_a) &= \int p(D|\theta,H_a)p(\theta|H_a) \\ \text{If }p(\theta|H_a) &= \begin{cases}1 & \text{if } \theta=\hat\theta \\ 0 & \text{if } \theta\neq\hat\theta\end{cases} \\ p(D|H_a) &= p(D|\hat\theta,H_a) \end{align} Meaning, that the BF will just become a likelihood ratio. If you have model A with 2 parameters, and model 0 with 1 parameter, and p = .05, what is the upper bound BF (BF_{ub})? \begin{align} df &= 1 \\ Q_{c,\alpha=.05,df=1} &= 3.841459 \\ e^{Q/2} &= 6.825936 = BF_{ub} \end{align} So if you are comparing two models with a difference of one parameter, and you see p = .05, your upper bound Bayes Factor is 6.825936. Basically, that means if your alternative hypothesis had one more parameter and your hypothesis predicts the exact estimate achieved, and you got p = .05, then the greatest BF you could see is 6.825936. That’s not stellar, which is the point of the RSS argument. Not so fast! That’s in one particular scenario, wherein you had two models, and the only difference is the inclusion of an extra parameter. Let’s vary the p-value obtained and the df for the test. Recall that df in LRTs represents the difference in parameters estimated for each model, so we are varying how complex the alternative model is from the reference or null model. Plot time. library(ggplot2) alphas <- seq(.005,.5,by=.005) dfs <- 1:10 ds <- expand.grid(alpha=alphas,df=dfs) dsqchisq <- qchisq(p = 1-ds$alpha,df = ds$df)
ds$bfup <- exp(ds$qchisq/2)

ggplot(data=ds,aes(color=factor(df),x=alpha,y=log(bfup))) +
geom_line() +
labs(x='p',y='log(BF10) Upper Bound',color='LRT df') +
theme_classic() +
geom_hline(yintercept=c(log(14),log(28))) +
scale_x_continuous(breaks=seq(0,.5,by=.025)) +
theme(axis.text.x = element_text(angle = 90)) The $\log BF_{ub}$ is plotted on the Y axis. P-values on the X-axis. Colors represent the df (i.e., difference in number of parameters from alternative to null models).
The black horizontal lines represent BF10 14-28.

If your model has 2 parameters more than the null, then p=.05 has an upper bound of “strong evidence”, according to the RSS authors.
Well, that changes things doesn’t it?
In fact, if you only have one parameter difference, then p=.01-.03 has “strong evidence” upper bounds.
A p=.005 would have an upper bound BF of about 50.

In the more extreme case, if you have 6–7 more parameters in the alternative model, then a p=.50 corresponds to “strong evidence” upper bounds.
Let me repeat that, for those who misread .50 as .05.
A p-value, of .50 (50%, point five zero, not 5% or point zero five), corresponds to strong evidence upper bounds.

Using the logic of the RSS authors, if you have 6-7 parameters more in your alternative model than the null model, then alpha should be set to .50, because that’s where the upper bound of “evidence” is strong.
That’s some really great news for people who fit SEM and MLM models, both which often compare much higher dimensional models to much lower dimensional models. Go for broke, know that if you have p=.42, it’s justifiable as a cutoff, since it has a high evidence upper bound.

Note that this does make sense from a BF-inferential perspective.
If you have a hypothesis that predicts precisely the parameter values that you estimated, then that hypothesis is insanely good.
And the goodness of that hypothesis increases as your model has more parameters that it precisely predicted.
This reasoning is intuitively why as df increases, the upper bound increases too.
If you have a hypothesis that is perfectly predicting the parameter values, then the BF upper bound should be higher for models that have more parameters to predict.
In other words, it’s much more impressive for a hypothesis to perfectly predict 10 parameters than 2.

Despite the pattern making sense from a BF-inferential perspective, it makes little sense to claim p-values overstate the “evidence”, when in many cases the p-values understate the “evidence”.
Sure, a p=.05 may only imply a BF of 3-6.
But also, a p=.05 may imply a BF of 20-10000.

It depends on the hypotheses you are testing, and therefore there can be no general mapping between a p-value and the “evidence” indicated by a BF.

# Final thoughts

The argument that p=.005 should be the new standard due to its relationship to the BF is questionable.
Philosophies aside, the p-value does not map onto the BF.
A p-value maps onto a BF.
It all depends on the hypotheses you test. A p of .5 could mean a BF of 106; a p of .005 could mean a BF of 51; a p of .05 could mean a BF from 7 to 10000. Who knows?

The recommendation that all findings be labeled significant only when $p \leq .005$ is based purely on one particular set of hypotheses, under a set of assumptions, with the goal of matching it to upper bounds of one particular evidentiary metric.
It cannot generalize to all computed p-values, and I doubt we’d want a big table of significance thresholds corresponding to a range of BFs for every possible significance test.

Finally, I agree that a single p-value near .05 is insufficient evidence; just as I would agree that a single BF of 3-10000 would be insufficient evidence.
The reason I state that a single p-value near .05 is insufficient evidence is not because the .05 is necessarily weak, but rather that any singular metric is necessarily weak, when assessing the evidence for a hypothesis.
More on this in a future blog post, I’m sure.

In sum, use $\alpha = .005$ if you’d like. Use BFs if you’d like. Use posteriors and CIs if you’d like. Report all the things if you’d like.
But maybe don’t say p=.05 is necessarily weak because your p = .05.

And maybe all of this is wrong, I don’t know. You tell me.

## Updating? Pooled data? Meta-analysis? Yes.

Meta-psychology is a journal that is founded on the principles of total transparency. Manuscripts are open. Reviews are open. Open commentary is permitted. Each draft is visible. It’s as though the journal cares about science or something.
Rickard mentioned a new submission about the utility of Bayesian updating, or as the authors call it – Posterior passing.

My comment can be read here by clicking on the upper-right comment button. Due to the excruciating 1000 character limit, you will have to hit the ‘+’ sign to read all of them.

Although I generally appreciate the paper, I think it misses an important point.
Their proposal to use “posterior passing” instead of meta-analysis or using full raw datasets is strange upon realizing that posterior passing is the same as having all of the raw data, and the same as a fixed-effects meta-analysis under a few reasonable assumptions.

# Posterior Passing

The gist of posterior passing is that a researcher can take the posterior of a parameter computed from a previous sample, and use that posterior as a prior.
Or, as the rest of the world calls it – Bayesian updating.
The prior is then updated by the new sample data into a new posterior.
That new posterior can serve as someone else’s prior, and so on.
Over repeated samples, this results in decreasing uncertainty (smaller posteriors) about parameter values.

Posterior passing looks like the following. Assume $Y_j$ represents some current observation vector, and $Y_{-j}$ represents a set of previous observation vectors. Or: $Y_j$ is your data, and $Y_{-j}$ is not your data.
\begin{align}
p(\theta,Y_j|Y_{-j}) & \propto p(Y_{j}|\theta,Y_{-j})\overbrace{p(\theta|Y_{-j})}^{\text{Your prior, their posterior}}
\end{align}

That’s about it. Obtain data, estimate a posterior, use that posterior as a prior for the next sample. Keep doing it ad nauseam, and you have posterior passing.

The argument is that this method is underutilized in the sciences, and that engaging in posterior passing will permit researchers to quickly converge toward a more certain, accurate estimate.

I do not disagree with the idea of Bayesian updating, of course. It’s just a mathematical fact, and it is underutilized. People are way too afraid to incorporate prior information into a model.
With that said, the proposal has a few problems, some they mention and some they do not.

• The proposed use of it assumes a single set of generative parameters across all of the studies. Given that multiple studies have minor to major differences between them, surely their corresponding data are not generated from precisely the same set of parameters. It would be difficult to find three, let alone 60, studies that are feasibly generated from the exact same parameters.
• The quick convergence toward the “true value” strongly depends on the quality of the available posteriors from previous studies. $p_\text{osterior}-\text{hacked}$ studies and publication bias result in a misleading assortment of posterior distributions, and though your own data may overwhelm the priors, it is not guaranteed. Indeed, if they had simulated the existence of hacking and pub bias, I suspect the performance would be underwhelming.

My bigger disagreement is with advocating it over meta-analysis, or even comparing it to having the full data available.

They’re the same things! Advocating for repeated Bayesian updating across studies in order to get a highly informative posterior is the same thing as advocating for full-data analysis and fixed effects meta-analysis.

## Bayesian updating vs full-data analysis

Imagine two researchers are arguing about their next three samples.
One researcher wants to collect 300 observations, then another 300, then another 300.
She then wants to compute a posterior based on the first 300, feed that in as a prior for the second 300, and finally the resulting posterior into the third.
Specify an initial prior, then chain posteriors into the next model’s prior, until we get a three-study Bayesian update on the state of information.
The other researcher thinks that’s inferior to the full-data method.
He wants to just collect 900 and be done with it; analyze the whole thing.
Specify an initial prior, then throw all 900 observations into a single model.
They argue for hours. Friendships are severed. Children are neglected. They use all caps when texting because their voices have given out.

Neither researcher is right or wrong.
These two approaches will accomplish the exact same end result.
They should have just worked out the math and grabbed a beer instead.

\begin{align}
p(\theta,Y_j,Y_{-j}) &\propto p(Y_j|\theta,Y_{-j})\overbrace{p(\theta|Y_{-j})}^\text{A posterior} \\
&\propto p(Y_j|\theta,Y_{-j})p(Y_{-j}|\theta)p(\theta) \\
\text{If IID…} &\\
&\propto p(Y_\cdot|\theta)p(\theta)
\end{align}

Assuming your data and other data are IID, you can either use a posterior as your prior, or combine all the data.
The result is the same. The joint distribution decomposes into either one.

Consequently, using all available data in a single model is no better than using posteriors derived from all previous data to inform the current data.

## Bayesian updating vs fixed-effect meta-analysis

A fixed-effect meta-analysis is defined as follows.
Some parameter, $\theta$, exists that is estimated by several $\hat\theta_i$. A variance estimate, $\sigma^2_{\theta_i}$ is computed for each $\hat\theta_i$ estimate.
It follows then, that:
$$p(\theta | \hat\theta_\cdot,\sigma_{\theta_\cdot}) \propto p(\hat\theta_\cdot | \theta, \sigma_{\theta_\cdot})p(\theta)$$

Decomposing this further,

\begin{align}
p(\theta | \hat\theta_\cdot,\sigma_{\theta_\cdot}) &\propto p(\hat\theta_\cdot | \theta, \sigma_{\theta_\cdot})p(\theta) \\
&\propto p(\hat\theta_j | \theta, \sigma_{\hat\theta_j})p(\hat\theta_{-j}|\theta,\sigma_{\hat\theta_{-j}})p(\theta) \\
&\propto p(\hat\theta_j | \theta, \sigma_{\hat\theta_j})p(\theta | \hat\theta_{-j},\sigma_{\hat\theta_{-j}})
\end{align}

If $\hat\theta_j, \sigma_{\hat\theta_j}$ are sufficient statistics for $Y_j$, then
$$p(\hat\theta_j | \theta, \sigma_{\hat\theta_j})p(\theta | \hat\theta_{-j},\sigma_{\hat\theta_{-j}}) = p(Y_j|\theta)p(\theta|Y_{-j})$$

Assuming each study’s $\hat\theta_j, \sigma_{\hat\theta_j}$ calculations are sufficient statistics and are independent of one another, the fixed-effect meta-analysis will provide the same information as posterior passing or full-data analysis.

# Summary

In essence, whether you want to utilize posterior passing, incorporate all data into a single model, or use a fixed-effect MA, you will produce the same information in the posterior.

I think, anyway. Feel free to critique the above.

It is inconsequential, based on the above, to use one version or another. They will all suffer the same consequences with the same problems, and have the same troubling assumption that all included data are presumed generated from the same single parameter.

Practically speaking, I believe the fixed-effect MA would be the simplest to utilize of those three. Regardless, I think the assumption of one-true-effect for multiple studies is seriously improbable in practice, and would recommend a random effects meta-analysis.

## Maximum Likelihood Lecture for Undergraduates

This semester I was permitted to give one brief lecture on maximum likelihood in the advanced statistics undergraduate course.

The course focuses primarily on the linear model, so we talk about OLS linear models, and variance-based approaches to model selection.
I have free reign over three lectures each semester, and my final lecture was on maximum likelihood approaches, and how we can use them.

The knitted lecture notes are embedded below. Note: I know some content is incorrect; I throw the term “likelihood” around pretty freely and incorrectly, in order to focus on the intuition rather than the correctness of that term. So before anyone comments on how I used the term likelihood incorrectly — I know I did. It was intentional, and with only one lecture to spend, it wasn’t particularly worth it to me to go into what precisely likelihood means, or why the $p(Y|\mu,\sigma)$ is not Fisher-friendly (because it treats parameters as random values to be conditioned on, even if mathematically they are the same).

Anyway, these notes got good reviews from undergraduates who had previously never seen or heard of maximum likelihood; even a couple of graduate students thought the visualizations were helpful to them.
The undergraduates were also given a fairly basic homework assignment wherein they fit logistic regression models, computed and used LRT and AIC for model selection, and made predictions (e.g., if high in twitter addiction, the model would suggest a high probability of being a graduate student).

## Yes, collect more data and pool them.

A question arose on one of the numerous facebook-methods groups. The question, in essence was: I have a sample of 30 from a dissertation; I want to collect 30 more observations to have better [sic] results for a publication. Should I use a bayesian analysis with the prior informed by the first 30, or should I just use an uninformed prior and completely pool the data? My response: Yes, collect more data and just completely pool the data.

# Why

Partial pooling is great, and I encourage it nearly all of the time. However, in this case there was just an arbitrary break in the data collection from the same process, so I think complete pooling is fine (also, partial pooling with only two samples seems pointless, but I could be wrong).
You could of course use the posterior from the first 30 as the prior for the second 30, but this winds up being the same thing as just completely pooling the data, assuming the observations are independently sampled from the same process:
\begin{align}
p(\theta|y_1) &= \frac{p(y_1 | \theta)p(\theta)}{p(y_1)} \\
p(\theta, y_1, y_2) &= p(y_2 | \theta, y_1)p(\theta|y_1)p(y_1) \\
p(\theta|y_1,y_2) &\propto p(y_2 | \theta, y_1)p(y_1|\theta)p(\theta) \\
p(\theta|y_1,y_2) &\propto p(y_1,y_2|\theta)p(\theta) \\
p(\theta|y_1,y_2) &\propto p(y|\theta)p(\theta)
\end{align}
So – Just to save time, I suggested to just pool the data.

Another user was a bit suspicious though.
This user said that if you collect 30 observations which suggest something interesting enough that you should continue collecting observations, then this data-dependent choice will bias your results.

I disagree! This is very close to the problem of multiple stopping inflating the probability of a wrong decision, but is very different for a special reason: The data are not stopped if they hit “significance”, but rather continued if they hit “significance”. If you stop upon hitting significance, yes, you will bias your result and inflate error rates, if you care for that sort of thing.
But if you continue after hitting significance for some fixed N, the estimate thereafter is less biased, and if the “null” is true, then you will actually reduce the error rate.

The reason it is less biased is simply that if you stopped after significance with an N = 30, then your estimate is, for most effect sizes, necessarily an overestimate. It must be an extreme estimate in order to pass a threshold, if the study is underpowered (which, for most effect sizes, N=30 is underpowered). If you then sample 30 more, your estimate will recede toward the mean, and is not likely to become more extreme, but less extreme. Gaining more observations means gaining a better representation of the population, and you are far more probable to observe new values near the typical set than anything more extreme than you already observed.

The reason you have less type 1 error (of course, assuming the nil-null is “true”) is for similar reasons – With an N=30, you still have a $\alpha$ chance of a t1 error. Given that t1 error, sampling $N_\Delta = 30$ more observations will draw your chance-extreme estimate back toward the true mean of zero. In order to make a t1 error after $N_\Delta$ more, think about what would need to occur: With some $Z_n$ statistic for the first $N$ sample, what is the $N_\Delta$ sample needed to maintain at least a $Z_\alpha$ for rejection in the same direction? Or – What sample would you need to keep being significant purely by chance, and what is the probability of obtaining that sample from a nil-null sampling distribution? It’s small!

# When the null is true

Let’s test it via simulation.
The following procedure is simulated 100,000 times.

1. Collect N=30 from Null.
2. If significant, collect 30 more.
3. Report $p$, the estimate, and the n (30 or 60).
• Type 1 error: .01641
• Expected effect estimate: .00042
• Effect estimate conditioned on significance: -.01311

Note that this conditional effect should be biased, but due to the 2-tailed test, the biases are canceling each other out.

# When the true effect is .4

• Power: .54841
• Expected effect estimate: .36552
• Effect estimate conditioned on significance: .46870
• Effect estimate when conditioned on obtaining $N_\Delta = 30$ more after initial significance: .46207

# The confirmatory approach

But what if we only did as the other user suggested instead: Ignore the first sample, just collect an N=30 sample as a confirmatory study?
That is – Collect N=30, if significant, collect another N=30 and analyze the latter one by itself. We’ll only simulate from the latter one — Just simulating from the distributions.

## When the null is true

• Type 1 error: .05032
• Expected effect estimate: -.00070
• Conditional effect estimate: .00019

Note that the conditional effect estimate really should be biased — It’s just that I’m using a 2-tailed test, and the biases are canceling each other out. The expected positive effect is about .415. ## When the true effect is .4

• Power: .56336
• Expected effect estimate: .39981
• Conditional effect estimate: .52331

# Conclusion

From the above, I recommend that if you have a sample, and you find something interesting in it, do not be afraid to collect a larger sample and include your previous data in it. If the “null” is true, the error rate decreases due to regression toward the mean. If the null is false, you will have a less biased estimate. Ideally, you should collect enough data that any initial sampling-error induced extremeness will shrink away.
Yes, there’s a chance that you could “lose” significance — That’s probably for a good reason though, given the simulations above; but there’s a pretty strong chance you would lose significance anyway. In the 30 + 30 case, the power was about .54 overall; in the 30 first, 30 again confirmatory case, the power is about .56 each time – So the chance of you getting two samples that support your claim is about .32.

TLDR:

• Do not collect data until you hit significance, then turn off your data collector. You will obtain overestimates and drastically alter your t1 error rate; under extreme conditions, there is a 100% chance of rejecting the null even if it’s true under this procedure.
• Do collect more data after finding something interesting, because you will improve or maintain your power and decrease the significance-conditioned bias in your estimate. That’s probably a good rule of thumb anyway – Once you find something interesting even in adequately powered studies, collect a bit more just to shimmy toward the generative parameter a bit. Type 1 error is inflated a bit relative to two independent tests, but not by much — In one case it’s about .0025, and in the other it’s .01; not enough to overwhelm the benefit in estimate accuracy and power. The more N you collect after the first set, the lower this is (which is counter-intuitive, given the p-value asymmetry, but it’s due to joint t1 errors: Those non-sig aren’t continued; those significant then gain more N, but samples are more likely to be near zero than extreme enough to maintain the already improbable t1 error).
• Do this with enough N to overcome any sampling-induced extreme estimates — As in, don’t collect 30, get significance, then collect 1 more to say the estimate is more accurate, that is silly. You can play around with the code below to see how much N to add to reduce conditional estimates. All things being equal, more N is better N.
• Do this with a fixed $N_\Delta$, rather than doing it repeatedly until you hit significance again, otherwise you wind up with the conditional stops problem. Basically, don’t switch your rule from “stop once an N is hit” to “stop once an N is hit and significance is found again”.
• Do partially pool if you can, and you probably can.

# EDIT

I wanted to clarify: I don’t like NHST! And I know Bayesian inference doesn’t have nor care about error rates for inference. I know this. Do not yell at me. The above is in response to the idea that pooling data with “interesting” (significant, probable, whatever you want) results with a new dataset from the same process would bias your results and/or inflate error rates — I am playing a frequentist for a moment and saying, no they don’t (or, it’s less biased with better error properties).

sim_study <- function(n1,n2,mu,sigma,alpha=.05){
d <- rnorm(n1,mu,sigma)
lmOut <- lm(d ~ 1)
lmSum <- summary(lmOut)
p <- lmSum$coefficients[1,4] eff <- lmSum$coefficients[1,1]

# Comment until you see #%%% If you don't want to sample more
if(p <= alpha) {
d <- c(d,rnorm(n2,mu,sigma))
lmOut <- lm(d ~ 1)
lmSum <- summary(lmOut)
p <- lmSum$coefficients[1,4] eff <- lmSum$coefficients[1,1]
}
#%%%
c(eff=eff,p=p,n=length(d))
}

# Null
reps.null <- t(replicate(100000,expr = sim_study(30,300,0,1)))

mean(reps.null[,'p'] < .05)
mean(reps.null[,'eff'])
mean(reps.null[reps.null[,'p'] < .05,'eff'])

# Not null
reps <- t(replicate(100000,expr = sim_study(30,30,.4,1)))

mean(reps[,'p'] < .05)
mean(reps[,'eff'])
mean(reps[reps[,'p'] < .05,'eff'])