# A foray into Bayesian handling of missing data

I have been, uh, “blessed” by the data gods for most of my research data, in that I really rarely have non-planned missing data. Seriously.
Most of my research has involved surveys, lab experiments, or within-subject repeated measures, and for some reason, I just rarely had missing data.
The missing data was small enough to ignore (like, less than 1%).
Consequently, I’ve never really had a need to “handle” missing observations.

Sure, I’ve dealt with models wherein some unknowns are treated as missing data, like latent scores or groups in latent variable models, but that’s not really the same thing, now is it? Those were “known-unknowns”.
I’ve also had data where missingness is planned and ignorable, like a between-subjects manipulation of which response condition a repeated-measures task requires.

I use Stan or brms for nearly every analysis I do, and I have constructed some fairly complicated stan models.
Multilevel SEM with non-normal residual distributions and moderation? Check.
Assessing differential item functioning or measurement variance through item model competition? Check.
Simultaneously estimating the probability that some item is DIF across latent groups? Check.
Full information meta-analytic path models? You betcha.
And so much more (Seriously, Stan rocks).

But missing observations? Never dealt with it.
Inspired by an assignment for a course, I decided to dive in and see just how bad missing data handling is in Stan.

# The model

The data has 6 columns: read, parents, iq, ses, absent, and treat, roughly corresponding to a reading score, number of parents (0 being 1, 1 being 2), IQ, socioeconomic status, number of absences, and whether the person was involved in the reading improvement treatment.
Simple enough.
The goal is to estimate the basic linear regression, read ~ parents + iq + ses + treat, which is of course very easy.
However, there’s fairly substantial missingness in read, iq, and ses.
One-third of the IQ scores are missing, 29% of SES is missing, and 14% of reading scores are missing.

So what do you do?
The two most common methods are multiple imputation and full information maximum likelihood.

I specify the model in lavaan.

lavMod <- '
read ~ 1 + parents + ses + iq + treat
'


By default, lavaan uses listwise deletion, which is not so much a way of “handling” missing data as it is ignoring it and hoping we have enough N to say anything at all; and that’s only if data are MCAR, which is basically an insane assumption.
In reality, we can expect data to be MAR, and listwise deletion will result in some bad estimates.

Easy enough to fix in lavaan; to use FIML, you just add missings='fiml' as an argument.

lavModOutFiml <- sem(lavMod,data=ds,missing='fiml')


All FIML really does, is change the estimation technique.
Instead of operating only on covariance matrices, the estimator maximizes a likelihood function that is at the observation-level, then I think it integrates out the missings.
This allows every observed variable to provide information to the model, and share information for missing variables.

Or we could use multiple imputation, which is fairly easy as well.

library(semTools)
lavModOutMI <- runMI(lavMod,data=ds,m=20,miPackage='mice',fun='sem')


Multiple imputation generates $M$ datasets using, basically, gibbs sampling for the missings.
Then you estimate the model on each dataset and pool the estimates and compute the total standard errors.
In effect, this also integrates out the missings, and is essentially a less principled Bayesian method.

Easy enough, but onto Stan!

# Bayesian Missing Data

The nice thing about Bayesian modeling, is that there is not really a clear line between parameters and mere “unknowns”.
Really, observations are known, and everything else is unknown.
The goal is to condition on those knowns to make probabilistic claims about the unknowns.

Usually when people talk about unknowns, they mean parameters, but that is needlessly restrictive.
Missing data are unknown, latent groups and states are unknown, latent scores are unknown, but none are “parameters” per se.

In order to “handle” missings, we merely need a model for them; then any posteriors of interest can be computed after marginalizing across it.
No need to scrap entire rows of data — Just model the missings with the observed quantities, condition on the known and unknown data, then marginalize.
In this way, missing data handling in Bayesian models is very natural. No external imputation needed; no fancy algorithm required. Missing data are merely part of the joint probability system.
Most realizations were observed with absolute certainty; some were not observed, but are informed by what is observed.

Take multiple regression as an example. Let X be the non-missing predictors, $\tilde{X}$ the missing predictors, $\sigma$ is the residual standard deviation, $\beta$ is the vector of regression coefficients, y is the outcome, $\mu$ is the vector of means and $\Sigma$ the covariance matrix for a multivariate normal distribution on the predictors.
\begin{align} p(\beta,\sigma,\mu,\Sigma|X,\tilde{X},y) &\propto p(y | X, \tilde{X},\beta,\sigma)p(\tilde{X}|X,\mu,\Sigma)p(\mu,\Sigma,\beta,\sigma) \\ p(\beta,\sigma,\mu,\Sigma|X,y) &\propto \int p(y | X, \tilde{X},\beta,\sigma)p(\tilde{X}|X,\mu,\Sigma)p(\mu,\Sigma,\beta,\sigma) d\tilde{X} \end{align}
Essentially, we impose a multivariate normal distribution on the predictor variables, with unknown mean and covariance parameters.
The known predictors inform the mu and covariances, which in turn inform unknown scores.
Both the known and informed unknown scores predict y, and this in turn also informs the unknown scores (It’s a joint probability system, after all).
The goal is to obtain the marginal posterior of the parameters of interest, and to do so you must integrate over the unknowns, including unknown scores.

MCMC is there to help us approximate integrals and expectations.
Importantly though, MCMC samplers are essentially imputing the unknown data points just like multiple imputation, but the model also uses full information likelihoods to inform the model.
Bayesian handling of missing data therefore sits somewhere between multiple imputation and FIML-like techniques.
From a mathematical perspective, it looks like FIML. From an estimation perspective, it looks like multiple imputation.

To stan!

# The Stan model, decrypted

Let me premise this section by saying: The Stan code I show below is not optimized. It is written for clarity, not for speed. There are several ways of optimizing this, but for a one-off model, it’s not critical.
Additionally, there are multiple ways of handling missings in Stan; the one I show below seemed easiest to me, even though it is an expensive method.

Stan hates NA values. Stan (or I assume, their C++ header and libraries) has no concept of missing values, and has no way of representing them.
So we need to do two things. We need to save which values are missing, and also replace those missing values with a temporary value.
This R code accomplishes those goals:

missings <- lapply(ds,function(var){which(is.na(var))}) # For each variable in ds, which rows are NA
nMissings <- lapply(missings,function(var){length(var)}) # How many NAs?
names(missings) <- paste0(names(missings),'_index') # Rename indices to variable_index
names(nMissings) <- paste0(names(nMissings),'_nMiss') # Rename n_miss to variable_nMiss

ds2 <- ds # Work with copy of dataset ds
ds2[is.na(ds)] <- -100 # Magic number to remove NAs


For clarity, this is what missings looks like:

> missings
$read_index [1] 7 22 24 30 40 42 50 59 62 70 86 87 93 96 99 104 111 120 139 148 150 157 172 174 180 190 192 200 209 212 220 236 237 243 246 249 254 261 270 289 298 300$parents_index
integer(0)

$ses_index [1] 1 7 17 21 22 24 30 31 44 50 52 59 60 61 62 64 65 68 72 82 83 86 89 90 92 94 96 98 99 101 104 115 119 125 126 129 138 139 141 144 145 148 150 151 157 167 171 172 174 180 [51] 181 194 200 202 209 210 211 212 214 215 218 222 232 233 236 239 240 242 244 246 248 249 251 254 265 269 275 276 279 288 289 291 294 295 298 300$iq_index
[1]   8  11  15  17  21  22  24  31  40  52  59  60  61  62  68  70  71  73  74  77  80  82  85  86  89  90  92  93  94  95  96  97  98  99 104 108 110 112 114 115 119 120 122 124 125 135 136 138 139
[50] 141 158 161 165 167 171 172 174 181 190 202 209 210 211 212 218 220 221 223 224 227 230 232 235 236 239 240 242 243 244 245 246 247 248 249 254 258 260 262 264 265 269 270 272 274 275 285 286 288
[99] 289 291

$absent_index integer(0)$treat_index
integer(0)


Time to compose the Stan model.

The data block is as follows:

data {
int N;
vector[N] read; // Data vectors
vector[N] parents;
vector[N] ses;
vector[N] iq;
vector[N] treat;

int read_nMiss; // N missing from variables; excluded have no missings
int ses_nMiss;
int iq_nMiss;

int ses_index[ses_nMiss];
int iq_index[iq_nMiss];
}


N is defined as the number of rows in the dataset (number of observations).
The outcome variable vector and the four predictor vectors are expected.
The number of missings for the three variables containing missing values are expected.
Finally, an integer array for the vector indices containing missings is expected for each variable with missings.

The parameters block:

parameters {
// Structural
vector[4] beta;
real intercept;
real<lower=0> sigma;

// MVN covariates; for imputation
cholesky_factor_cov[3] Sigma;
vector[3] Mu;

// Missings
vector[ses_nMiss] ses_imp;
vector[iq_nMiss] iq_imp;
}


The structural parameters are the four regression coefficients, the intercept, and sigma — Corresponding to the model $y \sim \text{Normal}(X\beta,\sigma)$.
The multivariate normal parameters include a cholesky-factorized covariance matrix $\Sigma$, and $\mu$-vector; the known predictor values will inform the parameters to a multivariate normal distribution, which will in turn inform the unknown values for these variables.
The three remaining vectors correspond to the unknowns of each variable. The _imp should be read as “imputed”.

Transformed parameters block:

transformed parameters {
// Recombine data
matrix[N,5] Data; // 1: read, 2: parents, 3: ses, 4: iq, 5: treat
Data[,2] = parents;
Data[,3] = ses; Data[ses_index,3] = ses_imp;
Data[,4] = iq; Data[iq_index,4] = iq_imp;
Data[,5] = treat;
}


Here, we combine the observed and missing data into a single data matrix, called Data.
Normally, I would put this in the model block, but I hope to use this in generated quantities as well, and I do not want to copy and paste all of this.
We cannot merely edit the data vectors provided in the data block, because Stan disallows it.
Instead, a data matrix is created, and modified to include the estimated missing values.
This block is straight forward. An Nx5 matrix is created named Data, and I create a little key corresponding to which columns should represent which variables.
Each column is initially defined to be the corresponding vector provided in the data block.
For those three variables with missings, the indices with missing values (which we set to -100) are replaced with the “imputed” value for that person.

Model block:

model {
// MVN imputation
for(n in 1:N){
Data[n,2:4] ~ multi_normal_cholesky(Mu,Sigma);
}

// Priors
beta ~ normal(0,2);
intercept ~ normal(0,2);
sigma ~ cauchy(0,1);
Mu[3] ~ normal(100,10);
Mu[1:2] ~ normal(0,2);
//// Sigma given uniform prior implicitly

Data[,1] ~ normal(Data[,2:5]*beta + intercept, sigma);
}


For each person, the parents, ses, and iq quantities (whether it be observed or unknown) are assumed distributed multivariate normal with $\mu$ means and $\Sigma\Sigma’$ covariance ($\Sigma$ is a cholesky factor).
Any observed data contribute to the likelihood, and thus inform these unknown parameters.
Any unknown data are simulated (in a sense) from the distribution.
And yes, it is weird to assume the number of parents is normally distributed; I am ignoring that for now for the sake of simplicity, because all I care about is the covariance, and I am not imputing the parent variable.
Note that I could have included all predictors into the multivariate normal, but treatment is completely orthogonal to every other variable, and was excluded for simplicity.

Priors are loosely defined by values I think are plausible given the scales of the variables.
The only odd looking one out is Mu[3], but that corresponds to IQ, and a-priori I can assume the mean is about 100, and extreme means are very unlikely.
Finally, read is assumed distributed normally about a value predicted from the known and unknown data.
Note that unknown read values are likewise predicted or imputed from the model, although I do not think it has any impact on the other parameters.

Generated quantities:

generated quantities {
cov_matrix[3] Omega = Sigma*(Sigma');
real<lower=0,upper=1> R2;
{
vector[N] muHat = Data[,2:5]*beta + intercept;
R2 = (variance(muHat))/(variance(muHat) + variance(Data[,1]-muHat));
}
}


In this block, I compute the covariance matrix of the three predictors involved in imputation.
The $R^2$ value is computed on the full data as well.

Then we run Stan. I only monitor the parameters of interest, and not the imputed data.

stan_data <- c(missings,nMissings,ds2,N=nrow(ds2))
stanOut <- stan(model_code = stanMod,data=stan_data,cores=4,pars = c('beta','sigma','intercept','Mu','Omega','R2'))


It stacks up well to lavaan’s FIML and MI output.

The model above produced the “Bayes” line near the bottom.
The Bayesian model looks very similar to the FIML estimator from lavaan.

Auxiliary variables can also be used, and a model with an Auxiliary variable for the multivariate normal imputation method is reported on the final line of the table.
I won’t put the stan code here, but the gist is: Don’t predict the outcome with the Auxiliary variable; permit the Auxiliary variable to covary with all the predictors in order for it to add information to the unknown predictor values. You can also have the auxiliary variable covary with the residual of the outcome variable (requiring a multivariate normal response model) to inform imputed outcomes. An alternative is to have all the predictors additionally predict the auxiliary variable, the residuals which covary with the outcome variable residuals. The former is a saturated covariate model, the latter is an added DV model; both accomplish the same goal of informing both missing predictors and missing outcomes.

In the end, I was pleasantly surprised with how easy it was to handle missing data in Stan. Ideally, you specify your generative model, and just combine the known data with the unknown data. The known data will inform the unknown data through its influence on the unknown parameters. In this case, I simply chose to model the exogenous variables as multivariate normal, which permitted unknown data to be informed and used along with the known data to predict the outcome of interest.

For MVN imputation:

• Save which observations are missing, and how many, from each variable.
• In addition to the typical parameters, include parameters for a multivariate normal.
• Include a “parameter” per missing value.
• Combine the known data with the unknown data into a new data structure.
• Model exogenous variables as multivariate normal.
• Use full combined data in the model
• Thanks to MCMC, marginal posteriors will already be integrated over unknown data and parameter uncertainty.