Intro to (Approximate) Bayesian Inference with INLA

knitr::opts_chunk$set(echo = TRUE)
library(MEMSS); library(tidyverse); library(kableExtra); library(INLA); library(Pmisc); library(naniar)
set.seed(12345)
# install.packages("Pmisc", repos="http://R-Forge.R-project.org")
# install.packages("INLA", repos=c(getOption("repos"), INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)

Introduction

INLA is framework for fast, approximate Bayesian inference. It is especially tailored for hierarchical, latent variable models.

INLA stands for Integrated, Nested, Laplacian Approximation, and it is particularly well suited for models with Gaussian latent variables. If you are interested in fitting generalized hierarchical models with non-normal random effects, as well as other probabilistic models, I encourage you to explore stan.

You can refer to my session slides and its reference for the theoretical and practical background to INLA.

Let’s begin by setting it up in our system (Required: R and RStudio software). INLA is not available in CRAN, so just run the below to download:

install.packages("INLA", repos=c(getOption("repos"), INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)

As with any other R library, to use it in your work, you have to first call it as

library(INLA)

Other useful packages for this workshop, and modern R usage are, tidyverse, kableExtra, and ggplot2. Our dataset can be fetched from the MEMSS package and some additional functionalities for reformatting the output of an INLA model can be obtained from Pmisc.

Demo

Let’s jump into a direct application of INLA. We can examine math achievement scores across schools and individuals as a hierarchical regression problem.

Our dataset consists of 7,185 observations depicting the individual math achievement scores (MathAch) of children in a metropolitan area, across multiple schools (School). In addition, the dataset contains three demographic variables: Binary membership of a minority racial group (Minority), binary sex (Sex), and numeric socioeconomic status (SES).

Research Question: Controlling for sex, socio-economic status, and minority status, what is the role of these fixed effects, as well as that of the school a child attends, with regards to a child’s math achievement scores?

Bonus: How well can we predict math achievement based on this information?

Loading the data

library(MEMSS)
data(MathAchieve)

MathAchieve %>%
  head() %>%
  kableExtra::kable(digits=3, escape=F, booktab=T,linesep = "") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)

School	Minority	Sex	SES	MathAch	MEANSES
1224	No	Female	-1.528	5.876	-0.428
1224	No	Female	-0.588	19.708	-0.428
1224	No	Male	-0.528	20.349	-0.428
1224	No	Male	-0.668	8.781	-0.428
1224	No	Male	-0.158	17.898	-0.428
1224	No	Male	0.022	4.583	-0.428

To formulate the models and assumptions suitable for this particular dataset, let’s start with some essential EDA (Exploratory Data Analysis).

EDA

Let’s have a quick look at some summary statistics:

MathAchieve %>%
  summary() %>%
  kableExtra::kable(digits=3, escape=F, booktab=T,linesep = "") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)

School	Minority	Sex	SES	MathAch	MEANSES
2305 : 67	No :5211	Female:3795	Min. :-3.758000	Min. :-2.832	Min. :-1.188000
5619 : 66	Yes:1974	Male :3390	1st Qu.:-0.538000	1st Qu.: 7.275	1st Qu.:-0.317000
4292 : 65	NA	NA	Median : 0.002000	Median :13.131	Median : 0.038000
3610 : 64	NA	NA	Mean : 0.000143	Mean :12.748	Mean : 0.006138
4042 : 64	NA	NA	3rd Qu.: 0.602000	3rd Qu.:18.317	3rd Qu.: 0.333000
8857 : 64	NA	NA	Max. : 2.692000	Max. :24.993	Max. : 0.831000
(Other):6795	NA	NA	NA	NA	NA

Ouch, does it make sense to have negative MathAch scores? Negative value on a test score? Let’s clean it up and work with cleanMathAchieve going forward.

cleanMathAchieve <- subset(MathAchieve, MathAch >= 0)
dim(cleanMathAchieve) # We filtered out 212 records

## [1] 6973    6

What likelihood should we choose to model our response variable?

hist(cleanMathAchieve$MathAch,freq=F)

Let’s look at our features via some pretty dull R base plots:

plot(MathAch ~ SES, data=MathAchieve)
plot(MathAch ~ Minority, data=MathAchieve)
plot(MathAch ~ Sex, data=MathAchieve)

$\label{fig:figs}EDA plots of fixed effects$ $\label{fig:figs}EDA plots of fixed effects$ $\label{fig:figs}EDA plots of fixed effects$

EDA plots of fixed effects

Ignoring joint effects, seems like MathAch is higher for non-minority children and males. Also seems that there is a positive correlation between MathAch and SES.

What about scores within and between schools? Let’s compare a random sample of 5 schools:

graph_sample <- subset(cleanMathAchieve, School == 2305 | School == 5619 | School ==4292 | School == 3610 | School == 4042)
graph_sample$School <- factor(graph_sample$School)

plot(MathAch ~ School, data=graph_sample)

There seem to be non-trivial differences in the performance levels between schools.

We could certainly allow the school ID School to be a fixed effect or standard categorical feature. What’s the problem with this, though?

Not enough data (<7000 total obs, 160 schools) $\implies$ inflated variance in our estimates

Enter hierarchical models!

We allow School to be a random Gaussian variable, such that it can model performance differentials accross schools without adding an excessive number of categories.

To understand this in the context of our model, we first consider the standard regression equation: \[ Y = X\beta + \epsilon,\\ \epsilon \sim N(0,\sigma) \] where,

Y is an n-dimensional vector of MathAch
X is an $(n+1)\times p$ matrix of fixed effects
- This includes SES, Minority, Sex, and School
- Recall that there are 160 schools, thus we have about 159 extra one-hot encoded features
- Thus, our number of parameters/fixed effects 159+3+1=163
$\epsilon$ is an n-dimentional vector of independent errors, which we assume follow a normal distribution

I highlighted independent above because I want to raise the following: One of the assumptions of standard linear regression is that the errors accross observations are uncorrelated. This needs not be the case here. There are good and bad schools everywhere—math achievement should be correlated accross children in the same school.

Now, in the latent variable context, our model becomes: \[ Y = X\beta + S + \epsilon,\\ S \sim N(0,\sigma_S),\\ \epsilon \sim N(0,\sigma),\\ \beta \sim \text{N}(a,b),\\ \sigma_S \sim \;??? \] where,

Y is unchanged from the above.
X is an $(n+1)\times p$ matrix of fixed effects.
- This includes SES, Minority, and Sex. School is out!
- Our parameters/fixed effects are now only 3+1=4.
S is now a latent variable denoting School
- This can be interpreted as a random intercept which shall vary from school to school, to account for the fact that there are “good” and “bad” schools.
$\epsilon$ is unchanged from the above, unless we seek to also model observation error, which may be too involved for now…
$\sigma_S$ is a “new” flexibility/complexity hyperparameter (PC framework), which indicates the (unobserved) variability in School which we have not controlled for in our model.

There already exists some tried and tested libraries to fit mixed effects models through restricted maximum likelihood estimation (glmer, nlme). But, these are very unstable, do not take in priors, and will not give us the same level of detail and mathematical rigor as a (approximate) Bayesian treatment.

Moreover, what if we had repeated test scores for each children? What if the schools surveyed spanned multiple cities and accross multiple states? These are multi-level hierarchical models and they are INLA’s bread and butter.

Modeling

Recall the primordial Bayesian in the context of model inference: \[ \begin{align} P(\theta|D)&=\frac{P(D|\theta)P(\theta)}{P(D)} \\ &\propto P(D|\theta)P(\theta) \end{align}\\ \text{where } P(D) = \int P(D|\theta)P(\theta)d\theta \]

This is the basic syntax for INLA:

inla(y ~ x + f(latent,model="iid"), family = c("gaussian"), 
           data = data, control.predictor=list(link=1))

Now, family in the above corresponds to the distribution of the target variable. INLA handles the standard distributions: Gaussian (Standard Regression), Binomial (Logistic Regression), Poisson, Gamma, Weibull, etc.

Let’s set up our train-test dataset:

# We need to set the seed for reproducibility!
set.seed(1234)

cleanMathAchieve <- cleanMathAchieve[sample(nrow(cleanMathAchieve)),]
# Using 90% of data to train just for kicks. Can use more traditional splits: 80-20, 70-30
# I won't employ a loss-based approach here
trainMathAchieve <- cleanMathAchieve %>%
  rownames_to_column('idd') %>%
  sample_frac(.90) %>% 
  column_to_rownames('idd')

# Let's re-factor these bad boys in case we left out entire schools from training set
trainMathAchieve$School <- factor(trainMathAchieve$School)

sid<-as.numeric(rownames(trainMathAchieve))
testMathAchieve <- cleanMathAchieve[-sid,]

Now let’s prepare our variables for model-fitting:

I’m not doing the encoding myself because most R libraries take care of it smartly

trainX <- within(trainMathAchieve, rm("MathAch"))
# Preserving the index
trainY <- trainMathAchieve[c("MathAch")]

testX <- within(testMathAchieve, rm("MathAch"))

# Preserving the index
testY <- testMathAchieve[c("MathAch")]

testYdum <- testY %>%
  rownames_to_column('gene') %>%
  mutate_if(is.numeric, ~replace(., . > 0, NA)) %>% 
  column_to_rownames('gene')

Model 1

Let’s fit a simple additive model with School as a Gaussian latent variable. We shall see how well this model predicts MathAch for the testing set. INLA’s pretty “tricky” with predictions–being a generative model, we simply censor these and fit the model:

#names(testYdum) <- "MathAch"
complete_data = rbind(trainMathAchieve[,c(5,1,3,2,4,6)],cbind(MathAch=testYdum,testX[,c(1,3,2,4,5)]))

#predmod1 = inla(MathAch ~ Minority + Sex + SES + MEANSES + f(School,model="iid"),
predmod1 = inla(MathAch ~ Minority + Sex + SES + f(School,model="iid"),
                family = "gaussian",
                data = complete_data,
                control.predictor = list(link = 1))

predmod1$summary.fixed[, 1:6] %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

	mean	sd	0.025quant	0.5quant	0.975quant	mode
(Intercept)	13.186	0.180	12.832	13.186	13.540	13.186
MinorityYes	-2.819	0.207	-3.226	-2.819	-2.412	-2.819
SexMale	1.233	0.165	0.909	1.233	1.556	1.233
SES	2.041	0.109	1.827	2.041	2.256	2.041

A big no-no in Bayesian computation software is to use the default priors… Given how INLA takes priors on the precision, which corresponds to $\frac{1}{\sigma^2}$, for $\sigma$ the standard deviation, it should make sense to use their default log-gamma distribution. Why not? It is defined on $[0,\infty)$, so it’s not incorrect.

What’s the rationale for choosing a prior on the “precision” hyperparameter? How to choose the parameters of this prior?

Penalized Complexity actually provides a principled framework to formulate a sensible prior that, after inference, would yield a “parsimonius” model.

It puts a prior on the KL distance (difference between probability distributions)
Think of it as: $P(\sigma > 2)=0.05$

Model 2

So we fit a “better” model:

#predmod2 = inla(MathAch ~ Minority + Sex + SES + MEANSES + f(School,model="iid",
predmod2 = inla(MathAch ~ Minority + Sex + SES + f(School,model="iid",
                                                             hyper=list(prec=list(
                                                               prior = 'pc.prec',
                                                               param=c(u = 5, alpha = 0.05))
                                                               )
                                                   ),
                control.fixed = list(
                  mean.intercept=0,
                  prec.intercept=1/(3^2),
                  mean=0, prec= 1/(3^2) 
                ),
                family = "gaussian", data = complete_data,
                control.predictor = list(link = 1),
                control.compute=list(return.marginals=TRUE)
)
# PC prior on sdev
# gaussian prior on predictors
# Likelihood can be Gaussian, or gamma, or others...


predmod2$summary.fixed[, 1:6] %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

	mean	sd	0.025quant	0.5quant	0.975quant	mode
(Intercept)	13.137	0.181	12.780	13.137	13.492	13.137
MinorityYes	-2.790	0.207	-3.196	-2.790	-2.384	-2.790
SexMale	1.247	0.165	0.924	1.247	1.570	1.247
SES	2.038	0.109	1.824	2.038	2.252	2.038

and we examine the posteriors of some of its fixed effects:

for(D in c('MinorityYes', 'SexMale','SES')) {
  plot(predmod2$marginals.fixed[[D]],type='l',col='blue',ylab="dens",
       xlab=D)
  xseq = predmod2$marginals.fixed[[D]][,'x']
  lines(xseq, dnorm(xseq, sd=3), type='l', col='red')
}

legend('topleft',lty=1,col=c('blue','red'),legend=c('post','prior'))

$\label{fig:figs}Priors and Posteriors of Fixed Effects $eta$$ $\label{fig:figs}Priors and Posteriors of Fixed Effects $eta$$ $\label{fig:figs}Priors and Posteriors of Fixed Effects $eta$$

Priors and Posteriors of Fixed Effects $eta$

Let’s look at the overall effect of our random effect School–after all, it could be that the school in which the child studies has no effect…

reparameterized = Pmisc::priorPostSd(predmod2)
reparameterized$summary %>%
  kable(digits = 2) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

	mean	sd	0.025quant	0.5quant	0.975quant	mode
SD for School	1.7	0.13	1.46	1.69	1.97	1.68

plot(reparameterized$posterior,type='l',col='blue',ylab="dens",
       xlab= "posterior")
  lines(reparameterized$prior, type='l', col='red')

legend('topleft',lty=1,col=c('blue','red'),legend=c('post','prior'))

$\label{fig:figs}Priors and Posterior of flexibility hyperparmeter $\sigma_S$$

Priors and Posterior of flexibility hyperparmeter $\sigma_S$

Turns out it does! What’s the posterior mean effect $E(S_j|Y)$ like for each school?

for(D in c(8, 50, 100)) {
  plot(predmod2$marginals.random$School[[D]],type='l',col='blue',ylab="dens",
       xlab= D)
  xseq = predmod2$marginals.random$School[[D]][,'x']
  lines(xseq, dnorm(xseq, sd=3), type='l', col='red')
}

legend('topleft',lty=1,col=c('blue','red'),legend=c('post','prior'))

$\label{fig:figs}Priors and Posteriors of Random Effects ($S$) for a Set of Schools$ $\label{fig:figs}Priors and Posteriors of Random Effects ($S$) for a Set of Schools$ $\label{fig:figs}Priors and Posteriors of Random Effects ($S$) for a Set of Schools$

Priors and Posteriors of Random Effects ($S$) for a Set of Schools

Model 3

What about a regularized model with priors N(0,1) on the fixed effects?

predmod3 = inla(MathAch ~ Minority + Sex + SES + f(School,model="iid",
                                                             hyper=list(prec=list(
                                                               prior = 'pc.prec',
                                                               param=c(u = 5, alpha = 0.05))
                                                               )
                                                   ),
                control.fixed = list(
                  mean.intercept=0,
                  prec.intercept=1,
                  mean=0, prec= 1
                ),
                family = "gaussian", data = complete_data,
                control.predictor = list(link = 1)
)


predmod3$summary.fixed[, 1:6] %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

	mean	sd	0.025quant	0.5quant	0.975quant	mode
(Intercept)	12.744	0.184	12.378	12.746	13.101	12.749
MinorityYes	-2.569	0.203	-2.969	-2.570	-2.170	-2.570
SexMale	1.359	0.163	1.040	1.359	1.678	1.359
SES	2.031	0.109	1.818	2.031	2.245	2.031

for(D in c('MinorityYes', 'SexMale','SES')) {
  plot(predmod3$marginals.fixed[[D]],type='l',col='blue',ylab="dens",
       xlab=D)
  xseq = predmod3$marginals.fixed[[D]][,'x']
  lines(xseq, dnorm(xseq, sd=1), type='l', col='red')
}

legend('topleft',lty=1,col=c('blue','red'),legend=c('post','prior'))

$\label{fig:figs}Priors and Posteriors of Fixed Effects $\beta$$ $\label{fig:figs}Priors and Posteriors of Fixed Effects $\beta$$ $\label{fig:figs}Priors and Posteriors of Fixed Effects $\beta$$

Priors and Posteriors of Fixed Effects $\beta$

But what about prediction? Turns out we’ve already generated our predictions and we can assess these using the standard Mean Square Error (MSE).

Our predictions were generated from the predictive posterior distribution:

\[ P(y^{*} | D , x^{*}) = \int{P(\theta | D)P(y^{*}|x^{*},\theta)d\theta} \]

# Let's set up a function to calculate MSE just for fun
# Can also use MLmetrics::MSE if we're lazy
mse_calc <- function(x,y){
  result = 0
  for (i in 1:length(x)){
    result = result + (x[i] - y[i])^2
  }
  result = result/(length(x))
  return(result)
}

# Train MSE for model 1 with default priors
train_mse_1 <- mse_calc(predmod1$summary.fitted.values$mean[1:length(trainY$MathAch)],trainY$MathAch)

# Test MSE
start_n <- length(trainY$MathAch)+1
end_n <- length(complete_data$MathAch)
test_mse_1 <- mse_calc(predmod1$summary.fitted.values$mean[start_n:end_n],testY$MathAch)

# Train MSE for model 2 with user-specified priors
train_mse_2 <- mse_calc(predmod2$summary.fitted.values$mean[1:length(trainY$MathAch)],trainY$MathAch)

# Test MSE
start_n <- length(trainY$MathAch)+1
end_n <- length(complete_data$MathAch)
test_mse_2 <- mse_calc(predmod2$summary.fitted.values$mean[start_n:end_n],testY$MathAch)

# Train MSE for model 2 with user-specified priors
train_mse_3 <- mse_calc(predmod3$summary.fitted.values$mean[1:length(trainY$MathAch)],trainY$MathAch)

# Test MSE
start_n <- length(trainY$MathAch)+1
end_n <- length(complete_data$MathAch)
test_mse_3 <- mse_calc(predmod3$summary.fitted.values$mean[start_n:end_n],testY$MathAch)

train_mse_b <- c(train_mse_1, train_mse_2, train_mse_3)
test_mse_b <- c(test_mse_1, test_mse_2, test_mse_3)
results <- data.frame(train_mse_b,test_mse_b, row.names=c("Model 1","Model 2", "Model 3"))
colnames(results) <- c("Train MSE", "Test MSE")

results %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

	Train MSE	Test MSE
Model 1	31.765	31.627
Model 2	31.757	31.621
Model 3	31.754	31.627

Turns out that our choice of priors didn not affect the prediction error very much. But what if we choose an outlandish prior? Or change the likelihood?

We can proceed to do a sensitivity analysis on our various choices of priors and move on to “model selection”…

Discussion

I want to highlight that INLA is particularly well-suited for hierarchical models with Gaussian random effects
- Only available for R–as are most cutting-edge scientific computation frameworks
- If you need generalized models, I encourage you to explore Stan and its numerous wrappers–PyStan, RStan, etc.
- I can tackle Stan next time…
Fantastic online INLA resource here
For all my hardcore peeps, I encourage you to read on Monte Carlo Methods (Integration, Optimization, etc) and play around with Monte Carlo libraries, e.g., PyMC
- This is how proper, modern Bayesian inference is actually done. You can also look into Variational Inference methods (quite popular in Deep Learning) as a scalable alternative to a fully Bayesian treatment. Here is a “simple” overview, and here is a technical review.
Bayesian inference is not that hard to incorporate into your Machine Learning practice.
- Useful as long as you know what you are doing! Prior specification is a difficult art-form
- Especially useful when you don’t have limited amounts of data (expensive data collection/labelling, rare events, etc)
- Unlikely to ourperform ML-approaches for ML tasks (prediction, classification, etc), especially as the size of the dataset grows.
- A properly specified Bayesian model is a generative model, so you should be able to easily generate data to check the consistency of your model with the dataset/data generating process in question.
EDA and plots are crucial before, throughout, and after the model building and checking process
- This is one of my favorite papers on the importance of visualization in the Bayesian workflow