knitr::opts_chunk$set(echo = TRUE)
library(MEMSS); library(tidyverse); library(kableExtra); library(INLA); library(Pmisc); library(naniar)
set.seed(12345)
# install.packages("Pmisc", repos="http://R-Forge.R-project.org")
# install.packages("INLA", repos=c(getOption("repos"), INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)

Introduction

INLA is framework for fast, approximate Bayesian inference. It is especially tailored for hierarchical, latent variable models.

INLA stands for Integrated, Nested, Laplacian Approximation, and it is particularly well suited for models with Gaussian latent variables. If you are interested in fitting generalized hierarchical models with non-normal random effects, as well as other probabilistic models, I encourage you to explore stan.

You can refer to my session slides and its reference for the theoretical and practical background to INLA.

Let’s begin by setting it up in our system (Required: R and RStudio software). INLA is not available in CRAN, so just run the below to download:

install.packages("INLA", repos=c(getOption("repos"), INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)

As with any other R library, to use it in your work, you have to first call it as

library(INLA)

Other useful packages for this workshop, and modern R usage are, tidyverse, kableExtra, and ggplot2. Our dataset can be fetched from the MEMSS package and some additional functionalities for reformatting the output of an INLA model can be obtained from Pmisc.

Demo

Let’s jump into a direct application of INLA. We can examine math achievement scores across schools and individuals as a hierarchical regression problem.

Our dataset consists of 7,185 observations depicting the individual math achievement scores (MathAch) of children in a metropolitan area, across multiple schools (School). In addition, the dataset contains three demographic variables: Binary membership of a minority racial group (Minority), binary sex (Sex), and numeric socioeconomic status (SES).

Research Question: Controlling for sex, socio-economic status, and minority status, what is the role of these fixed effects, as well as that of the school a child attends, with regards to a child’s math achievement scores?

Loading the data

library(MEMSS)
data(MathAchieve)

MathAchieve %>%
  head() %>%
  kableExtra::kable(digits=3, escape=F, booktab=T,linesep = "") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
School Minority Sex SES MathAch MEANSES
1224 No Female -1.528 5.876 -0.428
1224 No Female -0.588 19.708 -0.428
1224 No Male -0.528 20.349 -0.428
1224 No Male -0.668 8.781 -0.428
1224 No Male -0.158 17.898 -0.428
1224 No Male 0.022 4.583 -0.428

To formulate the models and assumptions suitable for this particular dataset, let’s start with some essential EDA (Exploratory Data Analysis).

EDA

Let’s have a quick look at some summary statistics:

MathAchieve %>%
  summary() %>%
  kableExtra::kable(digits=3, escape=F, booktab=T,linesep = "") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
School Minority Sex SES MathAch MEANSES
2305 : 67 No :5211 Female:3795 Min. :-3.758000 Min. :-2.832 Min. :-1.188000
5619 : 66 Yes:1974 Male :3390 1st Qu.:-0.538000 1st Qu.: 7.275 1st Qu.:-0.317000
4292 : 65 NA NA Median : 0.002000 Median :13.131 Median : 0.038000
3610 : 64 NA NA Mean : 0.000143 Mean :12.748 Mean : 0.006138
4042 : 64 NA NA 3rd Qu.: 0.602000 3rd Qu.:18.317 3rd Qu.: 0.333000
8857 : 64 NA NA Max. : 2.692000 Max. :24.993 Max. : 0.831000
(Other):6795 NA NA NA NA NA

Ouch, does it make sense to have negative MathAch scores? Negative value on a test score? Let’s clean it up and work with cleanMathAchieve going forward.

cleanMathAchieve <- subset(MathAchieve, MathAch >= 0)
dim(cleanMathAchieve) # We filtered out 212 records
## [1] 6973    6

What likelihood should we choose to model our response variable?

hist(cleanMathAchieve$MathAch,freq=F)

Let’s look at our features via some pretty dull R base plots:

plot(MathAch ~ SES, data=MathAchieve)
plot(MathAch ~ Minority, data=MathAchieve)
plot(MathAch ~ Sex, data=MathAchieve)
\label{fig:figs}EDA plots of fixed effects\label{fig:figs}EDA plots of fixed effects\label{fig:figs}EDA plots of fixed effects

EDA plots of fixed effects

Ignoring joint effects, seems like MathAch is higher for non-minority children and males. Also seems that there is a positive correlation between MathAch and SES.

What about scores within and between schools? Let’s compare a random sample of 5 schools:

graph_sample <- subset(cleanMathAchieve, School == 2305 | School == 5619 | School ==4292 | School == 3610 | School == 4042)
graph_sample$School <- factor(graph_sample$School)

plot(MathAch ~ School, data=graph_sample)

There seem to be non-trivial differences in the performance levels between schools.

We could certainly allow the school ID School to be a fixed effect or standard categorical feature. What’s the problem with this, though?

Enter hierarchical models!

We allow School to be a random Gaussian variable, such that it can model performance differentials accross schools without adding an excessive number of categories.

To understand this in the context of our model, we first consider the standard regression equation: \[ Y = X\beta + \epsilon,\\ \epsilon \sim N(0,\sigma) \] where,

I highlighted independent above because I want to raise the following: One of the assumptions of standard linear regression is that the errors accross observations are uncorrelated. This needs not be the case here. There are good and bad schools everywhere—math achievement should be correlated accross children in the same school.

Now, in the latent variable context, our model becomes: \[ Y = X\beta + S + \epsilon,\\ S \sim N(0,\sigma_S),\\ \epsilon \sim N(0,\sigma),\\ \beta \sim \text{N}(a,b),\\ \sigma_S \sim \;??? \] where,

There already exists some tried and tested libraries to fit mixed effects models through restricted maximum likelihood estimation (glmer, nlme). But, these are very unstable, do not take in priors, and will not give us the same level of detail and mathematical rigor as a (approximate) Bayesian treatment.

Moreover, what if we had repeated test scores for each children? What if the schools surveyed spanned multiple cities and accross multiple states? These are multi-level hierarchical models and they are INLA’s bread and butter.

Modeling

Recall the primordial Bayesian in the context of model inference: \[ \begin{align} P(\theta|D)&=\frac{P(D|\theta)P(\theta)}{P(D)} \\ &\propto P(D|\theta)P(\theta) \end{align}\\ \text{where } P(D) = \int P(D|\theta)P(\theta)d\theta \]

This is the basic syntax for INLA:

inla(y ~ x + f(latent,model="iid"), family = c("gaussian"), 
           data = data, control.predictor=list(link=1)) 

Now, family in the above corresponds to the distribution of the target variable. INLA handles the standard distributions: Gaussian (Standard Regression), Binomial (Logistic Regression), Poisson, Gamma, Weibull, etc.

Let’s set up our train-test dataset:

# We need to set the seed for reproducibility!
set.seed(1234)

cleanMathAchieve <- cleanMathAchieve[sample(nrow(cleanMathAchieve)),]
# Using 90% of data to train just for kicks. Can use more traditional splits: 80-20, 70-30
# I won't employ a loss-based approach here
trainMathAchieve <- cleanMathAchieve %>%
  rownames_to_column('idd') %>%
  sample_frac(.90) %>% 
  column_to_rownames('idd')

# Let's re-factor these bad boys in case we left out entire schools from training set
trainMathAchieve$School <- factor(trainMathAchieve$School)

sid<-as.numeric(rownames(trainMathAchieve))
testMathAchieve <- cleanMathAchieve[-sid,]

Now let’s prepare our variables for model-fitting:

trainX <- within(trainMathAchieve, rm("MathAch"))
# Preserving the index
trainY <- trainMathAchieve[c("MathAch")]

testX <- within(testMathAchieve, rm("MathAch"))

# Preserving the index
testY <- testMathAchieve[c("MathAch")]

testYdum <- testY %>%
  rownames_to_column('gene') %>%
  mutate_if(is.numeric, ~replace(., . > 0, NA)) %>% 
  column_to_rownames('gene')

Model 1

Let’s fit a simple additive model with School as a Gaussian latent variable. We shall see how well this model predicts MathAch for the testing set. INLA’s pretty “tricky” with predictions–being a generative model, we simply censor these and fit the model:

#names(testYdum) <- "MathAch"
complete_data = rbind(trainMathAchieve[,c(5,1,3,2,4,6)],cbind(MathAch=testYdum,testX[,c(1,3,2,4,5)]))

#predmod1 = inla(MathAch ~ Minority + Sex + SES + MEANSES + f(School,model="iid"),
predmod1 = inla(MathAch ~ Minority + Sex + SES + f(School,model="iid"),
                family = "gaussian",
                data = complete_data,
                control.predictor = list(link = 1))

predmod1$summary.fixed[, 1:6] %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
mean sd 0.025quant 0.5quant 0.975quant mode
(Intercept) 13.186 0.180 12.832 13.186 13.540 13.186
MinorityYes -2.819 0.207 -3.226 -2.819 -2.412 -2.819
SexMale 1.233 0.165 0.909 1.233 1.556 1.233
SES 2.041 0.109 1.827 2.041 2.256 2.041

A big no-no in Bayesian computation software is to use the default priors… Given how INLA takes priors on the precision, which corresponds to \(\frac{1}{\sigma^2}\), for \(\sigma\) the standard deviation, it should make sense to use their default log-gamma distribution. Why not? It is defined on \([0,\infty)\), so it’s not incorrect.

  • What’s the rationale for choosing a prior on the “precision” hyperparameter? How to choose the parameters of this prior?

Penalized Complexity actually provides a principled framework to formulate a sensible prior that, after inference, would yield a “parsimonius” model.

  • It puts a prior on the KL distance (difference between probability distributions)

  • Think of it as: \(P(\sigma > 2)=0.05\)

Model 2

So we fit a “better” model:

#predmod2 = inla(MathAch ~ Minority + Sex + SES + MEANSES + f(School,model="iid",
predmod2 = inla(MathAch ~ Minority + Sex + SES + f(School,model="iid",
                                                             hyper=list(prec=list(
                                                               prior = 'pc.prec',
                                                               param=c(u = 5, alpha = 0.05))
                                                               )
                                                   ),
                control.fixed = list(
                  mean.intercept=0,
                  prec.intercept=1/(3^2),
                  mean=0, prec= 1/(3^2) 
                ),
                family = "gaussian", data = complete_data,
                control.predictor = list(link = 1),
                control.compute=list(return.marginals=TRUE)
)
# PC prior on sdev
# gaussian prior on predictors
# Likelihood can be Gaussian, or gamma, or others...


predmod2$summary.fixed[, 1:6] %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
mean sd 0.025quant 0.5quant 0.975quant mode
(Intercept) 13.137 0.181 12.780 13.137 13.492 13.137
MinorityYes -2.790 0.207 -3.196 -2.790 -2.384 -2.790
SexMale 1.247 0.165 0.924 1.247 1.570 1.247
SES 2.038 0.109 1.824 2.038 2.252 2.038

and we examine the posteriors of some of its fixed effects:

for(D in c('MinorityYes', 'SexMale','SES')) {
  plot(predmod2$marginals.fixed[[D]],type='l',col='blue',ylab="dens",
       xlab=D)
  xseq = predmod2$marginals.fixed[[D]][,'x']
  lines(xseq, dnorm(xseq, sd=3), type='l', col='red')
}

legend('topleft',lty=1,col=c('blue','red'),legend=c('post','prior'))
\label{fig:figs}Priors and Posteriors of Fixed Effects $eta$\label{fig:figs}Priors and Posteriors of Fixed Effects $eta$\label{fig:figs}Priors and Posteriors of Fixed Effects $eta$

Priors and Posteriors of Fixed Effects \(eta\)

Let’s look at the overall effect of our random effect School–after all, it could be that the school in which the child studies has no effect…

reparameterized = Pmisc::priorPostSd(predmod2)
reparameterized$summary %>%
  kable(digits = 2) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
mean sd 0.025quant 0.5quant 0.975quant mode
SD for School 1.7 0.13 1.46 1.69 1.97 1.68
plot(reparameterized$posterior,type='l',col='blue',ylab="dens",
       xlab= "posterior")
  lines(reparameterized$prior, type='l', col='red')

legend('topleft',lty=1,col=c('blue','red'),legend=c('post','prior'))
\label{fig:figs}Priors and Posterior of flexibility hyperparmeter $\sigma_S$

Priors and Posterior of flexibility hyperparmeter \(\sigma_S\)

Turns out it does! What’s the posterior mean effect \(E(S_j|Y)\) like for each school?

for(D in c(8, 50, 100)) {
  plot(predmod2$marginals.random$School[[D]],type='l',col='blue',ylab="dens",
       xlab= D)
  xseq = predmod2$marginals.random$School[[D]][,'x']
  lines(xseq, dnorm(xseq, sd=3), type='l', col='red')
}

legend('topleft',lty=1,col=c('blue','red'),legend=c('post','prior'))
\label{fig:figs}Priors and Posteriors of Random Effects ($S$) for a Set of Schools\label{fig:figs}Priors and Posteriors of Random Effects ($S$) for a Set of Schools\label{fig:figs}Priors and Posteriors of Random Effects ($S$) for a Set of Schools

Priors and Posteriors of Random Effects (\(S\)) for a Set of Schools

Model 3

What about a regularized model with priors N(0,1) on the fixed effects?

predmod3 = inla(MathAch ~ Minority + Sex + SES + f(School,model="iid",
                                                             hyper=list(prec=list(
                                                               prior = 'pc.prec',
                                                               param=c(u = 5, alpha = 0.05))
                                                               )
                                                   ),
                control.fixed = list(
                  mean.intercept=0,
                  prec.intercept=1,
                  mean=0, prec= 1
                ),
                family = "gaussian", data = complete_data,
                control.predictor = list(link = 1)
)


predmod3$summary.fixed[, 1:6] %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
mean sd 0.025quant 0.5quant 0.975quant mode
(Intercept) 12.744 0.184 12.378 12.746 13.101 12.749
MinorityYes -2.569 0.203 -2.969 -2.570 -2.170 -2.570
SexMale 1.359 0.163 1.040 1.359 1.678 1.359
SES 2.031 0.109 1.818 2.031 2.245 2.031
for(D in c('MinorityYes', 'SexMale','SES')) {
  plot(predmod3$marginals.fixed[[D]],type='l',col='blue',ylab="dens",
       xlab=D)
  xseq = predmod3$marginals.fixed[[D]][,'x']
  lines(xseq, dnorm(xseq, sd=1), type='l', col='red')
}

legend('topleft',lty=1,col=c('blue','red'),legend=c('post','prior'))
\label{fig:figs}Priors and Posteriors of Fixed Effects $\beta$\label{fig:figs}Priors and Posteriors of Fixed Effects $\beta$\label{fig:figs}Priors and Posteriors of Fixed Effects $\beta$

Priors and Posteriors of Fixed Effects \(\beta\)

But what about prediction? Turns out we’ve already generated our predictions and we can assess these using the standard Mean Square Error (MSE).

Our predictions were generated from the predictive posterior distribution:

\[ P(y^{*} | D , x^{*}) = \int{P(\theta | D)P(y^{*}|x^{*},\theta)d\theta} \]

# Let's set up a function to calculate MSE just for fun
# Can also use MLmetrics::MSE if we're lazy
mse_calc <- function(x,y){
  result = 0
  for (i in 1:length(x)){
    result = result + (x[i] - y[i])^2
  }
  result = result/(length(x))
  return(result)
}
# Train MSE for model 1 with default priors
train_mse_1 <- mse_calc(predmod1$summary.fitted.values$mean[1:length(trainY$MathAch)],trainY$MathAch)

# Test MSE
start_n <- length(trainY$MathAch)+1
end_n <- length(complete_data$MathAch)
test_mse_1 <- mse_calc(predmod1$summary.fitted.values$mean[start_n:end_n],testY$MathAch)
# Train MSE for model 2 with user-specified priors
train_mse_2 <- mse_calc(predmod2$summary.fitted.values$mean[1:length(trainY$MathAch)],trainY$MathAch)

# Test MSE
start_n <- length(trainY$MathAch)+1
end_n <- length(complete_data$MathAch)
test_mse_2 <- mse_calc(predmod2$summary.fitted.values$mean[start_n:end_n],testY$MathAch)
# Train MSE for model 2 with user-specified priors
train_mse_3 <- mse_calc(predmod3$summary.fitted.values$mean[1:length(trainY$MathAch)],trainY$MathAch)

# Test MSE
start_n <- length(trainY$MathAch)+1
end_n <- length(complete_data$MathAch)
test_mse_3 <- mse_calc(predmod3$summary.fitted.values$mean[start_n:end_n],testY$MathAch)
train_mse_b <- c(train_mse_1, train_mse_2, train_mse_3)
test_mse_b <- c(test_mse_1, test_mse_2, test_mse_3)
results <- data.frame(train_mse_b,test_mse_b, row.names=c("Model 1","Model 2", "Model 3"))
colnames(results) <- c("Train MSE", "Test MSE")

results %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Train MSE Test MSE
Model 1 31.765 31.627
Model 2 31.757 31.621
Model 3 31.754 31.627

Turns out that our choice of priors didn not affect the prediction error very much. But what if we choose an outlandish prior? Or change the likelihood?

We can proceed to do a sensitivity analysis on our various choices of priors and move on to “model selection”…

Discussion