knitr::opts_chunk$set(echo = TRUE)
library(MEMSS); library(tidyverse); library(kableExtra); library(INLA); library(Pmisc); library(naniar)
set.seed(12345)
# install.packages("Pmisc", repos="http://R-Forge.R-project.org")
# install.packages("INLA", repos=c(getOption("repos"), INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
INLA is framework for fast, approximate Bayesian inference. It is especially tailored for hierarchical, latent variable models.
INLA stands for Integrated, Nested, Laplacian Approximation, and it is particularly well suited for models with Gaussian latent variables. If you are interested in fitting generalized hierarchical models with non-normal random effects, as well as other probabilistic models, I encourage you to explore stan.
You can refer to my session slides and its reference for the theoretical and practical background to INLA.
Let’s begin by setting it up in our system (Required: R and RStudio software). INLA is not available in CRAN, so just run the below to download:
install.packages("INLA", repos=c(getOption("repos"), INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
As with any other R library, to use it in your work, you have to first call it as
library(INLA)
Other useful packages for this workshop, and modern R usage are, tidyverse
, kableExtra
, and ggplot2
. Our dataset can be fetched from the MEMSS
package and some additional functionalities for reformatting the output of an INLA model can be obtained from Pmisc
.
Let’s jump into a direct application of INLA. We can examine math achievement scores across schools and individuals as a hierarchical regression problem.
Our dataset consists of 7,185 observations depicting the individual math achievement scores (MathAch
) of children in a metropolitan area, across multiple schools (School
). In addition, the dataset contains three demographic variables: Binary membership of a minority racial group (Minority
), binary sex (Sex
), and numeric socioeconomic status (SES
).
Research Question: Controlling for sex, socio-economic status, and minority status, what is the role of these fixed effects, as well as that of the school a child attends, with regards to a child’s math achievement scores?
library(MEMSS)
data(MathAchieve)
MathAchieve %>%
head() %>%
kableExtra::kable(digits=3, escape=F, booktab=T,linesep = "") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
School | Minority | Sex | SES | MathAch | MEANSES |
---|---|---|---|---|---|
1224 | No | Female | -1.528 | 5.876 | -0.428 |
1224 | No | Female | -0.588 | 19.708 | -0.428 |
1224 | No | Male | -0.528 | 20.349 | -0.428 |
1224 | No | Male | -0.668 | 8.781 | -0.428 |
1224 | No | Male | -0.158 | 17.898 | -0.428 |
1224 | No | Male | 0.022 | 4.583 | -0.428 |
To formulate the models and assumptions suitable for this particular dataset, let’s start with some essential EDA (Exploratory Data Analysis).
Let’s have a quick look at some summary statistics:
MathAchieve %>%
summary() %>%
kableExtra::kable(digits=3, escape=F, booktab=T,linesep = "") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
School | Minority | Sex | SES | MathAch | MEANSES | |
---|---|---|---|---|---|---|
2305 : 67 | No :5211 | Female:3795 | Min. :-3.758000 | Min. :-2.832 | Min. :-1.188000 | |
5619 : 66 | Yes:1974 | Male :3390 | 1st Qu.:-0.538000 | 1st Qu.: 7.275 | 1st Qu.:-0.317000 | |
4292 : 65 | NA | NA | Median : 0.002000 | Median :13.131 | Median : 0.038000 | |
3610 : 64 | NA | NA | Mean : 0.000143 | Mean :12.748 | Mean : 0.006138 | |
4042 : 64 | NA | NA | 3rd Qu.: 0.602000 | 3rd Qu.:18.317 | 3rd Qu.: 0.333000 | |
8857 : 64 | NA | NA | Max. : 2.692000 | Max. :24.993 | Max. : 0.831000 | |
(Other):6795 | NA | NA | NA | NA | NA |
Ouch, does it make sense to have negative MathAch
scores? Negative value on a test score? Let’s clean it up and work with cleanMathAchieve
going forward.
cleanMathAchieve <- subset(MathAchieve, MathAch >= 0)
dim(cleanMathAchieve) # We filtered out 212 records
## [1] 6973 6
What likelihood should we choose to model our response variable?
hist(cleanMathAchieve$MathAch,freq=F)
Let’s look at our features via some pretty dull R base plots:
plot(MathAch ~ SES, data=MathAchieve)
plot(MathAch ~ Minority, data=MathAchieve)
plot(MathAch ~ Sex, data=MathAchieve)
Ignoring joint effects, seems like MathAch
is higher for non-minority children and males. Also seems that there is a positive correlation between MathAch
and SES
.
What about scores within and between schools? Let’s compare a random sample of 5 schools:
graph_sample <- subset(cleanMathAchieve, School == 2305 | School == 5619 | School ==4292 | School == 3610 | School == 4042)
graph_sample$School <- factor(graph_sample$School)
plot(MathAch ~ School, data=graph_sample)
There seem to be non-trivial differences in the performance levels between schools.
We could certainly allow the school ID School
to be a fixed effect or standard categorical feature. What’s the problem with this, though?
We allow School
to be a random Gaussian variable, such that it can model performance differentials accross schools without adding an excessive number of categories.
To understand this in the context of our model, we first consider the standard regression equation: \[ Y = X\beta + \epsilon,\\ \epsilon \sim N(0,\sigma) \] where,
Y is an n-dimensional vector of MathAch
X is an \((n+1)\times p\) matrix of fixed effects
This includes SES
, Minority
, Sex
, and School
Recall that there are 160 schools, thus we have about 159 extra one-hot encoded features
Thus, our number of parameters/fixed effects 159+3+1=163
\(\epsilon\) is an n-dimentional vector of independent errors, which we assume follow a normal distribution
I highlighted independent above because I want to raise the following: One of the assumptions of standard linear regression is that the errors accross observations are uncorrelated. This needs not be the case here. There are good and bad schools everywhere—math achievement should be correlated accross children in the same school.
Now, in the latent variable context, our model becomes: \[ Y = X\beta + S + \epsilon,\\ S \sim N(0,\sigma_S),\\ \epsilon \sim N(0,\sigma),\\ \beta \sim \text{N}(a,b),\\ \sigma_S \sim \;??? \] where,
Y is unchanged from the above.
X is an \((n+1)\times p\) matrix of fixed effects.
This includes SES
, Minority
, and Sex
. School
is out!
Our parameters/fixed effects are now only 3+1=4.
S is now a latent variable denoting School
\(\epsilon\) is unchanged from the above, unless we seek to also model observation error, which may be too involved for now…
\(\sigma_S\) is a “new” flexibility/complexity hyperparameter (PC framework), which indicates the (unobserved) variability in School
which we have not controlled for in our model.
There already exists some tried and tested libraries to fit mixed effects models through restricted maximum likelihood estimation (glmer
, nlme
). But, these are very unstable, do not take in priors, and will not give us the same level of detail and mathematical rigor as a (approximate) Bayesian treatment.
Moreover, what if we had repeated test scores for each children? What if the schools surveyed spanned multiple cities and accross multiple states? These are multi-level hierarchical models and they are INLA’s bread and butter.
Recall the primordial Bayesian in the context of model inference: \[ \begin{align} P(\theta|D)&=\frac{P(D|\theta)P(\theta)}{P(D)} \\ &\propto P(D|\theta)P(\theta) \end{align}\\ \text{where } P(D) = \int P(D|\theta)P(\theta)d\theta \]
This is the basic syntax for INLA:
inla(y ~ x + f(latent,model="iid"), family = c("gaussian"),
data = data, control.predictor=list(link=1))
Now, family
in the above corresponds to the distribution of the target variable. INLA handles the standard distributions: Gaussian (Standard Regression), Binomial (Logistic Regression), Poisson, Gamma, Weibull, etc.
Let’s set up our train-test dataset:
# We need to set the seed for reproducibility!
set.seed(1234)
cleanMathAchieve <- cleanMathAchieve[sample(nrow(cleanMathAchieve)),]
# Using 90% of data to train just for kicks. Can use more traditional splits: 80-20, 70-30
# I won't employ a loss-based approach here
trainMathAchieve <- cleanMathAchieve %>%
rownames_to_column('idd') %>%
sample_frac(.90) %>%
column_to_rownames('idd')
# Let's re-factor these bad boys in case we left out entire schools from training set
trainMathAchieve$School <- factor(trainMathAchieve$School)
sid<-as.numeric(rownames(trainMathAchieve))
testMathAchieve <- cleanMathAchieve[-sid,]
Now let’s prepare our variables for model-fitting:
trainX <- within(trainMathAchieve, rm("MathAch"))
# Preserving the index
trainY <- trainMathAchieve[c("MathAch")]
testX <- within(testMathAchieve, rm("MathAch"))
# Preserving the index
testY <- testMathAchieve[c("MathAch")]
testYdum <- testY %>%
rownames_to_column('gene') %>%
mutate_if(is.numeric, ~replace(., . > 0, NA)) %>%
column_to_rownames('gene')
Let’s fit a simple additive model with School
as a Gaussian latent variable. We shall see how well this model predicts MathAch
for the testing set. INLA’s pretty “tricky” with predictions–being a generative model, we simply censor these and fit the model:
#names(testYdum) <- "MathAch"
complete_data = rbind(trainMathAchieve[,c(5,1,3,2,4,6)],cbind(MathAch=testYdum,testX[,c(1,3,2,4,5)]))
#predmod1 = inla(MathAch ~ Minority + Sex + SES + MEANSES + f(School,model="iid"),
predmod1 = inla(MathAch ~ Minority + Sex + SES + f(School,model="iid"),
family = "gaussian",
data = complete_data,
control.predictor = list(link = 1))
predmod1$summary.fixed[, 1:6] %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
mean | sd | 0.025quant | 0.5quant | 0.975quant | mode | |
---|---|---|---|---|---|---|
(Intercept) | 13.186 | 0.180 | 12.832 | 13.186 | 13.540 | 13.186 |
MinorityYes | -2.819 | 0.207 | -3.226 | -2.819 | -2.412 | -2.819 |
SexMale | 1.233 | 0.165 | 0.909 | 1.233 | 1.556 | 1.233 |
SES | 2.041 | 0.109 | 1.827 | 2.041 | 2.256 | 2.041 |
A big no-no in Bayesian computation software is to use the default priors… Given how INLA takes priors on the precision, which corresponds to \(\frac{1}{\sigma^2}\), for \(\sigma\) the standard deviation, it should make sense to use their default log-gamma distribution. Why not? It is defined on \([0,\infty)\), so it’s not incorrect.
Penalized Complexity actually provides a principled framework to formulate a sensible prior that, after inference, would yield a “parsimonius” model.
It puts a prior on the KL distance (difference between probability distributions)
Think of it as: \(P(\sigma > 2)=0.05\)
So we fit a “better” model:
#predmod2 = inla(MathAch ~ Minority + Sex + SES + MEANSES + f(School,model="iid",
predmod2 = inla(MathAch ~ Minority + Sex + SES + f(School,model="iid",
hyper=list(prec=list(
prior = 'pc.prec',
param=c(u = 5, alpha = 0.05))
)
),
control.fixed = list(
mean.intercept=0,
prec.intercept=1/(3^2),
mean=0, prec= 1/(3^2)
),
family = "gaussian", data = complete_data,
control.predictor = list(link = 1),
control.compute=list(return.marginals=TRUE)
)
# PC prior on sdev
# gaussian prior on predictors
# Likelihood can be Gaussian, or gamma, or others...
predmod2$summary.fixed[, 1:6] %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
mean | sd | 0.025quant | 0.5quant | 0.975quant | mode | |
---|---|---|---|---|---|---|
(Intercept) | 13.137 | 0.181 | 12.780 | 13.137 | 13.492 | 13.137 |
MinorityYes | -2.790 | 0.207 | -3.196 | -2.790 | -2.384 | -2.790 |
SexMale | 1.247 | 0.165 | 0.924 | 1.247 | 1.570 | 1.247 |
SES | 2.038 | 0.109 | 1.824 | 2.038 | 2.252 | 2.038 |
and we examine the posteriors of some of its fixed effects:
for(D in c('MinorityYes', 'SexMale','SES')) {
plot(predmod2$marginals.fixed[[D]],type='l',col='blue',ylab="dens",
xlab=D)
xseq = predmod2$marginals.fixed[[D]][,'x']
lines(xseq, dnorm(xseq, sd=3), type='l', col='red')
}
legend('topleft',lty=1,col=c('blue','red'),legend=c('post','prior'))
Let’s look at the overall effect of our random effect School
–after all, it could be that the school in which the child studies has no effect…
reparameterized = Pmisc::priorPostSd(predmod2)
reparameterized$summary %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
mean | sd | 0.025quant | 0.5quant | 0.975quant | mode | |
---|---|---|---|---|---|---|
SD for School | 1.7 | 0.13 | 1.46 | 1.69 | 1.97 | 1.68 |
plot(reparameterized$posterior,type='l',col='blue',ylab="dens",
xlab= "posterior")
lines(reparameterized$prior, type='l', col='red')
legend('topleft',lty=1,col=c('blue','red'),legend=c('post','prior'))
Turns out it does! What’s the posterior mean effect \(E(S_j|Y)\) like for each school?
for(D in c(8, 50, 100)) {
plot(predmod2$marginals.random$School[[D]],type='l',col='blue',ylab="dens",
xlab= D)
xseq = predmod2$marginals.random$School[[D]][,'x']
lines(xseq, dnorm(xseq, sd=3), type='l', col='red')
}
legend('topleft',lty=1,col=c('blue','red'),legend=c('post','prior'))
What about a regularized model with priors N(0,1) on the fixed effects?
predmod3 = inla(MathAch ~ Minority + Sex + SES + f(School,model="iid",
hyper=list(prec=list(
prior = 'pc.prec',
param=c(u = 5, alpha = 0.05))
)
),
control.fixed = list(
mean.intercept=0,
prec.intercept=1,
mean=0, prec= 1
),
family = "gaussian", data = complete_data,
control.predictor = list(link = 1)
)
predmod3$summary.fixed[, 1:6] %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
mean | sd | 0.025quant | 0.5quant | 0.975quant | mode | |
---|---|---|---|---|---|---|
(Intercept) | 12.744 | 0.184 | 12.378 | 12.746 | 13.101 | 12.749 |
MinorityYes | -2.569 | 0.203 | -2.969 | -2.570 | -2.170 | -2.570 |
SexMale | 1.359 | 0.163 | 1.040 | 1.359 | 1.678 | 1.359 |
SES | 2.031 | 0.109 | 1.818 | 2.031 | 2.245 | 2.031 |
for(D in c('MinorityYes', 'SexMale','SES')) {
plot(predmod3$marginals.fixed[[D]],type='l',col='blue',ylab="dens",
xlab=D)
xseq = predmod3$marginals.fixed[[D]][,'x']
lines(xseq, dnorm(xseq, sd=1), type='l', col='red')
}
legend('topleft',lty=1,col=c('blue','red'),legend=c('post','prior'))
But what about prediction? Turns out we’ve already generated our predictions and we can assess these using the standard Mean Square Error (MSE).
Our predictions were generated from the predictive posterior distribution:
\[ P(y^{*} | D , x^{*}) = \int{P(\theta | D)P(y^{*}|x^{*},\theta)d\theta} \]
# Let's set up a function to calculate MSE just for fun
# Can also use MLmetrics::MSE if we're lazy
mse_calc <- function(x,y){
result = 0
for (i in 1:length(x)){
result = result + (x[i] - y[i])^2
}
result = result/(length(x))
return(result)
}
# Train MSE for model 1 with default priors
train_mse_1 <- mse_calc(predmod1$summary.fitted.values$mean[1:length(trainY$MathAch)],trainY$MathAch)
# Test MSE
start_n <- length(trainY$MathAch)+1
end_n <- length(complete_data$MathAch)
test_mse_1 <- mse_calc(predmod1$summary.fitted.values$mean[start_n:end_n],testY$MathAch)
# Train MSE for model 2 with user-specified priors
train_mse_2 <- mse_calc(predmod2$summary.fitted.values$mean[1:length(trainY$MathAch)],trainY$MathAch)
# Test MSE
start_n <- length(trainY$MathAch)+1
end_n <- length(complete_data$MathAch)
test_mse_2 <- mse_calc(predmod2$summary.fitted.values$mean[start_n:end_n],testY$MathAch)
# Train MSE for model 2 with user-specified priors
train_mse_3 <- mse_calc(predmod3$summary.fitted.values$mean[1:length(trainY$MathAch)],trainY$MathAch)
# Test MSE
start_n <- length(trainY$MathAch)+1
end_n <- length(complete_data$MathAch)
test_mse_3 <- mse_calc(predmod3$summary.fitted.values$mean[start_n:end_n],testY$MathAch)
train_mse_b <- c(train_mse_1, train_mse_2, train_mse_3)
test_mse_b <- c(test_mse_1, test_mse_2, test_mse_3)
results <- data.frame(train_mse_b,test_mse_b, row.names=c("Model 1","Model 2", "Model 3"))
colnames(results) <- c("Train MSE", "Test MSE")
results %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Train MSE | Test MSE | |
---|---|---|
Model 1 | 31.765 | 31.627 |
Model 2 | 31.757 | 31.621 |
Model 3 | 31.754 | 31.627 |
Turns out that our choice of priors didn not affect the prediction error very much. But what if we choose an outlandish prior? Or change the likelihood?
We can proceed to do a sensitivity analysis on our various choices of priors and move on to “model selection”…
I want to highlight that INLA is particularly well-suited for hierarchical models with Gaussian random effects
Only available for R–as are most cutting-edge scientific computation frameworks
If you need generalized models, I encourage you to explore Stan and its numerous wrappers–PyStan, RStan, etc.
I can tackle Stan next time…
Fantastic online INLA resource here
For all my hardcore peeps, I encourage you to read on Monte Carlo Methods (Integration, Optimization, etc) and play around with Monte Carlo libraries, e.g., PyMC
Bayesian inference is not that hard to incorporate into your Machine Learning practice.
Useful as long as you know what you are doing! Prior specification is a difficult art-form
Especially useful when you don’t have limited amounts of data (expensive data collection/labelling, rare events, etc)
Unlikely to ourperform ML-approaches for ML tasks (prediction, classification, etc), especially as the size of the dataset grows.
A properly specified Bayesian model is a generative model, so you should be able to easily generate data to check the consistency of your model with the dataset/data generating process in question.
EDA and plots are crucial before, throughout, and after the model building and checking process