Bayesian Methods for Time and Space-Dependent Data

Published: April 09, 2019

```{r setup, include=FALSE} library(MASS); library(lmtest); library(knitr); library(kableExtra); library(nleqslv); library(lme4); library(broom.mixed); library(broom); library(mapmisc); library(brinla) library(Pmisc); library(extrafont); library(VGAM); library(INLA); library(MEMSS); library(nlme); library(ciTools); library(survival); knitr::opts_chunk$set(fig.pos = 'H'); ``` #Part 1: CO$_2$ Time Series ##Introduction Greenhouse emissions and their effect on global warming have been the subject of extensive research and discussion since the mid 1900s. In this report we study measurements of athmospheric Carbon Dioxide ($\text{CO}_2$) from an observatory in Hawaii, USA, recorded from the 1960s to the present day. The data was collected through the Scripps $\text{CO}_2$ [program](http://scrippsco2.ucsd.edu/). We aim to test the following hypotheses in the context of our model and the data in question: 1. Even though CO$_2$ levels continue increasing, there is evidence that the increase has slowed down recently 2. The data are consistent with slowing concentration during the global economic recessions in 1982 and 2008, and the collapse of the USSR after 1989 3. CO$_2$ levels tend to be higher in October than March 4. There is a reasonable chance that $\text{CO}_2$ levels will exceed 430 parts per gallon (ppG) by 2025 ##Modelling First we plot the nominal CO$_2$ readings through the years to inspect their behavior and determine how to best model them: ```{r echo=F} # PART 1 # Loading the data cUrl = paste0("http://scrippsco2.ucsd.edu/assets/data/atmospheric/", "stations/flask_co2/daily/daily_flask_co2_mlo.csv") cFile = basename(cUrl) if (!file.exists(cFile)) download.file(cUrl, cFile) co2s = read.table(cFile, header = FALSE, sep = ",", skip = 69, stringsAsFactors = FALSE, col.names = c("day", "time", "junk1", "junk2", "Nflasks", "quality", "co2")) co2s$date = strptime(paste(co2s$day, co2s$time), format = "%Y-%m-%d %H:%M", tz = "UTC") # remove low-quality measurements co2s[co2s$quality > 2, "co2"] = NA ``` ```{r echo=F, fig.pos='H', fig.align='center',out.width=c('45%','45%'), fig.subcap=c('Years 1960 - 2019 (Entire Time Period)', 'Years 2015 - 2019'), fig.cap="\\label{fig:figs}Observed CO$_2$ Readings at the Mauna Loa Observatory in Hawaii, USA"} #EDA plots plot(co2s$date, co2s$co2, log = "y", cex = 0.3, col = "#00000040", xlab = "time", ylab = "ppm") plot(co2s[co2s$date > ISOdate(2015, 3, 1, tz = "UTC"), c("date", "co2")], log = "y", type = "o", xlab = "time", ylab = "ppm", cex = 0.5) ``` Based on the strong seasonality and trend features visible in these plots, we fit the below semi-parametric model: $$ \text{Y}_{t} \sim \text{Gamma}(\mu_{t}, \nu) $$ $$ \mu_{t} = \Phi(X_{t})\mathbf{\beta} + f(U_t) $$ $$ U_{t+1} - 2U_{t} + U_{t-1} \sim N(0,\tau)\;\;\; \text{ (SecondOrder Random Walk)} $$ In the above: >* $\text{Y}_{t}$ corresponds to the observed CO$_2$ reading at time $t$. >* $\mu_{t}$ is the expected CO$_2$ reading at time $t$. >* $\Phi(X_{t})$ are trigonometric basis functions, taking in day indeces (t) from the origin date of January 1, 1980: $$ \text{cos12}(t) = \cos(2\pi \times \frac{\text{days}_t}{365.25});\;\; \text{sin12}(t)= \sin(2\pi \times \frac{\text{days}_t}{365.25});\;\; \text{cos6}(t) = \cos(4\pi \times \frac{\text{days}_t}{365.25});\;\; \text{sin6}(t) = \sin(4\pi \times \frac{\text{days}_t}{365.25}) $$ >* $U_{t}$ is the random time trend component >* $\sigma$ is the observation's residual variability >* $\tau$ is the variability in the estimated trend component This model accounts for the residual variation in observed measurements which may attributed to measurement error or unaccounted factors. This model also estimates a time trend component, which it assumes follows a RW2 process. ##Results ```{r echo = F} ## Prep time vars timeOrigin = ISOdate(1980, 1, 1, 0, 0, 0, tz = "UTC") co2s$days = as.numeric(difftime(co2s$date, timeOrigin, units = "days")) co2s$cos12 = cos(2 * pi * co2s$days/365.25) co2s$sin12 = sin(2 * pi * co2s$days/365.25) co2s$cos6 = cos(2 * 2 * pi * co2s$days/365.25) co2s$sin6 = sin(2 * 2 * pi * co2s$days/365.25) ``` ```{r echo=F, warning=F, message=F} ##### BAYESIAN MODEL ###### ## Prepping the dataset ## # time random effect timeBreaks = seq(min(co2s$date), ISOdate(2025, 1, 1, tz = "UTC"), by = "14 days") timePoints = timeBreaks[-1] co2s$timeRw2 = as.numeric(cut(co2s$date, timeBreaks)) # derivatives of time random effect D = Diagonal(length(timePoints)) - bandSparse(length(timePoints), k = -1) derivLincomb = inla.make.lincombs(timeRw2 = D[-1, ]) names(derivLincomb) = gsub("^lc", "time", names(derivLincomb)) # seasonal effect StimeSeason = seq(ISOdate(2009, 9, 1, tz = "UTC"), ISOdate(2011, 3, 1, tz = "UTC"), len = 1001) StimeYear = as.numeric(difftime(StimeSeason, timeOrigin, "days"))/365.35 seasonLincomb = inla.make.lincombs(sin12 = sin(2 * pi * StimeYear), cos12 = cos(2 * pi * StimeYear), sin6 = sin(2 * 2 * pi * StimeYear), cos6 = cos(2 * 2 * pi * StimeYear)) names(seasonLincomb) = gsub("^lc", "season", names(seasonLincomb)) # predictions timeBreaks1 = seq(min(co2s$date), ISOdate(2030, 1, 1, tz = "UTC"), by = "14 days") timePoints1 = timeBreaks1[-1] StimePred = as.numeric(difftime(timePoints1, timeOrigin, units = "days"))/365.35 predLincomb = inla.make.lincombs(timeRw2 = Diagonal(length(timePoints1)), `(Intercept)` = rep(1, length(timePoints1)), sin12 = sin(2 * pi * StimePred), cos12 = cos(2 * pi * StimePred), sin6 = sin(2 * 2 * pi * StimePred), cos6 = cos(2 * 2 * pi * StimePred)) names(predLincomb) = gsub("^lc", "pred", names(predLincomb)) StimeIndex = seq(1, length(timePoints1)) #Extending prediction timeOriginIndex = which.min(abs(difftime(timePoints1, timeOrigin))) ## disable some error checking in INLA ## mm = get("inla.models", INLA:::inla.get.inlaEnv(), mode='function') mm = mm() mm$latent$rw2$min.diff = NULL assign("inla.models", mm, INLA:::inla.get.inlaEnv()) ## Fitting the INLA model ## co2res = inla(co2 ~ sin12 + cos12 + sin6 + cos6 + f(timeRw2, model = 'rw2', values = StimeIndex, prior='pc.prec', param = c(log(1.01)/26, 0.5)), #log(1.01)/26 = .00039 data = co2s, family='gamma', lincomb = c(derivLincomb,seasonLincomb, predLincomb), control.family = list(hyper=list(prec=list(prior='pc.prec', param=c(2, 0.5)))), # add this line if your computer has trouble # control.inla = list(strategy='gaussian', int.strategy='eb'), verbose=F) ``` Below are the results of the fitted model. Recall that INLA's default parametrization of the Gamma distribution uses the log-link function. Thus, I exponentiate the estimated model coefficients in the following table: ```{r echo=F} co2resTab1 <- exp(co2res$summary.fixed[, c("mean", "0.025quant", "0.975quant")]) #BRINLA: Convert precisions to SD co2resTab3 <- bri.hyperpar.summary(co2res)[1, c("mean", "q0.025", "q0.975")] names(co2resTab3) <- c("mean", "0.025quant", "0.975quant") co2resTab2 <- Pmisc::priorPostSd(co2res)$summary[, c("mean", "0.025quant", "0.975quant")] resTable <- rbind(co2resTab1,co2resTab2) resTable <- rbind(resTable,co2resTab3) rownames(resTable) <- c("(Intercept)", "sin12","cos12","sin6","cos6","SD for timeRw2","Intern precision-parameter for the Gamma observations") knitr::kable(resTable, digits=5, escape=F, format="latex", booktab=T,linesep = "", caption="Exponentiated posterior mean and 2.5\\% and 97.5\\% percentiles by model coefficients") %>% kable_styling(latex_options = "hold_position") %>% row_spec(c(2,3,4,5), bold = T, color = "blue") ``` Note how the coefficients of the basis function terms above (highlighted in blue) are statistically credible--namely, they model the underlying data process well. ```{r echo=FALSE, eval=T, fig.pos='H', fig.align='center', out.width=c('46%','46%','46%','46%'), fig.subcap=c('Trend Component', 'Derivative', 'Seasonality with September as the baseline month','Predicted Measurements'), fig.ncol=2, fig.cap="\\label{fig:figs}Posterior plots of different estimated components, along with 95\\% credibility bands"} ## INLA plots par(mar = c(4,4,4,2) + 0.1); par(mgp=c(2,1,0)); matplot(timePoints1, exp(co2res$summary.random$timeRw2[, c("0.5quant", "0.025quant", "0.975quant")]), type = "l", col = "black", lty = c(1, 2, 2), log = "y", xaxt = "n", xlab = "time", ylab = "ppm") xax = pretty(timePoints1, n=10) axis(1, xax, format(xax, "%Y")) derivPred = co2res$summary.lincomb.derived[grep("time", rownames(co2res$summary.lincomb.derived)), c("0.5quant", "0.025quant", "0.975quant")] scaleTo10Years = (10 * 365.25/as.numeric(diff(timePoints, units = "days"))) matplot(timePoints[-1], scaleTo10Years * derivPred, type = "l", col = "black", lty = c(1, 2, 2), ylim = c(0, 0.1), xlim = range(as.numeric(co2s$date)), xaxs = "i", xaxt = "n", xlab = "time", ylab = "log ppm, change per 10yr") axis(1, xax, format(xax, "%Y")) abline(v = ISOdate(2008, 1, 1, tz = "UTC"), col = "blue") abline(v = ISOdate(2009, 1, 1, tz = "UTC"), col = "blue") abline(v = ISOdate(1989, 1, 1, tz = "UTC"), col = "red") abline(v = ISOdate(1992, 1, 1, tz = "UTC"), col = "red") abline(v = ISOdate(1980, 1, 1, tz = "UTC"), col = "green") abline(v = ISOdate(1982, 1, 1, tz = "UTC"), col = "green") abline(v = ISOdate(2017, 1, 1, tz = "UTC"), col = "purple") legend("topright", legend = c("1980-82 Recession", "1989-91 USSR Collapse", "2008 Recession","2017+ Recent Times"), col = c("green","red","blue","purple"), lty=c(1,1,1,1), cex=0.8,lwd=c(1.5,1.5,1.5,1.5), merge = F, box.lty=0, bg="transparent",xpd=TRUE,inset=c(0,-.3)) #1964, .105, matplot(StimeSeason, exp(co2res$summary.lincomb.derived[grep("season", rownames(co2res$summary.lincomb.derived)), c("0.5quant", "0.025quant", "0.975quant")]), type = "l", col = "black", lty = c(1, 2, 2), log = "y", xaxs = "i", xaxt = "n", xlab = "time", ylab = "relative ppm") xaxSeason = seq(ISOdate(2009, 9, 1, tz = "UTC"), by = "2 months", len = 12) abline(v = ISOdate(2010, 10, 1, tz = "UTC"), col = "red") abline(v = ISOdate(2010, 3, 1, tz = "UTC"), col = "blue") legend("topright", legend = c("March", "October"), col = c("blue","red"), lty=c(1,1), cex=0.8,lwd=c(1.5,1.5), merge = F, box.lty=0, bg="transparent",xpd=TRUE,inset=c(0,-.2)) #1964, .105, axis(1, xaxSeason, format(xaxSeason, "%b")) timePred = co2res$summary.lincomb.derived[grep("pred", rownames(co2res$summary.lincomb.derived)), c("0.5quant", "0.025quant", "0.975quant")] matplot(timePoints1, exp(timePred), type = "l", col = "black", lty = c(1, 2, 2), log = "y", xlim = ISOdate(c(2010, 2027), 1, 1, tz = "UTC"), ylim = c(390, 470), xaxs = "i", xaxt = "n", xlab = "time", ylab = "ppm") xaxPred = seq(ISOdate(2010, 1, 1, tz = "UTC"), by = "5 years", len = 20) abline(v = ISOdate(2025, 1, 1, tz = "UTC"), col = "red") abline(h = 430, col = "red") axis(1, xaxPred, format(xaxPred, "%Y")) ``` ```{r echo=F, eval=F, fig.pos='H', fig.align='center',out.width='50%', fig.cap="\\label{fig:figs}Prior and posterior distributions of the time trend component of this CO2 model"} #Plotting priors and posteriors #\alpha, sd state, and sd school co2res$priorPost = Pmisc::priorPost(co2res) par(mar = c(4,4,4,2) + 0.1); par(mgp=c(2,1,0)); for (Dparam in co2res$priorPost$parameters) { do.call(matplot, co2res$priorPost[[Dparam]]$matplot) } co2res$priorPost$legend$x = "topleft" do.call(legend, co2res$priorPost$legend) ``` With regards to the first hypothesis in question, the above results do not present conclusive evidence for an affine recent slowdown of carbon concentrations in the atmosphere. Figure 2 (b) above shows the rate of change of PPM readings, which does appear to show signs of a slowdown in the years 2016 and 2017, but shows a rising rate of change through the year 2018 and beyond. Nonetheless, the same figure points to a marked slowdown during the 1980-2 Recession. This Recession was infamously influenced by the global energy crises that affected much of the world around the same years. We see less pronounced slowdowns during the period of collapse of the USSR (1989-1991) and the most recent economic recession of 2008. In terms of seasonality, the fitted model estimates carbon concentration in the atmosphere to be higher in March, when compared to October, thus contradicting the initial hypothesis put forward. It would be a worthy endeavor to explore what structural, historical, and environmental factors may be driving this estimated seasonal difference. Now, what does the future look like (hypothesis 4)? Although the model does predict an approximate linear increasing trend in carbon concentrations, as seen in Figures 2 (a) and (b), the median prediction for the year 2025 is seen around 420 ppm. But we must also note the 95% credibility bands for this estimate, which go up to about 440 ppm in 2025, but also fall to almost 400 ppm. Therefore, there is reasonable chance that atmospheric readings of carbon may exceed 430 ppm in 2025. ##Conclusion The study of climate and mankind's influence on it has gained much traction in the past few decades. To gain a more complete picture of CO$_2$ concentrations in the atmosphere we may wish to consider other measurement sites around the world, for in this study we restrict ourselves to one location in Hawaii, USA. It would be interesting to extend our analysis to other, more impactful gases, like Methane (CH$_4$) and Nitrous Oxide (N$_2$O). Our model does indicate increasing CO$_2$ concentrations in the Hawaii location. We hope that given the extensive scientific literature for the greenhouse properties of this gas, perhaps the rise of clean energy and heightened climate awareness at the national and individual scales will curb emissions and diminish the impact of human-induced climate change. \pagebreak \newpage #Part 2: Malaria Spatial Mapping ## Introduction With the advent of the 2019 World Health Organization's [Malaria Day](https://www.who.int/campaigns/world-malaria-day/world-malaria-day-2019) on April 25 of the present year, and the number of reported malaria deaths worldwide as 435,000 in 2017, we find it worth exploring how to improve the spatial prediction of cases of this disease in developing countries where collecting data related to this disease is either costly or extremely difficult due to poor infrastructure, war, or political instability. In this report we focus on the country of Gambia, seen below: ```{r echo=F, message=F, warning=F} ### Part 2 # Loading the dataset eUrl = "http://pbrown.ca/teaching/astwo/data/eviMean.RData" eFile = basename(eUrl) if (!file.exists(eFile)) download.file(eUrl, eFile) load(eFile) library("geostatsp", quietly = TRUE) data("gambiaUTM") gborder = raster::getData("GADM", country = "GMB", level = 0) gborder = spTransform(gborder, projection(gambiaUTM)) graster = squareRaster(gborder,100) ``` ```{r echo=F, eval=T, warning=F, message=F} ### Default Spatial model # Note that in glgm default is shape = 1 myres0 = glgm(pos ~ phc + netuse + age + treated, data = gambiaUTM, grid = squareRaster(gborder,100), shape=2, family = "binomial", prior = list(sd = 5, range = 60 * 1000), control.inla = list(strategy="gaussian", int.strategy="eb")) ### Final Spatial model myres1 = glgm(pos ~ phc + evi + netuse + age + treated, data = gambiaUTM, grid = squareRaster(gborder,100), shape=2, family = "binomial", prior = list(sd = 5, range = 60 * 1000), covariates = list(evi = eviMean), control.inla = list(strategy="gaussian", int.strategy="eb")) ``` ```{r echo=FALSE, eval=T, warning=F, out.width=c('70%','70%'), fig.ncol=1, message = F, fig.pos='H', fig.show=T , fig.align= 'center', fig.subcap=c('Map of Gambia - crosses correspond to the villages which data we consider', 'Map of Vegetation Index EVI in Gambia'), fig.cap="\\label{fig:figs}Maps of Gambia and its Vegetation"} #clip(usr[1], usr[2], usr[3], -10000) gmap = trim(openmap(gambiaUTM, fact = 2, maxTiles=12)) gmap2 = tonerToTrans(openmap(gambiaUTM, path="stamen-toner", fact = 2, maxTiles=12)) map.new(graster,bty="n", axes=F) plot(gmap, add=T) plot(gambiaUTM, add=T) Axis(side=1, labels=FALSE) Axis(side=2, labels=FALSE) scaleBar(gambiaUTM, "bottom", cex = 1, bty = "n", outer=FALSE, bg="transparent") evi_col <- colourScale(eviMean, breaks=10, dec=0, col = "Greens", opacity=c(0.8, 0.8)) gmap_w_evi = tonerToTrans(openmap(eviMean, path = "stamen-toner",fact = 2)) map.new(eviMean, legendRight = 0.85,bty="n", axes=F) plot(eviMean, add = T, col = evi_col$col,bty="n",breaks = evi_col$breaks, legend = F) plot(gmap_w_evi, add = T, maxpixels = 10^6) plot(gambiaUTM, add=T) legendBreaks("right", evi_col, outer = T, bty = "n", inset = 0) ``` Our objective is to determine whether using satellite data (in this case the vegetation index EVI) is feasible in the prediction of the spatial distribution of the malaria disease in areas of Gambia where nearby samples are scarce. ##Methods For this research question I consider the below generalized linear geostatistical model (GLGM): $$ Y_{i} \sim \text{bernoulli}(\pi (s_{i})) $$ $$ \text{logit}(\pi (s_{i})) = X(s_{i})\beta + U(s_{i}) $$ $$ \text{cov}[U(s_{i}),U(s_{j})] = \sigma^2 \rho \left [\frac{||s_{i} - s_{j}||}{\phi};\kappa\right] $$ where, $$ \rho(h;\phi,\kappa) = \frac{1}{\Gamma(\kappa)2^{\kappa -1}}\left( \frac{\sqrt{8\kappa}||h||}{\phi}\right)^{\kappa} \text{bessel}_{\kappa}\left( \frac{\sqrt{8\kappa}||h||}{\phi}\right) $$ In the above: >* The dependent variable $\text{Y}_{i}$ is defined as below: $$ Y_{i} = \left\{ \begin{array}{rcl} 1 &\text{ if ith child tests positive for malaria}\\ 0 &\text{ if ith child tests negative for malaria} \end{array}\right. $$ >* $X(s_{i})$ corresponds to the covariates of age of the child, whether they use a bed net for mosquito protection, if their net has been treated with insecticide, if the village has a public health center, and EVI as the measure of vegetation for a given location. >* $U(s_{i})$ corresponds to the residual spatial variation. >* $\phi$ corresponds to the range hyperparameter, for which we choose an exponential prior with median = 60 $\times$ 1000, signifying that we expect the spatial correlation to start falling past 60 KM. >* $\sigma$ corresponds to the variability in residual variation, for which we choose an exponential prior with median = 5. >* $\kappa$ corresponds to the shape parameter in the Matern correlation function $\rho$. ##Results The below table shows evidence for net use and age as important covariates in assessing the odds of malaria in the observed locations. ```{r echo=F} sparesTable1 <- myres1$parameters$summary[, c("mean", "0.025quant", "0.975quant")]; sparesTable1[1:6,] <- exp(sparesTable1[1:6,]) rownames(sparesTable1) <- c("(Intercept)","Public Health Centre Presence", "Vegetation Index EVI", "Bed Net Usage", "Age","Bed Net Insecticide Treatment", "range/1000","sd") knitr::kable(sparesTable1,digits=5, escape=F, format="latex", booktab=T,linesep = "",caption="95\\% credibility intervals for exponentiated coefficients and posterior distributions of parameters") %>% kable_styling(latex_options = "hold_position") %>% row_spec(c(4,5), bold = T, color = "blue") %>% row_spec(c(3), bold = T, color = "red") ``` The model suggests that usage of mosquito nets **reduces** the odds of contracting malaria by approximately 28%, with weak contribution from additional insecticide treatment. Similarly, although having the opposite effect, the age of the child (fitted in terms of days) has an important effect in the prediction of malaria odds. To compute the 95% credibility interval of the effect of a one year increase in age on the odds of contracting malaria, we solve the following expression: $(1.00043^{365},1.00090^{365})$ = (1.169898, 1.388678). We also obtain a median value of $1.00066^365$=1.272293. This suggests that an increment of one year of age in the children in our sample yields an approximate median increment of 27% in the odds of contracting malaria. Below we plot the surfaces of estimated probabilities of malaria, as well as the estimated residual spatial variation (random effect): ```{r echo=F, eval=T, warning=F, message=F, fig.pos='H', fig.align= 'center', out.width=c('50%','50%'),fig.subcap=c('Estimated probabilities of malaria E($\\lambda(s)$|Y)','Estimated residual spatial variation E($U(s)$|Y)'), fig.cap="\\label{fig:figs}Estimates Including Vegetation Index EVI"} fitCol1 = colourScale(trim(myres1$raster[["predict.invlogit"]]), style = "equal", breaks = 10, dec = -log10(0.02), col = "RdYlGn", rev = TRUE, opacity = 0.8) map.new(trim(myres1$raster), legendRight = 0.9, axes=F) plot(trim(myres1$raster[["predict.invlogit"]]), add = TRUE, col = fitCol1$colOpacity,bty="n", breaks = fitCol1$breaks, legend = FALSE) plot(trim(gmap2), add = TRUE, maxpixels = 10^6) plot(gambiaUTM, add=T) legendBreaks("right", fitCol1, outer = TRUE, bty = "n", inset = 0) fitCol2 = colourScale(trim(myres1$raster[["random.mean"]]), style = "equal", breaks = 10, dec = -log10(0.02), col = "RdYlGn", rev = TRUE, opacity = 0.8) map.new(trim(myres1$raster), legendRight = 0.9, axes=F) plot(trim(myres1$raster[["random.mean"]]), add = TRUE, col = fitCol1$colOpacity,bty="n", breaks = fitCol2$breaks, legend = FALSE) plot(trim(gmap2), add = TRUE, maxpixels = 10^6) plot(gambiaUTM, add=T) legendBreaks("right", fitCol2, outer = TRUE, bty = "n", inset = 0) ``` ```{r echo=F, eval=T, warning=F, message=F, fig.pos='H', fig.align='center', out.width=c('50%','50%'),fig.subcap=c('Estimated probabilities of malaria E($\\lambda(s)|Y$)','Estimated Residual Spatial Variation E($U(s)|Y$)'),fig.cap="\\label{fig:figs}Estimates Excluding Vegetation Index EVI"} fitCol0 = colourScale(trim(myres0$raster[["predict.invlogit"]]), style = "equal", breaks = 10, dec = -log10(0.02), col = "RdYlGn", rev = TRUE, opacity = 0.8) map.new(trim(myres0$raster), legendRight = 0.9, axes=F) plot(trim(myres0$raster[["predict.invlogit"]]), add = TRUE, col = fitCol0$colOpacity,bty="n", breaks = fitCol0$breaks, legend = FALSE) plot(trim(gmap2), add = TRUE, maxpixels = 10^6) plot(gambiaUTM, add=T) legendBreaks("right", fitCol0, outer = TRUE, bty = "n", inset = 0) fitCol3 = colourScale(trim(myres0$raster[["random.mean"]]), style = "equal", breaks = 10, dec = -log10(0.02), col = "RdYlGn", rev = TRUE, opacity = 0.8) map.new(trim(myres0$raster), legendRight = 0.9, axes=F) plot(trim(myres0$raster[["random.mean"]]), add = TRUE, col = fitCol3$colOpacity,bty="n", breaks = fitCol3$breaks, legend = FALSE) plot(trim(gmap2), add = TRUE, maxpixels = 10^6) plot(gambiaUTM, add=T) legendBreaks("right", fitCol3, outer = TRUE, bty = "n", inset = 0) ``` The above maps of the estimated probabilites highlight the highest probability of contracting malaria in the easternmost village cluster, compared to perhaps the rest of Gambia, controlling for several factors. There also appears to be high probability of contracting this disease in the northern portion of the northwest village cluster. However, the maps of the estimated random effects show highly pronounced residual variation in these two areas—quite negative for the lower probability region and quite positive for the higher one. The redder area in the random effect heat map may be interpreted as higher actual observed risk of contracting the malaria disease than our model is able to account for, i.e., estimate. This variability in intensities suggests that there may be other unaccounted spatial factors driving malaria risk. These other factors could be public infrastructure, type of animals and plants in the area, type of crops and field usage, among others. But what can be said about locations for which we have no recorded data? Although the coefficient pertaining to the EVI vegetation index does not show great influence in our model from simple inspection of the table above, the below plots show how EVI reduces the residual spatial variation in our model, especially in areas without recorded observations, south of Nioro du Rip (Midwest Gambia). Moreover, it adds more granularity to the plots of estimated probabilities of malaria, highlighting increased probability near the rivers south of Nioro du Rip, as well as the outskirts of the southwestern village cluster. Note in the maps above that EVI measurements are not available for the coastal regions of Gambia (West), which is why the model cannot estimate the probability of malaria there. To gain a more complete understanding in this missing area we may wish to find the EVI readings for these missing (cropped) locations. Below are the diagrams contrasting our prior distributions on the model parameters range ($\phi$) and the variability in the residual spatial variation ($\sigma$), with their corresponding posterior densities: ```{r echo=F, fig.pos='H', fig.align='center',out.width=c('45%','45%'), fig.subcap=c( 'SD of residual spatial variation $\\sigma$','Range $\\phi$'), fig.cap="\\label{fig:figs}Densities of prior and posterior distributions pertaining to this spatial model"} #Plotting priors and posteriors par(mar = c(4,4,4,2) + 0.1); par(mgp=c(2,1,0)); matplot( myres1$parameters$sd$posterior[,'x'], myres1$parameters$sd$posterior[,c('y','prior')], lty=1, col=c('blue','red'), type='l', xlab='sd', ylab='density') legend("topright", col=c("blue","red"),lty=1,legend=c("posterior", "prior")) matplot( myres1$parameters$range$posterior[,'x'], myres1$parameters$range$posterior[,c('y','prior')], lty=1, col=c('blue','red'), type='l', xlab='range', ylab='density') legend("topright", col=c("blue","red"),lty=1,legend=c("posterior", "prior")) ``` The posterior plot for $\sigma$ above indicates that our model has reduced the variability in residual variation, from a prior with median of 5, to a narrower posterior around .9, assigning almost nill probability to 0. We can thus conclude that the posterior of $\sigma$ suggests that location accounts for part of the probability that a child will contract malaria. The posterior plot for range suggests a certain degree of smoothness in the spatial distribution of malaria, with a median of 22,000 (interpreted as 22KM) and a 95% credibility interval of (14,000, 33,000) approximately. We may interpret this results as the following: the spatial correlation between estimates at two different points in our space decays as their distance apart goes beyond 22KM. Intuitively, this makes sense if we examine the initial map with the observation locations: The distance between villages in a given cluster are almost always no larger than 50KM, but the distance between the western, central, and eastern clusters is larger than 50KM. ## Discussion It is important for governments and health NGOs to understand the incidence of preventable, debilitating diseases. With regards to malaria, John's Hopkins University estimates the yearly economic cost of this disease for the African continent as approximately USD $12 billion [**here**](https://www.malariafreefuture.org/malaria). In this report we see marginal benefits of including satellite vegetation imagery towards gaining a better understanding of the malaria disease in areas for which there are few or no recorded observations. By going with observations alone, which are often difficult to procure in developing countries due to safety and resource constraints, we may be limiting the scope of our modelling, whereas we can expand our insight by harnessing technological advances in satellite imagery. Although the results of considering the EVI vegetation index are not decidedly strong, they may inform us of areas that may require greater effort in data collection. Moreover, EVI does not appear to discriminate by the predominant type of vegetation or fauna in the area, perhaps suggesting that we consider other types of satellite data. We also see the benefit of providing cost-effective protection against the carriers of this disease, in the form of bed nets. Bed nets are a physical barrier that shields individuals from mosquitoes, especially during sleep, for children are particularly vulnerable to bites while they lay motionless, getting rest. However, our model suggests increased odds of contracting malaria as children advance in age--perhaps due to increased levels of activity and curiosity, playing near still waters or in areas with large mosquito presence. It is important to educate these children and those around them on common malaria prevention practices (e.g., wearing long-sleeved clothing and using insect repellect). \pagebreak \newpage #Appendix: Code ```{r ref.label=knitr::all_labels(), echo = T, eval = F} ```

Sergio E. Betancourt