In this lab you will explore how performance of an MCMC algorithm and the validity of scientific inferences can be affected by the choice of prior distribution.
Set up: Gotelli and Ellison (2002) investigate the impact of elevation and latitude on ant species richness. To measure species richness, square grids measuring 64 meters in length and width are established in 22 different bogs and in the forests surrounding them. The number of ant species within each grid was recorded. We will use this data set in the investigation below.
For this lab, you will want to load the following packages:
library(GGally)
library(rjags)
set.seed(10)
pairs
or
GGally::ggpairs
. What do you notice? How might this impact
your modeling choices?ants <- read.csv('ants.csv',header=TRUE)
GGally::ggpairs(ants)
Richness is a count variable demarcating the number of species of ants found in the sampling grid. Forest is an indicator variable equal to 1 if the site is in the forest and 0 if the site is in the bog. Latitude and elevation are both continuous variables. We will consider a Poisson data model.
Let \[g(\boldsymbol{\beta}, \mathbf{x}_i)=\text{log}^{-1}(\beta_0 + \beta_1x_{\text{forest}} + \beta_2x_{\text{latitude}} + \beta_3x_{\text{elev}}) =\exp(\beta_0 + \beta_1x_{\text{forest}} + \beta_2x_{\text{latitude}} + \beta_3x_{\text{elev}}).\]
Then, \[\begin{align*} [\boldsymbol{\beta} \mid \mathbf{y}] & \propto [\mathbf{y} \mid g(\boldsymbol{\beta}, \mathbf{X})][\boldsymbol{\beta}] \\ &= \prod_{i=1}^{44}{\text{Poisson}(y_i\mid g(\boldsymbol{\beta}, \mathbf{x}_i))}\prod_{i=0}^{3}{\text{normal}(\beta_j \mid 0, 10000)} \end{align*}\]
freq_glm <- glm(richness ~ ., data = ants, family = poisson(link = 'log'))$coef
names(freq_glm) <- c('b0', 'b1', 'b2', 'b3')
freq_glm
## b0 b1 b2 b3
## 11.936812111 0.635438863 -0.235793020 -0.001141079
# jags code
m.jags <- "
model{
for(i in 1:n){
ants[i, 1] ~ dpois(lam[i])
log(lam[i]) <- b0+b1*ants[i,2]+b2*ants[i,3]+b3*ants[i,4]
}
b0 ~ dnorm(mu,tau)
b1 ~ dnorm(mu,tau)
b2 ~ dnorm(mu,tau)
b3 ~ dnorm(mu,tau)
}
"
m <- textConnection(m.jags)
# set initializations
inits <- list(list(b0 = 0, b1 = 0, b2 = 0, b3 = 0),
list(b0 = 0, b1 = 0, b2 = 0, b3 = 0),
list(b0 = 0, b1 = 0, b2 = 0, b3 = 0))
# build model and algorithm
m.out <-
jags.model(
m,
data = list(
'ants' = as.matrix(ants),
'n' = nrow(ants),
'mu' = 0,
'tau' = 1/10000
),
inits = inits,
n.chains = 3,
n.adapt = 1000,
quiet = TRUE
)
# fit
update(m.out, 50)
m.samples <-
coda.samples(m.out, c('b0', 'b1', 'b2', 'b3'), 500)
We first look at the summary information and the trace plots.
MCMCvis::MCMCsummary(m.samples)
## mean sd 2.5% 50% 97.5% Rhat n.eff
## b0 6.842604785 0.8569001163 5.058657141 6.778858982 8.6704410119 2.06 11
## b1 0.641359666 0.1181498137 0.411558120 0.643233556 0.8768860371 1.01 249
## b2 -0.116202363 0.0199950935 -0.158752229 -0.115221922 -0.0740831533 2.09 11
## b3 -0.001322329 0.0003604419 -0.002007422 -0.001335771 -0.0005631731 1.02 370
MCMCvis::MCMCtrace(m.samples, pdf = FALSE)
Clearly, we should be concerned about convergence of the chains. We will need to correct these issues and conduct posterior predictive checks before making any inferences. Notice specifically the trace plots and effective sample sizes for \(\beta_0\) and \(\beta_2\).
There are three (more) obvious concerns we need to address to improve convergence of the chains. That is:
We implement the three suggestions outlined above. Specifically, we initialize the first chain starting at the frequentist estimates. This is called a lukewarm start. We initialize the other two chains uniformly at random within a reasonable range. Again, we use three chains so we can formally assess convergence using Rhat.
# jags code
m <- textConnection(m.jags)
inits <- list(
as.list(freq_glm),
as.list(freq_glm + runif(4, -1, 1)),
as.list(freq_glm + runif(4, -1, 1))
)
# build model and algorithm
m.out <-
jags.model(
m,
data = list(
'ants' = as.matrix(ants),
'n' = nrow(ants),
'mu' = 0,
'tau' = 1/10000
),
inits = inits,
n.chains = 3,
n.adapt = 1000,
quiet = TRUE
)
# fit
update(m.out, 2000) # increase burn in to 2K
m.samples <-
coda.samples(m.out, c('b0', 'b1', 'b2', 'b3'), 10000) # increase MCMC iterations to 10K
Let’s take a look at the diagnostics.
MCMCvis::MCMCsummary(m.samples)
## mean sd 2.5% 50% 97.5% Rhat n.eff
## b0 12.480473699 2.0559738409 8.38867555 12.818138112 16.0260997927 1.45 15
## b1 0.639058345 0.1192834465 0.40685757 0.638498230 0.8738555421 1.00 4458
## b2 -0.248737543 0.0483309153 -0.33211418 -0.256428447 -0.1529219613 1.45 15
## b3 -0.001135004 0.0003705971 -0.00186752 -0.001130364 -0.0004185119 1.01 4743
MCMCvis::MCMCtrace(m.samples, pdf = FALSE)
Clearly, there is still an issue. The effective sample size for \(\beta_0\) and \(\beta_2\) are still comically low, and the trace plots for those parameters do not exhibit stationarity. The next question will motivate how to track down the mistake. Bonus points to those who caught it already.
We first simulate one data set according to the data model and prior specification.
prior_var <- 10000
betas <- rnorm(4,0,sqrt(prior_var))
lambda_i <- exp(cbind(rep(1, nrow(ants)), as.matrix(ants[,-1])) %*% betas)
as.vector(lambda_i)
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0
rpois(n = nrow(ants), as.vector(lambda_i))
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0
Let’s simulate a second data set.
betas <- rnorm(4,0,sqrt(prior_var))
lambda_i <- exp(cbind(rep(1, nrow(ants)), as.matrix(ants[,-1])) %*% betas)
as.vector(lambda_i)
## [1] Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf
## [20] Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf
## [39] Inf Inf Inf Inf Inf Inf
rpois(n = nrow(ants), as.vector(lambda_i))
## Warning in rpois(n = nrow(ants), as.vector(lambda_i)): NAs produced
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
We used a \(\text{normal}(0, 10000)\) as a “vague” prior for each \(\beta_j\). However, we see that a large prior variance causes all of the \(\lambda_i\) values to be zero or infinite for all \(i\). YIKES! This is another instance where a non-linear function of parameters with vague priors have informative priors.
prior_var <- 0.01
betas <- rnorm(4,0,sqrt(prior_var))
lambda_i <- exp(cbind(rep(1, nrow(ants)), as.matrix(ants[,-1])) %*% betas)
as.vector(lambda_i)
## [1] 782.22635 708.67688 26.05572 23.60581 94.54371 85.65416
## [7] 24.56693 22.25701 158.97695 144.02901 49.31715 44.68007
## [13] 37.44144 33.92098 1983.78599 1797.25888 72.95765 66.09775
## [19] 58.00573 52.55169 292.47892 264.97835 504.79496 457.33120
## [25] 3239.96191 2935.32184 457.52534 414.50615 109.84292 99.51485
## [31] 463.15208 419.60382 1881.81133 1704.87248 732.06887 663.23550
## [37] 240.49581 217.88300 38.31555 34.71290 697.38794 631.81547
## [43] 99.06621 89.75143
sort(rpois(n = nrow(ants), as.vector(lambda_i)))
## [1] 23 24 26 26 29 34 35 37 44 50 53 58 61 66 79
## [16] 85 87 105 106 110 143 178 225 251 253 322 413 434 458 463
## [31] 487 492 662 687 698 707 726 755 1737 1805 1903 2026 2965 3184
We see that this smaller prior variance provides more reasonable ant species counts. We might still consider this prior to be relatively “vague” in the sense that the range of simulated species counts varies considerably.
Based on this investigation, we might propose a “vague” prior of \(\beta_j\sim\text{normal}(0, 0.01)\) for each of the coefficients in this generalized linear model.
We’ll first refit the model, adjusting the prior variance for each \(\beta_j\) to 0.01.
# jags code
m <- textConnection(m.jags)
inits <- list(
as.list(freq_glm),
as.list(freq_glm + runif(4, -1, 1)),
as.list(freq_glm + runif(4, -1, 1))
)
# build model and algorithm
m.out <-
jags.model(
m,
data = list(
'ants' = as.matrix(ants),
'n' = nrow(ants),
'mu' = 0,
'tau' = 1/0.01
),
inits = inits,
n.chains = 3,
n.adapt = 1000,
quiet = TRUE
)
# fit
update(m.out, 2000) # increase burn in to 2K
m.samples <-
coda.samples(m.out, c('b0', 'b1', 'b2', 'b3'), 10000) # increase MCMC iterations to 10K
Now, let’s look at the summary of the output and the trace plots.
MCMCvis::MCMCsummary(m.samples)
## mean sd 2.5% 50% 97.5% Rhat n.eff
## b0 0.021749106 0.0991553317 -0.172503392 0.021770138 0.2150903820 1 2954
## b1 0.275827917 0.0754870928 0.125876591 0.276337703 0.4236786858 1 9606
## b2 0.048559859 0.0033124999 0.042047439 0.048546132 0.0550756705 1 1943
## b3 -0.001493488 0.0003745039 -0.002223922 -0.001490949 -0.0007694229 1 4769
MCMCvis::MCMCtrace(m.samples, pdf = FALSE)
The trace plots look good, and the effective sample sizes have drastically improved. At this point, we can safely proceed with posterior predictive checks before making inference from the model.
freq_glm
## b0 b1 b2 b3
## 11.936812111 0.635438863 -0.235793020 -0.001141079
MCMCvis::MCMCsummary(m.samples)
## mean sd 2.5% 50% 97.5% Rhat n.eff
## b0 0.021749106 0.0991553317 -0.172503392 0.021770138 0.2150903820 1 2954
## b1 0.275827917 0.0754870928 0.125876591 0.276337703 0.4236786858 1 9606
## b2 0.048559859 0.0033124999 0.042047439 0.048546132 0.0550756705 1 1943
## b3 -0.001493488 0.0003745039 -0.002223922 -0.001490949 -0.0007694229 1 4769
Yikes! These are so different! A key reason for this is that our prior specifications on the \(\beta\) coefficients are performing significant shrinkage, especially on \(\beta_0\). Notice, for example, that the frequentist estimate of \(\beta_0\) is extremely unlikely under the current prior.
ants$forest <- as.vector(scale(ants$forest))
ants$latitude <- as.vector(scale(ants$latitude))
ants$elevation <- as.vector(scale(ants$elevation))
# frequentist glm estimation
freqest_scale <- glm(richness~., data = ants, family = poisson(link = 'log'))$coef
names(freqest_scale) <- c('b0', 'b1', 'b2', 'b3')
freqest_scale
## b0 b1 b2 b3
## 1.8451557 0.3213926 -0.2522916 -0.1839013
# jags code
m <- textConnection(m.jags)
# lukewarm start
inits <- list(
as.list(freqest_scale),
as.list(freqest_scale + runif(4, -1, 1)),
as.list(freqest_scale + runif(4, -1, 1))
)
# build model and algorithm
m.out <-
jags.model(
m,
data = list(
'ants' = as.matrix(ants),
'n' = nrow(ants),
'mu' = 0,
'tau' = 1/0.01
),
inits = inits,
n.chains = 3,
n.adapt = 1000,
quiet = TRUE
)
# fit
update(m.out, 2000) # increase burn in to 2K
m.samples <-
coda.samples(m.out, c('b0', 'b1', 'b2', 'b3'), 10000) # increase MCMC iterations to 10K
The summary and trace plots are provided below. After standardizing the covariates, the convergence diagnostics look good and the Bayesian estimates are much closer to the frequentist estimates. The Bayesian estimates are slightly closer to zero, reflecting the little bit of shrinkage (i.e. regularization) coming from the prior distributions.
MCMCvis::MCMCsummary(m.samples)
## mean sd 2.5% 50% 97.5% Rhat n.eff
## b0 1.2796982 0.06278389 1.1567410 1.2796726 1.4011604 1 15658
## b1 0.3477188 0.06123358 0.2276809 0.3477978 0.4674124 1 17306
## b2 -0.2805512 0.06513413 -0.4081767 -0.2801762 -0.1552971 1 16612
## b3 -0.2197938 0.06166953 -0.3425698 -0.2193559 -0.1003379 1 17210
MCMCvis::MCMCtrace(m.samples, pdf = FALSE)