MoreAboutPriors2Lab.knit

June 12, 2023

In this lab you will explore how performance of an MCMC algorithm and the validity of scientific inferences can be affected by the choice of prior distribution.

Set up: Gotelli and Ellison (2002) investigate the impact of elevation and latitude on ant species richness. To measure species richness, square grids measuring 64 meters in length and width are established in 22 different bogs and in the forests surrounding them. The number of ant species within each grid was recorded. We will use this data set in the investigation below.

For this lab, you will want to load the following packages:

library(GGally)
library(rjags)
set.seed(10)

Plot the data using pairs or GGally::ggpairs. What do you notice? How might this impact your modeling choices?

ants <- read.csv('ants.csv',header=TRUE)
GGally::ggpairs(ants)

Richness is a count variable demarcating the number of species of ants found in the sampling grid. Forest is an indicator variable equal to 1 if the site is in the forest and 0 if the site is in the bog. Latitude and elevation are both continuous variables. We will consider a Poisson data model.

Let \(y_i\) be the richness of ants at the \(i\)th grid. Write a mathematical expression for the posterior and the joint distribution, linking the conditional mean to the three covariates: forest, latitude, and elevation. Assume \(\beta_j\sim\text{Normal}(0, 10000)\).

Let \[g(\boldsymbol{\beta}, \mathbf{x}_i)=\text{log}^{-1}(\beta_0 + \beta_1x_{\text{forest}} + \beta_2x_{\text{latitude}} + \beta_3x_{\text{elev}}) =\exp(\beta_0 + \beta_1x_{\text{forest}} + \beta_2x_{\text{latitude}} + \beta_3x_{\text{elev}}).\]

Then, \[\begin{align*} [\boldsymbol{\beta} \mid \mathbf{y}] & \propto [\mathbf{y} \mid g(\boldsymbol{\beta}, \mathbf{X})][\boldsymbol{\beta}] \\ &= \prod_{i=1}^{44}{\text{Poisson}(y_i\mid g(\boldsymbol{\beta}, \mathbf{x}_i))}\prod_{i=0}^{3}{\text{normal}(\beta_j \mid 0, 10000)} \end{align*}\]

Fit a frequentist generalized linear model. What are the estimated coefficients?

freq_glm <- glm(richness ~ ., data = ants, family = poisson(link = 'log'))$coef
names(freq_glm) <- c('b0', 'b1', 'b2', 'b3')
freq_glm

##           b0           b1           b2           b3 
## 11.936812111  0.635438863 -0.235793020 -0.001141079

Your colleague fits a Bayesian generalized linear model in JAGS. She provides you with the following code to attain samples from the marginal posterior distribution for each parameter. Assess convergence of the model, and comment on the results. What four things would you suggest to improve convergence? Should we make inference before addressing these issues?

# jags code
m.jags <- "
  model{
    for(i in 1:n){
      ants[i, 1] ~ dpois(lam[i])
      log(lam[i]) <- b0+b1*ants[i,2]+b2*ants[i,3]+b3*ants[i,4]
    }
  b0 ~ dnorm(mu,tau)
  b1 ~ dnorm(mu,tau)
  b2 ~ dnorm(mu,tau)
  b3 ~ dnorm(mu,tau)
  }
"
m <- textConnection(m.jags)

# set initializations
inits <- list(list(b0 = 0, b1 = 0, b2 = 0, b3 = 0),
              list(b0 = 0, b1 = 0, b2 = 0, b3 = 0),
              list(b0 = 0, b1 = 0, b2 = 0, b3 = 0))

# build model and algorithm
m.out <-
  jags.model(
    m,
    data = list(
      'ants' = as.matrix(ants),
      'n' = nrow(ants),
      'mu' = 0,
      'tau' = 1/10000
    ),
    inits = inits,
    n.chains = 3,
    n.adapt = 1000,
    quiet = TRUE
  )

# fit
update(m.out, 50)
m.samples <-
  coda.samples(m.out, c('b0', 'b1', 'b2', 'b3'), 500)

We first look at the summary information and the trace plots.

MCMCvis::MCMCsummary(m.samples)

##            mean           sd         2.5%          50%         97.5% Rhat n.eff
## b0  6.842604785 0.8569001163  5.058657141  6.778858982  8.6704410119 2.06    11
## b1  0.641359666 0.1181498137  0.411558120  0.643233556  0.8768860371 1.01   249
## b2 -0.116202363 0.0199950935 -0.158752229 -0.115221922 -0.0740831533 2.09    11
## b3 -0.001322329 0.0003604419 -0.002007422 -0.001335771 -0.0005631731 1.02   370

MCMCvis::MCMCtrace(m.samples, pdf = FALSE)

Clearly, we should be concerned about convergence of the chains. We will need to correct these issues and conduct posterior predictive checks before making any inferences. Notice specifically the trace plots and effective sample sizes for \(\beta_0\) and \(\beta_2\).

There are three (more) obvious concerns we need to address to improve convergence of the chains. That is:

Increase the number of MCMC iterations,
Increase the number of iterations used for burn in,
Use informed initialization values or randomly select them in a range of reasonable values.

Fix any concerns you found with the code and recheck diagnostics.

We implement the three suggestions outlined above. Specifically, we initialize the first chain starting at the frequentist estimates. This is called a lukewarm start. We initialize the other two chains uniformly at random within a reasonable range. Again, we use three chains so we can formally assess convergence using Rhat.

# jags code
m <- textConnection(m.jags)

inits <- list(
  as.list(freq_glm),
  as.list(freq_glm + runif(4, -1, 1)),
  as.list(freq_glm + runif(4, -1, 1))
)

# build model and algorithm
m.out <-
  jags.model(
    m,
    data = list(
      'ants' = as.matrix(ants),
      'n' = nrow(ants),
      'mu' = 0,
      'tau' = 1/10000
    ),
    inits = inits,
    n.chains = 3,
    n.adapt = 1000,
    quiet = TRUE
  )

# fit
update(m.out, 2000) # increase burn in to 2K
m.samples <-
  coda.samples(m.out, c('b0', 'b1', 'b2', 'b3'), 10000) # increase MCMC iterations to 10K

Let’s take a look at the diagnostics.

MCMCvis::MCMCsummary(m.samples)

##            mean           sd        2.5%          50%         97.5% Rhat n.eff
## b0 12.480473699 2.0559738409  8.38867555 12.818138112 16.0260997927 1.45    15
## b1  0.639058345 0.1192834465  0.40685757  0.638498230  0.8738555421 1.00  4458
## b2 -0.248737543 0.0483309153 -0.33211418 -0.256428447 -0.1529219613 1.45    15
## b3 -0.001135004 0.0003705971 -0.00186752 -0.001130364 -0.0004185119 1.01  4743

MCMCvis::MCMCtrace(m.samples, pdf = FALSE)

Clearly, there is still an issue. The effective sample size for \(\beta_0\) and \(\beta_2\) are still comically low, and the trace plots for those parameters do not exhibit stationarity. The next question will motivate how to track down the mistake. Bonus points to those who caught it already.

Simulate data according to your model. In other words, simulate regression coefficients from the prior distributions, and then data conditional on these parameter values. What do you notice? Try running the code multiple times to see different simulated data sets.

We first simulate one data set according to the data model and prior specification.

prior_var <- 10000
betas <- rnorm(4,0,sqrt(prior_var))

lambda_i <- exp(cbind(rep(1, nrow(ants)), as.matrix(ants[,-1])) %*% betas)
as.vector(lambda_i)

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0

rpois(n = nrow(ants), as.vector(lambda_i))

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0

Let’s simulate a second data set.

betas <- rnorm(4,0,sqrt(prior_var))

lambda_i <- exp(cbind(rep(1, nrow(ants)), as.matrix(ants[,-1])) %*% betas)
as.vector(lambda_i)

##  [1] Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf
## [20] Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf
## [39] Inf Inf Inf Inf Inf Inf

rpois(n = nrow(ants), as.vector(lambda_i))

## Warning in rpois(n = nrow(ants), as.vector(lambda_i)): NAs produced

##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

We used a \(\text{normal}(0, 10000)\) as a “vague” prior for each \(\beta_j\). However, we see that a large prior variance causes all of the \(\lambda_i\) values to be zero or infinite for all \(i\). YIKES! This is another instance where a non-linear function of parameters with vague priors have informative priors.

Decrease the prior variance on the regression coefficients to 0.01 and simulate new data according to your model. Are the simulated data reasonable?

prior_var <- 0.01
betas <- rnorm(4,0,sqrt(prior_var))

lambda_i <- exp(cbind(rep(1, nrow(ants)), as.matrix(ants[,-1])) %*% betas)
as.vector(lambda_i)

##  [1]  782.22635  708.67688   26.05572   23.60581   94.54371   85.65416
##  [7]   24.56693   22.25701  158.97695  144.02901   49.31715   44.68007
## [13]   37.44144   33.92098 1983.78599 1797.25888   72.95765   66.09775
## [19]   58.00573   52.55169  292.47892  264.97835  504.79496  457.33120
## [25] 3239.96191 2935.32184  457.52534  414.50615  109.84292   99.51485
## [31]  463.15208  419.60382 1881.81133 1704.87248  732.06887  663.23550
## [37]  240.49581  217.88300   38.31555   34.71290  697.38794  631.81547
## [43]   99.06621   89.75143

sort(rpois(n = nrow(ants), as.vector(lambda_i)))

##  [1]   23   24   26   26   29   34   35   37   44   50   53   58   61   66   79
## [16]   85   87  105  106  110  143  178  225  251  253  322  413  434  458  463
## [31]  487  492  662  687  698  707  726  755 1737 1805 1903 2026 2965 3184

We see that this smaller prior variance provides more reasonable ant species counts. We might still consider this prior to be relatively “vague” in the sense that the range of simulated species counts varies considerably.

Based on this investigation, we might propose a “vague” prior of \(\beta_j\sim\text{normal}(0, 0.01)\) for each of the coefficients in this generalized linear model.

Refit the model with the newly proposed prior for each \(\beta_j\), keeping the three refinements in question 5. Check for convergence in the chain. Comment on the results.

We’ll first refit the model, adjusting the prior variance for each \(\beta_j\) to 0.01.

# jags code
m <- textConnection(m.jags)

inits <- list(
  as.list(freq_glm),
  as.list(freq_glm + runif(4, -1, 1)),
  as.list(freq_glm + runif(4, -1, 1))
)

# build model and algorithm
m.out <-
  jags.model(
    m,
    data = list(
      'ants' = as.matrix(ants),
      'n' = nrow(ants),
      'mu' = 0,
      'tau' = 1/0.01
    ),
    inits = inits,
    n.chains = 3,
    n.adapt = 1000,
    quiet = TRUE
  )

# fit
update(m.out, 2000) # increase burn in to 2K
m.samples <-
  coda.samples(m.out, c('b0', 'b1', 'b2', 'b3'), 10000) # increase MCMC iterations to 10K

Now, let’s look at the summary of the output and the trace plots.

MCMCvis::MCMCsummary(m.samples)

##            mean           sd         2.5%          50%         97.5% Rhat n.eff
## b0  0.021749106 0.0991553317 -0.172503392  0.021770138  0.2150903820    1  2954
## b1  0.275827917 0.0754870928  0.125876591  0.276337703  0.4236786858    1  9606
## b2  0.048559859 0.0033124999  0.042047439  0.048546132  0.0550756705    1  1943
## b3 -0.001493488 0.0003745039 -0.002223922 -0.001490949 -0.0007694229    1  4769

MCMCvis::MCMCtrace(m.samples, pdf = FALSE)

The trace plots look good, and the effective sample sizes have drastically improved. At this point, we can safely proceed with posterior predictive checks before making inference from the model.

While we will not ask you to conduct posterior predictive checks, investigate the difference between the frequentist estimates and the estimates derived from the fitted Bayesian model.

freq_glm

##           b0           b1           b2           b3 
## 11.936812111  0.635438863 -0.235793020 -0.001141079

MCMCvis::MCMCsummary(m.samples)

##            mean           sd         2.5%          50%         97.5% Rhat n.eff
## b0  0.021749106 0.0991553317 -0.172503392  0.021770138  0.2150903820    1  2954
## b1  0.275827917 0.0754870928  0.125876591  0.276337703  0.4236786858    1  9606
## b2  0.048559859 0.0033124999  0.042047439  0.048546132  0.0550756705    1  1943
## b3 -0.001493488 0.0003745039 -0.002223922 -0.001490949 -0.0007694229    1  4769

Yikes! These are so different! A key reason for this is that our prior specifications on the \(\beta\) coefficients are performing significant shrinkage, especially on \(\beta_0\). Notice, for example, that the frequentist estimate of \(\beta_0\) is extremely unlikely under the current prior.

Recall in the lecture it was recommended that covariates always be standardized when fitting non-linear models. Rescale your covariates, and refit the frequentist and Bayesian models. Compare your results.

ants$forest <- as.vector(scale(ants$forest))
ants$latitude <- as.vector(scale(ants$latitude))
ants$elevation <- as.vector(scale(ants$elevation))

# frequentist glm estimation
freqest_scale <- glm(richness~., data = ants, family = poisson(link = 'log'))$coef
names(freqest_scale) <- c('b0', 'b1', 'b2', 'b3')
freqest_scale

##         b0         b1         b2         b3 
##  1.8451557  0.3213926 -0.2522916 -0.1839013

# jags code
m <- textConnection(m.jags)

# lukewarm start
inits <- list(
  as.list(freqest_scale),
  as.list(freqest_scale + runif(4, -1, 1)),
  as.list(freqest_scale + runif(4, -1, 1))
)

# build model and algorithm
m.out <-
  jags.model(
    m,
    data = list(
      'ants' = as.matrix(ants),
      'n' = nrow(ants),
      'mu' = 0,
      'tau' = 1/0.01
    ),
    inits = inits,
    n.chains = 3,
    n.adapt = 1000,
    quiet = TRUE
  )

# fit
update(m.out, 2000) # increase burn in to 2K
m.samples <-
  coda.samples(m.out, c('b0', 'b1', 'b2', 'b3'), 10000) # increase MCMC iterations to 10K

The summary and trace plots are provided below. After standardizing the covariates, the convergence diagnostics look good and the Bayesian estimates are much closer to the frequentist estimates. The Bayesian estimates are slightly closer to zero, reflecting the little bit of shrinkage (i.e. regularization) coming from the prior distributions.

MCMCvis::MCMCsummary(m.samples)

##          mean         sd       2.5%        50%      97.5% Rhat n.eff
## b0  1.2796982 0.06278389  1.1567410  1.2796726  1.4011604    1 15658
## b1  0.3477188 0.06123358  0.2276809  0.3477978  0.4674124    1 17306
## b2 -0.2805512 0.06513413 -0.4081767 -0.2801762 -0.1552971    1 16612
## b3 -0.2197938 0.06166953 -0.3425698 -0.2193559 -0.1003379    1 17210

MCMCvis::MCMCtrace(m.samples, pdf = FALSE)

Bayesian Models for Ecologists

More About Priors 2

June 12, 2023