Bayesian Models for Ecologists
Bayesian Regression Lab
June 06, 2024

In this lab you will practice building regression models that are appropriate for a given data set.


Problem A. Hein et. al (2013) investigate how environmental conditions influence coexistence between brown trout (Salmo trutta) and pike (Esox lucius) in Swedish lakes. One-hundred and fifty one (151) lakes consisting of brown trout are introduced to pike by human intervention, and whether or not the two species coexist was recorded. For each lake, five environmental conditions deemed relevant to coexistence patterns are considered:

The predictors have been standardized.

  1. Plot the data using pairs or GGally::ggpairs. What do you notice? How might this impact your modeling choices?


coexist <- read.csv('coexist.csv',header=TRUE)
GGally::ggpairs(coexist)

This grid of plots given by GGally shows the empirical density (a.k.a. marginal distribution) of each variable on the diagonal, the scatterplot of points for pairs of variables on the lower triangle, and the Pearson correlation between variables in the upper right triangle.

We notice that the coexist outcome variable takes on two values (0 and 1). We also notice that there is one relatively large lake in the data set. A binary data model is appropriate given the support of the data.



  1. Let \(y_i\) be the indicator of whether the species coexisted in lake \(i\). We wish to understand the relationship between the probability of coexistence and the five environmental covariates. Write out a reasonable data model.


\[y_i \sim\text{Bernoulli}(p_i)\] where \[\text{logit}(p_i)=\beta_0 + \beta_1x_{i1} + \dots + \beta_{5}x_{i5}\]



  1. Assume we know very little about the impact of the environmental covariates on coexistence. Specify appropriate prior distributions for all unknown parameters. Explain your choice in priors.


Considering no prior information is available, we use a vague prior specification by positing that \(\beta_j\sim\text{Normal}(0, 2.7)\) for \(j=0, 1, \dots, 5\). Note, the variance parameter \(\sigma^2\) is set to 2.7 based on the suggestion in the Bayesian Regression lecture.

(As a reminder, when we write the model the second parameter of the normal distribution represents the variance, however when we write the JAGS code the second parameter of the normal distribution is the precision (1/variance).)



  1. Write the expression for the posterior in terms of the joint distribution of the parameters and data.


\[[\beta_0, \beta_1, \dots, \beta_5\mid \mathbf{y}] \propto \prod_{i=1}^{151}{\text{Bernoulli}(y_i\mid p_i)}\prod_{j=0}^{5}{\text{Normal}(\beta_j\mid 0, 2.7)}\] where \(p_i = \text{logit}^{-1}(\beta_0 + \beta_1x_{i1} + \dots + \beta_5x_{i5})=\frac{\exp(\beta_0 + \beta_1x_{i1} + \dots + \beta_5x_{i5})}{1 + \exp(\beta_0 + \beta_1x_{i1} + \dots + \beta_5x_{i5})}\)



  1. Write out the JAGS code for the model.


y <- coexist$coexist
n <- length(y)
X <- as.matrix(cbind(intercept = rep(1, times = nrow(coexist)), coexist[,-1]))

model{
    b0 ~ dnorm(0,1/2.7)
    b1 ~ dnorm(0,1/2.7)
    b2 ~ dnorm(0,1/2.7)
    b3 ~ dnorm(0,1/2.7)
    b4 ~ dnorm(0,1/2.7)
    b5 ~ dnorm(0,1/2.7)
    for(i in 1:n){
      logit(theta[i]) <- b0+b1*X[i,2]+b2*X[i,3]+b3*X[i,4]+b4*X[i,5]+b5*X[i,6]
      y[i] ~ dbern(theta[i]) 
    }
}



Problem B. Kembel and Cahill Jr. (2011) collect data from temperate grassland plant communities in Alberta, Canada. Twenty-seven plots are established, and the slope, slope position, aspect, and relative moisture of each plot is recorded. For each plot, the abundance of several species is recorded, interpreted as the proportion of land area covered by the species. The land coverage by Hesperostipa comata ssp. comata is considered here.

  1. Plot the data using pairs or GGally::ggpairs. What do you notice? How might this impact your modeling choices?


coverage <- read.csv('hesp_coverage.csv',header=TRUE)
GGally::ggpairs(coverage)

Coverage is a proportion, bounded between 0 and 1. There are several values of coverage near zero. Lastly, the slope position appears to be discrete.



  1. Let \(y_i\) represent the proportion of land covered by Hesperostipa comata ssp. comata. Consider the Bayesian regression model: \[\begin{equation} y_i\sim\text{Beta}(\alpha_i, \gamma_i), \text{ for } i=1, \dots, 27, \\ \alpha_i=\mu_{i}\tau, \\ \gamma_i=(1-\mu_i)\tau \end{equation}\]

Express the mean and variance of \(y_i\) in terms of \(\mu_i\) and \(\tau\). What is the interpretation of \(\mu_i\)? What is the interpretation of \(\tau\)?


\[\text{mean} = \mu_i\]

\[\text{variance} = \frac{\mu_i(1-\mu_i)}{(\tau + 1)} \] As \(\tau\) increases, the variance decreases so \(\tau\) controls the precision.



  1. What is an appropriate link function to connect \(\mu_i\) with the four predictor variables?


Since the mean \(\mu_i\) is a proportion, we use the logit link \[\text{logit}(\mu_i)=\beta_0 + \beta_1x_{i1} + \dots + \beta_4x_{i4}\]



  1. For the model, assume the priors \[\begin{equation} \beta_j\sim\text{normal}(0,10), j=0,1,2,3,4\\ \tau\sim\text{gamma}(0.04, 0.02). \end{equation}\] Express the posterior distribution as proportional to joint distribution for your model.


\[\begin{equation*} [\boldsymbol{\beta}, \tau \mid \mathbf{y}] \propto \prod_{i=1}^{27}{\text{Beta}(y_i\mid g(\boldsymbol{\beta}, \mathbf{x}_i), \tau)}\times \prod_{j=0}^{4}{\text{Normal}(\beta_j\mid 0, 10)} \times {\text{Gamma}(\tau \mid 0.04, 0.02)} \end{equation*}\] where \(\mu_{i} = g(\boldsymbol{\beta}, \mathbf{x}_i) = \text{logit}^{-1}(\beta_0 + \beta_1x_{i1} + \dots + \beta_4x_{i4})\).



  1. Write out the JAGS code for the model.


y <- coverage$coverage
n <- length(y)
X <- as.matrix(cbind(intercept = rep(1, times = nrow(coverage)), coverage[,-1]))

model{
  tau ~ dgamma(0.04, 0.02)
  b0 ~ dnorm(0,0.1)
  b1 ~ dnorm(0,0.1)
  b2 ~ dnorm(0,0.1)
  b3 ~ dnorm(0,0.1)
  b4 ~ dnorm(0,0.1)
  for(i in 1:n){
    logit(mu[i]) <- b0+b1*X[i,2]+b2*X[i,3]+b3*X[i,4]+b4*X[i,5]
    alpha[i] <- mu[i] * tau
    gamma[i] <- (1-mu[i]) * tau
    y[i] ~ dbeta(alpha[i], gamma[i])
  }
}