In this lab you will practice building regression models that are appropriate for a given data set.
Problem A. Hein et. al (2013) investigate how environmental conditions influence coexistence between brown trout (Salmo trutta) and pike (Esox lucius) in Swedish lakes. One-hundred and fifty one (151) lakes consisting of brown trout are introduced to pike by human intervention, and whether or not the two species coexist was recorded. For each lake, five environmental conditions deemed relevant to coexistence patterns are considered:
The predictors have been standardized.
pairs
or
GGally::ggpairs
. What do you notice? How might this impact
your modeling choices?coexist <- read.csv('coexist.csv',header=TRUE)
GGally::ggpairs(coexist)
This grid of plots given by GGally shows the empirical density (a.k.a. marginal distribution) of each variable on the diagonal, the scatterplot of points for pairs of variables on the lower triangle, and the Pearson correlation between variables in the upper right triangle.
We notice that the coexist outcome variable takes on two values (0 and 1). We also notice that there is one relatively large lake in the data set. A binary data model is appropriate given the support of the data.
\[y_i \sim\text{Bernoulli}(p_i)\] where \[\text{logit}(p_i)=\beta_0 + \beta_1x_{i1} + \dots + \beta_{5}x_{i5}\]
Considering no prior information is available, we use a vague prior specification by positing that \(\beta_j\sim\text{Normal}(0, 2.7)\) for \(j=0, 1, \dots, 5\). Note, the variance parameter \(\sigma^2\) is set to 2.7 based on the suggestion in the Bayesian Regression lecture.
(As a reminder, when we write the model the second parameter of the normal distribution represents the variance, however when we write the JAGS code the second parameter of the normal distribution is the precision (1/variance).)
\[[\beta_0, \beta_1, \dots, \beta_5\mid \mathbf{y}] \propto \prod_{i=1}^{151}{\text{Bernoulli}(y_i\mid p_i)}\prod_{j=0}^{5}{\text{Normal}(\beta_j\mid 0, 2.7)}\] where \(p_i = \text{logit}^{-1}(\beta_0 + \beta_1x_{i1} + \dots + \beta_5x_{i5})=\frac{\exp(\beta_0 + \beta_1x_{i1} + \dots + \beta_5x_{i5})}{1 + \exp(\beta_0 + \beta_1x_{i1} + \dots + \beta_5x_{i5})}\)
y <- coexist$coexist
n <- length(y)
X <- as.matrix(cbind(intercept = rep(1, times = nrow(coexist)), coexist[,-1]))
model{
b0 ~ dnorm(0,1/2.7)
b1 ~ dnorm(0,1/2.7)
b2 ~ dnorm(0,1/2.7)
b3 ~ dnorm(0,1/2.7)
b4 ~ dnorm(0,1/2.7)
b5 ~ dnorm(0,1/2.7)
for(i in 1:n){
logit(theta[i]) <- b0+b1*X[i,2]+b2*X[i,3]+b3*X[i,4]+b4*X[i,5]+b5*X[i,6]
y[i] ~ dbern(theta[i])
}
}
Problem B. Kembel and Cahill Jr. (2011) collect data from temperate grassland plant communities in Alberta, Canada. Twenty-seven plots are established, and the slope, slope position, aspect, and relative moisture of each plot is recorded. For each plot, the abundance of several species is recorded, interpreted as the proportion of land area covered by the species. The land coverage by Hesperostipa comata ssp. comata is considered here.
pairs
or
GGally::ggpairs
. What do you notice? How might this impact
your modeling choices?coverage <- read.csv('hesp_coverage.csv',header=TRUE)
GGally::ggpairs(coverage)
Coverage is a proportion, bounded between 0 and 1. There are several values of coverage near zero. Lastly, the slope position appears to be discrete.
Express the mean and variance of \(y_i\) in terms of \(\mu_i\) and \(\tau\). What is the interpretation of \(\mu_i\)? What is the interpretation of \(\tau\)?
\[\text{mean} = \mu_i\]
\[\text{variance} = \frac{\mu_i(1-\mu_i)}{(\tau + 1)} \] As \(\tau\) increases, the variance decreases so \(\tau\) controls the precision.
Since the mean \(\mu_i\) is a proportion, we use the logit link \[\text{logit}(\mu_i)=\beta_0 + \beta_1x_{i1} + \dots + \beta_4x_{i4}\]
\[\begin{equation*} [\boldsymbol{\beta}, \tau \mid \mathbf{y}] \propto \prod_{i=1}^{27}{\text{Beta}(y_i\mid g(\boldsymbol{\beta}, \mathbf{x}_i), \tau)}\times \prod_{j=0}^{4}{\text{Normal}(\beta_j\mid 0, 10)} \times {\text{Gamma}(\tau \mid 0.04, 0.02)} \end{equation*}\] where \(\mu_{i} = g(\boldsymbol{\beta}, \mathbf{x}_i) = \text{logit}^{-1}(\beta_0 + \beta_1x_{i1} + \dots + \beta_4x_{i4})\).
y <- coverage$coverage
n <- length(y)
X <- as.matrix(cbind(intercept = rep(1, times = nrow(coverage)), coverage[,-1]))
model{
tau ~ dgamma(0.04, 0.02)
b0 ~ dnorm(0,0.1)
b1 ~ dnorm(0,0.1)
b2 ~ dnorm(0,0.1)
b3 ~ dnorm(0,0.1)
b4 ~ dnorm(0,0.1)
for(i in 1:n){
logit(mu[i]) <- b0+b1*X[i,2]+b2*X[i,3]+b3*X[i,4]+b4*X[i,5]
alpha[i] <- mu[i] * tau
gamma[i] <- (1-mu[i]) * tau
y[i] ~ dbeta(alpha[i], gamma[i])
}
}