Bayesian Models for Ecologists
Markov Chain Monte Carlo: Gibbs Sampling
June 05, 2024

Motivation

This problem challenges you to understand how the Markov chain Monte Carlo algorithm takes a multivariate joint distribution and breaks it into a series of univariate, marginal distributions that can be approximated one at a time. There are only two unknowns in this problem, but the same principles and approach would apply if there were two hundred. The accompanying document, MCMCMath1.pdf, describes the math that stands behind the coding that you will do here. You should study this thoroughly before proceeding.



Problem

You will write code using conjugate relationships, also known as Gibbs updates, to draw samples from marginal posterior distributions of a mean and variance.

  1. Set the seed for random numbers = 10 in R with set.seed(10).

  2. Load the actuar library, which contains functions for inverse gamma distributions needed in step 4.

  3. Simulate 100 data points from a normal distribution with mean \(\theta = 100\) and variance \(\varsigma^{2} = 25\). Call the data set y. Be careful here. R requires the standard deviation, not the variance, as a parameter. You will use these “fake” data to verify the Gibbs sampler you will write below. Simulating data is always a good way to test methods. Your method should be able to recover the generating parameters given a sufficiently large number of simulated observations.


varsigma.sq = 25
theta = 100
n = 100
y = rnorm(n, theta, sqrt(varsigma.sq))


4.We have saved you some time by writing a function called draw_mean that makes draws from the marginal posterior distributions for \(\theta\) using a normal-normal conjugate relationship where the variance is assumed to be known. It is vital that you study the MCMCmath.pdf notes relative to this function.


# normal likelihood with normal prior conjugate for mean, assuming variance is known
# mu_0 is prior mean
# sigma.sq_0 is prior variance of mean
# varsigma.sq is known variance of data

draw_mean = function(mu_0, sigma.sq_0, varsigma.sq, y){
    mu_1 =((mu_0 / sigma.sq_0 + sum(y)/varsigma.sq)) / (1/sigma.sq_0 + length(y) / varsigma.sq)
    sigma.sq_1 = 1/(1 / sigma.sq_0 + length(y) / varsigma.sq)
    z = rnorm(1, mu_1, sqrt(sigma.sq_1))
    param = list(z = z, mu_1 = mu_1, sigma.sq_1 = sigma.sq_1)
    return(param)
}


  1. We have also provided a function called draw_var that makes draws from the marginal posterior distribution for \(\varsigma^2\) using a inverse gamma-normal conjugate relationship where the mean is assumed to be known. Study this function relative to the MCMCmath.pdf handout.


# normal likelihood with inverse-gamma prior conjugate relationship for variance, assuming mean is known
# alpha_0 is parameter of prior for variance
# beta_0 is parameter of prior for variance
# Note that this uses scale parameterization for inverse gamma

draw_var = function(alpha_0, beta_0, theta, y){
    alpha_1 = alpha_0 + length(y) / 2
    beta_1 = beta_0 + sum((y - theta)^2) / 2
    z = rinvgamma(1, alpha_1, scale = beta_1)
    param = list(z = z, alpha_1 = alpha_1, beta_1 = beta_1)
    return(param)
}


  1. Check the functions by simulating a large number of data points from a normal distribution using a mean and variance of your own choosing. Store the data points in a vector called y_check. Assume flat priors for the mean and the variance. A vague prior for the inverse gamma has parameters \(\alpha_0=.001\) and \(\beta_0=.001\).


check_mean = 32
check_sigma = 3.2
check_var = check_sigma^2
n.draw = 10000
y_check=rnorm(n.draw, mean = check_mean, sd = check_sigma)

var_vec = numeric(n.draw)
mean_vec = numeric(n.draw)

for(i in 1:n.draw) {
  
  var_vec[i] = draw_var(alpha_0 = .001, beta_0 = .001, theta = check_mean, y = y_check)$z
  mean_vec[i] = draw_mean(0,10000, check_sigma^2, y_check)$z
  
} 

mean(var_vec)
## [1] 10.36529
check_var
## [1] 10.24
mean(mean_vec)
## [1] 32.00431
check_mean
## [1] 32



Writing a sampler

Now execute these steps:

  1. Set up a matrix for storing samples from the posterior distribution of the mean. The number of rows should equal the number of chains (3) and number of columns should equal the number of iterations (10,000). Do the same thing for storing samples from the posterior distribution of the variance.


n.iter = 10000
n.chains = 3
pvar = matrix(nrow = n.chains, ncol = n.iter)
pmean = matrix(nrow = n.chains, ncol = n.iter)


  1. Assign initial values to the first column of each matrix, a different value for each of the chains. These can be virtually any value within the support of the random variable, but it would be fine for this exercise to use values not terribly far away from to those you used to simulate the data, reflecting some prior knowledge. You might try varying these later to show that you will get the same results.


pmean[1:3, 1] =c (50, 20, 1)
pvar[1:3, 1] = c(10, 5, .1)


  1. Set up nested for loops to iterate from one to the total number of iterations for each of the three chains for each parameter. Use the conjugate functions draw_mean and draw_var to draw a sample from the distribution of the mean using the value of the variance at the current iteration. Then make a draw from the variance using the current value of the mean. Repeat. Assume vague priors for the mean and variance:


\[[\,\theta\,] = \textrm{normal}(\,\theta \mid (0, 10000\,)\] \[[\,\varsigma^{2}\,] = \textrm{inverse gamma}(\,\varsigma^{2} \mid .001, .001\,)\]

for(t in 2:n.iter) {
  for (j in 1:n.chains) {
    
        pmean[j, t] = draw_mean(mu_0 = 0, sigma.sq_0 = 10000, varsigma.sq = pvar[j, t - 1], y = y)$z
        pvar[j, t] = draw_var(alpha_0 =.001, beta_0 = .001, theta = pmean[j ,t], y = y)$z
  
    }       
}



Trace plots and plots of marginal posteriors

  1. Discard the first 1000 iterations as burn-in. Plot the value of the mean as a function of iteration number for each chain. This is called a trace plot.


burnin = 1000
samplesKept <- (burnin+1):n.iter

plot(samplesKept, pmean[1, samplesKept], typ = "l", ylab = expression(theta), xlab = "Iteration", col = "yellow")
lines(samplesKept, pmean[2, samplesKept], typ = "l", col = "red")
lines(samplesKept, pmean[3, samplesKept], typ = "l", col = "green")


  1. Make a histogram of the samples of the mean retained after burn-in including all chains. Put a vertical line on the plot showing the generating value.


hist(pmean[, samplesKept], breaks = 100, freq = FALSE, main = expression(theta), xlim = c(95, 105), xlab = "Value of MCMC samples", col = "gray")
lines(density(pmean[, samplesKept]), col = "red", lwd = 3)
abline(v = theta, lty = "dashed", col = "blue", lwd = 4)


  1. Repeat steps 1-2 for the variance.


plot(samplesKept, pvar[1, samplesKept], typ = "l", ylab = expression(varsigma^2), xlab = "Iteration", col = "yellow")
lines(samplesKept, pvar[2, samplesKept], typ = "l", col = "red")
lines(samplesKept, pvar[3, samplesKept], typ = "l", col = "green")

hist(pvar[, samplesKept], breaks = 100, freq = FALSE, main = expression(varsigma^2), xlab = "Value of MCMC samples", col = "gray")
lines(density(pvar[, samplesKept]), col = "red", lwd = 3)
abline(v = varsigma.sq, lty = "dashed", col = "blue", lwd = 4)


  1. For both \(\theta\) and \(\varsigma^{2}\), calculate the mean of all the chains combined and its standard deviation. Interpret these quantities.


mean(pmean[, samplesKept])
## [1] 99.31091
sd(pmean[, samplesKept]) 
## [1] 0.4787576
mean(pvar[, samplesKept])
## [1] 22.58976
sd(pvar[, samplesKept]) 
## [1] 3.259028

We learned in the lecture on basic probability that the first moment of a distribution can be approximated by computing the mean of many random draws from the distribution and the second central moment can be approximated by computing the variance of many random draws from the distribution.


  1. Compare the standard deviation of the posterior distribution of \(\theta\) with an approximation using the standard deviation of the data divided by the square root of the sample size. What is this approximation called in the frequentist world?


sd(pmean[, samplesKept]) 
## [1] 0.4787576
sd(y)/sqrt(length(y))
## [1] 0.4706179

The standard deviation of the posterior distribution of \(\theta\) closely matches the approximation known as the standard error in frequentist lingo.


  1. Vary the number of values in the simulated data set, e.g., n = 10, 100, 1,000, 10,000. We do not exactly recover the generating values of \(\theta\) and \(\varsigma^{2}\) when \(n\) is small. Why? The mean of the marginal posterior distribution of the variance is further away from its generating value than the mean is. Why? Try different values for set seed with n = 100 and interpret the effect of changing the random number sequence.


The mean of the posterior distribution of \(\theta\) and \(\varsigma^2\) will not match the generating value when the sample size is small because simulated data represent one realization of stochastic process. We can see different realizations by changing the value of set.seed. This stochasticity can result in means and variances in small data sets that are quite different from the generating values. The variance is further away relative to its generating value than the mean is because the variance is more sensitive to extreme values that are rarely included in the simulated data.


  1. Make the burnin = 1 instead of 1000. Does this change your results? Why or why not?


Gibbs updates are extremely efficient. Convergence is virtually immediate for this simple, two parameter model. This will not be true for accept-reject updates or for Gibbs updates with many parameters.


  1. Reverse the order of the conjugate functions in step 3 of the Writing a Sampler section so that the variance is drawn first followed by the mean. (Be careful, this involves a bit more than simply reversing the order of the functions in the loop). Does this reordering have an effect on the posteriors? Why or why not?


for(t in 2:n.iter) {
  for (j in 1:n.chains) {
    
    pvar[j, t] = draw_var(alpha_0 =.001, beta_0 = .001, theta = pmean[j ,t - 1], y = y)$z
    pmean[j, t] = draw_mean(mu_0 = 0, sigma.sq_0 = 10000, varsigma.sq = pvar[j, t], y = y)$z
  
  }     
}

The MCMC algorithm is not sensitive to the order that samples are drawn from the full-conditionals.


  1. ADVANCED: A Gibbs sampler for simple regression. This problem is modified from from Hao Zhang. A model for a simple linear regression with support including all real numbers is:

\[\begin{align} g(a,b,x_i) &= a+bx_i \\ [a,b,\sigma^2|\mathbf{y}]&\propto\prod_{i=1}^n\text{normal}(g(a,b,x_i),\tau)\\ &\times \text{normal}(a\mid 0,\tau_b)\text{normal}(b \mid 0,\tau_a)\\ &\times\text{gamma}(\tau\mid \alpha, \beta) \end{align}\]

The parameters for priors \((\alpha, \beta, \tau_a, \tau_b)\), of course, must be given numeric values. Note that The quantities \(\tau, \tau_a, \tau_b\) are precisions, the inverses of the variances. It can be show that the full conditionals for the parameters in this model can be simplified to conjugates, allowing sampling from

\[\begin{align} [a|.] &\sim \text{normal}\left(\frac{\tau}{n\tau+\tau_a}\sum_{i=1}^n(y_i-bx_i),\frac{1}{n\tau+\tau_a} \right)\\ [b\mid .] &\sim \text{normal} \left(\frac{\tau\sum_{i=1}^n(y_i-a)x_i}{\tau\sum_{i=1}^nx_i^2+\tau_b},\frac{1}{\tau\sum_{i=1}^nx_i^2 + \tau_b} \right)\\ [\tau \mid .] &\sim \text{gamma}\left(\alpha+n/2, \beta + \frac{1}{2}\sum_{i=1}^n(y_i-a-bx_i)^2 \right) \end{align}\]

We first make some fake data on the relationship between plant growth and tannin content. We do a simple lm in R to find maximum likelihood estimates of the parameters.

data2 = data.frame(growth = c(12, 10, 8, 11, 6, 7), tannin = 0:5)
plot(data2$tannin, data2$growth, xlab = "Tannins", ylab="Plant growth")

summary(lm(data2$growth ~ data2$tannin))
## 
## Call:
## lm(formula = data2$growth ~ data2$tannin)
## 
## Residuals:
##       1       2       3       4       5       6 
##  0.5714 -0.4571 -1.4857  2.4857 -1.5429  0.4286 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   11.4286     1.2264   9.319 0.000738 ***
## data2$tannin  -0.9714     0.4051  -2.398 0.074504 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.695 on 4 degrees of freedom
## Multiple R-squared:  0.5898, Adjusted R-squared:  0.4872 
## F-statistic: 5.751 on 1 and 4 DF,  p-value: 0.0745
#growth.lm=lm.bayes(y=data2[,1], x=data2[,2], tau.a=0.001, tau.b=0.001, niter=10000)

Use these data to construct a Gibbs sampler with a single chain for each parameter to approximate their marginal posterior distributions. Assume numeric values for parameters of priors \(\alpha=\beta=\tau_a=\tau_b =.001\) Compare with parameter estimates obtained by lm.


lm.bayes <- function(y, x, tau.a, tau.b, alpha = 0.001, beta = 0.001, niter = 5000) {
  n <- length(y)
  a <- mean(y)
  b <- 0
  tau <- 1
  result <- matrix(nrow = niter, ncol = 3)

  for (i in 1:niter) { 
  
    a <- rnorm(1, mean = (tau/(n * tau + tau.a)) * sum(y - b * x), sd = 1/sqrt(n * tau + tau.a))
    b <- rnorm(1, mean = (tau * sum((y - a) * x))/(tau * sum(x^2) + tau.b), sd = 1/sqrt(tau *
    sum(x^2) + tau.b))
    tau <- rgamma(1, shape = alpha + n/2, rate = beta + 0.5 * sum((y - a - b * x)^2))
    result[i, ] <- c(a, b, tau)
  
  } #end of for 
  
  result

} #end of fucntion

n.iter = 10000
burn.in = 2000
keep = n.iter - burn.in
growth.lm = lm.bayes(y = data2[,1], x = data2[,2], tau.a = 0.001, tau.b = 0.001, niter = n.iter)

# drop the burn-in samples
growth.lm = growth.lm[-(1:burn.in), ]

# posterior means
param_means = colSums(growth.lm) / (n.iter-burn.in)
param_means
## [1] 11.3743799 -0.9515276  0.3463069
#transform precision to standard deviation
sqrt(1 / param_means[3])
## [1] 1.699298
summary(lm(data2$growth ~ data2$tannin))
## 
## Call:
## lm(formula = data2$growth ~ data2$tannin)
## 
## Residuals:
##       1       2       3       4       5       6 
##  0.5714 -0.4571 -1.4857  2.4857 -1.5429  0.4286 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   11.4286     1.2264   9.319 0.000738 ***
## data2$tannin  -0.9714     0.4051  -2.398 0.074504 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.695 on 4 degrees of freedom
## Multiple R-squared:  0.5898, Adjusted R-squared:  0.4872 
## F-statistic: 5.751 on 1 and 4 DF,  p-value: 0.0745
# plot mean of chain at each iteration to assess covergence
plot(cumsum(growth.lm[,1]) / (1:keep), type = "l", main = "a", ylab = "", xlab = "")

plot(cumsum(growth.lm[,2]) / (1:keep), type = "l", main = "b", ylab = "", xlab = "")

plot(cumsum(growth.lm[,3]) / (1:keep), type = "l", main = "tau", ylab = "", xlab = "")