This problem challenges you to understand how the Markov chain Monte
Carlo algorithm takes a multivariate joint distribution and breaks it
into a series of univariate, marginal distributions that can be
approximated one at a time. There are only two unknowns in this problem,
but the same principles and approach would apply if there were two
hundred. The accompanying document, MCMCMath1.pdf
,
describes the math that stands behind the coding that you will do here.
You should study this thoroughly before proceeding.
You will write code using conjugate relationships, also known as Gibbs updates, to draw samples from marginal posterior distributions of a mean and variance.
Set the seed for random numbers = 10 in R with
set.seed(10)
.
Load the actuar
library, which contains functions
for inverse gamma distributions needed in step 4.
Simulate 100 data points from a normal distribution with mean
\(\theta = 100\) and variance \(\varsigma^{2} = 25\). Call the data set
y
. Be careful here. R requires the standard deviation, not
the variance, as a parameter. You will use these “fake” data to verify
the Gibbs sampler you will write below. Simulating data is always a good
way to test methods. Your method should be able to recover the
generating parameters given a sufficiently large number of simulated
observations.
varsigma.sq = 25
theta = 100
n = 100
y = rnorm(n, theta, sqrt(varsigma.sq))
4.We have saved you some time by writing a function called
draw_mean
that makes draws from the marginal posterior
distributions for \(\theta\) using a
normal-normal conjugate relationship where the variance is assumed to be
known. It is vital that you study the MCMCmath.pdf
notes
relative to this function.
# normal likelihood with normal prior conjugate for mean, assuming variance is known
# mu_0 is prior mean
# sigma.sq_0 is prior variance of mean
# varsigma.sq is known variance of data
draw_mean = function(mu_0, sigma.sq_0, varsigma.sq, y){
mu_1 =((mu_0 / sigma.sq_0 + sum(y)/varsigma.sq)) / (1/sigma.sq_0 + length(y) / varsigma.sq)
sigma.sq_1 = 1/(1 / sigma.sq_0 + length(y) / varsigma.sq)
z = rnorm(1, mu_1, sqrt(sigma.sq_1))
param = list(z = z, mu_1 = mu_1, sigma.sq_1 = sigma.sq_1)
return(param)
}
draw_var
that
makes draws from the marginal posterior distribution for \(\varsigma^2\) using a inverse gamma-normal
conjugate relationship where the mean is assumed to be known. Study this
function relative to the MCMCmath.pdf
handout.# normal likelihood with inverse-gamma prior conjugate relationship for variance, assuming mean is known
# alpha_0 is parameter of prior for variance
# beta_0 is parameter of prior for variance
# Note that this uses scale parameterization for inverse gamma
draw_var = function(alpha_0, beta_0, theta, y){
alpha_1 = alpha_0 + length(y) / 2
beta_1 = beta_0 + sum((y - theta)^2) / 2
z = rinvgamma(1, alpha_1, scale = beta_1)
param = list(z = z, alpha_1 = alpha_1, beta_1 = beta_1)
return(param)
}
y_check
. Assume
flat priors for the mean and the variance. A vague prior for the inverse
gamma has parameters \(\alpha_0=.001\)
and \(\beta_0=.001\).check_mean = 32
check_sigma = 3.2
check_var = check_sigma^2
n.draw = 10000
y_check=rnorm(n.draw, mean = check_mean, sd = check_sigma)
var_vec = numeric(n.draw)
mean_vec = numeric(n.draw)
for(i in 1:n.draw) {
var_vec[i] = draw_var(alpha_0 = .001, beta_0 = .001, theta = check_mean, y = y_check)$z
mean_vec[i] = draw_mean(0,10000, check_sigma^2, y_check)$z
}
mean(var_vec)
## [1] 10.36529
check_var
## [1] 10.24
mean(mean_vec)
## [1] 32.00431
check_mean
## [1] 32
Now execute these steps:
n.iter = 10000
n.chains = 3
pvar = matrix(nrow = n.chains, ncol = n.iter)
pmean = matrix(nrow = n.chains, ncol = n.iter)
pmean[1:3, 1] =c (50, 20, 1)
pvar[1:3, 1] = c(10, 5, .1)
for
loops to iterate from one to the
total number of iterations for each of the three chains for each
parameter. Use the conjugate functions draw_mean
and
draw_var
to draw a sample from the distribution of the mean
using the value of the variance at the current iteration. Then make a
draw from the variance using the current value of the mean. Repeat.
Assume vague priors for the mean and variance:\[[\,\theta\,] = \textrm{normal}(\,\theta \mid (0, 10000\,)\] \[[\,\varsigma^{2}\,] = \textrm{inverse gamma}(\,\varsigma^{2} \mid .001, .001\,)\]
for(t in 2:n.iter) {
for (j in 1:n.chains) {
pmean[j, t] = draw_mean(mu_0 = 0, sigma.sq_0 = 10000, varsigma.sq = pvar[j, t - 1], y = y)$z
pvar[j, t] = draw_var(alpha_0 =.001, beta_0 = .001, theta = pmean[j ,t], y = y)$z
}
}
burnin = 1000
samplesKept <- (burnin+1):n.iter
plot(samplesKept, pmean[1, samplesKept], typ = "l", ylab = expression(theta), xlab = "Iteration", col = "yellow")
lines(samplesKept, pmean[2, samplesKept], typ = "l", col = "red")
lines(samplesKept, pmean[3, samplesKept], typ = "l", col = "green")
hist(pmean[, samplesKept], breaks = 100, freq = FALSE, main = expression(theta), xlim = c(95, 105), xlab = "Value of MCMC samples", col = "gray")
lines(density(pmean[, samplesKept]), col = "red", lwd = 3)
abline(v = theta, lty = "dashed", col = "blue", lwd = 4)
plot(samplesKept, pvar[1, samplesKept], typ = "l", ylab = expression(varsigma^2), xlab = "Iteration", col = "yellow")
lines(samplesKept, pvar[2, samplesKept], typ = "l", col = "red")
lines(samplesKept, pvar[3, samplesKept], typ = "l", col = "green")
hist(pvar[, samplesKept], breaks = 100, freq = FALSE, main = expression(varsigma^2), xlab = "Value of MCMC samples", col = "gray")
lines(density(pvar[, samplesKept]), col = "red", lwd = 3)
abline(v = varsigma.sq, lty = "dashed", col = "blue", lwd = 4)
mean(pmean[, samplesKept])
## [1] 99.31091
sd(pmean[, samplesKept])
## [1] 0.4787576
mean(pvar[, samplesKept])
## [1] 22.58976
sd(pvar[, samplesKept])
## [1] 3.259028
We learned in the lecture on basic probability that the first moment of a distribution can be approximated by computing the mean of many random draws from the distribution and the second central moment can be approximated by computing the variance of many random draws from the distribution.
sd(pmean[, samplesKept])
## [1] 0.4787576
sd(y)/sqrt(length(y))
## [1] 0.4706179
The standard deviation of the posterior distribution of \(\theta\) closely matches the approximation known as the standard error in frequentist lingo.
The mean of the posterior distribution of \(\theta\) and \(\varsigma^2\) will not match the generating
value when the sample size is small because simulated data represent one
realization of stochastic process. We can see different realizations by
changing the value of set.seed.
This stochasticity can
result in means and variances in small data sets that are quite
different from the generating values. The variance is further away
relative to its generating value than the mean is because the variance
is more sensitive to extreme values that are rarely included in the
simulated data.
Gibbs updates are extremely efficient. Convergence is virtually immediate for this simple, two parameter model. This will not be true for accept-reject updates or for Gibbs updates with many parameters.
for(t in 2:n.iter) {
for (j in 1:n.chains) {
pvar[j, t] = draw_var(alpha_0 =.001, beta_0 = .001, theta = pmean[j ,t - 1], y = y)$z
pmean[j, t] = draw_mean(mu_0 = 0, sigma.sq_0 = 10000, varsigma.sq = pvar[j, t], y = y)$z
}
}
The MCMC algorithm is not sensitive to the order that samples are drawn from the full-conditionals.
\[\begin{align} g(a,b,x_i) &= a+bx_i \\ [a,b,\sigma^2|\mathbf{y}]&\propto\prod_{i=1}^n\text{normal}(g(a,b,x_i),\tau)\\ &\times \text{normal}(a\mid 0,\tau_b)\text{normal}(b \mid 0,\tau_a)\\ &\times\text{gamma}(\tau\mid \alpha, \beta) \end{align}\]
The parameters for priors \((\alpha, \beta, \tau_a, \tau_b)\), of course, must be given numeric values. Note that The quantities \(\tau, \tau_a, \tau_b\) are precisions, the inverses of the variances. It can be show that the full conditionals for the parameters in this model can be simplified to conjugates, allowing sampling from
\[\begin{align} [a|.] &\sim \text{normal}\left(\frac{\tau}{n\tau+\tau_a}\sum_{i=1}^n(y_i-bx_i),\frac{1}{n\tau+\tau_a} \right)\\ [b\mid .] &\sim \text{normal} \left(\frac{\tau\sum_{i=1}^n(y_i-a)x_i}{\tau\sum_{i=1}^nx_i^2+\tau_b},\frac{1}{\tau\sum_{i=1}^nx_i^2 + \tau_b} \right)\\ [\tau \mid .] &\sim \text{gamma}\left(\alpha+n/2, \beta + \frac{1}{2}\sum_{i=1}^n(y_i-a-bx_i)^2 \right) \end{align}\]
We first make some fake data on the relationship between plant growth
and tannin content. We do a simple lm
in R to find maximum
likelihood estimates of the parameters.
data2 = data.frame(growth = c(12, 10, 8, 11, 6, 7), tannin = 0:5)
plot(data2$tannin, data2$growth, xlab = "Tannins", ylab="Plant growth")
summary(lm(data2$growth ~ data2$tannin))
##
## Call:
## lm(formula = data2$growth ~ data2$tannin)
##
## Residuals:
## 1 2 3 4 5 6
## 0.5714 -0.4571 -1.4857 2.4857 -1.5429 0.4286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.4286 1.2264 9.319 0.000738 ***
## data2$tannin -0.9714 0.4051 -2.398 0.074504 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.695 on 4 degrees of freedom
## Multiple R-squared: 0.5898, Adjusted R-squared: 0.4872
## F-statistic: 5.751 on 1 and 4 DF, p-value: 0.0745
#growth.lm=lm.bayes(y=data2[,1], x=data2[,2], tau.a=0.001, tau.b=0.001, niter=10000)
Use these data to construct a Gibbs sampler with a single chain for
each parameter to approximate their marginal posterior distributions.
Assume numeric values for parameters of priors \(\alpha=\beta=\tau_a=\tau_b =.001\) Compare
with parameter estimates obtained by lm
.
lm.bayes <- function(y, x, tau.a, tau.b, alpha = 0.001, beta = 0.001, niter = 5000) {
n <- length(y)
a <- mean(y)
b <- 0
tau <- 1
result <- matrix(nrow = niter, ncol = 3)
for (i in 1:niter) {
a <- rnorm(1, mean = (tau/(n * tau + tau.a)) * sum(y - b * x), sd = 1/sqrt(n * tau + tau.a))
b <- rnorm(1, mean = (tau * sum((y - a) * x))/(tau * sum(x^2) + tau.b), sd = 1/sqrt(tau *
sum(x^2) + tau.b))
tau <- rgamma(1, shape = alpha + n/2, rate = beta + 0.5 * sum((y - a - b * x)^2))
result[i, ] <- c(a, b, tau)
} #end of for
result
} #end of fucntion
n.iter = 10000
burn.in = 2000
keep = n.iter - burn.in
growth.lm = lm.bayes(y = data2[,1], x = data2[,2], tau.a = 0.001, tau.b = 0.001, niter = n.iter)
# drop the burn-in samples
growth.lm = growth.lm[-(1:burn.in), ]
# posterior means
param_means = colSums(growth.lm) / (n.iter-burn.in)
param_means
## [1] 11.3743799 -0.9515276 0.3463069
#transform precision to standard deviation
sqrt(1 / param_means[3])
## [1] 1.699298
summary(lm(data2$growth ~ data2$tannin))
##
## Call:
## lm(formula = data2$growth ~ data2$tannin)
##
## Residuals:
## 1 2 3 4 5 6
## 0.5714 -0.4571 -1.4857 2.4857 -1.5429 0.4286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.4286 1.2264 9.319 0.000738 ***
## data2$tannin -0.9714 0.4051 -2.398 0.074504 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.695 on 4 degrees of freedom
## Multiple R-squared: 0.5898, Adjusted R-squared: 0.4872
## F-statistic: 5.751 on 1 and 4 DF, p-value: 0.0745
# plot mean of chain at each iteration to assess covergence
plot(cumsum(growth.lm[,1]) / (1:keep), type = "l", main = "a", ylab = "", xlab = "")
plot(cumsum(growth.lm[,2]) / (1:keep), type = "l", main = "b", ylab = "", xlab = "")
plot(cumsum(growth.lm[,3]) / (1:keep), type = "l", main = "tau", ylab = "", xlab = "")