Bayesian analysis is flexible and intuitive because it uses the same approach to all problems. We learn about unobserved quantities from observed ones using the laws of probability as an inferential scaffolding. You derived Bayes’ Theorem in lecture from those basic laws. We learned that all Bayesian models have the same parts: the posterior distribution, the joint distribution composed of the likelihood(s) and priors, and the marginal probability of the data, which serves as a normalizing constant after the data are collected. For most of this course, we will exploit the proportionality between the posterior and the joint to allow us to dispense with the marginal probability of the data using simulation methods. However, it is vital that you understand the relationship among the components of Bayes’ Theorem before we do that. This understanding is the foundation for all of the powerful methods that follow. It is what sets Bayesian analysis apart from maximum likelihood.
This problem is a bit of a toy because you will never estimate posterior distributions this way, at least I never have. But, it is an instructive toy. Reliably using Bayesian methods requires understanding how Bayes’ theorem works. The best way to gain this understanding is to compute a posterior distribution from its component parts, the prior, the likelihood, and the marginal distributions. It helps you understand how you multiply and divide by statistical distributions, something you probably don’t do every morning after coffee. You will be thrilled to know that you only need to do this once.
To start, take out your notes on Bayes’ theorem and go through its derivation, particularly the color-coded section on the likelihood, the prior, the joint, and the probability of the data. Keep the equation in front of you as you do this exercise. Think about what each piece means as you write your code.
The key to success is to create code that does the following: 1) creates vectors using the probability density functions in the numerator of Bayes’ theorem (the likelihood and the prior), 2) multiplies these vectors to obtain the joint distribution of the parameters and the data, 3) integrates the joint distribution and 4) divides the joint distribution by its integral. Voila.
You are interested in estimating the posterior distribution for the mean number of individuals of an invasive plant species per m2 in a disturbed grassland. We will call that mean \(\theta\). You have prior information telling you that the average number of these plants per m2 is 10.2 with a standard deviation of the mean = .5. You have a set of fifty observations in hand obtained by sweaty labor in the field. Execute the following steps.
Simulate 50 data points from a Poisson distribution with mean
\(\theta\) = 6.4 to represent the data
set. (This portrays the data that you gathered from plots, but it is
lots easier to obtain.) What is the variance? Be sure to put the R
function set.seed(10)
before the call to
rpois()
to assure that we all get the same results. Call
the data vector y
.
Plot a histogram of the data with density on the y-axis. It turns
out that the histogram function in R is not really appropriate for
discrete data (why?). Here is a chance to write a function that plots
discrete data properly! Hint– the count()
and
prop.table()
functions in the dplyr
package
and the type="h"
argument in the plot
function
might prove useful. (You can skip this discrete histogram bit with no
loss of value from the rest of the exercise, but if you do, I urge you
to revisit it sometime. Moreover, there are R packages that do this for
you. Using one of those would be fine if you are not gripped by writing
functions.)
Set values for the prior mean (mu.prior
) and
standard deviation (sigma.prior
).
set.seed(10)
y <- rpois(50, lambda = 6.4)
hist(y, freq = FALSE, breaks = 10, main = "Histogram of data", col = "gray")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
discrete_hist <- function(y) {
z <- data.frame(y) %>% count(y) %>% mutate(prop = prop.table(n))
plot(z$y, z$prop, type = "h", ylab = "Probability", xlab = "y", main = "Improved histogram of data",
frame = T, xaxt = "n", lwd = 3, col = "blue")
x <- seq(min(z$y), max(z$y), 1)
axis(side = 1, at = x, labels = x)
}
discrete_hist(y)
mu.prior <- 10.2
sigma.prior <- 0.5
theta = seq(0,15,step)
where you set
step = .01
. Setting a value for step
with
global scope is important. You will use it later when you
integrate.step <- .01
theta <- seq(0, 15, step)
mu.prior = 10.2
and
sigma.prior = .5
. You could hard-code these in the function
template, but that is bad practice.prior <- function(theta, mu = mu.prior, sigma = sigma.prior){#your code implementing function}
\([\theta]=\text{gamma}\left(\theta \biggm{|}\frac{10.2^{2}}{.5^{2}},\frac{10.2}{.5^{2}}\right)\)
prior <- function(theta, mu = mu.prior, sigma = sigma.prior) dgamma(theta, mu^2 / sigma^2, mu / sigma^2)
plot(theta, prior(theta), type = "l", ylab = expression(paste("[", theta, "]")), xlab = expression(theta),
main = "Prior", xlim = c(5, 15))
sd(rgamma(100000, mu.prior^2 / sigma.prior^2, mu.prior / sigma.prior^2))
## [1] 0.5000646
mean(rgamma(100000, mu.prior^2 / sigma.prior^2, mu.prior / sigma.prior^2))
## [1] 10.20185
like <- function(theta, y){#your code to calculate total likelihood of the data conditional on each value of theta}
like <- function(theta, y){
L = rep(0, length(theta))
for(i in 1:length(theta)) L[i] = prod(dpois(y, theta[i], log = FALSE))
return(L)
}
The plot is called a likelihood profile. There is nothing we can say about the area under the curve except that it is not equal to one. The inference based on the likelihood profile remains the same regardless of the value of the constant multiplier because all evidence in likelihood is relative.
plot(theta, like(theta, y = y), type = "l", xlim = c(5, 15), main = "Likelihood", xlab = expression(theta),
ylab = expression(paste("[y|", theta, "]")))
joint(theta)
as a
function of theta
. Does this seem reasonable? Why are the
values on the y axis so small? Think about what is going on here.\[[\mathbf{y},\theta] =\prod_{i=1}^{50}\text{Poisson}(y_{i}|\theta)\,\text{gamma}\left(\theta \biggm{|}\frac{10.2^{2}}{.5^{2}},\frac{10.2}{.5^{2}}\right)\]
joint = function(theta) like(theta, y = y) * prior(theta)
plot(theta, joint(theta), type = "l", main = "Joint", xlim = c(5, 15), xlab = expression(theta),
ylab = expression(paste("[y|", theta, "] x [", theta, "]")))
\[[\mathbf{y}]=\int\prod_{i=1}^{50}\text{Poisson}(y_{i}|\theta)\,\text{gamma}\left(\theta \biggm{|}\frac{10.2^{2}}{.5^{2}},\frac{10.2}{.5^{2}}\right)d\theta\]
\([\mathbf{y}]\) is a distribution before the data are collected because \(\mathbf{y}\) is a random variable. The integral evaluates to the area under the joint distribution, a scalar, after the data are collected because the data are fixed.
Py <- sum(like(theta, y = y) * prior(theta) * step)
\[[\theta \mid \mathbf{y}]=\frac{\prod_{i=1}^{50}\text{Poisson}(y_{i}|\theta)\,\text{gamma}\left(\theta \biggm{|}\frac{10.2^{2}}{.5^{2}},\frac{10.2}{.5^{2}}\right)}{\int\prod_{i=1}^{50}\text{Poisson}(y_{i}|\theta)\,\text{gamma}\left(\theta \biggm{|}\frac{10.2^{2}}{.5^{2}},\frac{10.2}{.5^{2}}\right)d\theta}\]
p.theta <- joint(theta) / Py
plot(theta, p.theta, typ = "l", xlim = c(5, 15), main = "Posterior", xlab = expression(theta),
ylab = expression(paste("[ ", theta, " | y]")))
par(mfrow = (c(2, 3)))
plot(theta, prior(theta), type = "l", ylab = expression(paste("[", theta, "]")), xlab = expression(theta),
main = "Prior", xlim = c(5, 15))
hist(y, freq = FALSE, breaks = 10, main = "Histogram of data")
discrete_hist(y = y)
plot(theta, like(theta, y = y), type = "l", main = "Likelihood", xlim = c(5, 15), xlab = expression(theta),
ylab = expression(paste("[y|", theta, "])")))
plot(theta, joint(theta), type = "l", main = "Joint", xlim = c(5, 15), xlab = expression(theta),
ylab = expression(paste("[y | ", theta, "]) x [", theta, "]")))
plot(theta, p.theta, type = "l", xlim = c(5, 15), main = "Posterior", xlab = expression(theta),
ylab = expression(paste("[ ", theta, " | y]")))
(c <- max(p.theta) / max(like(theta, y)))
## [1] 1.609436e+47
like.scaled <- c * like(theta, y)
par(mfrow=c(1, 1))
plot(theta, like.scaled, type = "l", col = "red", xlim = c(5, 15), xlab = expression(theta),
main = "Scaled Overlay", ylab = "Probability density")
lines(theta, p.theta, type = "l")
lines(theta, prior(theta), col = "blue")
legend(11, 1, c("Scaled likelihood","Posterior", "Prior"), lty = rep("solid",3), col = c("red", "black", "blue"))
The likelihood profile is narrower than the histogram because the likelihood profile reflects the the distribution of the mean of the data, while the histogram reflects the distribution of the data.
p.conj = dgamma(theta, sum(y) + mu.prior^2 / sigma.prior^2, length(y) + mu.prior / sigma.prior^2)
par(mfrow=c(1, 1))
plot(theta, like.scaled, type = "l", col = "red", xlim = c(5, 15), xlab = expression(theta),
ylab = "Probability density", main = "Scaled Overlay")
lines(theta, prior(theta), col = "blue")
lines(theta, p.conj, type = "l", lwd = 1, col = "orange")
lines(theta, p.theta, col = "black", lwd = 4, type = "l", lty = "dashed")
legend(11, 1.2, legend= c("Scaled likelihood", "Prior","Integrated posterior", "Conjugate posterior"), cex = .8,
lwd = 2, bty = "n", col = c("red", "blue", "black", "orange"), lty = c("solid", "solid", "dashed", "solid"))