The ability of Bayesian methods to handle hierarchical models in an unusually tidy way is why they are becoming the first choice for complex problems in social and environmental science, problems with many unknowns, different type of data, and multiple sources of uncertainty. Recall that the posterior distribution of all of the unobserved quantities is proportionate to the joint distributions of the unobserved quantities and the data:
\[\big[\pmb{\theta}\mid\mathbf{y}\big]\propto\underbrace{\big[\pmb{\theta},\mathbf{y}\big]}_{{\text{Factor into sensible parts}}}\]
It follows that the starting point for developing hierarchical models is to write a properly factored expression for the proportionality between the posterior and joint distribution of the observed and unobserved quantities. Properly means that the expression for the factored joint distribution obeys the chain rule of probability after assumptions about independence have been made. Bayesian networks, also called directed acyclic graphs (or, unattractively in my view, DAGs), offer a way to visually assure that your model does so. This will be true if there is one unknown and one data set or one hundred unknowns and ten data sets. This factored expression is all that is required to specify a “roll-your-own” MCMC algorithm or to write code in one of the current software packages that sample from the marginal posterior distributions, JAGS, STAN, OpenBUGS etc. The expression for posterior and joint is where you start discussions with statistical colleagues. It must be included in all papers and proposals using Bayesian methods because it communicates virtually everything about where your inferences come from.
Learning to write proper mathematical and statistical expressions for Bayesian models is 70 percent of the battle of learning how to do Bayesian analysis. We will return to this battle time and time again during this course. In this exercise, we begin to learn the vital skill of model building. The problems increase in difficulty as we proceed, so it will be important to understand what you did right and wrong before you proceed to the next problem. In addition to practice drawing Bayesian networks and writing posterior and joint distributions, the problems will challenge you to:
For each problem below, draw the Bayesian network, write the posterior and joint distributions using generic bracket notation with appropriate products. Next, choose specific distributions following the general flow that we illustrated in the Light Writing Hierarchical Models lecture. At this point, don’t worry too much about the specific forms for prior distributions. We will learn more about composing these as the course proceeds. You may use uniform distributions with bounds that are vague for non-negative parameters. Use normal distributions centered on zero with large variances for real-valued parameters. Again, don’t sweat this too much.
Work in groups to allow discussion and to teach each other. Prepare a write up, one per group. You may use pencil and paper for drawing DAGs and writing models. Please practice proper notation. Indicate vectors and matrices by underling their symbols.
We urge you to do a problem as completely as you possibly can, then consult the answer before going to the next problem.
Good for you if you think you found a mistake in the answer! There is no better way to show that you are learning than to find mistakes.
Accumulate questions.
You seek to understand how radon levels influence risk of cancer. You have data on the incidence of lung cancer in households (1 if cancer is present and 0 if no cancer) and radon levels (a continuous, non-negative number) for 1000 houses in each of 40 counties within a state. You also have data on the clay soil content at the county level. You heroically assume both clay content and radon levels are known without error. How would you model the effect of radon and soil type on the probability of lung cancer? Some hints:
What deterministic model would you use to predict the probability of cancer in a household as a function of radon level?
What likelihood would you use for these 0 or 1 data?
Assume that the intercept in your deterministic model of the effect of radon level on probability of cancer in a household is a linear function of county level clay soil content.
\[\begin{align*} \big[\pmb{\gamma},\pmb{\beta},\sigma\mid\pmb{y}\big]\varpropto&\biggm(\prod_{i=1}^{1000}\prod_{j=1}^{40}\big[y_{ij}\mid g\big(\pmb{\beta},x_{ij}\big)\big]\biggm)\prod_{j=1}^{40}\big[\beta_{0,j}\mid h\big(\pmb{\gamma},w_{j}\big),\sigma^{2}\big]\big[\pmb{\gamma}\big]\big[\beta_{1}\big]\big[\sigma^2\big]\\ g\big(\pmb{\beta},x_{ij}\big)=&\,\frac{e^{\beta_{0,j}+\beta_{1}x_{ij}}}{1+e^{\beta_{0,j}+\beta_{1}x_{ij}}}\\ h\big(\pmb{\gamma},w_{j}\big)=&\,\gamma_{0}+\gamma_{1}w_{j}\\ y_{ij}\sim&\,\textrm{Bernoulli}\big(g\big(\pmb{\beta},x_{ij}\big)\big)\\ \prod_{j=1}^{40}\beta_{0,j}\sim&\,\textrm{normal}\big(h\big(\pmb{\gamma},w_{j}\big),\sigma^{2}\big)\\ \beta_{1}\sim&\,\textrm{normal}\big(0,1000\big)\\ \gamma_{0}\sim&\,\textrm{normal}\big(0,1000\big)\\ \gamma_{1}\sim&\,\textrm{normal}\big(0,1000\big)\\ \sigma^2\sim&\,\textrm{uniform}\big(0,1000\big)\\ \end{align*}\]
This problem is taken with minor modification from work Tom did for the National Park Service Inventory and Monitoring Program. The data consist of point intercepts taken at 0.1 m intervals along 30 meter transects. Each transect has 300 points where observers classify a point defined by a plumb line dropped vertically from the transect as intersecting with exotic vegetation, native vegetation, or bare ground. There are 100 transects. You seek to estimate the proportion of total vegetative cover that is non-native using the proportion of a transect that is bare ground as a predictor variable. Non-native cover as a proportion of total vegetative cover, not the proportion of the transect, is a key idea. Develop a Bayesian model for this problem that properly models uncertainty in the predictor variable (proportion bare ground) and the response variable (proportion of vegetative cover that is exotic). Hints - Think about two data models, one for the response and one for the predictor variable. Model the \(x\)’s using the same approach you use for modeling the \(y\)’s and link the data models via a latent, unobserved quantity. The sample unit is a transect. Start by thinking about what is observed, what is unobserved, and what is known for each transect. A key distinction here is that we want to know the proportion of the vegetative cover that is exotic, not the proportion of the transect (with inference to the landscape) that is exotic.
Define:
\(y_{i}=\) observed number of hits of exotics on transect \(i\)
\(v_{i}=\) number of hits of exotic and native vegetation on the \(i_{th}\) transect.
\(x_{i}=\) the true, unobserved proportion of bare ground on transect i
\(w_{i}=\) observed number of hits with bare soil
\(n_{i}=\) total number of intercepts on a transect, known (300)
\(N\) = total number of transects (100)
\[ \begin{align*} [\pmb{\beta}, \pmb{x} \mid \pmb{w}, \pmb{y}] \propto & \prod_{i=1}^{N}[y_{i} \mid v_{i}, g(\beta_{0}, \beta_{1}, x_{i})][w_{i} \mid n_{i}, x_{i}][x_{i}] \\ &\times [\beta_{0}][\beta_{1}],\\ g(\beta_{0}, \beta_{1}, x_{i})=&\frac{e^{(\beta_{0} + \beta_{1} x_{i})}}{e^{(\beta_{0} + \beta_{1} x_{i})}+1}\\ y_{i}\sim&\,\text{binomial}(v_{i}, g(\beta_{0}, \beta_{1}, x_{i}))\\ w_{i}\sim&\,\text{binomial}(n_{i}, x_{i})\\ x_{i}\sim&\,\text{uniform}(0,1)\\ \beta_{0}\sim&\,\text{normal}(0, 2.7)\\ \beta_{1}\sim&\,\text{normal}(0, 2.7)\\ \end{align*} \]
Define:
\(y_{i}=\) observed counts of willow seedlings in the \(i_{th}\) plot, assumed to be measured perfectly
\(w_{ij}=\) the \(j_{th}\) measurement of soil water in the \(i_{th}\) plot
\(x_{i}=\) observed proportion of herbaceous cover in the \(i_{th}\) plot, assumed to be measured perfectly
\(\mu_{i}=\) true measure of soil water content in the \(i_{th}\) plot
\[\begin{align*} [\pmb{\beta}, \pmb{\mu}, \varsigma^{2} \mid \pmb{w}, \pmb{y}] & \propto \prod_{i=1}^{100}\text{Poisson}( y_{i}\mid e^{\beta_{0} + \beta_{1}\mu_{i} + \beta_{2}x_{i}})\prod_{i=1}^{100}\prod_{j=1}^{5} \text{gamma}\bigg(w_{ij}\mid\frac{\mu_{i}^{2}}{\varsigma^{2}},\frac{\mu_{i}}{\varsigma^{2}}\bigg)\\ &\times \text{normal}(\beta_{0} \mid 0, 1E4)\,\text{normal}(\beta_{1} \mid 0, 1E4)\,\text{normal}(\beta_{2} \mid 0, 1E4)\\ &\times \prod_{i=1}^{100}\text{uniform}(\mu_{i} \mid 0, 2)\,\text{uniform}(\varsigma^2 \mid 0, 100) \end{align*}\]
A colleague objects to your assumption of cover observed perfectly by eye, insisting, reasonably we think, that you develop a data model relating your ocular estimate to the true cover in a plot. So, you obtained visual estimates of cover paired with the actual proportion of vegetated area (measured using small sub-plots) on 15 10 x 10 m plots. After days of sweaty labor, you regressed visual estimates \((x_{i})\) on the true cover \((z_{i})\) and developed a calibration equation \(h(\pmb{\eta},z_{i})\):
\[\begin{align*} h(\eta_{0}, \eta_{1}, z_{i}) = &\, \frac{e^{\eta_{0}+\eta_{1}z_{i}}}{1+e^{\eta_{0}+\eta_{1}z_{i}}}\\ x_{i} \sim &\, \text{beta}\big(m(h(\eta_{0}, \eta_{1}, z_{i}), \sigma^{2})\big)\\ \eta_{0} \sim &\, \text{normal}(0.05, 0.006)\\ \eta_{1} \sim &\, \text{normal}(1.07, 0.13)\\ \sigma^{2} \sim &\, \text{inverse gamma}(10.2, 630) \end{align*}\]
Define:
\(y_{i}=\) observed counts of willow seedlings in the \(i_{th}\) plot, assumed to be measured perfectly
\(w_{ij}=\) the \(j_{th}\) measurement of soil water in the \(i_{th}\) plot
\(x_{i}=\) observed proportion of herbaceous cover in the \(i_{th}\) plot
\(u_{i}=\) true measure of soil water content in the \(i_{th}\) plot
\(z_{i}=\) true proportion of herbaceous cover in the \(i_{th}\) plot
\[\begin{align*} [\pmb{\beta}, \pmb{\eta}, \pmb{\mu}, \mathbf{z}, \varsigma^{2} \sigma^{2}, \mid \pmb{w}, \pmb{y}, \pmb{x}] & \propto \prod_{i=1}^{100}\text{Poisson}(y_{i}\mid e^{\beta_{0} + \beta_{1}\mu_{i} + \beta_{2}z_{i}})\text{beta}\bigg(x_{i} \mid m\bigg(\frac{e^{\eta_{0} + \eta_{1}z_{i}}}{e^{\eta_{0} + \eta_{1}z_{i}}+1}, \sigma^{2}\bigg)\bigg)\\ &\times\prod_{i=1}^{100}\prod_{j=1}^{5}\text{gamma}\bigg(w_{ij}\mid\frac{\mu_{i}^{2}}{\varsigma^{2}},\frac{\mu_{i}}{\varsigma^{2}}\bigg) \\ &\times \text{normal}(\eta_{0}\mid 0.05, 0.006)\,\text{normal}(\eta_{1} \mid 1.07, 0.13)\,\text{inverse gamma}(\sigma^{2} \mid 10.2, 630)\\ &\times \text{normal}(\beta_{0} \mid 0, 1E4)\,\text{normal}(\beta_{1} \mid 0, 1E4)\,\text{normal}(\beta_{2} \mid 0, 1E4)\\ &\times \prod_{i=1}^{100}\text{uniform}(\mu_{i} \mid 0, 2)\,\text{uniform}(\varsigma \mid 0, 100)\,\text{uniform}(z_{i}\mid 0, 1)\end{align*}\]
Define:
\(y_{ik}=\) observed counts of willow seedlings in the \(i_{th}\) plot in the \(k_{th}\) stream reach, assumed to be measured perfectly
\(w_{ijk}=\) the \(j_{th}\) measurement of soil water in the \(i_{th}\) plot in the \(k_{th}\) stream reach
\(x_{ik}=\) observed proportion of herbaceous cover in the \(i_{th}\) plot in the \(k_{th}\) stream reach
\(z_{ik}=\) true proportion of herbaceous cover in the \(i_{th}\) plot in the \(k_{th}\) stream reach
\[\begin{align*} [\pmb{\beta}, \pmb{\eta}, \pmb{\mu}, \mu_{\alpha}, \sigma^2_{\alpha}, \varsigma^{2} \sigma^{2} \mid \pmb{w}, \pmb{y}, \pmb{x}] & \propto \prod_{i=1}^{20}\prod_{k=1}^{5}\text{Poisson}(y_{ik}\mid e^{\beta_{0,k} + \beta_{1}\mu_{ik} + \beta_{2}z_{ik}})\times\\ &\prod_{i=1}^{20}\prod_{k=1}^{5}\prod_{j=1}^{5}\text{gamma}\bigg(w_{ijk}\mid\frac{\mu_{ik}^{2}}{\varsigma^{2}},\frac{\mu_{ik}}{\varsigma^{2}}\bigg)\times\\ &\prod_{i=1}^{20}\prod_{k=1}^{5} \text{beta}\bigg(x_{ik} \mid m\bigg(\frac{e^{\eta_{0} + \eta_{1}z_{ik}}}{e^{\eta_{0} + \eta_{1}z_{ik}}+1}, \sigma^{2}\bigg)\bigg)\times\\ &\text{normal}(\beta_{0,k} \mid \mu_{\alpha}, \varsigma_{\alpha}^{2})\,\text{normal}(\mu_{\alpha}\mid 0, 1E4)\,\text{uniform}(\varsigma_{\alpha}\mid 0, 100)\times \\ &\text{normal}(\eta_{0}\mid 0.05, 0.006)\, \text{normal}(\eta_{1} \mid 1.07, 0.13)\,\text{inverse gamma}(\sigma^{2} \mid 10.2, 630)\times\\ &\text{normal}(\beta_{1} \mid 0, 1E4)\,\text{normal}(\beta_{2} \mid 0, 1E4)\,\text{uniform}(\mu_{i} \mid 0, 2)\,\text{uniform}(\varsigma^{2} \mid 0, 100)\times\\ &\text{uniform}(z_{i}\mid 0, 1) \end{align*}\]