MissingDataLab.knit

Bayesian Models for Ecologists

Missing data lab

June 11, 2024

R libraries needed for this lab

You need to load the following libraries. Set the seed to 10 to compare your answers to ours.

library(rjags)
library(MCMCvis)
library(HDInterval)
library(BayesNSF)
set.seed(10)

Mevin and Becky would be out of a job if it weren’t for missing data.

Statistics is the science of determining the value of observations as evidence. It is rare that we are able to observe all of the data that are relevant to the problem at hand, which means that we observe a subset of all of the possible observations, i.e., a sample. The data we fail to observe are missing by design. Moreover, there are inevitably observations that we intended to take that get missed. These data are missing by accident. The Bayesian framework treats these two types of missing data in precisely the same way. The exercises here will develop your skills for dealing with intentionally and accidentally missing data.

Data missing by accident

Exercise: Ignorable or not?

Determine if the missing data are ignorable in the following examples. Classify the each one as missing completely at random, missing at random, and missing not at random. Propose a solution that would make the missing data mechanism ignorable. Discuss among your lab mates.

Imagine a series of plots with automated gas-flux analyzers that record observations over time on a 10 ha site. We are interested in carbon dioxide emissions in response to a suite of covariates including plant biomass, soil water, and the other usual suspects.

A wildfire burns 2 km north of the site. Some of the plots burn because windblown embers land on them. We lose the data in those plots. May we ignore the lost data?
There are two brands of analyzers on the site. Some analyzers fail because their connections to batteries are chewed by field mice (really). Are the missing data ignorable? If not, what could we do to make them so?
Some analyzers have intake vents blocked by grass, causing data to be missed. Are the missing data ignorable?
Lighting strikes a single plot with an analyzer, sizzling the instrument. (This is probably not ignorable, at least some sense.)
We are taking weekly measurements for three months. Lighting strikes the grid of plots, rendering them all ineffective for a week.

A wildfire burns 2 km north of the site. Some of the plots burn because windblown embers land on them. We lose the data in those plots. Can we ignore the lost data? The only thing that determines if an analyzer was lost is whether a spark landed on the plot. There is no spatial pattern in burned plots. These can be viewed as ignorable, demonic intrusions.
Grass biomass and soil water influenced whether a plot burned. Low biomass, high moisture plots did not lose analyzers. Ignorable because the grass biomass and soil mositure covariate accounts for probability of missingness and for the ecological process. Missing at random.
Plots on the north side of the site burn. Others don’t. There is a spatial gradient in the lost analyzers. We include utms or other spatial coordinates in our model to make the missing data mechanism ignorable. Missing at random if we include those covariates.
There are two brands of analyzers on the site. Some analyzers fail because their connections to batteries are chewed by field mice (really). Are the missing data ignorable? The missing data may or may not be ignorable. The first thing you should check is whether there is a pattern in the missing data. Perhaps one brand is more subject to chewing than the other (tastier cords). In this case, you could make the missingness ignorable by including an indicator covariate for the brand of analyzer. We would plausibly assume that adding this covariate makes the data missing at random, although we could not prove this is true.
Some analyzers have intake vents blocked by grass, causing data to be missed. Are the missing data ignorable? Missing at random. These missing data are ignorable because the model includes a covariate for grass biomass.
Lighting strikes a single plot with an analyzer, sizzling the instrument. (This is probably not ignorable, at least some sense.) These missing data can be reasonably viewed as missing completely at random and are ignorable.
We are taking weekly measurements for three months. Lighting strikes the grid of plots, rendering them all ineffective for a week. The missing data mechanism is ignorable. The model contains all the information needed to make out-of-sample predictions for the missing values using the posterior predictive distribution. These data are missing at random because a time covariate accounts for the missingness.

Exercise: Missing counts in the Serengeti

Consisder the plot of transects surveyed from the air to count wildebeest in the Serengeti. The transect number indicates the order in which each was flown. Note that three transects were missing at the end of the series of parallel transects. Are these missing data ignorable? If not, what could you do to make them so?

The missing transects are not ingorable. Note that they occur in areas with low wildebeest density, so omitting them will bias the estimate of total population size upwards. We need to add covaraites to assure the missing data mechanism is ignorable. Distance along the diagonal line perpendicular to the transects should be included. It would make sense to add the proportion of grass cover on the transects because grass cover follows a NW to SE gradient of the transect. Similarly NDVI data revealing greenness could be included. Distance to water at the transect centroid might be plausibly added.

Data missing by design

Exercise: Finite population inference

This exercise is gratefully borrowed from W. A. Link and R. J. Barker. Bayesian inference with ecological applications. Academic Press, 2010.

Williams and colleagues (2002) studied the abundance of muskrat (Ondatra zibethicus) houses in a 100 ha marsh, searching the area in air boats. The study area was subdivided into 50, 2-ha plots. Ten of these were chosen at random and searched exhaustively for houses. Data were

y.obs = c(13, 18, 10, 6, 16, 13, 12, 13, 9, 11)

Write a model for the mean number of muskrat houses per plot incorporating finite population inference. Code the model and approximate the marginal posterior distribution of the mean. Overlay a plot of the marginal posterior for finite population inference on the mean number of houses vs a plot of the marginal posterior for infinite population inference for the mean number of houses.

\[\begin{align*} [\mathbf{y}_{miss}, \lambda \mid \mathbf{y}_{{obs}}] &\propto \prod_{i=1}^{10}[y_{obs,i}\mid \lambda]\prod_{j=11}^{50}[y_{miss,j}|\lambda][\lambda]\\ y_{obs,i} &\sim \text{Poisson}(\lambda) \\ y_{miss,i} &\sim \text{Poisson}(\lambda)\\ \lambda &\sim \text{gamma}(.001,.001) \end{align*}\]

{ # Extra bracket needed only for R markdown files
sink("Muskrats.R")
cat(" 
model {
  # priors
  lambda ~ dgamma(.001,.001)
  
  # likelihood
  for(i in 1:m){
      y.obs[i] ~ dpois(lambda)
  }
  for(i in 1 : (M - m)){
     y.miss[i] ~ dpois(lambda)
  }
  # derived quantity for number of houses
  N = sum(y.obs[]) + sum(y.miss[])
  lambda.finite = N / M
  } #end of model
",fill = TRUE)
sink()
} # Extra bracket needed only for R markdown files
set.seed(10)
y.obs = c(13, 18, 10, 6, 16, 13, 12, 13, 9, 11)
dataj = list(M = 50, m = 10, y.obs = y.obs )
jm = jags.model("Muskrats.R", data = dataj, n.adapt = 1000, n.chains = 3)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 10
##    Unobserved stochastic nodes: 41
##    Total graph size: 60
## 
## Initializing model

update(jm, n.iter = 5000)
zc = coda.samples(jm, variable.names = c("lambda", "lambda.finite"), n.iter = 50000)
MCMCsummary(zc)

##                   mean        sd     2.5%     50%    97.5% Rhat  n.eff
## lambda        12.10108 1.1013395 10.03308 12.0652 14.34889    1 149259
## lambda.finite 12.10106 0.9856783 10.26000 12.0800 14.12000    1 150000

lambda.samples = MCMCchains(zc, params = "lambda")
lambda.finite.samples = MCMCchains(zc, params = "lambda.finite")
plot(density(lambda.samples), main = "", xlab = expression(lambda), ylim = c(0,.5))
lines(density(lambda.finite.samples), col = "red")
legend(15,.4, legend = c("Infinite", "Finite"), lty = c(1,1), col = c("black", "red"))

Exercise: Data missing by design in time

Inventory and monitoring programs often use “revisit designs” where sites are observed according to a schedule that leaves out some years, creating missing data during those years. Managers would like predictions during all years. How can you make predictions for the missing years?

Revisit designs are often developed based on some measure of effort to obtain observations. Sites that require more effort to observe are visited less frequently. How could you account for the missing data that arise when the probability of observing a site is inversely related to hiking time?

No need for math here (whew!), just describe what you would do.

Each site has its own model of the response over time with intercept (or intercept and time slope) drawn from a distribution. Every site must have its own model to account for the repeated measures structure of the design. You use the posterior predictive distribution of each site’s model to impute the values for the missing years. You are using the data you have to inform the years when data are missing. If there are predictor variables (other than year) in your model, they must be included for the missing years or modeled as random variables.

We could include a covariate for hiking time in our model to make missing data in the revisit design ignorable.

Exercise: model specification to reflect sampling design

Consider the schematic of three sampling designs above. We start simply. Imagine a set of 1 m2 plots randomly placed at randomly chosen points in on a landscape (A). We count the number of native species in each plot \((y_{i})\) and take plot level measurements of covariates \((\mathbf{w}_{i})\), for example the area of bare soil, the number of non-native species, soil nitrogen availability and the other usual suspects. Covariates are standardized prior to analysis.

Completely random sampling (A) is ingnorable without conditioning on a design matrix or parameter subscripts. Designs with restrictions on randomization, for example cluster (B) and stratified random (C) are ingorable when the model includes the proper design matrix or proper subscripting. The shaded area is a hypothetical sampling frame of interest. The dots are samples. The rectangles represent clustering, of samples, within sites on a landscape for example. The dark lines represent strata.

Write a properly subscripted model for each of these cases to assure that the missing data mechanism is ignorable.

For case B, write a model for a single point in time and also a model where you have repeated measures at the plot scale over time.

How would you make inference across strata in C assuming that the area of inference is limited to the shaded area from which the strata were drawn?

You may omit priors to reduce notational clutter.

A. A completely random design. The \(w\) are covariates.

\[\begin{align} g(\alpha_{0},\mathbf{\boldsymbol{\beta}},\mathbf{w}_{i})&=\exp(\alpha_{0}+\mathbf{w}_{i}'\boldsymbol{\beta})\\y_{i}&\sim\text{Poisson}(g(\alpha_{0},\mathbf{\boldsymbol{\beta}},\mathbf{w}_{i})) \end{align}\]

This example is a bit of toy because completely random samples are unusual in ecological research. There are almost always logistical or biological reasons to use sampling designs with restrictions on the randomization, designs that are ingorable conditional on covariates or proper subscripting, as illustrated below.

B. A cluster sample at one point in time

\[\begin{align} g(\alpha_{0j},\mathbf{\boldsymbol{\beta}},\mathbf{w}_{ij})&=\exp(\alpha_{0j}+\mathbf{w}_{ij}'\boldsymbol{\beta})\\y_{ij}&\sim\text{Poisson}(g(\alpha_{0j},\mathbf{\boldsymbol{\beta}},\mathbf{w}_{ij}))\\\alpha_{0j}&\sim\text{normal}(\mu_{\alpha_{0}},\sigma_{\alpha_{0}}^{2}) \end{align}\]

where \(y_{ij}\) is now the count of native species in plot \(i\) of site \(j\). The change in the likelihood is subtle– we have added the subscript \(j\) to show the site level restriction on randomization and to allow each site to have its own intercept. Notice the utility of the hierarchical structure for \(\alpha_{0j}\), which supports inference across sites using \(\mu_{\alpha}\).

B. A cluster sample with repeated measures on plots within sites

\[\begin{align*} g(\alpha_{0ij},\boldsymbol{\beta},\mathbf{t},\mathbf{w}_{ij})&=\exp(\alpha_{0ij}+\mathbf{w}_{ijk}'\boldsymbol{\beta}+\gamma t_k)\\y_{ijk}&\sim\text{Poisson}(g(\alpha_{0ij},\mathbf{\boldsymbol{\beta},t},\mathbf{w}_{ijk})))\\\alpha_{0ij}&\sim\text{normal}(\mu_{\alpha_{0j}},\sigma_{\alpha_{0j}}^{2})\\ \mu_{\alpha_{0j}} &\sim \text{normal}(\overset{all}{\mu_{\alpha_0}},\overset{all}{\sigma^2_{\alpha_0}}) \end{align*}\]

We now have add vector of times that samples were taken \(\mathbf{t}\) as a covariate in the model and add a subscript \(k\) to \(y\) to index those times, \(k=1,...,K\) where \(K\) is the length of the \(\mathbf{t}\) vector. The parameter \(\gamma\) is the rate of change. We set the first value of the \(\mathbf{t}\) vector to 0, which allows us to interpret \(\alpha_{0ij}\) as the species richness of plot \(i\) in site \(j\) at the start of the study at the mean of the covariates. Note that each plot has its own intercept \(\alpha_{0ij}\) that is drawn from a distribution of intercepts within each site. The means of the site level intercepts, in turn, are drawn from a distribution of all intercepts.

C. A stratified random sample

It is often desirable to group sites into strata, based on characteristics that reduce the variation of the response within each stratum Stratification requires that we include additional information about how the data were collected in our model, in particular, the subscript \(k=1,2,3\) (not the same as the \(k\) in the previouse problem) to indicate membership in the three strata and the restriction on randomization imposed by stratification. Our model becomes \[\begin{align} g(\alpha_{0jk},\mathbf{\boldsymbol{\beta}},\mathbf{w}_{ijk})&=\exp(\alpha_{0jk}+\mathbf{w}_{ijk}'\boldsymbol{\beta})\\y_{ijk}&\sim\text{Poisson}(g(\alpha_{0jk},\mathbf{w}_{ijk}'\boldsymbol{\beta}))\\ \alpha_{0jk} &\sim \text{normal}(\mu_{\alpha_{0k}}, \sigma^2_{\alpha_{0k}}) \end{align}\]

We could use a hierarchical formulation as we did for the cluster example to make inference across sites, but it would not be appropriate to use a hierarchical structure for strata because the interest here is the mean across the three specific strata, not all possible strata. (Moreover, we only have three strata, which makes a hierarchical formulation somewhat sketchy in the absence of prior knowledge of the stratum level variance.) We should instead compute a weighted average of the stratum means as a derived quantity \(\mu_{\alpha_{0}}=\sum_{k=1}^{3}\frac{N_{k}}{N}\alpha_{0k,}\) where \(N_{k}\) is the number of potential sample sites in stratum k and N is the total number of potential sample sites \(N=\sum_{k=1}^{K}N_{k}\)(Gelman et al., 2013, eq. 8.7).

A. Gelman, J. B. Carlin, H. S. Stern, D. Dunson, A. Vehhtari, and D. B. Rubin. Bayesian data analysis. Chapman and Hall / CRC, London, UK, 2013.