Bayesian Models for Ecologists
Bayesian Model Checking
June 09, 2022

Posterior Predictive P-values

To check the goodness of fit for Bayesian models, we can compare the posterior predictions to the observed data in terms of a chosen statistic of interest (i.e., the variance of the data) to see if our model was capable of generating the data we observed. We do this using a derived quantity (i.e., statistic) that involves the observed and predicted data. If the statistic for the predicted data is far away from the statistic calculated using the observed data, it provides evidence that our model is not capable of having generated the observed data and thus the fit of the model is not adequate.

For example, if we are interested in the variance (i.e., spread of the data) then our statistic \(v(\mathbf{y})\) for the observed data \(\mathbf{y}=(y_1,\ldots,y_n)'\) is \[\begin{equation} v(\mathbf{y})=\frac{\sum_{i=1}^n(y_i-\bar{y})^2}{n} \;. \notag \end{equation}\]

Then, for each MCMC realization of posterior predictions for the data \(\mathbf{y}^{\text{new} (k)}=(y_1^{\text{new} (k)},\ldots,y_n^{\text{new} (k)})'\), we can calculate the same statistic as \[\begin{equation} v(\mathbf{y}^{\text{new} (k)})=\frac{\sum_{i=1}^n(y_i^{\text{new} (k)}-\bar{y}^{\text{new} (k)})^2}{n} \;, \notag \end{equation}\] for \(k=1,\ldots,K\) MCMC realizations.

The posterior predictive p-value can then be calculated as \[\begin{equation} \text{P}(v(\mathbf{y}^{\text{new}}) > v(\mathbf{y}) | \mathbf{y})=\frac{\sum_{k=1}^K 1_{\{v(\mathbf{y}^{\text{new} (k)}) > v(\mathbf{y})\}}}{K} \;. \notag \end{equation}\]



Simulate Data

To see how Bayesian model checking using posterior predictive p-values works in practice, simulate data from two different generative models:

set.seed(202)
n=100
lam=10
sig=.2
y.1=rpois(n,lam)
y.2=rpois(n,lam*exp(rnorm(n,0,sig)))

Note that the second data generating process includes extra dispersion due to the multiplicative exponentiated error term. Thus, the second data set has more disperson than allowed by the Poisson data model alone. Nevertheless, we will fit a Poisson model to both data sets and check both using the variance as the statistic in a posterior predictive p-value.

Problem

Find the posterior predictive p-value associated with a Bayesian Poisson model for each of the two data sets we simulated above. Do we have evidence that the Poisson model is not a good fit for either of them in terms of dispersion? Take the following steps to answer this question.

  1. Using pencil and paper, write a Bayesian model statement assuming a Poisson model for the response variable \(\mathbf{y}=(y_1,\ldots,y_n)'\) and use a gamma distribution for the prior of the Poisson intensity parameter \(\lambda\).


\[\begin{align*} y_i &\sim \text{Poisson}(\lambda), \text{for } i=1,\ldots,n \;, \\ \lambda &\sim \text{gamma}(\alpha,\beta) \;. \\ \end{align*}\]


  1. What is the posterior distribution for \(\lambda\) given \(\mathbf{y}\) associated with this model?


\[\begin{align*} [\boldsymbol\lambda \mid \mathbf{y}] &\propto [\mathbf{y}|\lambda][\lambda] \\ &\propto \prod_{i=1}^{n} \text{Poisson} (y_{i} \mid \lambda) \text{ gamma} (\lambda|\alpha,\beta) \\ &\propto \text{gamma}\left(\sum_{i=1}^n y_i+\alpha,n+\beta\right) \end{align*}\]


  1. Obtain a Monte Carlo (MC) sample from the posterior predictive distribution for the model fit to both data sets we simulated above. Use the hyperparameters \(\alpha=0.01\) and \(\beta=0.01\) in the prior for \(\lambda\).


In this case, we have an analytical solution for the posterior distribution and conditionally independent data, so we can first sample \(\lambda^{(k)} \sim \text{gamma}(\sum_{i=1}^n y_i+\alpha,n+\beta)\) and then sample \(y_i^{\text{new} (k)} \sim \text{Poisson}(\lambda^{(k)})\) for \(k=1,\ldots,K\) MC iterations. No MCMC algorithm is necessary because we know the exact posterior distribution and can sample directly from it.

alpha=.01
beta=.01
K=10000

Y.new.1=matrix(0,n,K)
Y.new.2=matrix(0,n,K)
for(k in 1:K){
  lambda.1=rgamma(1,sum(y.1)+alpha,n+beta)
  Y.new.1[,k]=rpois(n,lambda.1)
  
  lambda.2=rgamma(1,sum(y.2)+alpha,n+beta)
  Y.new.2[,k]=rpois(n,lambda.2)
}


  1. Compute the posterior predictive p-values using the variance statistic for the model fit to both data sets. Is there evidence of lack of fit for the model fit to either data set?


v.1=var(y.1)
v.1.new=apply(Y.new.1,2,var)
p.1=mean(v.1.new>v.1)
p.1
## [1] 0.358
v.2=var(y.2)
v.2.new=apply(Y.new.2,2,var)
p.2=mean(v.2.new>v.2)
p.2
## [1] 0.0045

Literature cited

Conn, P.B., D.S. Johnson, P.J. Williams, S. Melin, and M.B. Hooten. (2018). A guide to Bayesian model checking for ecologists. Ecological Monographs, 88: 526-542.