Model Checking: Posterior Predictive Checks

Once a model is fit and parameters estimated, we would look at how well the model explains the data and what aspects of the data generation process in nature are not captured by the model. Most of the material covered in this post follows the examples from:

[1] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition (Texts in Statistical Science). Book.

The example data I use here is Simon Newcomb’s experiment to measure the speed of light. The R source code and the data are present in the github repository. Import the data and define a model function i.e. log posterior function to model this data.

## define a log posterior function
lp = function(theta, data){
  # we define the sigma on a log scale as optimizers work better
  # if scale parameters are well behaved
  s = exp(theta[2])
  m = theta[1]
  d = data$vector # observed data vector
  log.lik = sum(dnorm(d, m, s, log=T))
  log.prior = 1 = log.lik + log.prior

Here we assume the data can be modelled using a normal distribution and estimate the parameters.

## try the laplace function from LearnBayes
fit = laplace(lp, start, lData)

## lets take a sample from this
sigSample.op = rnorm(1000, fit$mode['sigma'], se['sigma'])
muSample.op = rnorm(1000, mean(lData$vector), exp(sigSample.op)/sqrt(length(lData$vector)))

#### the posterior interval reported on P 67 Gelman [2013] is
## y ± 1.997s/ 66 = [23.6, 28.8]
## compare with our intervals
fit$mode['mu']-1.96*se['mu']; fit$mode['mu']+1.96*se['mu']
##      mu 
## 23.63911 
##      mu 
## 28.78365 
quantile(muSample.op, c(0.025, 0.975))
##     2.5%    97.5% 
## 23.61960 28.83061 

The figure below suggests that the normal model may not be appropriate for this data, as there is an extreme outlier measurement around -44.


There are various approaches to model checking e.g.:

  • Using external validation: use data to fit the model (i.e. estimate parameters), make predictions about the new or future data using the model and parameters, collect new real data and compare with the predictions.
  • Recycle existing data: if new data is not available. This also involves other modifications to model fitting cross-validation techniques. I will talk about this further in a future post.

How do we compare the predictions with the real data, i.e. we define some metrics to compare against, which are called Test Quantities.

Prediction of New/Future Data (Posterior Predictive Distribution)

New data or future data can be generated using simulation, from the model and estimated parameters.

## observed data should look plausible under the posterior predictive distribution
## Draw simulated values from the joint posterior of Yrep and compare to Yobs and look for systematic differences

## Gelman [2013] P 144 -
## sample 66 values, 20 times, each time drawing a fresh draw of sd and mean from the joint posterior
mDraws = matrix(NA, nrow = 66, ncol=20)

for (i in 1:20){
  p = sample(1:1000, size = 1)
  s = exp(sigSample.op[p])
  m = muSample.op[p]
  mDraws[,i] = rnorm(66, m, s)

In the example shown above we sample the standard deviation parameter, and then sample the mean parameter (conditioned on the standard deviation, notice we use the same index p). We draw 66 samples and repeat the process 20 times, this is akin to repeating the experiment 20 times and each time taking a 66 measurements. The figure below shows some of the histograms of the 20 simulations and the original data.


It appears that the replications do not include the outlier measurement at -44, and suggests that our current model does not capture that part of the data generation process. Using graphical checks can be a bit tedious in high-throughput settings, and we can use some quantitative checks by defining Test Quantities.

Test Quantities

A test quantity is a function of the original data and replicated data, with some optional additional parameters. We can evaluate the discrepancy in the between the test quantities using the original data and replicated data by calculating a PValue and watch for extreme tail area PValues. We define 5 test quantities in the current case representing: Variance, Mean, Symmetry, Minimum and Maximum.

# The procedure for carrying out a posterior predictive model check requires specifying a test
# quantity, T (y) or T (y, θ), and an appropriate predictive distribution for the replications
# y rep [Gelman 2008]
## variance
T1_var = function(Y) return(var(Y))

## is the model adequate except for the extreme tails
T1_symmetry = function(Y, th){
  Yq = quantile(Y, c(0.90, 0.10))
  return(abs(Yq[1]-th) - abs(Yq[2]-th))

## min quantity
T1_min = function(Y){

## max quantity
T1_max = function(Y){

## mean quantity
T1_mean = function(Y){

## calculate bayesian p-value for this test statistic
getPValue = function(Trep, Tobs){
  left = sum(Trep <= Tobs)/length(Trep)
  right = sum(Trep >= Tobs)/length(Trep)
  return(min(left, right))

Extreme PValues (typically less than 0.01) for the test quantities suggest areas of failure for the model which can be addressed by expanding the model, or ignored if appropriate for the applied question at hand. “The relevant goal is not to answer the question, ‘Do the data come from the assumed model?’ (to which the answer is almost always no), but to quantify the discrepancies between data and model, and assess whether they could have arisen by chance, under the model’s own assumptions.” [1].

Under the normal distribution model, the 5 test quantities show the following PValues.

> mChecks[,'Normal']
Variance Symmetry      Max      Min     Mean 
    0.47     0.16     0.01     0.00     0.47 

The tests for Min and Max quantities fail, suggesting that this model does not perform well in the tail regions. Perhaps using a heavy tailed distribution, like a T distribution with low degrees of freedom or a Contaminated normal distribution will be more useful.

## define a second log posterior function for mixture with contaminated normal distribution
lp2 = function(theta, data){
  # we define the sigma on a log scale as optimizers work better
  # if scale parameters are well behaved
  s = exp(theta[2])
  m = theta[1]
  mix = 0.9
  cont = theta[3]
  d = data$vector # observed data vector
  log.lik = sum(log(dnorm(d, m, s) * mix + dnorm(d, m, s*cont) * (1-mix)))
  log.prior = 1 = log.lik + log.prior

# sanity check for function
# choose a starting value
start = c('mu'=mean(ivTime), 'sigma'=log(sd(ivTime)), 'cont'=1)
lp2(start, lData)

## try the laplace function from LearnBayes
fit2 = laplace(lp2, start, lData)
se2 = sqrt(diag(fit2$var))

sigSample.op = rnorm(1000, fit2$mode['sigma'], se2['sigma'])
muSample.op = rnorm(1000, mean(lData$vector), exp(sigSample.op)/sqrt(length(lData$vector)))

A contaminated normal distribution is a mixture distribution where we add 2 more parameters to the model, the mixing probability and contamination parameter. We fix the mixing probability to 0.9 and track the contamination parameter.

Here are a couple of web links if you are more interested in contaminated normal distributions (Link 1, Link 2).

The third distribution we try is a T distribution with small degrees of freedom. We use a slightly different parameterization of the t-distribution which is nicely explained in another blog here.

lp3 = function(theta, data){
  # function to use to use scale parameter
  ## see here
  dt_ls = function(x, df, mu, a) 1/a * dt((x - mu)/a, df)
  ## likelihood function
  lf = function(dat, pred){
    return(log(dt_ls(dat, nu, pred, sigma)))
  nu = exp(theta['nu']) ## normality parameter for t distribution
  sigma = exp(theta['sigma']) # scale parameter for t distribution
  m = theta[1]
  d = data$vector # observed data vector
  if (exp(nu) < 1) return(-Inf)
  log.lik = sum(lf(d, m))
  log.prior = 1 = log.lik + log.prior

# sanity check for function
# choose a starting value
start = c('mu'=mean(ivTime), 'sigma'=log(sd(ivTime)), 'nu'=log(2))
lp3(start, lData)

op = optim(start, lp3, control = list(fnscale = -1), data=lData)

## try the laplace function from LearnBayes
fit3 = laplace(lp3, start, lData)
se3 = sqrt(diag(fit3$var))

sigSample.op = rnorm(1000, fit3$mode['sigma'], se3['sigma'])
nuSample = exp(rnorm(1000, fit3$mode['nu'], se3['nu']))
# threshold the sample values to above or equal to 1
nuSample[nuSample < 1] = 1

## generate random samples from alternative t-distribution parameterization
## see
rt_ls <- function(n, df, mu, a) rt(n,df)*a + mu
muSample.op = rnorm(1000, mean(lData$vector), exp(sigSample.op)/sqrt(length(lData$vector)))

Using the 3 approaches we calculated the test quantities and the PValues for these quantities.

> round(mChecks, 2)
         Normal NormalCont    T
Variance   0.47       0.34 0.34
Symmetry   0.16       0.10 0.14
Max        0.01       0.06 0.18
Min        0.00       0.21 0.14
Mean       0.47       0.48 0.47

We can see that the normal model does not seem to fit the data well in terms of outliers, while the Contaminated Normal and T distributions, both fit the data better.

The figure below shows the histogram of the data and the smattering of the density lines from the 200 posterior predictive replications of the data. We can see that the normal model does not fit well in the tails, while the other two show  a reasonable fit, at least it can be explained by sampling variation according to PValues.


While researching the material for this blog I also came across the post about this data set and extreme outliers in data sets at Andrew Gelman’s blog.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s