Once a model is fit and parameters estimated, we would look at how well the model explains the data and what aspects of the data generation process in nature are not captured by the model. Most of the material covered in this post follows the examples from:

[1] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition (Texts in Statistical Science). Book.

The example data I use here is Simon Newcomb’s experiment to measure the speed of light. The R source code and the data are present in the github repository. Import the data and define a model function i.e. log posterior function to model this data.

## define a log posterior function lp = function(theta, data){ # we define the sigma on a log scale as optimizers work better # if scale parameters are well behaved s = exp(theta[2]) m = theta[1] d = data$vector # observed data vector log.lik = sum(dnorm(d, m, s, log=T)) log.prior = 1 log.post = log.lik + log.prior return(log.post) }

Here we assume the data can be modelled using a normal distribution and estimate the parameters.

## try the laplace function from LearnBayes fit = laplace(lp, start, lData) ## lets take a sample from this sigSample.op = rnorm(1000, fit$mode['sigma'], se['sigma']) muSample.op = rnorm(1000, mean(lData$vector), exp(sigSample.op)/sqrt(length(lData$vector))) #### the posterior interval reported on P 67 Gelman [2013] is ## y ± 1.997s/ 66 = [23.6, 28.8] ## compare with our intervals fit$mode['mu']-1.96*se['mu']; fit$mode['mu']+1.96*se['mu'] ## mu ## 23.63911 ## mu ## 28.78365 quantile(muSample.op, c(0.025, 0.975)) ## 2.5% 97.5% ## 23.61960 28.83061

The figure below suggests that the normal model may not be appropriate for this data, as there is an extreme outlier measurement around -44.

There are various approaches to model checking e.g.:

- Using external validation: use data to fit the model (i.e. estimate parameters), make predictions about the new or future data using the model and parameters, collect new real data and
**compare**with the predictions. - Recycle existing data: if new data is not available. This also involves other modifications to model fitting cross-validation techniques. I will talk about this further in a future post.

How do we compare the predictions with the real data, i.e. we define some metrics to compare against, which are called Test Quantities.

## Prediction of New/Future Data (Posterior Predictive Distribution)

New data or future data can be generated using simulation, from the model and estimated parameters.

## POSTERIOR PREDICTIVE CHECKING ## observed data should look plausible under the posterior predictive distribution ## Draw simulated values from the joint posterior of Yrep and compare to Yobs and look for systematic differences ## Gelman [2013] P 144 - ## sample 66 values, 20 times, each time drawing a fresh draw of sd and mean from the joint posterior mDraws = matrix(NA, nrow = 66, ncol=20) for (i in 1:20){ p = sample(1:1000, size = 1) s = exp(sigSample.op[p]) m = muSample.op[p] mDraws[,i] = rnorm(66, m, s) }

In the example shown above we sample the standard deviation parameter, and then sample the mean parameter (conditioned on the standard deviation, notice we use the same index p). We draw 66 samples and repeat the process 20 times, this is akin to repeating the experiment 20 times and each time taking a 66 measurements. The figure below shows some of the histograms of the 20 simulations and the original data.

It appears that the replications do not include the outlier measurement at -44, and suggests that our current model does not capture that part of the data generation process. Using graphical checks can be a bit tedious in high-throughput settings, and we can use some quantitative checks by defining Test Quantities.

## Test Quantities

A test quantity is a function of the original data and replicated data, with some optional additional parameters. We can evaluate the discrepancy in the between the test quantities using the original data and replicated data by calculating a *PValue* and watch for extreme tail area *PValues*. We define 5 test quantities in the current case representing: Variance, Mean, Symmetry, Minimum and Maximum.

# The procedure for carrying out a posterior predictive model check requires specifying a test # quantity, T (y) or T (y, θ), and an appropriate predictive distribution for the replications # y rep [Gelman 2008] ## variance T1_var = function(Y) return(var(Y)) ## is the model adequate except for the extreme tails T1_symmetry = function(Y, th){ Yq = quantile(Y, c(0.90, 0.10)) return(abs(Yq[1]-th) - abs(Yq[2]-th)) } ## min quantity T1_min = function(Y){ return(min(Y)) } ## max quantity T1_max = function(Y){ return(max(Y)) } ## mean quantity T1_mean = function(Y){ return(mean(Y)) }

## calculate bayesian p-value for this test statistic getPValue = function(Trep, Tobs){ left = sum(Trep <= Tobs)/length(Trep) right = sum(Trep >= Tobs)/length(Trep) return(min(left, right)) }

Extreme *PValues* (typically less than 0.01) for the test quantities suggest areas of failure for the model which can be addressed by expanding the model, or ignored if appropriate for the applied question at hand. *“The relevant goal is not to answer the question, ‘Do the data come from the assumed model?’ (to which the answer is almost always no), but to quantify the discrepancies between data and model, and assess whether they could have arisen by chance, under the model’s own assumptions.”* [1].

Under the normal distribution model, the 5 test quantities show the following *PValues*.

> mChecks[,'Normal'] Variance Symmetry Max Min Mean 0.47 0.16 0.01 0.00 0.47

The tests for Min and Max quantities fail, suggesting that this model does not perform well in the tail regions. Perhaps using a heavy tailed distribution, like a T distribution with low degrees of freedom or a Contaminated normal distribution will be more useful.

## define a second log posterior function for mixture with contaminated normal distribution lp2 = function(theta, data){ # we define the sigma on a log scale as optimizers work better # if scale parameters are well behaved s = exp(theta[2]) m = theta[1] mix = 0.9 cont = theta[3] d = data$vector # observed data vector log.lik = sum(log(dnorm(d, m, s) * mix + dnorm(d, m, s*cont) * (1-mix))) log.prior = 1 log.post = log.lik + log.prior return(log.post) } # sanity check for function # choose a starting value start = c('mu'=mean(ivTime), 'sigma'=log(sd(ivTime)), 'cont'=1) lp2(start, lData) ## try the laplace function from LearnBayes fit2 = laplace(lp2, start, lData) fit2 se2 = sqrt(diag(fit2$var)) sigSample.op = rnorm(1000, fit2$mode['sigma'], se2['sigma']) muSample.op = rnorm(1000, mean(lData$vector), exp(sigSample.op)/sqrt(length(lData$vector)))

A contaminated normal distribution is a mixture distribution where we add 2 more parameters to the model, the mixing probability and contamination parameter. We fix the mixing probability to 0.9 and track the contamination parameter.

Here are a couple of web links if you are more interested in contaminated normal distributions (Link 1, Link 2).

The third distribution we try is a T distribution with small degrees of freedom. We use a slightly different parameterization of the t-distribution which is nicely explained in another blog here.

lp3 = function(theta, data){ # function to use to use scale parameter ## see here https://grollchristian.wordpress.com/2013/04/30/students-t-location-scale/ dt_ls = function(x, df, mu, a) 1/a * dt((x - mu)/a, df) ## likelihood function lf = function(dat, pred){ return(log(dt_ls(dat, nu, pred, sigma))) } nu = exp(theta['nu']) ## normality parameter for t distribution sigma = exp(theta['sigma']) # scale parameter for t distribution m = theta[1] d = data$vector # observed data vector if (exp(nu) < 1) return(-Inf) log.lik = sum(lf(d, m)) log.prior = 1 log.post = log.lik + log.prior return(log.post) } # sanity check for function # choose a starting value start = c('mu'=mean(ivTime), 'sigma'=log(sd(ivTime)), 'nu'=log(2)) lp3(start, lData) op = optim(start, lp3, control = list(fnscale = -1), data=lData) op$par exp(op$par[2:3]) ## try the laplace function from LearnBayes fit3 = laplace(lp3, start, lData) fit3 se3 = sqrt(diag(fit3$var)) sigSample.op = rnorm(1000, fit3$mode['sigma'], se3['sigma']) nuSample = exp(rnorm(1000, fit3$mode['nu'], se3['nu'])) # threshold the sample values to above or equal to 1 nuSample[nuSample < 1] = 1 ## generate random samples from alternative t-distribution parameterization ## see https://grollchristian.wordpress.com/2013/04/30/students-t-location-scale/ rt_ls <- function(n, df, mu, a) rt(n,df)*a + mu muSample.op = rnorm(1000, mean(lData$vector), exp(sigSample.op)/sqrt(length(lData$vector)))

Using the 3 approaches we calculated the test quantities and the *PValues* for these quantities.

> round(mChecks, 2) Normal NormalCont T Variance 0.47 0.34 0.34 Symmetry 0.16 0.10 0.14 Max 0.01 0.06 0.18 Min 0.00 0.21 0.14 Mean 0.47 0.48 0.47

We can see that the normal model does not seem to fit the data well in terms of outliers, while the Contaminated Normal and T distributions, both fit the data better.

The figure below shows the histogram of the data and the smattering of the density lines from the 200 posterior predictive replications of the data. We can see that the normal model does not fit well in the tails, while the other two show a reasonable fit, at least it can be explained by sampling variation according to *PValues*.

While researching the material for this blog I also came across the post about this data set and extreme outliers in data sets at Andrew Gelman’s blog.