Model Checking: Scoring and Comparing Models

This is another post in the series of model checking posts. Previously we looked at which aspects of the data and model are compatible, using posterior predictive checks. Once we have selected a model or a set of models for the data, we would like to score and compare them. One aspect of comparison using Mixture Distributions and Bayes Factors has been show in a previous post. Here we use posterior predictive checks to calculate information criteria and leave one out cross validation (and adjusting for number of parameters in the model). The examples and algorithms used in the post can be seen in more detail in:

[1] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition (Texts in Statistical Science). Book.

[2] Albert, J., Gentleman, R., Parmigiani, G., & Hornik, K. (2009). Bayesian computation with R. Bayesian Computation with R. http://doi.org/10.1007/978-0-387-92298-0

[3] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: with applications in R. Springer texts in statistics (Vol. XIV). http://doi.org/10.1007/978-1-4614-7138-7

The predictive accuracy of a model can be used to evaluate and compare a model against other candidate models. The accuracy can be defined using some form of a function, e.g. mean squared error, log-likelihood, log predictive density etc. This ties into the concept of deviance or information criteria, which puts different models (with different parameters) on a common scale (along with some adjustments made for the number of parameters being estimated).

The data to calculate this predictive accuracy can be recycled (within sample) or external (out of sample i.e. new data) or using cross-validation (Ref 3 has a good chapter on cross-validation).

Example Data and Model

The data used in this example is from the ‘bread and peace’ model, and the paper can be seen here, while the example is from Ref [1], and the R code can be found in the github repository.

We first fit a linear model using the lm function in R.

Call:
lm(formula = VoteShare ~ IncomeGrowth, data = dfData)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8370 -0.3748  0.1379  1.7745  5.6291 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   45.8027     1.7297  26.480 1.07e-12 ***
IncomeGrowth   3.1809     0.7226   4.402 0.000715 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.816 on 13 degrees of freedom
Multiple R-squared:  0.5985,    Adjusted R-squared:  0.5676 
F-statistic: 19.38 on 1 and 13 DF,  p-value: 0.0007149

There are three parameters being estimated here, the two coefficients and the standard deviation for the likelihood function.

## write the model with the log posterior function
# use the likelihood with non-informative priors
mylogpost = function(theta, data){
  sigma = exp(theta['sigma'])
  betas = theta[-1]
  x = data$pred
  y = data$resp
  
  # get the predicted or fitted value 
  mModMatrix = model.matrix(y ~ x)
  mCoef = matrix(betas, nrow = length(betas))
  iFitted = mModMatrix %*% mCoef # using identity link
  ## likelihood function
  llik = sum(dnorm(y, iFitted, sigma, log=T))
  ## prior, non informative
  lprior = 1
  lpost = llik + lprior
  return(lpost)
}

We can define this regression model using our custom function mylogpost and we are tracking 3 parameters (sigma, and 2 betas) – we are also using non-informative priors so our results should be similar to the lm model.

> fit = laplace(mylogpost, start, lData)
> fit
$mode
    sigma     beta0     beta1 
 1.268109 45.806382  3.180065 

$var
              sigma         beta0         beta1
sigma  3.335900e-02  0.0002461944 -5.857976e-05
beta0  2.461944e-04  2.5949267841 -8.909553e-01
beta1 -5.857976e-05 -0.8909553156  4.528745e-01

$int
[1] -38.72532

$converge
[1] TRUE

> se = sqrt(diag(fit$var))
> se
    sigma     beta0     beta1 
0.1826445 1.6108776 0.6729595 
## coef and sd
> fit$mode[-1]
    beta0     beta1 
45.806382  3.180065 
> exp(fit$mode[1])
   sigma 
3.554124 
> 

We take a sample from the posterior distribution of these three parameters using a multivariate t proposal density, while the target density is mylogpost using a Sampling Importance Resampling (SIR) algorithm (I will write a post about these sampling methods at some point in the future). You can try using the approach suggested in the post in parameter estimation, but if these are sampled independently then we can not account for the covariance between the parameters – which can be avoided by using a multivariate normal sampling.

## parameters for the multivariate t density
tpar = list(m=fit$mode, var=fit$var*2, df=4)
## get a sample directly and using sir (sampling importance resampling with a t proposal density)
s = sir(mylogpost, tpar, 1000, lData)
#s = rmt(10000, fit$mode, S = fit$var)
sigSample = s[,'sigma']
beta0Sample = s[,'beta0']
beta1Sample = s[,'beta1']

We use the SIR algorithm from [2], but you can write your own, its straight forward once you know how. The tpar is a list of parameters for the multivariate t proposal density. Just to show how our sampler works, we also use STAN to generate samples for these parameters using MCMC. The figure below, shows the samples generated by SIR as histograms, and the lines are the samples generated by STAN and MCMC (they agree pretty well). The fourth plot at the bottom right panel shows the data and the regression line.

breadAndPeaceSample

## first write the log predictive density function
lpd = function(beta0, beta1, sig){
  sigma = exp(sig)
  x = lData$pred
  y = lData$resp
  # get the predicted or fitted value 
  mModMatrix = model.matrix(y ~ x)
  mCoef = matrix(c(beta0, beta1), nrow = 2)
  iFitted = mModMatrix %*% mCoef # using identity link
  ## likelihood function with posterior theta
  return(sum(dnorm(y, iFitted, sigma, log=T)))
}

The scoring function for the model check is the log predictive density, which is basically the log-likelihood function, but using the posterior/fitted parameters.

“In the d-dimensional normal distribution, the logarithm of the density function is a constant plus a \mathit{X_{d}^{2}} distribution divided by -2″ [1],  – and the posterior distribution of the log predictive density has a maximum of  ~ -40.3 and mean of ~ -41.9; with a difference of 1.66, which is close to 3/2 (1.5), predicted from theory, the value of d here is 3 (the number of parameters being estimated). [See Page 171 of Ref 1 for details].

Predictive Scores

We show three predictive scores along with adjustments for number of parameters being estimated, plus leave one out cross validation. The details for these scores can be found in References [1] and [3], and else where in literature. Lower values of these scores imply a higher predictive accuracy.

Akaike Information Criterion (AIC)

This is converted to a deviance scale after adjusting for the number of parameters which is 3 in our case.

> iAIC = (lpd(fit$mode['beta0'], fit$mode['beta1'], fit$mode['sigma']) - 3) * -2
> AIC(fit.lm)
[1] 86.59985
> iAIC
[1] 86.59986

The AIC calculated using lpd function and using the AIC function in R on the linear model fit object are the same.

Deviance Information Criterion (DIC)

This is a somewhat Bayesian version of the AIC, and uses the posterior mean (instead of the maximum likelihood estimate) for θ, and a data-based bias correction.

> ## pDIC are the effective number of parameters
> ## 2 * [lpd(Expectation(theta)) - Expectation(lpd(Sample of thetas from posterior))]
> # calculate E(lpd(theta))
> eLPD = mean(sapply(1:1000, function(x) lpd(beta0Sample[x], beta1Sample[x], sigSample[x])))
> # calculate lpd(E(theta)) and pDIC
> pDIC = 2 *(lpd(fit$mode['beta0'], fit$mode['beta1'], fit$mode['sigma']) - eLPD)
> iDIC = (lpd(fit$mode['beta0'], fit$mode['beta1'], fit$mode['sigma']) - pDIC) * -2
> pDIC
[1] 3.424087
> iDIC
[1] 87.44804

The effective number of parameters will change slightly depending on the simulation size, but both numbers are pretty close (i.e. AIC and DIC).

Watanabe-Akaike or widely available information criterion (WAIC)

This is a more Bayesian approach and is an approximation to cross-validation. We show one version of this approach here. First calculated log pointwise predictive density, which is slightly different to log predictive density calculated earlier.

## log pointwise predictive density
lppd = function(beta0, beta1, sig, index){
  sigma = exp(sig)
  x = lData$pred[index]
  y = lData$resp[index]
  # get the predicted or fitted value 
  mModMatrix = model.matrix(y ~ x)
  mCoef = matrix(c(beta0, beta1), nrow = 2, byrow = T)
  iFitted = mModMatrix %*% mCoef # using identity link
  ## likelihood function with posterior theta
  return(mean(dnorm(y, iFitted, sigma, log=F)))
}
# log pointwise predictive probability of the observed data under the fitted
# model is
ilppd = sum(log(sapply(1:15, function(x) lppd(beta0Sample, beta1Sample, sigSample, x))))

The above function is slightly different than the lpd function, and works basically on one data point at a time (see in index argument).

> ## effective numbers of parameters pWAIC1
> pWAIC1 = 2 * (ilppd - eLPD)
> 
> iWAIC = -2 * (ilppd - pWAIC1)
> pWAIC1
[1] 2.245018
> iWAIC
[1] 86.26897

The effective number of parameters are about 2.2, instead of 3. This is because the effective number of parameters estimated in a model is also a function of the data, and can be considered a random variable – hence in more complex models it is not straight-forward to just count the number of parameters.

Leave-one-out cross-validation

Cross-validation can be computationally expensive and awkward in structured models, but simply put, the data are partitioned into training and test sets. The parameters are estimated on the training set while the fit is evaluated on the test set.

## leave one out cross validation
# we need to fit the model 15 times each time removing one data point
lFits = lapply(1:15, function(x){
  start = c('sigma' = log(sd(dfData$VoteShare[-x])), 'beta0'=0, 'beta1'=0)
  lData = list(pred=dfData$IncomeGrowth[-x], resp=dfData$VoteShare[-x])
  laplace(mylogpost, start, lData)
})

# lets take samples for posterior theta
lSamples = lapply(1:15, function(x){
  fit = lFits[[x]]
  tpar = list(m=fit$mode, var=fit$var*2, df=4)
  ## get a sample directly and using sir (sampling importance resampling with a t proposal density)
  lData = list(pred=dfData$IncomeGrowth[-x], resp=dfData$VoteShare[-x])
  s = sir(mylogpost, tpar, 1000, lData)
  return(s)
})

## calculate lppd on the hold out set
iLppd = sapply(1:15, function(x){
  s = lSamples[[x]]
  log(lppd(s[,'beta0'], s[,'beta1'], s[,'sigma'], x))
})
iLppd = sum(iLppd)
# calculate lppd on all data
# calculated earlier ilppd
# effective number of parameters
iParamCV = ilppd - iLppd
# Given that this model includes two linear coefficients and a variance parameter, these
# all look like reasonable estimates of effective number of parameters. [Gelman 2013]
# on deviance scale iCV is
iCV = -2 * iLppd
# after correction for number of parameters
iCV = -2 * (iLppd - iParamCV)

Generally predictive accuracy measures should be used in parallel with posterior predictive checks – starting with simpler models and expanding. This can be applied to comparing nested models: where the full model implements perhaps all the meaningful parameters, and restricted models (where some parameters are restricted or set to 0 or forced to be equal). In this case the complexity of the model and improvement in fit should justify interpretation and additional difficulty in fitting.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s