The following text and code snippets show examples from two books on Bayesian Data Analysis:

[1] Kruschke, J. K. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan, second edition. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan, Second Edition. http://doi.org/10.1016/B978-0-12-405888-0.09999-2

[2] Albert, J., Gentleman, R., Parmigiani, G., & Hornik, K. (2009). Bayesian computation with R. Bayesian Computation with R. http://doi.org/10.1007/978-0-387-92298-0

The examples I refer to here can be seen on Page 50 of Ref 2 (Mixtures of conjugate priors) and Chapter 10 of Ref 1 (Model Comparison .. ). I try and connect the two ideas presented there, where two independent models are either modelled together as a mixture distribution or separately and then compared.

**Problem:**

We can imagine this problem as two machines or two coins, that have a parameter theta θ, in case of coins the proportion of heads. Coin A is biased to produce more tails, Coin B is biased to produce more heads. We are presented with some data i.e. number of heads and number of tails.

- Which coin generated this data?
- What is the distribution of the parameter theta before and after observing the data?

The R code for this analysis and the figures can be found here.

Following the example from ref [2]. Lets define the two distributions:

### define two models that explain the data ## model m1 ## it is a beta distribution with a weight on the left side g1 = function(theta) dbeta(theta, 6, 14) ## model m2 ## it is a beta distribution with a weight on the right side g2 = function(theta) dbeta(theta, 14, 6)

How much weight we assign to each model i.e. the prior probability for each model, and this can be used to generate the mixture distribution.

## define an array that represents number of models in our parameter space ## each index has a prior weight/probability of being selected ## this can be thought of coming from a categorical distribution mix.prior = c(50, 50) mix.prior = mix.prior/sum(mix.prior) ## define a joint prior space for these 2 models ## (mix.prior[1] AND g1) OR (mix.prior[2] AND g2) g.mix = function(theta) mix.prior[1]*g1(theta) + mix.prior[2]*g2(theta)

This function represents the prior distribution or prior beliefs about how the parameter theta for each coin is distributed.

The observed data shows 7 heads and 3 tails. We now define the sampling distribution/likelihood functions and posterior distributions for each model. The posterior distribution represents how much parameter theta changes for each model after observing the data. I have used the word conjugate here, which I will explain at a later time – for the time being think of it as a simple analytical way to calculate the posterior.

## we flip a coin, perform a binary experiment ## our prior beliefs are that the experiment tends to follow two extreme models ## we will either see more 1s or more 0s, depending on which model (m1 or m2) is ## more representative of the data data = c(success=7, fail=3) ## the posterior is proportional to Likelihood * Prior ## P(Data | Theta, Model) X P(Theta, Model) ## define the Likelihood function, both models share the same likelihood functional form lik = function(data, theta) dbinom(data['success'], sum(data), theta) ## since this is a conjugate analysis, as prior is beta distributed and likelihood is binomial ## the posterior can be derived analytically ## using a similar analogy as before when describing priors for each model ## we can define a posterior for each of the 2 models g1.post = function(theta, data) dbeta(theta, 6+data['success'], 14+data['fail']) g2.post = function(theta, data) dbeta(theta, 14+data['success'], 6+data['fail'])

We have assigned a prior mixture probability in the variable *mix.prior, *however we need to calculate the posterior mixing probability, which requires some algebra, and I have tried to show that in the code below. For details you have to read the first few pages of Chapter 10 in ref 1.

In order to calculate a mixture probability, i.e. what is the probability each model m1 or m2 after we see the data:

P(Data, Model[1]) = P(Data, Model[1])

P(Model[1] | Data) X P(Data) = P(Data | Model[1]) X P(Model[1])

**P(Model[1] | Data) = P(Data | Model[1]) X P(Model[1]) / P(Data) … Equation (1)**

where P(Data) can be expanded using summation across all models

P(Data) = Sum for each Model P(Data, Model[i])

We need to calculate a few things:

P(Data | Model[i]) – this is the prior predictive distribution for the data given the selected model. So for each model we solve this equation:

P(Data, Theta) = P(Data | Theta) X P(Theta)

P(Data) = P(Data | Theta) X P(Theta) / P(Theta | Data)

In words this means: Prior predictive distribution for Data = Likelihood X Prior / Posterior

## Prior predictive probability for Data = Likelihood X Prior / Posterior ## for model 1 data.prior.g1 = function(data, theta){ ret = lik(data, theta) * g1(theta) / g1.post(theta, data) return(ret) } ## for model 2 data.prior.g2 = function(data, theta){ ret = lik(data, theta) * g2(theta) / g2.post(theta, data) return(ret) } ## P(Data | Model) for each model should be the same for any value of theta ## you can use that as a sanity check th = seq(0.01, 0.99, by=0.01) data.g1 = mean(data.prior.g1(data, th)) data.g2 = mean(data.prior.g2(data, th))

We have the necessary requirements to solve the equation 1.

## P(Data) = (P(Data | Model[1]) X P(Model[1])) + (P(Data | Model[2]) X P(Model[2]) ## we can calculate the posterior for Model 1 ## P(Model[1] | Data) = P(Data | Model[1]) X P(Model[1]) / P(Data) mix.post = data.g1 * mix.prior[1] / (data.g1 * mix.prior[1] + data.g2 * mix.prior[2]) mix.post = c(mix.post, 1-mix.post) ## (mix.post[1] AND g1.post) OR (mix.post[2] AND g2.post) g.mix.post = function(theta, data) mix.post[1]*g1.post(theta, data) + mix.post[2]*g2.post(theta, data)

The figures below show the posterior distribution of theta for each of the models of the data i.e. head biased and tail biased coins.

You can see that the posterior distribution of theta for model one has shifted more to the right (towards higher values of theta), as the data has influenced this shift. The mixture distribution has also shifted most of its weight towards the right with a very slight bump around 0.4.

We can now approximate this distribution of theta on a grid, and take a random sample from this distribution.

## approximate the posterior theta on the grid p.th = g.mix.post(th, data) p.th = p.th/sum(p.th) th.sam = sample(th, 10000, replace = T, prob=p.th) th.sam = th.sam + runif(10000, 0.01/2 * -1, 0.01/2) summary(th.sam); ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.2041 0.6224 0.6922 0.6748 0.7512 0.9342

I will add another post on grid approximations at a later point. For the time being, this is a useful method to generate random samples from a distribution and this sample can be used to calculate various statistics or parameters of the distribution.

We will also approximate this using another technique, where we first define a function that takes one or more parameters as input (defining the parameter space) and the data; and returns one value which is the log posterior at that position in the parameter space. The code below defines two functions, one for each model and uses an optimiser to explore the surface of this function to find the maximum.

library(LearnBayes) library(car) logit.inv = function(p) {exp(p)/(exp(p)+1) } mylogpost_m1 = function(theta, data){ ## theta contains parameters we wish to track th = logit.inv(theta['theta']) suc = data['success'] fail = data['fail'] # define likelihood function lf = function(s, f, t) return(dbinom(s, s+f, t, log=T)) # calculate log posterior val = lf(suc, fail, th) + dbeta(th, 6, 14, log = T) return(val) } mylogpost_m2 = function(theta, data){ ## theta contains parameters we wish to track th = logit.inv(theta['theta']) suc = data['success'] fail = data['fail'] # define likelihood function lf = function(s, f, t) return(dbinom(s, s+f, t, log=T)) # calculate log posterior val = lf(suc, fail, th) + dbeta(th, 14, 6, log = T) return(val) } # choose sensible starting values for the optimiser start = c(theta=logit(0.5)) data = c(success=7, fail=3) fit_m1 = laplace(mylogpost_m1, start, data) fit_m2 = laplace(mylogpost_m2, start, data) logit.inv(fit_m1$mode) ## theta ##0.428616 logit.inv(fit_m2$mode) ## theta ##0.7142694

The values for theta representing the maximum of each function are approximately similar to those calculated using grid approximation.

The figure above shows the histogram of the random sample generated using grid approximation, the line represents the value of the mixture posterior *g.mix.post*, and the red and green lines represent the maximum points for the 2 models calculated using an optimiser.

## Bayes Factor:

Simply put, the Bayes Factor (BF) indicates how much the prior odds on the two models change after seeing the data. I.e. how much evidence there is for model 1 vs model 2. As usual the technical details can be found in the references 1 and 2.

## Bayes factor for the ratios of posterior predictive distribution ## of the 2 models ## P(Data | Model[1]) / P(Data | Model[2]) BF = data.g1 / data.g2 ## OR posterior odds for the models / prior odds for the 2 models mix.post[1]/mix.post[2]/(mix.prior[1]/mix.prior[2])

The BF calculated analytically is 0.102 in support of model 1. The general convention for a discrete decision about about the models that there is significant evidence for model 1 when BF > 3, and for model 2 when BF < 1/3 (i.e. 0.33). In our case BF for model 2 is 0.1 which shows significant support for model 2, suggesting that the data was generated from model 2 rather than model 1.

The BF can also be approximated using the optimisation approach shown earlier, where the function *laplace* returns the P(Data | Model) in the slot $int.

> round(exp(fit_m1$int - fit_m2$int), 2) ## BF ## 0.09

This is very close to the BF of 0.1 calculated analytically. Both analyses suggest that the data was generated by Coin or Machine B.