Parameter Estimation

In statistical or mathematical models our aim is to look at the data and estimate the parameters and uncertainty of those estimations. Generally speaking, looking at a data set, we wish to choose a likelihood/noise/sampling distribution, that fits the data. A distribution requires some parameters, Θ, e.g. a normal distribution (which is a very common error/noise distribution used in practice) has 2 parameters, the mean (µ) and the standard deviation (σ). In a Bayesian setting we will perhaps put some constraints on these parameters, which will be called priors, and eventually the model will (hopefully) converge to a point in this parameter space (driven by the constraints from priors and the data). For single and two parameter models the, under the constraints of the data and the priors, the model parameters and the uncertainty can be estimated using a variety of methods. I try to show a few methods along with R code.

The examples and methodological details, including equations can be seen in

[1] Albert, J., Gentleman, R., Parmigiani, G., & Hornik, K. (2009). Bayesian computation with R. Bayesian Computation with R. http://doi.org/10.1007/978-0-387-92298-0

[2] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2008). Bayesian Data Analysis. Chapman Texts in Statistical Science Series. http://doi.org/10.1007/s13398-014-0173-7.2

The R code for these examples can be found here in the github repository.

The data we use here is marathon running times, and is from Ref [1], and can be found in the LearnBayes package.

Analytical Estimation:

We first use an analytical approach to estimate and parameters and add some simulation to estimate the uncertainty. Details and equations can be seen around page 62-68 of Ref [2].

### first we will use a conjugate analysis and calculate the parameters
### for the data analytically as shown in Gelman 2008 - see page 62 
### also see page 64 of bayesian computation with R
## We are not using any prior information, but to make this function more general 
## I am adding some prior information with 0s
## look at page 68 of Bayesian Data Analysis (Gelman) for formula
sim.post = function(dat.grp){
  # prior variance
  sigma.0 = 0
  # prior observations
  k.0 = 0
  # prior degrees of freedom
  v.0 = k.0 - 1
  # prior mean
  mu.0 = 0
  # calculate conjugate posterior
  # number of observations in the data
  n = length(dat.grp)
  # prior observations + data points = posterior number of observations
  k.n = k.0 + n
  # posterior degrees of freedom
  v.n = v.0 + n
  # mean and sd for the data
  y.bar = mean(dat.grp)
  s = sd(dat.grp)
  # posterior mean
  mu.n = (k.0/k.n * mu.0) + (n/k.n * y.bar)
  # posterior var
  sigma.n = (( v.0*sigma.0 ) + ( (n-1)*(s^2) ) + ( (k.0*n/k.n)*((y.bar-mu.0)^2) )) / v.n
  #post.scale = ((prior.dof * prior.scale) + (var(dat.grp) * (length(dat.grp) - 1))) / post.dof
  ## simulate posterior variance and conditional posterior mean
  sigma = (sigma.n * v.n)/rchisq(1000, v.n)
  mu = rnorm(1000, mu.n, sqrt(sigma)/sqrt(k.n))
  return(list(mu=mu, var=sigma))
}

We can call this function to simulate the posterior mean and variance, that fit within the parameter space under the constraints of the data and priors.

Grid Approximation

Another useful way to estimate the parameters, when analytical formulas are not available or possible is using grid based approximations. For a grid approximation the trick is first to define a function that returns one value and accepts certain parameters and data that need tracking. The simplest function is the likelihood function, however we follow the strategy from [1] and [2] and use the log posterior to define the function.

## define a log posterior function
lp = function(theta, data){
  s = theta[2]
  m = theta[1]
  d = data$time # data from marathon times
  log.lik = sum(dnorm(d, m, sqrt(s), log=T))
  log.prior = 1
  log.post = log.lik + log.prior
  return(log.post)
}

The function accepts two arguments here, one is theta which is a vector of parameters that need tracking/calculating, and the second is the data. Together the data and theta define a parameter space within which we can ‘move’ and estimate the constrained/fitted theta. The other thing you may notice is we are doing everything on the log scale, so this function should return the log posterior, or log likelihood if you do not use any priors.

## define a discrete grid of parameters
Mu = seq(250, 302, by=0.5)
Sigma2 = seq(500, 9000, by=10)

We define the grid with a certain spacing, and this spacing can be adjusted depending on how much accuracy we want in our estimations. Once the grid is defined, one can use a double loop or any other fancy shortcut like outer command, to create a matrix where each cell contains the value of the function at that combination of the grid points.

## calculate the log posterior at each grid point 
lm = matrix(NA, nrow=length(Mu), ncol = length(Sigma2))
for (r in seq_along(Mu)){
  for (c in seq_along(Sigma2)){
    s = c('mu'=Mu[r], 'sigma2'=Sigma2[c])
    lm[r, c] = lp(s, lData)
  }
}

# convert from log posterior 
# subtract max before exponentiating
lm = lm-max(lm)
lm = exp(lm)
# set total prob of grid to 1
lm = lm/sum(colSums(lm))

The lm is the grid matrix which is filled inside this double loop. Once the matrix is filled, we follow the strategies from [1] and [2] to subtract off the maximum before exponentiating (converting back from log scale), and then set the total grid probability to 1.

The sample can be generated from this matrix as shown in the code below

## take samples from this distribution
pSigma2 = colSums(lm)
pMu = rowSums(lm)

## add a uniform jitter centered at zero and half of grid spacing to make distribution continuous
sig2Sample = runif(1000, -1*10/2, 10/2) +  sample(size = 1000, Sigma2, replace = T, prob = pSigma2)
# for each sampled value of sig2Sample, get the posterior mu, conditioned on sig2Sample
muSample = rnorm(1000, mean(time), sqrt(sig2Sample)/sqrt(length(time)))

Optimization based approach

This time we will explore the surface of the function using an optimizer, and calculate the variation using the hessian. As usual the technical details can be found in the reference [1].

Start by defining the log posterior function like before, the main difference this time is that we rescale the standard deviation parameter on the log scale.

lp2 = function(theta, data){
  # we define the sigma on a log scale as optimizers work better
  # if scale parameters are well behaved
  s = exp(theta[2])
  m = theta[1]
  d = data$time # data from marathon times
  log.lik = sum(dnorm(d, m, s, log=T))
  log.prior = 1
  log.post = log.lik + log.prior
  return(log.post)
}

# sanity check for function
# choose a starting value
start = c('mu'=mean(time), 'sigma'=log(var(time)))
lData = list('time'=time)
lp2(start, lData)

## try the laplace function from LearnBayes
fit = laplace(lp2, start, lData)
fit
se = sqrt(diag(fit$var))

## lets take a sample from this
sigSample.op = rnorm(1000, fit$mode['sigma'], se['sigma'])
muSample.op = rnorm(1000, mean(time), exp(sigSample.op)/sqrt(length(time)))

I prefer the optimization based approach for multiparameter problems, and it is more is simpler to implement as long as you can provide sensible starting values. There are a couple of other methods I will cover at some point like Rejection Sampling, Importance Sampling.

MCMC using RStan and Stan

The final method I will cover briefly is doing this in Stan and using a MCMC based approach. The model is very similar to the one defined in the function earlier, but is defined in the Stan modelling language syntax.

data {
    int<lower=1> Ntotal; // number of observations
    real y[Ntotal]; // observed data
}
parameters {
    real mu; // posterior mean
    real<lower=0> sig; // posterior sd
}
model {
  y ~ normal(mu, sig);
}

I call this model from R using rstan library, and extract the simulated data points from this Stan object.

Comparing All the Estimates

I plot a contour plot for the log posterior function and smatter the simulated data points using each method, and they all agree pretty well. I also show two density plots for these to show how the estimates of the posterior means, variances and their uncertainty.

paramEstimationContour

paramEstimationDensity

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s