I wrote a post a while back about Mixture Distributions and Model Comparisons. This post continues on that theme and tries to model multiple data generating processes into a single model. The code for this post is available at the github repository. There were many useful resources that helped me understand this model, and some are mentioned below and others are mentioned in the previous post in this topic:

- Model Label Degeneracy issues.
- A couple of forum posts with examples here and here.
- And a couple of blogs here and here.
- Stan user manual.

I use the R library flexmix, for the sample data and fitting the model, in addition to Stan and MCMC approach. The data distribution is shown in the figure below, where looking at the figure suggests that there may be two data generation processes at work here. To start with, we try and find a distribution that will capture certain aspects of this data, and use posterior predictive checks to see if our distribution choice is appropriate. A previous post on posterior predictive checks covers this topic in more detail, and tests different distributions. To keep it short, I just try a normal mixture distribution with 2 mixture components and then perform some tests.

The figure in the left panel shows the density plot for the raw data, and the panel on the right shows the same data, with smattering of 200 predictive data sets from the fitted distribution – visually it appears to be fine, with the left component slightly higher in some cases. However the aspects of the data that I want to our distribution to capture are the mean, variance, maximum and minimum values, and the test result PValues suggest that any variation can be explained by sampling variability.

> ivTestQuantities variance minimum maximum mean 0.380 0.155 0.300 0.420

The first steps are to fit a mixture regression model with 2 components and one predictor using the flexmix package and then to repeat this using Stan.

> fit.flex = flexmix(yn ~ x, data=NPreg, k=2) > summary(fit.flex) Call: flexmix(formula = yn ~ x, data = NPreg, k = 2) prior size post>0 ratio Comp.1 0.561 104 198 0.525 Comp.2 0.439 96 146 0.658 'log Lik.' -704.0635 (df=7) AIC: 1422.127 BIC: 1445.215 > ## fitted coefficients > parameters(fit.flex) Comp.1 Comp.2 coef.(Intercept) 31.7644433 -0.7685309 coef.x 0.1867837 5.1841287 sigma 7.7718438 3.3828795

The same model is now broken down into parts and fit using Stan – you can write a log posterior function in R as well to do this, but I find that the optimiser tends to not explore the parameter space if we track the mixture components as well. However if we keep the mixture components fixed, then it tends to converge to the same point as Stan.

> print(fit.stan, digi=3) Inference for Stan model: fitNormalMixtureRegression. 1 chains, each with iter=4000; warmup=2000; thin=1; post-warmup draws per chain=2000, total post-warmup draws=2000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat sigma[1] 3.405 0.010 0.395 2.676 3.126 3.395 3.667 4.223 1448 1.002 sigma[2] 7.848 0.019 0.632 6.725 7.407 7.805 8.270 9.164 1142 1.000 iMixWeights[1] 0.439 0.001 0.046 0.350 0.406 0.438 0.472 0.530 1000 1.000 iMixWeights[2] 0.561 0.001 0.046 0.470 0.528 0.562 0.594 0.650 1000 1.000 betasMix1[1] 5.163 0.003 0.146 4.870 5.067 5.160 5.262 5.452 2000 1.001 betasMix2[1] 0.176 0.013 0.347 -0.501 -0.056 0.171 0.391 0.861 751 1.001 mu[1] -0.686 0.016 0.713 -2.072 -1.164 -0.705 -0.228 0.750 2000 1.000 mu[2] 31.846 0.067 1.971 27.898 30.590 31.890 33.136 35.791 853 1.000 lp__ -708.099 0.063 1.885 -712.367 -709.142 -707.715 -706.712 -705.449 899 1.001 Samples were drawn using NUTS(diag_e) at Tue Jul 4 16:10:02 2017. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).

Comparing the results from the two procedures, i.e. flexmix and the Stan model, suggest a nice convergence between the two methods.

## Fitted or Predicted Values

Once the coefficients of the regression equation have been determined, we can calculate the predicted (on new data) or fitted (current data) values. There are two ways to get this data: either get predicted data for each component; Or get an aggregate predicted value using each component.

> intercepts [1] -0.6598333 31.5694812 > betas [1] 5.1614796 0.2241089 > iMixWeights [1] 0.44 0.56 > iPred1 = lStanData$X %*% c(intercepts[1], betas[1]) > iPred2 = lStanData$X %*% c(intercepts[2], betas[2]) > ## compare with fit.flex > head(f) Comp.1 Comp.2 [1,] 32.54457 20.883670 [2,] 31.98889 5.460877 [3,] 32.19311 11.129077 [4,] 32.87877 30.159296 [5,] 32.20489 11.456077 [6,] 33.19092 38.822976 > head(cbind(iPred1, iPred2)) [,1] [,2] 1 20.897771 32.50550 2 5.542358 31.83878 3 11.185795 32.08381 4 30.132872 32.90649 5 11.511366 32.09795 6 38.758702 33.28101

However getting one aggregate predicted value for the response variable may be more appropriate in certain settings, and will require us to take a weighted average of each component (the weight determined by the mixing probability).

> iAggregate = cbind(iPred1, iPred2) > iAggregate = sweep(iAggregate, 2, iMixWeights, '*') > iAggregate = rowSums(iAggregate) > dfPredicted = data.frame(stan=iAggregate, flexmix=p.agg) > head(dfPredicted) stan flexmix 1 27.39810 27.42164 2 20.26835 20.33447 3 22.88868 22.93915 4 31.68610 31.68404 5 23.03985 23.08942 6 35.69120 35.66522 > ## calculate Mean Squared Error MSE > mean((dfPredicted$flexmix - NPreg$yn)^2) [1] 104.4622 > mean((dfPredicted$stan - NPreg$yn)^2) [1] 104.3325 > ## both pretty close

Comparing the aggregate values from the two models show that both have produced similar results, and the mean squared errors in each case are also similar.