Recently I came across the problem with suspiciously large difference in the averages of two groups while analysing some Omics data. An article dealing with similar issues can be seen here. The data distribution is shown below in Figure 1 (FYI: the fold change was around 6 – which is very large for this kind of data). The source code can be found in the github repository.

**Figure 1: **Density distribution of the data in the two main groups of interest (labelled 0 and 1).

The distribution clearly appears to be bimodal and the metadata for this experiment does not explain this bimodality. Typically in Omics experiments, these types of problems can occur due to technical batch effects.

> str(dfData) 'data.frame': 45 obs. of 2 variables: $ x : num 16.25 9.5 15.97 9.78 9.46 ... $ fGroups: Factor w/ 2 levels "0","1": 2 1 2 2 1 1 1 1 1 2 ... > tapply(dfData$x, dfData$fGroups, mean) 0 1 13.44531 12.97933 > 12.97933 - 13.44531 [1] -0.46598

Calculating the averages and differences directly estimates the averages in both groups to be around 12 and 13 which is somewhere in the middle dip between the two peaks in Figure 1.

I will try and model this in a few different ways starting with a finite mixture model. First we use Flexmix library in R to estimate the averages for the two peaks and then check for differences in coefficients.

> fit.flex = flexmix(x ~ fGroups, data=dfData, k=2) > summary(fit.flex) Call: flexmix(formula = x ~ fGroups, data = dfData, k = 2) prior size post>0 ratio Comp.1 0.422 19 19 1 Comp.2 0.578 26 26 1 'log Lik.' -64.27491 (df=7) AIC: 142.5498 BIC: 155.1965 > ## fitted coefficients > parameters(fit.flex) Comp.1 Comp.2 coef.(Intercept) 9.5091564 15.9054071 coef.fGroups1 -0.1400754 0.3231403 sigma 0.3633179 0.6810994 ########### running this again may converge in a ########### different position > ## fitted coefficients > parameters(fit.flex) Comp.1 Comp.2 coef.(Intercept) 9.5091564 15.905407 coef.fGroups1 6.7193910 -6.536326 ## huge fold difference sigma 0.4995355 0.618986

Running this code multiple times may also converge in places where the intercepts are fine but the fold changes can be highly overestimated. Let’s try this in Stan, but with two approaches: one where we use a single set of regression coefficients where we assume that the differences between groups 0 and 1 are the same in both batches; second where we use separate regression coefficients for each component or batch.

############### SINGLE SET of Regression Coefficients > print(fit.stan, c('sigmaRan1', 'sigma', 'mu', 'iMixWeights', 'rGroupsJitter1'), digits=3) Inference for Stan model: normResponseFiniteMixture1RandomEffect. 2 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=2000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat sigmaRan1 2.030 0.091 2.492 0.056 0.421 1.126 2.684 9.199 758 1.000 sigma[1] 0.419 0.002 0.083 0.292 0.360 0.404 0.465 0.617 1197 0.999 sigma[2] 0.746 0.003 0.115 0.559 0.664 0.734 0.809 1.012 1330 1.001 mu[1] 9.326 0.051 0.932 7.312 8.991 9.405 9.695 11.202 334 1.007 mu[2] 15.913 0.050 0.937 13.919 15.569 15.981 16.297 17.817 353 1.007 iMixWeights[1] 0.428 0.002 0.072 0.294 0.378 0.425 0.479 0.573 1401 1.001 iMixWeights[2] 0.572 0.002 0.072 0.427 0.521 0.575 0.622 0.706 1401 1.001 rGroupsJitter1[1] 0.118 0.051 0.932 -1.767 -0.249 0.042 0.461 2.179 334 1.007 rGroupsJitter1[2] 0.118 0.051 0.932 -1.802 -0.229 0.034 0.432 2.096 335 1.007 #################### TWO SETS of Regression Coefficients > print(fit.stan, digits=3) Inference for Stan model: normResponseFiniteMixture1RandomEffectSeparateCoefficients. 2 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=2000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat sigmaRan1_1 2.023 0.090 2.674 0.080 0.504 1.167 2.417 8.895 884 0.999 sigmaRan1_2 2.254 0.074 2.522 0.137 0.677 1.438 2.833 9.192 1171 1.001 sigma[1] 0.416 0.002 0.082 0.291 0.359 0.405 0.462 0.607 1564 1.001 sigma[2] 0.739 0.003 0.114 0.562 0.656 0.723 0.808 0.995 1248 1.000 rGroupsJitter1_1[1] -0.186 0.037 0.690 -1.870 -0.509 -0.080 0.203 1.025 349 1.002 rGroupsJitter1_1[2] -0.320 0.040 0.695 -2.007 -0.673 -0.198 0.071 0.961 308 1.001 rGroupsJitter1_2[1] 0.410 0.030 0.788 -0.895 -0.114 0.252 0.876 2.285 686 1.003 rGroupsJitter1_2[2] 0.690 0.031 0.805 -0.628 0.116 0.555 1.189 2.517 674 1.005 mu[1] 9.694 0.037 0.688 8.475 9.297 9.593 10.045 11.383 343 1.001 mu[2] 15.504 0.030 0.781 13.679 15.034 15.653 16.041 16.793 676 1.005 iMixWeights[1] 0.425 0.002 0.069 0.292 0.378 0.424 0.473 0.558 1700 1.000 iMixWeights[2] 0.575 0.002 0.069 0.442 0.527 0.576 0.622 0.708 1700 1.000 lp__ #################### Differences ###### SINGLE SET > getDifference(mCoef[,'1'], mCoef[,'0']) -0.0002651771 ## essentially zero ###### TWO SETS > getDifference(mCoef.1[,'1'], mCoef.1[,'0']) -0.1333644 ## 1 slightly less than 0 in component 1 > getDifference(mCoef.2[,'1'], mCoef.2[,'0']) 0.2809643 ## 1 slightly higher in component 2 ## compare with flexmix results

The results from Stan and Flexmix (in some runs of Flexmix) converge. Even looking at Figure 1, it appears that these fold changes coming from the model with two sets of regression coefficients are better, but for practical purposes a simpler model with one set of coefficients is not bad either – as on average these differences are essentially close to zero in both models (see Figure 2), when accounting for sampling variability.

**Figure 2: **Differences in the regression coefficients for groups 1 and 0, estimated by calculating the differences in the MCMC samples for each group at each step in the chain. Top Left: Finite mixture model with one set of regression coefficients; Top Right and Bottom Right: Finite mixture model with a separate set of regression coefficients for each group.

Now we generate a grouping factor for this data using a clustering algorithm e.g. kmeans and look for two clusters.

**Figure 3: **Density plot of the data conditioned on the 2 new groups assigned via kmeans.

The code snippets below show the results after using kmeans and fitting a simple linear model with and without the new grouping factor included.

> km = kmeans(dfData$x, centers = 2) > > dfData$cluster = factor(km$cluster) > > densityplot(~ x | cluster, groups=fGroups, data=dfData, auto.key=T) ## calculate averages in each group in each cluster > tapply(dfData$x, dfData$cluster:dfData$fGroups, mean) 1:0 1:1 2:0 2:1 9.509156 9.369081 15.905407 16.228547 ## compare these results with the stan ## using algebra to calculate these differences > 9.369081 - 9.509156 [1] -0.140075 ### cluster 1 difference > 16.228547 - 15.905407 [1] 0.32314 ### cluster 2 difference > fit.1 = lm(x ~ fGroups, data=dfData) > summary(fit.1) Call: lm(formula = x ~ fGroups, data = dfData) Residuals: Min 1Q Median 3Q Max -4.243 -3.822 1.939 2.793 4.700 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 13.4453 0.6611 20.339 <2e-16 *** fGroups1 -0.4660 1.0174 -0.458 0.649 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.371 on 43 degrees of freedom Multiple R-squared: 0.004855, Adjusted R-squared: -0.01829 F-statistic: 0.2098 on 1 and 43 DF, p-value: 0.6492 > > fit.2 = lm(x ~ fGroups + cluster, data=dfData) > summary(fit.2) Call: lm(formula = x ~ fGroups + cluster, data = dfData) Residuals: Min 1Q Median 3Q Max -1.22321 -0.37132 -0.09708 0.27576 1.95703 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.3852 0.1590 59.032 <2e-16 *** fGroups1 0.1217 0.1780 0.683 0.498 cluster2 6.5977 0.1780 37.061 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5875 on 42 degrees of freedom Multiple R-squared: 0.9705, Adjusted R-squared: 0.9691 F-statistic: 690.2 on 2 and 42 DF, p-value: < 2.2e-16

The second model shows smaller residuals and a better fit as shown in the results above and the residuals vs fitted values plot shown in Figure 4 below.

**Figure 4: **Residuals vs Fitted values plots for the two normal linear models. The left plot does not contain the information from the new grouping factor to identify the clusters. The right plot shows smaller estimates as it contains information for the main treatment effect and the cluster information.

If we think that the average difference between groups in our treatment of interest in the two clusters is not the same, then we can include an interaction term between the treatment and cluster, but in this case I don’t think its necessary.

* What I gather from this analysis is that one should be very careful if getting larger than expected effect sizes for coefficients* and this could be due to missing information in the model. We have tried to use a mixture model approach (which can have complexities associated with it) or a simpler model. The mixture model assumes the bimodality in the data and hence we can get away with the missing covariate in this case and pay for complexity. The simpler model produces practically comparable results once all the relevant information from the covariates is added.