Large Effect Sizes: Missing information produce misleading results.

Recently I came across the problem with suspiciously large difference in the averages of two groups while analysing some Omics data. An article dealing with similar issues can be seen here. The data distribution is shown below in Figure 1 (FYI: the fold change was around 6 – which is very large for this kind of data). The source code can be found in the github repository.

largeEffectsFigure1

Figure 1: Density distribution of the data in the two main groups of interest (labelled 0 and 1).

The distribution clearly appears to be bimodal and the metadata for this experiment does not explain this bimodality. Typically in Omics experiments, these types of problems can occur due to technical batch effects.

> str(dfData)
'data.frame':   45 obs. of  2 variables:
 $ x      : num  16.25 9.5 15.97 9.78 9.46 ...
 $ fGroups: Factor w/ 2 levels "0","1": 2 1 2 2 1 1 1 1 1 2 ...
> tapply(dfData$x, dfData$fGroups, mean)
       0        1 
13.44531 12.97933 
> 12.97933 - 13.44531 
[1] -0.46598

Calculating the averages and differences directly estimates the averages in both groups to be around 12 and 13 which is somewhere in the middle dip between the two peaks in Figure 1.

I will try and model this in a few different ways starting with a finite mixture model. First we use Flexmix library in R to estimate the averages for the two peaks and then check for differences in coefficients.

> fit.flex = flexmix(x ~ fGroups, data=dfData, k=2)
> summary(fit.flex)

Call:
flexmix(formula = x ~ fGroups, data = dfData, k = 2)

       prior size post>0 ratio
Comp.1 0.422   19     19     1
Comp.2 0.578   26     26     1

'log Lik.' -64.27491 (df=7)
AIC: 142.5498   BIC: 155.1965 

> ## fitted coefficients
> parameters(fit.flex)
                     Comp.1     Comp.2
coef.(Intercept)  9.5091564 15.9054071
coef.fGroups1    -0.1400754  0.3231403
sigma             0.3633179  0.6810994

########### running this again may converge in a 
########### different position
> ## fitted coefficients
> parameters(fit.flex)
                    Comp.1    Comp.2
coef.(Intercept) 9.5091564 15.905407
coef.fGroups1    6.7193910 -6.536326 ## huge fold difference
sigma            0.4995355  0.618986

Running this code multiple times may also converge in places where the intercepts are fine but the fold changes can be highly overestimated. Let’s try this in Stan, but with two approaches: one where we use a single set of regression coefficients where we assume that the differences between groups 0 and 1 are the same in both batches; second where we use separate regression coefficients for each component or batch.

############### SINGLE SET of Regression Coefficients
> print(fit.stan, c('sigmaRan1', 'sigma', 'mu', 'iMixWeights', 'rGroupsJitter1'), digits=3)
Inference for Stan model: normResponseFiniteMixture1RandomEffect.
2 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=2000.

                    mean se_mean    sd   2.5%    25%    50%    75%  97.5% n_eff  Rhat
sigmaRan1          2.030   0.091 2.492  0.056  0.421  1.126  2.684  9.199   758 1.000
sigma[1]           0.419   0.002 0.083  0.292  0.360  0.404  0.465  0.617  1197 0.999
sigma[2]           0.746   0.003 0.115  0.559  0.664  0.734  0.809  1.012  1330 1.001
mu[1]              9.326   0.051 0.932  7.312  8.991  9.405  9.695 11.202   334 1.007
mu[2]             15.913   0.050 0.937 13.919 15.569 15.981 16.297 17.817   353 1.007
iMixWeights[1]     0.428   0.002 0.072  0.294  0.378  0.425  0.479  0.573  1401 1.001
iMixWeights[2]     0.572   0.002 0.072  0.427  0.521  0.575  0.622  0.706  1401 1.001
rGroupsJitter1[1]  0.118   0.051 0.932 -1.767 -0.249  0.042  0.461  2.179   334 1.007
rGroupsJitter1[2]  0.118   0.051 0.932 -1.802 -0.229  0.034  0.432  2.096   335 1.007

#################### TWO SETS of Regression Coefficients
> print(fit.stan, digits=3)
Inference for Stan model: normResponseFiniteMixture1RandomEffectSeparateCoefficients.
2 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=2000.

                       mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff  Rhat
sigmaRan1_1           2.023   0.090 2.674   0.080   0.504   1.167   2.417   8.895   884 0.999
sigmaRan1_2           2.254   0.074 2.522   0.137   0.677   1.438   2.833   9.192  1171 1.001
sigma[1]              0.416   0.002 0.082   0.291   0.359   0.405   0.462   0.607  1564 1.001
sigma[2]              0.739   0.003 0.114   0.562   0.656   0.723   0.808   0.995  1248 1.000
rGroupsJitter1_1[1]  -0.186   0.037 0.690  -1.870  -0.509  -0.080   0.203   1.025   349 1.002
rGroupsJitter1_1[2]  -0.320   0.040 0.695  -2.007  -0.673  -0.198   0.071   0.961   308 1.001
rGroupsJitter1_2[1]   0.410   0.030 0.788  -0.895  -0.114   0.252   0.876   2.285   686 1.003
rGroupsJitter1_2[2]   0.690   0.031 0.805  -0.628   0.116   0.555   1.189   2.517   674 1.005
mu[1]                 9.694   0.037 0.688   8.475   9.297   9.593  10.045  11.383   343 1.001
mu[2]                15.504   0.030 0.781  13.679  15.034  15.653  16.041  16.793   676 1.005
iMixWeights[1]        0.425   0.002 0.069   0.292   0.378   0.424   0.473   0.558  1700 1.000
iMixWeights[2]        0.575   0.002 0.069   0.442   0.527   0.576   0.622   0.708  1700 1.000
lp__       

#################### Differences
###### SINGLE SET
> getDifference(mCoef[,'1'], mCoef[,'0'])
-0.0002651771 ## essentially zero

###### TWO SETS
> getDifference(mCoef.1[,'1'], mCoef.1[,'0'])
-0.1333644  ## 1 slightly less than 0 in component 1

> getDifference(mCoef.2[,'1'], mCoef.2[,'0'])
0.2809643  ## 1 slightly higher in component 2

## compare with flexmix results 

The results from Stan and Flexmix (in some runs of Flexmix) converge. Even looking at Figure 1, it appears that these fold changes coming from the model with two sets of regression coefficients are better, but for practical purposes a simpler model with one set of coefficients is not bad either – as on average these differences are essentially close to zero in both models (see Figure 2), when accounting for sampling variability.

largeEffectsFigure2

Figure 2: Differences in the regression coefficients for groups 1 and 0, estimated by calculating the differences in the MCMC samples for each group at each step in the chain. Top Left: Finite mixture model with one set of regression coefficients; Top Right and Bottom Right: Finite mixture model with a separate set of regression coefficients for each group.

Now we generate a grouping factor for this data using a clustering algorithm e.g. kmeans and look for two clusters.

largeEffectsFigure3

Figure 3: Density plot of the data conditioned on the 2 new groups assigned via kmeans.

The code snippets below show the results after using kmeans and fitting a simple linear model with and without the new grouping factor included.

> km = kmeans(dfData$x, centers = 2)
> 
> dfData$cluster = factor(km$cluster)
> 
> densityplot(~ x | cluster, groups=fGroups, data=dfData, auto.key=T)
## calculate averages in each group in each cluster
> tapply(dfData$x, dfData$cluster:dfData$fGroups, mean)
      1:0       1:1       2:0       2:1 
 9.509156  9.369081 15.905407 16.228547 
## compare these results with the stan
## using algebra to calculate these differences
> 9.369081 - 9.509156
[1] -0.140075 ### cluster 1 difference
> 16.228547 - 15.905407
[1] 0.32314 ### cluster 2 difference

> fit.1 = lm(x ~ fGroups, data=dfData)
> summary(fit.1)

Call:
lm(formula = x ~ fGroups, data = dfData)

Residuals:
   Min     1Q Median     3Q    Max 
-4.243 -3.822  1.939  2.793  4.700 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  13.4453     0.6611  20.339   <2e-16 ***
fGroups1     -0.4660     1.0174  -0.458    0.649    
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.371 on 43 degrees of freedom
Multiple R-squared:  0.004855,  Adjusted R-squared:  -0.01829 
F-statistic: 0.2098 on 1 and 43 DF,  p-value: 0.6492

> 
> fit.2 = lm(x ~ fGroups + cluster, data=dfData)
> summary(fit.2)

Call:
lm(formula = x ~ fGroups + cluster, data = dfData)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.22321 -0.37132 -0.09708  0.27576  1.95703 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.3852     0.1590  59.032   <2e-16 ***
fGroups1      0.1217     0.1780   0.683    0.498    
cluster2      6.5977     0.1780  37.061   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5875 on 42 degrees of freedom
Multiple R-squared:  0.9705,    Adjusted R-squared:  0.9691 
F-statistic: 690.2 on 2 and 42 DF,  p-value: < 2.2e-16

The second model shows smaller residuals and a better fit as shown in the results above and the residuals vs fitted values plot shown in Figure 4 below.

largeEffectsFigure4.png Figure 4: Residuals vs Fitted values plots for the two normal linear models. The left plot does not contain the information from the new grouping factor to identify the clusters. The right plot shows smaller estimates as it contains information for the main treatment effect and the cluster information.

If we think that the average difference between groups in our treatment of interest in the two clusters is not the same, then we can include an interaction term between the treatment and cluster, but in this case I don’t think its necessary.

What I gather from this analysis is that one should be very careful if getting larger than expected effect sizes for coefficients and this could be due to missing information in the model. We have tried to use a mixture model approach (which can have complexities associated with it) or a simpler model. The mixture model assumes the bimodality in the data and hence we can get away with the missing covariate in this case and pay for complexity. The simpler model produces practically comparable results once all the relevant information from the covariates is added.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s