In our previous work, while working in tuberculosis diagnostics research, we developed some workflows to detect possible biomarkers using Omics data from large cohort studies. Discovered in 19th century, Tuberculosis (TB) is still a serious public health problem and it is estimated that one third of the World’s population is infected with Mycobacterium Tuberculosis (mTB). A biomarker is defined as a characteristic that is objectively measured and evaluated as an indicator of normal or pathological processes, or pharmacologic responses to a therapeutic intervention (“Biomarkers and Surrogate Endpoints: 2001”) which can be used for prediction, diagnosis, staging and monitoring prognosis of a disease. The two data sets we use for this workflow can be found using their NCBI database ids GSE19491 and GSE37250. I have provided these data sets after some microarray data processing steps in the github page for this post.

Our objective here is to train a classifier with the least number of predictors, as this should be a viable and cheap option to implement in a clinical setting. Using classifiers with 30 or more genes will require expensive custom gene panels which eventually is a burden on the healthcare systems of any country.

Some of the material covered here will refer to previous posts (that contain original references to literature) and I will add some references here.

[1] Bishop, C. M. (2006). Pattern Recognition And Machine Learning. Springer. http://doi.org/10.1117/1.2819119

[2] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition (Texts in Statistical Science). Book.

[3] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: with applications in R. Springer texts in statistics (Vol. XIV). http://doi.org/10.1007/978-1-4614-7138-7

The training data looks something like this:

> str(lData.train) List of 3 $ data : num [1:1999, 1:537] 13.38 14.02 9.56 10.94 7.1 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:1999] "GAPDH" "RPS28" "TICAM2" "SERTAD2" ... .. ..$ : chr [1:537] "GSM914353" "GSM914354" "GSM914355" "GSM914356" ... $ grouping: Factor w/ 3 levels "HC","LTB","ATB": 3 3 3 3 3 3 3 3 3 3 ... $ batch : Factor w/ 2 levels "HIV-","HIV+": 1 1 1 1 1 1 1 1 1 1 ...

I will use two approaches to select a smaller subset of variables (instead of ~ 2000 variables), one using a Random Forest approach, with some custom tweaks applied to the workflow and a Bayesian binomial model where all coefficients share a scale (or variance) term (a bit like a Lasso model). You can read more technical details in the references 1 to 3. In both models the number of observations are ~ 537 and number of predictors are ~ 2000 (both are high dimensional problems).

**Figure 1: **Top 20 predictors ranked by nested Random Forest.

**Figure 2: **Top 30 predictors ranked on the absolute coefficient size, from the binomial model.

The figures 1 and 2 show the top predictors from the two models and some predictors are overlapping while others are different. In the next steps we use an exhaustive subset selection approach [3] by splitting the data into test and training sets in order to select model size (number of predictors/variables) that can give acceptable test set predictions while also keeping the number of variables small.

**Figure 3: **Top left and top right images show the results from nested exhaustive subset selection using the variable lists from the Random Forest and the Binomial models. The bottom panel shows the figure after combining the top variable combinations from the two top panels.

Looking at Figure 3, I would perhaps just use a 1 variable model, but here we fit a 7 variable model using binomial regression approach.

**Figure 4: **Pairs plots for the MCMC samples for the regression coefficients from the binomial model.

The figure 4 shows the sampled values for the regression coefficients from the 7 variable binomial model. The parameters for some variables are correlated and perhaps could be dropped or merged to further reduce the number of variables? The model checking blog posts contain some more ways of testing the model fit and can be read for more details.

A model eventually will give a prediction score for the an observation to belong to a class Active TB (ATB) – in our case these scores from the binomial model are between 0 and 1 and can be interpreted as probability of having ATB.

**Figure 5: **The actual class labels (HC= healthy control, LTB = latent TB and ATB = active TB) vs Predicted class probability (ATB or Other=both HC and LTB).

Figure 5 shows the scores that have been assigned to each observation, which were originally classified as Healthy (HC), Latent TB (LTB) and Active TB (ATB). The closer the score is to 1 the more likely that observation is from the class ATB. This takes us into the realm of decision making where we need to select an optimal cutoff for classification. At which score cutoff we will classify that observation as ATB? The idea is to maximise your benefit for a correct classification while reducing making the wrong choices. Furthermore, this is just one of the criteria for diagnosing TB and this classifier can be used to classify a person as ATB or else it can make a *‘Reject’* decision and the clinician can use other criteria to make a decision on the diagnosis. Reference [3] around Page 18 has a good example for modelling the distribution of this score which can help in choosing a cutoff. I try and replicate some of that here in this context.

**Figure 6: **The density distributions of the scores in each class (ATB or Other) on the 0 to 1 scale and after logit transform.

The figure 6 shows these prediction scores on two scales, the 0 to 1 probability scale, and the logit transformed scale. For modelling purposes the logit scale is more appropriate as it allows us to model this score as a mixture of two normal distributions. Once the mixture model is fitted, one can calculate the parameters of the Other group (the blue curve in bottom panel of figure 6) and calculate a probability of False Positives (i.e. wrongly declaring an observation ATB while it is class other). We first use a receiver operating characteristic curve, i.e. ROC curve to show the performance of this classifier and then plot the simulation lines that were obtained from the posterior predictive distribution of the mixture model (Figure 7).

## draw the simulation lines ## these are p-values from the mixture components ## create posterior smatter lines grid = seq(-7.5, 7, length.out = 100) f_getSmatterLines = function(m, s, g){ return(pnorm(g, m, s, lower.tail = F)) } ## holders for the simulated p-values mTP = matrix(NA, nrow = length(grid), ncol = 2000) mFP = matrix(NA, nrow = length(grid), ncol = 2000) for (i in 1:2000){ p = sample(1:nrow(mStan), size = 1) x = pnorm(grid, mStan[p, 'mu1'], mStan[p, 'sigma1'], lower.tail = F) y = pnorm(grid, mStan[p, 'mu2'], mStan[p, 'sigma2'], lower.tail=F) lines(x, y, col='darkgrey', lwd=0.5) mFP[,i] = x mTP[,i] = y }

We also test the performance using nested 10 fold cross validation. The workflow for nested cross validation (which was written sometime back) uses the LDA model – but the results are comparable with a binomial model for practical purposes.

**Figure 7: **ROC curve with nested 10 fold cross validation, and smattering of simulated true and false positive rates from the mixture model. The red broken line is the same data without cross validation.

The model produces a score but the discrete decision to classify something as ATB depends on the cutoff point. As the cutoff point moves, one can classify more correct ATB cases but also make mistakes. Using the curve or the results from the simulation we can decide on this cutoff point as a decision boundary that will produce an acceptable level of true positive while reducing false positives.

**Figure 8: **Using the ROC and Mixture model approach, we suggest a score cutoff of ~ 0.6. Anything equal or above 0.6 is assigned a class ATB while we defer decision on other observations and classify them as reject.

The data in Figure 8 shows that at a cutoff of 0.6 as a decision boundary, the false positive rate is ~ 0.06 while the true positive rate is ~ 0.73. The mixture model can give us an estimate of how many false positives and true positives we will expect at a decision boundary of 0.6. These rates shown in Figure 9 are concordant with the actual rates of 0.06 and 0.73.

p = sample(1:nrow(mStan), size = 2000) x = pnorm(logit(0.6), mStan[p, 'mu1'], mStan[p, 'sigma1'], lower.tail = F) y = pnorm(logit(0.6), mStan[p, 'mu2'], mStan[p, 'sigma2'], lower.tail=F)

**Figure 9: **The expected false positive and true positive rates if a decision cutoff is set at 0.6. These rates are calculated using the posterior predictive distribution from MCMC estimated parameters of the 2 components of the finite mixture model.

## Test Set:

We can use the second data set, and see if these 7 variables chosen in the training data, can perform at a comparable level in a test set. For this purpose, we use the nested 10 fold cross validation workflow used earlier.

**Figure 10: **ROC analysis for prediction of ATB using the variables selected from the training stage. The solid line with error bars shows the performance under nested 10-fold cross validation while the red line shows without cross validation.

Using the predicted model score and a cutoff of 0.6 as suggested from the training stage, we can assign classes to these observations.

**Figure 11:** The cutoff of 0.6 or higher is assigned a class ATB while we defer decision on other observations and classify them as reject.

The data in Figure 11 shows that at a cutoff of 0.6 or above as a decision boundary, the false positive rate is ~ 0.04 while the true positive rate is ~ 0.78 and these rates are concordant with the posterior predictive rates in Figure 9. You may have noticed that the training set contains a mixture of population that are HIV+ and HIV-. Sometimes stratifying these populations based on clinical metadata will produce cleaner data sets leading to a better classifier.

We have had promising results from similar types of workflows in the past. In fact we ended up with a 2 gene biomarker which was confirmed later on in a different cohort. Generally the first step of massively reducing the dimensions or variables from say a few thousand to top 30 requires application of different methods (I have just shown two approaches) – once you are down to these small numbers then a brute force approach like exhaustive subset selection works nicely. In the past we have also used network based approaches to select this short list of genes, and in fact if we took the top ranked genes from Figure 3 in the previous blog – they also contain a good candidate list of genes.