The topics dealt with in this post include a summary of multiple chapters from [1] where a regression model is built in steps. This is done by adding more structure to the model based on group and individual level information associated with the data. Some of the concepts like Shrinkage, Exchangeability and Pooling have been … Continue reading Partial, Complete or No Pooling: Information Content, Sample Sizes and Shrinkage in Multilevel Regression Models.

# Model Checking: Overconfident Predictions – controlling for unequal variance between groups.

In models when uncertainty is not accounted for, this can lead to overconfident predictions and small standard errors on parameters. We cover this in the context of model checking. Here we use posterior predictive checks and test quantities in a regression problem for outlier detection and unequal variance between groups at the likelihood level. Most … Continue reading Model Checking: Overconfident Predictions – controlling for unequal variance between groups.

# Biomarker Discovery: a machine learning workflow applied to Tuberculosis diagnosis.

In our previous work, while working in tuberculosis diagnostics research, we developed some workflows to detect possible biomarkers using Omics data from large cohort studies. Discovered in 19th century, Tuberculosis (TB) is still a serious public health problem and it is estimated that one third of the World’s population is infected with Mycobacterium Tuberculosis (mTB). A … Continue reading Biomarker Discovery: a machine learning workflow applied to Tuberculosis diagnosis.

# Large Effect Sizes: Missing information produce misleading results.

Recently I came across the problem with suspiciously large difference in the averages of two groups while analysing some Omics data. An article dealing with similar issues can be seen here. The data distribution is shown below in Figure 1 (FYI: the fold change was around 6 - which is very large for this kind … Continue reading Large Effect Sizes: Missing information produce misleading results.

# High Dimensional Data & Hierarchical Regression

In a high-throughput experiment one performs measurements on thousands of variables (e.g. genes or proteins) across two or more experimental conditions. In bioinformatics, we come across such data generated using technologies like Microarrays, Next generation sequencing, Mass spec etc. Data from these technologies have their own pre-processing, normalising and quality checks (see here and here … Continue reading High Dimensional Data & Hierarchical Regression

# Logistic “Aggression”: binary classification problems

Binary problems, where the outcome can be either True or False are very common in data analysis, from an inference or classification point of view. A previous post on binomial modelling deals with a similar problem, but this time we frame the problem from a regression or generalized linear model (GLM) view point. Previously we … Continue reading Logistic “Aggression”: binary classification problems

# Next Generation Sequencing Data Quality Checks

Analysing a variety of Next Generation Sequencing (NGS) data sets from different projects over the past years, we have developed a general workflow to assess data quality. This is a guideline and can be applied at various steps of the analysis, starting with raw FASTQ file checks. FASTQ Quality Checks: Generally the simplest tool to … Continue reading Next Generation Sequencing Data Quality Checks

# Hierarchical Models: A Binomial Model with Shrinkage

The material in this post comes from various sources, some of which can be found in [1] Kruschke, J. K. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan, second edition. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan, Second Edition. http://doi.org/10.1016/B978-0-12-405888-0.09999-2 [2] Gelman, A., Carlin, J. B., Stern, … Continue reading Hierarchical Models: A Binomial Model with Shrinkage

# Pattern Recognition using PCA: Variables and their Geometric Relationships

Principal component analysis is a commonly used technique in multi-variate statistics and pattern recognition literature. In this post I try to merge ideas of Geometric and Algebraic interpretation of data as vectors in a vector space and its relationship with PCA. The 3 major sources used in this blog are: [1] Thomas D. Wickens (1995). The … Continue reading Pattern Recognition using PCA: Variables and their Geometric Relationships