Next Generation Sequencing Data Quality Checks

Analysing a variety of Next Generation Sequencing (NGS) data sets from different projects over the past years, we have developed a general workflow to assess data quality. This is a guideline and can be applied at various steps of the analysis, starting with raw FASTQ file checks.

FASTQ Quality Checks:

Generally the simplest tool to use is FASTQC, which is very popular and gives various sequence quality metrics. Most of the time, we deal with a population of samples, as they come from an experimental study. Our aim for the QC is to see how does the population behave, and are there any obvious outliers. If there are any differences present, then are any of these due to technical or biological reasons?

I will link some examples from our work, but they are built using the bioconductor libraries – which means feel free to use the original libraries where appropriate. CFastqQuality is an R class which will create an S4 object  using the bioconductor library ShortRead.

## perform the analysis one sample at a time
## function to write the qa files
write.qa = function(fls, indir, title){
  wd = getwd()
  setwd(indir)
  ob = CFastqQuality(fls, title)
  setwd(wd)
  cat(paste('done', title, '\n'))
  return(ob)
}

ivFilesIndex = unlist(lFilesIndex)

lOb = lapply(ivFilesIndex, function(x){
  write.qa(csFiles[x], getwd(), csFiles[x])
})

The code snippet above takes a vector of FASTQ file names, and creates CFastqQuality objects. There are various types of diagnostics you can look at, but generally I would look at number of reads, per base quality and sequence content in certain cases. If these plots point out problems, then I would delve deeper into other diagnostics.

Single Cell RNA-Seq

Figure 1: Number of reads in each sample, before and after adapter trimming and removal of low quality reads.

Single Cell RNA-Seq (1)

Figure 2: The quality scores before and after adapter trimming and removal of low quality reads.

Figure 2 suggests that some samples have lower quality reads when compared to the population – e.g. the broken red line in the left panel. This trend appears to disappear after removing low quality reads, however there is an odd dip in quality on the reverse reads between bases 20 and 40. This odd pattern warrants further exploration, and when we look at the tile plots from FASTQC for the reverse reads, we see this trend as shown below

Single Cell RNA-Seq (2)

Figure 3: Dip in quality on tile plots, concordant with the reverse reads base quality in the right panel.

I tried looking for reasons, for this pattern – and found various suggestions that include: Flowcell overloading, Quality of run is low, Biased sequence composition, and dirt or obstruction to imaging system [sources 1 and 2].

BS Seq 1

Figure 4: Average base composition at various positions in the sequencing reads.

A plot like Figure 4 may be useful in cases where we have special libraries, e.g. Bisulphite Sequencing or MeDIP Sequencing. Sample M407 stands out from the rest of the population, suggesting problems with bisulphite conversion at the library preparation step. If one looks at each of these samples individually, then such trends will be missed. Some example source code from one of the projects can be seen here. One can even extract the sequence quality matrix, and then use various diagnostics like PCA or Hierarchical clustering to look for outliers.

Read Trimming:

I would perform the quality checks on the FASTQ files before and after trimming. I tend to use Trimmomatic with following settings.

Sequence Alignments:

The aligner of choice is pretty subjective, e.g. Hisat2 for RNA-Seq is a popular choice. You should try and parse the log files after the alignments are done, to extract some basic information e.g. number of reads aligned and mapping efficiency. This may help to point out samples where mapping efficiency is very different to the rest of the samples.

Samtools or Picard Tools:

Its your choice which one to use, but generally this is the workflow I use for e.g. RNA-Seq data or Bisulphite Seq data. Bottom line is I will remove low quality alignments and PCR duplicates.

Bam Files QC:

I tend to look at the coverage in the Bam Files over the chromosomes. A utility class for these tasks CBamScaffold can be used which uses various bioconductor libraries. Some example usage, if looking at a large number of Bam files

## perform the analysis one sample at a time
## function to write the qa files
write.qa = function(bfn, sn){
  # open bam file
  bf = BamFile(bfn)
  lBam = lapply(sn, function(x){
    return(CBamScaffold(bf, x))
  })
  cat(paste('done', bfn, '\n'))
  return(lBam)
}

lAllBams = lapply(csFiles, function(x){
  write.qa(x, cvSeqnames)
})
names(lAllBams) = dfSample$id

A script with some analysis can be seen here as an example.  We can look at the number of aligned reads before and after duplicate removal, which will highlight, if any samples stand out as in the figure below:

aligned reads pcr

Figure 5: Number of reads aligned in millions before and after duplicate removal.

Figure 5 suggests that some samples had problems that include large PCR duplication and low mapping efficiency. We can also model the coverage over each chromosome by splitting each chromosome into 1000 bins. This should highlight trends in the data as shown below:

coverage 1

Figure 6: Average coverage over binned chromosomes 1 and 2. The Red and Green groups show a similar trend, while the Purple and Turquoise samples are more similar. This is reflective of the biological groupings.

coverage 2

Figure 7: The coverage shows that there is a large difference between the sequencing depth for the three biological groups, and in this data set, this is not a true biological effect, but introduced due to technical reasons – less sequencing depth for Diet 1 samples.

Trends like the one in Figure 6 show what is expected biologically, i.e. the Red and Green groups are more similar. While the trend in Figure 7 is not good, as it is a problem with sequencing depth, and may cause problems in downstream analysis – which may not be fixed by normalisation.

I would probably collect the data from all the various steps as shown below:

workflow

Figure 8: Highlights the samples that are flagged at each step of the workflow.

A workflow like figure 8 is easy to follow, and highlights samples at each step that are problematic. I think of this as accumulating evidence, and if a sample is highlighted multiple times, then it most likely has problems. Such samples may be dropped from the statistical analysis or some sort of a control covariate might be required for these samples, in addition to the biological groupings. Adding these flagged samples into the statistical modelling step, without dropping or controlling for these, will most likely lead to issues with normalisation or spurious findings.

These are general steps, that can be performed for the data QC – but each type of experiment will require further checks. E.g. for Single cell RNA-Seq data, you will need to perform additional checks after generating the count matrix; Bisulphite Seq data will require checking for methylation proportions over each position in the genome. Read this with the section on omics data quality checks.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s