How to analyze sequencing data generated by NGS

Next Generation Sequencing (NGS), also known as High-throughput sequencing (HTS), includes many sequencing technologies like Illumina (Solexa) sequencing, Roche 454 sequencing, SOLiD sequencing, and Ion torrent sequencing. A huge collection of DNA or RNA can be produced in minimal time by these technologies. The sequence data has become an essential aspect of genomics. NGS has the capability to revolutionize life-sciences research by impacting healthcare, agriculture, and environment positively. The NGS data analysis using highly competitive next generation sequencing software along with the cutting edge high power computational resources unravels many unsolved problems in biology. Bioinformatics has made the analysis task much easier for the biologists and researchers by providing a wealth of next generation sequencing software solutions.

NGS is considered as a subordinate concept for diverse sequencing technologies including transcriptome sequencing (RNA-Seq), whole-genome and whole-exome sequencing, genome-wide association studies (GWAS), chromatin immunoprecipitation or methylated DNA immunoprecipitation, and single cell sequencing. The NGS analysis workflows for these types of sequencing data share some commonalities. Scientific community around the globe has developed many workflows and integrated different next generation sequencing software to get meaningful results. Some of the commonly used next generation sequencing software involve BWA, Bowtie, SAMtools, Galaxy, and Picard tools.

The NGS data analysis depends on the instrument-specific processing and can be divided into three phases: (i) Primary; (ii) Secondary; and (iii) Tertiary analysis. Primary analysis is sequencing instrument-specific steps needed to call base pairs and compute quality scores for those calls. The initial step of this analysis results in the formation of FASTQ files which contains a stretch of nucleotide sequences along with the associated Phred quality scores. The output from primary analysis serves as the starting point for NGS data analysis for various instrument-specific pipelines. However, the raw files need to undergo evaluation for quality before further processing (quality control). Many sequencing/base-calling errors by NGS specific platforms produce poor quality reads which may end up in giving false positive results. The removal of these errors is the primary concern of all NGS based analysis pipelines and is performed through a quality control tool called FASTQC. 

High-throughput sequencing (HTS) is based on the shotgun sequencing method, which breaks the DNA into small chunks during the sequencing step, and generates small reads of various lengths depending on the specific platform chemistry. The secondary analysis focuses on re-assembly of these short sequencing reads in a method known as De novo assembly for the organisms whose reference does not exist. The reads are assembled into contigs based on some overlap. The contigs are later joined together to form scaffolds. Finally, the gaps are joined together using various next generation sequencing software, most specifically the gap filling software to develop a draft genome. The variant identification is considered as the most diversely applied application of NGS data analysis.  The analysis process in case of organisms with a reference genome available is simpler, and it requires the alignment of short reads on a reference genome using next generation sequencing software. Many sources of reference genome such as UCSC (University of Santa Cruz) and GRC (Genome reference consortium) are important databases for the retrieval of reference genomes for alignment step. The next step involves variant (single nucleotide polymorphism analysis) identification by detecting differences at each index positions of the aligned reads and the reference sequence. Many next generation analysis software, including BWA and Bowtie, are available to perform this task. SNP (single nucleotide polymorphism) discovery can be performed for both whole genome and targeted genome/exome data. SNP identification plays a significant role in the diagnosis of many diseases, including cancers, mendelian, and hereditary diseases.

Analysis of gene expression in case of RNA-seq data is another broader aspect of NGS data analysis. The GWAS studies focused on transcription factor binding proteins, and study of epigenetic modifications is also paving the way for research in the field of natural sciences.

Finally, the tertiary analysis involves the downstream analysis of the results of secondary analysis to make sense of the data. Variant annotation is one of the methods that makes the secondary data meaningful. Variant annotation using next generation sequencing software helps establish links between databases, such as dbSNPS, and identifies the disease-causing variants. Some of the variant annotation tools are ANNOVAR, SIFT, Provean, and Poly-phen2. Visualization tools such as various genome browsers help to visualize the various aspects of the newly developed genome or identified variants by providing detailed insights about the mapping quality of reads, position of variants, read alignment along with the associated quality score.

Overall, the development of various next generation analysis software has made the NGS data analysis process smooth and easy. However, some typical challenges still persist. The first challenge is the appropriate selection of the next generation sequencing software, while the development of automated workflows/pipelines and data storage methods are some other concerns. Lastly, the powerful multi-node computer clusters with exclusive computer nodes are considered as the de-facto standards for NGS data analysis, but these requirements are often quite costly. The enormous size of NGS data is a unique challenge to many researchers. Proper computational resources and specific next generation sequencing software can create highly streamlined workflows and can handle these challenges with much ease. However, the most time-consuming part of the NGS data analysis is the setup of new analysis pipelines by appropriate selection of the right tools from the vast array of options available. Despite these challenges, there are many flexible next generation sequencing software and numerous computational algorithms, which deal with the development of automated pipelines.

References:

Gogol-Döring, A., & Chen, W. (2012). An overview of the analysis of next generation sequencing data. In Next generation microarray bioinformatics (pp. 249-257). Humana Press.

Mutz, K. O., Heilkenbrinker, A., Lönne, M., Walter, J. G., & Stahl, F. (2013). Transcriptome analysis using next-generation sequencing. Current opinion in biotechnology, 24(1), 22-30.

Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics, 12(6), 443.

Bioinformatics, B. (2011). FastQC: a quality control tool for high throughput sequence data. Cambridge, UK: Babraham Institute.

Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic acids research, 38(16), e164-e164.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … & Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079.

Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357.

Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics, 25(14), 1754-1760.