High-throughput RNA sequencing (RNA-Seq) promises a complete annotation and quantification of all genes and their isoforms across samples. Because sequencing reads from this new technology are shorter than transcripts from which they are derived, expression estimation with RNA-Seq requires increasingly complex computational methods. In recent years, a number of expression quantification methods have been published from both public and commercial sources. Here we presented an overview of these attempts on quantifying gene expression. We then defined a set of criteria and compared the performance of several programs based on these criteria, and we further provided advices on selecting suitable tools for different biological applications.
Next-generation sequencing (NGS) platforms have been widely available recently [
Recently, many studies have applied RNA-Seq to various biological and medical research. Quantification of alternative splicing in tissues [
Here we focus on the computational methods for gene expression quantification by RNA-Seq. Using Google Scholar citation, as shown in
In general, current transcriptome assembly tools belong to either a reference-based strategy or de novo strategy or both [
Cufflinks can be launched in two modes using options -G/--GTF and -g/--GTF-guide. Both modes need a reference GFF annotation file from mainly three data sources Ensembl (www.ensembl.org), NCBI (www.ncbi.nlm.nih.gov) and UCSC (http://genome.ucsc.edu/). The first option -G/--GTF tells Cufflinks to use the supplied reference annotation to estimate isoform expression. The latter option -g/--GTFguide tells Cufflinks to use the supplied reference annotation to guide RABT assembly [
Array Studio is a suite of tools developed by OmicSoft (www.omicsoft.com) in which an RNA-Seq analysis workflow is provided. Expression quantification analysis of RNA-Seq can be performed in two ways by mapping to either genome or transcriptome.
CLC Genomics Workbench is a Desktop application for NGS analysis developed by CLCbio (www.clcbio.com).
An in-house RNA-Seq dataset was used and six types of results were generated. For Cufflinks, two results from both -G/--GTF and -g/--GTF-guide modes which are denoted by Cuff.(-G) and Cuff.(-g). For Array Studio, by against both genome and transcriptome, two results were shown and denoted by OMIC(G) and OMIC(T). The result of CLC Genomics Workbench was CLC GW, and the last result is from Scripture. Summaries about the six results can be found in Tables 2 and 3. Note that CLC Genomics Workbench only gives gene information, so genes with only one transcript were counted.
There are some differences between Tables 2 and 3. In
In
There are always three types of expression values used in the RNA-Seq analysis, RPKM/TPKM, TPM and naïve counts. Because Affy presents expression values on the gene level, only expression values of RNA-Seq on the gene level were shown in
In
ferent parameters. The y-axis value of figure C is calculated by CLC Genomic workbench. The y-axis values of Figures 1(d)-(f) are calculated by OmicSoft software with different calculation methods for gene expression values based on RNA-Seq reads. Because the biggest two correlation values with Affy are Cuff.(-g) and Cuff. (-G), Cufflinks has the best performance for the calculations of expression values from RNA-Seq reads. Except OMIC(G, naïve count) with Pearson value 0.68, Array studio (OMIC(G, TPM), OMIC(G, RPKM)) has a better performance than CLC Genomics Workbench (CLC GW).
In this section, results of RNAseq analysis were assessed from the perspective of the linear structure of genomic
features on the level of transcripts. In this paper, USSC hg19 annotation file was used (ftp://igenome:G3nom3s4u@ftp.illumina.com/Homo_sapiens/UCSC/hg19/Homo_sapiens_UCSC_hg19.tar.gz).
In order to explore differences among results of chosen RNA-Seq analysis, four aspects were illustrated by following terms. First, “Match” means transcripts of both the result and public annotations are the same if all their exons are matched according to chromosome coordinates. Second, “Including” means exons of a transcript in the public annotations contains exons of a transcript in the results of RNA-Seq analysis. Third, “Included” means exons of a transcript in the public annotations are a subset of exons of a transcript in the result of RNA-Seq analysis. Fourth, “Overlap” means they share common exons.
As shown in
and (b) are calculated by Omicsoft software with different parameters. The x-axis values of Figures 2(c) and (d) are calculated by cufflinks with different parameters. The x-axis values of Figures 2(e) and (f) are respectively calculated by CLC Genomic workbench and Scripture. From comparisons between Cuff.(-G) and OMIC(G) on the “Match” aspect, the bigger values of Cuff.(-G) may explain its higher correlation scores with expression values of Affy than OMIC(G). On the “Including” aspect, both Cuff(-G) and OMIC(T) have very small values; Scripture has the biggest values. This can be explained by Scripture using only RNA-Seq data without public annotations as reference. Both Cuff(-G) and OMIC(T) fully utilize reference annotations for transcriptome assembly.
In this article, we have shown the evaluation results of a set of public and commercial tools for gene expression quantification by RNA-Seq. Because of rapid improvements in RNA-Seq data generation, more efforts need to be done in the areas of transcriptome analysis, mutation detection, and fusion identification. New questions will continue to emerge and novel programs will evolve. The tool evaluation needs to keep up with the pace of these changes in order to apply RNA-Seq technologies to drug discovery and development.
This work was supported in part by a project from AstraZeneca.