Evaluation of Ribosomal RNA Removal Protocols for Salmonella RNA-Seq Projects *

Next generation sequencing is a powerful technology whose application in sequencing entire RNA populations (RNA-Seq) of food-borne pathogens will provide valuable insights. A problem unique to prokaryotic RNA-Seq is removal of ribosomal RNA. Unlike eukaryotic messenger RNA (mRNA), bacterial mRNA species are devoid of polyadenylation at the 3’-end and thus the approach of affinity enrichment of mRNA using oligo-dT probes is not an option. Among several approaches to enriching mRNA molecules, removal of ribosomal RNA (rRNA) by subtractive hybridization has been widely used. This approach is a single-step procedure for which several rRNA-depletion kits are commercially available. We evaluated three commercially available rRNA-depletion kits to determine their respective efficiencies of rRNA removal from Salmonella enterica serovar Typhimurium strain SL1344. The three protocols achieved varying degrees of rRNA depletion and resulted in 8 to 1000-fold enrichment of mRNA. rRNA removal probes from two of the three kits were unable to titrate out 23S rRNA species while removal of 16S rRNA was less efficient. The Ribo-Zero kit was most efficient in eliminating 16S, 23S and 5S ribosomal RNA species from the transcriptome of S. enterica serovar Typhimurium strain SL1344.


Introduction
In order to gain insight into quantitative differences in gene expression, whole transcriptome sequencing (RNA-Seq) has become increasingly popular.Its application for the analysis of food-borne pathogens could be very beneficial as it has the potential to identify specific gene expression pathways associated with growth, survival and virulence under a variety of food processing and storage conditions.Despite the perception that prokaryotes have a simpler genome structure and transcription apparatus, RNA population structures are nonetheless of comparable complexity.The bacterial RNA population is comprised of rRNA, tRNA, non-coding/antisense/regulatory RNA and mRNA [1,2].
In eukaryotes, oligo-dT based capture hybridization protocols are very effective in eliminating other RNA species such as rRNA, tRNA and non-coding RNA.The challenge in bacterial mRNA enrichment is that oligo-dT based protocols are of no use as polyadenylation of bacterial mRNA is very limited [3,4].The task of bacterial mRNA enrichment becomes further complicated by the fact that >97% of bacterial total RNA is comprised of rRNA [2,5,6].Although various approaches have been proposed [5,7,8], hybridization capture of rRNA with complimentary oligonucleotide probes has been popular [9][10][11].Until recently, MICROB Express kit was considered as the best protocol for eliminating ribosomal RNA species [7].Additionally, several laboratories have designed combinations of different approaches and commercial kits to eliminate rRNA species [1,8].In spite of these attempts, mRNA enrichment was only modest with less than 50% of aligned sequencing reads representing transcripts other than rRNA [1,9,11].
Considering that the efficiency of rRNA removal by oligonucleotide capture may vary among different bacteria, we set out to compare and optimize rRNA depletion protocols for Salmonella RNA-Seq experiments.We selected three hybridization-based rRNA elimination kits.MICROBExpress (Ambion, TX) has been used widely in earlier studies [1,7,11].This protocol uses magnetic beads, derivative oligonucleotide probes designed to capture 16S and 23S rRNA species.RiboMinus kit (Invitrogen, CA) uses Locked Nucleic Acid (LNA)-monomers as rRNA probes where a ribonucleoside is linked between the 2' oxygen and 4' carbon atom of the methylene ring [12].This configuration increases the melting temperature and stability of the oligonucleotide probe and rRNA complexes [13].The Ribo-Minus kit utilizes streptavadin-coated magnetic beads that bind to biotinlabeled oligonucleotide probe thus facilitating easy removal from hybridization reaction mixture.The third and newest kit, Ribo-Zero (Epicenter, Madison, WI), uses magnetic beads and biotinylated capture probe, but unlike other kits, it promises to remove rRNA including the 5S species using only half the starting quantity of total RNA compared to other two kits.

Experimental Design
We isolated total RNA from Salmonella enterica serovar Typhimurium strain SL1344 grown in Luria broth till mid-logarithmic growth phase [14].Total RNA, free of genomic DNA, was subjected to one of three methods of rRNA depletion.RNA-Seq libraries were made from either total or rRNA-depleted RNA compatible with Illumina sequencing platform.Libraries were bar-coded with index-primers and examined for median insert length on the Experion (BIO-RAD Laboratories, Hercules, CA) microfluidic platform.Libraries were sequenced using the paired-end 100-cycle procedure on a HiSeq2500 Illumina sequencer.

RNA Extraction
For RNA isolation, an overnight culture of S. enterica SL1344 was diluted 1000-fold into Luria-Bertani-Miller (LB) broth and grown with shaking under aerobic conditions (230 rpm, 37˚C) for 3.5 -4.0 h until the OD 600 reached 0.4 -0.45.Three biological replicates were used for each treatment and the control.Cells were treated with RNA stabilization reagent and total RNA was isolated as described [15,16].RNA was treated with RNasefree DNase twice (once using on-column DNase treatment as per Qiagen RNeasy protocol, and once in-solu-tion as per New England BioLab's protocol and re-purified using Qiagen RNeasy kit.Finally RNA was examined for DNA contamination by performing real-time PCR using primers against invA gene as previously described [17].RNA quality and quantity measurements were performed using the Experion microfluidic gel platform (Bio-Rad Laboratories, Hercules, CA).

Removal of rRNA
Removal of rRNA was performed using one of the three protocols.For MICROB Express and RiboMinus selective hybridization kits, a total of 10 µg RNA was processed.For the Ribo-Zero kit, the recommended starting RNA quantity was 5 µg.rRNA depletion was carried out as per manufacturers' instructions.The relative efficacy of the three rRNA-depletion methods as well as the RNA concentration of samples was assessed using the Experion microfluidic platform and high-sensitivity picoRNA chips.

Construction and Sequencing of Illumina Strand Specific RNA-Seq Libraries
Strand-specific libraries were constructed starting with 8 -50 ng of rRNA-depleted RNA or 100 ng of total RNA.Randomly-primed cDNA was synthesized using the ScriptSeq v2 RNA-Seq library preparation kit (Epicenter, Madison, WI). cDNA was purified using Agencourt AMPure XP system (Beckman Coulter, Indianapolis, IN).Libraries were amplified using FailSafe PCR enzyme kit (Epicenter, Madison, WI).Typically 12 (50 -100 ng starting RNA) or 15 (8 ng starting RNA) PCR cycles were used with one of the ScriptSeq Index primers replacing the reverse primer from the kit.After PCR amplification, libraries were purified and size selected (~280 bp) using Agencourt AMPure XP system (Beckman-Coulter, Indianapolis, IN).The resulting cDNA was analyzed for library insert sizes using the Experion microfluidic platform and 1K DNA chips (Bio-Rad Laboratories, Hercules, CA).Libraries were sequenced on Illumina Hi-Seq2500 instrument using the 100 cycle paired-end mode.

Mapping and Assembly of Raw Reads
Transcriptome assembly and mapping were carried out using Seqman Ngen v. 11 (DNAstar, Madison, WI).The S. enterica serovar Typhimurium SL1344 chromosome and three plasmids were used as the reference genome (GenBank accession numbers: NC_016810, NC_017718, NC_017719, NC_017720).The default parameters for pair-end assemblies were: 45 bp as the minimum aligned length, 21 as mer length, and a minimum match percentage of 93.

Analysis of Gene Expression and Abundance
Analyses were carried out using ArrayStar v.11 (DNA star, Madison, WI).Seqman Ngen v.11 assembly files were loaded and preprocessed using the Qseq software.We used all the Array Star v.11 default settings.We only used features that have been annotated as genes, that includes rRNAs and tRNAs.We excluded known pseudogenes from the analysis.
To calculate intergenic reads, the total numbers of mapped reads to annotated genes were subtracted from reads that mapped to the entire chromosome and three plasmids.The normalized expression values (read counts) for each gene were calculated as RPKM values [defined as 1000 × (Assigned number of reads/Gene length) × (10 6 /Total mapped reads).Statistical comparisons were made using ANOVA followed by all pair-wise multiple comparisons using the Holm-Sidak method with a significance level of p < 0.05 (SigmaStat version 3.0).

Nucleotide Sequence Accession Numbers
Raw sequences reported in this article have been deposited in the NCBI Short Read Archive (SRA).Submission ID SRX347205.

Results and Discussion
The lack of polyadenylation at the 3' end of bacterial mRNA and the abundance of rRNA (>95% of total RNA), present unique challenges for examining prokaryotic gene expression using RNA-Seq.Although several approaches have been proposed to eliminate rRNA or selectively transcribe mRNA using specific probes, there is no consensus in the available literature and the published data are highly variable [7][8][9]11].Salmonella species constitute a major food borne threat to public health.RNA-Seq has the potential to elucidate the genetic regulation and pathways that Salmonella species utilize for survival in various food processing environments.Our goal was to evaluate commercially available rRNA depletion kits and optimize a rRNA depletion protocol for use in bacterial transcriptome sequencing.
We compared three commercial kits for rRNA removal, MICROBExpress, Ribo-Zero and RiboMinus using RNA isolated from logarithmic-growth phase S. enterica serovar Typhimurium SL1344 cells.Strand specific RNA-seq libraries from total (undepleted) and rRNA-depleted samples were prepared, sequenced, and mapped to the reference genome.Sequence reads that aligned to rRNA genes and to the coding DNA sequence (CDS) of annotated genes were counted and used for comparison (see Materials and Methods).
Without depletion, 97% of the mapped reads aligned to rRNA (Figure 1) with only about 1% mapping to mRNA and the remainder to intergenic regions.We observed that 90.8% ± 2.6% and 92.9% ± 1.3% of the reads mapped to rRNA genes when rRNA was depleted using MICRO-BExpress and RiboMinus protocols respectively.In parallel, these treatments increased reads mapping to CDS (coding DNA sequences, i.e. genes) from 1.8% ± 0.26% (total RNA) to 4.8% ± 0.9% (RiboMinus) and 6.6% ± 1.8% (MICROBExpress) (P <0.01) (Figure 1).Thus, Ribo-Minus and MICROBExpress depleted only a small fraction of the rRNA present in each sample.In contrast, Ribo-Zero was able to deplete the vast majority of rRNA as few reads mapped to rRNA from these samples.Concomitantly, reads corresponding to mRNA and mapping to CDS regions of the annotated genome increased from 1.8% ± 0.26% (total RNA) to 70% ± 0.9% in Ribo-Zero protocol and the remainder 29.38% ± 0.97% mapped to intergenic regions (see below).In spite of varying rRNA depletion efficiencies, biological replicates for each protocol were highly correlated with R 2 values ranging from 0.83 to 0.95 (see legends to Figure 1).Importantly, normalized gene expression values i.e., RPKM values were strongly correlated among different libraries prepared before and after rRNA depletion (Table 1).High r 2 values indicate absence of any systematic bias caused by removal of rRNA species.Additionally, no bias was observed for the effect of gene length and mapped reads before and after depletion (Figure 2).A linear regression value, r 2 of 0.037 was observed for mapped reads obtained from total RNA.Upon rRNA depletion by MI-CROBExpress, Ribo-Minus and RiboZero resulted in  a Regression values (R 2 ) were calculated between pairs of rRNA depletion protocols by performing linear regressions in ArrayStar using the least squares method.The data used for the regression were the log-2 transformed mean RPKM calculated from the three replicates of each protocol.

OPEN ACCESS AiM
higher median RPKM range but lower r 2 values of 0.019, 0.016 and 0.006 respectively indicating very little, if any, correlation between gene length and mapped reads.
To investigate the relatively low rRNA depletion efficiencies of MICROBExpress and Ribo-Minus kits, we analyzed the reads mapped specifically to the 23S and 16S genes (Figure 3).23S and 16S rRNA species are encoded by 7 structural genes (each) in the genome of SL1344.We observed that differences in rRNA depletion among the kits were not merely due to overall poor depletion efficiency, but also due to the inability to specifically deplete the 23S rRNA species.For example, 16S rRNA species encoded by SL1344 rRNA001 (and six other genes) were depleted by ~1-log (MICROBExpress and Ribo-Minus) to 3-logs (RiboZero) (Figure 3(a)).In contrast, 23S rRNA species encoded by SL1344_rRNA005 (and six other genes) were depleted only by RiboZero (Figure 3(b)).
We further evaluated the efficiency of 16S rRNA depletion by quantitative reverse transcriptase PCR.A low abundance gene, opgH, encoding a osmoregulatd periplasmic glucan (OPG) [14] was chosen to evaluate mRNA enrichment as a result of rRNA depletion.Transcripts Our observations also are in agreement with He et al. [7] who reported difficulty in accurately correlating rRNA depletion by RNA-seq and qPCR methods for treatments resulting in low rRNA removal efficiency.One of the main objectives in RNA-seq experiments is quantifying gene expression under different sets of conditions.We compared the proportion of total open reading frames (ORFs) detected between total RNA and post-rRNA-depletion (Figure 4).Determining the percentage of ORFs detected using a threshold of 30 mapped reads per ORF, even a modest depletion of rRNA (5% -7% depletion) resulted in two-fold increase in % ORFs de-   OPEN ACCESS AiM tected (17.9% for total RNA vs 35% -36% for Ribo-Minus and MICROBExpress treatments).These values were much less compared to 80% ORF detection following RiboZero treatment.It is worthwhile to note that these two rRNA depletion treatments as well as total RNA provided enough mapped reads to detect 87% -94% of the ORFs with minimum of 1 mapped read per ORF, which is very close to the 99.5% ORFs detected post RiboZero treatment.It is the accurate detection of low abundance transcripts that is facilitated by the efficient removal of rRNA as documented by the fact that 92.9% and 80.6% ORFs were detected post RiboZero depletion, using a minimum threshold of 10 and 30 mapped reads per ORF respectively.As shown in Figure 1, a high proportion of fragments were mapped to integenic positions.In E. coli a higher proportion of the genome is thought to be transcribed than indicated by current gene annotation, and recent studies have documented large numbers of transcripts mapping to intergenic regions postulated to be small noncoding RNAs [9].Alternative or spurious transcription initiation events may also give rise to high proportion of intergenic reads.Another reason for intergenic reads could be low level contamination of Salmonella genomic DNA (gDNA).This appears rather unlikely due to the fact that total RNA was subjected to 2 rounds of DNase treatments and no gDNA was detected following 35 cycle-PCR prior when rRNA-depleted RNA preparations were used as template.Moreover, had the high intergenic reads come from gDNA, we would expect a uniform distribution of such reads, which we do not find to be the case.Intergenic reads were not uniformly distributed, nor did all genes exhibit the same pattern.Broadly speaking, there were four patterns (Figure 5): 1) sequences with high intergenic reads but low ORF reads, 2) sequences with high intergenic reads and high ORF reads, 3) sequences with low intergenic reads but high ORF reads, and 4) sequences with low intergenic reads and low ORF reads (data not shown).For instance, the Salmonella pathogenicity island gene, hilA, is expressed in exponentially growing cells and has its transcriptional start site 500 bases upstream from its translational start site [18] (Figure 5(b)).Similarly the fliC gene, expressed highly during active growth phase and studies in E. coli have identified the transcriptional start 70 bases upstream [19] which would account for fewer reads in the 5' upstream region of fliC (Figure 5(c)).

OPEN ACCESS AiM
Taken together, our data indicate that RiboZero rRNA depletion protocol was superior to the other depletion methods tested and worked well with the model Salmonella strain SL1344.Even a modest depletion in rRNA, which was observed with MICROBExpress and Ribo-Minus rRNA protocols, resulted in a significant increase in the proportion of ORFs detected.When using a threshold as high as 10 mapped reads per gene, 92% of the ORFs were detected in RiboZero treated RNA samples as compared to 40% of the ORFs in total RNA and ~60% of the ORFs following MICROBExpress or Ribo-Minus rRNA depletion protocols.Depletion of rRNA species using RiboZero generated reproducible results with no apparent bias.

Conclusion
In conclusion, the data present herein illustrate the importance of proper sample preparation for use in next generation sequencing.While it was surprising that even with very limited rRNA depletion, it was possible to detect most transcripts, albeit at a very low coverage, efficient removal of rRNA is key to in-depth coverage.To maximize the amount of data that can be obtained and to reduce the average cost per sample, multiple samples are often combined in a single run by using different index primers for each sample.By efficiently removing rRNA, more considerable mRNA-derived sequence data can be obtained thus allowing more samples to be combined in a single run and reducing overall cost without sacrificing depth of coverage.

Figure 1 .
Figure 1.Efficacy of different ribosomal RNA depletion protocols and resultant mapping of RNA-Seq reads.Raw reads mapping to S. enterica serovar Tyhpimurium genome were grouped into three classes: mRNA, rRNA and intergenic regions.Different letters in each group indicate significant difference with p < 0.001.Three biological replicates were used for each treatment and the control.Linear regression values (r 2 ) for RPK expression signals between three replicates were 0.833 ± 0.005 (Total RNA), 0.957 ± 0.004 (RiboZero), 0.878 ± 0.006 (RiboMinus) and 0.869 ± 0.008 (MICROBExpress).

Figure 2 .
Figure 2. Effect of gene length on expression signal (RPKM).r 2 values were calculated between pairs of conditions by performing linear regressions in Sigma Plot 11.The data used for the regression were the log-2 transformed mean RPKM calculated from the three replicates of each condition.Dashed lines mark 99 percentile confidence interval.Prediction intervals, also called the confidence interval for the population, describe the range where the data values will fall a percentage of the time for repeated measurements.

Figure 3 .
Figure 3. Reads mapped to 16S rRNA genes (a) and 23S rRNA genes (b) from the libraries prepared from either total RNA or rRNA-depleted by three protocols.RPKM values were calculated from three replicates of each condition.correspondingto gene opgH were quantitated by measuring Ct values along with transcripts for 16 S rRNA_001 gene from total RNA and after rRNA depletion (Table2).Ct values provide an independent evaluation of transcript abundance.High abundance transcripts need fewer PCR cycles to cross the detection threshold (i.e.Ct value).Accordingly 16S RNA gene transcripts were detected in total RNA samples after only 15.23 ± 0.19 PCR cycles, whereas 29.02 ± 0.48 cycles were needed post RiboZero rRNA depletion.This observation correlates with the ~3-log reduction in mapped reads for 16S rRNA genes following RiboZero treatment (Figure3(a)).In parallel, RiboZero rRNA depletion resulted in earlier detection (i.e.Ct values of 32.24 ± 1.33 vs 25.19 ± 0.65 for before and after depletion respectively) of the low abundance transcript, opgH.MICROBExpress and Ribo-Minus treatments also resulted in earlier detection of the opgH transcript (difference of 3-5 PCR cycles) which roughly matched to the moderate depletion of

a
Ct values were calculated from three biological replicates and each replicate was tested three times.Values with different letters in each column denote significant difference at p < 0.001.

Figure 4 .
Figure 4. Efficacy of different rRNA depletion protocols on detection of gene expression at different threshold mapped reads.The percent of open reading frames detected (ORF) were plotted against the number of mapped reads per ORF.As the stringency is increased, i.e., minimum number of mapped fragments per ORF, the percentage of the ORF detected decreases and this varied according to the rRNA removal method used.

Figure 5 .
Figure 5. Different categories of intergenic reads (a) Low or no gene expression but high intergenic reads; (b) High gene expression and high intergenic reads and (c) High gene expression and no or low intergenic reads.