Lessons from microRNA Sequencing Using Illumina Technology

The numbers of reads generated by second-generation sequencing technologies permit to establish in a single sequencing lane multiple microRNA (miRNA) expression profiles from small RNAderived cDNA libraries tagged by barcodes consisting of few bases. Multiplex sequencing allows sample size expansion and thus the statistical reliability of generated data. This allows the detection of discrete changes in miRNA expression levels that occur at the onset of cellular processes. With the development of the “by-amplification” strategy, tagging cDNA libraries is no more a source of technical variability. However, other specific features should be kept in mind when designing experiments aimed at profiling miRNA expression using Illumina sequencing technology, the most important being the substantial distortion between miRNA expression in sequencing data and the true miRNA abundancy. miRNAs of low expression in profiles may correspond to abundant miRNAs in samples and vice versa. We report here data obtained from rat cerebellum and liver that illustrate 1) the high 3’ adaptor dependency of miRNA expression profiles, 2) the impact of sample size when working with moderate (3 4 fold) changes of miRNA expression and 3) the impact of the statistical tools used to identify differentially expressed miRNAs.


Introduction
In the last decade, three next-generation sequencing (NGS) technologies (the Roche/454 GS FLX, llumina/ Solexa Genome Analyzer and Applied Biosystems SOLiD system) have been released revolutionizing the field respectively, from Safe (Augy, France)). At birth, litters were sized to ten pups to prevent under-or overnutrition during lactation. At weaning, all animals were fed the standard food. From 4.5 months, half of each progeny was shifted to the unbalanced food for 10 weeks. Animals of 2.5 or 7 months were killed by decapitation under isoflurane deep anesthesia between 9 am and 10 am, following a 15-16-hour fasting. Brain was removed from males of 2.5 months born to dams fed a standard food, cerebellum was dissected and immediately homogenized in a ceramic bead tube (Ozyme) containing 2 ml of QIAzol lysis reagent (Qiagen) using a PreCellys homogenizer (PreCellys 24/Cryolys) for 20 s at 10˚C. Livers from males of 7 months were taken, immediately frozen in liquid nitrogen and stored at −80˚C. Samples were homogenized as described for cerebellum.

Purification of miRNA-Enriched Small RNA Fractions
Small (<150 -200 bases) and long (>150 -200 bases) RNA fractions of cerebellum were separated from QIAzol homogenates using the RNeasy Mini Kit (Qiagen) following manufacturer's instructions and recovered in RNase-free H 2 O (Invitrogen). For liver, nuclei acids in the aqueous phase of the Qiazol extraction were precipitated by adding 1 volume of isopropanol. Pellets were resuspended in RNAse-free H 2 O. Part of small RNAs was heated in 50% (v/v) formamide, for 3 min at 70˚C, and size-fractionated on a 17% Tris-Borate-EDTA (TBE) urea (8 M) polyacrylamide gel. RNAs of 16 -40 bases were eluted in 0.4 M NaCl overnight at 4˚C and precipitated in presence of glycogen (0.04 μg/μl; Invitrogen) by addition of 3 volumes of EtOH. After centrifugation at 13,200 rpm for 30 min at 4˚C, pellets were rinsed twice with cold 70% EtOH, air dried at room temperature and resuspended in RNase-free H 2 O.

Construction of Tagged cDNA Libraries
All DNA and RNA oligonucleotides ( Table 1) used in the preparation of cDNA libraries were purchased from Sigma (France). 3' adaptors were bought phosphorylated at the 5' end and blocked at the 3' end by a C7-Amine residue. They were adenylated as described [6].

Tagging with Bulged Primers
For each library, 16 -40 bases RNA (corresponding to the RNA obtained from 300 -600 ng of total RNA) was ligated to 0.25 pmoles of the adenylated BC8 3' adaptor. Ligation reaction was performed in a final volume of 10 μl containing 2.4 μl of PEG 8000 (Biolabs), 1X truncated T4 RNA ligase 2 buffer (Biolabs) and 0.4 μl (20 U/μl) of T4 RNA ligase 2 truncated (Biolabs). The ligation reaction was incubated for 75 min at 25˚C. The 5' RNA adaptor was subsequently ligated to RNA by adding 1 pmole of adaptor to the mixture. After 3 min denaturation at 70˚C, the mixture was kept on ice while adding 1.0 μl of (20 U/μl) T4 RNA ligase 1 (Biolabs) and 10 pmoles of ATP. The ligation reaction was incubated for 60 min at 25˚C. The ligated RNA was converted to single stranded cDNA using 130 units of Superscript II reverse transcriptase (Life technology) and 50 pmoles of RT1-primer at 50˚C for 90 min in a final volume of 30 μl. Following reverse transcription, a PCR mix was directly added to the RT reaction tube that contained 30 μl of 2X Master Mix Phusion enzyme (Biolabs), 60 pmoles of P B primer and 100 pmoles of bulged primer. The resulting 60 μl reaction volume was distributed into 4 × 15 μl to enhance thermic exchange, denaturated for 1 mn at 98˚C and submitted to 16 PCR cycles (20 s at 98˚C, 30 s at 55˚C, 25 s at 72˚C). PCR products were EtOH precipitated in sodium acetate 0.3 M final and resuspended before size-fractionation on 6% TBE polyacrylamide gel to purify the small RNA-derived cDNAs of 90 -95 base pair (bp) from the adaptor dimer byproducts of 74 bp. cDNAs were eluted as described above and resuspended in H 2 O.

Tagging with Barcoded Primers (TruSeq)
For each library, 4 μl of 16 -40 bases RNA was ligated to 0.25 pmoles of the RA3 3'adaptor. Following ligation to the 5' adaptor (1 pmole) and reverse transcription with the RT2-primer (50 pmoles), PCR amplification was performed in presence of 100 pmoles of TruSeq primer under the same thermal cycling conditions as above. PCR products of 130 -140 bp were purified from the 112 bp dimer adaptor byproducts as described above.

Sequencing Data Analysis
A GAIIx sequencing lane yields approximately 30 -32 millions quality-filtered reads. The TruSeq barcoded reads were obtained demultiplexed from Imagif platform (Centre de Recherche de Gif http://www.imagif.cnrs.fr). The barcoded BP reads were demultiplexed using our script written for this purpose. This script retrieved sequences having the first eleven nucleotides of the 3' adaptor. The filtered reads were trimmed from the adaptor Table 1. Names and sequences (5'-3') of oligonucleotides used in the three methods of small-RNA library preparation. Barcodes of the cDNA amplification primers are in bold. The canonical sequence of the reverse PCR bulging primers is given in which NN symbolizes different dinucleotides.  [15]. Reads were mapped onto the BN/SsNHsd/Mcwi reference rat genome allowing 2 mismatches. This server uses the miRBase (http://www.mirbase.org) and additional databases to classify the mapped reads into miRNAs, miRNA-related and non-miRNA sequences. Sequences that do not fall into these annotation categories but that match on the reference genome are tested for the encoding of putative novel miRNAs. Concerning former categories, reads are assigned to the miRNA genes that have been identified in the rat genome. Detailed counts of known mature miRNA and mature miRNA-star (the less stable strand of the RNA duplex precursor) sequences as well as detailed counts of new variant sequences not yet annotated in the miRBase database are provided. The former designation of mature miRNA/miRNA* has been replaced by the miRNA-3p/miRNA-5p designation according to the hairpin arm precursor miRNA arise from. The variant sequences denoted "Hairpin rno-miR'" differ from known miRNAs at the 5' end by one base and define 5'_isomiRs. miRNA sequence-based profiles were then constructed and compared. To account for library size differences, we computed the expression of each miRNA the DESeq normalization methods [16]. The raw data files have been submitted to SRA database (NCBI) under the accession numbers (SRX363287-90, SRX363303-18, SRS1443267 and SRS1443207). Individual group statistics were calculated using Mann and Whitney tests and corrected for multiple testing.

First Challenge: No Barcode Bias
As illustrated in Figure 2, the introduction of barcodes is no more a source of technical variability when barcoding through cDNA amplification with tagged primers (TruSeq indexes or bulged primers). As mentioned in the Introduction, significant variability used to be introduced when using a 'by-ligation' multiplexing strategy. This is important to recall because by-ligation protocols were still in use in the scientific community [10]. It is also important to take into account when re-examining early published data.

Second Challenge: MicroRNAs of Low Expression
In sequencing data, moderately or even weakly expressed miRNAs can actually correspond to abundant miRNAs in the samples because of adaptor bias. Figure 3 highlights the strong 3' adaptor-dependency of  miRNA expression profiles. Studies devoted to RNA ligase activities have shown that, for a given adaptor, RNA ligases favour some miRNA species over others [8] [9]. This selective capture impairs measurement of the true (absolute) expression level of miRNAs and less biased library preparation remains an important issue as testified by the recent publications in the field of small-RNA deep sequencing [17]- [19]. Our study markedly illustrates how much miRNA expression profiles are dependent of the adaptor sequence per se. From the same small RNA sample, we have built eight multiplexed BP libraries and four TruSeq multiplex cDNA libraries. These two protocols used the same 5' adaptor but two 3' adaptors of unrelated sequences (BC8 or RA3). Following the construction of mean profiles specific of the BC8 or RA3 oligonucleotides, miRNA expression were established in reads per million (RPM). Both adaptors approximately capture the same number of miRNAs with a common list of 268 miRNAs, "the 268-set". However, these miRNAs displayed important variation that can be two orders of magnitude even for miRNAs of high expression ranks (Figure 3(a)). For instance we obtained in BP and TruSeq mean profiles respectively, 760,000 versus 8000 for let-7d-5p, 20,000 versus 204,000 for miR-26a-5p, 1500 versus 78,000 for miR-30a-5p. In Figure 3(b), we plot for each miRNA its expression value in the mean BP profile versus the RA3 one. Four categories of miRNAs can be detected according to expression fold changes: 1 to 3, 3 to 10, 10 to 50, and beyond. Our data confirmed the strong sequence and/or structural-fold preference of RNA ligases ultimately leading to 3'-or 5'-adaptor-dependent miRNA profiles [20]. To solve this problem and allow a more accurate measurement of the absolute expression level of miRNA, a strategy had been proposed using pools of adaptors degenerated at their first positions [8] [17] [18] [20]. However, this cannot be sufficient to achieve the "true" miRNome characterization as we reported that profiles recovered with a family of barcoded adaptors displayed variation far below the variation observed between adaptors differing over their entire length for a large number of miRNAs (personal data). So it is important to estimate the abundance of a miRNA using alternative quantification techniques such as in situ hybridization or calibration curve in real-time RT-qPCRs.

Third Challenge: Quantification of Moderate Expression Changes
miRNA repertoires and miRNA expression profiles are specific of tissues and/or physiological/pathophysiological status. The range of miRNA expression changes is expected to greatly differ, depending on their biological status: small miRNA expression differences (<3) often characterize a tissue under different physiological conditions while miRNA expression differences of several orders of magnitude are observed between different tissues, or healthy tissues versus neoplastic tumors. In former cases, estimates of expression change will be highly dependent on the accuracy of expression level between sample groups. As shown in [21], increasing the number of biological replicates is powerful to increase accuracy of expression level of highly or moderately expressed genes. For lowly expressed genes, adding replicates and sequencing depth both participate to increase accuracy of expression levels. From our experience on miRNA populations of rodent tissues and body fluids, we recommend to sequence 5 -7 biological replicates at a sequencing depth of at least 1 -2 millions miRNA-related reads [22]- [24]. As an example we provide here comparison of miRNA expression profiles of liver of 7 months old rat fed standard diet (7 biological replicates) versus high-fat diet (6 biological replicates). The 13 libraries were sequenced at a depth of 3.1 ± 0.2 millions reads. More than six hundreds miRNAs were detected giving a total of 181 miRNA gene families. Eleven miRNAs were identified as differentially expressed (padj-values < 0.1). Fold-changes of expression ranged from 2 to 4. Reducing the sample size to 3 -4 replicates markedly deceased the number of identified differentially expressed miRNAs (Figure 4). In one case, only 4 differentially expressed miRNAs could be identified, in another one, only 1 differentially expressed miRNAs could be identified while in two other cases, no differentially expressed miRNA was identified. In addition, we would like to draw attention to the fact that very different statistical tools are used for the characterization of false-discoveryrates (FDR) [25]- [27]. Figure 5 illustrates the fact that different FDR corrections give very different lists of putatively differentially expressed miRNAs.

Conclusion
In summary, this work shows 1) the dependency of miRNA expression profiles from adaptors. This questions the statement of abundance widely used in the literature from sequence data. This may in turn influence our vision of relationships between miRNA expression and function; 2) the impact of sample size on comparative expression analyses. This should be strengthen because profiling individual miRNA expression of complex tissue is becoming a major breakthrough notably in brain specialized structures or substructures which express a high number of ubiquitous and specific miRNAs [28]- [30].  . Impact of statistical tools. miRNA expression profiles of rat liver were constructed using the TruSeq protocol. Samples have been compared using a DESeq-like procedure and methods described by Holm (padj-values(H)) [28], Benjamini and Hochberg (padj-values(BH)) [29] or Benjamini and Yekutieli (padj-values(BY)) [30] to characterize falsediscovery-rates and whole sets of replicates (7 for sample C and 6 for sample HF). Out of 256 miRNAs identified by more than 10 reads, 73 displayed fold-change of expression with p-values < 0.05. Two miRNAs displayed fold-change of expression with a padj-values (BH) < 0.05, 10 miRNAs, with a padj-values (H) < 0.05. No miRNA displayed fold-change of expression with a padj-values (BY) < 0.05. Depending on the test, 0 -10 miRNAs therefore appeared putatively differentially expressed between samples C and HF.
(Centre de Recherche de Gif http://www.imagif.cnrs.fr). This work was supported by the Centre National de la Recherche Scientifique, the University Paris-Sud, XI and grant from Danone/FRM 2011.