Identification of Microrna Precursors with New Sequence-structure Features

MicroRNAs are an important subclass of non-coding RNAs (ncRNA), and serve as main players into RNA interference (RNAi). Mature microRNA derived from stem-loop structure called precursor. Identification of precursor microRNA (pre-miRNA) is essential step to target microRNA in whole genome. The present work proposed 25 novel local features for identifying stem-loop structure of pre-miRNAs, which captures characteristics on both the sequence and structure. Firstly, we pulled the stem of hairpins and aligned the bases in bulges and internal loops used '―', and then counted 24 base-pairs ('AA', 'AU', …, '―G', except '――') in pulled stem (formalized by length of pulled stem) as features vector of Support Vector Machine (SVM). Performances of three classifiers with our features and different kernels trained on human data were all superior to Triplet-SVM-classifier's in positive and negative testing data sets. Moreover, we achieved higher prediction accuracy through combining 7 global sequence-structure. The result indicates validity of novel local features.


INTRODUCTION
MicroRNAs (miRNA) are small regulatory non-coding RNA molecule 17-25 bp long, and whose function is to down-regulate gene expression in a variety of manners, including translational repression, mRNA cleavage, and deadenylation [1,2].More than one-third of human genes are thought to be regulated by miRNA, and these molecules represent the greatest number in eukaryotic genomes.The miRNA genes are initially transcribed as long primary transcripts (pri-miRNAs), which are then processed to the shorter, 60-120 bp stem-loop structures (called hairpin) known as miRNA precursor (pre-miRNA) [3].Finally, the mature miRNA is separated from one of the two strands in pre-miRNA hairpin, and then by binding to a complementary target in the mRNA, which inhibits induces mRNA cleavage or translational repression [4].
Although the majority of the miRNA were identified through experimental way [5][6][7], computational prediction techniques become possible and necessary due to accumulation of information and data about miRNA properties [8].All existing computational prediction methods can be classified two categories: the comparative sequence analysis approaches and the de novo (or ab initio) predictive approaches.Methods in the first category based on the assumption that miRNA genes are conserved in the primary sequences and secondary structure crossing species.Several algorithms have been developed and successfully been used for predicting miRNA in various species [9][10][11][12][13][14][15][16][17].However, for a species that does not have a closely homologies species sequenced, the first category methods will not work [15].For this reason, the secondary category methods, that are de novo prediction methods, have been developed to predict miRNA in single genome.Instead of evolutional information, those methods use characteristics of sequence and/or secondary structure of pre-miRNAs to achieve their purposes.The stem-loop hairpin structure is the most noticeable but not discriminative characteristic of pre-miRNAs, because a large amount of nonpre-miRNA sequences can fold themselves into pre-miRNA-like hairpins.To identify pre-miRNA hairpins, most existed methods use sets of features concerning sequence composition [17][18][19], topological properties of the stem-loop [17,19,20], thermodynamic stability [17,19,20], and sometimes other properties including entropy measures [19].Xue [18] shown that local contiguous substructures of pre-miRNAs are significantly distinct with that of pseudo pre-miRNAs.
Moreover, most of de novo methods employed machine learning techniques to identify pre-miRNAs, such as Hidden Markov Models (HMM) [21,22], Support Vector Machine (SVM) [17][18][19]23], Naïve Bayes [24], Random Forest [25] and Random Walks [26] In this work, the novel local sequence-structure features of pre-miRNA based on "pulled" the stem-loop structure were introduced and SVM was employed as classifier to class real pre-miRNAs from pseudo ones.Those features contain information on both the sequence and structure of pre-miRNAs.Moreover, the new positive testing data set were built on updated miRNA registry database [28] with Xue's way [18].The tests show that new method outperformed the Triplet-SVMclassifier.

Features for Identify Pre-miRNA
The main difference in hairpins structure between pre-miRNA and pseudo pre-miRNAs are base pair composition in stem, the number of bulges and internal loops, and the size of bulges and internal loops.Simply, we can get sequence and structure information through counting base pair in "pulled" stem.Inspired by Xue's result, a novel local sequence-structures feature of pre-miRNAs are proposed, which based on "pulled" stem of hairpins.Firstly, the secondary structures of the pre-miRNA and the candidates are predicted with the RNAfold [29].Then, the stems of hairpin are pulled, just as Figure 1 shows.The bases in bulges and internal loops are aligned with '-'.Finally, counted the number of 24 base-pairs ('AA', 'AU', …, '-G', except '--', here '-' as fifth base) in pulled stem, such as Table 1, and normalized them with the length of pulled stem.It is noticeable that the base-pair 'AU' is different from 'UA' because of the direction of miRNA sequences (from 5' to 3').The number of canonical base pair, that is 'AU', 'UA', 'GC', 'CG', 'GU' and 'UG', reveals the base pairs composition in stem.The number of non-canonical base pair (no gap) displays the information of internal loop.The number of gaped base pair shows the information of bulges.Another local feature is the length of pulled stem.
To improve the performance, the 7 global features used in other methods also are combined, which are numbers of base-pairs, GC content, length of sequences and central loop, free energy per nucleotide, 5' and 3' tail length.
The combined feature vector of Figure 1 is shown as Table 2:

Measures for Assessment
The prediction performance was evaluated by four indexes [31]: prediction accuracy (ACC), Matthews correlation coefficient (MCC), sensitivity (Sen) and selectivity (Sel).100% tptn ACC tptnfpfn where, tp is true positive, fp is false positive, tn is true negative, and fn is false negative.

RESULTS AND DISCUSSION
To demonstrate the validity of novel local sequencestructures feature, firstly, SVM classifier are performed with only 24 novel features (not including the length of pulled stem) on all testing data sets.The feature vector of training sets are scaled to zero means and unit deviations, and the feature vector of testing sets are scaled according to the means and deviations of training sets.Three basic kernel functions (linear kernel, polynomial kernel and RBF kernel) have been tested on all testing data sets, and adjusted the parameters through grid way.
The results were listed in Table 3 (the detail results see supplemental).As a comparison, it also listed the result of Triplet-SVM-classifier (3SVM) [18].The boldface in tables is the maximum in same row.
As shown in Table 3, the performance of three SVMs with 24 novel local features are better than Triplet-SVMclassifier's.The best SVM (RBF kernel) is able to predict 82% (2956 out of 3607) of all pre-miRNAs, and can identify 92% (3159 out of 3444) pseudo pre-miRNAs.In contrast, 3SVM reports 80% (2886 out of 3607) of all pre-miRNAs and 89% (3056 out of 3444) of all pseudo pre-miRNAs.This result demonstrates the validity of 24 novel local sequence-structure features for distinguishing real pre-miRNAs from pseudo ones.
To improve the performance of SVM classifier, SVM with appended 7 global features are test on all testing sets, and the result were listed in Table 4.We can see from Table 4 that the performance of SVM classifier significantly increased by combining the 7 global features with 25 new local features (including the length of pulled stem).The ACC and MCC of the best SVM with 32 combined features are 90.11% and 80.34%, respectively.It indicated that the global features are important to identify real pre-miRNAs from pseudo ones.

SciRes Copyright © 2009 JBiSE
Table 5 shows the SVM prediction on the CROSS-PECIES data sets, which contains 3207 known pre-iRNAs of 31 species.The SVM with new 24 local features and 32 combined features achieve overall accuracy of 83.5% and 88.9% on the CROSS-SPECIES data sets, respectively.The new 24 local features have better performance than Xue's local features in almost 31 species, especially for Epstein Barr virus (ebv) and Fugu rubripes (fru), our accuracy achieve 100% on those species, but Xue's accuracy is 91.7% and 87%, respectively.

CONCLUSIONS
In this paper, a novel local features different from Xue's [18] have been present for identifying real pre-miRNAs from pseudo ones.These features come from simply statistical on pulled stem of hairpin structure, and achieve higher accuracy than Triplet-SVM-classifier on updating testing data sets with SVM classifier.The results indicate that our method could be used as an alternative way for finding pre-miRNAs.

Figure 1 .
Figure 1.The example of pulled stem.The sequence is hsa-mir-139 of Homo sapiens from miRNA registry database [28].

Table 2 .
The composition of feature vector in our method.

Table 4 .
Performance comparisons with three kernel (with 32 features) and 3SVM.

Table 5 .
The prediction results of our method and 3SVM on cross species test sets.