Accurate Plant MicroRNA Prediction Can Be Achieved Using Sequence Motif Features

MicroRNAs (miRNAs) are short (~21 nt) nucleotide sequences that are either co-transcribed during the production of mRNA or are organized in intergenic regions transcribed by RNA polymerase II. In animals, Drosha, and in plants DCL1 recognize pre-miRNAs which set themselves apart by their characteristic stem loop (hairpin) structure. This structure appears important for their recognition during the process of maturation leading to functioning mature miRNAs. A large body of research is available for computational pre-miRNA detection in animals, but less within the plant kingdom. For the prediction of pre-miRNAs, usually machine learning approaches are employed. Therefore, it is necessary to convert the pre-miRNAs into a set of features that can be calculated and many such features have been described. We here select a subset of the previously described features and add sequence motifs as new features. The resulting model which we called MotifmiRNAPred was tested on known pre-miRNAs listed in miRBase and its accuracy was compared to existing approaches in the field. With an accuracy of 99.95% for the generalized plant model, it distinguishes itself from previously published results which reach an average accuracy between 74% and 98%. We believe that our approach is useful for prediction of pre-miRNAs in plants without per species adjustment.


Introduction
MicroRNAs (miRNAs) are short RNA sequences that form a hairpin structure which harbors one or more mature miRNAs of about 21 nucleotides in length [1].Mature miRNAs, when incorporated into RISC, provide a template sequence for the recognition of their target mRNAs which are then either degraded or whose translation is reduced [2].Since their discovery by Lee and colleagues [3], they have received increasing attention and it is now clear that in case of animals they are also involved in many diseases [4] and in case of plants play essential roles in regulation, development, response to cold stress and nutrient deprivation [5].MicroRNAs are found in multicellular organisms ranging from sponges [6] to human, but the plant miRNA pathway may have evolved distinctly from the animal one [7].
The experimental study of miRNAs is quite involved and complicated by the fact that the miRNA and their targets have to be expressed at the same time in the same cell to lead to a measurable effect.For this reason, computational detection of miRNAs and their targets is important [8] [9].Different approaches to computational miRNA detection have been applied, but most approaches are based on feature extraction followed by machine learning [10] [11].The so called ab initio miRNA detection methodology is well established in animal models for which abundant learning data are available for example in miRBase [12].
Most studies which report new ab initio approaches to pre-miRNA prediction have used different data sets which make it difficult to compare the results.Additionally, various computational approaches (apart from machine learning) have been employed for example based on sequence conservation and/or structural similarity [13]- [17].However, most detrimental for a true comparison of methodologies is that there is no fully annotated genome available, which would allow a proper accuracy assessment on real data.For these reasons, accuracies and other measures reported in the studies below cannot be compared directly, but can provide a general idea.
NOVOMIR [18] uses a series of filter steps and a statistical model to discriminate a pre-miRNA from other RNAs and reports a sensitivity of 80% at a specificity of 99%.MiRenSVM an algorithm combining three SVM achieved a sensitivity of 93% at a specificity of 97% [19].Xue and colleagues trained a support vector machine on human data (93% sensitivity at 88% specificity) but interestingly also achieved high accuracies of up to 90% in other species [20].Jiang and colleagues [21] added a P-value and minimum free energy to the classification parameters of Xue and colleagues and using Random Forrest, a different classification algorithm, achieved a sensitivity of 95% at a specificity of 98%.A recent study by Zeller and coworkers employed structure/sequence conservation, homology to known microRNAs, and phylgenetic footprinting [22].Others have used homology searches for revealing paralog and ortholog miRNAs [14] [23]- [26].Additionally, Wang and others [27] developed a method based on sequence and structure alignment for miRNA identification.Finally, Hertel and Stadler included multiple sequence alignment for microRNA detection [28].
Many algorithms for miRNA gene prediction are based on machine learning strategies.In general, these algorithms need a sufficient number of positive as well as negative examples.Although many miRNA genes seem to be unique in any organism, positive training examples can easily be found whereas negative examples are hard to come by [19] [29]- [31].Some negative examples that were picked in studies, for example mRNA sequences [32] are dubious since to our current knowledge miRNAs can originate from any part of a pri-miRNA.Thus, defining the negative class is a major challenge in training machine learning algorithms for miRNA discovery.For this reason, one-class machine learning which only needs positive examples has been tried [20] [31].
As pointed out above, plant miRNAs may have evolved distinct from animal ones and thus the approaches for miRNA detection introduced so far may need to be adapted when applied to plant miRNA detection.It has been found that plant miRNAs are more variable in size and very heterogeneous, but usually larger than animal miRNAs.Also their base pairing propensity (bonds in the stem) seems to be more extensive and their length is close to 21 nucleotides [33].Billoud and colleagues predicted miRNAs in brown algae, which are different from both land plants and animals using a set of normalized features like Shannon entropy that have previously been used for detection of miRNAs in plants and animals [34].Other studies also use tools developed for miRNA detection in animals for studies in plants [18] [35] [36].PlantMiRNAPred achieved an accuracy of more than 90% when used with multiple plant species [36].One study shows that generalized training using multiple plant data as input for training a decision tree leads to sensitivity of 84% at a specificity of 99% [37].This may be due to their concurrent usage of structural features and targeting parameters for miRNA prediction which is beneficial for the accuracy of miRNA prediction [38].In Arabidopsis thaliana, one approach searched for all complementary pairs of sequences within its transcriptome of the expected size of a miRNA-mRNA duplex and then successfully filtered the results according to divergence patterns [39].
We should note in passing that high-throughput methods for sequencing isolated small RNAs provides a new tool for discovering novel microRNA species [40] [41] and that such information for plants is available in PMRD [5].Another new method for amplifying low-concentration microRNAs allows easier testing of predictions [42].These tools are equally important for plant and animal models.However, this study is interested in the ab inito detection of miRNAs from genomic rather than from transcriptomic data.
Compared to animals, less effort for computational detection of miRNAs and their targets has been exerted since it was thought to be simple, but it has become clear that miRNA regulation in plants is more complex than anticipated [43].It is difficult to differentiate between miRNAs and short interfering RNAs in plants [44], but this is beyond the scope of this study.Here, we aimed to improve upon current methodologies for plant pre-miRNA prediction.To achieve this, we pursued two routes for the ab inito prediction of miRNAs.Like many other studies, we employed features describing hairpins but included many more than usual (~700) of which we selected the 100 most discriminative.This strategy led to a prediction accuracy of 98%, which is comparable to previous studies.The second approach describes miRNAs solely based on motifs.This novel approach is also of comparable accuracy (90%) to previous studies in itself.Employing a hybrid approach using the best of both descriptors led to an accuracy of 99.48% which is the best result reported for plants today.

Data
We downloaded microRNAs from different plant species available on miRBase (Release 20 and 21).We considered Brassicaceae with 699 pre-miRNAs, that consists of Arabidopsis lyrata (205 precursors), Arabidopsis thaliana (298 precursors), Brassica napus (90 precursors), Brassica oleracea (10 precursors), and Brassica rapa (96 precursors).We also included the data published on the web server PlantMiRNAPred [36] whose training dataset consist of 980 real pre-miRNAs and 980 pseudo pre-miRNAs (we refer to this data as PlantMiRNAPred data in the following).Our negative data pool of the 980 pseudo pre-miRNAs in the PlantMiRNAPred dataset.

Motif Parameters
Here a sequence motif is a short stretch of nucleotides that is widespread among plant hairpins.Motif discovery in turn is the process of finding short sequences within a larger sequence; here motifs in plant hairpins.
The MEME (Multiple EM for Motif Elicitation) [45] suite web server is used in our study to discovery sequence motifs from our input data which consist of plant pre-microRNA (positive sequences) and plant pseudo hairpins (negative sequences).The MEME algorithm for motif discovery is based on [46] which works by searching for repeated, ungapped sequence motifs that occur in the DNA or protein sequences.MEME provides the results as regular expressions (Table 1).Nucleotides within brackets represent alternatives for the given position in the sequence; without brackets only the given nucleotide occurs abundantly within all collected sequences representing the motif.More visual representations of such motifs are sequence logos (Figure 1).MEME was instructed to generate 20 motifs, each of which must appear in at least 10 sites to be an acceptable motif.

expression: [GA]A[GAC][AC][GC]A[AG]A[CG][AG][GA][ACG][AC][AGC][AC][CG][GAC][AGC]AAA.
Table 1.Match score calculation.Example of match score between a motif and a part of a sequence.The number of matches is 6.For the assessment the score is normalized by the length of the motif.The final match score is 6/19 = 0.31.

Sequence-Based and Motif Features for Plant Pre-miRNA Detection
Simple sequence-based features have been described and used for ab initio pre-miRNA detection in numerous studies (see Hairpin Feature Calculation).These simple features, also called words, k-mers, or n-grams describe a short sequence of nucleotides of length k or n.For example a 1-gram over the alphabet {A,T,C,G} can produce the words A,T,C,G; while a 2-gram over {A,U,C,G} can generate: AA, AC, AG, AU, CA, CC, CG, CU, GA, GC, GG, GU, UA, UC, UG, and UU.Higher n have also been used [38] but selective for interesting 3grams.
Motif features are different from n-grams in that they are not exact and allow some degree of error tolerance.In this study motifs are represented as regular expressions (see above).Regular expressions are widespread in approximate pattern matching and many programs allow searching with regular expressions (e.g.: most Linux tools such as grep).Here we use PatMatch [47] to analyze whether a pattern is within a hairpin (1) or not (0).The hairpin is analyzed using the following algorithm:

Traditional Hairpin Feature Calculation
Apart from the novel motifs discovered in this study, we also calculated conventional features which may be statistical in nature, thermodynamic, sequence-based, structural, or any combination of these.The features calculated were taken from 9 studies presenting ab initio detection of hairpins in animals [19]- [21] [32] [48]- [52].We further added their logical extensions and normalizations, for example normalization based on stem or hairpin length.While it is outside of the scope of this work, some of the features are further explained in Saçar and Allmer [10].All features were implemented using Java and the calculations were distributed over a 200 core HTCondor [53] cluster at the Izmir Institute of Technology, Urla, Turkey.

Feature Selection
Features were ranked according to the recursive feature elimination with SVM procedure (SVM-RFE) implemented in WEKA for the motif and the traditional approach individually.SVM-RFE [54] is a SVM based model that removes features, recursively based on their contribution to the discrimination, between the two classes.The lowest scoring features by coefficient weights are removed and the remaining features are scored again and the procedure is repeated until only a few features remain.Supplementary Table S1 contains the 60 top ranked features for the three models as presented in Table 5.

Support Vector Machine Classification
Support Vector Machines (SVMs) are used in machine learning for classification [55].In general, during training of a linear SVM examples from two classes need to be provided (positive and negative).The SVM learns a model by finding a hyperplane which best separates the two classes maximizing the margin from hyperplane to the support vectors (Figure 2).
For training a set of labeled examples E need to be provided where x is a l dimensional vector, and i y defines the class of the i x example (i.e.: with p representing the posi- Here we used the SVM learner which has become the method of choice to solve difficult classification problems in a wide range of application domains and especially in the field of bioinformatics.In previous studies we have examined other classifiers and while there were no great differences in outcome SVM worked well consistently.
We used the WEKA software [56] for the implementation of our SVM classifier based on LibSVM [57].The radial basis function was set to a gamma value of 0.7 and the cost parameter was chosen to be 4.0 and the normalization option set to true.
Any machine learning algorithm needs initial training and we performed five-fold cross validation during learning employing stratified random sampling (Figure 3).

Trained Models
We trained three separate models using the strategy outlined above to investigate whether motifs, or other previously described features or their combination are most successful for separating true from false plant pre-miRNAs.For training the motif-only model, we used the best 60 motifs and n-grams.For training the traditional model, we used the best 60 features.For the combined model the top 60 motifs were selected from the mixture of n-grams, traditional features, and motifs.The selected features are listed in Supplementary Table S1.

Evaluation Methods
Positive data from miRBase and negative data from PlantMiRNAPred was used to evaluate the models derived via SVM training.We calculated the performance of the classifier with the known sensitivity (SE) and specificity (SP) and accuracy (ACC) statistics as follows (TP refers to true positives, FP to false positives, TN to true negatives, and FN to false negatives):

Results
The PlantMiRNAPred data was divided into two parts, PlantMiRNAPred-p1 data consisting of 450 pre-miRNAs (positive data) and 450 pseudo pre-miRNAs (negative data) and PlantMiRNAPred-p2 data composed of 530 pre-miRNAs and 530 pseudo pre-miRNAs.The Brassicaceae data was also divided into two parts, first part consists of one third of the data (233 sequences; named Brassicaceae-p1) the remaining two third (named Brassicaceae-p2) contain 466 sequences.MEME software was used to discover motifs in the dataset as described in the Materials and Methods Section, and several motifs were found in all datasets as seen in Table 2. MEME was used to discover motifs in one part of the divided dataset (p1) and the same motifs were used for further experiments in the remainder of the data (p2) to ensure that the extracted motifs are meaningful and not dataset dependent.
The selected motifs and the n-grams (short nucleotide sequences; see Materials and Methods; Supplementary Table S1), were used to train a support vector machine (SVM) model for which the accuracy and other performance measures were established (Table 3).To see the impact of motifs on the classification accuracy, two models were trained for all datasets, one which uses both motifs and n-grams and one which only relies only on the latter.
Table 3 presents the average performance of our SVM classifier MotifmiRNAPred using five-fold cross validation.For the motifs extracted from PlantMiRNAPred-p1 and applied to PlantMiRNAPred-p2 we see a decrease in performance of the model by about 13% which indicates that there is some data dependency of the motifs in this case.For Brassicaceae there was no significant difference between the datasets p1 and p2 which shows that in this case stable motifs were generated that are not affected by differences in the tested datasets.When comparing the results on PlantMiRNAPred-p1 with the results achieved by PlantMiRNAPred [36] it can be seen that our methodology achieves a similar performance (Table 4).PlantMiRNAPred achieves accuracies between 92% and 100% when the data is separated into species with a trend to be more successful for smaller datasets.
In Table 4, we considered the data from PlantMiRNAPred web server [36] to perform a comparison performance with the classification results of PlantMiRNAPred, TripletSVM [20], and microPred [58].The data was represented by 174 features consisting of 84 n-grams and 90 motifs.From these, the top 60 selected features by SVM-RFE, feature selection method available in WEKA [56], were considered and the performance resulting from five-fold cross validation are presented (Table 4).
The nucleotide T(U) is one of the most informative features (Supplementary Table S1) and it always appeared on the top of the selected features for each data set individually.This observation is also confirmed by the study of Zhang, Pan et al. [59].This seems to confirm that the sequences of pre-miRNAs and mature miR-NAs are slightly enriched in T(U) and T(U) plus G, respectively.
The comparison in Table 4 shows that using motifs for miRNA detection is comparably successful to using traditional features while at times even slightly more successful.Following this, we set forth and calculated the traditional features used to describe hairpins and trained a model (traditional) for pre-miRNA detection.We  calculated about 700 features, but ranked them as above and selected only the top 60 features for machine learning.Additionally, we combined the traditional features with the motifs and ranked the mixture and again selected the 60 best ranked features to train a model (combined).These two models were compared to the initially learned model (motifs-only) which is only based on motifs and n-grams.The combined model performs better than the underlying models individually with an increase by about 11% and 1%, respectively (Table 5).
The best accuracy was achieved by the combined feature set.It is striking that the accuracy is even better than  the best accuracy for any of the models trained on the individual plant data sets (Table 4).Since our motif-only approach is comparable in accuracy with previously published studies (Table 4), and due to the fact that the combined feature set is significantly better than the motifs-only one, we propose, that it suffices to use our feature set and create one model for miRNA detection to be applicable in even different plant species.

Conclusion
An abundance of features describing miRNA hairpins have been proposed which are mostly based on structural, statistical and thermodynamic features [60].Here we showed that for plant miRNA detection, motif based features are useful and they by themselves lead to a good recognition of pre-miRNAs at an accuracy of 90% -95%, depending on the plant species (Table 4).When using a mixture of plant pre-miRNAs to train models based on motifs and n-grams, traditional features, and their combination, it can be seen that the combination of features is most successful (Table 5).We found no great difference when comparing the performance of selected features from the domain of animal pre-miRNA detection to using sequence motifs and n-grams as features (Table 4).However, the combination of these features (Table 5) performed about 4% better, even when compared to the average performance of classifiers specifically trained for plant species (Table 4).We conclude that using motifs for the prediction of pre-miRNAs is useful and in combination with traditional features is most successful.Furthermore, we propose that it may be sufficient to use a classifier trained in this manner to detect plant pre-miRNAs at an accuracy level high enough to warrant experimental confirmation of predicted pre-miRNAs.

Figure 1 .
Figure 1.Motif Construction.The sequence logo corresponding to one of the motifs discovered in this study.Size of letters in stacks represents their frequencies while the height of the stack represents the information content.Not all options in the profile may be incorporated into the corresponding regular expression: [GA]A[GAC][AC][GC]A[AG]A[CG][AG][GA][ACG][AC][AGC][AC][CG][GAC][AGC]AAA.

Figure 2 .
Figure 2. A support vector machine separates examples, represented by vectors with n-dimensions using a hyperplane.Positive examples (green points) and negative examples (grey doughnuts) are separated by maximising the margins which intersect with the so called support vectors.tive (+1) and n representing the negative class (−1)).The separating hyperplane then takes the following form: 0 with , w x b w b R ⋅ + = ∈ where w is the norm of the hyperplane and b defines its position in space.In order to predict a new instance given a trained model the formulation ( ) ( ) sign f x w x b = ⋅ + can be solved and a positive result indicates membership to the positive class and negative otherwise.If the value is zero then the example on the separating hyperplane and cannot be classified.Here we used the SVM learner which has become the method of choice to solve difficult classification problems in a wide range of application domains and especially in the field of bioinformatics.In previous studies we have examined other classifiers and while there were no great differences in outcome SVM worked well consistently.We used the WEKA software[56] for the implementation of our SVM classifier based on LibSVM[57].The radial basis function was set to a gamma value of 0.7 and the cost parameter was chosen to be 4.0 and the normalization option set to true.Any machine learning algorithm needs initial training and we performed five-fold cross validation during learning employing stratified random sampling (Figure3).

Figure 3 .
Figure 3. SVM Training.The figure depicts the workflow that was used to train the SVM classifier.Positive and negative data were combined and stratified random sampling was applied.The sampled data was split into 90% data for training and 10% for testing.This procedure was repeated 5 times.

Table 2 .
Dataset description.Dataset description and number of generated motifs per dataset.

Table 3 .
Classifier performance.The result of MotifmiRNAPred applied to different plant miRNA data.The first value given as performance measure refers to the model trained with both n-grams and motifs and the value following is in respect to a model trained with only the former.ROC: receiver operator characteristic.

Table 4 .
Performance comparison among tools.Comparison of MotifmiRNAPred with different methods.The first 4 columns taken from the PlantMiRNAPred paper.The columns below MotifmiRNAPred present our results using the same data as in the PlantMiRNAPred paper.

Table 5 .
Performance comparison among feature sets.Comparison of two synergistic feature sets and their synthesis proposed in this study.The models were trained on the combined dataset including all plant miRNAs.