A comparison study between one-class and two-class machine learning for MicroRNA target detection

The application of one-class machine learning is gaining attention in the computational biology community. Different studies have described the use of two-class machine learning to predict microRNAs (miRNAs) gene target. Most of these methods require the generation of an artificial negative class. However, designation of the negative class can be problematic and if it is not properly done can affect the performance of the classifier dramatically and/or yield a biased estimate of performance. We present a study using one-class machine learning for miRNA target discovery and compare one-class to two-class approaches. Of all the one-class methods tested, we found that most of them gave similar accuracy that range from 0.81 to 0.89 while the two-class naive Bayes gave 0.99 accuracy. One and two class methods can both give useful classification accuracies. The advantage of one class methods is that they don’t require any additional effort for choosing the best way of generating the negative class. In these cases oneclass methods can be superior to two-class methods when the features which are chosen as representative of that positive class are well defined.


INTRODUCTION
MicroRNAs (miRNAs) are single-stranded, non-coding RNAs averaging 21 nucleotides in length.The mature miRNA is cleaved from a 70-110 nucleotide (nt) "hairpin" precursor with a double-stranded region containing one or more single-stranded loops.MiRNAs target messenger RNAs (mRNAs) for cleavage, primarily by repressing translation and causing mRNA degradation [1].Although recent findings [2] suggest microRNAs may affect gene expression by binding to either 5' or 3' untranslated regions (UTRs) of mRNA, most studies have found that miRNA mark their target mRNAs for degradation or suppress their translation by binding to the 3'UTR and most target programs search there.These studies have suggested that the miRNA seed segment which includes 6-8 nt at the 5' end of the mature miRNA sequence is very important in the selection of the target site (see Figure 1).
Several computational approaches have been applied to miRNA gene prediction using methods based on sequence conservation and/or structural similarity [3,4,5,6,7].Those methods that used machine learning were based on the two-class approaches, while our new reported results are based on the one-class approaches.
Several additional methods for the prediction of mi-RNA targets have been subsequently developed.These methods mainly use sequence complementarities, thermodynamic stability calculations, and evolutionary conservation among species to determine the likelihood of a productive miRNA: mRNA duplex formation [8,9].John et al., (2004) developed the miRanda [10] algorithm for miRNA target prediction.MiRanda uses dynamic programming to search for optimal sequence complementarities between a set of mature miRNAs and a given miRNA.Another algorithm RNAhybrid [8,9] is similar to a RNA secondary structure prediction algorithm like the Mfold program [11] but it determines the most favorable hybridization site between two sequences.Lewis et al., (2005) developed TargetScanS [12].Tar-getScanS scores target sites based on the conservation of the target sequences between five genomes (human, mouse, rat, dog and chicken) as evolutionarily conserved target sequences are more likely to be true targets.In testing, TargetScanS was able to recover targets for all 5300 human genes known at the time to be targeted by miRNAs.
PicTar [13] is a computational method to detect comon miRNA targets in vertebrates, nematodes (C.ele-m Copyright © 2010 SciRes.JBiSE gans), and insects (Drosophila melanogaster).PicTar is based on a statistical method applied to eight vertebrate genome-wide alignments (multiple alignments of orthologous nucleotide sequences (3' UTRs).PicTar was able to recover validated miRNA targets at an estimated 30% false-positive rate.In a separate study PicTar was applied to target identification in D. melanogaster [14].These studies suggest that one miRNA can target 54 genes on average and that known miRNAs are projected to regulate a large fraction of all D. melanogaster genes (15%).This is likely to be a conservative estimate due to the incomplete input data.
TargetBoost [15] is a machine learning algorithm for miRNA target prediction using only sequence information to create weighted sequence motifs that capture the binding characteristics between miRNAs and their targets.The authors suggest that TargetBoost is stable and identifies more of the already verified true targets than do other existing algorithms.
Sung-Kyu et al., (2005), also reported the development of a machine learning algorithm using Support Vector Machine (SVM).The best reported results [16] were 0.921 sensitivity and 0.833 specificity.More recently Yan and others, used a machine learning approach that employs features extracted from both the seed and out-seed segments [17].The best result obtained was an accuracy of 82.95% but it was generated using only 48 positive human and 16 negative examples, a relatively small training set to assess the algorithm.
In 2006, Thadani and Tammi [18] launched MicroTar, a novel statistical computational tool for prediction of miRNA targets from RNA duplexes which does not use sequence homology for prediction.MicroTar mainly relies on a quite novel approach to estimate the duplex energy.However, the reported sensitivity (60%) is significantly lower than that achieved using other published algorithms.At the same time, a miRNA pattern discovery method, RNA22 [19] was proposed to scan UTR sequences for targets.RNA22 does not rely upon crossspecies conservation but was able to recover most of the known target sites with validation of some of its new predictions.
More recently, Yousef et al., (2007) described a target prediction method, (NBmiRTar [20]) using instead machine learning by a Naïve Bayes classifier.NBmiRTar does not require sequence conservation but generates a model from sequence and miRNA:mRNA duplex information derived from validated target sequences and arti-ficially generated negative examples.In this case, both the seed and "out-seed" segments of the miRNA:mRNA duplexes are used for target identification.NBmiRTar technique produces fewer false positive predictions and fewer target candidates to be tested than miRanda [10].It exhibits higher sensitivity and specificity than algorithms that rely only on conserved genomic regions to decrease false positive predictions.
This paper describes a comparison study of using oneclass and two-class approaches for miRNA target detection.The advantage of one class methods is that they don't require any additional effort for choosing the best way of generating the negative class while it is clear that the two class approaches performances are outperform the one-class methods.

Designing Duplex Structure and Sequence Features
Machine learning enables one to generate automatic rules based on observation of the appropriate examples by the learning machine.However, the selection and design of the features that will be considered in order to represent each example for the learning process are very important and influence the classifier performance.We have followed [20] for feature design.We have partitioned the duplex into two parts, the seed (5' 8nt of the miRNA) and out-seed (3' remainder) as described in Figure 1.For each of these parts the following features are extracted to give 57 structural features: 1) the number of paired bases (bp), 2) The number of bulges (inserts on one strand between paired bases), 3) the number of loops (unpaired bases opposite each other between paired bases), 4) the number of asymmetric loops (loops with unequal numbers of unpaired bases on the two strands), 5) eight features, each representing the number of bulges of lengths 1-7 and those with lengths greater than 7. 6) Eight features, each representing the number of symmetric loops with lengths 1-7 and those with lengths greater than 7, 7) eight features each representing the number of asymmetric loops with lengths 1-7 and those with lengths greater than 7, 8) the distance from the start of the seed (the 3' end) to the first paired base of the 5' start of the out-seed part is an additional feature that is extracted.For the sequence features, we define "words" as sequences having lengths equal to or less than 3.The frequency of each word in the seed part

JBiSE
is extracted to form a representation in the vector space.

One-Class Methods
In general a binary learning (two-class) approach to miRNA discovery considers both positive (miRNA) and negative (non-miRNA) classes by providing examples for the two-classes to a learning algorithm in order to build a classifier that will attempt to discriminate between them.The most common term for this kind of learning is supervised learning where the labels of the two-classes are known before hand.One-class uses only the information for the target class (positive class) building a classifier which is able to recognize the examples belonging to its target and rejecting others as outliers.
Among the many classification algorithms available, we chose five one-class algorithms to compare for miRNA discovery.We give a brief description of each one-class classifier and we refer the references [21,22] for additional details including a description of parameters and thresholds.The LIBSVM library [23] was used as implementation of the SVM (one-class using the RBF kernel function) and the DDtools [24] for the other one-class methods.The WEKA software [25] was used as implementation of the two-class classifiers.

One-Class Support Vector Machines (OC-SVM)
Support Vector Machines (SVMs) are a learning machine developed as a two-class approach [26,27].The use of one-class SVM was originally suggested by [22].One-class SVM is an algorithmic method that produces a prediction function trained to "capture" most of the training data.For that purpose a kernel function is used to map the data into a feature space where the SVM is employed to find the hyper-plane with maximum margin from the origin of the feature space.In this use, the margin to be maximized between the two classes (in two-class SVM) becomes the distance between the origin and the support vectors which define the boundaries of the surrounding circle, (or hyper-sphere in high-dimensional space) which encloses the single class.

One-Class Gaussian (OC-Gaussian)
The Gaussian model is considered as a density estimation model.The assumption is that the target samples form a multivariate normal distribution, therefore for a given test sample z in n-dimensional space, the probability density function can be calculated as: where  and  are the mean and covariance matrix of the target class estimated from the training samples.

One-Class Kmeans (OC-Kmeans)
Kmeans is a simple and well-known unsupervised machine learning algorithm used in order to partition the data into k clusters.Using the OC-Kmeans we describe the data as k clusters, or more specifically as k centroids, one derived from each cluster.For a new sample, z, the distance d(z) is calculated as the minimum distance to each centroid.Then based on a user threshold, the classification decision is made.If d(z) is less than the threshold the new sample belongs to the target class, otherwise it is rejected.

One-Class Principal Component Analysis (OC-PCA)
Principal component analysis (PCA) is a classical statistical method known as a linear transform that has been widely used in data analysis and compression.Mainly PCA is a projection method used for reducing dimensionality in a given dataset by capturing the most variance by a few orthogonal subspaces called principal components (PCs).For the one-class approach (OC-PCA) one needs to build the PCA model based on the training set and then for a given test example z the distance to the PCA(z) model is calculated and used as a decision factor for acceptance or rejection.

One-Class K-Nearest Neighbor (OC-KNN)
The where NN(y) is the nearest neighbor of y, in other words, it is the nearest neighbor of the nearest neighbor of z.
The default value of  is 1.The average distance of the k nearest neighbors is considered for the OC-KNN implementation.

Naïve Bayes
Naïve Bayes is a classification model obtained by applying a relatively simple method to a training dataset [28].A Naïve Bayes classifier calculates the probability that a given instance (example) belongs to a certain class.

It makes the simplifying assumption that the features
Copyright © 2010 SciRes.JBiSE constituting the instance are conditionally independent given the class.Given an example X, described by its feature vector (x1,...,xn), we are looking for a class C that maximizes the likelihood: .The (naïve) assumption of conditional independence among the features, given the class, allows us to express this conditional probability P(X | C) as a product of simpler probabilities: .
We used the Rainbow program [29] to train the naïve Bayes classifier.To combine the numeric features identified in the miRNA-target duplex with the sequence features ("words") in the target candidate sequence, a dictionary of all the unique "words" was generated and the frequency of each "word" in the sequence is used.

Support Vector Machines (SVMs)
Support Vector Machines (SVMs) is a learning machine developed by Vapnik [27].The performance of this algorithm, as compared to other algorithms, has proven to be particularly useful for the analysis of various classification problems, and has recently been widely used in the bioinformatics field [30,31,32].Linear SVMs are usually defined as SVMs with linear kernel.The training data for linear SVMs could be linear non-separable and then soft-margin SVM could be applied.Linear SVM separates the two classes in the training data by producing the optimal separating hyper-plane with a maximal margin between the class 1 and class 2 samples.Given a training set of labeled examples ( , )   i i x y where and ) , the support vector machines (SVMs) find the separating hyper-plane of the form . Here, w is the "normal" of the hyper-plane.The constant b defines the position of the hyper-plane in the space.One could use the following formula as a predictor for a new instance: for more information see Vapnik [27].

Random Forest
Random forests are a ensemble of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest [33].The improvement in the classification accuracy is due to the growing or an ensemble of tress that vote for the most popular class.Random forests are becoming increasingly popular because their ability to deal with small sample size with high-dimensional space.

C4.5
C4.5 is a decision tree algorithm, developed by Quinlan (1993) [34].A decision tree is a simple structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes.

Data
A collection of 326 confirmed MicroRNA targets (human, mouse, fruit fly worm and virus) were downloaded from the TarBase [35] (TarBase_V4, Tarbase flat file data as of 04/2007 ) web-site to serve as positive examples and 1,000 negative examples chosen at random from the negative class pool generated at the study of NBmiRTar [20].
To evaluate classification performance, we used the data generated from the positive class and 1,000 negative examples.The negative class is not used for training of the one-class classifiers, but merely for estimating the specificity performance Each one-class algorithm was trained using 90% of the positive class and the remaining 10% was used for sensitivity evaluation.The randomly selected 1,000 negative examples were used for the evaluation of specificity.The whole process was repeated 100 times in order to evaluate the stability of the methods.Additionally, the Matthews Correlation Coefficient (MCC) [36] measurement is used to take into account both over-prediction and underprediction in imbalanced data sets.It is defined as:

DISCUSSION
The one-class approach in machine learning has been receiving more attention particularly for solving problems where the negative class is not well defined [37,38,39,40]; moreover, the one class approach has been successfully applied in various fields including text mining [41], functional Magnetic Resonance Imaging (fMRI) [42] ,signature verification [43] and miRNA gene discovery [44].
This paper describes a comparison study of using oneclass and two-class approaches for miRNA target detection.The advantage of one class methods is that they don't require any additional effort for choosing the best way of generating the negative class while it is clear that the two class approaches performances are outperform the one-class methods.
Table 1 shows the performance of five one-class classifiers while Table 2 shows the performances of twoclass methods.The results of the one-class approaches show a slight superiority for OC-Kmeans over the other one class methods based on the average of the MCC measurement.The MCC measurement with value of +1 represents a perfect prediction while 0 value indicates an average random prediction.However, accuracy is less than the two-class approaches.During the training stage of the one-class classifier we have set the 10% of the positive data, whose likelihood is furthest from the true positive data based on the distribution, as "outliers" in order to produce a compact classifier.This factor might cause a loss of information about the target class which might also result in reducing performance compared to the two class approach.

CONCLUSIONS
The current results show that it is possible to build up a classifier based only on positive examples yielding a reasonable performance.Moreover, more efforts are required to figure out more biological features to be used in the design of the one-class classifier to improve the performance.However, we hypothesize that taken 10% of the training data as "outlier" is the cause of reducing the one-class performance.

3' uagcgccaaauauggUUUACUUA 5 'Figure 1 .
Figure 1.Duplex partitioned into two parts for miRNA hsa-miR-579 and its target LRIG3, the seed part and the out-seed part.The seed part appears in capital letters.

Table 1 .
One-Class results.Two class results.