A New Approach for Hiv-1 Protease Cleavage Site Prediction Combined with Feature Selection

Acquired immunodeficiency syndrome (AIDS) is a fatal disease which highly threatens the health of human being. Human immunodeficiency virus (HIV) is the pathogeny for this disease. Investigating HIV-1 protease cleavage sites can help researchers find or develop protease inhibitors which can restrain the replication of HIV-1, thus resisting AIDS. Feature selection is a new approach for solving the HIV-1 protease cleavage site prediction task and it's a key point in our research. Comparing with the previous work, there are several advantages in our work. First, a filter method is used to eliminate the redundant features. Second, besides traditional orthogonal encoding (OE), two kinds of newly proposed features extracted by conducting principal component analysis (PCA) and non-linear Fisher transformation (NLF) on AAindex database are used. The two new features are proven to perform better than OE. Third, the data set used here is largely expanded to 1922 samples. Also to improve prediction performance, we conduct parameter optimization for SVM, thus the classifier can obtain better prediction capability. We also fuse the three kinds of features to make sure comprehensive feature representation and improve prediction performance. To effectively evaluate the prediction performance of our method, five parameters, which are much more than previous work, are used to conduct complete comparison. The experimental results of our method show that our method gain better performance than the state of art method. This means that the feature selection combined with feature fusion and classifier parameter optimization can effectively improve HIV-1 cleavage site prediction. Moreover , our work can provide useful help for HIV-1 protease inhibitor developing in the future.


INTRODUCTION
Acquired immune deficiency syndrome (AIDS) is quite a mortality disease, which is due to the patients' infection of HIV-1.HIV-1 protease is a key enzyme in the virus replication process, and it cleaves specific kinds of small proteins to smaller peptides which will generate the indispensable proteins for the replication process [1].HIV-1 protease inhibitors can combine with the protease firmly but cannot be cleaved, so the protease will not combine with the substrates and its function will be inhibited.Nevertheless, it's not practical to find inhibitors in laboratory by conducting biological experiment, because there are too many kinds of peptides to test one by one.Take octapeptide for example: there are 20 kinds of amino acid residues in nature, thus there are 20 8 kinds of octapeptides altogether.It's impossible to test so many octapeptides by biological experiment.Nevertheless, machine learning can be used here to solve the problem [2].
For a machine learning task, feature extraction, dimensionality reduction, classifier designing and performance evaluation are of great importance, which will be discussed as follows: octapeptide that contains eight amino acid residues is the research object in the research.In previous investigations, researchers proposed different feature extraction methods for octapeptide sequence which can be mainly divided into two categories: feature extraction based on peptide sequence and physicochemical properties [3].Orthogonal encoding (OE) is a classical feature extraction method based on sequence.Features based on physicochemical properties can be extracted from the Amino Acid Index Database (AAindex database) which is a collection of amino acid indices in published papers [4].The inherently contained characteristics of amino acids can provide useful information for the prediction task [5].Many published bioinformatics investigations use data from this database [6][7][8].Loris Nanni and his colleague propose two kinds of new physicochemical features using principal component analysis (PCA) and non-linear Fisher transformation (NLF) based on this database [9].The two kinds of new features are compared with OE, and turn out to perform better than OE.For some pattern recognition tasks, if a stand alone method is not good enough, ensembles of features can be conducted to improve classification performance [10].Thus the three kinds of features are fused in our research to guarantee comprehensive representation.Feature selection is mentioned that can improve classification performance in their work too, and it's a key point in this paper.
Feature selection is an effective dimensionality reduction method, which is quite different from feature transformation.It does not change the original features, but keeps the original structure features and help understanding the physical meaning of data [11].It also removes redundant features and raises classifier efficiency, thus improving prediction performance [12].Local preserving projection (LPP) is an effective feature transformation method, which retains the meaningful information and eliminates the redundant information [13].However, the retained information is saved in the transformed features, difficult to understand.We expect to find the relationship between the retained information and transformed features.Thus a feature selection approach called BPFS that approximates LPP is used to find the optimal feature subset [14].The subset includes features from original features space and contains the meaningful information.BPFS has one severe drawback: the optimal feature number of subset is not clearly defined, and different data might obtain their own optimal feature number of subset.In this paper, we conduct complete tests for all subsets with different feature numbers, and calculate multiple evaluation parameters to compare their prediction performance, based on which to determine the optimal feature number for each kind of original features.
Performance evaluation is much important for a machine learning task, and different evaluation parameters can be used.Loris Nanni and his colleague use euc (1-auc) to evaluate their method, which is equivalent with auc [9,15].Auc can overall measure the perform-ance of a classifier based on setting different classification thresholds and calculating corresponding sensitivities and specificities.However, for our HIV-1 protease cleavage site prediction task, the best threshold needs to be determined in order to provide best prediction capability.Matthew's correlation coefficient (mcc) can perfectly evaluate the prediction performance of our work using the best classification threshold [16].It takes sensitivity and specificity into consideration at the same time.Also we calculate accuracy, sensitivity, specificity, and auc to better evaluate our work; all of them have their own characteristics and advantages.Especially mcc is the most important evaluation parameter.
The rest of this paper is organized as follows: Section 2 introduces the data set and the feature selection method.Section 3 shows the results of experiments and presents the detailed analysis of the results.At last Section 4 provides the conclusion.

Data Set
There are 20 8 kinds of octapeptides, which is a very big number.To effectively investigate inhibitor prediction, date set should contain as many samples as possible to make sure the completeness of data set.The bigger data set, the more helpful is the prediction result.In previous papers some classic data sets have been collected and analyzed.The most famous one is the 362 data set which is collected by Cai and Chou [17].Another relatively bigger one is the 746 data set, which is collected by You, Garwicz and Rognvaldsson [18].To enlarge the data set, 392 new octapeptides are added to the 362 data set by Hyeoncheol Kim, Tae-Sun Yoon and their colleagues, thus generating a 754-sample data set [19].The largest data set mentioned in the published investigations is the 1625 data set which is collected by Kontijevskis and his colleagues [20].To get a larger data set, we fuse all the data sets above and get 3618 samples.After removing contradictory and redundant samples, there are 1922 octapeptides including 596 positive samples and 1326 negative samples.This dataset is called 1922 data set.

Feature Selection
A filter method named BPFS is used here to eliminate the redundant features.BPFS is newly proposed to conduct feature selection, which transforms the original high-dimensionality features into a lower dimensionality space by a binary projection matrix (all the elements in it are 0 or 1), thus accomplishing feature selection.Correntropy is used as the evaluation function.The approach of BPFS is to make sure the correntropy between the subset and the labels of samples is a maximum.Assume there are two data sets  which contain N samples.Then the correntropy of X and Y can be calculated according to Eq.1.
At the beginning of this algorithm, LPP is carried out to get the mapping matrix C. Assume the data set contains n samples.The original feature number of data is d, and the feature number after conducting LPP is p.The feature selection model is like this: a data set d n X R   contains n samples and each sample is represented by a d-element vector x i ; learn a mapping matrix which maximizes the objective function J(W).Here W is a 0-1 matrix.Assume that the n samples in data set belong to N c different classes and the sample number of the class x i belongs to is Let Y is the data set after feature selection, then Y = WX.J(W) can be represented by the correntropy between Y and C, as shown in Eq.2.
Here .
For all i and j, , and A series of math operations prove that the task to find the best projection matrix can be converted to a binary programming problem, and we use Hungary algorithm to solve this binary programming problem.A drawback of BPFS is that the inherent dimension of data is not determined, thus the optimal feature number of subset is not affirmed.In the following part, we will determine the best feature number of subsets for each kind of features.

Optimization for Subset Feature Number
BPFS is an effective feature selection method while the feature number of subset need to be set before using it.Thus before conducting BPFS on the three kinds of features, the optional p values for them should be affirmed.Here p is determined by completely testing all subsets with different p values.Take OE for example, each amino acid residue is represented by a 20-bit vector.Thus an octapeptide sequence is represented by a 160feature vector, which means the feature number of the original OE data is 160.In the beginning p is set to 1 and BPFS is conducted, then a subset containing one feature is got.Carry out 10-fold cross validation on this subset, compute four evaluation parameters (accuracy, sensitivity, specificity and mcc) and save them.Then p is set to 2 and same work is done as mentioned previously.Each time make sure p is added by 1 and do the work.Repeat this process until p is 160.When all the work is done the evaluation parameters for each value of p is saved, according to which the optimal p is determined.
The principle we follow is to make sure the parameter obtains a relatively high value, and starting from this point all the values following are relatively high.Comprehensively consider the values of all the parameters for all different subsets and finally determine the optimal p value.For example the original feature number of OE for an octapeptide is 160. Figure 1 shows all the parameter values of different subsets.The abscissa of each subgraph denotes the feature number of each subset, and the ordinate of each subgraph denotes the value of each evaluation parameter for different subsets.When the subset includes 120 features, the four parameters get relatively high values and the following values are high too.Thus p is set to 120 for OE.For PCA based features, each amino acid residue is represented by a 19-element feature vector, thus an octapeptide sequence can be represented by a 152-feature vector.And for NLF based features, each amino acid residue is represented by an 18-element feature vector, thus an octapeptide sequence can be represented by a 144-feature vector.Repeat the same work for PCA and NLF based features, and the optimal p values for them are 124 and 106.In the following part, the prediction capability of the three optimal subsets is examined.

EXPERIMENTS AND DISCUSSIONS
In order to comprehensively analyze and compare the experiment results, multiple evaluation parameters are used in this paper: accuracy, sensitivity, specificity, mcc and auc.Different from Loris Nanni's work, in which only euc is used, our work can effectively assess the experiment results and provide instruction for HIV-1 protease inhibitors designing.
In order to get excellent prediction capability, parameter optimization is conducted for SVM in this paper.The radial basis function (RBF) is chosen as the kernel function in this work.Here accuracy, mcc and auc are separately used to determine the optimal C and g values by 10-fold cross validation.The three parameters are unbiased thus can evaluate the classification performance effectively.The range of C is set between 2 0 and 2 5 , and the range of g is set between 2 −5 and 2 0 .Each time the index of base 2 increases by 0.5 until it reaches the ceiling value.The results of parameter optimization are shown in Table 1.The optimal C and g are determined according accuracy, mcc and auc respectively.
First we use accuracy to determine the optimal C and g.Then test the prediction performance by 10-fold cross validation and calculate the five evaluation parameters.Table 2 shows the detailed results of each kind of fea-   Ensemble of the three original features can significantly improve prediction capability and performs better than all the single original features.This means fusion of the three kinds of original features can effectively make use of different information contained in the features, thus improving prediction capability.Examining the results of the three subsets for different features, we can find their performances are quite close to their corresponding original features.This means feature selection successfully eliminates redundant features and preserves informative features thus keeping good prediction capability.Ensemble of the three subsets gets best result in this table, which means it makes sure the redundant features are eliminated and useful features are preserved, and different kinds of information are effectively used.The results prove that feature fusion of subsets got by feature selection can significantly improve prediction performance.Also mcc is used to optimize SVM parameters here.The prediction results of 10-fold cross validation are shown in Table 3. From this table, we can find the prediction capability of original OE, PCA and NLF based features is different: PCA based features gain best results, NLF based features gain little inferior results and the results of OE are not as good as them.This kind of results is consistent with the conclusion got in the previous part: PCA and NLF based features have better prediction capability than OE.This time ensemble of the three kinds of original features significantly improves prediction performance again.The results of the three subsets show that they obtain very close prediction capability to their original features.The ensemble of three subsets also gets very good results which are equivalent with the ensemble of three kinds of original features.This means fusion of subsets keep prediction capability as good as original features even though the dimension of feature space is reduced.
At last, auc is used to choose the optimal parameters for SVM, and the results of 10-fold is shown in Table 4.
From table, we can find that the original OE and NLF based features have equivalent prediction capability, and PCA based features are better than them.Also the results of three subsets are close to their original features.This time the ensemble of three kinds of original features gain slightly inferior results to original PCA based features.The reason for that may be the parameters for SVM are not appropriate enough.Nevertheless, the ensemble of three subsets still gain the best results, which means that feature fusion of the three kinds of features after feature selection is useful and effective for HIV-1 protease Comparing all the results shown in the three tables, we can find the best results are feature fusion of the three subsets using the SVM parameters optimized based on classification accuracy.Its mcc and auc values are the largest in all the experiment results.The other three evaluation parameters also get very high values.In Loris Nanni's work, only one kind of evaluation parameter is used: euc, which can be calculated by 1-auc.Our work provides five parameters to evaluate prediction performance, because only one kind of parameter isn't enough to effectively measure the results.Though the best euc got in Loris Nanni's work is 0.007, and the best euc in our work is 0.008 (1 − 0.992), our work gets quite high mcc value.Euc can measure the overall performance of a classifier testing different classification thresholds, but the most important point of HIV-1 protease cleavage site prediction task is to train a good classifier with optimal parameters to accomplish a good prediction model.Finding the only best threshold can affirm the classifier has best prediction capability, and mcc can perfectly evaluate the prediction performance using the optimal parameters and classification threshold.The best results in our work are pleasantly surprising.The best mcc in our work is 0.914 which is quite a high value.It is reasonable to believe that our results are better than the state of art results, and also Loris Nanni's results.Our work can provide much useful help for researchers and doctors to discover or design HIV-1 protease inhibitors in the future.

CONCLUSION
Feature selection is a new approach for HIV-1 protease cleavage site prediction.Different from traditional methods, our work eliminates the redundant features, simplifies the feature structure and improves prediction performance.Physicochemical properties of amino acid residues provide a lot of useful information and we try to make good use of them for the prediction task.Thus two newly proposed kinds of features extracted from AAindex database by conducting PCA and NLF are used in this paper.Traditional OE features are also used, while results of the experiment show that the two kinds of new features perform better than OE.To make effective use of the physicochemical and sequence information contained in an octapeptide, we fuse the three kinds of features to represent an octapeptide.Parameter optimization for SVM is also conducted to improve the prediction capability of the classifier.To make a complete comparison between our method and previous work, five evaluation parameters are calculated for each kind of work.The results turn out to be that our method gain better prediction performance than the state of art work.In the future, we expect to find a new feature extraction method to generate more informative features to represent an amino acid residue.More effective feature selection methods can be used to pick out the useful and informative features to improve prediction performance.Moreover, a more successful ensemble method of features or classifiers can be used to solve the prediction task.Hopefully the future investigation of HIV-1 protease cleavage site will provide more useful help for HIV-1 protease inhibitor development.

Figure 1 .
Figure 1.The test results of all possible subsets for OE features.

Table 1 .
aThe optimal C and g determined according to different evaluation parameters.
a Here OE means the original OE features, and OE_FS means the subset for OE features after feature selection.The PCA and NLF based features are indicated in the same way.The ensemble of three kinds of original features is shown as All_fusion, and the ensemble of the three subsets is shown as FS_Fusion.The two values in each column are the C and g values for SVM respectively.

Table 2 .
Prediction performance of accuracy based optimization parameters.and their fusion combinations.Comparing the five evaluation parameters of original OE, PCA and NLF based features we can find PCA and NLF based features get better prediction performance than OE.PCA based features perform a little better than NLF based features. tures

Table 3 .
Prediction performance of mcc based optimization parameters.

Table 4 .
Prediction performance of auc based optimization parameters.