Improving Protein Sequence Classification Performance Using Adjacent and Overlapped Segments on Existing Protein Descriptors

In protein sequence classification research, it is popular to convert a variable length sequence of protein into a fixed length numerical vector by using various descriptors, for instance, composition of k-mer composition. Such position-independent descriptors are useful since they are applicable to any length of sequence; however, positional information of subsequence is discarded even though it might have high contribution to classification performance. To solve this problem, we divided the original sequence into some segments, and then calculated the numerical features for them. It enables us to partially introduce positional information (for instance, compositions of serine in anterior and posterior segments of a sequence). Through comprehensive experiments on the number of segments and length of overlapping region, we found our classification approach with sequence segmentation and feature selection is effective to improve the performance. We evaluated our approach on three protein classification problems and achieved significant improvement in all cases which have a dataset with sufficient amino acid in each sequence. This result has shown the great potential of using additional segments in protein sequence classification to solve other sequence problems in bioinformatics.


INTRODUCTION
Protein sequence is an essential asset in protein classification research.To apply different machine learning approaches on protein sequence data, it is a standard process to convert protein sequence into a Open Access numerical representation.This process is called feature extraction and it is a critical step because the selection of the effective and appropriate type of feature extraction will profoundly affect classification performance.It drives the scientists to develop algorithm or program that performs feature extraction process, which is commonly known as protein descriptors [1].
Within two decades, scientists have developed various protein descriptors.Moreover, they have used these descriptors for various cases of protein analysis.Xiao et al. [1] grouped the types of commonly used descriptors into eight groups such as Amino Acid Composition, Autocorrelation, CTD, Conjoint Triad, Quasi-Sequence-Order, Pseudo-Amino Acid Composition, Proteochemometric descriptors, and PSSM.These groups have 22 type descriptors that have been actively used in researches.
The following are the commonly used protein descriptors and their application in protein analysis researches.Bhasin and Gajendra [2] used Amino Acid Composition (AAC) and Dipeptide Composition (DC) in their study to predict nuclear receptor.They used Support Vector Machine (SVM) as a classifier and achieved overall accuracy 82.6% when using numerical features from AAC and 97.5% with DC.The study about the prediction of membrane protein types was carried out by Feng and Zhang [3].They adopted a formulation of the autocorrelation functions based on the hydrophobicity index of the 20 amino acids as protein descriptor.Using Bayes discriminant algorithm as a classifier, they achieved overall predictive accuracy as high as 94% and 82% for the re-substitution and jackknife tests.This result is higher about 13% in the resubstitution test and 8% in the jackknife test if compared with those of algorithms based only on the amino acid composition.Dubchak et al. [4] conducted a study on protein folding prediction using the global description of amino acid sequences or also known as CTD (Composition/Translation/Distribution) as protein descriptor.Using a neural network as a classifier, they obtained 71.7% accuracy for positive class prediction and 90% -95% for negative class.In 2007, Shen et al. [5] presented a computational approach for predicting protein-protein interaction (PPI).The Support Vector Machine (SVM) algorithm was used to develop the methodology.They constructed numerical features for representing the PPI information by using conjoint triad descriptor.On average, their method may produce a PPI prediction model with an accuracy of 83.90% ± 1.29%.Another commonly used protein descriptor is quasi-sequence order descriptor.This descriptor was used by Chou [6] to solve prediction of protein subcellular locations.The author used this descriptor and augmented covariant discriminant algorithm as a classifier, and achieved accuracy between 79.6% -86.4%.
Amino Acid Composition (AAC) is one of the protein descriptors often used to solve many cases of protein analysis.AAC has information from 20 amino acid components but does not have positional (i.e.sequence order) information.To increase the descriptor's ability, Chou [7] developed Pseudo Amino Acid Composition (PseAAC) by adding a set of sequence correlation factors.Ussing the PseAAC, a significant improvement in protein subcellular location prediction quality has been inspected for both the ProtLock algorithm and the covariant discriminant algorithm.In another study, the author combined 20 features from amino acid composition and 2λ numbers of a set of correlation factors that reflected different hydrophobicity and hydrophilicity distribution patterns along a protein chain [8].Moreover, it also achieved better performance on the prediction of 16 subfamily classes of oxidoreductases if compared with AAC.
Protein descriptors described above can be grouped as an alignment-free descriptor.Also, there are descriptors grouped as alignment-based descriptor [9] or profile-based descriptor [10].Profile-based descriptor generates feature vector based on Position-Specific Scoring Matrix (PSSM) by running PSI-BLAST.It produces some feature vectors that vary according to the amount of amino acid in the sequence.Rangwala and Karypis [11] used this descriptor to solve detection of remote homology and fold recognition.It can improve the overall ability to recognize remote homologs and distinguish proteins that share the same structural fold.
Existing protein descriptors perform feature extraction using information such as hydrophobicity, polarizability, polarity, charge, surface tension, secondary structure, solvent accessibility and normalized Van der Waals volume.However, Asgari and Mofrad [12] developed a protein descriptor without that information.They adopted existing methods in natural language processing (NLP) that is Continuous Vector Representation, as a distributed representation for words.Testing was performed on 7027 protein fam-ilies using SVM as a classifier.They obtained a weighted average accuracy of 93% ± 0.06%.
A combination of several existing descriptors can also generate a new numerical representation of protein sequence.This numerical representation has more information than the features generated only from a descriptor, and it can improve prediction accuracy.This study was carried out by Ong et al. [13] in 2007 for predicting protein functional families.They used various descriptors of an alignment-free group such as Amino Acid Composition, Dipeptide Composition, Normalized Moreau-Broto Autocorrelation, Moran Autocorrelation, Geary Autocorrelation, Quasi Sequence Order, Pseudo Amino Acid Composition, and Descriptors of Composition, Transition, and Distribution.They gained a slightly better prediction performance than the use of individual descriptor.In other research, Liu et al. [14] conducted a study of alignment-free and alignment-based descriptor combinations using Pseudo Amino Acid Composition (PseAAC) and Profile-based descriptor.They proposed two methods to solve the remote protein homology detection.The first method, named PseAAC Index, is a combination of features from PseAAC and 531 indices extracted from the AA Index database.This method can get the average ROC score 0.88.The second method is a combination of PseAAC Index with a profile-based protein representation.They are named PseAAC Index-Profile, which obtained the average ROC score 0.922.From these researches, the combination of features from various protein descriptors can improve prediction performance in general.
However, according to a study by Ong et al. [13], those features may not always improve prediction performance because they contain noises.The authors suggested the use of feature selection method to reduce noises and choose important features.
One common thing in these researches is that only a full length of the sequence is used as an input to the protein descriptor.It means that the output of the protein descriptor only describes the state of a whole protein alone.If the descriptor has a segment of the sequence as an input, it will give information of that segment.We can expect amixture of numeric representation from a full length of the sequence and its segments provide information of the whole protein state (global information) as well as the state of each segment (local information).In this study, we propose an effective approach for improving existing alignment-free protein descriptor capabilities by using adjacent and overlapped segments as inputs.We also tried to use a combination of various descriptors with this input.With this approach, we have improved prediction performance in several validation datasets.However, this approach may have features with noises generated through the use of overlapping and redundant descriptors.Accordingly, by exploiting feature selection along feature ranking, we achieved slight improvements in prediction accuracy, and at the same time, we could also find which type of features was more useful to increase it.
The remainder of this paper is organized as follows.In Section 2, we present in detail about how the model works.In Section 3, we show experiments and results of evaluating the model with some validation datasets.Finally, some discussions and conclusions are given in Section 4.

Existing Alignment-Free Protein Descriptors
In this research, we used the protein descriptor from R package protr.This package has various structures and physicochemical descriptors and PCMs modeling descriptors for amino acid sequence [1].A list of protein descriptors covered by protr is presented in Table 1.
Protr has eight group descriptors.The first seven groups are the alignment-free descriptors and the last group, PSSM, is an alignment-based descriptor.The PSSM group has PSSM profile descriptor that produces outputs with a varying number of features depends on the number of amino acid.
In active research on protein classification, feature extraction is one of the important processes.This process converts a protein sequence into numerical features by using protein descriptor.If s is a protein sequence with n amino acids, where The protein descriptor can then be written as the following formula:  ( ) The output of ( ) descriptor s is numerical features f where f j ∈ decimal numbers and m is the number of features.
To obtain more sequence's information and to improve prediction accuracy, acombination of various descriptors is also used to generate anumerical representation of protein sequence in general active research.This formula can represent a combination of various descriptors implementation: where type is descriptor type, type ∈ {amino acid composition, dipeptide composition, tripeptide composition, and other descriptors that listed in Table 1}.
and m is the number of features which are generated by descriptor type .For instance, if we use two type of descriptors such as Amino Acid Composition (aac) and Dipeptide Composition (dt) then we have numerical features as shown below.
One of the successful reports of this approach is the study of predicting protein functional families by using a combination of eight descriptors from alignment-free groups [13].Moreover, the other study used a combination of alignment-free descriptors and alignment-based descriptors for remote protein homology detection [14].Both of that studies had same conclusion that the combination of various descriptors can give a better result than using a single descriptor only.

Protein's Features Construction
Equations ( 1) and ( 2) can represent the feature extraction process that has been used in active research.One common thing in both equations is that they use a full-length of sequence s as the input.Moreover, f is the output which provides global information of s.
Our goal is to construct protein's features that have complete information, not only global information but also local information.If the sequence sis divided into several segments, and each segment becomes input to a descriptor.Then each output has local information on its location.We obtained new features by concatenation all those outputs.The division of those segments is done in two steps.
In the first step, we generated segments that have relatively same length.The first segment is calculated from the beginning of the sequence, then followed by the second segment and so on.We named this segment as adjacent segment.For example, given a protein sequence s as shown below: where n s is a number of amino acids in sequence s.Each segment is then generated as follows: J. Biomedical Science and Engineering ( ) and for the last segment when k = j: where ( ) In the second step, we generate additional segments to get local information between two adjacent segments.We named this segment as overlapped segment.An overlapped segment is the union of half from the end of the first segment and a half from the beginning of the second segment.For example, an overlapped segment for segment 1 and segment 2 is obtained as follows: Each overlapped segment can be generated using the following formula: where ( ) l segment + is generated by using formula below: ( ) After all segments are created, we calculate features of sequence by using the formulabelow: descriptor s descriptor segment descriptor overlapped The result of the above formula is numerical features as defined below: For instance, if sequence s is divided into k segments (k = 3) and protein descriptor is Amino Acid J. Biomedical Science and Engineering Composition.Accordingly, the generated features are: 1 20 , , , , , , ∪ By using k = 3, the numerical representation of sequences has 120 numerical features.
In our study, we expect that the use of various values of k will provide complete information of sequence s than the use of single k value.For example k = 2, 3, •••, z, where z is a positive integer.Moreover, we can generate numerical features for sequence s as defined below: descriptor s descriptor segment descriptor overlapped We also implement this approach with a combination of various descriptors.So the sequence s will have numerical features as follows:

Algorithm
Our proposed approach consists of main three steps.The flowchart of our approach is shown in Figure 1.The first step is feature extraction that has three processes: 1) Sanity check of the amino acid types is responsible for erasing amino acids if they are not in the 20 default of amino acid types.
2) Sequence segmentation is conducted for dividing a sequence into adjacent segments and overlapped segments.
3) Feature construction is in charge of converting a original sequence, adjacent segments, and overlapped segments into numerical features by using existing descriptor from protr package.Then a concatenation of all those numerical features is created.
The second step is classification.This step has two processes that are commonly used in active classification research.We conduct k-fold cross-validation or jackknife test, each process in this step are repeated k times or n time, with n is a number of samples.
1) Feature ranking is responsible for sorting features by importance.The random Forest function for R [15] conducts this process.
2) Feature selection and prediction are responsible for creating feature subsets, and performing learning and predicting with ksvm function in a kernlab package for R [16].
The last step is prediction accuracy calculation.It is in charge of calculating accuracy for each feature subset.

EXPERIMENTS AND RESULTS
In order to show the validity of our proposed approach to improving existing alignment-free protein descriptor to deal with a protein sequence classification problem, we did experiments with datasets from Uni-Prot, Swiss-Prot, and Nuclea RDB.Our experiments are grouped by three protein analysis cases which are a classification of nuclear receptors, protein family classification, and cell-penetrating peptides prediction.

Classification of Nuclear Receptors
In this section, we evaluate the strength of our proposed approach on a single protein descriptor in the classification of nuclear receptors.Nuclear receptors are key transcription factors that regulate important gene networks responsible for cell growth, differentiation, and homeostasis [2].Classification of nuclear receptors was done in researches [2,17].
As done by Bhasin and Gajendra [2], the classification was achievedon the basis of amino acid composition and dipeptide composition from a sequence of nuclear receptors using support vector machine (SVM).They did training and testing on a non-redundant dataset of 282 proteins obtained from the Nuc-leaRDB database.The dataset had four subfamilies of nuclear receptors as shown in Table 2.
The performance of both classifiers was evaluated using 5-foldcross-validation.The accuracy of the amino acid composition-based classifier was 82%, and dipeptide composition-based classifier was 97.5%.
In the research done by Wang et al. [17], the classification was achieved on the basis of various protein descriptors from a sequence of nuclear receptors using Fuzzy K nearest neighbor (FK-NN).They  3.They create two layers predictor.The first layer was used to identify a query protein as NR or not.If it was a NR, the second layer would be continued to identify the NR among the seven subfamilies.The performance of all classifier was evaluated using jackknife test and independent dataset test.The overall accuracy of first layer predictor is 92.56% by using jackknife test and 98.03% by using independent dataset test.Moreover, the overall accuracy of second layer predictor is 88.68% by using jackknife test and 99.65% by using independent dataset test.
Research [2] is a single descriptor based classifier and research [17] can be grouped as various descriptors based classifier.Both researches have similarities.They use the same type of descriptor which are amino acid composition and dipeptide composition.
To compare the results of our proposed approach to the result of research [2], we used their method on the data those were provided by the research [17].However, we use four subfamilies; they are the same subfamilies that were used in research [2] as shown in Table 4.
We also used same classifier and evaluation method which are Support Vector Machine with a 5-fold cross-validation test.In this experiment, we converted a sequence into numerical features by using Equation (17).In amino acid composition based classifier experiment, we obtained the best prediction accuracy at z = 7.Moreover, in dipeptide composition based classifier experiment, the best prediction accuracy was achieved at z = 4.The comparison of our experimental results and result from methods from research [2] is shown in Table 5.
We also investigated important features that have contributed to the prediction accuracy.Table 6 and Table 7 show detail of 790 important features that were obtained in AAC_7 FS experiment and 355 important features that were generated in DC_4 FS experiment.
For further, we compared research [17] results with our result.In the experiment of identifying NR and non-NR, we used amino acid composition based classifier with z = 3 and dipeptide composition based classifier with z = 2.The result is shown in Table 8.
Detail important features of AAC_3 FS and DC_2 FS are shown in Table 9 and Table 10.
In the second level experiment, we identified NR subfamilies by using amino acid composition based classifier with z = 5 and dipeptide composition based classifier with z = 2.The comparison result is shown in Table 11.Moreover, the detail of important features on AAC_5 FS and DC_2 FS experiments are shown in Table 12 and Table 13.

Protein Family Classification
In this experiment, we evaluate the strength of our proposed approach on the combination of various protein descriptors.We selected protein family classification as the case.A protein family is a set of proteins that are evolutionarily related, typically involving similar structures or functions [12].Protein family classification was done in researches [12,18].Cai et al. [18] had classified 54 functional families.The feature extraction process had been done by using a combination of protein descriptors which are composition, translation, and distribution.The reported accuracies of family classification had been in the range of 69.1% -99.6%.In another study, Asgari and Mofrad [12] performed classifications of 7027 protein families.They applied a new feature extraction method as known as ProtVec.The average accuracy for the first 1000 families is 94% ± 0.05%.And the average accuracy for 2000, 3000 and 4000 frequent families were respectively 93% ± 0.05%, 92% ± 0.06%, and 91% ± 0.08%.The weighted accuracy of all 7027 families was 93% ± 0.06%.
In this experiment, we used the dataset that were provided by Asgari and Mofrad [12] and performed 1000 classification cases using the first 1000 families.The classification performed in this experiment is a balanced binary classification.Samples of positive class are samples of selected family protein.Samples of negative class are randomly selected samples.In the feature extraction process, we used a combination of various protein descriptors which are Amino Acid Composition (AAC), Composition (CTDC), translation (CTDT), and distribution (CTDD) with z = 5.Moreover, we used SVM with 10-fold cross-validation test as classifier and evaluation method.We used feature selection to check whether there was a significant increase in accuracy of prediction.There were improvements, but it was not significant as shown in Table 14.
We have investigated subset features that can obtain the best accuracy prediction from each family classification case.The result of our investigation of three families are shown in Tables 15-17.We saw a subset features were formed of the four descriptors that we used with all various k values.

Cell-Penetrating Peptides Prediction
Cell-penetrating peptides (CPPs) are small peptides that are about 10 -30 amino acids long.CPPs can carry various bioactive cargoes, ranging from small molecules to proteins and supramolecular particles, to directly enter cells without significantly damaging the cell membrane.It makes them potential drug delivery agents for the translocation of cargo into cells.CPP prediction research has increased in the past few years.CPPsite2.0 is CPP-specific database that has approximately 1850 experimentally validated CPPs [19].
CPPred-RF is one method that has succeeded to solve the CPPs prediction case [19].In this study Wei et al. used two dataset that are CPP924 and CPPsite 3. The detail information of those dataset are shown in Table 18.In feature extraction process, they used a combination of several descriptors, i.e. parallel correlation pseudo-amino-acid composition (PC-PseAAC), series correlation pseudo-amino acid composition (SC-PseAAC), adaptive skip dipeptide composition (ASDC) and physicochemical properties (PPs).The result is numerical representation with 636 features.Then features selection is applied by using Max-Relevance-Max-Distance (MRMD) as feature ranking method and Sequential Feature Selection (SFS) as optimal features selector.Moreover, they used random forest as the classifier with jackknife test at the prediction and evaluation stage.The result is 91.6% Accuracy for CPP924 dataset and 71.1% accuracy for CPPsite3.
In this experiment, we implemented our approach on single descriptor and combination of various descriptors based classifier.We used amino acid composition, dipeptide composition and composition/distribution/translation (CTD) descriptor on feature extraction process.In the classification and evaluation process, we used SVM as a classifier with 10-fold cross-validation test.The results are shown in the tables (Table 19 and Table 20).
The best performance was obtained by using ACC based classifier with original input sequence.Implementation of our approach with z = 2 and z = 3 cannot produce better performance instead of decreasing accuracy.In the experiment a combination of various descriptors based classifier with those descriptors and feature selection, we obtained 76.08% accuracy and 20 important features.

DISCUSSIONS AND CONCLUSIONS
We have proven that our proposed approach is simple in implementation and powerful on solving protein sequence classification problems.Our approach was tested on three cases which are classification of nuclear receptors, protein family classification, and cell-penetrating peptides prediction.We compared the performance of our approach with the performance from other methods that have been used in those cases.
On first two classification cases, the experimental results show that there was a significant improvement in the prediction accuracy of our approach.We also used random Forest to generate variable importance to rank features, and then perform the feature ranking to conduct feature selection.Feature selection also helped us to get information that features subset which gave the best accuracy contains generated features from additional segments.Our approach also worked in both single descriptor and a combination In contrast, our approach did not work well in Cell-Penetrating Peptides Prediction.Performance of our approach was not significantly improved, or it was lower than the result of the classifier with original sequence only.It occurred because sequences have a small number of amino acids.Table 21 shows the comparison of amino acids numbers from each case.
In this research, we only focus on solving protein sequence classification problems with five out of 21 of existing protein descriptors which are grouped to the alignment-free descriptor.In the future, we apply the proposed approach using other descriptors.Also, we need further investigation to find out the minimum number of amino acid in sequence to make our approach can work properly.
of descriptor's features output depends on the selection of the number of properties of amino acid and the selection of the parameter.bThe number of descriptor's features output depends on the selection of the number of components and the selection of the lag parameter.

MCMDVRCPSICTAPGSRGLASACMERVCIC
If we divide sequence s into k segments where k = 3, then the generated segments are as follows: segment 1 = MCMDVRCPSI segment 2 = CTAPGSRGLA segment 3 = SACMERVCIC With the following formula where segment n is a initial number of amino acids in each segment:

Figure 1 .
Figure 1.The flowchart of the proposed approach.

Table 1 .
Description of existing protein descriptors.

Table 2 .
Description of the datasetin Bhasin and Gajendra research.into numerical features by using a combination of amino acid composition, dipeptide composition, complexity factor and low-frequency Fourier spectrum components.The training and testing were done on 159 sequences of nuclear receptors obtained from NucleaRDB database and 500 sequences of non-nuclear receptors obtained from UniProt database.No sequence had ≥60% sequence identity with any other sequence in this dataset.Nuclear receptors data had seven subfamilies as shown in Table

Table 3 .
Description of the datasetin Wang et al. research.

Table 4 .
Description of the modifieddataset in our research.

Table 5 .
[2]diction accuracy comparison of our approach and method in research[2].

Table 6 .
Detail of important features in AAC_7 FS experiment.

Table 7 .
Detail of important features inDC_4 FS experiment.

Table 8 .
[17]iction accuracy comparisonof our approach and method in research[17]for identifying NR and non-NR.

Table 9 .
Detail of important features inAAC_3 FS experiment.

Table 10 .
Detail of important features inDC_2 FS experiment.

Table 11 .
[17]iction accuracy comparison of our approach and method in research[17]for identifying NR subfamilies.

Table 12 .
Detail of important features inAAC_5 FS experiment.

Table 13 .
Detail of important features of DC_2 FS experiment.

Table 14 .
[12]iction accuracy comparison of our approach and method in research[12]for classifying first 1000 families.

Table 15 .
Detail of important features in 50S ribosome-binding GTPase family classification.

Table 16 .
Detail of important features in Transmembrane receptor (rhodopsin family) family classification.

Table 17 .
Detail of important features in Ribosomal protein S14p/S29e family classification.

Table 19 .
The predictive result of the proposed approach on CPP924 dataset.

Table 20 .
The predictive result of the proposed approach on CPPsite 3dataset.

Table 21 .
Statistic comparison of amino acid numbers in sequences.