Use Chou’s 5-Steps Rule to Predict Remote Homology Proteins by Merging Grey Incidence Analysis and Domain Similarity Analysis

Detecting remote homology proteins is a challenging problem for both basic research and drug development. Although there are a couple of methods to deal with this problem, the benchmark datasets based on which the existing methods were trained and tested contain many high homologous samples as reflected by the fact that the cutoff threshold was set at 95%. In this study, we reconstructed the benchmark dataset by setting the threshold at 40%, meaning none of the proteins included in the benchmark dataset has more than 40% pairwise sequence identity with any other in the same subset. Using the new benchmark dataset, we proposed a new predictor called “dRHP-GreyFun” based on the grey modeling and functional domain approach. Rigorous cross-validations have indicated that the new predictor is supe-rior to its counterparts in both enhancing success rates and reducing computational cost. The predictor can be downloaded from https://github.com/jcilwz/dRHP-GreyFun.


INTRODUCTION
Detecting remote homology relationship among proteins plays one of the fundamental and central roles in computational proteomics. It is particularly useful for drug development [1,2]. With the avalanche of protein sequences generated in the post-genomic age, it is highly desired to timely detect the remote homology proteins. Although X-ray crystallography is a powerful tool in determining protein 3D structures, it is time-consuming and expensive. Particularly, not all proteins can be successfully crystallized, particularly for membrane proteins. Membrane proteins are difficult to crystallize and most of them will not dissolve in normal solvents. Therefore, so far very few membrane protein structures have been determined. Although NMR is indeed a very powerful tool in determining the 3D structures of membrane proteins (see, e.g., [3][4][5][6][7]), it is also time-consuming and costly. To acquire the structural information in a Open Access Natural Science timely manner, a series of 3D protein structures have been developed by means of structural bioinformatics tools (see, e.g., [8][9][10][11][12][13][14][15][16][17][18][19][20]). Meanwhile, facing the explosive growth of biological sequences discovered in the post-genomic age, to timely use them for drug development, a lot of important sequence-based information, such as PTM (posttranslational modification) sites in proteins [21,22], protein-drug interaction in cellular networking [23], DNA-methylation sites [24], recombination spots [25], and sigma-54 promoters [26], have been deducted by various sequential bioinformatics tools such as PseAAC approach [27] and PseKNC approach [28]. Actually, the rapid development in sequential bioinformatics and structural bioinformatics have driven the medicinal chemistry undergoing an unprecedented revolution [29], in which the computational biology has played increasingly important roles in stimulating the development of finding novel drugs. In view of this, the computational methods were also utilized in this study for detecting remote homology.
To acquire the structural information in a timely manner, one has to resort to various structural bioinformatics tools based on the sequence similarity principle (see, e.g., [30]). Unfortunately, such principle cannot cover the cases of remote homology proteins. In view of this, considerable efforts [31-35] have been made to detect remote homology proteins.
Although these methods each had their own merits and did play a stimulating role in this area, further work is needed. Firstly, the benchmark datasets used in their studies had high similarity. For instance, the benchmark dataset in [33, 34] contains 7329 proteins from 1070 different super families, with pairwise sequence identity cutoff set at 95%. In other words, it would allow those proteins with higher than 80% similarity in the benchmark dataset. Secondly, the ranking algorithm used in those studies would spend a lot of time to train or learn the model. For example, if the training dataset had proteins, the Lambda-MART would need to deal with N 2 proteins pair samples.
As demonstrated by a series of recent publications [23,25,26,, to develop a really useful predictor for a biological system, one needs to follow Chou's 5-step rule to go through the following five steps: 1) select or construct a valid benchmark dataset to train and test the predictor; 2) represent the samples with an effective formulation that can truly reflect their intrinsic correlation with the target to be predicted; 3) introduce or develop a powerful algorithm to conduct the prediction; 4) properly perform cross-validation tests to objectively evaluate the anticipated prediction accuracy; 5) establish a user-friendly web-server for the predictor that is accessible to the public. Papers presented for developing a new sequence-analyzing method or statistical predictor by observing the guidelines of Chou's 5-step rules have the following notable merits: 1) crystal clear in logic development, 2) completely transparent in operation, 3) easily to repeat the reported results by other investigators, 4) with high potential in stimulating other sequence-analyzing methods, and 5) very convenient to be used by the majority of experimental scientists. Below, let us elaborate on how to deal with these five steps one by one.

Benchmark Dataset
According to Chou's 5-step rules [72], the first prerequisite in establishing a new predictor is to construct or select an effective benchmark dataset.
In this study, the benchmark dataset was taken from Liu et al. [33]. It contains 7329 proteins from 1070 different super families and 1824 families derived from SCOP database. To reduce the redundancy and homology bias, the program CD-HIT [73] was adopted to remove those proteins that had ≥40% pairwise sequence identity to any other in the same subset. Meanwhile, removed were also those families that only had one protein sequence. Finally, we obtained 3128 proteins from 540 super-families and 777 families.

Sample Formulation
Most biological systems have two remarkable features: one is of evolution and the other is of complexity. All biological species have developed beginning from a very limited number of ancestral species. It Natural Science is true for protein sequence as well [30]. Their evolution involves changes of single residues, insertions and deletions of several residues, gene doubling, and gene fusion [9,74]. With these changes accumulated for a long period of time, many similarities between initial and resultant amino acid sequences are gradually eliminated, but the corresponding proteins may still share many common attributes, such as having basically the same biological function, subcellular location and similar binding site. To take into account the evolution information, many investigators used the PSSM (Position-Specific Scoring Matrix) approach [75], as done in a series of previous publications (see, e.g., [76][77][78][79][80][81]). On the other hand, biological systems are extremely complicated with a lot of uncertainties. According to the grey system theory [82], if the information of an investigated system is fully known, it is called a ''white system;'' if completely unknown, a ''black system;'' if partially known, a ''grey system.'' Actually, most biological systems belong to the grey systems, and hence it is particularly effective to treat them with the grey model approach [83][84][85][86].

Grey Incidence Analysis of Proteins Formulated by Grey-PSSM
Given a protein with L amino acid residues, it is usually expressed by is the i-th residue in the protein. Because all the existing machine-learning algorithms (such as "Optimization" algorithm [87], "Covariance Discriminant" or "CD" algorithm [88,89], "Nearest Neighbor" or "NN" algorithm [90], and "Support Vector Machine" or "SVM" algorithm [90]) can only handle vectors as elaborated in a comprehensive review [29]. However, a vector defined in a discrete model may completely lose all the sequence-pattern information. To avoid completely losing the sequence-pattern information for proteins, the pseudo amino acid composition [27] or PseAAC [91] was proposed. Ever since then, it has been widely used in nearly all the areas of computational proteomics (see, e.g., [92][93][94][95] as well as a long list of references cited in [96]). Because it has been widely and increasingly used, four powerful open access soft-wares, called "PseAAC" [97], "PseAAC-Builder" [98], "propy" [99], and "PseAAC-General" [100], were established: the former three are for generating various modes of Chou's special PseAAC [101]; while the 4th one for those of Chou's general PseAAC [72], including not only all the special modes of feature vectors for proteins but also the higher level feature vectors such as "Functional Domain" mode (see Eqs.9-10 of [72]), "Gene Ontology" mode (see Eqs.11-12 of [72]), and "Sequential Evolution" or "PSSM" mode (see Eqs.13-14 of [72]). Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, the concept of PseKNC (Pseudo K-tuple Nucleotide Composition) [28] was developed for generating various feature vectors for DNA/RNA sequences [102,103] that have proved very useful as well. Particularly, recently a very powerful web-server called "Pse-in-One" [104] and its updated version "Pse-in-One2.0" [105] have been established that can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of users' studies.
According to the general PseAAC [72], the protein of Equation (1) can be formulated as where T is the transposing operator, the subscript Ω is an integer, and its value and the components ( ) Ψ 1, 2, u u =  will depend on how to extract the desired features and properties from the protein sequence.
In this study, the model, Grey-PSSM proposed by Lin et al. [85,86] is adopted. It has extracted the sequential evolution information by the Position Specific Scoring Matrix (PSSM). After the Grey-PSSM treatment, we have finally got a 60-D PseKNC vector for Equation (2); i.e., its subscript parameter Ω = 60and each of the 60 components therein has been uniquely defined below. Suppose the set of protein samples is is the i-th protein. According to Eqs.6-11 in Lin et al. [106], the distance ( ) , i j P P Γ is defined as the grey incidence degree between i P and j P . The larger the value of ( ) , i j P P Γ , the more similar between i P and j P will be.

Domain Similarity Analysis
In addition to the PseAAC [27, 91] approach, the functional domain [107][108][109][110][111][112] can also be used to characterize protein sample, i P ∈  , according to the following steps. Step 1. Searching UniProt release 2018_08 Swiss-Prot FASTA format flatfile by HMMER [113][114][115] for the homology set of protein i P , we have obtained homo i  . If the outcome has more than 10 protein sequences, only the top 10-ranking ones are used.
Step 2. For the protein in homo where  denotes union in the set theory. As we can see from Equations ((5), (6)) the distance (Dis) between i P and j P is within the range ( ) 0 Dis , 1 i j P P ≤ ≤ .

Operation Engine or Algorithm
In this study, the Grey Relational Analysis [82,116] and the Domain Similarity Index was utilized to rank the relationship of proteins. Given a query protein, the system will search the benchmark dataset for it and return the top-ranking similar proteins. The predictor thus formed is called "dRHP-GreyFun". Illustrated in Figure 1 is a flowchart to show how the proposed predictor is working. In this paper, w(1) and w(2) are equal to 0.5.

RESULTS AND DISCUSSION
Among the independent dataset test, sub-sampling (e.g., 5 or 10-fold cross-validation) test, and jackknife test, which are often used for examining the accuracy of a statistical prediction method [117], the jackknife test was deemed the least arbitrary that can always yield a unique result for a given benchmark dataset [118,119], as clearly elucidated in a comprehensive review paper [72] and demonstrated by Eqs.28-32 therein. Therefore, the jackknife test has been increasingly recognized and widely adopted by investigators to test the power of various prediction methods (see, e.g., [120][121][122][123]). However, to reduce the computational time, we adopted the 5-fold and 10-fold cross-validation in this study as done by many investigators with SVM as the prediction engine. This is also because the LambdaMART ranking algorithm used in preview studies [33, 34] would consume a lot of training time and computer memory. As a compromise, the 5-fold cross-validation test was adopted there. But, now we employed the operation engine Natural Science  based on the grey modeling and functional domains to detect the remote homology proteins, significantly reducing the computing time and memory. Therefore, it would be feasible to use the most rigorous jackknife test to examine the prediction quality. The outcomes thus obtained are given in Table 1, where we can see that dRHP-GreyFun achieved the best performance in both the score of ROC1 and the score of ROC50.

CONCLUSIONS
Protein remote homology detection is vitally important for studying protein structures and functions. It is anticipated that the proposed method may become a useful high throughput tool for both basic research and drug design. Natural Science As pointed out in [124] and demonstrated in a series of recent publications (see, e.g., [40,[125][126][127][128][129][130][131][132][133][134][135][136][137][138][139][140][141][142][143][144]), user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful prediction methods and computational tools. Actually, many practically useful web-servers have significantly increased the impacts of bioinformatics on medical science [29], driving medicinal chemistry into an unprecedented revolution [96]. Accordingly, we have also provided a web-server for the prediction method presented in this paper, by which users can easily get their desired results without the need to go through the complicated math equation involved. Also, all the programs can be downloaded from https://github.com/jcilwz/dRHP-GreyFun.
For the remarkable and awesome roles of the "5-steps rule" in driving proteome, genome analyses and drug development, see a series of recent papers [139,, where the rule and its wide applications have been very impressively presented from various aspects or at different angles.