pLoc_Deep-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by Deep Learning

Recently, the life of worldwide human beings has been endangering by the spreading of pneu-monia-causing virus, such as Coronavirus, COVID-19, and H1N1. To develop effective drugs against Coronavirus, knowledge of protein subcellular localization is prerequisite. In 2019, a predictor called “pLoc_bal-mEuk” was developed for identifying the subcellular localization of eukaryotic proteins. Its predicted results are significantly better than its counterparts, particularly for those proteins that may simultaneously occur or move between two or more subcellular location sites. However, more efforts are definitely needed to further improve its power since pLoc_bal-mEuk was still not trained by a “deep learning”, a very powerful technique developed recently. The present study was devoted to incorporating the “deep-learning” technique and developed a new predictor called “pLoc_Deep-mEuk”. The global absolute true rate achieved by the new predictor is over 81% and its local accuracy is over 90%. Both are overwhelmingly superior to its counterparts. Moreover, a user-friendly webserver for the new predictor


INTRODUCTION
Knowledge of the subcellular localization of proteins is crucially important for fulfilling the following two important goals: 1) revealing the intricate pathways that regulate biological processes at the cellular level [1,2]. 2) selecting the right targets [3] for developing new drugs.
With the avalanche of protein sequences in the post-genomic age, we are challenged to develop computational tools for effectively identifying their subcellular localization purely based on the sequence in-Open Access Natural Science formation.
In 2019, a very powerful predictor, called "pLoc_bal-mEuk" [4], was developed for predicting the subcellular localization of eukaryotic proteins based on their sequence information alone. It has the following remarkable advantages. 1) Most existing protein subcellular location prediction methods were developed based on the single-label system in which it was assumed that each constituent protein had one, and only one, subcellular location (see, e.g., [5][6][7] and a long list of references cited in a review papers [8]). With more experimental data uncovered, however, the localization of proteins in a cell is actually a multi-label system, where some proteins may simultaneously occur in two or more different location sites. This kind of multiplex proteins often bears some exceptional functions worthy of our special notice [2]. And the pLoc_bal-mEuk predictor [4] can cover this kind of important information missed by most other methods since it was established based on the multi-label benchmark dataset and theory. 2) Although there are a few methods (see, e.g., [9,10]) that can be used to deal with multi-label subcellular localization for eukaryotic proteins, the prediction quality achieved by pLoc_bal-mEuk [4] is overwhelmingly higher, particularly in the absolute true rate. 3) Although the pLoc_bal-mEuk predictor [4] has the aforementioned merits, it has not been trained at a deeper level yet [11][12][13][14].
The present study was initiated in an attempt to address this problem. As done in pLoc_bal-mEuk [4] as well as many other recent publications in developing new prediction methods (see, e.g., ), the guidelines of the 5-step rule [58] are followed. They are about the detailed procedures for 1) benchmark dataset, 2) sample formulation, 3) operation engine or algorithm, 4) cross-validation, and 5) web-server. But here our attentions are focused on the procedures that significantly differ from those in developing the predictor pLoc_bal-mEuk [4].

Benchmark Dataset
The benchmark dataset used in this study is exactly the same as that in pLoc_bal-mEuk [4]; i.e., 1 2 where 1  only contains the protein samples from the "Acrosome" location, 2  only contains those from the "Cell membrane" location, and so forth;  denotes the symbol for "union" in the set theory. For readers' convenience, their detailed sequences and accession numbers (or ID codes) are given in Supporting Information S1 that is also available at http://www.jci-bioinfo.cn/pLoc_bal-mEuk/Supp1.pdf, where none of proteins included has ≥25% sequence identity to any other in the same subset (subcellular location).

Proteins Sample Formulation
Now let us consider the 2 nd step of the 5-step rule [58]; i.e., how to formulate the biological sequence samples with an effective mathematical expression that can truly reflect their essential correlation with the target concerned. Given a protein sequence P, its most straightforward expression is where L denotes the protein's length or the number of its constituent amino acid residues, 1 R is the 1 st residue, 2 R the 2 nd residue, 3 R the 3 rd residue, and so forth. Since all the existing machine-learning al-gorithms} can only handle vectors as elaborated in [3], one has to convert a protein sample from its sequential expression (Equation (2)) to a vector. But a vector defined in a discrete model might completely miss all the sequence-order or pattern information. To deal with this problem, the Pseudo Amino Acid Composition [59] or PseAAC [60]. Ever since then, the concept of "Pseudo Amino Acid Composition" has been widely used in nearly all the areas of computational proteomics with the aim to grasp various different sequence patterns that are essential to the targets investigated (see, e.g., [4,10,23,24,). Because it has been widely and increasingly used, recently three powerful open access soft-wares, called "PseAAC-Natural Science Builder" [93], "propy" [181], and "PseAAC-General" [120], were established: the former two are for generating various modes of special PseAAC [228]; while the 3rd one for those of general PseAAC [58], including not only all the special modes of feature vectors for proteins but also the higher level feature vectors such as "Functional Domain" mode, "Gene Ontology" mode, and "Sequential Evolution" or "PSSM" mode. Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, its idea and approach were extended to PseKNC (Pseudo K-tuple Nucleotide Composition) to generate various feature vectors for DNA/RNA sequences [229] that have proved very successful as well [141,146,147,[230][231][232][233][234][235][236][237][238]. According to the concept of general PseAAC [58], any protein sequence can be formulated as a PseAAC vector given by [ ] where T is a transpose operator, while the integer Ω is a parameter and its value as well as the components ( ) will depend on how to extract the desired information from the amino acid sequence of P, as elaborated in [4]. Thus, by following exactly the same procedures as described in the Section 2.2 of [4], each of the protein samples in the benchmark dataset can be uniquely defined as a 22-D numerical vector as given in columns 3 -24 of Supporting Information S2, which can also be directly downloaded at http://www.jci-bioinfo.cn/pLoc_bal-mEuk/Supp2.pdf.

Installing Deep-Learning for Three Deeper Levels
In this study, a dense neural network with 3 fully connected layers was used to predict subcellular localization of multi-label eukaryotic proteins, as illustrated in Figure 1. The predicted results were decided by the output of the threshold θ. If the output is greater than 0.5, the outcome was true; otherwise, false.
For more information about this, see [11], where the details have been clearly elaborated and hence there is no need to repeat here.
The new predictor developed via the above procedures is called "pLoc_Deep-mEuk", where "pLoc_Deep" stands for "predict subcellular localization by deep learning", and "mEuk" for "multi-label eukaryotic proteins".

RESULTS AND DISCUSSION
According to the 5-step rules [58], one of the important procedures in developing a new predictor is how to properly evaluate its anticipated accuracy. To deal with that, two issues need to be considered. 1) What metrics should be used to quantitatively reflect the predictor's quality? 2) What test method should be applied to score the metrics?

A Set of Five Metrics for Multi-Label Systems
Different from the metrics used to measure the prediction quality of single-label systems, the metrics for the multi-label systems are much more complicated. To make them more intuitive and easier to understand for most experimental scientists, here we use the following intuitive Chou's five metrics [239] that have recently been widely used for studying various multi-label systems (see, e.g., [240,241] where q N is the total number of query proteins or tested proteins, M is the total number of different labels for the investigated system (for the current study it is cell 22 L = ), means the operator acting on the set therein to count the number of its elements,  means the symbol for the "union" in the set theory,  denotes the symbol for the "intersection", k  denotes the subset that contains all the labels observed by experiments for the k-th tested sample, * k  represents the subset that contains all the labels predicted for the k-th sample, and In Equation (4), the first four metrics with an upper arrow ↑ are called positive metrics, meaning that the larger the rate is the better the prediction quality will be; the 5 th metrics with a down arrow ↓ is called negative metrics, implying just the opposite meaning.
From Equation (4) we can see the following: 1) the "Aiming" defined by the 1 st sub-equation is for checking the rate or percentage of the correctly predicted labels over the practically predicted labels; 2) the "Coverage" defined in the 2 nd sub-equation is for checking the rate of the correctly predicted labels over the actual labels in the system concerned; 3) the "Accuracy" in the 3 rd sub-equation is for checking the average ratio of correctly predicted labels over the total labels including correctly and incorrectly predicted labels as well as those real labels but are missed in the prediction; 4) the "Absolute true" in the 4 th sub-equation is for checking the ratio of the perfectly or completely correct prediction events over the total prediction events; 5) the "Absolute false" in the 5 th sub-equation is for checking the ratio of the completely wrong prediction over the total prediction events. Natural Science

Comparison with the State-of-the-Art Predictor
Listed in Table 1 are the rates achieved by the current pLoc_Deep-mEuk predictor via the cross validations on the same experiment-confirmed dataset as used in [4]. For facilitating comparison, listed there are also the corresponding results obtained by the pLoc_bal-mEuk [4], the existing most powerful predictor for identifying the subcellular localization of eukaryotic proteins with both single and multiple location sites. As shown in Table 1, the newly proposed predictor pLoc_Deep-mEuk is remarkably superior to the existing state-of-the-art predictor pLoc_bal-mEuk in all the five metrics. Particularly, it can be seen from the table that the absolute true rate achieved by the new predictor is over 81%, which is far beyond the reach of any other existing methods. This is because it is extremely difficult to enhance the absolute true rate of a prediction method for a multi-label system as clearly elucidated in [4]. Actually, to avoid embarrassment, many investigators even chose not to mention the metrics of absolute true rate in dealing with multi-label systems (see, e.g., [91,178,184]).
Moreover, to in-depth examine the prediction quality of the new predictor for the proteins in each of the subcellular locations concerned (cf. Table 2), we used a set of four intuitive metrics that were derived in [242] based on the Chou's symbols introduced for studying protein signal peptides [243] and that have ever since been widely concurred or justified (see, e.g., [242,244]). For the current study, the set of metrics can be formulated as:   [4], where the reported metrics rates were obtained by the jackknife test on the benchmark dataset of Supporting Information S1 that contains experiment-confirmed proteins only. c The proposed predictor; to assure that the test was performed on exactly the same experimental data as reported in [4] for pLoc_bal-mEuk. Natural Science  Listed in Table 2 are the results achieved by pLoc_Deep-mEuk for the eukaryotic proteins in each of 22 subcellular locations. As we can see from the table, nearly all the success rates achieved by the new predictor for the eukaryotic proteins in each of the 22 subcellular locations are within the range of 90% -100%, which is once again far beyond the reach of any of its counterparts. Natural Science

Web Server and User Guide
As pointed out in [245], user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful predictors. Actually, user-friendly web-servers as given in a series of recent publications (see, e.g., [219,220,234,) will significantly enhance the impacts of theoretical work because they can attract the broad experimental scientists [301]. In view of this, the web-server of the current pLoc_Deep-mEuk predictor has also been established. Moreover, to maximize users' convenience, a step-by-step guide is given below.
Step 1. Click the link at http://www.jci-bioinfo.cn/pLoc_Deep-mEuk/, the top page of the pLoc_Deep-mEukweb-server will appear on your computer screen, as shown in Figure 2. Click on the Read Me button to see a brief introduction about the predictor.
Step 2. Either type or copy/paste the sequences of query eukaryotic proteins into the input box at the center of Figure 2. The input sequence should be in the FASTA format. For the examples of sequences in FASTA format, click the Example button right above the input box.
Step 3. Click on the Submit button to see the predicted result. For instance, if you use the four protein sequences in the Example window as the input, after 10 seconds or so, you will see a new screen (Figure 3) occurring. On its upper part are listed the names of the subcellular locations numbered from (1) to (22) covered by the current predictor. On its lower part are the predicted results: the query protein Q63564 of example-1 corresponds to "1," meaning it belonging to "Acrosome" only; the query protein P23276 of example-2 corresponds to "2, 8" meaning it belonging to "Cell membrane" and "Cytoskeleton"; the query protein Q9VVV9 of example-3 corresponds to "2, 7, 18", meaning it belonging to "Cell membrane", "Cytoplasm", and "Nucleus"; the query protein Q673G8 of example-4 corresponds to "2, 7, 10, 18", meaning it belonging to "Cell membrane", "Cytoplasm", "Endosome", and "Nucleus". All these results are perfectly consistent with experimental observations.
Step 4. As shown on the lower panel of Figure 2, you may also choose the batch prediction by entering your e-mail address and your desired batch input file (in FASTA format of course) via the Browse button. To see the sample of batch input file, click on the button Batch-example. After clicking the button Batch-submit, you will see "Your batch job is under computation; once the results are available, you will be notified by e-mail".  Step 5. Click on the Citation button to find the papers that have played the key role in developing the current predictor of pLoc_Deep-mEuk.
Step 6. Click the Supporting Information button to download the Supporting Informations mentioned in this paper.

CONCLUSION
It is anticipated that the pLoc_Deep-Euk predictor holds very high potential to become a useful high throughput tool in identifying the subcellular localization of eukaryotic proteins, particularly for finding multi-target drugs that is currently a very hot trend in drug development. Natural Science