iATC_Deep-mISF: A Multi-Label Classifier for Predicting the Classes of Anatomical Therapeutic Chemicals by Deep Learning

The recent worldwide spreading of pneumonia-causing virus, such as Coronavirus, COVID-19, and H1N1, has been endangering the life of human be-ings all around the world. To provide useful clues for developing antiviral drugs, information of anatomical therapeutic chemicals is vitally important. In view of this, a CNN based predictor called “iATC_Deep-mISF” has been developed. The predictor is particularly useful in dealing with the mul-ti-label systems in which some chemicals may occur in two or more different classes. To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/iATC_Deep-mISF/, which will become a very powerful tool for developing effective drugs to fight pandemic coronavirus and save the mankind of this planet.


Introduction
According to the ATC (Anatomical Therapeutic Chemical) system (http://www.whocc.no/atc/structure_and_principles) as recommended by WHO (World Health Organization), the drug compounds are categorized into the following 14 main groups: 1) alimentary tract and metabolism; 2) blood and blood forming organs; 3) cardiovascular system; 4) dermatologicals; 5) genitourinary system and sex hormones; 6) systemic hormonal preparations, excluding sex Advances in Bioscience and Biotechnology hormones and insulins; 7) anti-infectives for systemic use; 8) antineoplastic and immunomodulating agents; 9) musculoskeletal system; 10) nervous system; 11) antiparasitic products, insecticides and repellents; 12) respiratory system; 13) sensory organs; 14) various. Given an uncharacterized compound, can we identify which ATC-class it belongs to? It is no doubt a significant problem for both basic research and drug development.
In 2017, a powerful predictor called "iATC-mISF", was developed, which is overwhelmingly superior to its counterparts. But the method has not been further treated with the Deep Learning yet, a very powerful technique [1] [2]. The present study was devoted to doing so.
According to the 5-step guidelines [3] and demonstrated in a series of recent publications (see, e.g., [4] [5]), to develop a statistical predictor that not only can be easily used by experimental scientists but also can stimulate theoretical scientists to develop more relevant ones, we should make the following five steps crystal clear: 1) benchmark dataset, 2) sample formulation, 3) operation algorithm, 4) anticipated accuracy, and 5) web-server. Below, we are to elaborate how to deal with these procedures one-by-one.

Benchmark Dataset
The benchmark dataset used in this study is exactly the same as that in iATC-mSMF [6]; i.e.,

Installing Deep-Learning for Three Deeper Levels
In this study, we use multilayer perceptron neural network model, which consists of 3 fully connected layers and was used to predict classes of multi-label ATC classes, as illustrated in Figure 1. We set input layer with 14 neural unGra-  were decided by the output of the threshold θ. If the output is greater than 0.5, the outcome was true; otherwise, false. For more information about this, see [1], where the details have been clearly elaborated and hence there is no need to repeat here.

Results and Discussion
According to the 5-step rules [3], one of the important procedures in developing a new predictor is how to properly evaluate its anticipated accuracy. To deal with that, two issues need to be considered. 1) What metrics should be used to quantitatively reflect the predictor's quality? 2) What test method should be applied to score the metrics?

A Set of Five Metrics for Multi-Label Systems
Different from the metrics used to measure the prediction quality of single-label systems, the metrics for the multi-label systems are much more complicated. To make them more intuitive and easier to understand for most experimental scientists, here we use the following intuitive Chou's five metrics [7] or the "global metrics" that have recently been widely used for studying various multi-label systems (see, e.g., [8] [9]). For the current study, the set of global metrics can be formulated as: where q N is the total number of query proteins or tested proteins, M is the total number of different labels for the investigated system (for the current study it is cell 4 L = ), means the operator acting on the set therein to count the number of its elements,  means the symbol for the "union" in the set theory,  denotes the symbol for the "intersection", k  denotes the subset that contains all the labels observed by experiments for the k-th tested sample, * k  represents the subset that contains all the labels predicted for the k-th sample, and ( ) * * 1, if all the labels in are identical to those in Δ , 0, otherwise In Equation (4), the first four metrics with an upper arrow ↑ are called positive metrics, meaning that the larger the rate is the better the prediction quality will be; the 5 th metrics with a down arrow ↓ is called positive metrics, implying just the opposite meaning.
From Equation (2) we can see the following: 1) the "Aiming" defined by the 1 st sub-equation is for checking the rate or percentage of the correctly predicted labels over the practically predicted labels; 2) the "Coverage" defined in the 2 nd sub-equation is for checking the rate of the correctly predicted labels over the actual labels in the system concerned; 3) the "Accuracy" in the 3 rd sub-equation is for checking the average ratio of correctly predicted labels over the total labels including correctly and incorrectly predicted labels as well as those real labels but are missed in the prediction; 4) the "Absolute true" in the 4 th sub-equation is for checking the ratio of the perfectly or completely correct prediction events over the total prediction events; 5) the "Absolute false" in the 5 th sub-equation is for checking the ratio of the completely wrong prediction over the total prediction events.

Comparison with the State-of-the-Art Predictor
Listed in Table 1 (2) for the definition of the metrics. b See [6], where the reported metrics rates were obtained by the jackknife test on the benchmark dataset of Supporting Information S1 that contains experiment-confirmed proteins only. c The proposed predictor; to assure that the test was performed on exactly the same experimental data as reported in [6] for iATC-mISF.
used in [6]. For facilitating comparison, listed there are also the corresponding results obtained by the iATC-mISF predictor [6], the existing most powerful method for predicting the classes of anatomical therapeutic chemicals. As shown in Table 1, the newly proposed predictor iATC_Deep-mISF is remarkably superior to the existing state-of-the-art predictor iATC-mISF in all the five metrics. Particularly, it can be seen from the table that the absolute true rate achieved by the new predictor is over 67%, which is about 7% higher than iATC-mISF [6]. This is because it is extremely difficult to enhance the absolute true rate of a prediction method for a multi-label system as clearly elucidated in [6]. Actually, to avoid embarrassment, many investigators even chose not to mention the metrics of absolute true rate in dealing with multi-label systems (see, e.g., [10] [11]). Meanwhile, as a byproduct, the present paper has also stimulated some very interesting or provoked papers (see, e.g., [12]- [17]).

Web Server and User Guide
As pointed out in [18], user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful predictors. Actually, user-friendly web-servers will significantly enhance the impacts of theoretical work because they can attract the broad experimental scientists [19]. In view of this, the web-server of the current iATC_Deep-mISF predictor has also been established at http://www.jci-bioinfo.cn/iATC_Deep-mISF/, by which users can easily get their desired data without the need to go thru the mathematical details.

Conclusion
It is anticipated that the iATC_Deep-mISF predictor holds very high potential to become a useful high throughput tool in identifying the classes of anatomical therapeutic chemicals. Most important is that the predictor will become a very useful tool for fighting against the coronavirus to save mankind on this planet.