Exploring QSARs for Inhibitory Activity of Cyclic Urea and Nonpeptide-Cyclic Cyanoguanidine Derivatives HIV-1 Protease Inhibitors by Artificial Neural Network

Quantitative structure-activity relationship study using artificial neural network (ANN) methodology were conducted to predict the inhibition constants of 127 symmetrical and unsymmetrical cyclic urea and cyclic cyanoguanidine derivatives containing different substituent groups such as: benzyl, isopropyl, 4-hydroxybenzyl, ketone, oxime, pyrazole, imidazole, triazole and having anti-HIV-1 protease activities. The results obtained by artificial neural network give advanced regression models with good prediction ability. The two optimal artificial neural network models obtained have coefficients of determination of 0.746 and 0.756. The lowest prediction’s root mean square error obtained is 0.607. Artificial neural networks provide improved models for heterogeneous data sets without splitting them into families. Both the external and cross-validation methods are used to validate the performances of the resulting models. Randomization test is employed to check the suitability of the models.


Introduction
HIV-1 protease (HIV-1 PR) is an enzyme that belongs to the family of aspartic acid protease.The Human immunodeficiency virus (HIV), the causative agent of acquired immunodeficiency syndrome (AIDS), infects vital organs of human immune system such as CD4 + T cells, macrophages and dendritic cells.Therefore, AIDS consider as one of the most destructive diseases and it infects millions of people worldwide.It is characterized by reducetion of the effectiveness of the immune system leaving the individual susceptible opportunistic infections and tumors.AIDS is transmitted due to direct contact of blood or body fluids with those of a body containing AIDS.
Not surprising that protease enzyme represents the most attractive target site for development of therapeutic agents for treatment of AIDS, the most agents target this site are cyclic urea and non-peptide cyclic cyanoquanidine derivatives.These agents contains many functional groups that interact with the wild type HIV-1 PR and its mutants, this interaction results in a complex of HIV-1 PR with the peptidomimetic inhibitors that produce an inactive HIV-1 PR and so inactive HIV [1][2][3][4][5].
Number of potent inhibitors have been developed and approved as drugs for the treatment of HIV infection; there has been a continuous interest for the search of new drugs.Saquinavir was the first protease inhibitor approved by FDA.It has been in clinical use since 1995 [6].A review related to the current development on HIV-1 Protease Inhibitors focuses in the first part on the general features of the HIV-1 PR as well as its structure and functions.While in the second part, the review was targeted to characteristic and activity of drug resistant of the nine FDA approval inhibitors [7].
Ligands having high potency against HIV may be properly developed using quantitative structure-activity relationship (QSAR) procedures.
The base of QSAR is the correlation between the experimental values of the activity and theoretical molecular descriptors reflecting the molecular structure of the compounds.
Quantitative structure activity relationship (QSAR) is the quantitative correlation of structural properties of a compound with its chemical, physical, pharmaceutical, or biological effect.Based on this assumption, many trials were made to correlate various physicochemical properties of a set of molecules with their experimentally known biological activity, and so QSAR goals are: 1) Prediction of the activity of untested molecules, depend-ing on models developed using a series of molecules and 2) Constructing ideas about mechanism of action of a group of compounds leading to a design of new compounds of better activity and less toxicity.QSAR model development process is typically divided into three steps: data preparation, data analysis and model validation.
Data preparation starts by selection of the data set to be used; this may simply be the extraction of data from a database or may need additional experimental studies.There are two steps to complete data preparation: geometry optimization and descriptors calculation.Geometry optimization or minimization is finding the coordinates that represents the potential energy minimum for the molecular structure in its 3D form.Theoretical molecular descriptor is a value that describes the molecular structure numerically.These descriptors can be simple such as molecular weight or complex such as geometrical descriptors.
In data analysis, the first step is to decide which techniques for statistical analysis and correlation to be used.If our correlation models to be built are linear then we use multilinear regression (MLR) or non linear then we use artificial neural network (ANN).
Model validation is the final part of the model development process, the predictive power of the model is tested on an independent set of compounds, generally predictive power is the most important characteristics of the model and model predictivity is the ability of the model to predict accurately the target activity of a compound that was not used for model development.
In model validation step, most of validation processes implement the leave one out (LOO) and leave many out (LMO) cross-validation procedures.The most common outcome parameters resulted from cross-validation procedures are cross-validated determination coefficient q 2 (R 2 cv ) and root mean squares error (RMSE).High R 2 cv and low RMSE values is a result of good and more predictive model and that lead to better description of the observed data.
Finally and the most important advantage of QSAR is that we can use QSAR resultant models outside the range of the data set; the model can be used to design new drugs depending on the most effective descriptors.
Multilinear regression (MLR) is multivariate statistical technique to examine the linear relationship between the single dependent variable (activity) and two or more independent variables (molecular descriptors).Collinearity, which often exists between independent variables, generates a severe problem in certain types of mathematical handling such as matrix inversion [8].As it was recently reviewed by Schneider and Wrede [9], the flexibility of ANN for finding out relationships that are more complex allows this method to be widely applied in QSAR studies.Both linear and nonlinear mapping functions can be modeled by configuring the network properly.To obtain po-werful and accurate ANN models, one should train a subset of descriptors instead of all generated descriptors [10][11][12][13][14][15].
In a recent study, Coutinho et al. [16] performed molecular docking and 3D-QSAR studies of HIV-1 protease inhibitors on a series of 54 cyclic urea analogs.Another study was performed by Deeb et al., [17] related to QSAR for inhibitory activity of 46 non-peptide HIV-1 protease inhibitors by GA-PLS and GA-SVM.
This study aims to predict the anti-HIV-1 protease activity of the heterogeneous data set in reference [18] as one group without splitting them into categorizes.This is achieved by applying ANN to develop new statistically validated QSAR models utilizing different types of descriptors.The strength and the predictive performance of the proposed models were verified using cross validation, chance correlation and external test set.Therefore, the motivation of this work is to provide QSAR models that will be used to predict anti-HIV-1 protease activity of unknown compounds and also these models may be used to design new drugs.

Chemical Data and Descriptors
A data set of 127 symmetrical and unsymmetrical cyclic urea and cyclic cyanoguanidine derivatives and their activity (log 1/K i ) obtained from reference [18] was used in this study.Compound's name and activities are included in Table S1 in the supporting information.
The structures of the compounds are drawn by hyperchem software.The resultant structures are 2D then we convert them to 3D.HyperChem software was used to optimize the different compound structures using AM1 semi-empirical level.The optimization was preceded by the Polak-Rebiere algorithm.To be sure that we reached global minima, geometry optimization was run multiple times with different starting points for each molecule.
In this study, a pool of 1481 descriptors classified into 18 different groups was calculated using Dragon software.Two groups of descriptors (properties and empirical descriptors) were constant or nearly constant for all the 127 compounds.Therefore, these descriptors were discarded from further analysis.The remaining 16 groups of descriptors are: molecular walk counts, Galves topological charge indices, Randic molecular profiles, aromaticity indices, functional groups, atom-centered fragments, constitutional, charge, RDF, WHIM, topological, BUCT, geometrical, 3D-MoRSE, GETAWAY and 2D descriptors.Furthermore, chemical descriptors such as HO-MO, LUMO and polarizability were calculated using HyperChem software.Depending on the HOMO and LUMO values, electrophylicity, electronegativity, hardness, and softness descriptors were calculated.Other descriptors such as surface area approximate, surface area grid.Volume, mass, polarizability, hydration energy, octanol-water partition coefficient (log P), and refractivity were calculated (group 17).Discarding highly inter-correlated (r > 0.95) descriptors reduced the total number of descriptors to 223 (see Table S2 in the supporting information).Following the procedure described in the next section, this number of descriptors was declined to 11 descriptors in the "final" MLR regression model (model 11 in Table 1).

Multiple Linear Regression (MLR) Analysis
Multiple linear regression analysis with stepwise selection and elimination of variables was employed to model the anti-HIV-1 protease activity (log 1/K i ) relationships with each group of descriptors separately.Log 1/K i is the dependent variable and the set of descriptors as independent variables.Then, the "optimal" descriptors for each group were selected and gathered in one group to perform new MLR analysis.

Principal Components Analysis (PCA)
Collinear descriptors add redundancy to the input data matrix and consequently the performances of the models obtained by using these descriptors would be degraded.PCA and more specifically factor analysis, groups together variables that are collinear to form a composite indicator capable of capturing as much of common information of those indicators as possible.Each factor reveals the set of variables with the highest relationship.The idea under this approach is to explain the highest possible variation in the indicators set using the smallest possible number of factors.Consequently, the index no longer depends upon the dimensionality of the data set but it is rather based on the "statistical" dimensions of the data.Application of PCA on a descriptor data matrix results in a loading matrix containing factors or PCs, which are orthogonal and therefore have no correlation with each other.
The PC's were calculated by singular value decomposition (SVD) method in MATLAB environment (Math-Work Inc.Version 7.0.1 (R14)).Due to the quality of data, a previous treatment of the data is essential before applying the multivariate analysis methods.Scaling and centering is one of the pre-processing methods needed before performing the regression methods joint with feature extraction.Projection methods results depend on the normalization of the data.Descriptors with small absolute values have a small contribution to overall variances leading to biased PC's caused by the presence of other descriptors with higher values.In order to have the focus on the important variables in the model, equal weights are assigned to each descriptor, with appropriate scaling.Furthermore, descriptors were standardized to unit variance and zero mean (autoscaling) to give all variables the same importance.Then, the data matrix containing the entire set of descriptors and activity were simultaneously subjected to PCA.

Principal Component-Artificial Neural Network (PC-ANN) Analysis
ANNs are computer-based models in which a number of nodes, also called neurons are interconnected by links forming netlike structure "layers".A variable value is assigned to every neuron.There are three kinds of neurons: 1) the input neurons which receive their values from independent variables 0.760 and constitute the input layer, 2) the hidden neurons which collect values from other neurons, giving a result that is passed to a successor neuron, 3) the output neurons which take values from other units and correspond to different dependent variables, forming the output layer.In this sense, network architecture is commonly represented as I-H-O, where I, H, and O are the number of neurons in the input, hidden, and output layers, respectively [5].
The weights are links between units that condition the values assigned to the neurons.The weights are adjusted through a training process in order to minimize network error.For this, a non-linear transfer function relates the input parameters with the outputs.Commonly neural networks are adjusted, or trained, so that a particular input leads to a specific target output.
In PC-ANN analysis, as a preliminary treatment, the input data (i.e., molecular descriptors) were normalized to have zero mean and unity variance, and then were subjected to PCA before being introduced into the neural network.It should be illustrated that for each MLR resulted model, separate ANN models were developed so that the input's descriptors were the subsets selected by the stepwise MLR methods.In the case of each MLR model, a feed-forward neural network with back-propagation of error algorithm was constructed to model the activity-structure relationships between the descriptors on one hand and inhibitory activity on the other hand.The model development in ANN and the network architecture is fully described by us [13] and others [14].The data set was divided into training and external test sets.The test set is used to test the trend of the prediction precision of the model trained at some point of the training evolution.The extracted PC's for each MLR model were classified homogenously, based on the factors space of the descriptors, into training set (80%) and external test set (20%) according to the PCA and the first two PC's were plotted against each other (see Figure 1).Afterward, the training set was used to optimize the network performance.The regression between the network output and the observed activity was calculated for the two sets individually.The training function "Tanh" was used to train the network.To find models with lower errors, the ANN algorithm was run many times, with different geometry and initial weights each time.

MLR Analysis
In continuation to recent QSAR studies [19][20][21][22] done using similar methods including nonpeptide HIV-1PR inhibitors [23], we developed an ANN-QSAR model that describes the anti-HIV activity of a series of compounds using large number of different descriptors.MLR were performed on each one of the 17 groups of descriptors individually (individual approach described in Reference [24] by Deeb) where log 1/K i is the dependent variable.
Stepwise method is used to develop multilinear equation by correlating dependent variable (activity) and the best independent variables.The results of the 17 MLR analyses are summarized in Table S2 in the supporting information.
Next, a new or "final" MLR analysis was performed by correlating the dependent variable (activity) and the optimal descriptors selected from the individual 17 MLR models.Table 1 shows the regression models suggested from the "final" MLR analysis.The number of descriptors in these models is varied between 1 and 11.The highest coefficient of determination (R 2 ) obtained, is 0.760 for a regression model with 11 descriptors (model 11).Table 2 shows a key for the different descriptors used in the final MLR model.According to the above equation, the most important descriptor in this equation is R1p which reflects the polarizability of the compounds; it is inversely proportional to the activity of the compounds.The second important descriptor is R7u which reflects the geometrical matrix of the compound.
Then, leave many out (LMO) cross validation was performed on models 6-11 since these models have coefficients of determination larger than 0.6 [25].The results of LMO cross validation are summarized in Table S3 in the supporting information.This table shows that the cross-validation coefficient of determination ( cv ) has positive values starting from model 6 to model 11.Table S3 shows also those models 8-11 have the highest R 2 and cv values as well as the lowest root mean square error (RMSE) values.Thus, models 8-11 were chosen for further analysis with ANN.

PCA
The inputs of the ANN were the subset of the descriptors used in different MLR models (Table 1).First, PCA was performed to classify the molecules into training (80%) and test (20%) sets.Plotting the first and second PC's, shows that compounds SD146 (molecule 124) and XP-521 (molecule 118) are outliers (see Figure 1).This indicates that these 2 molecules behave differently from other molecules with respect to both molecular structure (descriptors) and anti-HIV1 protease activity.Therefore, these molecules are not used in future analysis.According to the pattern of the distribution of the data in factor spaces (Figure 1), the training and test sets molecules were selected homogenously so that molecules in different zones of Figure 1 belong to the two subsets.After removing the outliers and subjecting the data of the remaining 125 molecules to the preliminary treatment mentioned previously, the classified data were used as an input for the ANN.

ANN
In this study, a three-layered feed-forward ANN model with back propagation learning algorithm [26] was employed.At first, non-linear relationship between the sub-set of descriptors selected by stepwise selection-based MLR and anti-HIV-1 protease activity was preceded by ANN models with similar structure.The number of hidden layer's nodes was set to 6 for all models, and the number of nodes in the input layer was the number of descriptors.
The correlation coefficients and cross-validation parameters of ANN analysis for ANN model numbers 8-11 are given in Table S4 in the supporting information.This table shows that the lowest prediction RMSE (RMSE P ) is obtained for model 8 while the lowest calibration (training) RMSE (RMSE C ) is obtained from model 10.Table S4 in the supporting information shows that model 8 has the highest coefficient of determination for the test set ( p R = 0.800).However, the calibration and cross validation coefficients of determination for this model are 0.693 and 0.550, respectively.
On the other hand, the highest coefficients of determination for calibration and cross validation are obtained for model 10 ( C = 0.796 and cv = 0.734).However, the value for this model is 0.772.Hence, these two models were subjected for further analysis by optimizing the number of hidden nodes.
To optimize the performance of the ANN models 8 and 10, these models were trained using different number of hidden nodes starting from 2 to 20.Choosing the best model was based on cross-validation parameters and determination of minimum prediction error [27].For the evaluation of the predictive ability of a multivariate calibration model, RMSE P is an important statistical parameter to find the best number of hidden nodes.Moreover, because large numbers of hidden nodes often draw attention to the risk of overfitting [28], considering models with low prediction error is avoided if a large number of hidden nodes are used in their network training.
The results of optimizing the number of hidden nodes for models 8 and 10 are summarized in Tables S5 and S6 in the supporting information respectively.
Figure 2(a) shows RMSE C and RMSE P values against the number of hidden nodes for model 8.This figure shows that the lowest RMSE C (0.619) is obtained when using 8 hidden nodes.This value is close to the obtained RMSE P (0.632).Using 8 hidden nodes gives the highest coefficients of determination ( C = 0.746 and cv = 0.653).Furthermore, the p R 2 value (0.743) is close to that obtained for the training set.

 
Randomization test is performed to investigate the probability of chance correlation for the optimal models (models 8 and 10 with 8 and 6 hidden nodes in the network, respectively).Chance correlation was done using the same configuration parameters and the same activetion functions of all our ANN models.The results of chance correlation for models 8 (using 8 hidden nodes) and 10 (using 6 hidden nodes) are summarized in Tables S7 and S8 in the supporting information, respectively.These tables show that the coefficients of determination obtained by chance are low in general while the RMSE values are high.This indicates that the models obtained from ANN are better than those obtained by chance.
Model 10 has higher coefficients of determination and lower RMSE P than those obtained for model 8. Furthermore, the optimal number of hidden nodes obtained for model 10 (6 hidden nodes) is smaller than that obtained for model 8 (8 hidden nodes).However, the differences between the two models are not large.
As we can see, our models were validated by calculateing different statistical parameters, using external test set and finally performing randomization test.Correlation between calculated and observed log (1/K i ) for the training set of model 8 is given by: and for the test set of this model is given by: while the correlation between calculated and observed log (1/K i ) for the training set of model 10 is given by: and for the test set of this model is given by: To check the presence of outliers in a model, for the training and test sets, the standard deviation of the observed activity data was calculated.The residue which is equal to the difference between the predicted and observed one were calculated also.Finally, if the value of the residue is larger than two times the standard deviation of the observed activity, then this point is considered as an outlier.We found that there was no outlier in our data.

Comparison with Other QSAR Studies
Speranta, et al. [18] have performed QSAR study on the same dataset of anti-HIV-1 protease compounds used in this study.They have modeled the HIV-1 protease inhibitor activity (log 1/K i ) from different families using Comparative Molecular Field Analysis (CoMFA) methodology.They found that no simple or multiple regressions gave any statistically significant model.They have obtained of 0.63 and R 2 of 0.70.
CV Khedkar, et al. [16] used comparative molecular field analysis (CoMFA) and comparative molecular similarity indices analysis (CoMSIA) to build QSAR models for 54 compounds out of 127 compounds used in this study.Higher calibration and cross validation coefficients of determination were obtained in this study.However, the results obtained by Speranta and Khedkar are for one group of compounds that have the same core structure while in this study, an ANN-QSAR model was built for one heterogeneous group of compounds that contains many families of anti-HIV-1 protease compounds without splitting them into families.The ANN approach used in this study succeeds to explain the non-linear relationships for the data of interest considering the nature of the heterogeneous data set.Although our results seems to be close to Separanta and Khedkar results, our models are more predictive because we used more compounds with different core structures in our data, also we calculated a wider range of descriptors.

Conclusions
The performance of the ANN modeling method combi-ned with the individual [24] factor selection approach is applied to predict the anti-HIV inhibitory activity of a set of 127 compounds.The optimal two models have calibration and prediction coefficients of determinations of 0.746 and 0.756.The lowest RMSE P obtained is 0.607.ANN provides improved models for heterogeneous data sets without splitting them into families and gives good regression models with good prediction ability.
Generally, the models obtained from the ANN analysis are better than those obtained by MLR analysis.Both the external and cross-validation methods are used to validate the performances of the resulting models.Employed randomization test indicates that the models obtained from ANN are better than those obtained by chance.

Supporting Information
Table S1.Molecular structures and observed activities of the 127 HIV-1 PR cyclic urea and non peptide cyanoguanidine derivative inhibitors expressed as log 1/K i (R) and (X) in all structures represent the substituent.

Figure 1 .
Figure 1.First and second principal components for the factor spaces of the descriptors and anti-HIV-1 protease activity data.

Figure 2 (
b) shows RMSE C and RMSE P against the number of hidden nodes for model 10.This figure shows that the lowest RMSE P (0.607) is obtained when using 6 hidden nodes.The RMSE C obtained when using this number of hidden nodes is 0.644.The coefficient of determination obtained for the test set ( 750) is close to that obtained for the training set ( = 0.756).The obtained for this model is 0.675.

Figure 2 .
Figure 2. RMSE of calibration and prediction against hidden nodes number for (a) model 8 and (b) model 10.

Figure 3 (
a) shows plot of the predicted activity against observed ones for the training and test sets compounds of model 8 while Figure 3(b) shows their residuals.Similarly, Figure 3(c) shows plot of the predicted activity against observed ones for the training and test sets compounds of model 10 while Figure 3(d) shows their residuals.

Figure 3 .
Figure 3. (a) Plot of the predicted anti-HIV-1 protease activities against observed ones and (b) their residuals for model 8; (c) Plot of the anti-HIV-1 protease activities against observed ones and (d) their residuals for model 10.Two different alignment schemes viz.receptor-based and atom-fit alignment, were used to build the QSAR models.The cv values for CoMFA and CoMSIA derived from receptor-based alignment were 0.68 and 0.65, respectively.2 R

Table S6 . Coefficient of determination and cross validation parameters for optimizing number of hidden nodes for model 10.
RMSE C is root mean square error of calibration, RMSE P is root mean square error of prediction.

R 2 cv R 2 P R Table S7. Coefficients of determination and cross validation parameters for chance correlation results for model 8 with 8 hidden nodes.
CV is cross validation coefficient of determination, is prediction coefficient of determination.RMSE C is root mean square error of calibration, RMSE P is root mean square error of prediction.
C is calibration (training) coefficient of determination,

Table S8 . Coefficients of determination and cross validation parameters for chance correlation results for model 10 with 6 hidden nodes.
) coefficient of determination, CV is cross validation coefficient of determination, is prediction coefficient of determination.MSE C is root mean square error of calibration, RMSE P is root mean square error of prediction.