Prediction of Anti-Inflammatory Activity of a Series of Pyrimidine Derivatives, by Multiple Linear Regression and Artificial Neural Networks ()
1. Introduction
Inflammation is a local response of the organism to an agression of exogenous or endogenous origin [1]. It aims to circumscribe and repair this aggression and involves a series of events that are characterized by a combination of redness, heat, edema and pain [2]. Like pain, inflammation changes behavior. It can then lead to the loss of jobs and even the marginalization of the patient by relatives. It therefore has a very significant social and economic cost [3]. To take part in the treatment of inflammation, several varieties of drugs are available such as aspirin and Non-Steroidal Anti-Inflammatory Drugs (NSAIDs). But many of these varieties have adverse side effects for the body [4]. This is why researchers continue to mobilize in order to find new effective molecules with fewer side effects. Pyrimidine derivatives are a promising avenue. Evidenced by the many studies on series of molecules comprising the core of the pyrimidine and which have analgesic and anti-inflammatory properties [5] [6] [7]. A fairly large number of these studies relate to tri-substituted derivatives of pyrimidine [8] [9]. The results obtained are very encouraging and various substituents are tested. To participate in this dynamic research, Quantitative Structure Activity-Relationship (QSAR) models of anti-inflammatory activity, developed for other organic compounds, are available [10] [11]. But few models relate to the anti-inflammatory activity of pyrimidine derivatives. A QSAR model is an alternative and complementary solution to traditional methods for investigating a biological activity [12]. This approach is increasingly used to reduce the excessive number of experiments, which are sometimes long, dangerous and costly in terms of time and finance [13]. The model establishes a quantitative relationship between biological activity and molecular descriptors. Most models use multiple linear regression. But sometimes linear models are not sufficient to explain all sources of variability due to the complex nature of the relationships between molecular structure and activity [14]. Therefore, nonlinear modeling approaches are used to develop statistically significant and predictive QSAR models [15]. The aim of this work is to develop QSAR models of the anti-inflammatory activity of a series of tri-substituted pyrimidine derivatives using molecular descriptors.
2. Materials and Method
2.1. Computational Theory Level
The quantum descriptor calculation program used in this work is Gaussian 09 [16] with its graphical interface GaussView05. The optimization and the calculation of the frequencies of the molecules were carried out using the Density Functional Theory (DFT) method with the B3LYP functional. The B3LYP functional is a hybrid functional that combines Becke’s third parametrization for the exchange energy and the Lee, Yang and Parr functional for the correlation energy [17]. This functional has shown its efficiency for the calculation of many molecular properties [18]. The basis retained is the split-valence and double-dzeta 6-31G(d,p). This basis is sufficiently extensive and the consideration of polarization functions is important for the explanation of dipole and multipole moments. The B3LYP/6-31G(d,p) level of theory was used to determine the quantum molecular descriptors. The logP molecular lipophilicity of the derivatives was estimated using the ChemSketch software from ACD/Lab [19].
2.2. Molecular Descriptors Used
Some theoretical descriptors have been characterized in order to develop our QSAR model. In particular, the dipole moment µD, the energy of the highest occupied molecular orbital EHOMO, the isotropic polarizability α and the logP molecular lipophilicity.
The dipole moment related to the charge distribution is a parameter that relies on the existence of electrostatic dipoles. It is an overall distribution of electric charges in a molecular system, such that the barycenter of the positive charges does not coincide with that of the negative charges. The dipole moment is a vector quantity. The dipole moment makes it possible to describe the global polarity as well as the existence of interaction of molecular systems such as Van der Waals forces, and also to predict their solubility in polar solvents. The dipole moment is an important property that gives an idea of the reactivity of the molecule [20]. It also indicates the stability of a molecule in water. Thus, a strong dipole moment will reflect low solubility in organic solvents and high solubility in water [21].
The highest occupied molecular orbital (HOMO) plays a fundamental role in the qualitative interpretation of chemical reactivity [22]. It is considered the outer orbital containing electrons and it tends to behave as an electron donor.
Another parameter studied is the isotropic polarizability α. It is the ease of a building to deform under the action of an electric field [23]. It is defined by the following relationship [24]:
(1)
Finally, the last descriptor evaluated is molecular lipophilicity, which is very important. It is intimately linked to the notion of partition of a molecule between an aqueous phase and a lipid phase [25]. We now know that this capacity for partitioning of a molecule between two phases partly conditions its biological properties such as transport, passage through membranes, bioavailability (distribution and accumulation), affinity for a receptor, protein binding, pharmacological activity, toxicity, accumulation in aquatic organisms, etc. [19].
2.3. Quantitative Structure Activity-Relationship (QSAR)
The objective of a QSAR study is to establish a mathematical relationship between molecular properties called descriptors and a given biological activity, for a series of similar compounds [26]. The equation of such a relationship, when validated, makes it possible to determine the values of the parameters which correspond to optimal activity and to predict the most promising molecular structure which should be synthesized and tested in the laboratory [27]. It can also be used for the prediction of the properties of molecules already synthesized or not for which the biological activities are not available. The development of a QSAR model must then follow a rigorous scheme in order to achieve a reliable and quantitative result. Thus, the development of a QSAR model 1) begins with the selection of reliable experimental data [28], 2) the calculation of molecular descriptors, as many as possible, 3) the selection of independent and relevant descriptors [29], 4) setting up the QSAR relationship with the selected descriptors using data analysis tools and 5) validating the model developed [30]. Various internal validation criteria exist such as internal correlation coefficient R2, adjusted correlation coefficient
, standard deviation RMCE, Fisher coefficient F [30], cross-validation
[31], randomization [32] and also external validation criteria such as
and standard deviation RMSEP for the test set, the criteria of Golbraikh and Tropsha [33] as well as those of Roy et al.
,
,
,
and
[34]. These various criteria make it possible to establish the significance, robustness and reliability of the model developed. XLSTAT 2014 and EXCEL 2013 softwares were used to develop the QSAR models and to perform the various calculations.
2.4. Multiple Linear Regression (MLR)
Multiple linear regression is the statistical tool which consists in modeling, using a multiple linear combination, a dependent quantitative variable Y by several independent quantitative explanatory variables
(
), according to the Equation (2) [35].
(2)
where
are the regression coefficients and
is the model error. These coefficients
and the variance
are estimated by minimizing the least squares criterion. The analysis of variance, which is generally done using an ANOVA table, provides access to various model validation parameters such as R2,
, RMCE and F defined below [36]:
(3)
(4)
(5)
(6)
with:
(7)
(8)
(9)
(10)
n is the number of molecules in the training set (TS) and p the number of descriptors in the model.
and
are the experimental and predicted values of the dependent variable
for molecule i;
is the mean value of the dependent variable for the training set.
2.5. Artificial Neural Network (ANN)
An artificial neural network (ANN) is a biologically inspired computer algorithm designed to work in the same way as the human brain processes information [37]. It consists of a number of processing elements (or cells) which represent artificial neurons. Each neuron has an input, weights (wi) associated with each input, a transfer function (f) and an output (a) [38] (see Figure 1(a)), which can then branch out to feed a variable number of other neurons [39]. The neurons are interconnected to form the artificial neural network with variable coefficients or weights and are organized into layers: input layer, hidden layers and output layer [40] (see Figure 1(b)).
Artificial neural networks have shown great efficiency in modeling nonlinear relationships [15]. The algorithm of multilayer neural networks (or Multilayer Perceptrons) with backpropagation remains the most productive model at the application level and the most widely used [41]. The MATLAB 2017a program was used to build the artificial neural networks of this work.
(a)(b)
Figure 1. (a) Single neuron with 4 inputs and one output and (b) multilayer perceptron.
3. Results and Discussion
3.1. Analysis of Molecular Descriptors
The work of Vishal et al. [42] and Yejella et al. [43] provided twenty-eight tri-substituted pyrimidine derivatives with anti-inflammatory activity expressed as a percentage. The general structure of these molecules is as shown in Figure 2. The designation codes of the derivatives, the substituents and the percentages of inhibition of inflammation (PI), are collated in Table 1.
The results of the calculations of the various molecular descriptors, namely, µD, EHOMO, α and logP, for the 28 molecules, are collated in Table 2. This table also contains the anti-inflammatory activity expressed by logAI for each derivative of the series. Indeed, the values of the percentages of inhibition of inflammation (PI), were transformed into decimal logarithms logAI according to the expression (12). These new values are collated in Table 2.
The decimal logarithm of the anti-inflammatory activity logAI [44] represents the magnitude to be explained in this study. This quantity takes into account both the experimental dose D (10 or 100 mg/kg) of the molecule injected into
Figure 2. General structure of pyrimidine derivatives.
Table 1. Designation codes, substituents Ar1, Ar2 and YHn and percentages of inhibition of inflammation (PI) of the 28 Pyrimidine derivatives.
Table 2. Anti-inflammatory activities (logAI), dipole moment μD(Debye), energies of the highest occupied molecular orbital EHOMO(eV), isotropic molecular polarizability α(Bohr3) and logP lipophilicity of tri-substituted pyrimidine derivatives.
the animal, the molar massM of this injected molecule as well as the physiological response of the animal expressed as a percentage of inhibition (PI) of inflammation [45] according to the expression (12).
(12)
The analysis of the results of this table reveals that these descriptors vary from one derivative to another and thus depend on the substituents attached to the nucleus of the pyrimidine.
3.2. Statistical Analysis of Data
We seek to build mathematical models capable of explaining and predicting anti-inflammatory activity based on descriptors of free molecules.
For the QSAR model to be simple and understandable, the descriptors used must be meaningful and interpretable [46]. The selection of candidate descriptors for the model is a crucial step and the quality of the model will depend on their relevance because they must provide information that can explain the response (biological activity). To this end, the processing of the descriptors was carried out on the one hand using the one-factor variances and on the other hand using the Pearson correlation coefficients. The analysis of variances makes it possible to eliminate constant or little varied descriptors [28] [47]. The correlation coefficients of the descriptors are calculated taking into account the biological activity expressed by logAI. This consists in bringing the descriptors strongly correlated with each other to the one which is most strongly correlated with the biological activity. Indeed, descriptors strongly correlated between them are redundant, because they have the same information [48]. These two methods, the results of which are presented in Table 3 and Table 4, show that the four descriptors which are the dipole moment µD, the energy of the highest occupied molecular orbital EHOMO, the isotropic polarizability α and the lipophilicity logP, vary well from one derivative to another and that they are linearly independent.
3.3. Prediction of Anti-Inflammatory Activity by Multiple Linear Regression (MLR)
To build and test the multiple linear regression (MLR) model, the initial set of 28 molecules was subdivided into a training set (75%) and a test set (25%) [33] using hierarchical ascending clustering (CAH) [49]. The Euclidean distance between the observations, in the space defined by the descriptors, was retained as the dissimilarity criterion and Ward’s method as the aggregation criterion [49]. The multiple linear regression method applied to the training set, using the four descriptors, gave the Equation (13) below:
(13)
The statistical indicators of this model are:
,
,
,
,
,
,
,
, Randomization
,
,
,
,
All these statistical indicators are strongly different from the defined limit values [28] [34]. They thus show that the developed MLR model explains the anti-inflammatory activity of this series of pyrimidine derivatives in a statistically significant and satisfactory manner. This model can thus be considered robust and stable. The predicted values for each set are recorded in Table 5 as well as the residuals between experimental and predicted values.
Table 3. Variances of anti-inflammatory activity (logAI) and various descriptors.
Table 4. Correlation matrix of the 4 calculated descriptors and the logAI anti-inflammatory activity.
Table 5. Experimental logAI values, predicted values and residuals e(MLR) of the multiple linear regression model.
The analysis of these results shows that, for the two sets, the absolute values of the residuals range from 0.01 to 0.69 with a mean absolute difference (RMCE) of 0.27 defined for the two sets. This result confirms that the predicted values are close to the experimental values overall. The MLR model therefore has a good predictive performance of the anti-inflammatory activity of this series of derivatives.
3.4. Prediction of Anti-Inflammatory Activity by Artificial Neural Networks (ANN)
A feed-forward Backpropagation neural network (a multilayer perceptron) [40] was used, with four inputs corresponding to the four descriptors (μD, EHOMO, α and logP), ten hidden layers and one output. The 28 derivatives are randomly divided into three subsets. Training (70%) (20 molecules), validation (15%) (4 molecules) and test (15%) (4 molecules). The training set adjusts the connection weights and model-fitting biases. The validation set verifies the performance of the model throughout the training process and stops training to avoid over-training [38]. The activation functions are the hyperbolic tangent function for the hidden and output layers and the Levenberg-Marquard function for the training set. The performance of the developed model was evaluated by the residual e (ANN) between the predicted and experimental values for each value of logAI. The predicted and experimental values as well as the residuals are presented in Table 6.
These results indicate that, for all three sets, the absolute values of the residuals range from 0.00 to 0.29 with a mean absolute deviation (RMCE) of 0.11. This confirms that the predicted values are very close to the experimental values. The ANN model therefore has a very good predictive performance of the anti-inflammatory activity of this series of derivatives.
3.5. Comparison of the Two Established Models
Table 7 brings together the values predicted by each of the two models as well as the residuals and the experimental values for the 28 derivatives studied.
The statistical parameters of the two models are collected in Table 8. These parameters show that the two models can predict the anti-inflammatory activity of this series of pyrimidine derivatives, in a statistically significant and satisfactory way. But the results obtained with the model of artificial neural networks are better than those of multiple linear regression. This demonstrates that the model obtained with artificial neural networks has a better predictive capacity of anti-inflammatory activity than that obtained by multiple linear regression. Figure 3 shows the fit of predicted values and experimental values for the two models. We can see a better match between the values predicted by the artificial neural networks and the experimental values.
Table 6. Experimental values, predicted values of logAI and e(ANN) residuals of the artificial neural network model.
Table 7. Experimental logAI values, predicted values PredMLR(logAI), PredANN(logAI), residuals e(MLR) and e(ANN) of the MLR and ANN models for the 28 derivatives studied.
Table 8. Statistical parameters R2,
, MCE and
of each of the two models.
(a)(b)
Figure 3. Similarity between predicted and experimental values of logAI. (a) Similarity Between Exp(logAI) and PredMLRlogAI; (b) Similarity Between Exp(logAI) et PredRNAlogAI.
4. Conclusion
This work allowed us to build two models for predicting the anti-inflammatory activity of a series of tri-substituted derivatives of pyrimidine using quantum descriptors such as the dipole moment µD, the energy of the highest occupied molecular orbital EHOMO, isotropic polarizability α and molecular lipophilicity logP. Multiple linear regression (MLR) and artificial neural networks (ANN) methods were used to develop these models. The multiple linear regression model has obtained the following statistical parameters:
,
, RMCE = 0.2831,
while that of the artificial neural networks has the following values:
,
, RMCE = 0.1131,
. The results obtained with RNA are better than those obtained with RLM. However, the statistical parameters show that the two models have a very good predictive performance of anti-inflammatory activity. In short, the two models developed make it possible to explain the anti-inflammatory activity of this series of pyrimidine derivatives, in a statistically significant and satisfactory manner. They can be considered sturdy and stable. In perspective, these two models can be used to predict the anti-inflammatory activity of new pyrimidine derivatives for which no experiment has yet been carried out in this direction.