DFT Study, Linear and Nonlinear Multiple Regression in the Prediction of HDAC7 Inhibitory Activities on a Series of Hydroxamic Acids ()
1. Introduction
Histone deacetylases (HDAC) have become essential transcriptional corepressors in a variety of physiological systems. To date, 18 human HDACs have been identified and grouped into four classes. HDAC Class I (HDAC1, 2, 3 and 8), HDAC Class II (HDAC4, 5, 6, 7, 9 and 10), HDAC Class III, also called sirtuins (SIRT1, 2, 3, 4, 5, 6 and 7) and HDAC Class IV (HDAC11). Class II HDACs are subdivided into classes IIa (HDAC4, 5, 7, 9) and class IIb (HDAC6 and 10) [1] . HDACs have also been identified as one of the major players in tumorigenesis and the inhibition of HDAC function has been shown to be an effective strategy in the treatment of cancer [2] . They are ubiquitously expressed and are widely implicated in many chronic diseases such as cancer and inflammation [3] . Class II is closely related to HDAC1 class I. HDAC class IIa has large N-terminal extensions with conserved binding sites for transcription factor 2 (MEF2) and chaperone 14-3-3 protein that make the signal sensitive to HDACs [4] . Regulated phosphorylation of HDAC class IIa provides a mechanism for binding extracellular signals to transcription and plays a key role in many tissues during development and disease. In contrast to other HDACs, class IIa HDACs show relatively restricted expression patterns. HDAC5 and HDAC9 are highly enriched in muscles, heart and brain [5] . HDAC4 is highly expressed in the brain and growth plate of the skeleton [6] , and HDAC7 is enriched in endothelial cells and thymocytes [7] . The class IIa HDACs repress transcription has not been fully elucidated. Highly purified recombinant class IIa HDACs possess only minimal catalytic activity, and the activity of class IIa HDACs purified from mammalian cells has been shown to be due to contaminating class I HDACs [8] . Studies in the cell culture by silencing or over-expressing HDAC7 have shown that HDAC7 is involved in the regulation of cell proliferation, apoptosis, differentiation and migration. Some reports have investigated the role of HDAC7 in cancers, and a high level of HDAC7 protein has been observed in nine different human pancreatic cancer cases [9] . HDAC7 is also overexpressed in breast cancer stem cells and is required to maintain cells [10] . HDAC7 belongs to Class IIa HDACs and has conflicting functions in different types of cancer [11] . Some reports indicate an ambivalent role of HDAC7 in cancers: for example, HDAC7 has oncogenic functions in children with acute lymphoblastic leukemia and pancreatic cancer [12] but acts as a tumor suppressor in acute lymphoblastic leukemia B and Burkitt’s lymphoma [13] . In addition, the high expression of HDAC7 was related to a prognosis in lung cancer [14] . As part of the prediction of biological activities of HDAC7, certain descriptors from quantum chemistry were used during our work. Quantitative Structure Activity Relationship (QSAR) is one of the best methods used to design new therapeutic agents [8] [9] [10] . It makes it possible to correlate quantitatively through a mathematical model the structure or the properties of the compounds with their biological activities. It is increasingly used to reduce the excessive number of experiments, sometimes long and expensive and the cost of drug production by pharmaceutical companies [11] [12] . This QSAR approach has its origins in the studies carried out by Hansch [13] and by Free and Wilson [14] . In this work, the goal is to conduct a descriptive and predictive study of the anticancer activity of a series of eighteen (18) compounds of HDAC7 inhibitors. By implementing quantum chemistry methods, this work aims at modeling the anticancer activities observed, the molecular descriptors being calculated solely from the chemical structure of the compounds, and subsequently predicting the inhibitory concentration of analogous molecules. In the specific case of the QSAR study, twelve (12) histone deacetylase inhibitors (HDACi) were used for the test set and six (6) others from the same series were used for the test. external validation (Table 1). Yao et al. [15] in their work on the synthesis of hydroxamic acids have determined the experimental inhibitory activities of these inhibitors used in our work. of 38 molecules synthesized, only 19 provided satisfactory activity on HDAC7 classified in the following table.
2. Materials and Methods
2.1. Theoretical Calculations
In order to establish a descriptive and predictive theory of the anticancer activity HDAC7 of hydroxamic acids, the methods of Theoretical Chemistry are used at the B3LYP/6-311G (d, p) level. The gradient-corrected functionalities and the hybrid functionals such as B3LYP give better energies and agree with high-level ab initio methods [16] [17] . In this work, to evaluate the quantitative structure-activity relationship between the anticancer activity of HDAC7 and the descriptors, the Gaussian 09 [18] quantum chemistry software was used. The base 6-311G (d, p) being sufficiently extended, considering the polarization functions are important for considering the free doublets of the hetero atoms in order to obtain satisfactory results. The modeling was done using the multilinear regression method implemented in Excel spreadsheets [19] and XLSTAT version 2014 [20] .
2.2. Chemical Descriptors
Some physico-chemical descriptors have been used for the development of QSAR models. In particular, the electronic affinity (AE) and the vibration descriptors that are the vibration frequency ν (O-H) and the vibration frequency ν (N-H). These two vibration descriptors are shown in Figure 1.
It should be noted that the descriptors related to the molecular frontier orbitals have been calculated as part of the Koopmans approximation [21] . The LUMO energy characterizes the sensitivity of the molecule to a nucleophilic attack. The electronic affinity AE is the descriptor that translates the ability of a molecule to capture an electron. This descriptor is obtained from the relation (1) below:
(1)
Figure 1. Vibration descriptors of the hydroxamic acids used ν(O-H) and ν(N-H).
Table 1. Molecular structures of test sets and validation of hydroxamic acids used for QSAR models.
Several studies have shown that geometric descriptors provide better models as well as global responsivity descriptors [22] [23] [24] .
2.3. Statistical Analyzes
2.3.1. Data Analysis
The structures of 18 hydroxamic acid compounds were studied by statistical methods based on Principal Component Analysis (PCA) [25] [26] [27] using the software XLSTAT version 2014 [20] to determine the descriptors that are related. direct with anticancer activity. PCA is a useful statistical technique for summarizing all the information encoded in the structures of the compounds. It is also very useful for understanding the distribution of compounds and for selecting descriptors that are directly related to biological activity [28] . It is an essentially descriptive statistical method that aims to present, in graphical form, the maximum information of physicochemical descriptors. The Ascendant Hierarchical Classification (AHC) aims to partition a set of molecules into homogeneous classes [29] . It organizes molecules, defined by several variables and modalities, by grouping them hierarchically on a dendrogram. It aggregates those that are most like each other by using dissimilarity or distance measurements between molecules to form classes. It is made from the data of molecules and descriptors. AHC has established a typology of molecules based on electronic affinity (AE) and vibration frequencies ν(O-H) and ν(N-H).
2.3.2. Multiple Linear and Nonlinear Regressions (MLR and NMR)
The Multiple Linear Regression (MLR) statistical technique is used to study the relationship between a dependent variable (Biological activity) and several independent variables (descriptors). This statistical method minimizes the differences between the actual and predicted values.
It also allowed to select the descriptors used as input parameters in nonlinear multiple regression (NMR).
Nonlinear multiple regression (NMR) analysis is a technique that improves the structure-activity relationship to quantitatively evaluate biological activity. It considers several parameters. It is the most common tool for studying multidimensional data. It is based on preprogrammed XLSTAT functions as follows:
(2)
where
: are the parameters and
: are the variables.
The (MLR) and the (NRM) were generated using the XLSTAT software version 2014 [20] to predict the anticancer activity IC50 HDAC7. The equations of the different models were evaluated by the coefficient of determination (R2), the mean squared error (S), the Fischer test (F) and the cross-correlation coefficient (
) [30] [31] .
The potential of the inhibitory concentration is calculated according to the following expression:
(3)
2.4. Estimation of the Predictive Capacity of a Model
Histone deacetylase 7 has various inhibitory concentrations ranging from 0.311 to 38.9 μM. This range of concentrations makes it possible to define a quantitative relationship between the cancer activity and the theoretical descriptors of these molecules. The quality of a model is determined based on various statistical analysis criteria including the coefficient of determination R2, the standard deviation S, the correlation coefficients of the cross validation
and Fischer F. R2, S and F relate to the adjustment of calculated and experimental values. They describe the predictive ability within the limits of the model and make it possible to estimate the accuracy of the values calculated on the test set [32] [33] . As for the cross-validation coefficient
, it gives information on the predictive power of the model. This predictive power is called “internal” because it is calculated from the structures used to build this model. The correlation coefficient R2 gives an evaluation of the dispersion of the theoretical values around the experimental values. The quality of the modeling is better when the points are close to the adjustment line [34] . The adjustment of points to this line can be evaluated by the coefficient of determination.
(4)
where:
: Experimental value of anticancer activity,
: Theoretical value of anticancer activity and,
: Mean value of the experimental values of the anticancer activity.
The more the value of R2 will be close to 1, the more the theoretical and experimental values are correlated.
Moreover, the variance
is determined by the relation 4:
(5)
where k is the number of independent variables (descriptors), n is the number of molecules in the test or learning set, and
is the degree of freedom.
The standard deviation or standard deviation S is another statistical indicator used. It allows to evaluate the reliability and the precision of a model:
(6)
The Fisher F test is also used to measure the level of statistical significance of the model, that is, the quality of the choice of descriptors constituting the model.
(7)
The coefficient of determination of the cross-validation
makes it possible to evaluate the accuracy of the prediction on the test set. It is calculated using the following relation:
(8)
2.5. Criterion for Acceptance of a QSAR Model
According to Eriksson et al. [35] [36] , The performance of a mathematical model is characterized by a value of
for a satisfactory model when for the excellent model
. According to them, given a test game, a model will be efficient if the acceptance criterion
is respected.
According to Tropsha et al. [37] [38] [39] , for the external validation set, the predictive power of a model can be obtained from five criteria. These criteria are:
1)
, 2)
, 3)
,
4)
and
, 5)
and
2.6. Applicability Domain
The applicability domain principle helps modelers to specify the scope of proposed models, thereby defining the model’s limitations with respect to its structural domain and chemical space. If an external compound exceeds the defined scope of a model, it is outside the applicability domain of that model and cannot be associated with reliable prediction. There are several methods for determining the applicability domain of a QSAR model, among which we find the lever method that is used the most. If a compound has a residual and a lever that exceeds the threshold h* = 3p/n (where p is the number of descriptors plus 1 and n the number of observations), this compound is considered outside the field of applicability of the elaborate model. The field of applicability will be discussed using the Williams diagram which represents the standardized prediction residuals as a function of the values of the hi levers [40] . For each compound i in the original space of the independent variables (Xi), the value of hi is calculated by the following relation [41] :
(9)
where
With: Xi is the line vector of the descriptors of the compound i, X (n * k − 1) is the matrix of the model deduced from the values of the descriptors of the training set; the index T designates the transposed matrix of the matrix. The critical value of the lever (h*) is set [42] to:
With n, the number of test compounds used; k is the number of the descriptors of the model.
If hi < h*, the probability of agreement between the measured and predicted values of the compound “i” is as high as that of the compounds in the database. Compounds with hi > h* reinforce the model when they belong to the training set but will otherwise have dubious predicted values without being necessarily aberrant, the residues being low [43] .
3. Results and Discussion
The set of descriptor values of the thirteen (13) hydroxamic acid molecules of the test set and the six (6) other molecules of the validation set are presented in Table 2.
3.1. Principal Component Analysis (PCA) and Ascendant Hierarchical Classification (AHC)
All three descriptors for the 19 hydroxamic acid compounds are subjected to the PCA analysis. The two main axes are enough to describe the information provided by the data matrix. The correlations between the three descriptors are presented in Table 3 according to a correlation matrix and in Figure 2 where these descriptors are represented in a correlation circle.
Table 2. Experimental quantum and potential descriptors of test and validation sets.
AE in electron Volt (eV), ν(O-H) in Cm−1 and ν(N-H) in Cm−1, IC50 (μM).
Table 3. Correlation matrix (Pearson (n)) between the different descriptors.
Bold values are different from 0 to a significant level for p < 0.05. Very significant for p < 0.01. Very significant for p < 0.001.
Figure 2. Correlation Circle descriptors and the explained variable.
The two main axes are enough to characterize the different descriptors. In fact, the variance percentages are 50.51% and 33.49% for the F1 and F2 axes, respectively. The total information is estimated at 85%. Principal Component Analysis (PCA) [29] was conducted to identify the link between the different descriptors. Bold values are different from 0 at a significance level of p = 0.05.
The matrix obtained provides information on the negative or positive correlation between the variables. The Pearson correlation coefficients are summarized in Table 3. The resulting matrix provides information on the negative or positive correlation between the variables.
The correlation circle was made to detect the connection between the different descriptors. The analysis of the principal components from the correlation circle (Figure 2) revealed that the F1 axis (50.51% of the variance) seems to represent the vibration frequencies ν (OH) and ν (NH), and the axis F2 (33.49% of the variance) seems to represent the electronic affinity AE.
The AHC of Figure 3 distributes the iHDAC7 in two classes according to the affinity of one compound to another. The two major classes consist of compounds as follows: C1 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 14, 17) and C2 (11, 12, 13, 15, 16, 18, 19).
Figure 3. Dendrogram of the partition hierarchy of the 19 hydroxamic acids for 3 homogeneous classes.
3.2. Prediction of HDAC7 Anti-Cancer Activity from MLR and NMR Models
The equations of the QSAR models obtained for cancer activity from MLR and NMR as well as the statistical indicators are given in Table 4. It should be emphasized that these models were established using the same test and validation Table 3.
The equations of the different models are obtained by using three descriptors (AE, ν(O-H) and ν(N-H)) determined from the optimized molecules. It is important to note that the negative or positive sign of a model descriptor coefficient reflects the proportionality effect between the evolution of the biological activity and this parameter of the MLR model equation. Thus, the negative sign indicates that when the value of the descriptor is high, the biological activity decreases while the positive sign reflects the opposite effect. For the MLR model, the negative signs of the coefficients of the three descriptors (AE, ν(O-H) and ν(N-H)) indicate that the HDAC7 activity will be improved for low values of these descriptors. The study of the significance of these different models is led by the evaluation of the statistical indicators and by the acceptance criteria of Erickson et al. and Tropsha et al. The values of the statistical indicators determined for each model are reported in Table 5. The values of the statistical indicators listed in this table reflect a good correlation of the inhibitory activity on HDAC7 with the different descriptors.
In these respective models (MLR and NMR), 96.56% and 97.62% of the descriptors (AE, ν(O-H) and ν(N-H)) are considered with the standard deviation S of prediction from 0.488 to 0.663. The significance of these models is given by the Fischer F test, 85.00 to 172.79 respectively for the MLR and NMR. The correlation coefficient of the cross-validation
is 0.918 to 0.938 respectively for the RMNL and RML models. These values reflect excellent models according to Erikson et al. [35] [36] . These models are also acceptable because they agree with the acceptance criteria of these authors:
Table 4. Most Significant QSAR Models for Modeling HDAC7 Activity from MLR and NMR.
Table 5. Statistical indicators of multilinear regression.
The regression line between the experimental and theoretical nematocidal potentials of the test set and the validation set is illustrated in Figure 4. This figure illustrates the correlation between the experimental and theoretical IC50 inhibition concentrations of the test set (blue dots) and the validation game (red dots). These models obtained relate the HDAC7 activity and the theoretical descriptors of hydroxamic acids.
The low values of the standard error of 0.448 to 0.663 respectively of the NMR and RML models attest to the good similarity between the predicted and experimental values of the HDAC7 activity despite some differences recorded (Figure 5).
Verification of Tropsha Criteria
The statistical indicators of the five (5) Tropsha criteria of these two models (MLR and NMR) of the validation set are given in Table 6.
,
,
and
;
and
All values meet the Tropsha criteria, so these models are acceptable for predicting HDAC7 anticancer activity.
However, these two models being a function of three theoretical descriptors, it is essential to determine the contribution of each in the prediction of the anticancer activity HDAC7 of the hydroxamic acids studied. Indeed, the knowledge of this contribution makes it possible to establish the order of priority of the various descriptors and to define the choice of the parameters to be optimized for the realization of a better activity HDAC7.
Table 6. Tropsha criteria for different models.
Figure 4. Regression lines of the different models (MLR and NMR).
Figure 5. Similarity curve of the experimental and predicted values of the RML and NMR models.
3.3. Analysis of the Contribution of Descriptors
The contribution of the four descriptors of this model in the prediction of the anticancer activity HDAC7 of the hydroxamic acids was determined from the software XLSTAT version 2014 [20] . The different contributions are illustrated in Figure 6.
In this study, the electronic affinity (AE) has a nearly identical weight as the vibration frequencies ν(O-H) and ν(N-H). The absence of one of these descriptors in the model could destabilize this one. It should be noted that these quantum descriptors in a global way make a rather important contribution in the prediction of the anticancer activity of HDAC7.
3.4. Applicability Domain Analysis
The values of leverage and standardized residuals of the observables of the model used to develop the applicability domain of this model are shown in Table 7.
Analysis of the data in Table 7 shows that all other observations have their standardized residuals between −1.5 and 1.5. The lifts obtained are all below the critical value
. This fact is elucidated by the Williams diagram (Figure 7). The results of the external validation and the domain of applicability show that the established model can be used reliably for the prediction of the inhibitory concentration of future hydroxamic acids.
4. Conclusion
In this work, the anticancer activity HDAC7 of nineteen (19) hydroxamic acid compounds was correlated with the theoretical descriptors calculated by the DFT methods. The descriptors electronic affinity (AE), vibration frequencies ν(N-H) and ν(O-H) can explain and predict the anticancer activity HDAC7. Statistical methods such as Principal Component Analysis (PCA), Ascending Hierarchical Classification (AHC), Multilinear and Nonlinear Regression were used.
Figure 6. Contribution of descriptors in models.
Figure 7. Williams diagram of the MLR model.
Table 7. Values of the levers and residuals of observations of the model.
Two QSAR models (MLR, NMR) showed that the descriptors used would predict, at an acceptable level of confidence, the inhibitory activity of hydroxamic acids. The use of two different methods in this work was to show on the one hand that from these descriptors, we can predict in a different way the inhibitory activity of hydroxamic acids on HDAC7 and on the other hand, the relevance has these descriptors. However, the MLR model (R2 = 0.9659, S = 0.488, F = 85 and p-value < 0.0001) is an effective tool for predicting HDAC7 anticancer activity. Moreover, the study of the contribution of the descriptors showed that these descriptors are almost equivalent in the prediction of the inhibitory activity of the HDACi7 studied. For this model, the future compounds must have their standardized residue between −1.5 and +1.5 with a threshold lever h* = 0.923. A study of the applicability domain of these models is envisaged. From the field of applicability and quantum descriptors elaborated in this work, we plan to propose new molecules with improved activities.