2 D-QSAR Study of a Series of Pyrazoline-Based Anti-Tubercular Agents Using Genetic Function Approximation

A series of pyrazoline-based new heterocycles have recently been synthesized from our group where some of the compounds display potent anti-tubercular activity against Mycobacterium tuberculosis H37Rv. In order to further explore the potency of the compounds, quantitative structure activity relationship study is carried out using genetic function approximation. Statistically significant (r2 = 0.85) and predictive ( ) 2 2 pred 0.89 and 0.74 m r r = = QSAR models are developed. It is evident from the QSAR study that majority of the anti-tubercular activity is found to be driven by lipophilicity. Also, molecular solubility, Jurs and shadow descriptors influence the biological activity significantly. Also, positive contribution of molecular shadow descriptors suggests that molecules with bulkier substituents are more likely to enhance anti-tubercular activity. Since the developed QSAR models are found to be statistically significant and predictive, they potentially can be applied for predicting anti-tubercular activity of new molecules for prioritization of molecules for synthesis.


Introduction
World Health Organization (WHO) estimates that almost one-third of the world's population, (~2 billion people) is infected with the tuberculosis [1].Every year, more than 8 million people develop an active form of this disease, which claims the lives of nearly 2 million.WHO estimated in 2002 that if the worldwide spread of tuberculosis was left unchecked, it would be responsible for nearly 36 million more deaths by 2020.Effective and specific anti-tubercular drugs are still not found and classical antibiotics are currently being used for curing tuberculosis.However, effectiveness of such treatment is rather controversial [2].Multidrug-resistant TB (MDR-TB), a form of TB that does not respond to the first-line TB drugs and extensively drug-resistant TB (XDR-TB), an MDR-TB with resistance to aminoglycosides and fluoroquinolones has become a serious threat to control and treatment of tuberculosis.There are also a few cases reported of totally drug resistant tuberculosis (TDR-TB); which has raised alarming concerns on the existing drug regimen [3].This implies urgent need to discover newer anti-tubercular agents with newer molecular mechanisms.
Quantitative structure activity relationship (QSAR) is one of the most widely used tools to design newer candidates for several therapeutic areas [4]- [6].It provides useful insights into the structural features which are responsible for the biological activity and help to generate a mathematical model which can predict activity of untested compounds quantitatively.QSAR study usually leads to a predictive formula by correlation of physicochemical properties of a congeneric series with the biological activity [7].
Earlier from our laboratory, a series of pyrazoline-based benzoxazoles are identified as potent anti-tubercular agents [8] [9].In order to further investigate the potency of the molecule as a part of lead optimization program, we carry out QSAR study by using Genetic Function Approximation (GFA) technique [10].GFA algorithm is a novel approach to create structure-activity models.It searches QSAR models automatically by combining statistical modeling with genetic algorithm tools.Typically, thousands of candidate models are generated and tested during evolution.However, only the superior (best) models survive; which are used as "parents" to create the next generation of candidate models.Previously, we have successfully applied GFA to generate a variety of QSAR models [4] [5].Such models provide useful structure-activity insights, which can be used for prioritization of synthetic efforts to generate and lead optimization strategies.

Data Set
In present studies, a series of substituted pyrazoline-based compounds reported by Rana et al. as potent anti-tubercular agents was selected [8] [9].Fifty four compounds were randomly divided into training and test set, the former set consisting of thirty nine compounds and the remaining fifteen compounds were taken as the test set.Structures of all the compounds used for 2D-QSAR analysis and their anti-tubercular activity (MIC, µg/mL) are given in Table 1.For all the compounds, the experimental values of biological activity (MIC) are used in the negative logarithmic scale (pMIC) to achieve normal distribution.Structures of all compounds were sketched by using visualizer module of Discovery Studio 2.1 software (Accelrys Inc., USA).CHARMM force field was used for the calculation of potential energy.Energy minimization of all the compounds was done using Smart Minimizer method until the root mean square (RMS) gradient value becomes smaller than 0.001 kcal/mol Å.This was followed by geometry optimization by semi empirical MOPAC-AM1 method (Astin Method-1).

Descriptor Calculation
"Calculate Molecular Properties" protocol of the Discovery Studio 2.1 was used to calculate various physicochemical descriptors like structural, thermodynamic, steric, electronic and quantum mechanical descriptors.Further, a correlation matrix of the molecular descriptors was generated and highly correlated descriptors with a correlation value of 0.6 or above were discarded from the study.Remaining least correlated descriptors were used to develop 2D-QSAR models.Descriptors included in developing 2D-QSAR models are listed and described in Table 2.

Regression Analysis
The advantage of GFA is that the data set is being modeled to generate a population of equations rather than one single equation for descriptor-activity correlation.GFA is genetic principle based method of variable selection, which combines Holland's genetic algorithm and Friedman's multivariate adaptive regression splines.Thus, it evolves the population of equations that best fit the training set data.
In GFA, a particular number of equations (set at 100 by default) are randomly generated.The pairs of "parent" equations then are chosen randomly from this set of 100 equations.After this, "crossover" operations are performed at random.The number of crossing over was set at 5000 by default.The goodness of each progeny equation is assessed by Friedman's lack of fit (LOF) score ( ) where c is the number of basis functions in the model, LSE is the least-squares error, p is the number of descriptors, d is smoothing parameter, and m is the number of observations in the training set.The smoothing parameter controls the scoring bias between equations of different sizes.It was set at default value of 0.5.GFA crossover of 5000 was set to give reasonable convergence.The length of equation was fixed to six terms, the population size was established as 100, and the mutation probability was specified as 0.1.Best three equations, out of the 100 equations, were chosen based on the statistical parameters like LOF, regression coefficient (r), adjusted regression coefficient (r adj ), cross-validated regression coefficient (r cv ) and F-test values.

Validation Test
Variance inflation factor (VIF) analysis was performed to check the inter-correlation of descriptors.VIF value is calculated from 1/1 − r 2 , where r 2 is the multiple correlation coefficient of one molecular descriptor's effect regressed on the remaining descriptors.VIF value greater than10 suggests chance-correlation and hide the information of molecular descriptors by inter-correlation of descriptors [11].
It is proven that a high value of statistical characteristics r and F and low value of s and LOF need not be the criteria of a highly predictive model.Thus, in order to evaluate the predictive ability of the 2D-QSAR model, the external predictability method described by Roy et al. was used [12].It was determined by calculating the value of predictive r 2 ( )

Results and Discussion
In the present study, 31 descriptors were selected initially for correlation with anti-tubercular activity.The 31 preselected descriptors represented different class of descriptors such as quantum mechanical, steric, geometric, thermodynamic, and electronic.The descriptors were correlated with training set using GFA methodology.Initially, 100 2D-QSAR equations with six descriptors were generated.The results of the best three models are given in Table 3 along with their regression statistics.
For a statistically significant model, it is inevitable that the descriptors evolved in the equation should be least inter-correlated with each other.In the present study, the inter-correlation of the descriptors used in the selected models was found to be very low.The correlation matrix for the used descriptors is shown in Table 4.
Further to check the inter-correlation of descriptors, variance inflation factor (VIF) analysis was performed (as described in Section 2.4).VIF values of these descriptors were found to be 2.010 (ALogP), 1.243 (Jurs_RNCG), 2.558 (Apol), 1.366 (Jurs_DPSA_1), 1.520 (Shadow_XZ) and 1.585(Molecular_Solubility).All the VIF values were found to be less than 10, which suggest very less multi-collinearity within descriptors.The models were also evaluated for their predictive power, i.e. internal and external cross-validation.The results for Equation (1) are summarized in Table 5 and Table 6.
Figure 1 and Figure 2 show the plot of observed Vs predicted activity for training and test set compounds, respectively as per Equation (1).It was seen that the models displayed 2 pred r and 2 m r in the acceptable range [12].
The descriptors used in the study were found to have significant influence on the biological activity as seen  from their high coefficients values.Noticeably, the activity was found to be governed chiefly through lipophilicity (AlogP).As seen from the positive coefficient, lipophilicity positively influenced the activity.Indeed, compounds with halogens (bromo/chloro, 2, 7, 16, 20) were found to possess high anti-tubercular activity whereas compounds with polar groups (9-15) were found to be less active.Jurs descriptors are a group of molecular descriptors which combine electronic and shape information to characterize molecules [13].They are calculated by mapping atomic partial charges on solvent-accessible surface areas of individual atoms.Jurs_RNCG is charge of most negative atom divided by the total negative charge.Jurs_DPSA_1 is partial positive solvent-accessible surface area minus partial negative solvent-accessible surface area.A critical analysis of the generated equations   suggested negative contribution of these descriptors on biological activity.This means that the charge distribution within the molecules serves as the driving force for intermolecular interactions and the higher the relative charge the smaller the interactions.The above fact is exemplified from compounds 2, 20, 30 where lower values of the above descriptors resulted in increase in activity.Another set of geometrical descriptors, Molecular Shadow descriptors like Shadow_XZ (area of the molecular shadow in the XZ plane) also showed significant contribution to anti-tubercular activity with the coefficient being positive.This shows that molecules with bulkier substituents (2, 20, 30, 35) are more likely to show activity.In consistent with the above correlation, compounds 1, 5, 13 and (with one or more H substituents) stood out as less active due to low values of Shadow_XZ.Apol (the sum of the atomic polarizabilities) also contributed positively to anti-tubercular activity.However, its low co-efficient signals its low contribution as compared to the other descriptors.

Conclusion
Developed 2D-QSAR models were found to be statistically significant as seen from their regression statistics.
Obs(test) and Y Pred(test) are the observed and predicted activity values, respectively, of the test set compounds and Y training is the mean activity value of the training set.

Figure 1 .
Figure 1.Plot of observed Vs predicted pMIC values of training set compounds (as per Equation (1)).

Figure 2 .
Figure 2. Plot of observed Vs predicted pMIC values of test set compounds (as per Equation (1)).

Table 2 .
List of descriptors used in the study.

Table 3 .
Selected 2D-QSAR equations and their regression statistics.

Table 4 .
Correlation matrix of the descriptors used in the equations.

Table 5 .
Observed and predicted pMIC values of training set compounds (as per Equation (1)).

Table 6 .
Observed and predicted pMIC values of test set compounds (as per Equation (1)).