Quantum Chemistry Prediction of Molecular Lipophilicity Using Semi-Empirical AM 1 and Ab Initio HF / 6-311 + + G Levels

Reliable prediction of lipophilicity in organic compounds involves molecular descriptors determination. In this work, the lipophilicity of a set of twenty-three molecules has been determined using up to eleven quantum various descriptors calculated by means of quantum chemistry methods. According to Quantitative Structure Property Relationship (QSPR) methods, a first set of fourteen molecules was used as training set whereas a second set of nine molecules was used as test set. Calculations made at AM1 and HF/6-311++G theories levels have led to establish a QSPR relation able to predict molecular lipophilicity with over 95% confidence.


Introduction
The informations contained in molecular structure can be accessed and described by the mean of various physicochemical quantities named descriptors.For decades, many studies have been conducted to determine empirically or compute these descriptors and it is well known that they actually can describe molecular structures [1] [2] [3].In quantum chemistry, the computed descriptors, obviously, will be favoured.The aim of our work is to determine the molecular descriptors that can reliably predict the molecular lipophilicity by quantum chemistry methods.The suitable descriptors will be selected from an initial set of eleven, only taking into account the ones who are highly correlated with the molecular lipophilicity while being independent one from each other, in pairs.The whole process will lead to establish and validate by statistical methods, a performant QSPR model.

Training and Test Sets Molecules
Both training and test sets are constituted from a sample of twenty-three aromatic compounds with known experimental values [4] of molecular lipophilicity expressed as logP exp , where P exp is the experimental value of octanol-water partition's coefficient.The training set corresponds to fourteen molecules and test set, nine molecules (Table 1).
All molecules are codified CAi, the i running from 1 to 23.

Computational Theories Levels and Softwares
All molecules have been fully optimized using GAUSSIAN 03 [5] software at semi-empirical AM1 method and ab initio HF/6-311++G method.The basis set 6-311++G is sufficient, especially, the use of both polarization and diffuse functions is not necessary since we are not in a case of intermolecular study.Two other softwares have been used, according their specificities, to do statistical analysing of the results and to plot graphics, i.e.XLSTAT [6] and EXCEL [7].

Statistical Analysing
QSPR study needs a statistic analysis all along the validation process.In this work, we used the multiple linear regression analysis method [8] [9], corresponding to the below general equation: : explanatory variables (descriptors) of the studied property; 0 1 2 , , , , p β β β β  : model regression coefficients.Excel software directly provides these linear regression equations with the regression analysis tool.The final choice of predictive descriptors is based on two fundamental criteria for selecting descriptors set, according Vessereau [10].The first criterion requires that there must be a linear dependency between the property studied and the descriptors.For each descriptor, one must have 0.50 R ≥ where R is the linear correlation coefficient.The second criterion indicates that the descriptors must be independent each from other, so we must have 0.70 ij a < where ij a is the partial correlation coefficient between descriptors i and j.XLSTAT software directly provides these coefficients.In the case of simple linear regression [11], expressions of R and ij a are: The determination coefficient 2 R [12] is given by the following equation: A linear regression equation significancy is drawn from Fisher's coefficient (F) [13].The higher this coefficient is, the better the linear regression equation is.The predicting power of a model can be obtained from five Tropsha's criteria [14] [15].If at least three of the criteria are satisfied, then the model will be considered efficient in predicting the property studied.These criteria are:  ; Criterion 5:

Molecular Descriptors Selection
There are thousands of molecular descriptors from the literature and quantum chemical calculations.For our study, we considered eleven quantum descriptors (Table 2).Table 3 and Table 4 give the values of the quantum descriptors at AM1 and HF/ 6-311++G levels respectively.These values were used to calculate correlation linear coefficient R, the partial coefficient correlation ij a and to establish regression models.
According to Table 5, the rejected descriptors have a correlation coefficient value less than 0.50 and those selected have a coefficient greater than 0.50.We hold the following results.At semi-empirical level, AM1, the selected descriptors are HOMO , , B ε ε χ and Q .At ab initio level HF/6-311++G, the selected descriptors are HOMO , , , , , B S q ε ε χ η − and Q .The last step is to verify the criterion 2 (Table 6 and Table 7).According to Table 6, the descriptors HOMO ε and χ are dependent.This leads us to consider two groups of descriptors at AM1 level.In the group 1, the selected de- Basicity by hydrogen bonding [16] B ε ( ) Chemical elecrtonegativity [17] χ     ), Basicity by hydrogen bonding ( B ε ), Chemical hardness (η ), Chemical soft- ness ( S ), Smallest negative charge of the molecule ( q − ), Sum of absolutes values of net electrical charges of Mulliken ( Q ).For the last group 4, the selected descriptors are Ba- sicity by hydrogen bonding ( B ε ), Chemical electronegativity ( χ ), Chemical hardness (η ), Chemical softness ( S ), Smallest negative charge of the molecule ( q − ) and Sum of absolutes values of net electrical charges of Mulliken ( Q ).

Prediction of Lipophilicity at Semi-Empirical Level AM1 (Model 1)
Figure 1 shows that the group 2 quantum descriptors retained are linearly dependent on molecular lipophilicity.The actual plot on Figure 1 According to the statistical t_test, the importance of quantum descriptors in Model 1 is in the following descending order: In Table 8 are various statistical parameters for Model 1 validation.Table 8 shows that the Model 1 has a very high predictive capability, since up to 95.60%, of the test molecules have their game lipophilicities predicted.This means that Model 1 can be used to reliably predict the aromatic compounds unavailable lipophilicities.
Verification of Tropsha criteria for Model 1.

1)
All values satisfy Tropsha's criteria.Model 1 is retained as predictive model of molecular lipophilicity.Statistical parameters are gathered in Table 8.

Prediction of Lipophilicity at Ab Initio Level HF/6-311++G (Model 2)
Figure 2 shows that there is indeed a linear dependence between the quantum descriptors of group 4 and the molecular lipophilicity.The quantum descriptors of group 4 were used for the establishment of Model 2 as they give a more significant regression equation in the sense of Fisher than group 3. Model According to the statistical t_test, the importance of quantum descriptors in Model 2 is in the following descending order: S Q q η χ − > > > > .Table 9 shows the various statistical parameters for validating the Model 2. Table 9 shows that the Model 2 has a low predictive ability ( 2 0.60 ext Q < ), since only 59.71%, of the test molecules have their game lipophilicities predicted.This means that the Model 2 cannot be used to reliably , but is not efficient in predicting the lipophilicity.He is dismissed as a model for lipophilicity prediction.This unsuitable prediction of lipophilicity is certainly due to the use of an extended basic function, taking into account the diffuse functions on all atoms.The use of diffuse functions seems unefficient when calculating lipophilicity.Statistical parameters are gathered in Table 9.      Figure 3 shows that there is, indeed, a strong correlation between the predicted and the experimental lipophilicity according Model 1.The contrary is observed at Figure 4, for Model 2. In the latter case, it can be seen a large dispersion of the points cloud and no linear plot could be obtained.Here is the confirmation that Model 1 is highly performant, but not Model 2.

Conclusion
QSPR methodology and quantum chemical methods were used to establish predictive models of molecular lipophilicity.In this work, we identified four groups of quantum descriptors according to the basic criteria usually used for descriptors selection.The results showed that many descriptors strongly correlate lipophilicity.From these descriptors, we have established two lipophilicity prediction models.The statistical analysis led us to select only the semi-empirical (AM1) based model.On the other hand, ab initio (HF/6-311++G) based model was rejected because of its low predictive power.Furthermore, the main descriptors that strongly influence the lipophilicity are, from of the selected model, the Basicity by hydrogen bonding ( B ε ), Chemical electonegativity ( χ ) and the Sum of absolutes values of net electrical charges of Mulliken ( Q ).The ab initio based model unefficiency could be due to the use of high theory level, and tends to indicate that high theory levels, and specifically diffuse functions addition, are not suitable for molecular lipophilicity calculation.The performance of the semi-empirical based model could indicate that lipophilicity property is not strongly linked to electronic effect in molecules.
Sum of Squares; ESS: Extended Sum of Squares; RSS: Residual Sum of Squares.
Smallestnegative charge of the molecule q −Larger positive charge of the hydrogenatoms of the molecule q +Sum of absolutes values of net electrical charges of Mulliken Q scriptors are Energy of the HOMO ( HOMO ε ), Basicity by hydrogen bonding ( B ε ) and Sum of absolutes values of net electrical charges of Mulliken ( Q ).For the group 2, the selected descriptors are Basicity by hydrogen bonding ( B ε ), Chemical electronegativity ( χ ) and Sum of absolutes values of net electrical charges of Mulliken ( Q ).According to Table7, the descriptors HOMO ε and χ are dependent.This leads us to consider two groups of descriptors for the level calculation HF/6-311++G.So, we can Figure 1 shows that the group 2 quantum descriptors retained are linearly dependent on molecular lipophilicity.The actual plot on Figure 1 is ( ) exp Descriptors log f P = .Indeed, there are several descriptors corresponding to a single value of exp log P , and it has

Figure 3 and
Figure 3 and Figure 4 represent the following graphs

Table 2 .
List of eleven quantum descriptors.

Table 3 .
Values of the training set quantum descriptors at AM1 level.

Table 4 .
Values of the test set quantum descriptors at HF/6-311++G level.

Table 7 .
Selection of quantum descriptors according criterion 2 at HF/6-311++G level. of group 2 were used for the establishment of Model 1 because they give a more significant regression equation in the sense of Fisher than group 1.
press s 0.5684 press s 0.6605