A Structural Comparison Approach for Identifying Small Variations in Binding Sites of Homologous Proteins

A method for analyzing the protein site similarity was devised aiming at understanding selectivity of homologous proteins and guiding the design of new drugs. The method is based on calculating Cα distances between selected pocket residues and subsequent analysis by multivariate methods. Five closely related serine proteases, the coagulation factors II, VII, IX, X, and XI, were studied and their pocket similarity was illustrated by PCA clustering. OPLS-DA was then applied to identify the residues responsible for the variation. By combining these two multivariate methods, we could successfully cluster the different proteases according to class and identify the important residues responsible for the observed variation.


Introduction
Drug discovery includes several crucial steps before clinical testing and alluding to an optimal process may render saved time and efforts.Thus identifying and front-loading mitigation of sub-optimal compound properties before clinical phases is highly desirable [1].Next to lack of efficacy, safety issues are the leading cause for drug rejection in late stage clinical trials which represents an estimated 70% of the total cost for a drugs clinical development [2]- [4].Drug promiscuity, to some extent, has been correlated to ligand flexibility, although mainly to structural similarity between targets [5] [6].The problem of structural similarity can furthermore be broken down into two parts, the identification of binding pockets and the aspect of protein similarity.Numerous methods have been developed for identifying druggable targets and predicting their binding pockets.These often involve geometry-based [7], energy-based [8], or physicochemical property-based methods [9], or a combination of these [10], and have been reviewed in detail elsewhere [11] [12].In this study we focus on the issue of structural protein pocket similarity.Early drug target selectivity mitigation by inspection of identified analogue drug targets might provide early compound modification for implementation of more relevant molecular design by ensuing generated information [1].
There are three key steps involved in similarity searches: representation of the binding site, comparison, and scoring.Since proteins can have varying overall structure and still maintain highly conserved function and binding site composition, such comparative methods usually focus on the binding site and more or less neglect the remainder of the protein [13].Binding site representation is a crucial step in the process of similarity detection.The representation can for example be made using the atomic position of the Cα carbons of the residues in the site.It can be complemented with the Cβ carbons, and even pseudo atoms representing an amino acid residue.The latter is done in order to incorporate structural information, yet at the same time ensuring a simplified model over the more complex and time consuming inclusion of the exact positions of all atoms in the representative residues.The binding site can moreover be represented by surface patches [14], pharmacophore features [15], or by physiochemical properties [16].A simplified representation of the site is again favorable in order to reduce the complexity of the comparison.In fact, Fleedman and Labute have shown that Cα carbons give adequate information to enable efficient comparison of binding sites and that use of Cβ (and Cγ) carbons generate essentially identical models to this [17].The use of Cα carbons hence forms the basis of the present study.
Binding site comparison also involves finding the best superposition of the sites involved.This is dependent on how the site is represented, the similarity metric, and the algorithm used for the comparison.Algorithms can for example include alignment methods such as iterative searches [18], geometric matching [19], geometric hashing [20], and clique detection [9] [21] which are not always favorable because of the uncertainty that comes with the alignment.Principal component analysis (PCA) [22] has been used in several fields for interpreting different datasets and analyzing variations in the data.It has also been applied when comparing protein cavities such as in the GRID consensus principal components (GRID/CPCA) approach which does not rely on alignments [23].GRID/CPCA is applied to improve ligand selectivity towards a particular target by identifying potential modifications in the ligand.However, this similarity based method is dependent on the availability of structural data of proteins with an active molecule present in the cavity.PCA has recently also been implemented to characterize and map the cavities of proteins without ligands present [24], and to examine the dynamics of cavity geometry evolution by selecting structures from molecular dynamics simulations (MD) [25], in this way dealing with the protein flexibility issue.
To our knowledge, our current study is the first time orthogonal partial least squares discriminant analysis (OPLS-DA) [26] has been applied to explore binding sites.The present method is intended as an initial step when comparing similar targets, in order to identify even subtle variations in homologous binding sites.To demonstrate the approach, we evaluated coagulation factors II, VII, IX, X, and XI which all share high structural similarity, aiming at discriminating between the different classes.The model is based on calculations of distances between the Cα atoms of selected amino acid residues located in the vicinity of the catalytic site.By applying PCA, the retrieved distance data could be visualized in an easily interpretable fashion separating the protein classes into clusters based on cavity backbone variation.Furthermore, OPLS-DA was implemented to derive more accurate loadings, as compared to the PCA generated ones, in order to identify the specific amino acid residue(s) responsible for the observed clustering in the PCA.Our aim is to develop a method that provides insight regarding similarity between protein cavities in order to guide molecular modeling or drug design based on 3D structures.The strength of the present approach is its straightforwardness regarding both describing and comparing the included target sites, which is of value in the early stage of drug design.

Selection of 3D Structures
A set of 86 protein structures of the five coagulation factors II, VII, IX, X, and XI was used in the study as obtained from the Protein Data Bank (www.rcsb.org)[27] and are named according to their PDB codes.These five proteins were mainly selected because of sharing high structural similarity.Structures having a covalently attached ligand were discarded as it could interfere with the calculations.Structures from different organisms; human, bovine, and mouse were included, as well as dimers, trimers, and tetramers.Both apo and holo forms of the enzymes were included (See Table S1, Supporting Information).

Selection of Amino Acid Residues (AAs)
To incorporate the vicinity of the catalytic site, AAs representing all three binding pockets (S1, S2-Sn, and S1'-S2') were selected (Figure 1), resulting in a total of seven initial AAs.These include the catalytic triad His57, Asp102 and Ser195 (thrombin numbering), and the aspartate residue Asp189, located at the bottom of the S1 pocket, which is responsible for the specificity of the proteases [28].These four residues were furthermore complemented by Leu41 located in the S1'-S2' region, Trp215 in S2 and Tyr228 in the S1 pocket.The selection parameter was set so as to add two AAs before and two after each respective initial residue, e.g., for His57 amino acids 55, 56, and 58, 59 were also included in the distance calculations.This parameter was set in order to sample more of the sites, but one can chose to include as many or as few AAs as desired.In the current study, a total of 35 AAs were thus included, that efficiently span all three regions of the binding pockets.

Calculation of Distances
The Cα atom coordinates of the selected AAs were obtained from their respective PDB files.The distances between the Cα atoms were calculated in an all-against-all fashion (Figure 2) creating a descriptor matrix (X) with all distances given in Ångström (Å).

Multivariate Analysis
Principal component analysis (PCA) [29] was used to compress the systematic variation in the descriptor matrix (X), containing N observations (86 protein structures) and K variables (Cα-Cα distances), into two low-dimensional matrices T (scores matrix) and P' (loadings matrix), and is generally illustrated according to Equation (1).
T is composed of the principal components ( ) , , , n t t t  and represents the variation between the different coagulation factors while P' contains the loadings of the components ( ) , , , k p p p  and represents the variations in the calculated distances and defines the orientation of the PC plane.In this way the combination of T and P' defines the PC model and the overall variation of the descriptor matrix by orthogonal factors.The residual matrix E, containing noise, is discarded from the PC model.
The first extracted PC accounts for the largest variance in the data.To this, additional PCs are added, orthogonal to the previous, to improve the approximation of the data.Each PC was further evaluated based on its eigenvalue, multiple correlation coefficients (R 2 ), and cross-validation (Q 2 ) [22].
In order to identify the amino acid residue(s) responsible for the obtained differences in the scores plot, orthogonal projections to latent structures (OPLS) was applied to remove the Y-orthogonal variation from X as described in Equation ( 2) [30].T P is the predictive score matrix and T P P is the predictive loading matrix for X.The Y-orthogonal score and loading matrices are denoted T O and T O P , respectively.E comprises the residual matrix.The method can in our case be defined as OPLS discriminant analysis (DA) because the response vector, Y, was set as a discriminant defining each protease class in order to find the discrimination of that particular class against all others.The model was further evaluated by cross validation after each generated orthogonal component.
All multivariate calculations were performed in the SIMCA 13.0 software [29] [30].All data was subjected to centering but remained un-scaled before any multivariate calculations.

Target Protein Preparation
In the present study a total of 86 protein structures of the five main coagulation factors FII, FVII, FIX, FX, and FXI were analyzed.The protein pockets were sampled by calculating distances between the Cα atoms of selected AAs.
A first PCA including all coagulation factors was calculated, resulting in five main outliers, 3sqe, 3sqh, and the two dimer structures of 2afq, all from FII, and 1jbu from FVII (Figure 3), plus three additionally possible outliers, 3edx (trimer), 3hk3 (monomer), and 3hk6 (dimer) of FII.By inspection of the loadings, the distances generating the outliers could be explained.For 3sqe and 3sqh the catalytic residue Ser195 was mutated to alanine which gave rise to the deviation [31].The two dimers structure 2afq were found to be altered in several areas of the protein structure leading to overall structural differences.Taking a closer look at the crystallization data it was found that 2afq was expressed in the absence of coordinating Na + .The Na + free environment thus induces a conformational change which ultimately blocks the S1 and S2 pockets making it stand out from the proteins expressed in Na + containing environment [32].The structure of 1jbu was found to be in complex with the exosite binding inhibitory peptide A-183.This peptide occupies the binding site and thereby alters its structure whereby 1jbu displays large differences in the binding region compared to other serine proteases, in particular in the loop which defines the S1 pocket [33].The identification of these five outliers was enabled by PCA using data centering without scaling, making it possible to identify deviating data visualized as outliers in the scores plot.To obtain more relevant resolution the five first outliers were excluded from all subsequent calculations.The second PCA was calculated keeping 3edx, 3hk3, and 3hk6 in the model (Figure 4(A)) since they appeared closer to the remaining clusters.These three structures all share the mutations of Trp215 and Glu217 to Ala, which were two of the AAs included in the Cα distance calculations.The mutated structures assume a conformation similar to the inactive form of thrombin recently shown by Gandhi et al., which explains the deviation in the PCA [34].Removing these from the model (Figure 4(B)) did not result in any significant difference from the previous PCA (Figure 4(A)) with regards to the clustering of the remaining factors.However, in the OPLS scores plot, the structures of FII were divided into two groups where the mutated structures all clustered into one.

Multivariate Analyses for Protein Cavity Characterizations
The recalculated PCA model, after excluding all above mentioned structures seen as outliers, could successfully identify and cluster the factors by class using a 4-component model (Figure 4 ) and the cross-validation, after PC4, gave a cumulative Q 2 value of 0.75 (Table 1).The rather small difference between 2  The relatively large eigenvalues further indicate that there was substantial systematic variation present in the dataset.Aside from the clustering of the individual factors there also seems to be a division between two groups (Figure 4) where FIX and FX make up one of the two.Their similarity is described in literature as sharing the same homology with protein C [35], which explains their resemblance.When inspecting, also the third component of the PCA, further clustering was revealed for all factors, as seen in the 3D scores plot (Figure 5).
In addition to PCA, OPLS-DA was applied to identify unique variations in the distances of the protein binding cavities for each protease class.By removing the non-correlated variation from X the predictive model complexity is reduced which improves the interpretation of the resulting one component giving a more accurate correlation between the factors.All coagulation factors showed to be discriminated mainly by distance variations in the S1 pocket which gave rise to the main variance seen in the original PCA (Figure 4(B)).When assigning FII as the discriminant Y, in addition to S1, variation was also found in the hydrophobic proximal S1'-S2' pocket (Figure 6).The largest variation obtained from the loadings indicate that residues Glu39 and Leu40 are responsible for the discrimination in S1'-S2'.It was also found that, among others, the distance to His230 in the S2-Sn pocket deviates from the rest of the proteases.
In the case of FIX, the OPLS loadings indicate a variation in loop 2, residue Glu217, as the main contributor to the discrimination of this protein.This agrees with the fact that FIX possesses a glutamate at position 219 in loop 2, close to 217, whereas the other serine proteases have glycine in that same position [36].Additionally, since this residue (Glu219) is located right at the entrance of the S1 pocket it may very well be the cause of the difference seen in the PCA.
It has earlier been shown by Shirk et al. that FVII differentiates itself from FII and FX in both the S2-Sn and S1'-S2' pocket [37], which was also confirmed in this study.FVII contains large distance variation in Asp100, Arg101, Asp102, and Ala104 which are all located in the S2-Sn region, and in Leu41 and Glu39 located in the    S1'-S2' region.FX was discriminated in the loop 1 region which constitute parts of the S1 pocket, and in the S2'-Sn' pocket by the same residues as for FVII.Lastly, FXI also displayed a large region of possible selectivity sites where, aside from the S1 pocket, variations were found in distances located in both the S2-Sn and the S1'-S2' pockets.

Method Example Rationale for Early Selectivity Identification
In order to exemplify how one could apply the described approach in drug design based on 3D structures, and achieve drug selectivity, we examine the difference in volumes and side chains of the respective amino acids important for discriminating FII from the rest of the coagulation factors.The most important deviation for FII can be identified as located in the S1 pocket.This appears larger in FII than in the other factors, because Arg187 is located further away from most of the included AAs.This opens up the possibility to design a ligand that reaches further down in the S1 pocket in FII, whereas it would be sterically hindered from binding to the other factors.In addition, residue 39 was identified as a possible "selectivity filter" as it is negatively charged in FII (Glu) as opposite to FXI which has an amino acid with a positively charged side chain (Arg) or VII where a hydrophobic alanine occupies that position.In this case, a ligand with a positively charged group pointing towards Glu39 would provide strong binding in FII.In order to target Leu40 we can consider halogenated ligands, as these have a high propensity to interact with hydrophobic or hydrogen bonding amino acids such as Leu, Phe, Ser and Thr [38].Based on the above analysis, we have thus very easily identified three possible sites where drug design could be directly applied in order to selectively target FII.

Conclusions
PCA and OPLS-DA were applied to analyze the variation between similar protein structures.The approach was able to group the structures into clusters according to the common classifications of serine proteases.Our results show that the Cα coordinates contain sufficient information to distinguish between subtle variations in the proteases with good accuracy.In this case, further details of exact side chain conformation are not needed to make the initial comparison.Adequate information to group the proteases appears to be encoded in the protein backbone as single mutations such as the one of Ser to Ala in 3sqe and 3sqh clearly stand out in the PCA/OPLS analyses.Although this may be seen as a limitation in the sense that the method may be overly sensitive to small variations, it is important to remember that the method is intended for precisely this purpose, namely to identify and expose the smallest variations in structures.
The proposed method complements already existing methods in analysis of closely related binding sites, and at the same time adds the possibility of rapidly and readily identifying residues responsible for variations found in the analysis.The advantage of the present approach is that it be applied even if only backbone information is available, and since it is a structure based method, it is also independent of the availability of active ligands.
Providing detailed knowledge about variations in cavities can assist when selecting targets for molecular modeling studies or for designing ligands, be it highly selective as well as highly promiscuous compounds, based on 3D structures of targets as herein exemplified.Thus an early intervention, using the present approach for identification of possible compound structure selectivity sources, might provide crucial remuneration during following development drug discovery phases.

Figure 2 .
Figure 2. Schematic figure of the protein active site comparison method.Input PDB-files are used for calculating the distances between 3D coordinates of AAs.The difference in distances is illustrated by clusters plotted in a PCA.

Figure 3 .
Figure 3. PCA score plot of t1 vs t 2 based on distances between the selected AAs.The colors represents the different coagulation factors; II (blue), VII (yellow), IX (red), X (turquoise) and XI (purple).
(B)).The first component PC1 explained 40% of the data (R 2 = 0.40).Four PCs together explained 79% ( cross-validation was successful since the model was able to accurately predict the missing data.
multiple correlation coefficients and cross validations, respectively.

Figure 5 .
Figure 5. 3D scores plot of the PCA model B, excluding outliers.The coloring schemes are the same as in Figure 3 and Figure 4.

Table 1 .
Statistical values for the first four principal components (PCs) a .