Discovery and validation of potential drug targets based on the phylogenetic evolution of GPCRs

Target identification is a critical step following the discovery of small molecules that elicit a biological phenotype. G-protein coupled recaptors (GPCRs) are among the most important drug targets for the pharmaceutical industry. The present work seeks to provide an in silico model of known GPCR protein fishing technologies in order to rapidly fish out potential drug targets on the basis of amino acid sequences and seven transmembrane regions (TMs) of GPCRs. Some scoring matrices were trained on 22 groups of GPCRs in the GPCRDB database. These models were employed to predict the GPCR proteins in two groups of test sets. On average, the mean correct rate of each TM of 38 GPCRs from two test sets ( and ) was found 62% and 57.5%, respectively, using training set 18 ( ); the mean hit rate of each TM of 38 GPCRs from and S24 was found 68.1% and 64.7%, respectively. Based on the scoring matrices of PreMod, the mean correct rate of each TM of GPCRs from and was found 62% and 62.04%, respectively; the mean hit rate of each TM of GPCRs from and was found 67.7% and 68.0%, respectively. The means of GPCRs in based on is close to those based on PreMod; whereas the means of GPCRs in 24 based on D S 18 is less than those based on PreMod. Moreover, the accuracy (“2”) and validity (“2 + 1”) rates of prediction all seven TMs of 38 GPCRs by the scoring matrices of PreMod are more than those by , T S23


INTRODUCTION
G-protein coupled receptors (GPCRs) are among the most important drug targets for the pharmaceutical industry [1].More than 30% of all marketed therapeutics interacts with them.GPCRs are integral membrane proteins that possess seven membrane-spanning domain or transmembrane helices with the N terminal of these proteins located in extracellular and the C-terminal extended in the cytoplasm.They comprise a large protein family of transmembrane receptors that sense molecules outside the cell and activate inside signal transduction pathways and, ultimately, cellular responses.The heterotrimeric G proteins (guanine nucleotide-binding proteins) are signal transducers, attached to the cell surface plasma membrane, that connect receptors to effectors and thus to intracellular signaling pathways [2,3].The extracellular signals are received by GPCRs that activate the G proteins, which communicate signals from many hormones, neurotransmitters, chemokines, and autocrine and paracrine factors by several distinct intracellular signaling pathways [2].These pathways interact with one another to form a network that regulates metabolic enzymes, ion channels, transporters, and other components of the cellular machinery controlling a broad range of cellular processes, including transcription, motility, contractility, and secretion.These cellular processes in turn regulate systemic functions such as embryonic development, gonadal development, learning and memory, and organismal homeostasis [2].G protein-dependent and G protein-independent pathways each have the capacity to initiate numerous intracellular signaling cascades to mediate these effects [4].G proteins are GTPases (guanosine triphosphatases) that cycle between a GDP-bound form and a GTP-bound form [5].The GTP-bound G protein is an active form that interacts with downstream effectors and transmits signals, during which the bound GTP is often hydrolyzed to GDP and the G protein recycles into the inactive GDP-bound form [5].The heterotrimeric G protein complex comprises a Gα subunit, of which there are 4 main families (Gαs, Gαi/o, Gαq/11, and Gα12/13), coupled to a combination of Gβ and Gγ subunits, of which there exist 6 and 12 members, respecttively [2,4].Gα subunit binds to guanine nucleotides while Gβγ subunits cannot be dissociated under nondenaturing conditions.The activity of G proteins is regulated mainly through three classes of regulatory proteins: GTPase-activating proteins (GAPs), guanine nucleotideexchange factors (GEFs), and guanine nucleotide-dissociation inhibitors (GDIs) [6].Upon activation, the GTPbound Gα subunit dissociates from Gβγ subunits, and serves as the major signaling messenger by interacting with its signal acceptors (downstream effectors) [2].
Mammalian GPCRs constitute a superfamily of diverse proteins with hundreds of members [7,8].GPCRs can be grouped into 6 classes based on sequence homology and functional similarity [9,10]: Class A (Rhodopsin-like receptors) [11], Class B (Secretin receptor family) [12], Class C (Metabotropic glutamate/pheromone receptors) [13], Class D (Fungal mating pheromone recaptors) [14], Class E (Cyclic AMP receptors) [15], and Class F (Frizzled/Smoothened, F/S) [16,17].GPCRs act as receptors for a multitude of different signals [8].One major group, referred to as chemosensory GPCRs (cs-GPCRs), is receptors for sensory signals of external origin that are sensed as odors [18,19], pheromones, or tastes [20].Most other GPCRs respond to endogenous signals, such as peptides, lipids, neurotransmitters, or nucleotides [21,22].These GPCRs are involved in numerous physiological processes, including the regulation of neuronal excitability, metabolism, reproduction, development, hormonal homeostasis, and behavior [8].A characteristic feature of GPCRs differentially expressed in many cell types in the body, together with their structural diversity, has proved important in medicinal chemistry.GPCRs are involved in many diseases, and are also the target of around half of all modern medicinal drugs [23].Of all currently marketed drugs, >30% are modulators of specific GPCRs [24].However, only 10% of GPCRs are targeted by these drugs, emphasizing the potential of the remaining 90% of the GPCR superfamily for the treatment of human disease [8].
Additionally, Celera's initial analysis of the human genome found 616 GPCRs [25] and Takeda et al. [26] found 178 intronless nonchemosensory GPCRs, whereas the International Human Genome Sequencing Consortium reported a total of 569 "rhodopsin-like" (i.e., Class A) GPCRs [27].Vassilatis DK and co-worker conducted a comprehensive analysis and reported that the repertoire of GPCRs for endogenous ligands consists of 367 receptors in humans and 392 in mice.Included here are 26 human and 83 mouse GPCRs not previously identified [8].Phylogenetic analyses cluster 60% of GPCRs according to ligand preference, allowing prediction of ligand types for dozens of orphan receptors.Expression profiling of 100 GPCRs demonstrates that most are expressed in multiple tissues and that individual tissues express multiple GPCRs.Over 90% of GPCRs are expressed in the brain.Strikingly, however, the profiles of most GPCRs are unique, yielding thousands of tissueand cell-specific receptor combinations for the modulation of physiological processes.
Moreover, diverse members of GPCR superfamily participate in a variety of physiological functions and are major targets of pharmaceutical drugs.GPCRs are one of the most important target classes in pharmacology and are the target of many blockbuster drugs [28].The presumably α-helical transmembrane regions (TMs) of GPCRs are probably arranged with similarity to bacteriorhodopsin (brh) [29].Except for low-resolution electron diffraction [30,31] and high resolution X ray-based crystallography [32] of brh, the first crystal structure of a mammalian GPCR, bovine rhodopsin [33], was solved.In 2007, the first structure of a human GPCR, β 2 -adrenergic receptor, was solved [34,35].In particular, GPCRs are of enormous importance for the pharmaceutical industry because 52% of all existing medicines act on a GPCR [36].Very well-known therapeutic drugs such as β-blockers and anti-histamines act on GPCRs.This explains why so many three-dimensional models of GPCRs have been built.Early structural models, such as HIV-1 co-receptor CCR5 (chemokine receptors) [37,38], and human thromboxane receptor [39], are based on the atomic coordinates of the brh structure; some models, e.g.human ADP receptor (Purinergic Receptor P2Y12) [40], are constructed by homology modeling using bovine rhodopsin as a template.All of these modeling studies combined with bioinformatics and chemoinformatics become amenable to the rational design of novel drugs targeting GPCRs in the human genome [28].
These models would contribute to a better understanding of the structure and the function of GPCRs, as well as the ligand-receptor interaction.The present study is devoted to use bioinformatics and computational modeling to build up GPCRs' theoretical modeling and folding fashions, for prediction of unknown GPCRs in the were taken from a new release of the GPCRDB v.7.6 (http://www.gpcr.org/7tm/htmls/entries.html) based on the latest UniProtKB (Universal Protein Knowledgebase) release of 15-May-2006 (http://www.ebi.ac.uk/swissprot/; http://au.expasy.org/),which contain approximately 764 proteins.Their GPCR family profiles are updated.Their amino acid sequences were from Genbank (http://www.ncbi.nlm.nih.gov/Genbank/index.html)and SWISSPROT.The secondary structure of protein residues corresponds to the DSSP method and their seven TMs were determined based on the GPCR superfamily.

Data Partitioning
The transmembrane domain regions of 764 known GPCRs were each used as a query I TBLASTEN searches of the National Center for Biotechnology Information human genome database.Sequences were retrieved from the National Center for Biotechnology Information with the accession numbers (Appendix 1).GPCR Class A, B, and C Hidden Markov Model models were also used as queries to search the International Protein Index proteome database [8].Grouping of the samples was based on the phylogenetic analysis results of Vassilatis and co-worker.Data sets were partitioned into three sets: Training, test, and validation sets.Although protein prediction methodology is almost always reported in terms of training and test sets only, we withheld an external validation set in order to provide an additional rigorous check on model quality.We feel this is necessary since a high statistical correlation on the training and test sets does not necessarily indicate a highly predictive model [41].To properly partition our data sets so that they each reflect the makeup of the original data set as much as possible, we take into account the distribution of both feature diversity and biological activity as we form our training, test, and external validation sets.In this way, we maintain the original proportions of categorical bins and structural diversity in each of the three sets.

The Scoring Matrices of Training Sets
Take Group 1 of Class A for an example.In order to represent the GPCRs' TM patterns, a representative nonredundant set of high resolution GPCRs' TMs are chosen as previously reported to build a training set (Tables 1  and 2).The most consistent sequences are picked up to constitute a scoring matrix by alignment that would be used to predict the TM regions.The amino acid sequences of the seven TMs of GPCRs were extracted and aligned using ClustalW; the TM regions cluster in one fragment (motif) which are about 12, 11, 13, 14, 10, 10, and 12 amino acid residues for TM1-TM7 of the Group 1 (Table 2), respectively; and then their coding regions of such amino acid fragments were chosen to constitute the scoring matrix, which contains 4 types of nucleotides (Figure 2).
Take TM1 of GPCRs in Group 1 of Class A for an example.There are 42 GPCR proteins consisting of the training set after alignment (Table 1).Figure 2 means the scoring matrix, which was generated by assigning a value of the stimulatory potential to each of the 4 defined nucleotides in each position of

S t
. Take the adenosine (A) for example.Based on the Table 1, the times of adenosine is at the position of respecttively, and the sum of four nucleotides in the training set is 1512 . So, the scores 5,12, ,  at the position of respectively, whereas it is 0 at other position because it does not appear (Figure 2).The rest (Thymine, Cytidine, and Guanosine) may be deduced by analogy.The value of the scoring matrix is 1. 0.003, 0.008, , 

Test Sets
According to the set theory of mathematics [42], the GPCRs chosen above consist of different training sets (Table 3) comes from the complement of L S for GPCRs aggregate (Appendix 1).According to our previous methods [40,43], we defined the coding sequence (CDS) of GPCRs' each TM as TM-CDS unit composed of nucleotides.At first, the TM-CDS units are obtained using the sliding window method one by one from 5'-terminal of GPCRs' CDS to 3'-teminal: A sequence of l nucleotides gives rise to m 1 l m   TM-CDS units.For example, the coding sequences of TM1 of GPCRs in group 1 are 12 × 3 nucleotides, namely 36 m  .

Validation Set
Similarly, we calculate the total scores of the coding sequences of 22 GPCRs located at the sense chain of chromosome 19 using the sliding window method.

Assessment of Model Quality
In this study, training model quality is simply the percent correct classification (binning) of GPCRs' TM segments for the test set [41].The overall predictive power of a given model is the percent correct classification for the test set (%test) and for the external validation set (%validation), where the external validation set represents native holdout data.More extensive model assessment was accomplished by a "dynamic partitioning" procedure, which provides a no error rate of the test and external validation sets.

Statistics
Data are expressed as mean±standard deviation (S.D.) through this paper.Statistical analyses were performed with F-test by one-way analysis of variance (abbreviated one-way ANOVA) and by t-test between the means of two groups of the samples.Data was considered significant for 0.01 P  at 95 confidence limit [44].Tests for normality were performed with Shapiro-Wilk test because of the number of samples less than 2000 [45].The normality of the data was tested by the Shapiro-Wilk statistic.All statistical testing was conducted at significance level 0.10 and all confidence intervals had confidence level 0.90 unless otherwise noted.All tests and confidence intervals were two-sided.Confidence intervals for normal data were constructed from analysis of covariance models [45].Here, α = 0.10 requests 90% confidence limits.The default value is 0.05.One way-ANOVA, Test of Homogeneity of Variances and Multiple comparisons (LSD and Tamhane's T2), and tests for normality were performed using SPSS version 11.5 software.

The Prediction Model Algorithm
In general, our prediction model (PreMod) method employs the scoring matrices combined with descriptor L , and        Note: The number "0", "1", and "2" donate that the predicting helical regions of GPCRs by the scoring matrices of the training set   18 L D S are "different from", "partial consistent with (similar to)", and "identical with" the actual transmembrane regions of GPCRs, respectively.

OPEN ACCESS
In what follows, we present three primary results, based on application of the methods described above.

Phylogenetic Analysis and Structural Evolution
and Group 14    3).Here, , and ; 1 lists the amino acid sequences of TM1 in Group 1 GPCRs, the common 12-residue regions of TM1 by alignment, and the corresponding coding sequences consisting of 36 nucleotides.Table 2 displays the amino acid sequence length and the sample number consisting of the scoring matrix of each transmembrane region of GPCRs in the training datasets after sequence alignments.Different the training sets, different the amino acid sequence length and the sample number consisting of the scoring matrix to same TMs; the same the training sets, different the amino acid sequence length and the sample number consisting of the scoring matrix to different TMs. Figure 2 illustrates the scoring matrices of seven TMs (TM1-TM7) of GPCRs in Group 1 of Class A in the training datasets.This is the core of prediction system of GPCRs. .Tables 4 and  5 display the score and the prediction accuracy of the coding sequences of GPCRs' trans-membrane segments in test sets by the scoring matrices of the different training datasets.All the data can be clearly divided into four categories: 23 -"2", 23 -"1 + 2", 24 -"2" and 24 -"1 + 2".The number "2" donates that the predicting helical regions of GPCRs by the scoring matrices of the training set are identical with the actual TM regions of GPCRs, while the number "1" donates the predicting helical regions of GPCRs are partial consistent with their actual TMs."1 + 2" means the combination of "2" with "1", namely the positive prediction results.There are 22 examinations (corresponding 22 training sets) in each category.For instance, if we use test set 23 to examine Group 1

Validation of the Models (Scoring
of training set, and then get seven "all hit" ("2") correctness rates (validity) (TM1 to TM7).These 7 correctness rates as a whole can be deemed as examination 1 in 23 -"2".Under this situation, mean of one examination's correctness rate is the mean of the seven correctness rates (Table 5).The rest may be deduced by analogy.

Statistics Analysis
One way-ANOVA, a powerful and common statistical procedure, is used to figure out whether there are significant differences among means of correctness rate of the examinations.As one way-ANOVA requires, all data that does not obey normal distribution are eliminated, such as Groups 1, 6 and 10 of -"2", group 16 of 23 -"1 + 2", groups 12, 16 and 22 of 24 -"2", and groups 15, 19 and 22 of 24 -"1 + 2".The four results of one way-ANOVA, with F values of 21.931, 9.308, 22.807 and 7.488 for 23 -"2", 23 -"1 + 2", 24 -"2" and 24 -"1 + 2", respectively, indicate that there are significant differences between means of correctness rate of examinations at the 0.01 level in each category.Then Test of Homogeneity of Variances is applied in order to find a suitable method for multiple comparisons.Actually, 23 -"1 + 2" and 24 -"1 + 2", with P values of 0.7673 and 0.7121, respectively, has homogenous variances at the 0.05 level, and LSD method of multiple comparisons will be used.On the other hand, variances of 23 -"2" and 24 -"2" with P values of 0.0032 and 0.0418, respectively, are not homogenous, which means that Tamhane's T2 should be chosen as multiple comparisons method.The results of multiple comparisons are respectively visualized in Figure 4, where "X" shows there are significant differences (at the significant level of 0.05) between the two examinations indicated by corresponding column and line (specific value of multiple comparisons can be found in supplemental data).Finally, the average of correctness rate of examinations of each category are plotted respectively on Figure 5, from which we can see that the training set of 18 has the highest mean of correctness rate when examined by test set in each of the four groups although training set 18 and other training sets such as 2, 3, 4, 14, etc. has not statistical difference.ANOVA results reveal that there are three scoring matrices significant, from three training datasets, Group 3 , respectively (Figure 5).
The following t-test results reveal that the mean difference of the scores between two groups, "2"/"2 + 1" and "0", are statistically significant (P < 0.05) with the exception of TM4 and TM5 (Table 6) based on the scores and the validity of GPCRs in test sets ( (Table 3).Especially to TM2, TM3, TM6, and TM7, the mean of two groups between "2"/"2 + 1" and "0" is statistically significant (P < 0.01), which means that the probability of the difference being due to chance is less than 0.01.Of the seven TMs, there is significant different between the scores of two groups in TM2 with the t values from 4.494 to 6.959 (P < 0.001), where the degrees of freedom of a set of data are more than 30 but less than 40 (the critical value of t for the 0.001 level of significance at 30 of df is 3.646).On the other hand, t-values show that there are significant different between the scores of two groups by the scoring matrix of

S
come from the same GPCR subfamily.Similarly, the members of belong to human GPCR proteins whereas those of are GPCR proteins from different species.S except TM7 (Figure 6). Figure 7 shows the histo- Note: * The number "1" and "2" donate that the predicting helical regions of GPCRs by the scoring matrices of the training sets are "partial consistent with" and "identical with" the actual TM regions of GPCRs, respectively.Note: * The number "1" and "2" donate that the predicting helical regions of GPCRs by the scoring matrices of the training sets are "partial consistent with" and "identical with" the actual TM regions of GPCRs, respectively.Note: "n", " x ", and "S" donate the sample number, mean, and standard deviation (S.D.) of the scores of GPCRs in the test set (T23 and T24) by the scoring matrices of training set 18, 14 and 3. x x n  3) reveals that the prediction seven TMs' accuracy of the former is more than the latter's whereas the hit rate of the former (94.74% and 97.37%) is less than that of  8).This is the reason that we PreMod to predict choose OPEN ACCESS    Note: Here, these numbers are all seven TMs.
Table 10 displays the score and the validity of the coding sequences of each trans-membrane segment of 22 GPCRs located in sense chain of chromosome 19 by the scoring matrices of Group 18 (model 18).Chromosome 19 is composed of four configs: Config 1, 2, 3, and 4, containing 3, 0, 11, and 7 GPCRs, respectively.The prediction results of these GPCRs show that there are 19, 8, 9, 5, 6, 6, and 11 positive data of TM1, TM2, TM3, TM4, TM5, TM6, and TM7, respectively.Plot of predicted and actual values of GPCRs in four configs of chromosome 19 shows that four TMs (i.e.TM1, TM2, TM6, and TM7) have higher prediction accuracy while other three TMs (such as TM3, TM4, and TM5) possess lower positive results (Figure 8), especially TM1 and TM7 with positive rates of 19/74 and 11/25, respectively.However, the "hits" rate is up to 20/22 if only anyone TM fits to the actual TMs of GPCRs.some potential drug targets.
The test of the normality of scores of Test sets by the Shapiro-Wilk statistic reveals that the data of model 18 and 14 fit the normal distribution and all statistical testing was conducted at significance level 0.10 with all confidence intervals at confidence level 0.90 (Table 9).Here, the W values are between zero and one, and typical value is 0.10.The significative p-values are more than the default value 0.05 at the level of α = 0.10.The batch means pass the Shapiro-Wilk test for multivariate normality.Particularly, the means of scores on the basis of model 18 are less than those of model 14 while "2 + 1" combined with Test set 24 model is superior to "2" binding to Test set 23 model.Take 90% confidence interval lower limit of "2 + 1" combined with Test set 24 model based on model 18 as the threshold of TM1-7 of GPCRs for prediction of chromosome 19 (Validation Set).
Due to the size of our data sets, pairwise computation OPEN ACCESS  of molecular similarities required on the order of a million individual protein segment/segment similarities.Rather than employ the phylogenetic evolution similarity method directly, we employed the scoring matrix approach to infer similarities in these protein sequences.

DISCUSSION
Being the largest family of cell surface receptors, GPCRs play a key role in cellular signaling pathways that regulate many basic physiological processes, such as neurotransmission, secretion, growth, cellular differentiation, inflammatory, and immune responses [46,47].Protein phosphorylation is an essential type of posttranslational modification that consists of the addition of a phosphate group to serine (S), threonine (T), and tyrosine (Y) [48].This process is catalyzed by a group of enzymes called kinases, and can be reversed by phosphatases [47].The phosphorylation process is catalyzed by GPCR kinases (GRKs) that recognize the receptors as substrates after agonist binding [49].This phosphorylation often modifies the cytosolic C-terminal tail and leads to receptor uncoupling from G proteins, binding arrestin, and further results in receptor desensitization and deactivation [50].Huang JH and co-work have revealed that the exact positions of a phosphorylation in a GPCR protein sequence could provide useful clues for drug design and other biotechnology applications [47].
GPCRs are extensively targeted for drug development in humans, especially the biogenic amine-binding GPCRs, which are integral components of the central and peripheral nervous systems of eukaryotes and include receptors that bind the neurotransmitters dopamine, histamine, octopamine, serotonin, tyramine, and acetylcholine [51,52].Malaria is a devastating infection caused by protozoa of the genus Plasmodium (P.falciparum).Gamo et al. have reported multiple GPCR-interacting chemistries as promising anti-malarial leads [53].Analyses using historic assay data revealed that some compounds had activity, but against drug targets without obvious orthologues in the malarial genome, such as GPCRs, nuclear recaptors, ion channels and transporters.They suggested several novel mechanisms of antimalarial action, such as inhibition of protein kinases and host-pathogen interacttion related targets, which provide new tools to exploit the malarial kinome for drug discovery [53].More than 100 different GPCRs have been identified in the genomes of multiple insect species, including malaria-and yellow fever-transmitting mosquitoes.Hill et al. used bioinformatics approaches to identify a total of 276 GPCRs from the Anopheles gambiae genome, which are likely to play roles in pathways affecting almost every aspect of the mosquito's life cycle [54].Meyer JM et al. used "genome-to-lead" approach to develop new modeof-action insecticides for arthropod disease vectors, involving 1) exploitation of an arthropod genome sequence for novel target identification; 2) molecular, biochemical and pharmacological target validation; 3) chemical library screening; and 4) confirmation of hits and identification of candidate "leads" using secondary in vitro as-says and mosquito in vivo assays [52].They reported the first study to identify Aedes aegypti D1-like dopamine receptor (AaDOP1) antagonists with in vivo toxicity toward mosquitoes.
GPCRs comprise the largest family of validated drug targets while 30% -50% of approved drugs derive their benefits by selective targeting of GPCRs [55].Mutations in GPCRs are responsible for over 30 disorders, includeing cancers, heritable obesity, diabetes insipidus, blindness, endocrine diseases, and diseases involving the melanocortin type 4 and gonadotropin releasing hormone receptor (GnRHR) [56,57].Many pathologies associated with misfolded mutant receptors occur because these are retained by the endoplasmic reticulum (ER) and do not reach their normal site of function [57].Normally, GPCRs are subjected to a stringent quality control system (QCS) in the endoplasmic reticulum [56].This system consists of both protein chaperones and enzyme-like proteins.The former retains misfolded proteins while the latter participates in catalysis of the folding process.Moreover, the QCS insures that only correctly folded proteins enter the pathway leading to the plasma membrane.However, point mutations may result in the production of misfolded and disease-causing proteins that are unable to reach their functional destinations in the cell because they are retained by the QCS even though they may retain function [56].On the other hand, pharmacoperone drugs (from "pharmacological chaperone") are small molecules that enter cells and serve as a "molecular scaffold" to promote correct folding of otherwise-misfolded mutant proteins and route correctly within the cell [58].Because these drugs are frequently selected from candidates that were originally identified as target specific antagonists, they also show high target specificity as pharmacoperones, although competition for endogenous ligands is a therapeutic complication.Accordingly Janovick et al. sought to develop assays that would identify molecules that were not necessarily agonists or antagonists [56].In principle, the pharmacoperone-rescue approach applies to a diverse array of human diseases that result from protein misfolding, such as cystic fibrosis [59], hypogonadotropic hypogonadism [60], nephrogenic diabetes insipidus [61], retinitis pigmentosa [62], hyper-cholesterolemia [63], cataracts [64], neurodegenerative diseases (Huntington's [65], Alzheimer's [66], Parkinson's [67]) and particular cancers [68].Janovick et al. have also explored molecular mechanism of action of pharmacoperone rescue of misrouted GPCR mutants using hGnRHR, a useful model for studying pharmacoperones [57].Especially, there is a naturally occurring and highly conserved salt bridge (E 90 -K121) in hGnRHR that stabilizes the relation between transmembranes 2 and 3 of hGnRHR, which is required for passage of the receptor through the cellular QCS and to the plasma membrane.This bridge, broken in the naturally occurring hGnRHR mutant E 90 K, causes hypogonadotropic hypogonadism because the misfolded mutant receptor fails the cellular QCS and cannot traffic to the plasma membrane [69].Additionally, pharmacoperone drugs from different chemical classes all happened to interact identically by creating a surrogate bridge for E 90 -K121.This ligand-mediated bridge plays a key role in rescue of misrouted GPCR mutants.The method provides the basis of novel primary screens for pharmacoperones, especially to identify structures beyond agonists or antagonists.Non-antagonistic pharmacoperones have a therapeutic advantage since they will not compete for endogenous agonists and may not have to be washed out once rescue has occurred and before activation by endogenous or exogenous agonists [56].These studies suggest that rational design of these therapeutic agents, e.g.ones that do not compete with endogenous ligands, is likely to assist this therapeutic approach.
GPCRs are a large superfamily of membrane bound signaling proteins that are involved in the regulation of a wide range of physiological functions and constitute the most common target for therapeutic intervention [70].GPCRs are among the most important drug targets for the pharmaceutical industry.Knowledge of the threedimensional structure of a protein is of utmost importance for drug discovery, as it serves as the basis for the identification of novel ligands by means of computational or in silico techniques, such as de novo design and virtual screening.25% of the small molecule drugs approved in 2006 were discovered through structure-based drug discovery (SBDD) [70].Consequently, target identification is a critical step following the discovery of small molecules that elicit a biological phenotype.There are a serial of technologies and approaches applied in new drug targets and biomarker identification, such as proteomics technology, systems biology approach, mi-croRNA technology, and computational methods.Sugahara et al. have identified a large number of candidates for the target proteins specific to β1,4-galactosyltransferase-I (β4GalT-I) by comparative analysis of β4-GalT-I-deleted and wild-type mice using the LC/MSbased technique with the isotope-coded glycosylation site-specific tagging (IGOT) of lectin-captured N-glycopeptides [71].Their approach to identify the target proteins in a proteome-scale offers common features and trends in the target proteins, which facilitate understanding of the mechanism that controls assembly of a particular glycan motif on specific proteins.Research on microRNAs (miRNAs) is a promising new research, providing novel insights into the pathogenesis of some diseases, biomarker identification, and treatment.The short (approximately 22 nucleotides), endogenous, widely distributed, single-stranded RNAs target both Mrna degradation and suppression of protein translation based on sequence complementarity between the miRNA and its targeted mRNA [72].During evolution, RNA retroviruses or transgenes invaded the eukaryotic genome and inserted itself in the noncoding regions of DNA, acting as transposon-like jumping genes.MiRNAs are evolutionary conserved in animals and plants, and regulate specific target mRNAs at the post-transcriptional level, which involved in several biological processes, including development, cell differentiation, proliferation and apoptosis [73].MiRNAs may be responsible for regulating the expression of nearly one-third of the genes in the human genome whereas very little is known about their biological functions and functional targets despite the identification of more than 1900 mature human miRNAs.Furthermore, miRNA deregulation often results in an impaired cellular function, and a disturbance of downstream gene regulation and signaling cascades, suggesting their implication in disease etiology.Koskun M et al. have identified dysregulated miRNAs in tissue samples of inflammatory bowel disease (IBD) patients, demonstrated similar differences in circulating miRNAs in the serum of IBD patients, and further discovered that miRNAs will aid in the early diagnosis of IBD and in the development of personalized therapies [73].Additionally, our results represent a generalization of the validation and identification of GPCRs using computational methods.The sequence similarity and protein diversity exhibited intuitive behavior in the clustering when considering the underlying distributions.The computations involving the scoring matrix methods are a substantial test of such an approach, with explicit models built that cover roughly 90% of approved GPCRs in test sets.Our focus in previous work was methodological for prediction of active sites using the scoring matrix [40] while this approach showed that the scoring matrix methodology quantitatively outperformed molecular modeling methods for prediction target proteins.In the present work, the scoring matrix methods are used to predict potential proteins as well as prediction of active sites at a level of genome or amino acid sequences.
In conclusion, the present work seeks to provide an in silico model of known GPCR protein fishing technologies in order to rapidly fish out potential drug targets on the basis of amino acid sequences and seven TMs of GPCRs.Some scoring matrices were trained on 22 groups of GPCRs in the GPCRDB database.These models were employed to predict the GPCR proteins in two groups of test sets.On average, the mean correct rate of each TM of 38 GPCRs from T23 and T24 was found 62% and 57.5%, respectively, using training set 18 bigger than those of S 18 and L D L A

15 A 18 S
) as follows: GPCRs from human different chromosomes ( L DC S ), from human same chromosomes ( L SC S ) (such as chromosome 3 and 11), and from different species, based on the phylogenetic trees [8].The first contains five classes: Class A ( ).Class B, C, and F/S each contain one group ( .The first is also extracted into one group 18 

Figure 1 .
Figure 1.The grouping frame of the different training datasets.

Figure 2 .
Figure 2. The scoring matrices of seven transmembrane regions of GPCRs in Group 1 of the training datasets (TM1-TM7: From top to bottom).

Figure 1
Figure 1 displays the grouping frame of the training datasets (learning dataset, L S ), where 22 groups belong to three types.Of the different chromosome type, there are five classes: Class A (Groups 1-4 and Group 14), Class B (Group 15), Class C (Group 16), Class F (Group 22), and Class O (Groups 5-13 and Group 17).Group 1 contain 39, 27, 22, and 20  GPCRs, respectively.The following test datasets

Matrix) 22
scoring matrices are built based on the 22 groups of training datasets   L S from GPCR superfamily and validated by two groups of test sets   T S S 23 and 24 ) by the scoring matrices of training set 18

S
0.001) except in TM1, TM6, and TM7 of 23 and TM5 of 24 (P < 0.01).But the probability that the difference between samples in TM4 of 18 is more than 0.05 due to sampling error.Comparison of t-values in 23 with those in 24 reveals that different scoring matrices have different statistical significance.To the scoring matrix of 18 , the mean differences between the scores of two groups in 24 are more than those in 23 ; whereas the mean differences in 23 are more than those in24 .The reason may be the homology of samples consisting of training sets and test sets.The samples of18 Statistical graphs reveal that the mean scores of the predicting coding sequences of GPCRs' 7 TMs in test sets ( 23 and 24 ) by the scoring matrices of 3 except TM2 and TM6, while the predicting scores by 14 L A S are higher than those by 18 L D

Figure 7 .
Figure 7.The accuracy of the coding sequences of GPCRs' TMs in test sets by the scoring matrices of the different training datasets.Left: "2"; Right: "2 + 1".Here, red and green poles display D18; blue and cyan, A14; and magenta and yellow, A3.Red, Blue, and magenta mean T23 while the rest do T24.

.
Based on the matrix, we designed a simple algorithm to evaluate the relationship significance of any sequence to the GPCRs'   ij this nucleotide denotes the proportional (weighting) it takes place in each position , which was calculated as

Table 1 .
TM1 sequence alignment of GPCRs in group 1 of class A by clustal W.

Table 2 .
The amino acid sequence length and the sample number of the scoring matrix in the training datasets after sequence alignments.
Note: "Number" means the total sample numbers of each training dataset; "Re", the amino acid residue length of each transmembrane region; "Sp", the actual sample numbers of each transmembrane region in each training dataset. in our implementation, the external validation set is selected to have a high level of diversity; 2) Further partition the 80% identified for model building to form two more sets: Training (80%) and test (20%) sets; 3) Select seven TMs of GPCRs as descriptors based on phyloge-

Table 3 .
The scores and the validity of prediction each transmembrane region of GPCRs in test sets by the scoring matrices of training set 18, 14 and 3.

Table 4 .
The score of the coding sequences of GPCRs' each trans-membrane segment in test sets by the scoring matrices of the different training datasets.

Table 5 .
The validity of prediction the coding sequences of GPCRs' each TM segment in test sets by the scoring matrices of the different training sets.

Table 6 .
The t-test results based on the scores and the validity of GPCRs in test sets by training set 18, 14 and 3.

Table 7 .
The score and the validity of the coding sequences of GPCRs' each trans-membrane segment in test sets by the scoring matrices of PreMod.

Table 8 .
The validity and hit rates of prediction all seven transmembrane regions of GPCRs in test sets by the scoring matrices of training set 18, 14, 3 and PreMod.D18 A14 A3 PreMod Model Accuracy Validity hits miss Accuracy Validity hit miss Accuracy Validity hit miss Accuracy Validity hit miss

Table 9 .
Normality analysis of the scores of 38 GPCRs' 7 TMs in Test set 24 by the scoring matrices of PreMod.

Table 10 .
The score and the validity of the coding sequences of each trans-membrane segment of 22 GPCRs located in sense chain of chromosome 19 by the scoring matrices of PreMod.
. This is the reason that we choose PreMod to predict some potential drug targets.23 GPCR proteins in the sense chain of chromosome 19 constructing validation set were predicted and validated by PreMod whose hit rate is up to 95.65%.Further evaluation is under investigation.1. G-protein couple receptors (GPCRs) of GPCRs' database consisting of training set/test sets. S