Using the improved position specific scoring matrix and ensemble learning method to predict drug-binding residues from protein sequences

Juan Li; Yongqing Zhang; Wenli Qin; Yanzhi Guo; Lezheng Yu; Xuemei Pu; Menglong Li; Jing Sun

doi:10.4236/ns.2012.45043

Natural Science > Vol.4 No.5, May 2012

Using the improved position specific scoring matrix and ensemble learning method to predict drug-binding residues from protein sequences

Juan Li, Yongqing Zhang, Wenli Qin, Yanzhi Guo, Lezheng Yu, Xuemei Pu, Menglong Li, Jing Sun
College of Chemistry, Sichuan University, Chengdu, China;.
College of Computer Science, Sichuan University, Chengdu, China.
DOI: 10.4236/ns.2012.45043 PDF HTML 5,519 Downloads 9,386 Views Citations

Abstract

Identification of the drug-binding residues on the surface of proteins is a vital step in drug discovery and it is important for understanding protein function. Most previous researches are based on the structural information of proteins, but the structures of most proteins are not available. So in this article, a sequence-based method was proposed by combining the support vector machine (SVM)-based ensemble learning and the improved position specific scoring matrix (PSSM). In order to take the local environment information of a drug-binding site into account, an improved PSSM profile scaled by the sliding window and smoothing window was used to improve the prediction result. In addition, a new SVM-based ensemble learning method was developed to deal with the imbalanced data classification problem that commonly exists in the binding site predictions. When performed on the dataset of 985 drug-binding residues, the method achieved a very promising prediction result with the area under the curve (AUC) of 0.9264. Furthermore, an independent dataset of 349 drug- binding residues was used to evaluate the pre- diction model and the prediction accuracy is 84.68%. These results suggest that our method is effective for predicting the drug-binding sites in proteins. The code and all datasets used in this article are freely available at http://cic.scu.edu.cn/bioinformatics/Ensem_DBS.zip.

Keywords

Drug-Binding Site Prediction; Position Specific Scoring Matrix; Ensemble Learning; Support Vector Machine

Share and Cite:

Li, J. , Zhang, Y. , Qin, W. , Guo, Y. , Yu, L. , Pu, X. , Li, M. and Sun, J. (2012) Using the improved position specific scoring matrix and ensemble learning method to predict drug-binding residues from protein sequences. Natural Science, 4, 304-312. doi: 10.4236/ns.2012.45043.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1]	Riddall, D.R., Leach, M.J. and Garthwaite, J. (2006) A novel drug binding site on voltage-gated sodium channels in rat brain. Molecular Pharmacology, 69, 278-287. doi:10.1124/mol.105.015966
[2]	Zsila, F. and Iwao, Y. (2007) The drug binding site of human α1-acid glycoprotein: Insight from induced circular dichroism and electronic absorption spectra. Biochimica et Biophysica Acta, General Subjects, 1770, 797-809. doi:10.1016/j.bbagen.2007.01.009
[3]	Barasoain, I., Matesanz, R., Maccari, G., Trigili C., Mori, M., et al. (2010) Probing the pore drug binding site of Microtubules with fluorescent taxanes: Evidence of two binding poses. Chemistry &Biology, 17, 243-253. doi:10.1016/j.chembiol.2010.02.006
[4]	Messori, L., Piccioli, F., Gabrielli, S., Orioli, P., Angelonia, L. and Bugnob, C.D. (2002) The disaccharide anthracycline MEN 10755 binds human serum albumin to a non-classical drug binding site. Bioorganic &Medicinal Chemistry, 10, 3425-3430. doi:10.1016/S0968-0896(02)00265-1
[5]	Chen, K., Huzil, J.T., Freedman, H., Ramachandran, P., Antoniou, A., Tuszynski, J.A., et al. (2008) Identification of tubulin drug binding sites and prediction of relative differences in binding affinities to tubulin isotypes using digital signal processing. Journal of Molecular Graphics &Modelling, 27, 497-505. doi:10.1016/j.jmgm.2008.09.001
[6]	Fuller, J.C., Burgoyne, N.J. and Jackson, R.M. (2009) Predicting druggable binding sites at the protein-protein interface. Drug Discovery Today, 14, 155-161. doi:10.1016/j.drudis.2008.10.009
[7]	Capra, J.A., Laskowski, R.A., Thornton, J.M., Singh, M. and Funkhouser, T.A. (2009) Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. Plos Computational Biology, 5, e1000585. doi:10.1371/journal.pcbi.1000585
[8]	Nayal, M. and Honig, B. (2006) On the nature of cavities on protein surfaces: Application to the identification of drug-binding sites. Proteins: Structure, Function, and Bioinformatics, 63, 892-906. doi:10.1002/prot.20897
[9]	Perola, E., Walters, W.P. and Charifson, P.S. (2004) A detailed comparison of current docking and scoring methods on systems of pharmaceutical relevance. Proteins: Structure, Function, and Bioinformatics, 56, 235- 249. doi:10.1002/prot.20088
[10]	Ghersi, D. and Sanchez, R. (2009) Improving accuracy and efficiency of blind protein-ligand docking by focusing on predicted binding sites. Proteins: Structure, Function, and Bioinformatics, 74, 417-424. doi:10.1002/prot.22154
[11]	Thangudu, R.R., Tyagi, M., Shoemaker, B.A., Bryant, S.H., Panchenko, A.R. and Madej, T. (2010) Knowledge-based annotation of small molecule binding sites in proteins. BMC Bioinformatics, 11, 365. doi:10.1186/1471-2105-11-365
[12]	Laurie, A.T.R. and Jackson, R.M. (2006) Methods for the prediction of protein-ligand binding sites for structure-based drug design and virtual ligand screening. Current Protein and Peptide Science, 7, 395-406. doi:10.2174/138920306778559386
[13]	Berman, H.M., Westbrook, J., Feng, Z.K., Gilliland, G., Bhat, T.N., Weissig, H., et al. (2000) The protein data bank. Nucleic Acids Research, 28, 235-242. doi:10.1093/nar/28.1.235
[14]	Altschul, S.F., Madden, T.L., A.A., J.H., Z. Zhang, Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI- BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25, 3389-3402. doi:10.1093/nar/25.17.3389
[15]	Wu, J.S., Liu, H.D., Duan, X.Y., Ding, Y., Wu, H.T., Bai, Y.F., et al. (2009) Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics, 25, 30-35. doi:10.1093/bioinformatics/btn583
[16]	Wang Y., Xue, Z., Shen, G. and Xu, J. (2008) PRINTR: Prediction of RNA binding sites in proteins using SVM and profiles. Amino Acids, 35, 295-302. doi:10.1007/s00726-007-0634-9
[17]	Zhang T., Zhang, H., Chen, K., Shen, S.Y., Ruan, J.S. and Kurgan, L. (2008) Accurate sequence-based prediction of catalytic residues. Bioinformatics, 24, 2329-2338. doi:10.1093/bioinformatics/btn433
[18]	Kumar, M., Gromiha, M.M. and Raghava, G.P.S. (2008) Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins: Structure, Function, and Bioinformatics, 71, 189-194. doi:10.1002/prot.21677
[19]	Ahmad, S. and Sarai, A. (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics, 6, 33. doi:10.1186/1471-2105-6-33
[20]	Chauhan, J.S., Mishra, N.K. and Raghava, G.P.S. (2009) Identification of ATP binding residues of a protein from its primary sequence. BMC Bioinformatics, 10, 434. doi:10.1186/1471-2105-10-434
[21]	Kaur, H. and Raghava, G.P.S. (2003) Prediction of b-turns in proteins from multiple alignment using neural network. Protein Science, 12, 627-634. doi:10.1110/ps.0228903
[22]	Garg, A., Kaur, H. and Raghava, G.P.S. (2005) Real value prediction of solvent accessibility in proteins using multiple sequence alignment and secondary structure. Proteins, 61, 318-324. doi:10.1002/prot.20630
[23]	Cheng, C.W., Su, E.C., Hwang, J., Sung, T.Y. and Hsu, W.L. (2008) Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics, 9, S6. doi:10.1186/1471-2105-9-S12-S6
[24]	Wang, C.C., Fang, Y.P., Xiao, J.M. and Li, L.M. (2011) Identification of RNA-binding sites in proteins by integrating various sequence information. Amino Acids, 40, 239-248. doi:10.1007/s00726-010-0639-7
[25]	Hayat, M., Khan, A. (2012) MemHyb: Predicting membrane protein types by hybridizing SAAC and PSSM. Journal of Theoretical Biology, 292, 93-102. doi:10.1016/j.jtbi.2011.09.026
[26]	Li, D., Jiang, Z., Yu, W. and Du, L. (2010) Predicting caspase substrate cleavage sites based on a hybrid SVM- PSSM method. Protein and Peptide Letters, 17, 1566-1571.
[27]	Mundra, P., Kumar, M., Kumar, K.K., Jayaraman, V.K. and Kulkarni, B.D. (2007) Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM. Pattern Recognition Letters, 28, 1610-1615. doi:10.1016/j.patrec.2007.04.001
[28]	Shen, H.B. and Chou, K.C. (2007) Nuc-PLoc: A new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. Protein engineering. Design & Selection, 20, 561-567. doi:10.1093/protein/gzm057
[29]	Chou, K.C., Wu, Z.C. and Xiao, X. (2011) iLoc-Euk: A multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS ONE, 6, e18258. doi:10.1371/journal.pone.0018258
[30]	Wu, Z.C., Xiao, X. and Chou, K.C. (2011) iLoc-plant: A multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. Molecular Biosystems, 7, 3287-3297. doi:10.1039/c1mb05232b
[31]	Chou, K.C. and Shen, H.B. (2007) MemType-2L: A Web server for predicting membrane proteins and their types by incorporating evolution information through Pse- PSSM. Biochemical and Biophysical Research Communications, 360, 339-345. doi:10.1016/j.bbrc.2007.06.027
[32]	Wu, Z.C., Xiao, X. and Chou, K.C. (2012) iLoc-Gpos: A multi-layer classifier for predicting the subcellular localization of singleplex and multiplex gram-positive bacterial proteins. Protein & Peptide Letters, 19, 4-14.
[33]	Chou, K.C., Wu, Z.C. and Xiao, X. (2012) iLoc-Hum: Using accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Molecular Biosystems, 8, 629-641. doi:10.1039/c1mb05420a
[34]	Vapnik, V.N. (1998) Statistical learning theory. Wiley, New York.
[35]	Chou, K.C. and Cai, Y.D. (2002) Using functional domain composition and support vector machines for prediction of protein subcellular location. Journal of Biological Chemistry, 277, 45765-45769. doi:10.1074/jbc.M204161200
[36]	Cai, Y.D., Liu X.J., Xu, X.B. and Chou, K.C. (2002) Support vector machines for predicting HIV protease cleavage sites in protein. Journal of Computational Che- mistry, 23, 267-274. doi:10.1002/jcc.10017
[37]	Cai, Y.D., Zhou, G.P. and Chou, K.C. (2003) Support vector machines for predicting membrane protein types by using functional domain composition. Biophysical Journal, 84, 3257-3263. doi:10.1016/S0006-3495(03)70050-2
[38]	Petrova, N.V. and Wu, C.H. (2006) Prediction of catalytic residues using support vector machine with selected protein sequence and structural properties. BMC Bioinformatics, 7, 312. doi:10.1186/1471-2105-7-312
[39]	Pugalenthi, G., Kumar, K.K., Suganthan, P.N. and Gangal, R. (2008) Identification of catalytic residues from protein structure using support vector machine with sequence and structural features. Biochemical and Biophysical Research Communications, 367, 630-634. doi:10.1016/j.bbrc.2008.01.038
[40]	Li, S.L., Li, H., Li, M.F., Shyr, Y., Xie, L. and Li, Y.X. (2009) Improved prediction of lysine acetylation by support vector machines. Protein & Peptide Letters, 16, 977- 983. doi:10.2174/092986609788923338
[41]	Li, Z.C., Zhou, X., Dai, Z. and Zou, X.Y. (2011) Identification of protein methylation sites by coupling improved ant colony optimization algorithm and support vector machine. Analytical Chimica Acta, 703, 163-171.
[42]	Dietterich, T.G. (2000) Ensemble methods in machine learning. Lecture Notes in Computer Science, 1857, 1-15. doi:10.1007/3-540-45014-9_1
[43]	Kuncheva, L.I., Skurichina, M. and Duin, R.P.W. (2002) An experimental study on diversity for bagging and boosting with linear classifiers. Inform Fusion, 3, 245- 258. doi:10.1016/S1566-2535(02)00093-3
[44]	Caragea, C., Sinapov, J., Silvescu, A., Dobbs, D. and Honavar, V. (2007) Glycosylation site prediction using ensembles of support vector machine classifiers. BMC Bioinformatics, 8, 438. doi:10.1186/1471-2105-8-438
[45]	Xu Y., Wang, X.B., Ding, J., Wu, L.Y. and Deng, N.Y. (2010) Lysine acetylation sites prediction using an ensemble of support vector machine classifiers. Journal of Theoretical Biology, 264, 130-135. doi:10.1016/j.jtbi.2010.01.013
[46]	Swets, J.A. (1988) Measuring the accuracy of diagnostic systems. Science, 240, 1285-1293. doi:10.1126/science.3287615
[47]	Bradley, A.P. (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30, 1145-1159. doi:10.1016/S0031-3203(96)00142-2
[48]	Chou, K.C. and Zhang, C.T. (1995) Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology, 30, 275-349. doi:10.3109/10409239509083488
[49]	Chou, K.C. (2011) Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). Journal of Theoretical Biology, 273, 236-247. doi:10.1016/j.jtbi.2010.12.024
[50]	Chen, C., Chen, L., Zou, X. and Cai, P. (2009) Prediction of protein secondary structure content by using the concept of Chou’s pseudo amino acid composition and support vector machine. Protein & Peptide Letters, 16, 27-31. doi:10.2174/092986609787049420
[51]	Ding, H., Luo, L. and Lin, H. (2009) Prediction of cell wall lytic enzymes using Chou’s amphiphilic pseudo amino acid composition. Protein & Peptide Letters, 16, 351-355. doi:10.2174/092986609787848045
[52]	Esmaeili, M., Mohabatkar, H. and Mohsenzadeh, S. (2010) Using the concept of Chou’s pseudo amino acid composition for risk type prediction of human papillomaviruses. Journal of Theoretical Biology, 263, 203-209. doi:10.1016/j.jtbi.2009.11.016
[53]	Georgiou, D.N., Karakasidis, T.E., Nieto, J.J. and Torres, A. (2009) Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou’s pseudo amino acid composition. Journal of Theoretical Biology, 257, 17-26. doi:10.1016/j.jtbi.2008.11.003
[54]	Gu, Q., Ding, Y.S. and Zhang, T.L. (2010) Prediction of G-protein-coupled receptor classes in low homology using Chou’s pseudo amino acid composition with approximate entropy and hydrophobicity patterns. Protein & Peptide Letters, 17, 559-567. doi:10.2174/092986610791112693
[55]	Guo, J., Rao, N., Liu, G., Yang, Y. and Wang, G. (2011) Predicting protein folding rates using the concept of Chou’s pseudo amino acid composition. Journal of Computational Chemistry, 32, 1612-1617. doi:10.1002/jcc.21740
[56]	Jiang, X., Wei, R., Zhang, T.L. and Gu, Q. (2008) Using the concept of Chou’s pseudo amino acid composition to predict apoptosis proteins subcellular location: An approach by approximate entropy. Protein & Peptide Letters, 15, 392-396. doi:10.2174/092986608784246443
[57]	Lin, H. (2008) The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition. Journal of Theoretical Biology, 252, 350-356. doi:10.1016/j.jtbi.2008.02.004
[58]	Lin, J. and Wang, Y. (2011) Using a novel AdaBoost algorithm and Chou’s pseudo amino acid composition for predicting protein subcellular localization. Protein & Peptide Letters, 18, 1219-1225. doi:10.2174/092986611797642797
[59]	Mei, S. (2012) Multi-kernel transfer learning based on Chou’s PseAAC formulation for protein submitochondria localization. Journal of Theoretical Biology, 293, 121- 130. doi:10.1016/j.jtbi.2011.10.015
[60]	Mohabatkar, H. (2010) Prediction of cyclin proteins using Chou’s pseudo amino acid composition. Protein & Peptide Letters, 17, 1207-1214. doi:10.2174/092986610792231564
[61]	Mohabatkar, H., Mohammad Beigi, M. and Esmaeili, A. (2011) Prediction of GABA(A) receptor proteins using the concept of Chou’s pseudo-amino acid composition and support vector machine. Journal of Theoretical Biology, 281, 18-23. doi:10.1016/j.jtbi.2011.04.017
[62]	Mohammad Beigi, M., Behjati, M. and Mohabatkar, H. (2011) Prediction of metalloproteinase family based on the concept of Chou’s pseudo amino acid composition using a machine learning approach. Journal of Structural and Functional Genomics, 12, 191-197. doi:10.1007/s10969-011-9120-4
[63]	Nanni, L., Lumini, A., Gupta, D. and Garg, A. (2012) Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou’s pseudo amino acid composition and on evolutionary information. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9, 467-475.
[64]	Qiu, J.D., Huang, J.H., Shi, S.P. and Liang, R.P. (2010) Using the concept of Chou’s pseudo amino acid composition to predict enzyme family classes: An approach with support vector machine based on discrete wavelet transform. Protein & Peptide Letters, 17, 715-722. doi:10.2174/092986610791190372
[65]	Sahu, S.S. and Panda, G. (2010) A novel feature representation method based on Chou’s pseudo amino acid composition for protein structural class prediction. Computational Biology and Chemistry, 34, 320-327. doi:10.1016/j.compbiolchem.2010.09.002
[66]	Zou, D., He, Z., He, J. and Xia, Y. (2011) Supersecondary structure prediction using Chou’s pseudo amino acid composition. Journal of Computational Chemistry, 32, 271-278. doi:10.1002/jcc.21616
[67]	Chou, K.C. and Shen, H.B. (2009) Review: Recent advances in developing web-servers for predicting protein attributes. Natural Science, 2, 63-92. doi:10.4236/ns.2009.12011
[68]	Chou, K.C. and Shen, H.B. (2008) Cell-PLoc: A package of Web servers for predicting subcellular localization of proteins in various organisms (updated version: Cell- PLoc 2.0: An improved package of web-servers for predicting subcellular localization of proteins in various organisms. Natural Science, 2, 1090-1103. doi:10.1038/nprot.2007.494

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies