Application of Sparse Bayesian Generalized Linear Model to Gene Expression Data for Classification of Prostate Cancer Subtypes

DOI: 10.4236/ojs.2014.47049   PDF   HTML     4,117 Downloads   6,161 Views   Citations


A major limitation of expression profiling is caused by the large number of variables assessed compared to relatively small sample sizes. In this study, we developed a multinomial Probit Bayesian model which utilizes the double exponential prior to induce shrinkage and reduce the number of covariates in the model [1]. A hierarchical Sparse Bayesian Generalized Linear Model (SBGLM) was developed in order to facilitate Gibbs sampling which takes into account the progressive nature of the response variable. The method was evaluated using a published dataset (GSE6099) which contained 99 prostate cancer cell types in four different progressive stages [2]. Initially, 398 genes were selected using ordinal logistic regression with a cutoff value of 0.05 after Benjamini and Hochberg FDR correction. The dataset was randomly divided into training (N = 50) and test (N = 49) groups such that each group contained equal number of each cancer subtype. In order to obtain more robust results we performed 50 re-samplings of the training and test groups. Using the top ten genes obtained from SBGLM, we were able to achieve an average classification accuracy of 85% and 80% in training and test groups, respectively. To functionally evaluate the model performance, we used a literature mining approach called Geneset Cohesion Analysis Tool [3]. Examination of the top 100 genes produced an average functional cohesion p-value of 0.007 compared to 0.047 and 0.131 produced by classical multi-category logistic regression and Random Forest approaches, respectively. In addition, 96 percent of the SBGLM runs resulted in a GCAT literature cohesion p-value smaller than 0.047. Taken together, these results suggest that sparse Bayesian Multinomial Probit model applied to cancer progression data allows for better subclass prediction and produces more functionally relevant gene sets.

Share and Cite:

Madahian, B. , Deng, L. and Homayouni, R. (2014) Application of Sparse Bayesian Generalized Linear Model to Gene Expression Data for Classification of Prostate Cancer Subtypes. Open Journal of Statistics, 4, 518-526. doi: 10.4236/ojs.2014.47049.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] Park, T. and Casella, G. (2008) The Bayesian lasso. Journal of the American Statistical Association, 103, 681-686.
[2] Tomlins, S.A., Mehra, R., et al. (2007) Integrative Molecular Concept Modeling of Prostate Cancer Progression. Nature Genetics, 39, 41-51.
[3] Xu, L., Furlotte, N., Lin, Y., Heinrich, K., Berry, M.W., George, E.O. and Homayouni, R. (2011) Functional Cohesion of Gene Sets Determined by Latent Semantic Indexing of PubMed Abstracts. PLoS ONE, 6, Article ID: e18851.
[4] Cao, J. and Zhang, S. (2010) Measuring Statistical Significance for Full Bayesian Methods in Microarray Analyses. Bayesian Analysis, 5, 413-427.
[5] Devore, J. and Peck, R. (1997) Statistics: The Exploration and Analysis of Data. Duxbury Press, Pacific Grove.
[6] Thomas, J.G., Olson, J.M., Tapscott, S.J. and Zhao, L.P. (2001) An Efficient and Robust Statistical Modeling Approach to Discover Differentially Expressed Genes Using Genomic Expression Profiles. Genome Research, 11, 1227-1236.
[7] Pan, W. (1996) A Comparative Review of Statistical Methods for Discovering Differentially Expressed Genes in Replicated Microarray Experiments. Bioinformatics, 18, 546-554.
[8] Dudoit S., Fridlyand, J. and Speed, T.P. (2002) Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association, 97, 77-87.
[9] Troyanskaya, O.G., Garber, M.E., Brown, P., Botstein, D. and Altman, R.B. (2002) Nonparametric Methods for Identifying Differentially Expressed Genes in Microarray Data. Bioinformatics, 18, 1454-1461.
[10] Bae, K. and Mallick, B.K. (2004) Gene Selection Using a Two-Level Hierarchical Bayesian Model. Bioinformatics, 20, 3423-3430.
[11] Logsdon, B.A., Hoffman, G.E. and Mezey, J.G. (2010) A Variational Bayes Algorithm for Fast and Accurate Multiple Locus Genome-Wide Association Analysis. BMC Bioinformatics, 11, 58.
[12] Wu, T.T., Chen, Y.F., Hastie, T., Sobel, E. and Lange, K. (2009) Genome-Wide Association Analysis by Lasso Penalized Logistic Regression. Bioinformatics, 25, 714-721.
[13] Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders, A.K., Nyholt, D.R., Madden, P.A., Heath, A.C., Martin, N.G., Montgomery, G.W., Goddard, M.E. and Visscher, P.M. (2010) Common SNPs Explain a Large Proportion of the Heritability for Human Height. Nature Genetics, 42, 565-569.
[14] Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, 58, 267-288.
[15] Li, J.H., Das, K., Fu, G.F., Li, R.Z. and Wu, R.L. (2011) The Bayesian Lasso for Genome-Wide Association Studies. Bioinformatics, 27, 516-523.
[16] Zou, H. (2006) The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association, 101, 1418-1429.
[17] Nott, D.J. and Leng, C. (2010) Bayesian Projection Approaches to Variable Selection in Generalized Linear Models. Computational Statistics & Data Analysis, 54, 3227-3241.
[18] Yi, N. and Xu, S. (2008) Bayesian LASSO for Quantitative Loci Mapping. Genetics, 179, 1045-1055.
[19] Fan, J. and Li, R. (2001) Variable Selection via Nonconcave Penalized Likelihoodand Its Oracle Properties. Journal of the American Statistical Association, 96, 1348-1360.
[20] Ye, J.P., Li, T., Xiong, T. and Janardan, R. (2004) Using Uncorrelated Discriminant Analysis for Tissue Classification with Gene Expression Data. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 1, 181-190.
[21] Calvo, A., Xiao, N., Kang, J., Best, C.J., Leiva, I., Emmert-Buck, M.R., Jorcyk, C. and Green, J.E. (2002) Alterations in Gene Expression Profiles during Prostate Cancer Progression: Functional Correlations to Tumorigenicity and Down- Regulation of Selenoprotein-P in Mouse and Human Tumors. Cancer Research, 62, 5325-5335.
[22] Dalgin, G.S., Alexe, G., Scanfeld, D., Tamayo, P., Mesirov, J.P., Ganesan, S., DeLisi, C. and Bhanot, G. (2007) Portraits of Breast Cancer Progression. BMC Bioinformatics, 8, 291.
[23] Pyon, Y.S. and Li, J. (2009) Identifying Gene Signatures from Cancer Progression Data Using Ordinal Analysis. BIBM ‘09. IEEE International Conference on Bioinformatics and Biomedicine, Washington DC, 1-4 November 2009, 136-141.
[24] Hans, C. (2009) Bayesian Lasso Regression. Biometrika, 96, 835-845.
[25] Nelder, J. and Wedderburn, R. (1972) Generalized Linear Models. Journal of the Royal Statistical Society, 135, 370-384.
[26] McCullagh, P. and Nelder, J. (1989) Generalized Linear Models. Chapman and Hall, London.
[27] Madsen, H. and Thyregod, P. (2011) Introduction to General and Generalized Linear Models. Chapman & Hall/CRC, London.
[28] Pike, M.C., Hill, A.P. and Smith, P.G. (1980) Bias and Efficiency in Logistic Analysis of Stratified Case-Control Studies. International Journal of Epidemiology, 9, 89-95.
[29] Knight, K. and Fu, W. (2000) Asymptotics for Lasso-Type Estimators. The Annals of Statistics, 28, 1356-1378.
[30] Xu, H., Caramanis, C. and Mannor, S. (2010) Robust Regression and Lasso. IEEE Transactions on Information Theory, 56, 3561-3574.
[31] Gilks, W., Richardson, S. and Spiegelhalter, D. (1996) Markov Chain Monte Carlo in Practice. Chapman and Hall, London.
[32] Gelfand, A. and Smith, A.F.M. (1990) Sampling-Based Approaches to Calculating Marginal Densities. Journal of the American Statistical Association, 85, 398-409.
[33] Albert, J. and Chib, S. (1993) Bayesian Analysis of Binary and Polychotomous Response Data. Journal of the American Statistical Association, 88, 669-679.
[34] Benjamini, Y. and Hochberg, Y. (1995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B, 57, 289-300.
[35] Karatzoglou, A., Smola, A., Hornik, K. and Zeileis, A. (2004) kernlab—An S4 Package for Kernel Methods in R. Journal of Statistical Software, 11, 1-20.
[36] Liaw, A. and Wiener, M. (2002) Classification and Regression by Random Forest. R News, 2, 18-22.
[37] Breiman, L. (2001) Random Forests. Machine Learning, 45, 5-32.
[38] Boulesteix, A.L., Janitza, S., Kruppa, J. and König, I.R. (2012) Overview of Random Forest Methodology and Practical Guidance with Emphasis on Computational Biology and Bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2, 493-507.
[39] Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S. and Mesirov, J.P. (2005) Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles. Proceedings of the National Academy of Sciences of the United States of America, 102, 15545-15550.

comments powered by Disqus

Copyright © 2020 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.