Visualizing Random Forest’s Prediction Results

Abstract

The current paper proposes a new visualization tool to help check the quality of the random forest predictions by plotting the proximity matrix as weighted networks. This new visualization technique will be compared with the traditional multidimensional scale plot. The present paper also introduces a new accuracy index (proportion of misplaced cases), and compares it to total accuracy, sensitivity and specificity. It also applies cluster coefficients to weighted graphs, in order to understand how well the random forest algorithm is separating two classes. Two datasets were analyzed, one from a medical research (breast cancer) and the other from a psychology research (medical student’s academic achievement), varying the sample sizes and the predictive accuracy. With different number of observations and different possible prediction accuracies, it was possible to compare how each visualization technique behaves in each situation. The results pointed that the visualization of random forest’s predictive performance was easier and more intuitive to interpret using the weighted network of the proximity matrix than using the multidimensional scale plot. The proportion of misplaced cases was highly related to total accuracy, sensitivity and specificity. This strategy, together with the computation of Zhang and Horvath’s (2005) clustering coefficient for weighted graphs, can be very helpful in understanding how well a random forest prediction is doing in terms of classification.

Share and Cite:

Golino, H. & Gomes, C. (2014). Visualizing Random Forest’s Prediction Results. Psychology, 5, 2084-2098. doi: 10.4236/psych.2014.519211.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Antoniou, I. E., & Tsompa, E. T. (2008). Statistical Analysis of Weighted Networks. Discrete Dynamics in Nature and Society, 2008, 16.
http://dx.doi.org/10.1155/2008/375452
[2] Barrat, A., Barthelemy, M., Pastor-Satorras, R., & Vespignani, A. (2004). The Architecture of Complex Weighted Networks. Proceedings of the National Academy of Sciences of the United States of America, 101, 3747-3752.
http://dx.doi.org/10.1073/pnas.0400087101
[3] Bennett, K. P., & Mangasarian, O. L. (1992) Robust Linear Programming Discrimination of Two Linearly Inseparable Sets. Optimization Methods and Software, 1, 23-34.
http://dx.doi.org/10.1080/10556789208805504
[4] Blanch, A., & Aluja, A. (2013). A Regression Tree of the Aptitudes, Personality, and Academic Performance Relationship. Personality and Individual Differences, 54, 703-708.
http://dx.doi.org/10.1016/j.paid.2012.11.032
[5] Borg, I. & Groenen, P. (2005). Modern Multidimensional Scaling: Theory and Applications (2nd ed.). New York: Springer-Verlag.
[6] Borkin, M. A., Vo, A. A., Bylinskii, Z., Isola, P., Sunkavalli, S., Oliva, A., & Pfister, H. (2013). What Makes a Visualization Memorable? Visualization and Computer Graphics IEEE Transactions, 12, 2306-2315.
http://dx.doi.org/10.1109/TVCG.2013.234
[7] Breiman, L. (2001). Random Forests. Machine Learning, 1, 5-32.
http://dx.doi.org/10.1023/A:1010933404324
[8] Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. New York: Chapman & Hall.
[9] Cortez, P., & Silva, A. M. G. (2008). Using Data Mining to Predict Secondary School Student Performance. In A. Brito, & J. Teixeira (Eds.), Proceedings of 5th Annual Future Business Technology Conference, Porto, 5-12.
[10] Eloyan, A., Muschelli, J., Nebel, M., Liu, H., Han, F., Zhao, T., Caffo, B. et al. (2012). Automated Diagnoses of Attention Deficit Hyperactive Disorder Using Magnetic Resonance Imaging. Frontiers in Systems Neuroscience, 6.
http://dx.doi.org/10.3389/fnsys.2012.00061
[11] Epskamp, S., Cramer, A. O. J., Waldorp, L. J., Schmittmann, V. D., & Borsboom, D. (2012). Qgraph: Network Visualizations of Relationships in Psychometric Data. Journal of Statistical Software, 48, 1-18.
http://www.jstatsoft.org/v48/i04/
[12] Fruchterman, T. M. J., & Reingold, E. M. (1991). Graph Drawing by Force-Directed Placement. Software: Practice and Experience, 21, 1129-1164.
http://dx.doi.org/10.1002/spe.4380211102
[13] Geurts, P., Irrthum, A., & Wehenkel, L. (2009). Supervised Learning with Decision Tree-Based Methods in Computational and Systems Biology. Molecular Biosystems, 5, 1593-1605.
http://dx.doi.org/10.1039/b907946g
[14] Golino, H. F., & Gomes, C. M. A. (2014). Four Machine Learning Methods to Predict Academic Achievement of College Students: A Comparison Study. Revista E-Psi, 4, 68-101.
[15] Hardman, J., Paucar-Caceres, A., & Fielding, A. (2013). Predicting Students’ Progression in Higher Education by Using the Random Forest Algorithm. Systems Research and Behavioral Science, 30, 194-203.
http://dx.doi.org/10.1002/sres.2130
[16] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference and Prediction (2nd ed.). New York: Springer.
http://dx.doi.org/10.1007/978-0-387-84858-7
[17] Honarkhah, M., & Caers, J. (2010). Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling. Mathematical Geosciences, 42, 487-517.
http://dx.doi.org/10.1007/s11004-010-9276-7
[18] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. New York: Springer.
http://dx.doi.org/10.1007/978-1-4614-7138-7
[19] Kalna, G., & Higham, D. J. (2007). A Clustering Coefficient for Weighted Networks, with Application to Gene Expression Data. Journal of AI Communications-Network Analysis in Natural Sciences and Engineering, 20, 263-271.
[20] Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. New York: Springer.
http://dx.doi.org/10.1007/978-1-4614-6849-3
[21] Lemon, J. (2006). Plotrix: A Package in the Red Light District of R. R-News, 6, 8-12.
[22] Liaw, A., & Wiener, M. (2012). Random Forest: Breiman and Cutler’s Random Forests for Classification and Regression. R Package Version 4.6-7.
[23] Mangasarian, O. L., & Wolberg, W. H. (1990). Cancerdiagnosis via Linear Programming. SIAM News, 23, 1-18.
[24] Mangasarian, O. L., Setiono, R., & Wolberg, W. H. (1990). Pattern Recognition via Linear Programming: Theory and Application to Medical Diagnosis. In T. F. Coleman, & Y. Y. Li (Eds.), Large-Scale Numerical Optimization (pp. 22-30). Philadelphia, PA: SIAM Publications.
[25] Onnela, J. P., Saramaki, J., Kertesz, J., & Kaski, K. (2005). Intensity and Coherence of Motifs in Weighted Complex Networks. Physical Review E, 71, Article ID: 065103.
http://dx.doi.org/10.1103/PhysRevE.71.065103
[26] Quach, A. T. (2012). Interactive Random Forests Plots. All Graduate Plan B and Other Reports, Paper 134, Utah State Univesity.
[27] R Development Core Team (2011). R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing.
http://www.R-project.org
[28] Seni, G., & Elder, J. F. (2010). Ensemble Methods in Data Mining: Improving Accuracy through Combining Predictions. Morgan & Claypool Publishers.
http://dx.doi.org/10.2200/S00240ED1V01Y200912DMK002
[29] Skogli, E., Teicher, M. H., Andersen, P., Hovik, K., & Øie, M. (2013). ADHD in Girls and Boys—Gender Differences in Co-Existing Symptoms and Executive Function Measures. BMC Psychiatry, 13, 298.
http://dx.doi.org/10.1186/1471-244X-13-298
[30] Steincke, K. K. (1948). Farvelogtak: Ogsaaen Tilvaerelse IV. København: Fremad.
[31] Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). New York: Springer.
http://dx.doi.org/10.1007/978-0-387-21706-2
[32] Wickham, H., Caragea, D., & Cook, D. (2006). Exploring High-Dimensional Classification Boundaries. Proceedings of the 38th Symposium on the Interface of Statistics, Computing Science, and Applications—Interface 2006: Massive Data Sets and Streams, Pasadena, May 24-27 2006.
[33] Wolberg, W. H., & Mangasarian, O. L. (1990) Multisurface Method of Pattern Separation for Medical Diagnosis Applied to Breast Cytology. Proceedings of the National Academy of Sciences of the United States of America, 87, 9193-9196.
[34] Zhang, B., & Horvath, S. (2005). A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in Geneticsand Molecular Biology, 4.
http://dx.doi.org/10.2202/1544-6115.1128

Copyright © 2023 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.