Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects

Tim MENZIES; Osamu MIZUNO; Yasunari TAKAGI; Tohru KIKUNO

doi:10.4236/jsea.2009.24030

Journal of Software Engineering and Applications > Vol.2 No.4, November 2009

Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects

Tim MENZIES, Osamu MIZUNO, Yasunari TAKAGI, Tohru KIKUNO
.
DOI: 10.4236/jsea.2009.24030 PDF HTML 6,116 Downloads 10,995 Views Citations

Abstract

Often, the explanatory power of a learned model must be traded off against model performance. In the case of predict-ing runaway software projects, we show that the twin goals of high performance and good explanatory power are achievable after applying a variety of data mining techniques (discrimination, feature subset selection, rule covering algorithms). This result is a new high water mark in predicting runaway projects. Measured in terms of precision, this new model is as good as can be expected for our data. Other methods might out-perform our result (e.g. by generating a smaller, more explainable model) but no other method could out-perform the precision of our learned model.

Keywords

Explanation, Data Mining, Runaway

Share and Cite:

T. MENZIES, O. MIZUNO, Y. TAKAGI and T. KIKUNO, "Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects," Journal of Software Engineering and Applications, Vol. 2 No. 4, 2009, pp. 221-236. doi: 10.4236/jsea.2009.24030.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1]	Y. Takagi, O. Mizuno, and T. Kikuno, “An empirical approach to characterizing risky software projects based on logistic regression analysis,” Empirical Software En-gineering, Vol. 10, No. 4, pp. 495–515, 2005.
[2]	S. Abe, O. Mizuno, T. Kikuno, N. Kikuchi, and M. Hira-yama, “Estimation of project success using bayesian clas-sifier,” in ICSE 2006, pp. 600–603, 2006.
[3]	O. Mizuno, T. Kikuno, Y. Takagi, and K. Sakamoto, “Characterization of risky projects based on project man-agers evaluation,” in ICSE 2000, 2000.
[4]	R. Glass, “Software runaways: Lessons learned from massive software project failures,” Pearson Education, 1997.
[5]	“The Standish Group Report: Chaos 2001,” 2001, http://standishgroup.com/sample research/PDFpages/ ex-treme chaos.pdf.
[6]	J. Jiang, G. Klein, H. Chen, and L. Lin, “Reducing user-related risks during and prior to system develop-ment,” International Journal of Project Management, Vol. 20, No. 7, pp. 507–515, October 2002.
[7]	J. Ropponen and K. Lyytinen, “Components of software development risk: how to address them? A project man-ager survey,” IEEE Transactions on Software Engineer-ing, pp. 98–112, Feburary 2000.
[8]	W. Dillon and M. Goldstein, “Multivariate analysis: Methods and applications.” Wiley-Interscience, 1984.
[9]	J. C. Munson and T. M. Khoshgoftaar, “The use of soft-ware complexity metrics in software reliability model-ing,” in Proceedings of the International Symposium on Software Reliability Engineering, Austin, TX, May 1991.
[10]	G. Boetticher, T. Menzies, and T. Ostrand, “The PROM-ISE Repository of Empirical Software Engineering Data,” 2007, http://promisedata.org/repository.
[11]	T. McCabe, “A complexity measure,” IEEE Transactions on Software Engineering, Vol. 2, No. 4, pp. 308–320, December 1976.
[12]	M. Halstead, “Elements of software science,” Elsevier, 1977.
[13]	K. Toh, W. Yau, and X. Jiang, “A reduced multivariate polynomial model for multimodal biometrics and classi-fiers fusion,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 224–233, February 2004.
[14]	R. Duda, P. Hart, and N. Nilsson, “Subjective bayesian methods for rule-based inference systems,” in Technical Report 124, Artificial Intelligence Center, SRI Interna-tional, 1976.
[15]	P. Domingos and M. J. Pazzani, “On the optimality of the simple bayesian classifier under zero-one loss,” Machine Learning, Vol. 29, No. 2-3, pp. 103–130, 1997. http:// citeseer.ist.psu.edu/domingos97 optimality. html
[16]	Y. Yang and G. Webb, “Weighted proportional k-interval discretization for naive-bayes classifiers,” in Proceedings of the 7th Pacific-Asia Conference on Knowledge Dis-covery and Data Mining (PAKDD 2003), 2003, http://www.csse.monash.edu/_webb/Files/YangWe-bb03.pdf.
[17]	I. H. Witten and E. Frank, Data mining. 2nd edition. Los Altos, Morgan Kaufmann, US, 2005.
[18]	G. John and P. Langley, “Estimating continuous distribu-tions in bayesian classifiers,” in Proceedings of the Elev-enth Conference on Uncertainty in Artificial Intelligence Montreal, Quebec: Morgan Kaufmann, 1995, pp. 338–345, http://citeseer.ist.psu.edu/john95 estimating.html.
[19]	M. Hall and G. Holmes, “Benchmarking attribute selec-tion techniques for discrete class data mining,” IEEE Transactions On Knowledge And Data Engineering, Vol. 15, No. 6, pp. 1437–1447, 2003, http://www.cs.waikato.ac.nz/_mhall/HallHolmesTKDE.pdf.
[20]	J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and unsupervised discretization of continuous features,” in International Conference on Machine Learning, pp. 194–202, 1995, http://www.cs.pdx.edu/_timm/dm/dougherty95supervised.pdf.
[21]	T. Menzies, J. Greenwald, and A. Frank, “Data mining static code attributes to learn defect predictors,” IEEE Transactions on Software Engineering, January 2007, http://menzies.us/pdf/06learnPredict.pdf.
[22]	R. Quinlan, C4.5: Programs for Machine Learning. Mor-gan Kaufman, 1992.
[23]	R. Holte, “Very simple classification rules perform well on most commonly used datasets,” Machine Learning, Vol. 11, pp. 63, 1993.
[24]	L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, “Classification and regression trees,” Wadsworth Interna-tional, Monterey, CA, Tech. Rep., 1984.
[25]	J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, 1967.
[26]	T. M. Cover and P. E. Hart, “Nearest neighbour pattern classification,” IEEE Transactions on Information Theory, pp. 21–27, January 1967.
[27]	A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest neighbor,” in ICML’06, 2006, http://hunch.net/_jl/projects/cover tree/cover tree.html.
[28]	S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimi-zation by simulated annealing,” Science, No. 4598, Vol. 220, pp. 671–680, 1983, http://citeseer.nj.nec.com/kirkpatrick83opt-imization.html
[29]	G. G. Towell and J. W. Shavlik, “Extracting refined rules from knowledge-based neural networks,” Machine Learning, Vol. 13, pp. 71–101, 1993, http: //citeseer.ist.psu.edu/towell92extracting.html
[30]	B. Taylor and M. Darrah, “Rule extraction as a formal method for the verification and validation of neural net-works,” in IJCNN ’05: Proceedings. 2005 IEEE Interna-tional Joint Conference on Neural Networks, Vol. 5, pp. 2915–2920, 2005.
[31]	T. Menzies and E. Sinsel, “Practical large scale what-if queries: Case studies with software risk assessment,” in Proceedings ASE 2000, 2000, http://menzies.us/pdf/00ase.pdf.
[32]	W. Cohen, “Fast effective rule induction,” in ICML’95, 1995, pp. 115–123, http://www.cs.cmu.edu/_wcohen/postscript/ml-95-ripper.ps.
[33]	J. Cendrowska, “Prism: An algorithm for inducing modular rules,” International Journal of Man-Machine Studies, Vol. 27, No. 4, pp. 349–370, 1987.
[34]	T. Dietterich, “Machine learning research: Four current directions,” AI Magazine, Vol. 18, No. 4, pp. 97–136, 1997.
[35]	T. Menzies and J. S. D. Stefano, “How good is your blind spot sampling policy?” in 2004 IEEE Conference on High Assurance Software Engineering, 2003, http://menzies.us/pdf/03blind.pdf.
[36]	J. Lu, Y. Yang, and G. Webb, “Incremental discretization for naive-bayes classifier,” in Lecture Notes in Computer Science 4093: Proceedings of the Second International Conference on Advanced Data Mining and Applications (ADMA 2006), pp. 223–238, 2006, http://www.csse.monash.edu/_webb/Files/LuYangWebb06.pdf.
[37]	U. M. Fayyad and I. H. Irani, “Multi-interval discretiza-tion of continuous-valued attributes for classification learning,” in Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pp. 1022–1027, 1993.
[38]	J. Gama and C. Pinto, “Discretization from data streams: Applications to histograms and data mining,” in SAC ’06: Proceedings of the 2006 ACM symposium on Applied computing. New York, NY, USA: ACM Press, pp. 662–667, 2006. http://www.liacc.up.pt/_jgama/ IWKDDS/Papers/p6.pdf.
[39]	A. Miller, Subset Selection in Regression (second edition). Chapman & Hall, 2002.
[40]	R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial Intelligence, Vol. 97, No. 1-2, pp. 273–324, 1997, http://citeseer.nj.nec.com/ kohavi96wrappers.html
[41]	T. Menzies and J. D. Stefano, “More success and failure factors in software reuse,” IEEE Transactions on Soft-ware Engineering, May 2003, http://men- zies.us/pdf/02sereuse.pdf.
[42]	T. Menzies, Z. Chen, J. Hihn, and K. Lum, “Selecting best practices for effort estimation,” IEEE Transactions on Software Engineering, November 2006, http://menzies.us/pdf/06coseekmo.pdf.
[43]	U. Fayyad, “Data mining and knowledge discovery in databases: Implications for scientific databases,” in Pro-ceedings on Ninth International Conference on Scientific and Statistical Database Management, pp. 2–11, 1997.
[44]	F. Provost, T. Fawcett, and R. Kohavi, “The case against accuracy estimation for comparing induction algorithms,” in Proc. 15th International Conf. on Ma-chine Learning. Morgan Kaufmann, San Francisco, CA, pp. 445–453, 1998, http://citeseer.nj.nec.com/ provost98case.html.
[45]	R. Bouckaert, “Choosing between two learning algo-rithms based on calibrated tests,” in ICML’03, 2003, http://www.cs.pdx.edu/_timm/dm/10x 10way.
[46]	C. Kirsopp and M. Shepperd, “Case and feature subset selection in case-based software project effort predic-tion,” in Proc. of 22nd SGAI International Conference on Knowledge-Based Systems and Applied Artificial Intel-ligence, Cambridge, UK, 2002.
[47]	N. Nagappan and T. Ball, “Static analysis tools as early indicators of pre-release defect density,” in ICSE 2005, St. Louis, 2005.

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies