Ensemble-based active learning for class imbalance problem

DOI: 10.4236/jbise.2010.310133   PDF   HTML     5,063 Downloads   10,035 Views   Citations


In medical diagnosis, the problem of class imbalance is popular. Though there are abundant unlabeled data, it is very difficult and expensive to get labeled ones. In this paper, an ensemble-based active learning algorithm is proposed to address the class imbalance problem. The artificial data are created according to the distribution of the training dataset to make the ensemble diverse, and the random subspace re-sampling method is used to reduce the data dimension. In selecting member classifiers based on misclassification cost estimation, the minority class is assigned with higher weights for misclassification costs, while each testing sample has a variable penalty factor to induce the ensemble to correct current error. In our experiments with UCI disease datasets, instead of classification accuracy, F-value and G-means are used as the evaluation rule. Compared with other ensemble methods, our method shows best performance, and needs less labeled samples.

Share and Cite:

Yang, Y. and Ma, G. (2010) Ensemble-based active learning for class imbalance problem. Journal of Biomedical Science and Engineering, 3, 1022-1029. doi: 10.4236/jbise.2010.310133.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] Japkowicz, N. and Stephen, S. (2002) The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 203-231.
[2] Gustavo, E.A., Batista, P.A., Ronaldo, C., et a1. (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1), 20-29.
[3] Settles, B. (2009) Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison.
[4] Tomek, I. (1976) Two modifications of CNN. IEEE Transaction on Systems Man and Communications, 6, 769-772.
[5] Hart, P.E. (1968) The condensed nearest neighbor rule. IEEE Transaction on Information Theory, 14(3), 515- 516.
[6] Laurikkala, J. (2001) Improving identification of difficult small classes by balancing class distribution. Proceedings of the 8th Conference on AI in Medicine, Cascais, Portugal, Europe: Artificial Intelligence Medicine, 63-66.
[7] Wilson, D.L. (1972) Asymptotic properties of nearest neighbor rules using edited data.IEEE Transaction on Systems, Man and Communications, 2(3), 408-421.
[8] Chawlan, V., Bowyer, K.W. and Hall, L.O. (2002) SMOTE: Synthetie minority over-sampling technique. Journal of Aflificial Intelligence Research, 16(1), 321- 357.
[9] Joshi, M., Kumar, V. and Agarwal, R. (2001) Evaluating boosting algorithms to classify rare classes: Comparison and improvements. Proceedings of the 1st IEEE Interna- tional Conference on Data Mining. Washington DC: IEEE Computer Society, 257-264.
[10] Akbani, R., Kwek, S. and Japkowicz, N. (2004) Applying support vector machines to imbalanced datasets. Proceedings of the 15th European Conference on Machines Learning, Pisa, Italy, 39-50.
[11] Krogh, A. and Vedelsby, J. (1995) Neural network ensembles, cross validation and active learning. Advances in Neural Information Processing Systems, 7, 231-238.
[12] Provost, F. (2000) Machine learning from imbalanced data sets 101. Invited paper for the AAAI, Workshop on Imbalanced Data Sets, Menlo Park, CA.
[13] Abe, N. (2003) Invited talk: Sampling approaches to learning from imbalanced datasets: Active learning, cost sensitive learning and beyond. ICML-KDD Workshop: Learning from Imbalanced Data Sets.
[14] Ertekin, S., Huang, J. and Giles, C.L. (2007) Active learning for class imbalance problem. Proceedings of Annual International ACM SIGIR Conference Research and development in information retrieval, Amsterdam, Netherlands, 823-824.
[15] Ertekin, S., Huang, J., Bottou, L. and Giles, C.L. (2007) Learning on the border: Active learning in imbalanced data classification. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 6-8, Lisboa, Portugal, 127-136.
[16] Zhu, J. and Hovy, E. (2007). Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem. In Proc. Joint Conf. Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, 783-790.
[17] Chawla, N.V., Lazarevic, A. and Hall, O. (2003) SMOTE- Boost: improving prediction of the minority class in boosting: knowledge discovery in databases. Proceeding of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. Cavtat Dubrovnik, 107-119.
[18] Veropoulos, K., Campbell, C. and Cristianini, N. (2009) Con- trolling the sensitivity of support vector machines. Proc of Intemational Joint Confbrence on AI, 55-60.
[19] Breiman, L. (1996) Bagging predictors. Machine Learn -ing, 24(2), 123-140.
[20] Abe, N. and Mamitsuka, H. (1998) Query learning strategies using boosting and bagging. Proceedings of the International Conference on Machine Learning (ICML), Morgan Kaufmann, 1-9.
[21] Breiman, L. (2001) Random forests. Machine Learning, 2001, 45(1), 5-32.
[22] Kleinberg, E.M. (1990) Stochastic discrimination. Annals of Mathematics and Artificial Intelligence, 1(1-4), 207- 239.
[23] Seung, H.S., Opper, M. and Sompolinsky, H. (1992) Query by committee. In Proceedings of the ACMWorkshop on Computational Learning Theory, 287-294.
[24] Blake, C., Keogh, E., and Merz, C.J. UCI repository of machine learning databases. http://www.ics.uci.edu
[25] Su, C.T., Chen, L.S. (2006) Knowledge acquisition through information granulation for imbalanced data. Expert Systems with applications, 31, 531-541.
[26] Joshi, M. (2002) On evaluating performance of classi- fiers for rare classes. Proceeding of the 2nd IEEE In- ternational Conference on Data Mining, Maebishi, Japan, 641-644.
[27] Kotsiantis, S., Kanellopoulos, D., Pintelas, P. (2006) Handling imbalanced datasets: A review. GESTS Interna- tional Transactions on Computer Science and Engineer- ing, 30(1), 25-36.
[28] Guo, H., Viktor, H. (2004) Learning from imbalanced data sets with boosting and data generation: the Data- Boost-IM approach. Sigkdd Explorations, 6(1), 30-39.
[29] Witten, I. H., Frank, E. (2005) Data mining-pracitcal machine learning tools and techniques with JAVA implementations. 2nd Edition, Morgan Kaufmann Publishers.

comments powered by Disqus

Copyright © 2020 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.