Improving Knowledge Based Spam Detection Methods: The Effect of Malicious Related Features in Imbalance Data Distribution


Spam is no longer just commercial unsolicited email messages that waste our time, it consumes network traffic and mail servers’ storage. Furthermore, spam has become a major component of several attack vectors including attacks such as phishing, cross-site scripting, cross-site request forgery and malware infection. Statistics show that the amount of spam containing malicious contents increased compared to the one advertising legitimate products and services. In this paper, the issue of spam detection is investigated with the aim to develop an efficient method to identify spam email based on the analysis of the content of email messages. We identify a set of features that have a considerable number of malicious related features. Our goal is to study the effect of these features in helping the classical classifiers in identifying spam emails. To make the problem more challenging, we developed spam classification models based on imbalanced data where spam emails form the rare class with only 16.5% of the total emails. Different metrics were utilized in the evaluation of the developed models. Results show noticeable improvement of spam classification models when trained by dataset that includes malicious related features.

Share and Cite:

Alqatawna, J. , Faris, H. , Jaradat, K. , Al-Zewairi, M. and Adwan, O. (2015) Improving Knowledge Based Spam Detection Methods: The Effect of Malicious Related Features in Imbalance Data Distribution. International Journal of Communications, Network and System Sciences, 8, 118-129. doi: 10.4236/ijcns.2015.85014.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] Guzella, T.S. and Caminhas, W.M. (2009) A Review of Machine Learning Approaches to Spam Filtering. Expert Systems with Applications, 36, 10206-10222.
[2] Rao, J.M. and Reiley, D.H. (2012) The Economics of Spam. Journal of Economic Perspectives, 26, 87-110.
[3] Stern, H. and Others (2008) A Survey of Modern Spam Tools. 5th Conference on Email and Anti-Spam, CEAS, California.
[4] Kanich, C., Weaver, N., McCoy, D., Halvorson, T., Kreibich, C., Levchenko, K., Paxson, V., Voelker, G.M. and Savage, S. (2011) Show Me the Money: Characterizing Spam-Advertised Revenue. USENIX Security Symposium, San Francisco, August 2011, 15.
[5] Cranor, L.F. and LaMacchia, B.A. (1998) Spam! Communications of the ACM, 41, 74-83.
[6] Stone-Gross, B., Holz, T., Stringhini, G. and Vigna, G. (2011) The Underground Economy of Spam: A Botmaster’s Perspective of Coordinating Large-Scale Spam Campaigns. USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET), Boston, March 2011.
[7] Su, M.-C., Lo, H.-H. and Hsu, F.-H. (2010) A Neural Tree and Its Application to Spam E-Mail Detection. Expert Systems with Applications, 37, 7976-7985.
[8] Gudkova, D. (2013) Kaspersky Security Bulletin. Spam Evolution 2013.
[9] Pérez-Díaz, N., Ruano-Ordás, D., Fdez-Riverola, F. and Méndez, J.R. (2012) SDAI: An Integral Evaluation Methodology for Content-Based Spam Filtering Models. Expert Systems with Applications, 39, 12487-12500.
[10] Kamboj, R. (2010) A Rule Based Approach for Spam Detection. Thapar University, Patiala.
[11] Pérez-Díaz, N., Ruano-Ordás, D., Méndez, J.R., Gálvez, J.F. and Fdez-Riverola, F. (2012) Rough Sets for Spam Filtering: Selecting Appropriate Decision Rules for Boundary E-Mail Classification. Applied Soft Computing, 12, 3671-3682.
[12] Idris, I., Selamat, A., Thanh Nguyen, N., Omatu, S., Krejcar, O., Kuca, K. and Penhaker, M. (2015) A Combined Negative Selection Algorithm-Particle Swarm Optimization for an Email Spam Detection System. Engineering Applications of Artificial Intelligence, 39, 33-44.
[13] Caruana, G. and Li, M. (2012) A Survey of Emerging Approaches to Spam Filtering. ACM Computing Surveys (CSUR), 44, 9.
[14] Symantec (2013) Internet Security Threat Report 2013.
[15] Santhi, G., Wenisch, S.M. and Sengutuvan, P. (2013) A Content Based Classification of Spam Mails with Fuzzy Word Ranking. IJCSI International Journal of Computer Science Issues, 10, 48-58.
[16] Luckner, M., Gad, M. and Sobkowiak, P. (2014) Stable Web Spam Detection Using Features Based on Lexical Items. Computers & Security, 46, 79-93.
[17] Lee, S.M., Kim, D.S., Kim, J.H. and Park, J.S. (2010) Spam Detection Using Feature Selection and Parameters Optimization. International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), Poland, February 2010, 883-888.
[18] Alazab, M. and Broadhurst, R. (2014) Spam and Criminal Activity. Trends and Issues (Australian Institute of Criminology), Forthcoming.
[19] Tran, K.-N., Alazab, M. and Broadhurst, R. (2013) Towards a Feature Rich Model for Predicting Spam Emails Containing Malicious Attachments and URLs. 11th Australasian Data Mining Conference, Canberra, November 2013.
[20] Le Blond, S., Uritesc, A., Gilbert, C., Chua, Z.L., Saxena, P. and Kirda, E. (2014) A Look at Targeted Attacks through the Lense of an NGO. Proceedings of the 23rd USENIX Conference on Security Symposium, San Diego, August 2014, 543-558.
[21] Amin, R.M. (2011) Detecting Targeted Malicious Email through Supervised Classification of Persistent Threat and Recipient Oriented Features. The George Washington University, Washington DC.
[22] Spam Assassin Project (2015) Spam Assassin Public Corpus.

Copyright © 2023 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.