Author Gender Prediction in an Email Stream Using Neural Networks

Abstract

With the rapid growth of the Internet in recent years, the ability to analyze and identify its users has become increasingly important. Authorship analysis provides a means to glean information about the author of a document originating from the internet or elsewhere, including but not limited to the author’s gender. There are well-known linguistic differences between the writing of men and women, and these differences can be effectively used to predict the gender of a document’s author. Capitalizing on these linguistic nuances, this study uses a set of stylometric features and a set of word count features to facilitate automatic gender discrimination on emails from the popular Enron email dataset. These features are used in conjunction with the Modified Balanced Winnow Neural Network proposed by Carvalho and Cohen, an improvement on the original Balanced Winnow created by Littlestone. Experiments with the Modified Balanced Winnow show that it is effectively able to discriminate gender using both stylometric and word count features, with the word count features providing superior results.

Share and Cite:

W. Deitrick, Z. Miller, B. Valyou, B. Dickinson, T. Munson and W. Hu, "Author Gender Prediction in an Email Stream Using Neural Networks," Journal of Intelligent Learning Systems and Applications, Vol. 4 No. 3, 2012, pp. 169-175. doi: 10.4236/jilsa.2012.43017.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] N. Cheng, R. Chandramouli and K. P. Subbalakshmi, “Author Gender Identification from Text,” Digital Investigation, Vol. 8, No. 1, 2011, pp. 78-88. doi:10.1016/j.diin.2011.04.002
[2] H. Touré, “Brief Remarks to Media,” Commission on Information and Accountability for Women’s and Children’s Health, 2011.
[3] R. Zheng, Y. Qin, Z. Huang and H. Chen, “Authorship Analysis in Cybercrime Investigation,” Proceedings from the 1st NSF/NIJ Symposium, 2003, pp. 59-73.
[4] R. Zheng, J. Li, H, Chen and Z. Huang, “A Framework for Authorship Identi?cation of Online Messages: Writing-Style Features and Classi?cation Techniques,” Journal of the American Society for Information Science and Technology, Vol. 57, No. 3, 2006, pp. 378-393. doi:10.1002/asi.20316
[5] J. Burrows, “Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style,” Literary and Linguistic Computing, Vol. 2, No. 2, 1987, pp. 61-67. doi:10.1093/llc/2.2.61
[6] A. F. Damerau and A. F. S. Weiss, “Text Mining with Decision Trees and Decision Rules,” Conference on Automated Learning and Discovery, Carnegie-Mellon University, Pittsburgh, 1998.
[7] F. R. Bilous and R. M. Krauss, “Dominance and Accommodation in the Conversation Behaviours of Sameand Mixed-Gender Dyads,” Language and Communicaion, Vol. 8, No. 3-4, 1988, pp. 183-194. doi:10.1016/0271-5309(88)90016-X
[8] S. Nowson and J. Oberlander, “The Identity of Bloggers: Openness and Gender in Personal Weblogs,” Proceedings of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs, California, 2006.
[9] J. D. Burger, J. Henderson, G. Kim and G. Zarella, “Discriminating Gender on Twitter,” Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, 27-31 July 2011, pp. 1301-1309. http://www.mitre.org/work/tech_papers/2011/11_0170/11_0170.pdf
[10] M. W. Corney, “Analyzing E-Mail Text Authorship for Forensic Purposes,” Masters Thesis, Queensland University of Technology, Queensland, 2003.
[11] O. Y. de Vel, M. W. Corney, A. M. Anderson and G. M. Mohay, “Language and Gender Author Cohort Analysis of E-Mail for Computer Forensics,” Proceedings Digital Forensics Research Workshop, Syracuse, 6-8 August 2002.
[12] R. Thomson and T. Murachver, “Predicting Gender from Electronic Discourse,” The British Journal of Social Psychology, Vol. 40, No. 2, 2001, pp. 193-208. doi:10.1348/014466601164812
[13] W. Fan, H. Wang and P. S. Yu, “Active Mining of Data Streams,” Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, 22-24 April 2004, pp. 457-461.
[14] P. Domingos and G. Hulten, “Mining High Speed Data Streams,” University of Washington, Seattle, 2000
[15] L. Kaelbling, “Enron Email Dataset,” CALO Project. http://www.cs.cmu.edu/~enron/, 2011
[16] J. Shetty and J. Adibi, “Enron Email Dataset Database Schema and Brief Statistical Report,” Information Sciences Institute Technical Report, University of Southern, California, 2004.
[17] V. R. Carvalho and W. W. Cohen, “Single-Pass Online Learning: Performance, Voting Schemes and Online Feature Selection,” Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, 20-23 August 2006, pp. 548-553.
[18] I. Dagan, Y. Karov and D. Roth, “Mistake-Driven Learning in Text Categorization,” Conference on Empirical Methods on Natural Language Processing, Providence 1-2 August 1997, pp. 55-63.
[19] N. Littlestone, “Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm,” Machine Learning, Vol. 2, No. 4, 1988, pp. 285-318. doi:10.1007/BF00116827

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.