Gender Identification on Twitter Using the Modified Balanced Winnow

Abstract

With the rapid growth of web-based social networking technologies in recent years, author identification and analysis have proven increasingly useful. Authorship analysis provides information about a document’s author, often including the author’s gender. Men and women are known to write in distinctly different ways, and these differences can be successfully used to make a gender prediction. Making use of these distinctions between male and female authors, this study demonstrates the use of a simple stream-based neural network to automatically discriminate gender on manually labeled tweets from the Twitter social network. This neural network, the Modified Balanced Winnow, was employed in two ways; the effectiveness of data stream mining was initially examined with an extensive list of n-gram features. Feature selection techniques were then evaluated by drastically reducing the feature list using WEKA’s attribute selection algorithms. This study demonstrates the effectiveness of the stream mining approach, achieving an accuracy of 82.48%, a 20.81% increase above the baseline prediction. Using feature selection methods improved the results by an additional 16.03%, to an accuracy of 98.51%.

Share and Cite:

W. Deitrick, Z. Miller, B. Valyou, B. Dickinson, T. Munson and W. Hu, "Gender Identification on Twitter Using the Modified Balanced Winnow," Communications and Network, Vol. 4 No. 3, 2012, pp. 189-195. doi: 10.4236/cn.2012.43023.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] J. D. Burger, J. Henderson, G. Kim and G. Zarella, “Discriminating Gender on Twitter,” Technical Report, Mitre Corporation, Bedford.
[2] M. W. Corney, “Analyzing E-Mail Text Authorship for Forensic Purposes,” Masters Thesis, Queensland University of Technology, Queensland, 2003.
[3] P. Refaeilzadeh, L. Tang and H. Liu, “Cross Validation,” Arizona State University, 2008.
[4] W. Fan, H. Wang and P. S. Yu, “Active Mining of Data Streams,” Proceedings of the Fourth SIAM International Conference on Data Mining, Florida, 22-24 April 2004.
[5] P. Domingos and G. Hulten, “Mining High Speed Data Streams,” University of Washington, Washington, 2000.
[6] I. H. Witten and E. Frank, “Data Mining: Practical Machine Learning Tools and Techniques,” 2nd Edition, Morgan Kaufmann, San Francisco, 2005.
[7] V. R. Carvalho and W. W. Cohen, “Single-Pass Online Learning: Performance,” Voting Schemes and Online Feature Selection, KDD, 2006.
[8] I. Dagan, Y. Karov and D. Roth, “Mistake-Driven Learning in Text Cateorization,” Conference on Empirical Methods on Natural Language Processing, 1997.
[9] O. N. Littlestone, “Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm,” Machine Learning, Vol. 2, No. 4, 1988, pp. 285-318. doi:10.1007/BF00116827

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.