TITLE:
A Novel Efficient and Effective Preprocessing Algorithm for Text Classification
AUTHORS:
Lijie Zhu, Difan Luo
KEYWORDS:
Text Classification, Preprocessing, Feature Dimension, Orthogonal Matching Pursuit
JOURNAL NAME:
Journal of Computer and Communications,
Vol.11 No.3,
March
13,
2023
ABSTRACT: Text classification is an essential task of natural language processing. Preprocessing, which determines the representation of text features, is one of the key steps of text classification architecture. It proposed a novel efficient and effective preprocessing algorithm with three methods for text classification combining the Orthogonal Matching Pursuit algorithm to perform the classification. The main idea of the novel preprocessing strategy is that it combined stopword removal and/or regular filtering with tokenization and lowercase conversion, which can effectively reduce the feature dimension and improve the text feature matrix quality. Simulation tests on the 20 newsgroups dataset show that compared with the existing state-of-the-art method, the new method reduces the number of features by 19.85%, 34.35%, 26.25% and 38.67%, improves accuracy by 7.36%, 8.8%, 5.71% and 7.73%, and increases the speed of text classification by 17.38%, 25.64%, 23.76% and 33.38% on the four data, respectively.