Automatic Arabic Document Classification Based on the HRWiTD Algorithm

The documents contain a large amount of valuable knowledge on various subjects and, more recently, documents on the Internet are available from various sources. Therefore, automatic, rapid and accurate classification of these documents with less human interaction has become necessary. In this paper, we introduce a new algorithm called the highest repetition of words in a text document (HRWiTD) to classify the automatic Arabic text. The corpus is divided into a train set and a test set to be applied to proposed classification technique. The train set is analyzed for learning and the learning data is stored in the Learning Dataset file. The category that contains the highest repetition for each word is assigned as a category for the word in Learning Dataset file. This file includes non-duplicate words with the value of higher repetition and categories and they get from all texts in the train set. For each text in the test set, the category of words is assigned to a specific category by using Learning Dataset file. The category that contains the largest number of words is assigned as the predicted category of the text. To evaluate the classification accuracy of the HRWiTD algorithm, the confusion matrix method is used. The HRWiTD algorithm has been applied to convergent samples from six categories of Arabic news at SPA (Saudi Press Agency). As a result, the accuracy of the HRWiTD algorithm is 86.84%. In addition, we used the same corpus with the most popular machine learning algorithms which are C5.0, KNN, SVM, NB and C4.5, and their results of classification accuracy are 52.86%, 52.38%, 51.90%, 51.90% and 30%, respectively. Thus, the HRWiTD algorithm gives better classification accuracy compared to the most popular machine learning algorithms on the selected domain.


Introduction
The internet is a very effective technique for obtaining a huge amount of information in different forms such as documents.Recently, there are millions of documents from various sources, most of which contain valuable information.Manual classification of documents consumes time and is very difficult, especially when people must estimate the category based on the information included.Therefore, the automatic text classification is used to discover the basic information of text documents automatically while saving human effort and time [1].
Automatic text categorization is assigning and categorizing texts by using a set of predetermined categories based on the contents of the text.Specifically, it is filtering and routing, clustering information in related texts, and then classifying the texts into specified topics [2].The text classification process is divided into three main phases.First, compile training data.Second, select a set of features to represent the texts categories.Third, test testing data with selected machine learning algorithm [3].The concept of machine learning (ML) refers to automatic methods of learning automatically without human intervention to make predictions accurate or behave intelligently.Text classification (TC) is one of the important areas in ML.TC is a method in data mining field; it is set categories of texts in a web page, book library, media articles, gallery etc. Predetermined categories are based on their content and then give valuable information from a large unstructured text resource such as email filtering (spam or legitimate) [4].cation accuracy based on [5].The train set is analyzed to obtain predetermined categories for each word in all texts and then constructs the Learning Dataset file that will use to predict the categories of test set, then the classification of each text in the test set will be classified based on the learning process [6].
Based on recent research, various automated learning algorithms have been successfully applied to Arabic text.The most famous techniques to classify Arabic text from the best to the worst are C5.0 classifier, Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (C4.5), and K-nearest neighbor (KNN) [5].These classification techniques recognize as simple and efficient methods for classifying texts [7] [8].In this research, these techniques did not perform satisfactory performance in accuracy, and the best average accuracy from all categories is 52.86% using the C5.0 classifier.On the other hand, the HRWiTD algorithm achieved the best performance of the text classification and obtained the highest average accuracy (86.84%) compared to those techniques.
The second section presents some of the relevant work, the third section introduces the proposed work including the HRWiTD algorithm and the evaluation method used in the details, the fourth section presents the experimental results of the proposed algorithm and the most popular machine learning algorithms with their comparison, the latter part is the conclusion

Related Work
Text classification (TC) in data mining field is the process of extracting useful knowledge from text by analyzing complex and textual data [1].The TC process is the automatic classification of a set of texts in categories based on content [9].
In many text mining algorithms, pre-processing is one of the main components of text classification.Typically, the TC framework begins with the pre-processing, then the extraction feature, and finally the classification steps [10].In detail, the process of classifying texts is divided into nine steps.These steps in the order are The automatic text classification is used to classify texts in many languages such as Arabic.Arabic is the native language of more than 300 million people and is widely spread in the world [2].
Recently, many types of research have been published in machine learning al-gorithms for the classification of Arabic text.Naïve Bayes is used to automatically classify Arabic documents in El-Kourdi et al. [2].Sawaf et al. [11] used a statistical approach based on the Maximum Entropy to classify and cluster news articles.Sawaf et al. also described a method based on Association Rules to classify Arabic documents [3].Al-Harbi et al. [12] compared the SVM algorithm and the Decision Tree algorithm.Al-Kabi, and Al-Sinjilawi [13] compared the classification of Arabic documents in Vector Space Model and Naïve Bayesian.
Khreisat [14] compared KNN and SVM algorithms.Kanaan et al. [15] used Naïve Bayesian classifier to classify Arabic texts and distributed equally into many categories.Different Machine learning algorithms that are applied to Arabic texts have produced the different classification accuracy that is presented in [5].The most popular machine learning algorithms for classifying Arabic documents based on the most frequent selection methods (CHI, TF, DF, IG and None) are C5.0,SVM, NB, C4.5 and KNN, respectively [2] [16] [17].

Proposed Work
In this paper, there are three main phases to classify Arabic texts, pre-processing, feature extraction and classification.In the pre-processing stage, the selection feature is used to remove noisy data such as numbers, punctuations, kashida, stop words and diacritics [18]; in the feature extraction stage, features are then identified when learning the train set, and then building a Learning Dataset file.This file includes unduplicated words with the highest repetition values and categories, and these words are not repeated (just keep the word and category of the category of the highest repetition).In the classification stage, the classification of each text in the test set by using HRWiTD algorithm is based on matching the words of each text with the words in the Learning Dataset file to obtain a prediction classification (category) for each word.Typically, when more than two thirds (66.67%) of words with undefined categories are found in the text, the classification for this text is ambiguous and it is difficult to determine a particular classification.In fact, the "General" category includes all type of texts, some of which may belong to a specific category and some may belong to an unspecified category.Therefore, the best-predetermined classification of ambiguous text is "General" classification.In the suggested approach, if the average of the total of the repetition for all words in a text containing a predetermined classification (category) is greater than third (33.33%), the expected classification of the text is the category with the largest number of words.Otherwise, the proposed classification will be "General".
The accuracy of using the HRWiTD algorithm for classifying is evaluated through the confusion matrix.This method evaluates the predicted classification of the texts with the actual classification (from six categories) in the Arabic news (SPA).
This section describes the main stages of classification of Arabic texts in details. Figure 1 shows the stages which include data collection, documents

Data Collection
Data collection is the first and very important stage for the classification of Arabic texts.We chose an Arab source (Newswire) from the Saudi Press Agency (Saudi Press Agency), which includes convergent samples of six categories.We choose a SPA source for two reasons: availability of actual classification (category) for each text in corpus and availability of SPA texts on the Web.SPA statistics are shown in Table 1.

Documents Preprocessing
The process of pre-processing is actually a process of improving the classification Journal of Software Engineering and Applications of text documents by removing the data that is worthless.The data may include worthless numbers, punctuations, kashida, Hamza "," diacritics, and stop words.
Some words do not belong to any classification such as prepositions, pronouns, etc., so we append them to a stop word list see Table 2. Preprocessing also normalize text documents by changing TaaMarboutah ‫"ة"‬ to ‫."ا"‬ ATC Tool is used to remove worthless data from the selective corpus.

Data Division
At this stage, ATC Tool is used to dividing corpus into two partitions, the train set, and the test set.The train set contains 70% of a selected corpus and a test set contains 30%, and this division is best for the best performance of the classification based on [5].The user can manually select the percentage of the train set and the test set.

Feature Extraction
In this stage, we use data from train set and test set from internal or external source.Features extract and the repetition list of words generates by using the ATC tool.The ATC tool lists and saves the repetitions of each word in all texts of the train set in a train list file.It also lists and saves the repetitions of each word in all texts of the test set in a test list file.In addition, add a field to train the list file and the test list file to label the category of each word.The category of words in the train list is the actual category.On the other hands, the word categories in the test list are set from the Dataset Learning file of the same words.

Filtering
At this stage, train file will filter by remove the duplication words with their classifications.The word that has the highest repetition will remain with its relative data (repetitive number and category) and delete the same words and its relative data with less repetition.

Data Representation (Train Set/Test Set)
At this stage, the train list file that is produced from the filter stage will format into Learning Dataset file.The test list file that is produced from the extract feature stage will be used for classifying text with HRWiTD algorithm.The data will be represented as an array with n rows and m columns where rows correspond to words in text and columns that correspond to repetition and category.

Classification Algorithm (HRWiTD)
In this step, the Learning Dataset file is produced from the data representation stage and the test list file will be used in the classification algorithm (HRWiTD).
The test list file is used to store the predicted classification (which gets from Learning Dataset file) for all words in each text.Predicated classification file is used to store the predicted classification of all test texts.Details of the HRWiTD algorithm process are given in Figure 2.

Performance Evaluation
The performance of using the HRWiTD algorithm for classifying texts has been evaluated using the confusion matrix [19].A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known.It allows the visualization of the performance of an algorithm.
It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other.Most performance measures are computed from the confusion matrix.The actual and predicted information (classification) will be assigned by using HRWiTD algorithm.The confusion matrix should evaluate the performance using the actual and predicted information in the matrix, see Table 3.
Entries in the confusion matrix have the following meaning in the context of our study: • True negative (TN) is the number of correct predictions that an instance is negative.
E. Othman, A. Al-Hamadi • False positive (FP) is the number of incorrect predictions that an instance is positive.• False negative (FN) is the number of incorrect of predictions that an instance negative.
• True positive (TP) is the number of correct predictions that an instance is positive.
• Total is the summation of all above variables.See Equation (1).

Total TN FP FN TP
Overall, the accuracy (AC) is the proportion of the total number of predictions that were correct.It is determined by using the Equation ( 2): ( ) • There are two possible predicted classifications: "Positive" and "Negative".If we were predicting the target classification (ex."Sport") of text, for example, "Positive" would mean it belongs to that target classification, and "Negative" would mean it doesn't belong to that target classification.• The classifier (HRWiTD algorithm) has a total of 420 (test data) out of 1421 predictions for each of six categories, including 70 text per category.
• Out of those 420 cases, the classifier predicted "Positive" FP + TP times, and "Negative" TN + FN times.
• In reality, FN + TP classification in the table is belong to target classification, and TN + FP classification do not.

Experimental Results and Discussion
The HRWiTD algorithm is used to classify Arabic texts.The confusion matrix method was used to determine the classification accuracy of the HRWiTD algorithm, which is 86.84% in this experiment, see Table 5 for details.On the other hand, the same data set has been applied in various famous classifier techniques, models have been developed based on using C5.0, decision tree C4.5, NB, KNN and SVM classifiers (models create by using Rapid Mine Software 5.0) [13].We

Conclusion
In summary, this paper was carried out to classify Arabic texts automatically using the HRWiTD algorithm.We have applied it to 1421 Arabic Newswire from the Saudi Press Agency (SPA).The corpus includes convergent samples of six categories (culture, economic, public, political, social, and sports).In this paper, the average of the overall classification accuracy for six categories is 86.84 %; first, the HRWiTD algorithm needs to be improved to get better results to classify all text categories; here we cover only six categories and other categories were assigned general category as general.Second, it needs to extend the expe- The classification of Arabic texts has received great attention in many recent researches based on the importance of the Arabic language and the huge population who speak Arabic.In this paper, we introduce the HRWiTD algorithm used to automatically analyze Arabic texts to estimate classifications (categories).The proposed algorithm abbreviation refers to highest repetition of words in a text document.The proposed algorithm abbreviation refers to highest repetition of words in a text document.The proposed technique for classifying text is built based on three main stages, pre-processing stage to remove noisy data.Feature extraction stage to learn dataset and build Learning Dataset file based on the extracted features from the train set.Learning Dataset file includes non-duplicate words with its highest repetition values and categories.Classification stage is estimating the classification of texts by using HRWiTD algorithm (the expected classification of the text is the category with the largest number of words).If the average of total repetition for all words in a text (that contains a predetermined classification (categories)) is less than 33.33%, the proposed classification of text sets is "General" category.The HRWiTD algorithm has been applied to convergent samples of six categories namely culture, economic, public, political, social, and sports to obtain the best classification accuracy.The selected corpus has got from SPA (Saudi Press Agency), it contains 1421 Arabic texts (Newswire), it was divided into two sets, 70% train set and 30% test set and this division is the best to get the best classifi-E.Othman, A. Al-Hamadi DOI: 10.4236/jsea.2018.114011169 Journal of Software Engineering and Applications
test the performance of the models on the test set and evaluate the accuracy based on the use of a Cross-validation technique and set the number of validations to X-Validation operators.The previous classifiers were evaluated based on two advanced methods for term selection: CHI square (CHI) and Information gain (IG), and different weight methods (Boolean, Entropy, Frequency, LTC, confusion matrix method is used to evaluate the classification accuracy.The classification technique in this paper is constructed based on three main phases which are preprocessing, features extraction and classification by using HRWiTD algorithm.The repetition for a predetermined category of each word in the text is calculated.If the average of the total of those words is less than 33.33%, the expected classification of text is "General" category; otherwise, the expected classification of text is the category with the largest number of words.We compared the accuracy of the proposed algorithm (HRWiTD) with the accuracy of the most popular techniques and the accuracy of C5.0, KNN, SVM, NB and C4.5 classifies are 52.86%,52.38%, 51.90%, 51.90% and 30%, respectively.The best classification performance was when techniques used advanced methods for term selection (CHI, IG, None), different weight methods (Boolean, Entropy, Frequency, LTC, Relative Frequency, TFC and TFiDF), and two sample methods for term selection (TF and DF).Thus, we conclude that the best technique to classify Arabic texts in the selected domain is obtained from the HRWiTD algorithm.In addition, the HRWiTD algorithm gives the best classification accuracy for each individual classification except the "General" category.In future work,

Table 1 .
SPA statistic of selected corpus.

Table 5 .
The best results of classification accuracy C5.0 classifier and HRWiTD algorithm.