A Statistical Analysis of Textual E-Commerce Reviews Using Tree-Based Methods ()
1. Introduction
It is true that most Internet data is in unstructured form, primarily text. These originate from social networks, such as Facebook and Twitter, or even corporate data, such as complaints and opinion polls. A real example is The Women’s Clothing E-Commerce Reviews1, which consists of reviews written by real customers. Text Mining, a Data Mining branch, has arisen from the need to process and extract information from this large mass of textual data.
A Sentiment Analysis comprises the computational treatment of opinions, sentiments and text subjectivity [1]. Its techniques can be roughly categorized into two groups, the first consisting in the Lexicon-based Approach, which associates words or expressions with positive, negative or neutral feelings. In this regard, Aung and Myo [2] proposed to automatically analyze student text feedback by employing the Lexicon-based approach to predict teaching performance levels. Their system indicated the opinion results of teachers, represented as strongly positive, moderately positive, weakly positive, strongly negative, moderately negative, weakly negative or neutral. Palanisamy et al. [3] also applied this approach to classify tweets as positive or negative based on the contextual sentiment orientation of the employed words.
The other Sentiment Analysis group comprises the Machine Learning Approach. Within this scheme, the learning model can be obtained by both supervised and unsupervised training. This approach relies on Machine Learning Algorithms to solve a Sentiment Analysis as a regular classification problem, applying linguistic features as independent variables. In this regard, Onan [4] presented a text mining approach to analyze MOOC reviews through supervised learning methods. Following a MOOC review selection, the training dataset consisted in half negative and half positive reviews. The highest predictive performance comprised a classification accuracy of 95.80% for all compared configurations. Ko and Seo [5], on the other hand, proposed an unsupervised method that classifies documents into sentences, categorizing each sentence through keyword lists for each category and sentence similarity measures. This method can be applied in areas where low-cost text categorization is required, or in the development of training documents.
Concerning the Machine Learning Approach, documents can be pre-processed, describing the original unstructured database, in order to create a structured database. An alternative is to work with term (word) presence or frequency in each document [2]. In a second step, Machine Learning methods can then be applied to the structured data. The method choice depends on the final objective. Descriptive analyses can be applied, followed by more elaborate models, such as regression or classification models.
In this context, this study analyzed The Women’s Clothing E-Commerce Reviews database, which consists of consumer comments regarding a particular clothing item and a label designating whether or not the consumer indicates said item. The aim was to correctly and automatically classify customer recommendations based on their textual reports, applying Text Mining techniques and tree-based methods.
The paper is organized as follows. Section 2 cites some related studies, Section 3 describes the main Text Mining process steps and Section 4 presents the applied tree-based methods, namely Classification Tree, Random Forest, Gradient Boosting and XGBoost. Section 5 comprises quality measures for the classification methods, Section 6 described the database, Section 7 presents some interesting numerical results and analyses and, finally, the conclusions are reported in Section 8.
2. Related Work
Alrehili and Albalawi [6] conducted a study employing a sentimental analysis approach on a set of customer reviews collected from Amazon. Following a manual review classification and text processing, the dataset remained with a 1500 record (750 positive reviews and 750 negative reviews) and two attributes: ReviewsText and ReviewType. Six classifiers, namely Naive Bayes, SVM, Random Forest (RF), Bagging using RF, Boosting using RF and Voting, were applied and tested by unigram, bigram, and trigram with stop word removal and without it. The best performance was obtained by the Random Forest technique, with a 89.87% accuracy when using unigram and a stop word removal.
Shah et al. [7] performed research on the sentiment analysis of a BBC news data set, where each text was categorized into one of five categories: business, entertainment, politics, sport and tech. Stop words were removed, the text was converted to lowercase and the Porter Stemmer algorithm was used for stemming. The TF-IDF Vectorizer was implemented to transform the text into a numerical representation. It was not clear how many terms were used. The dataset was split into training (75%) and testing (25%) sets and, after splitting, the pipeline was used to implement the classifiers. The results indicate that the Logistic Regression classifier attained the highest accuracy, of 97% and the second best was the Random Forest classifier, with a 93% accuracy. The Logistic Regression algorithm emerged as the most stable classifier for a small data set.
Lin [8] also employed The Women’s Clothing E-Commerce Reviews database to perform a sentiment analysis of customer recommendations, aiming at understanding the correlation between review features and product recommendations based on natural language processing (NLP), applying five machine learning algorithms, i.e., Logistic Regression, Support Vector Machine (SVM), Random Forest, XGBoost and LightGBM. The best results were achieved by the LightGBM algorithm, obtaining the highest AUC value and accuracy. The Ridge Regression, Linear Kernel SVM and XGboost algorithms exhibited close performances, with a 94% accuracy.
Comparing our study with the one described above [8], despite the same database and similar classification methods, two important differences are highlighted:
1) The dataset was not categorized into training and test sets in the Lin study. For this reason, the authors could not evaluate the out-of-sample performance. Sometimes a high accuracy determined for the training set is the result of overfitting.
2) An in-depth processing of the raw review texts in the Lin study was not conducted. Furthermore, no stop words were removed, nor were terms selected by their frequency. This is an important step in Natural Language Processing (NLP) and was carefully performed in the study presented in this paper.
3. Text Mining
Text Mining is the process that basically consists in the extraction of non-trivial patterns or knowledge from unstructured text documents. This process can be categorized into two main steps: refinement, which transforms the original textual database into a numerical database; and the information extraction process, which consists of detecting patterns from the refined database using conventional statistical tools [9].
The main refinement steps of a textual database, also called preprocessing techniques, are: Tokenization; Stop Word Removal; Normalization; Creation of the Document-term Matrix; Term Selection. A brief explanation of each is presented below.
Tokenization is the first preprocessing stage and aims to extract minimum text units from a free text. These units are called tokens and most often refer to a single word.
Stop Words are the most frequent terms in a language. They have no semantic value and only aid in the general understanding of the text. Stop words are usually characterized by articles, prepositions, punctuation, conjunctions and pronouns. A pre-established list is usually applied, called a stoplist. The removal of stop words considerably reduces the amount of tokens and improves the analysis to be performed.
Normalization is the process of grouping words that share the same pattern. The main normalization methods are stemming and lemmatization, and further explanations on these terms can be found in [10]. The lemmatization method will be applied herein, which, for example, replaces the tokens “calculate’’, “calculating’’ and “calculated’’ for the term “calculate’’.
Term Selection, proposed by [11], establishes a set of significant database terms. Non-significant terms, which have low semantic value, appear at very high or very low frequencies in document sets and are not considered in the analyses.
Following these steps, each document was then transformed into a bag of terms. Considering a textual database formed by n documents that together contain p terms, the n×p matrix A, where each element
represents the frequency with which the term j occurs in document i, is called the Document-term Matrix. Each line of this matrix corresponds to a document and can be understood as an object. Each column corresponds to a term and can be understood as a document attribute.
4. Tree-Based Methods
Several tree-based methods are applied in classification problems. The first, Classification Trees, is simple and useful for data interpretation, although it is not competitive in terms of prediction accuracy. On the other hand, the Random Forest and Boosting methods grow multiple trees which are then combined to yield a single consensus prediction. Combining many trees can often result in prediction accuracy improvements, at the expense of some loss of interpretation.
4.1. Classification Trees
Considering a universe composed of n objects which are described by p attributes, each attribute is an independent variable
,
, and each object i belongs to a known class
. A classification method aims to define a mathematical model capable of predicting the class of a new object when its p attributes are known. The quality of a classification method is determined by the proportion of correctly predicted classes (more details are given in Section 5).
The Classification Tree model is a type of classification method. It uses the tree structure to recursively partition the dataset. Once the input data has been split, the prediction is made from a simple classification method in each partition, such as the dominant class (see Figure 1). If the resulting tree has too many nodes, it is still possible to perform a pruning process. The pruning process eliminates some nodes in order to minimize estimation errors outside the sample.
In general, for each node k, starting at the root, a classification tree algorithm can be summarized by the following three steps:
1) Choose an
attribute from the available p and a constant
which best separate the objects arriving at node k according to the following partition:
and
. This partition defines two new nodes, the node k children. For each of these two child nodes,
Figure 1. A classification tree example.
2) If this node comprises the prevalence of any class, i.e. a prevalence greater than a pre-defined value, this node becomes a leaf with this prevalent class and END.
3) Otherwise, go back to step 1 considering only the objects that arrived at this child node.
The choice of the
attribute and the constant
mentioned in Step 1 is performed in order to optimize the class division. The values of these two variables originate from the result of a complex optimization problem, which seeks to minimize impurities or maximize prevalence in the two new child nodes. Some different classification tree algorithms exist, and the main difference between them are the way the dataset is partitioned, i.e., the choice of the
and
; the partition stopping criterion; and the pruning process [12].
For example, the THAID [13] [14], C4.5 [15] and CART [16] algorithms use node impurity measurements and divide a node by searching exhaustively over all X that minimize the total impurity of their two child nodes. Since the impurity measurements are different for each algorithm, the result is a difference in the way the dataset is partitioned. In addition, while a THAID division stops if the relative decrease in impurity is below a pre-specified threshold, C4.5 and CART first grow a tree too large and then prune it to a smaller size.
4.2. Random Forest (Bagging)
The Random Forest is a classification model created by Breiman [17] in order to improve the prediction of classification tree models. According to Breiman, the Random Forest is a classifier consisting of a collection of M tree-structured classifiers where each tree is constructed from a smaller dataset composed of n objects and p attributes.
The n objects are selected from a Bagging strategy, such as Boostrap Sampling [18]. The p attributes are selected randomly and without replacement. Each combination of these two draws results in a decision tree. After many trees are generated, these results are combined to provide a final prediction: the most popular class among each of the M predictions.
Just like classification trees, several Random Forest algorithms are available. These differ in the way the sample is selected and also in the adopted classification tree algorithm. In this study, we apply the Liaw and Wiener method [19] through the use of the randomForest package available in the R Program [20], described in the following steps.
1) Draw M bootstrap samples from the original data. Each sample is a
size,
.
2) For each of the bootstrap samples, grow an unpruned classification tree, with the following modification: at each node, rather than choosing the best split among all p attributes, randomly sample
of the attributes,
, and choose the best split from among those variables.
3) Predict new data by aggregating the M trees predictions (i.e., majority votes for classification).
The values of the M,
and
constants are previously defined. For example, in the randomForest package, the default values for classification are:
,
and
(integer part of the square root of p).
4.3. Gradient Boosting Tree (Boosting)
The Random Forest method is a bagging algorithm. In these algorithms, trees are grown in parallel to obtain the average prediction across all trees, where each tree is built on an original data sample. Gradient boosting, on the other hand, employs a sequential approach in obtaining predictions. In the Gradient Boosting method, each decision tree predicts the error of the previous one [21].
With
as a decision tree using
independent variables with
of accuracy,
the error and Y the dependent variable;
(1)
Next, the residual
can be predicted by another decision tree
,
. When combining these two steps a new model for Y is obtained:
(2)
In the next step the new residual can be predicted by another decision tree
,
, and then
(3)
The preceding equation is likely to present a greater accuracy than
, while in the equation above three decision trees are considered [21].
The Gradient Boosting algorithm can be summarized by the following steps, where t is the maximum number of iterations,
is the learning rate and
is the minimum AUC (Area Under the Curve) required. These three parameters must be defined in advance.
1) Build a Classification Tree T that describe predicted class Y as a function of the independent variables
and set
.
2) Define the k-th iteration residuals,
, and fit a Regression Tree with response variable
and predictor variables
, terming this tree,
.
3) Calculate the model error given by the AUC (Area Under the Curve) metric, called
. If
and
, set
and go back to step 3. Otherwise, end the algorithm.
4.4. XGBoost (Boosting)
The XGBoost method is an upgraded Gradient Boosting Tree algorithm that can flexibly process sparse data and missing values [8]. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings. It also incorporates a regularized model to prevent overfitting [22].
5. Quality Measures
Consider that a database with n objects is tested by a classifier, which will predict one between two possible classes for each object. After the test it is possible to build a confusion matrix like the one presented in Table 1.
From the confusion matrix, some quality classification measures can be defined: True Positive Rate, True Negative Rate, Accuracy, Precision and F1 score. Each one is defined below.
6. The Dataset
The Women’s Clothing E-Commerce Reviews was used as the dataset for this study and revolves around reviews written by customers. This dataset includes 23,486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the following variables: Clothing ID; Age; Title; Review Text; Rating; Recommended IND; Positive Feedback Count; Division Name; Department Name; and Class Name. Of the 23,486 rows in the database, 19,314 refer to recommended items while the other 4,172 refer to non-recommended items.
Only three variables were considered herein: Review Text, string variable for the review body; Title, string variable for the title of the review; and Recommended IND, binary variable stating whether the customer recommends the product, where 1 is recommended and 0 is not recommended. Variables Title and Review Text were concatenated in order to add more wealth of information to the analysis. Then, only the text was used as a classification method attributes.
First, the dataset was randomly split into 70% as a training set and 30% as a testing set. Since the original database contains many more recommended objects compared to non-recommended, not all of the documents should be used. To select balanced sets and respect the 70/30 ratio, the number of documents in the training set was 5840 and the number of documents in testing was 2504.
Figure 2 presents a diagram with the main stages of this research.
7. Results
This study was performed using the R Program [20]. The tidytext [23], tm [24] and textstem [25] packages were used for textual pre-processing. The rpart [26], randomForest [19], gbm [27] and xgboost [22] packages were used for the Classification Tree, Random Forest, Gradient Boosting and XGBoost analyses, respectively.
The pre-processing described in Section 3 was performed for the training dataset. An extra step was performed after the word tokenization, were we concatenate the term “not’ with the next word. Subsequently, stop words were removed and the normalization process was conducted based on Mechura’s English
lemmatization list2. Following this process, the textual database contained 123,769 terms.
A term selection was carried out to select 100 of these terms. The 10 most frequent terms in all documents and the 10 most frequent terms per category, positive and negative, were analyzed. Some frequent terms were noted in all documents as among the top 10 by category, and because of that, comprised non-significant terms and were dismissed. These terms were removed in an iterative process until no more terms in common between top 10 frequent terms in positive reviews and top 10 frequent terms in negative reviews remained. The terms removed in this process were as follows: “dress’’, “fit’’, “love’’, “size’’, “top’’, “wear’’, “color”, “fabric’’, “cute’’, “shirt’’, “run”, “pretty”, “beautiful”, “short”, “sweater”, “material’’, “nice” and “buy”.
The 10 most frequent terms in all documents and per category, positive and negative, are presented in Figure 3. A naive visual inspection would indicate that the most frequent terms in the positive reviews are more positive than those in the negative reviews, confirmed by the count of positive and negative terms in each of these groups. The classification of terms as positive or negative can be assessed by a sentiment lexicon dictionary, like the Bing sentiment lexicon available by the package textdata [28] in R Program [20]. The positive terms presented in Figure 3 are “perfect”, “flatter”, “soft” and “comfortable”, all more frequent in the positive reviews. Only “disappoint” comprises a negative term by Bing dictionary, while the other terms were considered neutral. We believe that in the specific e-commerce context, the term “return” can be considered a negative term.
The 100 most frequent terms were selected from the remaining terms. The
Figure 3. Most Frequent Words within the Training Dataset. (a) Complete Data; (b) Positive Label; (c) Negative Label.
final Document-term matrix was a 5618 × 100 matrix, applied as the input data for the classification methods presented below.
A simple Classification Tree (CT) was adjusted according to the CART algorithm [16]. A Random Forest method was run with the following parameter values:
and
. In order to test the method performance for different M parameter values,
, 300, and 500 were adopted (RF100, RF300 and RF500).
is the default value for the Random Forest package [19].
For the Gradient Boosting algorithm
and
, 300 and 500, (GB100, GB300 and GB500) where t is the number of performed tree iterations and
is the learning rate.
is the default value for gbm package [27], but for comparison purposes the same number of trees as in the Random Forest method was applied. The same was done for XGboost algorithm,
, 300 and 500 (XGB100, XGB300 and XGB500) and
. For Gradient Boosting and XGboost algorithms
was set as the default R Program [20] packages values.
A total of 10 methods were run, as follows: CT, RF100, RF300, RF500, GB100, GB300, GB500, XGB100, XGB300 and XGB500. The results of quality measures for the training dataset are presented in Table 2 and in Figure 4. In the latter, the Classification Tree (CT) is indicates in blue, the Random Forest methods (RF100, RF300 and RF500), in red, Gradient Boosting methods (GB100, GB300 and GB500), in yellow and XGBoost methods (XGB100, XGB300 and XGB500), in green.
The three Random Forest methods exhibited the best results for all quality measures. The number of trees did not influence the performance of this method. The XGBoost method ranked second, also with good results. However, unlike the Random Forest method, the higher the number of trees, the better the quality measures of the training data.
Table 2. Evaluation of the different models applied to the training dataset.
The Gradient Boosting algorithm presented quality measures above 73% for all metrics, with the behavior not very sensitive to variations in the number of trees. The Classification Tree method presented the worst results and a high Specificity combined with low Recall value. This indicates that the Classification Tree method can detect negative classes but not positive ones.
Table 3 displays the 10 most relevant terms for each method. The first four most important terms were the same for all methods: “return”, “disappoint”, “perfect” and “comfortable”. In addition, three other terms appear in common in the list of the top 10 most important terms for all methods, i.e., “jeans”, “unflattering” and “look”. Therefore, a consensus is noted between the applied methods concerning important classification terms.
The out-of-sample results are presented in Table 4 and in Figure 5. The color criteria in Figure 5 are the same as that presented in Figure 4.
All methods showed similar results, except for the Classification Tree method. Even the Random Forest and XGBoost, which performed better on the training
Figure 4. Evaluation of the different models applied to the training dataset.
Table 3. The ten most relevant terms per method.
Table 4. Evaluation of the different models applied to the testing dataset.
Figure 5. Evaluation of the different models applied to the test dataset.
dataset, now no longer exhibit advantages over the other methods. This may indicate that these models are overfitting the training data, especially the Random Forest and XGBoost models with many trees.
The Classification Tree results are exhibited in Table 4 and Figure 5 for. This model exhibited the best Specificity value and the worst Recall value. This method again presents a high Specificity value and a low Recall value, indicating that it is not able to identify positive reviews.
It is important to note the good performance of the Gradient Boosting method. Its in-sample quality measures are close to the out-of-sample quality measures, with values above 77% for all metrics, indicating its real classification capability.
8. Conclusions
A Text Mining analysis via consumer reviews, i.e., free text, referring to recommendable or not recommendable products, was performed. The goal was to predict whether a product is recommended by the consumer based on their review alone. Tree-based classification methods were applied in a balanced training dataset containing 5840 review, with the test dataset containing 2504 reviews. The training and test datasets were randomly selected.
The Text Mining analysis indicated that some very frequent terms were non-significant, as they appeared very frequently in both positive and negative reviews, as follows: “dress’’, “fit’’, “love’’, “size’’, “top’’, “wear’’, “color”, “fabric’’, “cute’’, “shirt’’, “run”, “pretty”, “beautiful”, “short”, “sweater”, “material’’, “nice” and “buy”, eliminated before applying the classification methods.
The 10 most frequent words in the positive reviews were “perfect”, “flatter”, “soft”, “comfortable”, “bite”, “length”, “jeans”, “purchase”, “pant” and “skirt”. On the other hand, the 10 most frequent words in the negative reviews were “return”, “quality”, “disappoint”, “cut”, “picture”, “waist”, “arm” and “retailer”. A naive visual inspection would seem to indicate that the most frequent terms in the positive reviews are more positive than the most frequent in the negative reviews. A simple sentimental analysis demonstrated that, considering the Bing sentiment lexicon, the positive terms among these words were among the most frequent in the positive reviews, and the only negative one was noted among the most frequent in negative reviews.
The Classification Tree, Random Forest, Gradient Boosting and XGBoost methods were applied to classify documents from a supervised database. The Random Forest and Gradient Boosting methods are not very sensitive to a varying number of trees, while the XGBoost method is. The higher number of trees, the more the XGBoost overfitted the training data. The results suggest that the Random Forest overfitted the training data independent of the number of trees, as M < 100 would not be appropriate for this parameter.
The Classification Tree presented high Specificity and low Recall in the test dataset. This means this method is adequate for the detection of negative reviews, but cannot detect positives reviews. The Gradient Boosting was the most robust method and the number of trees did not make much difference regarding the results. The Gradient Boosting exhibited similar metrics when comparing the training and test datasets. Considering the 500-tree model, all quality measures were above 77% in the test dataset.
Finally, the first four most important terms were the same for all methods, and three other terms appear in common in the list of the top 10 most important classification terms for all methods. This suggests a consensus between the applied methods regarding important classification terms. The seven common terms in the top 10 most important terms for the classifying methods were “return”, “disappoint”, “perfect”, “comfortable”, “jeans”, “unflattering” and “look”.
NOTES
1Nick Brooks. (2018). The women’s e-commerce clothing reviews. Retrieved 2019 from https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews.
2Michal Mĕchura (2018). Lemmatization-lists. Retrieved 2019 from https://github.com/michmech/lemmatization-lists.