Prediction of After-Sales Behavior in E-Commerce Using Machine Learning Models ()
1. Introduction
The proliferation of e-commerce and the surge in online transactions have made predicting post-purchase behaviors, such as returns, exchanges, and refunds, a critical challenge for e-commerce platforms. According to the China Internet Network Information Center’s 54th Statistical Report on Internet Development in China, as of June 2024, China’s internet user base reached nearly 1.1 billion (1099.67 million), with an increase of 7.42 million from December 2023, and the internet penetration rate reached 78.0% [1]. Online shopping has become an integral part of daily life for many consumers. However, as e-commerce continues to grow, issues such as product returns and refunds have also become more prevalent. These after-sales behaviors not only waste merchants’ human, material, and financial resources but also pose significant challenges to their operations. Consequently, reducing the rate of returns and refunds and minimizing the occurrence of after-sales behaviors have become critical priorities for most e-commerce platforms.
Traditional methods of after-sales service management often rely on human experience and heuristic rules, which can lead to inefficiencies and a lack of personalization. Moreover, with the growth of big data, it is becoming increasingly difficult to process and analyze consumer behavior manually. In this context, leveraging machine learning models to predict after-sales behaviors has become a promising solution. Machine learning models, which can analyze large volumes of consumer data, hold the potential to accurately predict after-sales needs and optimize service strategies.
This study aims to construct a machine learning-based predictive system for after-sales behavior. By applying various machine learning algorithms, such as decision trees, random forests, and XGBoost, the study will predict consumers’ after-sales needs based on historical purchase and service data. The objectives of this research include collecting and analyzing consumer behavior data with a focus on after-sales actions, developing machine learning models to predict after-sales behaviors (e.g., returns, exchanges, complaints), evaluating the performance of these models and comparing the effectiveness of different algorithms, and proposing targeted strategies to improve after-sales services and customer satisfaction. Merchants can dynamically adjust shipping and insurance rates based on machine learning predictions, improving customer satisfaction while minimizing the impact of after-sales activities on operations.
The innovation of this study lies in the development of an automated model that uses machine learning to predict consumer after-sales behavior. The application of the model can reduce the additional costs of online shopping merchants, increase the profit margin of manufacturers, improve the efficiency of after-sales service, reduce the probability of consumers initiating after-sales behavior, and promote effective economic growth.
2. Related Work
In the realm of e-commerce, post-purchase behaviors—including returns, exchanges, and refunds—pose significant operational challenges. As e-commerce platforms seek to optimize after-sales services and mitigate the costs associated with returns, there has been a growing interest in employing machine learning techniques to predict these behaviors. This section reviews relevant studies that highlight the potential of predictive models not only to enhance customer satisfaction but also to inform strategic decision-making in return management.
Yang et al. [2] delved into the role of artificial intelligence in shaping online return policies, particularly in forecasting return volumes through machine learning. Their work underscored the efficacy of machine learning in anticipating return behaviors, thereby enabling businesses to refine their strategies and streamline after-sales services. Building on this, Chen and Ma [3] investigated the determinants of customer satisfaction within online return services. They utilized machine learning to forecast the propensity for returns, underscoring the criticality of comprehending post-purchase consumer dynamics for enhancing satisfaction and optimizing service delivery.
Da Veiga et al. [4] provided a comprehensive review of after-sales attributes in e-commerce, with a focus on returns, exchanges, and refunds. Their study illuminated how AI and machine learning can be leveraged to streamline these processes, bolster customer service, and strengthen retention strategies. Liu [5] explored the application of deep learning to forecast customer satisfaction in cross-border e-commerce, with an emphasis on enhancing return services. This research demonstrated the accuracy of machine learning models in predicting return behaviors, thus amplifying the efficiency of after-sales service processes.
Guan et al. [6] examined the factors influencing consumer satisfaction in the context of fresh produce e-commerce. They adopted a hybrid approach, combining LDA-SEM and XGBoost, to predict consumer satisfaction and return behaviors, thereby showcasing the efficacy of machine learning in enriching post-purchase consumer experiences in the e-commerce sector. Duong et al. [7] conducted a systematic literature review on the application of machine learning techniques to predict product returns in e-commerce. Their study identified various models, including decision trees and random forests, which have been shown to enhance the accuracy of predicting post-purchase behaviors such as returns and exchanges.
Sunarya et al. [8] compared the performance of two prevalent machine learning algorithms, Logistic Regression and Random Forest, in predicting e-commerce customer behavior, with a specific focus on customer churn. Their study utilized an E-commerce Customer dataset to delve into the intricacies of customer interactions and behaviors, assessing the predictive capabilities of each model in forecasting post-purchase behaviors like returns, exchanges, and complaints.
These studies collectively contribute to a burgeoning body of knowledge on consumer behavior prediction in e-commerce. They not only provide a foundation for our research but also highlight the need for further exploration, particularly in the context of diverse consumer behaviors and the evolving landscape of e-commerce. Our study aims to build upon these insights by developing a robust predictive model that can accurately forecast post-purchase behaviors, thereby offering valuable strategies for e-commerce platforms to enhance customer service and operational efficiency.
3. Data Sources and Processing
3.1. Data Introduction
The dataset for this analysis is sourced from the customer shopping order information of the “Damon Home Flagship Store” on TikTok Mall, which has been anonymized to ensure privacy and exported into an Excel table via the merchant interface. This selection, while providing a rich dataset, inherently limits the generalizability of our findings to other e-commerce platforms or product categories due to the unique characteristics of a single store.
We have defined a target column named “After-sales Behavior” within our dataset to categorize orders. An entry of “1” indicates the presence of after-sales behavior, which includes activities such as returns, exchanges, and refunds. An entry of “0” signifies the absence of such behavior, suggesting that the order was completed without any subsequent actions, and the product was received as intended.
We acknowledge that the specific attributes of the “Damon Home Flagship Store,” including its marketing strategies and customer demographics, may introduce biases into our dataset and affect the after-sales behaviors of consumers. To address these potential biases, we have taken measures to ensure that the dataset is as representative as possible within the scope of our study. Despite these efforts, we recognize that our findings may not be entirely transferable to other e-commerce contexts, and we encourage further research to explore the variability of after-sales behaviors across different platforms and product types.
The data contains a total of 200,270 order data from the store from January 1, 2023 to June 6, 2023, of which some order information is shown in Figure 1:
Figure 1. Partial order information diagram.
The original data contains 12 characteristic variables, including order number, product category, region, payment method, order submission time, payment completion time, product quantity, product amount, shipping cost, total discount amount, order status, and after-sales status. Among them, the after-sales status is the target variable, which is divided into after-sales application and no after-sales application. It is necessary to use the machine learning model to predict the after-sales status based on the existing sample data.
3.2. Data Preprocessing
Data preprocessing is a critical step in machine learning that involves transforming raw data into a format suitable for modeling. This process includes several techniques, such as feature derivation and one-hot encoding, which are essential for converting categorical variables into numerical data [9]. Feature derivation is an important technique in machine learning data preprocessing, as it can improve the performance of the model by converting or combining some variables of the original data into new features so that the model can better capture the information in the data [10]. One-hot encoding, in particular, is a common method used to convert categorical variables into numerical data, which is essential for machine learning models that require numerical input [11].
3.2.1. Feature Derivation
Feature derivation is a critical step in machine learning data preprocessing that can significantly enhance model performance by transforming or combining original data variables into new features that better capture the underlying patterns and information relevant to the prediction task [10]. In this study, we derived new features from the original dataset based on domain knowledge and statistical analysis to improve the predictive power of our machine learning models.
Here we derive the original features and obtain the following three new features as shown in Table 1, which improve the accuracy of model prediction.
Table 1. Derived features and descriptions.
Original Features |
Derived Features |
Derived Feature Description |
Area |
Regional
Development Level |
Divided into 5 levels according to the region’s GDP in 2023: 5: Shanghai, Beijing 4: Guangdong, Jiangsu, Zhejiang, Fujian 3: Shandong, Hubei, Shaanxi, Chongqing, Inner Mongolia 2: Hunan, Anhui, Jiangxi, Shanxi 1: Other areas |
Purchase Time |
Shopping time |
Divide shopping time periods by the order submission time: 1: 00:00-06:00 2: 06:00-12:00 3: 12:00-18:00 4: 18:00-24:00 |
Product Amount, Discount Amount |
Discount Ratio |
The ratio of the total discount amount to the product amount |
We carefully selected features for derivation based on their potential to reveal hidden patterns in consumer behavior that could influence after-sales actions. Each feature was chosen to provide unique insights into consumer decision-making processes. For instance, the “Regional Development Level” feature was derived from the “Area” variable, categorizing regions into different levels based on their GDP in 2023. This feature was included because economic development levels have been shown in studies such as Lichtenstein, Drumwright, and Braig [12], as well as de Mooij and Hofstede [13], to correlate with consumer behavior and financial capabilities, which in turn can affect the likelihood of returns or exchanges. Additionally, the “Shopping Time” feature was derived to capture the time of day when orders were placed, as purchase timing can indicate consumer urgency or convenience-driven behavior, potentially impacting the likelihood of after-sales interactions. Prior research suggests that the time of day when consumers make purchases can be indicative of their decision-making processes and may influence post-purchase actions, such as the propensity for returns or exchanges.
Each derived feature was carefully selected to ensure it added unique information to the model and improved its ability to distinguish between customers who would and would not engage in after-sales behaviors. The inclusion of these features allowed our models to capture more nuanced patterns in the data, leading to improved accuracy, F1 score, and AUC compared to using only the original features.
3.2.2. One-Hot Encoding
One-Hot Encoding is a common method for converting categorical variables into numerical data, especially in machine learning models [11]. The core idea is to represent the value of each category as a binary vector with a length equal to the total number of categories, where only one position is 1 and all other positions are 0. Each category corresponds to a unique position. Here, we apply one-hot encoding to the payment method. (Figure 2)
Figure 2. Payment method One-Hot Encoding.
After one-hot encoding, the categorical features can be digitized and then added to the machine learning model for training.
3.2.3. Outlier Processing
The Z-Score method, a standard score measure, is widely used to identify outliers in a dataset. It calculates the number of standard deviations an element is from the mean, facilitating the detection of data points that deviate significantly from the norm [14].
After calculating the Z value, we delete the samples whose absolute value is greater than 3. By processing the outliers, we deleted 1300 outliers and obtained 198,890 sample data.
3.2.4. Data Normalization
The Min-Max Normalization method is employed to scale numerical data within the interval [0, 1], which is particularly beneficial for features with varying scales. This method ensures that all features contribute equally to the model, regardless of their original scale or range. For a feature column xx, the normalization formula is given by:
This process is crucial as it eliminates the impact of inconsistent dimensions across features, thereby improving the effectiveness of the model [15]. By normalizing the data, we ensure that no single feature dominates the model due to its scale, which is especially important in machine learning algorithms where feature scaling can significantly affect performance.
In summary, after feature derivation, encoding, outlier processing and data standardization, we obtain the data required for preliminary analysis as shown in the following Figure 3:
Figure 3. Processed data.
4. Models and Methods
4.1. Logistic Regression
Logistic Regression (LR) is a fundamental model used for binary classification problems. It estimates the probabilities of the outcomes and is based on the maximum likelihood method. In their work, Agresti detailed the statistical foundations of logistic regression, making it a suitable choice for our model due to its simplicity and interpretability [16]. LR predicts the probability that an individual belongs to a certain category, such as whether or not they have an after-sales behavior, by converting the linear combination of features into a probability value through a logit function.
The mathematical expression of this function is as follows:
where
are input features,
are model parameters, and
is the base of the natural logarithm.
is the probability that the target variable
is equal to 1 given the features
.
4.2. XGBoost
eXtreme Gradient Boosting (XGBoost) is an efficient and scalable implementation of the gradient boosting framework. Chen and Guestrin developed XGBoost, which is known for its ability to handle large datasets and provide state-of-the-art performance in many machine learning tasks [17]. The key advantage of XGBoost lies in its ability to model complex, non-linear relationships and handle large datasets efficiently. The algorithm minimizes a combination of loss and regularization terms, effectively preventing overfitting and improving model generalization.
The objective function of XGBoost is given by:
where
is the loss function and
is the regularization term. The loss function is typically cross-entropy for classification problems, and the regularization term penalizes large models to avoid overfitting.
By integrating features such as parallel computation, tree pruning, and regularization, XGBoost can achieve high predictive accuracy, making it a powerful tool in a wide range of machine learning applications, including e-commerce predictions, customer churn analysis, and medical diagnostics [18] [19].
4.3. Decision Tree
The basic idea of a Decision Tree (DT) is to classify or predict data by “splitting” features. When building a decision tree, first select a feature to split so that the data set becomes more “pure” after the split (that is, samples of the same category are concentrated together). This splitting continues until a certain stopping condition is met (such as reaching the maximum tree depth, the number of samples contained in the leaf node is less than a certain threshold, etc.). Since the data set used is large, the number of features is large, and higher computational efficiency is required. There is an imbalance in the categories, so this article uses the Gini coefficient as the splitting criterion for the decision tree model.
The Gini Index is also a criterion for selecting the best features, especially used in CART models. For the data set
, the calculation formula for the Gini Index is:
where
is the proportion of the i-th class in the dataset
. For each possible value
of a feature
, its Gini index is:
where
is the subset of the data where feature
takes the value
,
is the total number of samples in the dataset. The feature and threshold that result in the lowest Gini Index are selected for the split at each node of the tree.
Pseudocode for the number of decisions implemented is as follows:
Algorithm: Decision Tree. |
Input: Training set
; Attribute set
. |
Procedure: Function TreeGenerate
|
1: Create a node node; |
2: if all samples in
belong to the same class
then |
3: Label node as class
leaf node; return |
4: end if |
5: if
OR all samples in
have the same value for attributes in
then |
6: Label node as leaf node with the most common class in
; return |
7: end if |
8: Select the best splitting attribute
from
; |
9: for each value
of attribute
do |
10: Create a branch for node;
is the subset of
where attribute
has value
; |
11: if
is empty then |
12: Label the branch node as leaf node with the most common class in
; return |
13: else |
14: Use TreeGenerate
as the branch node; |
15: end if |
16: end for |
Output: A decision tree with node as the root. |
The Gini Index is computationally efficient and easy to interpret. It is particularly suitable for large datasets and is commonly used in binary classification problems. Unlike entropy-based measures, the Gini Index tends to be faster because it doesn’t involve logarithmic calculations. Moreover, it is robust to class imbalances, which makes it a good fit for many real-world classification tasks, including e-commerce behavior prediction [20] [21].
4.4. Random Forest
Random Forest (RF) is an ensemble learning method that improves prediction accuracy and controls overfitting by building multiple decision trees and outputting average results. Random Forest performs well for problems of various data types and sizes, especially when dealing with high-dimensional data and features with complex interactions [22].
Random Forest uses the bootstrap self-service resampling technique to randomly extract n samples from the training set of the original data with replacement to form a new training set. For each sample set, a decision tree is built based on the characteristics and the best split point is determined. This process is repeated many times to generate m decision trees. Finally, the test set is predicted using the bagging strategy and a majority voting mechanism. The learning process of the random forest algorithm is shown in Figure 4 below:
Figure 4. Random forest flowchart.
4.5. Class Imbalance
Class imbalance is a significant issue in many predictive models, particularly when one class of the target variable, such as after-sales behaviors (e.g., returns and refunds), occurs less frequently than the other class (no after-sales behavior). By counting the feature columns, we found that the number of orders with after-sales behavior accounted for 80.6%, indicating a class imbalance. (Figure 5)
This often leads to decreased model performance, as the model tends to be biased towards the majority class. To address this issue and improve the accuracy of our predictions, we apply various techniques to balance the dataset, ensuring that the model can better learn to predict both classes effectively. These methods help mitigate the negative impact of class imbalance and enhance the overall performance of the model.
Figure 5. Proportion of orders with after-sales behavior.
1) Oversampling: Increase the sample size of the minority class by duplicating the minority class samples. We can use the SMOTE (Synthetic Minority Over-sampling Technique) method to generate new minority class samples.
2) Undersampling: Reduce the number of majority class samples to balance class distribution.
3) Class weighting: When training logistic regression, decision tree, random forest, XGBoost models, the model pays more attention to the minority class by assigning a larger weight to the minority class.
4.6. Evaluation
We divide the dataset into training and testing sets to train and evaluate the models. Using the selected features, we train logistic regression, decision tree, random forest, and XGBoost models respectively. We make predictions for each model and evaluate its performance using metrics such as Accuracy, F1 Score, and AUC.
1) Accuracy: Accuracy is one of the simplest evaluation indicators, which indicates the proportion of correctly classified samples to the total samples.
2) Precision: Precision represents the proportion of samples that are actually positive among all samples predicted to be positive. It can help us evaluate the accuracy of the model in predicting positive samples.
3) Recall: The recall rate indicates the proportion of samples that are correctly predicted as positive among all samples that are actually positive.
4) F1-Score: The F1 score is the harmonic mean of precision and recall, which takes into account the balance between precision and recall. The F1 score ranges from 0 to 1, and the larger the value, the better the model performs in predicting the positive class.
5) Area Under Curve (AUC): AUC is the area under the ROC curve, which indicates the ability of the model to distinguish between different categories. The value of AUC ranges from 0 to 1, and the higher the value, the stronger the model’s ability to distinguish. When the AUC value is 0.5, it means that the model is no different from random guessing; when the AUC value is 1.0, it means that the model can perfectly distinguish between the two categories.
6) Receiver Operating Characteristic Curve (ROC): The ROC curve plots the relationship between the false positive rate (FPR) and the true positive rate (TPR), and the AUC value is the area under the ROC curve.
5. Main Results
In this study, we evaluated four machine learning models—Logistic Regression, Decision Tree, Random Forest, and XGBoost—to predict after-sales behaviors in online shopping, such as returns, exchanges, and refunds. The dataset, which is imbalanced with a higher frequency of non-return behaviors, poses a significant challenge to model performance. To address this, we applied class balancing techniques, including SMOTE oversampling and the use of class weights. The models were rigorously evaluated using five-fold cross-validation, with performance assessed through Accuracy, F1-Score, and ROC curve analysis.
5.1. Model Performance Summary
After rigorous testing, the performance metrics of the four models are encapsulated in the confusion matrices and AUC values presented in Figure 6 and Figure 7. These figures provide a visual representation of each model’s ability to distinguish between after-sales behaviors and non-after-sales behaviors.
This figure offers a detailed look at the true positives, false positives, true negatives, and false negatives for each model, serving as a critical tool for assessing classification model performance.
Building on the insights from the confusion matrices, Figure 7 displays the ROC curves for our models, measuring their ability to discriminate between different classes. The Area Under the Curve (AUC) is a key metric that quantifies the overall performance of the models, especially in the context of imbalanced datasets.
Figure 6. Confusion matrix diagram of four models.
Figure 7. ROC curve of machine learning prediction results.
Summarizing the data from these visual representations, we present in Table 2, which includes the accuracy, R2, F1 Score, and AUC for each model.
Table 2. Prediction results including accuracy, R2, F1 Score, and AUC.
Model |
Mean Accuracy |
Mean R2 |
Mean F1 Score |
Mean AUC |
Logistic Regression |
0.5794 |
0.5794 |
0.6152 |
0.6126 |
XGBoost |
0.5184 |
0.5184 |
0.6743 |
0.6582 |
Decision Tree |
0.5956 |
0.5956 |
0.6166 |
0.6376 |
Random Forest |
0.6025 |
0.6025 |
0.6262 |
0.6467 |
To provide a comprehensive overview of the models’ performance, we have included a radar chart (Figure 8) that visualizes the mean accuracy, mean R2, mean F1 score, and mean AUC for each model. This chart offers a clear and concise comparison of the models’ performance across different metrics.
The radar chart (Figure 8) reinforces our findings that Random Forest and XGBoost outperform Logistic Regression and Decision Tree in terms of accuracy, F1 score, and AUC. While XGBoost excels in F1 score and AUC, it is less suitable for large-scale data analysis due to its high complexity and computational cost. Random Forest, on the other hand, offers a consistent performance across all metrics and is more efficient in terms of speed and training time, making it the optimal choice for predicting post-purchase behavior in e-commerce contexts.
In summary, Random Forest performed the best overall, achieving a mean accuracy of 0.6025, F1 score of 0.6262, and AUC of 0.6467. Random Forest’s robust performance across all metrics suggests it is the most reliable choice for this problem, demonstrating strong predictive power and stability without overfitting. XGBoost, while having a lower accuracy (0.5184), excelled in F1 score (0.6743) and AUC (0.6582), showing its ability to effectively balance precision and recall, especially in class-imbalanced scenarios. Decision Tree showed comparable results to Logistic Regression, with a slight improvement in accuracy (0.5956) and AUC (0.6376), indicating that it could be useful in cases where model interpretability is critical. Logistic Regression performed reasonably well but had relatively lower AUC and F1 scores compared to the more complex models, with a mean accuracy of 0.5794, suggesting that while it is suitable for simpler tasks, it may not be the best choice for imbalanced datasets or non-linear relationships.
![]()
Figure 8. Model performance radar chart.
5.2. Insights and Recommendations
In conclusion, our analysis shows that XGBoost and Random Forest both outperform Logistic Regression and Decision Tree in terms of accuracy, F1 score, and AUC. While XGBoost excels in F1 score and AUC, indicating the best balance between accuracy and recall, especially in the presence of class imbalance, it is less suitable for large-scale data analysis like e-commerce or online shopping due to its high complexity and computational cost. On the other hand, Random Forest consistently performs well across all metrics and is more efficient in terms of speed and training time, offering a lower computational cost. Overall, Random Forest provides the best combination of performance, speed, and efficiency, making it the optimal choice for predicting post-purchase behavior in e-commerce contexts.
6. Conclusions
In this paper, we apply four machine learning models, logistic regression, decision tree, random forest, and XGBoost, to predict post-purchase behavior for online shopping, focusing on returns, exchanges, and refunds. Results show that while logistic regression provides a useful baseline, it performs poorly in handling class imbalance compared to more complex models. XGBoost performs well in terms of F1 score and AUC, and excels in scenarios that require high precision and recall, but have a high computational cost. Decision tree models, while effective, are more suitable for applications that prioritize model interpretability. The random forest model outperforms the other models with the highest accuracy and stability, making it the most reliable choice for handling complex datasets with class imbalance.
While this study provides valuable insights, there are several limitations to consider. The current dataset is limited in size and may not fully capture the complexities of consumer behavior, especially in the context of more personalized factors. Additionally, the use of oversampling methods like SMOTE may introduce noise or lead to overfitting in some cases. Future research could focus on exploring alternative oversampling strategies, integrating more advanced features (such as customer sentiment or browsing behavior), and refining the machine learning models. Incorporating ensemble methods or hybrid models could further enhance predictive accuracy. Moreover, improving model efficiency, particularly for computationally intensive algorithms like XGBoost, would allow for more scalable real-time applications.
The findings of this study have significant practical implications for e-commerce platforms aiming to optimize their after-sales services. By leveraging machine learning models, merchants can better predict post-purchase behaviors, such as returns and exchanges, and develop targeted strategies to address these behaviors. For example, merchants can dynamically adjust shipping fees and insurance rates based on predicted after-sales behavior, minimizing operational costs while improving customer satisfaction. These strategies can lead to more efficient resource allocation, better customer retention, and overall enhanced service quality.
In conclusion, the application of machine learning models, particularly Random Forest and XGBoost, offers an effective approach to predicting and managing after-sales behavior in online shopping. These models provide valuable insights that can help e-commerce platforms improve their customer service strategies, reduce operational inefficiencies, and ultimately enhance the consumer experience.