Analysis of Changes in Customers’ Market Basket Across Different Branches of a Chain Store Using Association Rules Technique and Its Impact on Product Placement: A Case Study of a Chain Store in Various Areas of Tehran ()
1. Introduction
To date, researchers have provided many definitions for market basket analysis. Market basket analysis is a practical method for finding customer purchase patterns by examining common events from the transaction databases of stores [1]. It provides insights into customer consumption patterns and industry trends. Information about customer buying habits can help the seller choose the products or services they want. The main goal of market basket analysis is to improve the situation [2]. Market basket analysis empowers marketing and sales organizations to make better-informed decisions about where and how to deploy their efforts and resources. Millions of transactions occur in retail businesses, and the need to analyze them for higher profits has driven the adoption of market basket analysis [3]. In addition, using the MBA method can make shopping easier for visitors because products usually purchased together will be placed near each other [4]. By analyzing the market basket, recurring patterns can be found for offering related products together, thus increasing sales. Also, related products are placed so that customers can logically find items they might buy together, increasing customer satisfaction [5]. Moreover, studying consumer behavior is a complex and challenging task that requires a deep understanding of the factors influencing customer decision-making. Customers are influenced by various factors, including cultural, social, economic, and personal factors, making their behavior often difficult to predict. Analyzing and understanding consumer behavior allows retailers to stay ahead of competitors, respond better to market changes, and remain competitive in the ever-changing retail sector [6]. According to research conducted by researchers, it has been shown that basket analysis depends on various factors, including the season [7], dates and occasions, geographical location [8], and type of goods [1]. According to the study by Chen et al., in a multi-store environment, a product may sell out of one store due to geographical, political, or environmental issues.
A study conducted in 2022 by T Formánek, O Sokol indicates that the geographical location of a store is a crucial factor that affects the volume and structure of sales. Understanding the complexity of location effects on sales dynamics and using such information may be a key element to a company’s success in a competitive market environment. Generally, the geographical location of a store can be characterized by geographical-spatial and socio-demographic features. Distinct geographical location factors can have different effects on the sales of various products [7]. Additionally, according to a study conducted in 2018 in urban and rural areas of Bandar Abbas, it was found that the factors influencing customer satisfaction and purchase in urban areas differ from those in rural areas [9]. In another study by researchers, a recommendation system based on a customer market basket was implemented. According to the suggestions and hypotheses stated in the study, implementing the algorithm in different geographical locations may affect the algorithm results [2]. Therefore, this paper proposes a comprehensive framework based on machine learning for analyzing consumer behavior trends, gaining insights into the data, and enabling data-driven decision-making. To create the framework, we analyzed the transactions of this store’s customers and identified and studied the demands of the market baskets of loyal customers. By providing insights into consumer behavior to retailers, it is expected that the proposed framework will draw attention to the difference in customer market baskets across their branches. According to this information, for the upcoming research, we know that Tehran has 22 districts, and the chain store under study has branches in all of them. In addition, we assume that different stores can have different product combinations over different periods. This means that each store can have its own product combination, and the product combination in a store can dynamically change over time. It has been proven that when an organization can understand consumers’ buying habits, it becomes easier to improve their business performance indicators [6].
Our research aimed to address this point, and the results and recommendations will improve marketing performance indicators such as customer satisfaction and loyalty. In this study, the impact of different geographical areas is only examined regarding the difference in product associations across various branches. The research also attempts to focus on a similar category of goods and a similar brand in all branches. This research is centered around applying machine learning technologies in Tehran’s retail sector. This paper is organized as follows: Section 2 reviews and defines the problem in Tehran chain stores. Sections 3, respectively, discuss the types of algorithms proposed for examining customer market baskets and the methods of clustering loyal customers. Section 4 presents some case study results and discusses the findings. Finally, Section 5 provides conclusions, and in Section 6, we will talk about directions for future work and research.
2. Problem Definition
As explained in the previous section, the current retail environment is highly challenging and competitive due to market diversity, price change pressures from discounts, increased price transparency, and competition among companies. Traditional approaches for strategic pricing differentiation and product-related promotions are now more practical in the retail industry. In this competitive nature, treating customers as the company’s main asset increases the organization’s value. The retail sector is a complex and ever-evolving market that heavily relies on customer behavior. Studying consumer behavior is a complex and challenging task that requires a deep understanding of the factors influencing customer decision-making. Customers are influenced by various factors, including cultural, social, economic, and personal factors, which often make their behavior difficult to predict. Many of these product pairs that consumers purchase together are generally known. However, given the fact that a typical supermarket contains hundreds of items bought by thousands of customers purchasing numerous products, understanding the less obvious related product pairs becomes difficult. Definition: If in branch A of the store, item x is placed next to item y, then in branches B and C, item x is also placed next to item y. Definition: The decision to place item x next to item y has been made entirely traditionally and without using association rules. Based on definitions 1 and 2, it has been identified in one of the chain stores in Tehran that all products on the shelves and floors in the branches are arranged uniformly, and no association rules are used in any of these branches.
3. Methods
3.1. FP-Growth Algorithm
There are many algorithms related to Frequent Itemset Mining (FIM). The Apriori and FP-Growth algorithms are the most fundamental FIM algorithms. Researchers have examined the fundamental differences between these algorithms in Table 1. According to studies [10] [11], the FP-Growth algorithm is recognized as a popular mining algorithm. It scans the database only twice and efficiently discovers all standard frequent item sets, particularly compared to the Apriori algorithm. FP-Growth has three strengths.
Firstly, FP-Growth compresses the entire database into a relatively small data structure (FP-tree), which results in scanning the database only twice. Secondly, it creates a frequent pattern growth formula to avoid generating many candidate itemsets. Thirdly, it generates detailed layers of the tree to discover frequent itemsets and reduces computational complexity. Experimental results show that FP-Growth is faster than the Apriori algorithm and several other frequent item mining methods [10] [12]. As explained in this paper, we have used the FP-Growth algorithm to analyze the transactional data of the chain store under study.
Table 1. Comparison of two algorithms [13].
Fpgrowth |
Apriori |
Scans the database only once, making it fast |
Scans the database multiple times, making it slow |
Used when database data is large |
Used when database data is small |
Stores a set of conditional FP-trees for each item in memory |
Stores a transformed version of the database in memory |
Creates a conditional FP-tree for each item |
Generates frequent patterns by creating item sets through pairings like one-item sets, two-item sets, and three-item sets. |
3.2. Customer Clustering
As reviewed in [14], several methods for customer segmentation exist, but most are based on customers’ behavioral, psychological, geographic, and demographic information. However, customer behavioral information based on RFM analysis is emphasized because it uses a small set of features for segmentation. To cluster store customers, the RFM model is used. RFM stands for Recency, Frequency, and Monetary Value, and it is becoming a prevalent form of clustering in the retail industry. This is particularly due to its simplicity of implementation with minimal help from data scientists and its straightforward interpretation due to the visual nature of its results. The three main factors of RFM can be explained as follows: Recency (R) Represents the time interval between the date of the last purchase and the most recent date in the statistical period. The smaller the time interval, the higher the R-value. Frequency (F): Indicates the number of times a customer has made purchases during the statistical period. The higher the F value, the more loyal the customer is to the company. Monetary Value (M): Represents the total value of transactions made by customers during the statistical period. The higher the M value, the more revenue for the company [15]. RFM analysis is a common clustering method for explaining customer purchasing behavior based on transaction data. Valuable customers have the highest frequency and monetary value and the lowest recency. These three variables belong to behavioral variables and can be used as clustering variables by observing customers’ attitudes towards the product, brand, profit, or even loyalty from the database. The RFM scoring process uses quintiles to quantify customer behavior. The first quintile with the highest value (most minor for recency) is marked as 5. The next quintile is marked as 4, and so on. Finally, all customers are represented by 555, 554, 553, ..., 112, 111. The most valuable customer group is 555, while the worst is 111 [16].
3.3. Execution Steps of the Model
As shown in Figure 1, the framework development process begins with the implementation of the CRISP-DM model. To better understand this business, the received data was reviewed and analyzed. Each branch has approximately 700,000 transactions within a specified historical period. These transactions include customer purchases across all product categories of this chain store, including household items, health and beauty products, food items, and tools. Therefore, with the help of SQL software, the food product information for each branch was initially separated. In the second stage, refrigerated food items were also reviewed and cleaned. In the third stage, it was determined that all product names are fully specified along with the brand names, and according to the research hypothesis, product brands do not affect the result. At this stage, three refined Excel files were created, each containing columns for Branch Name Transaction Date Transaction Amount Customer Name and Code Purchased Product Clustering Loyal Customers for:
Branch Name
Transaction Date
Transaction Amount
Customer Name and
Code
Purchased Product
Figure 1. Execution steps of the model.
Clustering Loyal Customer
After determining the execution steps of the research and cleaning the transaction data, the next step is to find the store’s loyal customers under study. One of the main reasons for selecting loyal customers for this research is to test the hypothesis that in-person shopping behavior in different geographical areas affects the relationship of purchased products in the customer shopping basket for each branch. Outlier data, such as a customer purchasing from a specific branch only once, can affect the research outcome. Therefore, to avoid this issue, the focus is on loyal customers.
As we explained in section 3.2 customer classification by RFM method, in this section, we classified customers according to the amount of shopping baskets, the number of times they purchased, and the last purchase of customers with the customer data we had.
We have shown some of these data in Table 2. The first column is the unique ID of the customers in each store, the second column is the shopping frequency, and the third column Represents the time interval between the date of the last purchase and the most recent date in the statistical period. Loyal customers for each branch are identified and introduced according to Table 2. This research examines the last purchase date according to the available data over three months. Customers with high recency scores are more likely to make repeat purchases.
Table 2. An example of the output of the RFM model for surveying loyal customers.
Customer code |
Frequency |
Recency |
Monetary |
0040011139811600166 |
1 |
19 |
243,000 |
0040131398111600012 |
1 |
66 |
27,000 |
0040111398111600163 |
1 |
59 |
252,000 |
0040121398111600297 |
3 |
33 |
1,572,000 |
004011139811600335 |
3 |
9 |
162,000 |
.... |
.... |
..... |
….. |
0040111398111600177 |
1 |
0 |
27,000 |
0040111398091500045 |
1 |
67 |
54,000 |
0040081398809115009 |
2 |
9 |
81,000 |
0040111398091500133 |
4 |
15 |
243,000 |
4. Results and Discussion
In this study, we aimed to analyze market baskets to observe association rules for frequently purchased food items in each branch of grocery stores using the FP-Growth algorithm.
Initially, a minimum support value was introduced. The support threshold parameter specifies that the minimum coverage items must have to be confirmed as a frequent item set. This threshold can be determined as a percentage of the total transactions or as a specific number of transactions [17]. Based on statistical reports and analyses of each item’s repetition frequency, the support threshold considered in this study was set at 7%.
The number of association rules generated is determined by the confidence threshold parameter. McLennan et al. suggest that in a sparse data set, such as a purchase transaction table, this threshold should be considered between 5-10% to derive reasonable rules.
Accordingly, a confidence threshold of 7% was chosen.
In basket analysis, we aim to find groups of items that often appear together and provide a recommendation based on the repetition of items in transactions containing other items. Afterward, the results of the three algorithms are evaluated to compare them. The expected outcome is to facilitate an understanding of all consumer transactions at a specific grocery store branch and identify products that are frequently purchased together at that branch. The results of the FP-Growth algorithm, implemented in Python software, included 10,000 recommendations, a portion of which are shown in Figures 2-4. The first column contains the items for which the recommendations are provided, and the second column contains the items often purchased with the leading item.
4.2. Market Basket Analysis Results for Branch 1
In Branch 1, the association rule algorithms produce similar extraction rules with a maximum support of 0.1176, maximum confidence of 1, and a lift of 8.5.
Below are the details of the top 10 rules based on the highest confidence values (Figure 2). Transactions in the branch1 show the highest association among items like:
Figure 2. Output for Branch 1. (bread and pasta⭢, eggs, Rani drink, and wafers) And (bread and pasta⭢, eggs, oil, Rani drink, and wafers).
4.3. Market Basket Analysis Results for Branch 2
In Branch 2, the association rule algorithms produce similar extraction rules with a maximum support of 0.11 and a maximum confidence of 1. Below are the details of the top 10 rules based on the highest confidence values (Figure 3).
Transactions in the Railway branch show the highest association among items like:
4.4. Market Basket Analysis Results for Branch 3
In Branch 3, the association rule algorithms produce similar extraction rules with a maximum support of 0.111 and a maximum confidence of 1. Below are the details of the top 10 rules based on the highest confidence and lift values (Figure 4).
Transactions in the Behrood branch show the highest association among items like:
Figure 3. Output for Branch 2. (sauce and drinks ⭢ sugar and bread) And (sugar and drinks ⭢ sauce, beans, and bread).
Figure 4. Output for Branch 4. (oil, bread, split peas ⭢, chocolate, eggs, Rani drink, and sugar) And (chocolate, eggs, Rani drink, and sugar ⭢ oil, cake, and split peas).
Figure 5. Independent items display.
Independent Items Analysis Items displayed in Figure 5 do not show any association with each other in Branch 2, considering minimum support of 0.7%. As shown, the lift and confidence values are less than one, and Zhang’s criterion is negative, indicating the independence of the items.
As seen in Figure 5, in Branch 2, oil and bread are independent of each other, considering a minimum support of 0.7%. This is while in Branches 1 and 3, bread and oil are among the items that are associated with each other. Therefore, the shopping basket for each branch is still different from the other branches.
5. Conclusion
As the algorithm results indicate, the association of items purchased by customers in each branch of a chain store is different, implying that branch variation impacts the market basket. Therefore, chain stores should pay attention to these branch differences and conduct separate market basket analyses for each branch. The empirical evaluation in this study shows that the proposed method is computationally efficient. Additionally, we assume that different stores may have different product combinations over different periods. This means that each store can have its product combination, and the product combination in a store can dynamically change over time.
6. Future Recommendations
The following recommendations are made for future market basket analysis using data mining approaches:
Implement the algorithm across different chain stores with larger data volumes.
Consider factors such as seasonality or product brands.
Investigate the relationship between various items, such as digital products.
Analyze customer basket correlations in online stores.
Conflicts of Interest
The authors declare no conflicts of interest.