A Clustering Approach for Customer Billing Prediction in Mall: A Machine Learning Mechanism ()
1. Introduction
Machine learning mechanisms are widely used in a large number of applications related to science and technology, and we can implement those mechanisms even in employee-related things or student-related information. We need to predict something based on the information we have and with the past experiences. In this article, we are concentrating on the prediction of whether the customer will purchase any goods from the mall or not based on the gender and salary. Here, we have multiple independent variables and only one dependent variable which we need to predict whether the customer will make any bill on the product.
Hierarchical cluster mechanism is one algorithm we implement and the K-Means clustering mechanism we are implementing on the data we have. There are different scenarios for both of the clustering mechanisms, but the resultant work is common. The accuracy of the algorithm differs the most popular and most acceptable algorithm for the prediction model design and implementation. As per the clustering rules, we have two kinds of clustering mechanisms. One is hard clustering, and other is soft clustering.
1) Soft Clustering:
We need to identify whether the data point is belonged to any cluster instead of making every data point into the cluster we need to identify whether the current data point will fit into either of the existing data clusters.
2) Hard Clustering:
In this scenario, we need to find out whether the current dataset or data point belongs to the existing data set or not [1] [2] [3] . Consider if we have ten different datasets, we need to identify to which cluster the data point will belong.
There are different types of clustering mechanisms that are identified, and they are mentioned as follows. We need to learn about those, because we are utilizing two kinds of clusters in this mechanism [4] [5] [6] to identify whether the customer will purchase any product or not. Those are mentioned as follows.
Connectivity models are the first type which deals with the scenario of connecting the data points based on the category or the thing which is common in the relation. For supposition if one data point is lying far away and the new data point is related to the characteristic of the current data point, then there will be connectivity between the data point in the space.
Centroid models are another model which deals with similarity identification of the data point that will be done by how the data point is close to the centroid of the cluster. If the closeness from the centroid to the group is smaller, then there will be a good connection between then centroid, and the data point and the current data point will belong to the cluster which centroid belongs to [7] [8] [9] .
The next model is a distribution model which deals with the probability of how the data points in the cluster belong to the same distribution. Based on the probability notation, the distribution will form. The distributions may be Gaussian or any other type.
The last model type is density type. It deals with the search for the density of the data point in the data space [7] [8] [9] .
The difference between the previous researches and our model identification is a quite interesting thing. Because of models we implement and the plotting will be done in the form of required features instead of all the existing features. The feature extraction mechanism and the identification of what are the most prominent features in the model is most important thing we implemented. We use two kinds of classifiers and the plotting using the clustering mechanisms are main focus of gamification implementation and explanation. We focused on implementation of the models with those clustering mechanisms and the models will show the optimal model and the features to improve our model at any point of time of extension.
The following (Figure 1) [10] [11] [12] describes the structure of different clustering models in deals.
Here we are implementing the same mechanisms using two primary clustering mechanisms. In (Figure 2) we implement the distribution of clusters, in (Figure 3) we tried to project the Centroid model of clustering and in (Figure 4) we implemented density model of Clustering. They are K-Means (Figure 5) and hierarchical clustering mechanisms. In the K-Means clustering [13] [14] , we are implementing the method with k = 2 as default value and identify the cluster to
Figure 1. Connectivity model in clustering.
Figure 2. Distribution model of clustering.
which data point will belong to. In the hierarchical clustering, we form the dendrograms related to the group. Based on which we can identify the category [15] [16] [17] . The sample dendrogram and cluster as follows in the 2D plane.
The lateral part of this article will deal with the explanation of the K-Means and hierarchical clustering (Figure 6) mechanisms related this approach discussed in the article abstract, which is predicting whether the customer will make a bill in the mall or not based on his age and salary as main independent
Figure 6. Dendrogram for hierarchical clustering.
variables. Next section will describe the flow of the process, next with sample results and plottings, next, we conclude the process with sample future scope of the work [15] [16] [17] .
2. Mechanisms
2.1. K-Means
K-means works on the iterative process of the algorithm which aims for the local maxima in each of the iterations. There may be different iteration values based on the K Value considered. Here in this process, we found K value as 2. And the following will be the steps to be mentioned [18] [19] .
Initially, we need to specify the number of clusters K in the 2 D space. In this regard, we are considering k as 2
In this above image, we can see that we considered two as the K value and the five different data point in the 2D plane space [18] [19] .
We need to assign each data point to the cluster available. Suppose in this regard we are considering there are two clusters [19] [20] [21] which are mentioned in Red and White as indicated in (Figure 7).
Now we need to compute the centroid of the data points. They are mentioned in this below image as a cross symbol. For the red cluster red cross is mentioned, and for the white cluster, the white color cross is specified as below (Figure 8).
Figure 7. K value and the 5 data points in the 2D space.
Figure 8. Centroid designing with the cross symbol.
1) Verify whether the newly created centroid is closest to the related category of the data points or not of the centroid is far from the data points of the same type then re-assign the centroid to the related data points in the cluster. The same mentioned in image nine below.
If we observer Figure 9, we can identify that there is an increase in white category data points and a decrease in red category data points. That happened when the centroid is far from the data points of the same category.
2) Re-compute the centroid based on the available data points if necessary as the new iteration of the data points. The following is the procedure of the centroid re-computing process [21] .
Repeat the previous two steps until there are improvements identified in the cluster (Figure 10).
2.2. Hierarchical Clustering
As the name mentions there will be the hierarchy of the clusters based on the data pointsint he 2D plane or 2D space. In this regard, we design dendrograms which are related to the data points in few iterations as done in the K-means algorithm. First, the cluster starts with the data point assigned to it and then it will merge to the nearest data point in the space and forms the group. For every iteration, there will be a massive change in the cluster and the centroid of the cluster [21] .
Dendrogram of the cluster will be formed for every iteration, and the best
choice of the number of groups will be 4, and the red lines mentioned in Figure 11 defines the maximum vertical distance [22] .
3. Process Flow
We are maintaining individual process flow for K-Means and Hierarchical clustering mechanism. They are described in this article with sample codes (Figure 12, Figure 13).
1) K-Means
The process consists of the following steps:
Figure 11. Hierarchical Clustering with the maximum vertical region.
Figure 13. Sample of hierarchical clustering.
a) Import the libraries
b) Import the related dataset in CSV or JSON format
c) Perform Feature scaling
d) Split the dataset into test and train set
e) Use the elbow method to identify the optimal number of clusters
f) Fit K-Means to the dataset
g) Visualizing the Cluster
2) Hierarchical Clustering
The process consists of the following steps:
a) Import the libraries
b) Import the related dataset
c) Perform feature scaling
d) Split the dataset
e) Using dendrogram find the optimal number of clusters
f) Fit hierarchical clustering to the dataset
g) Visualize the cluster
4. Results
The following are the results of the two mechanisms used in this architecture. The first one is K-Means and the second one is Hierarchical Clustering. As mentioned in previous discussions K-Means clustering here will identify whether the model designed with the features will identify the desired result is obtained or not. We use The Elbow method for this implementation to predict whether the customer will make bill in the mail or not based on his features.
1) K-Means
In this scenario, there are two sample plots which are consisting of identifying the number of possible clusters and then visualizing the number of groups. The resultant of the clusters is as follows in Figure 14, Figure 15.
2) Hierarchical Clustering
The following are the outputs we acquired for hierarchical clustering mechanisms.
First one is sample dendrogram (Figure 16) and the second one (Figure 17) is the clusters of the customers based on the annual income.
Figure 15. Clusters formed based on customers.
Figure 17. Clusters of the customers based on annual income.
5. Conclusion
We conclude the article with the sample outputs of the K-Means clustering and Hierarchical clustering. There are few scenarios in which we need to perform backward elimination process for identifying the best feature for the model to acquire the best accuracy in the models. As per the acquired results, we identified the best fit model for the addressed problem is k-means. The main reason behind highest accuracy of k-means is because of recurrent changes in Centroid based on the nodes modifications. The point of view of researchers is to identify whether there is a chance of identifying for the path purchasing the item in mall. But the model here requires a simple thing like feature extraction. More number of features will make the model wrong and not optimal. To find the optimal path of the model, we try to implement confusion matrix and identify the difference between obtained predicted result and actual result we want. The future scope of this research is to identify more optimal features to improve the model for better identification of the customer billing prediction.