Study on the Development and Implementation of Different Big Data Clustering Methods ()
1. Introduction
In recent years, we have entered a new era, that of data. Ever since humans have been able to write or even communicate, they have been collecting and transmitting data. According to the International Data Corporation (IDC), it is known approximately that from the first traces of man until 2005, the human species created 135 exabytes of data and 161 exabytes of data were created until 2006 [1] . In 2010, we exceeded 1200 exabytes, in 2015, we were at 7900 exabytes and until 2020, we exceeded 40,000 exabytes [1] . According to another study by IDC in 2018, the data sphere will increase from 33,000 exabytes (amount of data produced until 2018) to 175,000 exabytes in 2025 [2] . The evolution of the data over time is shown in Figure 1.
In the early 2000s, we entered the internet age. Many companies began to understand the value of this data, which was becoming increasingly voluminous. For example, it is possible to predict the onset of a virus such as influenza through human searches on Google. It became clear very quickly that all this data would be used one day. So it was very valuable. There was just not enough time to invent new products using this data for commercial, military or medicinal purposes. Now, with the mass arrival of social networks but also smartphones and connected devices, which were 15 billion in 2015 and which are counted as more than 50 billion today and are predicted to be more than 70 billion in 2025, there will be almost 10 connected devices per human being on average. And all this will generate even more data. All this data is the era of Big Data, which we have only just entered.
The very rapid increase in the amount of data produced brings new challenges in the analysis and storage of this data. Recently, there is a growing interest in key areas such as real-time data mining, which reveal an urgent need to process very large data sets under strict performance constraints. These large masses of data streams are multidimensional and often from different sources. This massive data often brings together content of several types (mixed data), represented by descriptors of different natures: vectors of fixed or variable dimension with real or categorical components, etc. To be able to tap into all the hidden wealth within this avalanche of data, the use of high-quality, high-performance knowledge discovery tools is necessary. Data clustering techniques or algorithms are well known and very powerful knowledge discovery tools for this purpose [3] - [9] . These techniques are shown in Figure 2.
Clustering is an unsupervised learning method for constructing subsets (clusters) whose instances are similar (share the same characteristics) to each other with respect to a given similarity measure and differ from one group to another (dissimilar when they belong to different groups). It is clear that good clustering should achieve high intra-cluster similarity and low inter-cluster similarity [10] .
In unsupervised learning, the data is represented as follows:
with each row representing an observation. By applying a clustering technique to this data, the output is data grouped by similarity. Clustering will place objects in several groups according to their characteristics. Thus, similar objects will be in the same cluster and dissimilar objects will be in different clusters.
In this paper, we will review some of these methods with higher frequency in previous studies such as K-means algorithm, Fuzzy c-means, BIRCH and Expectation Maximization algorithm used for data clustering. The rest of this paper is organized as follows: Section 3 presents and develops these different algorithms in pseudocode, graphical and/or literal form. The strengths and weaknesses of the presented techniques are also discussed. Section 4 presents the implementation results of these algorithms. Section 5 discusses these results. Section 6 concludes the work.
2. Data Clustering Algorithms in Unsupervised Learning
Clustering is the process of organizing objects (data) into groups based on similar characteristics within the members (data points of the group) [11] . Clustering or clustering algorithms analyze and study the similarities in the data provided by the user in order to classify them into groups. There are several clustering algorithms. For each one, one has to choose the method or technique to be used to measure the similarity between two objects that can be compared to two real points in d-dimensional space. The principle of clustering is to let the machine classify our data according to their similarity. Here are some of these algorithms:
2.1. K-Means Algorithm
The K-means clustering algorithm is one of the best known, most proven, most popular and simplest unsupervised machine learning algorithms [12] [13] , which is most often applied to solve clustering problems. The k-means algorithm is an iterative (learning) method to discover a number k of clusters in the input space [14] . The number k is defined a priori [13] [15] [16] and each cluster is represented by a cluster centroid in the feature space. For example, let’s imagine that we want to group the data shown in Figure 3 below into four clusters.
To do this, we will first place four points called centroids (red points in Figure 4) at random among our data. Then, we assign each point of the dataset to the nearest centroid which gives us four clusters, then we move each centroid to the middle of its cluster. Then, we start again, we assign each point of our dataset to the nearest centroid and then we move each centroid to the centre of its cluster and we will continue like this until the centroids converge towards an equilibrium position as in the following figure:
Depending on the initial position of the centroids, it is possible that they converge to the wrong positions. The solution is then to run the K-means clustering algorithm several times in a row, changing the initial position of the centroids each time. For each result, the distance between the points of a cluster and the centre of the latter is measured and the solution for which the sum of these distances is the smallest is retained. The K-means algorithm seeks to minimize a
Figure 4. Examples of clustering with K-means.
cost function called inertia, which represents the distance between the points of a cluster and its centre:
In reality, it seeks to minimize the variance of the clusters. Here is how a K-means clustering algorithm works.
The steps of the K-means algorithm
Step 1. Randomly select k elements (centroids) from the dataset
as a medium
of the initial clusters;
Step 2. Calculate the Euclidean distance between each element of the dataset and each centroid by:
;
Step 3. Assign each element of the dataset to the nearest centroid. We obtain k clusters;
Step 4. Recalculate the centroids of the k clusters by:
;
is the jth cluster;
Step 5. Calculate the total mean square error between all the elements of the dataset and their corresponding cluster centroids by:
;
Step 6. If the centroids converge to an equilibrium position, then the results are obtained and exit otherwise return to step 2.
Advantages and disadvantages of the K-means algorithm
K-means is applicable to large data and also any type of data (even textual), by choosing a good notion of distance. The specification of class number, distance/average and initial draw of class centers built on non-existent objects are some of drawbacks of K-means.
2.2. Fuzzy c-Means (FCM) Algorithm
Fuzzy c-means (FCM) is a clustering method that allows a point to belong to two or more clusters, unlike K-means where only one cluster is assigned to each point [17] . This algorithm necessarily assigns a dataset element to one of the clusters by giving them the credibility measure. With FCM, each element of a cluster has the probability of belonging to the other. Therefore, an object does not have an absolute membership on a particular cluster [11] . This method was developed by Dunn in 1973 [18] and improved by Bezdek in 1981 [19] and is based on fuzzy set theory. The Fuzzy c-means procedure [20] is similar to that of K-means. It consists in minimizing the functional J:
If N is the total number of points to be processed; C: the number of clusters searched for;
the degree of fuzziness and B: the vector of cluster centres,
will be called a membership degree matrix if and only if:
We can then determine B and U using the Lagrange technique. Let us define for each vector
,
by:
If we cancel the partial derivatives with respect to a
and
we get:
and
so that if
, we have:
with:
Finally, we have:
For C and X fixed, if we cancel the partial derivatives
with respect to any direction g of B we obtain:
The steps involved in this algorithm are shown in the form of a flow chart in Figure 5.
The FCM algorithm has been widely used to segment brain volumes from one or more multimodalities [21] [22] . Several variants exist for the Fuzzy c-means
algorithm. Among them are FcE [23] (fuzzy c-elliptotypes), AFc [24] (adaptive fuzzy c-elliptotypes) [25] [26] . Although this algorithm is widely used in some areas as mentioned, it suffers from several weaknesses. Among them are problems related to the degrees of membership which are relative degrees. In other words, the membership of an individual to a class depends on the membership of this individual to other classes [27] . The membership functions constructed are therefore interdependent. Also, the estimates of the centres of the classes do not correspond to the real or typical centres. Another disadvantage is related to the initialization of the class centres in a random way (step 2) which may influence the convergence of J to a local minimum, so the idea lies in the optimal estimation of these prototypes [28] .
2.3. Expectation-Maximization (EM) Algorithm
A general clustering method using statistical principles is to represent the probability density function of the data as a mixture model, which states that the data is a combination of k individual component densities (usually Gaussian), corresponding to k clusters [29] . The EM algorithm is an efficient and popular technique for estimating the parameters of the mixture model [30] . It iteratively refines an initial cluster model to better fit the data and ends with a solution that is locally optimal for the underlying clustering criterion [31] . Like other iterative refinement clustering methods, including the popular k-means algorithm, the EM algorithm is fast and scalable versions are available [32] .
In real-world applications of machine learning, it is very common for many relevant features to be available for learning, but for only a small subset of them to be observable. The expectation-maximization algorithm can be used for latent variables (variables that are not directly observable and are in fact inferred from the values of other observed variables). It is in fact the basis of many unsupervised clustering algorithms in the field of machine learning. The EM algorithm is a parametric estimation method within the general framework of maximum likelihood. It is an iterative algorithm that allows to find the maximum likelihood parameters of a probabilistic model when the latter depends on unobservable latent variables. All variables are assumed to be independent of each other and all data are derived from K joint distributions. The Expectation Maximization algorithm is described in detail below:
Inputs: Data
, k (number of clusters);
Outputs: Set of parameters
with
: the cluster;
Step 1: Randomly select the initial parameters:
;
Step 2: Expectation Stage
For each data
, calculate:
with:
Step 3: Maximization stage
Using the
,
, find the new parameter estimate that maximizes the expected likelihood:
Step 4: Repeat steps 2 and 3 if the parameters (centroids) do not change.
Advantages and disadvantages of the Expectation-Maximization algorithm
The advantage of EM over k-means is that it provides a statistical model of the data and is able to handle the associated uncertainties. However, a problem with its iterative nature is the convergence to a local rather than global optimum. It is sensitive to initial conditions and is not robust. While iterative refinement schemes such as k-means and expectation-maximization (EM) are fast and easily adaptable to large databases [32] , they can only produce convex groups and are sensitive to parameter initialization.
2.4. BIRCH
BIRCH is a scalable clustering method and is designed for clustering very large datasets by integrating hierarchical clustering and other clustering methods such as iterative partitioning [33] . It overcomes the two difficulties of agglomerative clustering methods:
1) Scalability;
2) The inability to undo what was done in the previous step.
Given N d-dimensional data points in a cluster:
where
, we can define the centroid
, the radius R and the diameter D of the cluster as follows:
where R is the average distance between member points and the centroid and D the average distance between pairs within a cluster. They reflect the closeness of the cluster around the centroid. We can also define the Euclidean distance from the centroid D0 and the Manhattan distance of the centroid D1 of two clusters as follows:
when their centroids
and
are given.
The average inter-cluster distance D2 the average intra-cluster distance D3 and the distance of increase in variance D4 when we have N1 d-dimensional data points in the cluster
with
, and N2 data points in another cluster
with
are defined as follows:
To summarize the cluster representations, BIRCH uses two methods, namely the clustering feature (CF) and the clustering feature tree (CF tree). According to [34] , when we have N d-dimensional data points in the cluster,
, the clustering feature (CF) vector of the cluster is defined as
with:
➢ N: the number of data points in the cluster;
➢
: the linear sum of the N data points (i.e.,
);
➢ SS: The square sum of the N data points (i.e.,
).
A CF tree is a height-balanced tree that stores the clustering characteristics for hierarchical clustering and has two parameters namely the branching factor B and the threshold T. The advantage of the BIRCH algorithm is that it can find a good clustering with a single scan and improve the quality with a few more scans, but unfortunately it only processes numerical data. The execution steps of the BIRCH algorithm are shown in Figure 6.
3. Implementation Results of These Algorithms
4. Discussion
We have just applied the four algorithms on the same data set (Figure 7) to see the differences in the results obtained. By analyzing these results, we can see that for some algorithms, there is a small difference in the classification while for others, similarities are clearly evident. For example, as presented in Figure 8, with the K-Means clustering algorithm, we can see that the clusters are clearly separated in such a way that we can even think that there is an imaginary line separating each pair of clusters, which is not the case for the BIRCH clustering algorithm as shown in Figure 9. With the FCM clustering algorithm in Figure 11, the results are almost similar to the results obtained using the K-Means algorithm. The clusters are well compacted and separated. The results of the EM algorithm in Figure 10 are also close to the results of the K-Means, EM and BIRCH algorithm. In general, taking into account the results obtained using these different algorithms on the same data, it can be seen that there is not a big difference. Therefore, it cannot be said that one algorithm is better than another with respect to the results obtained here. Meanwhile, as highlighted while describing each clustering method, each one has its advantages and drawbacks. Consequently, the choice of a specific method depends on the kind of data and the aimed outcomes and/or results.
5. Conclusion
Clustering of large and numerous data is a data classification method used to extract valuable knowledge that can guide decision makers and business managers. The topicality of this work lies in the deep review of the evolution of the volumes, nature and form of data, which, with their structural and functional complexity, render the classical methods of data management of transactional systems almost obsolete. The new methods, namely those based on unsupervised learning, significantly solve a range of problems encountered when using classical methods. In particular, they allow for the improved capitalization of data and knowledge in this digital era focused on the knowledge economy. The study of modern clustering methods has allowed us to identify a real mapping of these methods from the perspective of their design. The implementation of K-means, Fuzzy c-means, BIRCH and Expectation-Maximization methods shows that there is practically no one method that is more effective and efficient than another. Thus, a rational and therefore sound approach may require the combination of a number of these methods in order to take advantage of the benefits that one or the other method offers. Future research will concentrate on studying, stimulating and analyzing how the combination of two or more clustering methods improves and reinforces the model’s efficacy, efficiency and rationality.