LeaDen-Stream : A Leader Density-Based Clustering Algorithm over Evolving Data Stream

Clustering evolving data streams is important to be performed in a limited time with a reasonable quality. The existing micro clustering based methods do not consider the distribution of data points inside the micro cluster. We propose LeaDen-Stream (Leader Density-based clustering algorithm over evolving data Stream


Introduction
Mining data stream became more prominent in many applications, including real-time detection of anomalies in computer network traffic, web searches, monitoring environmental sensors, social networks, sensor networks, and cyber-physical systems [1].In these applications, data streams arrive continuously and evolve significantly over time.Mining data streams is related to extracting knowledge structure represented in streams information.Clustering is a significant data streams' mining task [2][3][4][5][6].However, clustering in data stream environment needs some special requirements due to the data stream's characteristics such as clustering in limited memory and time with single pass over the evolving data streams and further handling noisy data [7][8][9].
There are various methods of clustering in the literature such as partitioning and hierarchical, which are developed to find spherical-shape clusters.One of the important classes in clustering is density-based clustering which can discover the clusters of non-spherical shape and filter out the outliers.The density-based clustering algorithms can find non-spherical shape clusters and are useful for identifying the noise.Some typical examples of density-based algorithms include DBSCAN [10], OP-TICS [11], and DENCLUE [12].The main idea in these ), a density-based clustering algorithm using leader clustering.The algorithm is based on a two-phase clustering.The online phase selects the proper mini-micro or micro-cluster leaders based on the distribution of data points in the micro clusters.Then, the leader centers are sent to the offline phase to form final clusters.In LeaDen-Stream, by carefully choosing between two kinds of micro leaders, we decrease time complexity of the clustering while maintaining the cluster quality.A pruning strategy is also used to filter out real data from noise by introducing dense and sparse mini-micro and micro-cluster leaders.Our performance study over a number of real and synthetic data sets demonstrates the effectiveness and efficiency of our method.
algorithms is to consider the dense area of points in the data space as clusters, which are separated by low-density area (noise).Another method of clustering is gridbased which has fast processing time and is independent from the number of data points.Moreover, some algorithms are developed based on the integration of grid and density based termed as density grid based clustering algorithms [13].
Micro clustering is a remarkable method in stream clustering to compress data streams effectively and to record the temporal locality of data [6].The micro-cluster was first proposed in [14] for large data sets, and subsequently adapted in [5] for data streams.The microclustering method for clustering data streams has attracted considerable attention in literature [5,[15][16][17][18][19][20].
In [21], a two-level hybrid DBSCAN algorithm, L-DBSCAN, is proposed.First, it searches each point in dataset and finds out the coarse leaders at a coarse level in order to reduce time complexity.Then, it uses these leaders to determine density-based clusters in a finer level to reduce the deviation of the result.Furthermore, L-DBSCAN is developed into rough-DBSCAN in [22].
The remainder of this paper is organized as follows: Section 2 surveys related work.Section 3 introduces basic definitions.In Section 4, we explain the LeaDen-Stream algorithm in details.We conduct experimental study of LeaDen-Stream on real-world and synthetic data sets in Section 5 and conclude the paper in Section 6.

Related Work
Algorithms on clustering data streams are categorized as one-scan and evolving approaches.The one-scan approaches cluster the data streams by scanning only once under the assumption that the data arrives in chunks [7,23].In evolving approaches, the behavior of data streams is defined based on certain time window.Fading window model and sliding window model are widely adopted in stream mining [5,9,15,17,[24][25][26].
Most of clustering algorithms over evolving data streams have two phases firstly introduced by CluStream [5].CluStream has online and offline phases.The online phase keeps summary information, and the offline phase generates clusters based on synopsis information.However, CluStream, which is based on the k-means approach, finds only spherical clusters.Density-based clustering can overcome this limitation.Therefore, recently density-based clustering is extended in two phase clustering [9,17,24,27,28].
Den-stream [17] is a clustering algorithm for evolving data stream.The algorithm extends the micro cluster [5] concept, and introduces the outlier and potential micro clusters to distinguish between real data and outliers.Den-Stream is based on fading window model in which the importance of micro-clusters is reduced over time if there are no incoming data points.
MR-Stream [9] is an algorithm, which has the ability to cluster data streams at multiple resolutions.The algorithm partitions the data space in cells and a tree like data structure, which keeps the space partitioning.The tree data structure keeps the data clustering in different resolutions.Each node has the summary information about its parent and children.The algorithm improves the performance of clustering by determining the right time to generate the clusters.D-Stream [24] is a density grid-based algorithm in which the data points are mapped to the corresponding grids and the grids are clustered based on their density.It uses a multi-resolution approach to cluster analysis.
We   good quality.
In this paper, we introduce a new algorithm which we call it LeaDenStream with good quality while its time complexity is as low as D-Stream.We introduce new concepts, which are called Mini Micro Leader Cluster and Micro Leader Cluster.We present a new method in which we have to define the granularity of Micro Leaders based on their inside data distribution (which is not considered in any of the existing algorithms).For example, in Den-Stream only the center of potential micro clusters are sent to its offline phase.However, if the data points are not distributed uniformly inside the micro cluster, sending only one representative point for each micro cluster leads to less accuracy.Therefore, using Mini Micro Leader Cluster keeps the quality and Micro Leader-Cluster decreases the time complexity.Figure 3 shows the Mini Micro Leader Cluster and Micro Leader Cluster in the micro cluster.The situation is compared with DenStream.We also used Mahalanobis distance instead of Euclidean distance for identifying correct cluster center, which increases the quality of clustering as well.

Basic Definitions
In this section, we introduce the basic definitions, which form LeaDenStream algorithm.
Definition 1.The Decaying Function: The fading function [29] used in LeaDen-Stream is which can be converted to the following equation: The maximum weight is defined when t → ∞, therefore the maximum is defined as follows: It is given in [17] and [24].

LeaDen-Stream Clustering Algorithm
We describe the key components of LeaDen-Stream outlined in Algorithm 1.In LeaDen-Stream, when a new data record x arrives, it is added to the Mini-Micro or Micro leader cluster based on the distribution of data in AdjustingLeader-Clusters (Algorithm 2).Then, we periodically and in every gap time, which is the minimum  time for converting a dense mini-micro leader to a sparse, convert sparse mini-micro leader clusters to dense and vice versa.We remove the sparse mini micro and micro leader clusters in PuringLeaderClusters (Algorithm 3).
Our clustering algorithm is divided into two phases: • Online phase: keeping Mini-Micro and Micro leader clusters • Offline phase: generating final clusters

Keeping Mini-Micro and Micro Leader Clusters
This phase is triggered when a data point arrives from data streams.The procedure is described as follows (Algorithm 2, Adjust Leader Clusters): 1) We try to find the nearest micro leader cluster to the data point 2) If we find such a micro leader cluster, we try to find nearest mini-micro leader cluster to the data point.
(a) If there is such a mini-micro cluster leader then merge the data point to the nearest mini-micro cluster leader.
(b) Otherwise, form a new mini-micro cluster with x as the center of new mini-micro cluster.
3) Otherwise, there is not such micro leader cluster, form a new micro leader cluster with x as the center of  Furthermore, we prune the mini-micro and micro leader clusters in the gap time in Algorithm 3, Puring Leader Clusters.In the pruning time, all the micro leader clusters and their Mini Micro Cluster Leaders are checked.Micro and mini-micro leader clusters are kept in the tree structure to make it easier for searching and updating.Based on different kinds of Mini Micro Cluster inside micro cluster different decisions are made for pruning, which are described as follows: • All the mini-micro leader clusters are dense: micro leader cluster center is kept for the offline phase • All the mini-micro leader clusters are sparse: mini micro leader clusters are removed as well as their micro leader cluster.• Some of mini-micro leader clusters are dense and some of them are sparse: 1) Remove the sparse mini-micro leader clusters 2) Keep the center of the dense mini-micro leader clusters for the offline phase

Generating Final Clusters
The online phase maintains micro and mini-micro leaders clusters.However, we need to use a clustering algorithm to get the final clusters.When a clustering request arrives, DBSCAN algorithm is used on the micro and mini-micro leader cluster centers to get the final results.Each minimicro and micro leader center is used as a virtual point to be used for clustering.

Experimental Evaluation
We implemented LeaDen-Stream in Massive Online Analysis (MOA)1 [30] (Figure 4).In order to evaluate the clustering quality and scalability of the LeaDen-Stream algorithm, both real and synthetic data sets are

Conclusion
In this paper, we have proposed LeaDen-Stream, an algorithm for density-based clustering of evolving data stream using leader clustering.The algorithm runs in two phases.The method determines data points for offline clustering based on the distribution of the data inside the micro leader clusters.If the data is uniformly distributed, it only sends the micro leaders' centers.However, if the data is non-uniformly distributed, instead of micro leader centers their dense mini-micro leader cluster centers are kept for the offline phase.The pruning strategy is designed to eliminate the sparse mini-micro and micro leader clusters and to keep the dense ones for the offline phase.
Mini-micro and micro leader clusters are used in terms of increasing cluster quality and decreasing the time complexity.Using more than one representative point in cases that some of the mini-micro leader clusters are dense and some sparse, improves the quality of clustering.On the other hand, in cases that all of the mini-micro leader clusters are dense, sending only the micro leader cluster's center is enough for the offline phase, which in turn saves the time complexity.
Experimental results on a real-world data set as well as a synthetic data validates the design goals and shows that LeaDen-Stream significantly improves over DenStream and Clustream in terms of both clustering quality and time.As a future work, we want to automate the parameters of LeaDen-Stream and examine our algorithm in a sliding window model.
compared the time complexity and the clustering quality of DenStream, MR-Stream, and D-Stream algorithms.The results are shown in Figures 1 and 2. In terms of time complexity, D-Stream has the lowest time complexity; however, it has low quality since the clustering quality depends on the granularity of the lowest level of the grid structure.DenStream has a higher time complexity compared to D-Stream; however, it has a better memory usage and quality.MR-Stream has the highest time complexity and memory usage while it has

Figure 1 .
Figure 1.Data stream clustering algorithms time execution comparison.

Figure 2 .
Figure 2. Data stream clustering algorithms quality comparison.

Figure 3 .Definition 3 .L
Figure 3. Mini micro and micro leader clusters.defined as f(t) = 2 −λt , where 0 < λ < 1.The weight of the data stream points decreases exponentially over time, i.e. the older a point gets, the less important it gets.The parameter λ is used to control the importance of the historical data of the stream.Definition 2. MiniMicroLeaderCluster (MMLC): A MMLC for a group of data points p i1 ...p in with time stamp at time t, T i1 ...T in , is defined as { } 1 2 , , , , mm mm mm CF CF W C L •

Lemma 2 .
The minimum time for converting DMMLC to SMMLC and vice versa is:

Figure 4 .
Figure 4. LeaDenStream in MOA.used.The synthetic data set is depicted in Figure 5.The real data set is the KDD CUP99 Network Intrusion Detection data set (all 34 continuous attributes out of the total 42 available attributes are used).Using MOA framework, the clustering quality of LeaDen-Stream algorithm is evaluated and compared with CluStream and Den-Stream based on purity [31].The efficiency is measured by the execution time.The quality of LeaDen-Stream is higher than CluStream with lower execution time.The LeaDen-Stream clustering quality is equal to DenStream while it runs faster than DenStream.