^{1}

^{*}

^{2}

^{*}

^{2}

^{*}

The
work on the paper is focused on the use of Fractal Dimension in clustering for
evolving data streams. Recently Anuradha *et
al.* proposed a new approach based on Relative Change in Fractal Dimension
(RCFD) and damped window model for clustering evolving data streams. Through observations on the aforementioned referred paper, this
paper reveals that the formation of quality cluster is heavily predominant on
the suitable selection of threshold value. In the above-mentionedpaper
Anuradha *et al.* have used a heuristic
approach for fixing the threshold value. Although the outcome of the approach
is acceptable, however, the approach is purely based on random selection and
has no basis to claim the acceptability in general. In this paper a novel
method is proposed to optimally compute threshold value using a population
based randomized approach known as particle swarm optimization (PSO). Simulations
are done on two huge data sets KDD Cup 1999 data set and the Forest Covertype
data set and the results of the cluster quality are compared with the fixed
approach. The comparison reveals that the chosen value of threshold by Anuradha *et al*., is robust and can be used
with confidence.

Clustering is partitioning data into similar objects where each cluster can be a model for the similarity among data points in the cluster. Representing data by less number of clusters necessarily loses certain fine details as well large number of clusters may not give good results. Software and hardware advancement enabled an enormous growth in generation and storage of data in most diverse segments of society [

Traditional clustering approaches are not sufficiently flexible to deal with data that continuously evolve with time, thus, in the last few years, many proposals to data stream clustering have been presented [

New concepts may keep evolving in data streams over time. Evolving concepts require data stream processing algorithms to continuously update their models to adapt the changes [

In data stream mining algorithms should handle online clustering meritoriously and should maintain the clusters considering the potentiality of clusters. Time plays a major role in data stream clustering as a data point belonging to a cluster in some time horizon can become an outlier; in some other time horizon as most recent data plays an important role. The algorithm [

The rest of the paper is organized as follows. Section 2 describes the basic concepts described in the algorithm for clustering data stream using correlational fractal dimension [

Fractals are self-similar objects. Cluster is a group of more similar objects. Fractal Dimension is the measure for creating a cluster [

Cluster partitions on evolving data streams are often computed based on certain time intervals (or windows). There are three well-known window models: landmark window, sliding window and damped window. In order to meet the requirements of evolving data stream, we adopt basic window technology based on sliding window model [_{o} and the numbers of points are n_{o}, n_{1}, …. n_{e} at t_{o}, t_{l}, … t_{e} time respectively, the weight of the current window is calculated as equation (2).

Definition 1: (Correlation Fractal Dimension )Given a data set containing N data points which show self- similarity in the range of scales (r_{min},r_{max}), the correlation fractal dimension D2 [

where C_{r,i} is the occupancy with which the data points fall in the i^{th} cell when the original space is divided into grid cells with sides of length r. The relative fractal dimension change in more accurate in describing the change in pattern of the data set to form a new cluster. In order to measure the change in fractal dimension that corresponds to the change in patterns, we formally define relative fractal dimension change, RCFD [

Definition 2: (Relative Change in Fractal Dimension, RCFD) Given data stream S, cluster c, and a new cluster c’ formed by joining a new set of data points. fd(c) is the fractal dimension of the old cluster, c and fd(c’) is the fractal dimension of the new cluster, cꞌ. Relative change in the fractal dimension is defined as follows

Here fd refers to correlation fractal dimension D_{2}, which can reflect the data distribution and indicate the change of data trend.

The main idea behind FractStream [

Algorithm 1 Online Clustering (p_{1}, p_{2}, … p_{k})_{}

Consider a window of data points p_{1}, p_{2}, … p_{k}

Normalize Data points

While stream not end.

For all Core Fractal Clusters, CFC_{i}

CFC_{i}’ is the new cluster formed after placing the points in CFC_{i }

Compute new fractal dimension of CFC_{i}’

If RCFD < €

Then merge {p_{1}, p_{2}, … p_{k}} to CFC_{i}.

Update weight w.

else

For all Progressive Fractal Clusters PFC_{i}

PFC_{i}’ is the new cluster formed after placing the points in PFC_{i}

Compute new fractal dimension of PFC_{i}’

If RCFD < €

Then merge {p_{1}, p_{2}, … p_{k}} to PFC_{i}.

Update weight w.

else

For all Outlier Fractal Clusters OFC_{i}

OFC_{i}’ is the new cluster formed after placing the points in OFC_{i}

Compute new fractal dimension of OFC_{i}’

If RCFD < €

Then merge {p_{1}, p_{2}, … p_{k}} to OFC_{i}.

Update weight w.

If w_{o} (new weight of OFC_{i}) > βµ

Then remove OFC_{i} from outlier hash table and create a new progressive Fractal Cluster by PFC_{i}.

end if

else

Create a new Outlier Fractal Cluster with {p_{1}, p_{2}, … p_{k}} and insert it into outlier fractal clusters hash table.

end if.

end if.

At first, we try to insert the points into all CFC. We compute the Relative change in the fractal dimension, RCFD. Insert the points into that cluster whose RCFD change is within a minimal threshold and remove the points from rest of the CFC. If w is below µ and above βµ it means that CFCi has become a progressive Fractal cluster. Therefore, we remove CFC_{i} from the Core Fractal buffer and create a new Progressive Fractal Cluster by PFC_{i}.

Else, we try to insert the points into all PFC. We compute RCFD. Insert the points into that cluster whose RCFD change is within a minimal threshold and remove the points from rest of the PFC updating the weight of the cluster.

Else, we try to merge points into all OFC and compute RCFD. Insert the points into OFC whose RCFD is within the minimum threshold and remove the points from rest of OFC. And then, we check w the new weight of OFC. If w is above βµ it means that OFC_{i} has grown into a progressive Fractal cluster. Therefore, we remove OFC_{i} from the outlier-buffer and create a new Progressive Fractal Cluster by PFC_{i}.

Otherwise we create a new Outlier fractal cluster OFC_{i} by basic window of points and insert the points into the outlier-buffer. This is because these points do not naturally fit into any existing Fractal clusters. These points may be an outlier or the seed of a new cluster.

The weight of all the clusters needs to be checked periodically, because if no points are added to the clusters then the weight reduces as time passes on. If the weight of PFC_{i} is below βµ, then it is no more progressive and should be deleted to release the memory space for new PFC’s. As data streams advances the number of Outlier fractal Clusters also increases. So OFC’s are to be restored which are potential to grow into PFC deleting the real outlier.

The clustering quality [

where K denotes the number of clusters. _{i} denotes the number of points in cluster i. Intuitively, the purity measures the purity of the clusters with respect to the true cluster (class) labels that are known for our data sets.

The existing check of belongingness to a cluster based on the Relative Change in Fractal Dimension (RCFD) based on minimum threshold may have following concerns.

There may be situations in which RCFD may be less than the minimum threshold value € in more than one base clusters

There may be situations in which the incoming data point unnecessarily be created as a new cluster, i.e. outlier cluster instead of fitting into the right cluster.

To handle the above said problems a technique is proposed to choose the right threshold value. From studies it is seen that the threshold value is set from 0.02 to 2.50. If the € value is set to a greater value, the chances of belongingness of incoming set of data points to more than one base cluster increases. However at some instances a minimum value of € may cause creating unnecessary new cluster as outlier. Hence it is a big challenge to properly set the values of € within a given range for the effective creation of quality clusters. In our proposed approach we have adopted a popular evolutionary technique known as particle swarm optimization to compute suitable values.

In this work the attempt is made to compute the most suitable value of € i.e. the threshold without compromising the quality of the clusters. Although there are several classical optimizing techniques such as Newton method, linear programming, Integer programming, quadratic programming etc. existing in the literature, the authors of the paper have chosen a population based randomized approach to address the above issue due to rich wide spread usages, applications and reported effectiveness of the techniques. Particle Swarm optimization (PSO) is one of the most well researched and used randomized techniques. In our work PSO is taken as the basic frame work for developing optimal threshold value. In the section below an introductory explanation of PSO is presented followed by the approach of finding Threshold value using PSO.

PSO is a population-based search algorithm and is initialized with a population of random solutions called particles [

The i^{th} particle is represented as x_{i} = (x_{i}_{1}, x_{i}_{2}, ... x_{id}) where d is the dimension

The best previous position (the position giving the best fitness value) of the i^{th} particle is recorded and represented as P_{i} = (P_{i}_{1}, P_{i}_{2}, ... P_{id}). The best position of the particle alternatively can be expressed with a variable called as pbest.

The index of the best particle among all the particles in the population is represented by the symbol g. And this can alternatively be expressed as a variable gbest.

The rate of position change (velocity) for particle i is represented as V_{i} = (V_{i}_{1}, V_{i}_{2}_{ }... V_{id}).

The particle is manipulated according to the following equation:

where w is the inertia of weight, c_{1} and c_{2} are two positive constants, and rand( ) is a random function in the range (0,1) [_{id} is known as pbest and P_{gd} is known as gbest.

Algorithm 2 for implementing the global version of PSO to find the threshold value:

Step 1: Initialize population of particles with random positions and velocities on d dimensions in the problem space, evaluating the desired optimization fitness function in d variables, for each particle.

Step 2: compare particle’s fitness evaluation with particle’s pbest. If current value is better than pbest, then set pbest value equal to the current value, and the pbest location equal to the curent location in d-dimensional space.

Step 3: compare fitness evaluation with the population’s overall previous best. If the current value is better than gbest, then reset gbest to the current particle’s value.

Step 4: change the velocity and position of the particle according to the equations (6) and (7), respectively.

Step 5: loop to step 2 until the criterion is met, usually a sufficiently good fitness or a maximum number of iterations.

To optimally choose the minimum threshold value the above describe algorithm is used in our approach. The fitness function of the problem under study is the quality of cluster formed. The quality of the cluster is defined as the purity of the cluster as in equation (5). Each particle initialized in our work is used to evaluate the purity based on the concept of RCFD as described in the section 2. Accordingly the gbest of the swarm and the pbest of each particle are computed for further processing in the iterations. The algorithm runs for enough number of iterations till the best optimal value is obtained.

The proposed algorithm is implemented for choosing the optimal value of threshold. The range of threshold values are chosen in between 0.02 to 2.50 based on our observations of use from different related research papers. The current position and velocities are initialized in the above range. The maximum velocity is set to 1 for restricting the particles not to overshoot the selected search region.

The values of w, c_{1} and c_{2} are chosen as in [

From the above result and results derived from [

Clustering quality (Network Intrusion data set, horizon = 1, stream speed = 1000)

Clustering quality (Network Intrusion data set, horizon = 5, stream speed = 1000)

Clustering quality(Forest Covertype data set for stream speed = 1000 and horizon = 1)

Clustering quality(the Forest Coverty- pe data set for stream speed = 1000 and hori- zon = 5

In this paper a new approach for suitably choosing a minimum threshold value that helps in identifying the appropriate group or cluster in RCFD based clustering technique is investigated. In [