A State of Art Analysis of Telecommunication Data by k-Means and k-Medoids Clustering Algorithms

Cluster analysis is one of the major data analysis methods widely used for many practical applications in emerging areas of data mining. A good clustering method will produce high quality clusters with high intra-cluster similarity and low inter-cluster similarity. Clustering techniques are applied in different domains to predict future trends of available data and its uses for the real world. This research work is carried out to find the performance of two of the most delegated, partition based clustering algorithms namely k-Means and k-Medoids. A state of art analysis of these two algorithms is implemented and performance is analyzed based on their clustering result quality by means of its execution time and other components. Telecommunication data is the source data for this analysis. The connection oriented broadband data is given as input to find the clustering quality of the algorithms. Distance between the server locations and their connection is considered for clustering. Execution time for each algorithm is analyzed and the results are compared with one another. Results found in comparison study are satisfactory for the chosen application.


Introduction
Data Mining (DM) is a convenient way of extracting patterns, which represents knowledge implicitly stored in large data sets and focuses on issues relating to their feasibility, usefulness, effectiveness and scalability.Data mining approach tions to data miners.An important field of application for data mining techniques is also the World Wide Web.The Web provides the ability to access one of the largest data repositories, which in most cases still remains to be analyzed and understood.Recently, DM techniques are also being applied to social sciences, home land security and counter terrorism.A DM system is therefore composed of a software environment that provides all the functionalities to compose DM applications, and a hardware back-end onto which the DM applications are executed.
Data mining can be performed on various types of databases and information repositories, but the kind of patterns to be found are specified by various data mining functionalities like class/concept description, association, correlation analysis, classification, prediction, cluster analysis etc.Among these, Cluster analysis is one of the major data analysis method widely used for many practical applications in emerging areas [2].Clustering is the process of finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.The quality of a clustering result depends on both the similarity measure used by the method and its implementation and also by its ability to discover some or all of the hidden patterns.There is a number of clustering techniques that have been proposed over the years [3].Different clustering approaches may yield different results.involving both telecommunication data and clustering algorithms".In this paper [11], they have proposed bit-vector based information storage method (BV) for the analysis of telecommunication data.The method reduces the amount of overall data by storing only the deviation from the previous measurement as bit-value.They implemented a simulator to evaluate the efficiency of the proposed methods, and the methods turned out to be fairly efficient compared to normal cases.

Materials and Methods
Clustering is a concept to determine the pattern through map and analysis of available data set according to the need and demand of the business applications.
Clustering is belonging to both data analysis and machine learning major domains.Many methodologies have been proposed in order to organize, to summarize or to simplify a dataset into a set of clusters such that data belonging to a same cluster are similar and data from different clusters are dissimilar [12].Different approaches may yield different results in the clustering techniques.Here, the telecommunication data is analyzed based on the distance between the server locations and their user points.As stated in previous sections, the k-Means and k-Medoids clustering algorithms are compared and analyzed [13].In the comparison study, only the time parameter is taken for analysis.

The k-Means Algorithm
The k-Means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem.The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori [14].This algorithm aims at minimizing an objective function, in this case a squared error function.
The objective function where x c − is a chosen distance measure between a data point j i x and the cluster centre c j , is an indicator of the distance of the n data points from their respective cluster centers.The algorithm is composed of the following steps: Step 1: Place k points into the space represented by the objects that are being clustered.These points represent initial group centroids.
Step 2: Assign each object to the group that has the closest centroid.
Step 3: When all objects have been assigned, recalculate the positions of the k centroids.
Step 4: Repeat Steps 2 and 3 until the centroids no longer move.
This produces a separation of the objects into groups from which the metric to be minimized can be calculated.Although it can be proved that the procedure will always terminate, the k-Means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum [15].This k-Means is a simple algorithm that has been adapted to many problem domains [16].

The k-Medoids Algorithm
The k-Means algorithm is sensitive to outliers since an object with an extremely large value may substantially distort the distribution of data [17].Instead of taking the mean value of the objects in a cluster as a reference point, a medoid can be used, which is the most centrally located object in a cluster.Thus, the partitioning method can still be performed based on the principle of minimizing the sum of the dissimilarities between each object and its corresponding reference point [18].This forms the basis of the k-Medoids method.The basic strategy of k-Medoids clustering algorithms is to find k clusters in n objects by first arbitrarily finding a representative object (the medoids) for each cluster.Each remaining object is clustered with the medoid to which it is the most similar [19].The algorithm takes the input parameter k, the number of clusters to be partitioned among a set of n objects.A typical k-Mediods algorithm for partitioning based on medoid or central objects is as follows: Input: k: The number of clusters D: A data set containing n objects Output: A set of k clusters that minimizes the sum of the dissimilarities of all the objects to their nearest medoid.
Method: Arbitrarily choose k objects in D as the initial representative objects; Repeat assigneach remaining object to the cluster with the nearest medoid; Randomly select a non medoid object Orandom; Compute the total points S of swaping object O j with Oramdom; if S < 0 then swap O j with Orandom to form the new set of k medoid; Until no change; It attempts to determine k partitions for n objects.After an initial random selection of k medoids, the algorithm repeatedly tries to make a better choice of medoids.Therefore, the algorithm is often called as representative object based algorithm.

Experimental Setup
Data mining concepts are used in different applications as per the need, demand, nature of the problem and domain.In this research, the clustering process is achieved using a distance method.The clustering process is aimed to minimize the expenditure of the business application and increase the benefits of the business.The algorithms are implemented in the real time connection oriented telecommunication data and the results are discussed.In this process, the communication connection structure is evaluated and reconstructed using clustering techniques for the effective data distribution.The data distribution process is affected by the connected server, distance and number of connections available in the specific server.The distance factor create an impact on the creation of the infrastructure using cable, cost of the cable, manpower, maintenance and the data distribution based on the bandwidth.Therefore, the data access points are considered as data points and planned to optimize the network using clustering concepts.After the clustering process, the number of connections for each server is changed.
The data set collected from a broadband service provider at Chennai city.The connection oriented data set contains 285,520 data connection points with 27 servers with locations.The 27 servers are treated as 27 clusters in this work and they are called as data centers.The user points are called as data access points.There are 12 data sets available.One data set for each month.The data set contains information about distance, type of connection (single user, multi user), data transfer capacity (256, 512, 1024, 2048), area code and the server number in which the data points are connected.This representation is based on the connections established from the month of January to December.The collected data consists of the connection establishment month, area and the connected data center, type of the data service and the volume of data used in the year.The total connection of data access points are connected according to the geographical location.This connection is made, based on the demand of the customer which is provided by the service provider.The total number of connected data points for each and every month is given in Table 1.The distance between data centers and its data access points are available in meters.The servers (Data centers) are treated as center point in the clustering process.Normally, the distance between one server and the same server is zero.
The total number of data access points (number of user points before clustering) in all the 27 servers are given in Table 2.The data access points are currently distributed unevenly based on the request of the user at every month and the availability of the nearest data centers.But, this caused the issues of the traffic and the distribution of data.It affects the quality of the server to the user as well as the service provider.In the exiting connection, the data center 27 has 279 connections alone.This is the minimum number of connection in a particular data center.But, the first data center has 14,101 data access points, which is the maximum of all the data centers.After the clustering process, this may vary from server to server.

Clustering Process and Experimental Results
The k-Mean and k-Medoids algorithms are implemented using MATLAB software and the results are discussed in this section.In the implementation process, the data set is processed based on the distance.Initially, first data center and the first month (January) are selected.For the selected data center and for the month, the distance is reconstructed and stored in the process matrix.The reconstruction of distance is made by using the Pythagoras theorem.After the reconstruction of the distance, the data access points are clustered using any one of the taken algorithm.In this process, the number of user points in each and every data center is reassigned.Therefore, each data center has some new number of data access points after the process.The computational time and the number of connections in each server are stored in tables.This means that the starting and ending time of clustering process is stored in tables.Next, by choosing the same first data center, the second month (February) data access points are chosen and clustered using the chosen algorithm.This process is repeated upto the last month (December) data.After processing the 12th month data by choosing the first server, the second server is chosen and the process is repeated up to 27 servers.Hence, the number of connections in each data center is considered as application impact and the process time is considered as computation impact.
The algorithmic steps involved in the clustering process are summarized below.

Experimental Results
The k-Mean clustering algorithm is implemented as per the discussion above.month data is available in the last row, which is found to be 17.925 sec.The data points in each data center points are clustered (distributed) using the k-Means algorithm based on the neighborhood distance.The minimum time taken by the algorithm is 8.050 seconds and the maximum is 37.008 seconds.To avoid lengthy discussion, the number of data points created by the algorithm is not shown.
The first server is chosen as a center point, in the same way the second server is chosen and the clustering process is repeated.
In the similar fashion, the results of k-Medoids algorithms are also given in the Table 3.In this table, it is easy to find the total processing time for the first month data is 34.505 seconds and so on.The average processing time for the 12 month data is 35.732seconds.Both algorithms are executed 12 times (up to choosing 12 servers as center point) and the average results are listed in Table 4.
The average processing time of 12 runs of the k-Means clustering is observed.
The algorithm takes a minimum of 15.919 Seconds in the 7th run and a maximum of 19.796 Seconds in the 5th run.From the same table, a minimum of 31.924seconds and a maximum of 57.659 seconds are taken by k-Medoids algorithm to complete the clustering process.After the clustering process, the number of data points created by both the algorithms is given in Table 5.It is very clear that from the results of the algorithm, the distribution of data points in each and every data center is almost closer to one another.In the distribution process, data access points are equally divided for all the data centers.This makes the requirement of the clustering process.

Summary and Discussions
The total numbers of 285,520 data access points are clustered which are available in 27 data centers by the chosen two algorithms.Based on the distance between the data access points and data centers, the performance and efficiency of the clustering process is analyzed.The k-Means algorithm assigns a minimum of 9853 data points and a maximum of 11,185 data points after the clustering.The minimum and maximum data points assigned by the k-Medoids method is 9808 and 11361 respectively.Figure 1 shows that the minimum and maximum data access points assigned by the proposed two algorithms.Note that the minimum and the maximum data points are created by the k-Medoids method.But, in particular, the difference between minimum and the maximum is very less in both the methods.In Table 4, average execution time for 12 runs is given in last row to both algorithms.Figure 2 is the graphical representation of average time

Conclusion
Cluster analysis is still an active field of development.
The partitioning based algorithms are frequently used by many researchers for various applications in different domains.This research work compares two of the partitioning based clustering techniques namely k-Means and k-Medoids via its performance based on their execution time.The remainder of the paper is structured as follows.The next section provides a comprehensive outline of related work via literature survey.Section 3 describes the basic approach and method of both algorithms.An experimental setup of the telecommunication data and the properties of the same data are discussed in Section 4. Section 5 explores the clustering process and obtained results of the algorithms.Finally, Section 6 contains the concluding remarks of the research work.
complexity.In this research, the most representative algorithms k-Means and the Enhanced k-Means were examined and analyzed based on their basic approach.They found that the elapsed time taken by proposed enhanced k-means is less than k-means algorithm.A work carried out by Benderskaya et al., titled as "Self-organized Clustering and Classification: A Unified Approach via Distributed Chaotic Computing" [9] describes a unified approach to solve clustering and classification problems by means of oscillatory neural networks with chaotic dynamics.The advantages of distributed clusters formation in comparison to centers of clusters estimation are demonstrated.New approach to clustering on-the-fly is proposed.A Novel Approach to Medical Image Segmentation is presented by Shanmugam et al., in their paper [10].In this research, a novel approach is used to segment the 2D echo images of various views.A modified k-Means clustering algorithm, called "Fast SQL k-Means" is proposed using the power of SQL in DBMS environment.In k-Means, Euclidean distance computation is the most time consuming process.Since the entire processing is done with database, additional overhead of import and export of data is not required.The 2D echo images are acquired from the local Cardiology Hospital for conducting the experiments.There are number of research articles utilizing broad band data for the analysis of various types of networks.Also, some of the clustering algorithms are utilized to analyze telecommunication data.One such topic was done by Sung Suk Kim and Sun Ok Yang titled as "Wireless sensor gathering data during long time

1 ) 7 )
Selection of Algorithm from k-Means or k-Medoids 2) Selection of data center 3) Calculate the distance between data access points and servers based on selected data center 4) Selection of monthly data 5) Implementation of the selected algorithm and cluster the distance 6) According to the processed cluster, the data points are reassigned to the data center Observe the cluster process start time and completion time 8) Summaries the number of connection in each data center 9) Implement the step (2) to (8) to all the different data sets 10) Represent the data connection according to the newly assigned data center Usually, clustering approaches yields different kind of results, this depends on the nature and chosen application of the problems.Next section discusses the results and interpretations about the results of the algorithms produced as output in the process.
Many cluster analysis techniques do not have a strong formal basis.Cluster analysis is a rather ad-hoc field.There are a wide variety of clustering techniques.Comparisons among different clustering techniques are difficult.All techniques seem to impose a certain structure on the data and yet few authors describe the type of limitations being imposed.In spite of all these problems, clustering analysis is a useful (and interesting) field.In summary, clustering is an interesting, useful, and challenging problem.It has great potential in applications like object recognition, image segmentation, and information filtering and retrieval.However, it is possible to exploit this potential only after making several carefully chosen designs and application.From the experimental approach, by several executions of the program [4] main advantage of clustering is that interesting patterns and structures can be found directly from very large data sets with little or none of the background knowledge.The cluster results are not subjective, but implementation dependent.Data Clustering has been addressed by many researchers and many clustering approaches have been explored and studied.A variety of data clustering algorithms are developed and applied for many applications domain in the field of data mining.Clustering techniques have been applied to a wide variety of research problems.Hartigan provides an excellent summary of the many published studies reporting the results of cluster analyses[4].For example, in the field of medicine, clustering diseases, cures for diseases, or symptoms of diseases can lead to very useful taxonomies.In the field of psychiatry, the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for [5]adays, data clustering has attracted the attention of many researchers in different disciplines.It is an important and useful technique in data analysis.A T. Velmurugan DOI: 10.4236/jcc.2018.61019192Journal of Computer and Communications large number of clustering algorithms have been put forward and investigated.successfultherapy.In archeology, researchers have attempted to establish taxonomies of stone tools, funeral objects, etc. by applying cluster analytic techniques.In general, whenever one needs to classify a "mountain" of information into manageable meaningful piles, cluster analysis is of great utility.Bradley P.S and Fayyad describe refining Initial Points for k-Means Clustering in their paper[5].They said that the practical approaches to clustering use an iterative procedure (e.g.

Table 1 .
Total connections in data set.

Table 2 .
Number of connections in servers (before clustering).

Table 3
is the processing time of both k-Means and k-Medoids algorithm.The total elapsed time by k-Means algorithm to cluster first month data set for the entire 27 data center is given in third column.Processing time for the first month data is 16.653 seconds and so on.Average processing time of all the 12

Table 3 .
Results of clustering algorithms (processing time).

Table 5 .
Total number of connections (after clustering).