^{1}

^{*}

^{2}

^{*}

*Modalclust* is an R package which performs Hierarchical Mode Association Clustering (HMAC) along with its parallel implementation over several processors. Modal clustering techniques are especially designed to efficiently extract clusters in high dimensions with arbitrary density shapes. Further, clustering is performed over several resolutions and the results are summarized as a hierarchical tree, thus providing a model based multi resolution cluster analysis. Finally we implement a novel parallel implementation of HMAC which performs the clustering job over several processors thereby dramatically increasing the speed of clustering procedure especially for large data sets. This package also provides a number of functions for visualizing clusters in high dimensions, which can also be used with other clustering softwares.

Cluster analysis is a ubiquitous technique in statistical analysis that has been widely used in multiple disciplines for many years. Historically cluster analysis techniques have been approached from either a fully parametric view, e.g. mixture model based clustering, or a distribution free approach, e.g. linkage based hierarchical clustering. While the parametric paradigm provides the inferential framework and accounts for the sampling variability, it often lacks the flexibility to accommodate complex clusters and are often not scalable to high dimensional data. On the other hand, the distribution free approaches are usually fast and capable of uncovering complex clusters by making use of different distance measures. However, the inferential framework is distinctly missing in the distribution free clustering techniques. Accordingly most clustering packages in R also fall under the two above mentioned groups of clustering techniques.

This paper describes a software program for cluster analysis that can knead the strengths of these two seemingly different approaches and develop a framework of parallel implementation for clustering techniques. For most model based approaches to clustering, the following limitations are well recognized in the literature: 1) the number of clusters has to be specified; 2) the mixing densities have to be specified, and as estimating the parameters of the mixture models is often computationally very expensive, we are often forced to limit our choices to simple distributions such as Gaussian; 3) computational speed is inadequate especially in high dimensions and this coupled with the complexity of the proposed model often limits the use of model-based techniques either theoretically or computationally; 4) it is not straightforward to extend model-based clustering to uncover heterogeneity at multiple resolutions, similar to the one offered by to the model free linkage based hierarchical clustering.

Influential work towards resolving the first three issues has been carried out in [

This paper describes a software program for cluster analysis that can knead the strengths of these two seemingly different approaches and develop a framework of parallel implementation for clustering techniques. The hierarchical mode association clustering—HMAC [

This paper is organized as follows: Section 2 briefly introduces the algorithm of Modal Expectation Maximi- zation (MEM) and builds the notion of mode association clustering technique. Section 3 describes a parallel computing framework of HMAC along with computing time comparisons. Section 4 illustrates the implemen- tation of clustering functions in the R package Modalclust along with examples of the plotting functions especially designed for objects of class hmac. Section 5 provides the conclusion and discussion. Comparison of Modal clustering with other popular model based and model free techniques are provided in the supplementary document.

The main challenge for using mode-based clustering in high dimensions is the cost of computing modes, which are mathematically evaluated as local maximas of the density function with support on

Modal Expectation Maximization (MEM). Define the mixture density as

initial value

1. Let

2. Update

Details of convergence of the MEM approach can be found in [

where

where

allowing us to avoid the numerical optimization of Step 2.

Now we present the HMAC algorithm. First we scale the data and use a kernel density estimator, with a normal kernel to estimate the density of the data. The variance of the kernel,

1. Given a set of data

2. Use

3. Extract distinctive values from the set

4. If

We note that when the bandwidth

1. Start with the data

2.

3. Form kernel density as in (1.1) using

4. Cluster the elements in

5. If

6. Stop if

In this section we develop the method of parallel computing of HMAC (PHMAC) and its application together with some comparisons of performance of the parallel and non-parallel approach. The MAC approach is computationally expensive when the number of objects

Step 1. Sphering transform the data

Step 2. Let

Step 3. Perform HMAC on each of these subsets at the lowest resolution, i.e., using

Step 4. Pool the modes from each subset of data to form

Step 5. Perform HMAC starting from Step 2 and obtain the final hierarchical clustering.

Step 6. Transform

Modes have a natural hierarchy and it is computationally easy to merge modes from different partitions. In practice, we need to decide the best choice of the partition and how many partitions to use. In this section, we provide some guidelines regarding the choices, without exploring their quality in details. In the absence of any other knowledge, one should randomly partition the data. Other choices include partitioning data based on certain coordinates which form a natural clustering, and then taking products of a few of those coordinates to build the overall partition. This strategy might increase the computational speed by restricting the modes within a relatively homogeneous set of observations. Another choice might be to sample the data and build partitions based on the modes of the sampled data.

The PHMAC we proposed uses parallel computing at the first level of HMAC and then use non-parallel computing from the second level onwards. Therefore, the number of partitions to minimize the computational time is a complex function of the number of available processors, the number of observations and the bandwidth parameter of the KDE. If one uses too many partitions, one might speed up the first step, but would have the risk of ending up with too many modes for the next level, where the hill climbing is done from the collection of modes from each partition with respect to the overall density. In contrast, for a large

We compare the computing speed of parallel versus serial clustering using 1, 2, 4, 8 and 12 multi-core processors. Tests were performed on a 64 bits 4 Quad Core AMD 8384 (2.7 Ghz each core), with 16 GB RAM

running Linux Centos 5 and R version 2.11.0 From

Number of processors | ||||||
---|---|---|---|---|---|---|

Data dimensions | 1 | 2 | 4 | 8 | 12 | |

n = 2000 | d = 2 | 56.58 | 17.01 | 7.84 | 6.91 | 8.02 |

n = 2000 | d = 20 | 323.16 | 128.13 | 112.42 | 190.11 | 250.22 |

n = 2000 | d = 40 | 730.18 | 560.16 | 687.79 | 764.29 | 753.36 |

n = 10,000 | d = 2 | 3849.83 | 871.33 | 276.88 | 145.61 | 131.22 |

n = 10,000 | d = 20 | 8410.96 | 1694.82 | 585.33 | 536.32 | 459.88 |

n = 50,000 | d = 2 | 210295.29 | 71152.82 | 23383.61 | 11959.24 | 4875.64 |

maximum 12 processors. For

The R package Modalclust was created to implement the HMAC and PHMAC. There are also some plotting tools that give the user a comprehensive visual and understanding of the clustering result. Sources, binaries and documentation of Modalclust are available for download from the Comprehensive R Archive Network http://cran.r-project.org/ under the GNU Public License.

In this section, we demonstrate the usage of the functions and plotting tools that are available in the Modalclust package.

First, we provide an example of performing modal clustering to extract the subpopulations in the logcta20 data. The description of the dataset is given in the package. The scatter plot, along with its smooth density, is provided in

R > install.packages (“Modalclust”)

R > library (“Modalclust”)

Using the following command, we can get the standard (serial) HMAC and parallel HMAC using two pro- cessors for logctA20 data.

R > logcta20.hmac < −phmac(logcta20,npart=1,parallel=FALSE)

R > logcta20p2.hmac < −phmac(logcta20,npart=2,parallel=TRUE)

Both implementation results are given in

By default, the function selects an interesting range of smoothing parameters with ten

R > logcta20.hmac$sigma

[

which are chosen using the spectral degrees of freedom criterion introduced in [

R > logcta20.hmac$level

[

R > logcta20.hmac$n.cluster

[

The user can also provide smoothing levels using the option sigmaselect in phmac. There is also the option of starting the algorithm from user defined modes instead of the original data points. This option becomes handy if the user wishes to merge clusters obtained from other clustering methods, e.g., EM-clustering or K-means.

There are several plotting functions in Modalclust, which can be used to visualize the output from the function phmac. The plotting functions are defined on object class hmac, which is the default class of a phmac output. These plot functions will be illustrated through a data set named disc2d, which has 400 observations displaying the shape of two half discs. The scatter plot of disc2d along with its contour plot are given in

First, we introduce the standard plot function for an object of class “hmac”. This unique and informative plot shows the hierarchical tree obtained from modal clustering. It can be obtained by

R > data (“disc2d.hmac”)

R > plot (disc2d.hmac)

The dendrogram obtained from the disc2d data is given in

including starting the tree from a specific level, drawing the tree only up to a desired number of clusters, and comparing the clustering results with user defined clusters.

There are some other plotting functions that are designed mainly for visualizing clustering results for two dimensional data, although one can provide multivariate extensions of the functions by considering all possible pairwise dimensions. One can obtain the hard clustering of the data for each level using the command

R > hard.hmac(disc2d.hmac)

Alternatively, the user can specify the hierarchical level or the number of desired clusters, and obtain the corresponding cluster membership (hard clustering) of the data. For example, the plot in

R > hard.hmac (disc2d.hmac, n.cluster=2)

R > hard.hmac (disc2d.hmac, level=3)

Another function, which allows the user to visualize the soft clustering of the data, is based on the posterior probabilities of each observation belonging to the clusters at a specified level. For example, the plot in

R > soft.hmac (disc2d.hmac, n.cluster=3)

The plot enables us to visualize the probabilistic clustering of the three cluster model. A user can specify a probability threshold for assigning observations which clearly belong to a cluster or lie in the “boundary” of more than one cluster. Points having posterior probability below the user specified boundlevel (default value 0.4) are assigned as boundary points and colored in gray. In

R > disc2d.2clust < −hard.hmac (disc2d.hmac,n.cluster=2, plot=FALSE)

R > disc2d.2clust.soft < −soft.hmac (disc2d.hmac,n.cluster=2, plot=FALSE)

Modalclust performs a hierarchical model based clustering allowing for arbitrary density shapes. Parallel computing can dramatically increase the computing speed by splitting the data and running the HMAC simul- taneously on multi-core processors. Plotting functions give the user a comprehensive visualizing and under- standing of the clustering result. One future work from this stage would be to increase computing speed, especially for large data set. From the discussion in Section 3, it is clear to see, parallel computing increases the computing speed a lot. That relies on the computing equipment. If one user has no multicore or a few multicore processors available, it will take a lot of the computing resources when clustering large data sets. One potential way to solve the computing speed problem is using k-means or other faster clustering techniques initially, and using the HMAC from the centers of each cluster of initial clustering results. For example, if we have a data set with 20,000 observations, we can use k-means clustering and choose a certain number of centers, like 200 centers and run k-means clustering first. And then we start from the centers of 200 clusters and clustering by HMAC. Theoretically it is a sub-optimal way compared with running HMAC for all points. In practice, it is very useful to reduce the computing costs and still obtain the right clustering.

In addition, we are currently working on an implementation of modal clustering for online or streaming data, where the goal would be to update an existing cluster with the new data without storing all the original data points and allowing for creation of new clusters and merging of existing clusters.

Sources, binaries and documentation of Modalclust are available for download from the Comprehensive R Archive Network http://cran.r-project.org/ under the GNU Public License.