High Dimensional Cluster Analysis Using Path Lengths

A hierarchical scheme for clustering data is presented which applies to spaces with a high number of dimension ($N_{_{D}}>3$). The data set is first reduced to a smaller set of partitions (multi-dimensional bins). Multiple clustering techniques are used, including spectral clustering, however, new techniques are also introduced based on the path length between partitions that are connected to one another. A Line-Of-Sight algorithm is also developed for clustering. A test bank of 12 data sets with varying properties is used to expose the strengths and weaknesses of each technique. Finally, a robust clustering technique is discussed based on reaching a consensus among the multiple approaches, overcoming the weaknesses found individually.

into groups according to a criterion that is appropriate for the specific application under consideration.
The literature on clustering is extensive and it is beyond the scope of this paper to provide an adequate review of this topic. The following papers [1] [2] [3] [4] provide background on the clustering methods in this paper and the book [5] provides a broad overview of clustering methodologies, as well as their numerical implementation.
There is no single algorithm that realizes all four of these aspects of data organization. The approach to this problem pursued in this paper is to develop a hierarchical scheme leading to a cluster analysis that encompasses the issues raised above and adapts to high dimensional spaces.
The data analysis scheme presented in this paper uses a blend of traditional data analysis via a multivariate histogram along with standard clustering techniques, such as k-means, k-medoids and spectral clustering. By binning the data onto a multi-dimensional grid, data is partitioned into regions on the grid which may be connected or separated depending on the character of the data set. Data reduction is realized by only retaining bins that have a population above a user selected threshold. The resulting multidimensional bins are referred to as partitions. The passage to partitions is the data reduction step.
Data identification is the process of assigning known data distributions (parent) to an entangled set of data. Typical examples are found in the literature of Bayesian analysis [6] [7], however, this pursuit dates farther back to earlier attempts to understand how to distinguish data from two or more distributions with overlapping tails. In more difficult scenarios, several distributions might overlap within the peak regions, changing the problem to the identification of subdomains of the mixed versus non-mixed distributions.
Data clustering traditionally refers to assigning data to subsets based on the proximity of data to one another. The goals of the field of data clustering have expanded from this definition, taking on some of the other roles identified here.
For the purposes of this study, the term clustering will refer to both the overall techniques applied as well as the specific property a set has when its members are close to one another when appropriate. In the broadest sense, a cluster is simply a label given to data to identify common features.
Data grouping is the process of assigning labels to data, without regard for proximity or parent distributions. An example might be to segregate a class of thirty 2 nd grade children into five subgroups before entering a museum for a tour.
How the larger group is broken apart is unimportant, merely that the larger group is distributed into smaller groups.
In this study, standard clustering techniques are applied such as k-means, k-medoids and spectral clustering, along with new path-based approaches. After data reduction, data within partitions may be connected in regions where a path length can be calculated along the grid of partitions between any two data. Several new clustering algorithms have been developed using the path length. This paper presents five new variations of approaches to data clustering: 1) Data reduction is achieved by segmenting the data set into partitions.
2) Data clustering is sought using path lengths as a distance metric.
3) Data clustering is achieved using a Line-of-Sight criterion. 4) Spectral clustering is sought using alternatives to the graph Laplacian and the eigenspace formed. datum is assigned to a cluster. When data consistently cluster in one arrangement across multiple analysis configurations, the data is assigned robustly to its cluster. To determine a robust clustering assignment, a polling technique is used to arrive at a consensus amongst the clustering algorithms. While any one technique has faults, the consensus of techniques overcomes any one failure mode, giving the best all-round identification [8].
This paper is organized as follows: Sections 2 and 3 define the basic component used in this study, the partition. Section 4 shows the calculations of several values used throughout the analysis. Section 5 discusses a Line-of-Sight criterion. Section 6 outlines the strategy taken for this study and it lists the comprehensive set of arrays calculated that are needed for the suite of algorithms. This section also introduces a test-bank of data sets used for clustering. Section 7 presents each algorithm, with details left for the appendix. Section 8 shows the results for each clustering algorithm, discussing the strengths and weaknesses of each approach. Section 9 introduces the approach to robust clustering, employing multiple techniques and how a consensus is reached. Section 10 concludes with suggestions for extending this suite of clustering techniques. Throughout this paper, matrices and vectors are shown in bold face, while components are given subscripts.

Reduction of Data to Partitions
In this study, data refer to collection of real values forming a vector, residing in a data space of dimension, D N , whose elements total N. Along each dimension of the data space, the data is coarsely delineated into a set of bins, N is the number of bins per dimension. For each datum, the collection of indices form a bin address vector, giving the unique location of a bin within the data space. Each bin is given a unique index, k  , serialized by the expression given below. Within each bin, multiple data may reside, where k w  is the number of elements in each bin (population).
The maximal value the single index, k  , can take is the total number of possible bins in the data space, given by the product of the number of bins, Even though a data set may be large ( bins. Depending on how the data is distributed, most likely the data will reside in small groupings within the data space, leaving much of the domain sparse.

Reduction of Partitions to Clusters
The data has been reduced to a set of bins, for clustering. The bin index, k  , is mapped to a sequential list of indices, 1 P k N =  , where the total number of bins under consideration, P N , will be referred to as partitions, with the vector of populations, , for each partition addressed by k, and the partition data space given by, All calculations for this study are performed on the partition data space, P  , which represents the integer-based grid of bin locations. The complimentary data space of either empty or low population partitions is given by Clusters are subsets of data grouped based on a common feature. Cluster algorithms use a criterion to delineate data, which are then gathered by some mechanism and then assigned to clusters. Traditional definitions rely on proximity of data to one another, yet clustering can also be defined as a simple grouping of the data, which could be based alphabetically, by income, or some property that is difficult to map numerically such as an objects shape. Proximity alone can fail to cluster data appropriately when considering data distributed along tails of K. Mcilhany, S. Wiggins distributions far from a centroid, such as a horseshoe. By altering the definition of "proximity" to include distance measures such as path length, clustering can still be viewed as a local grouping. This paper explores multiple clustering algorithms to later sort the clustering assignments into groupings reached by consensus.

Intermediary Calculations
Several calculations are common to multiple techniques which require only the partition bin address vector. These low level calculations define geometrical features of how the partitions are related to one another. Calculations between two partitions form matrices indexed by [ ] , k  . Specific algorithms for each calculation can be found in the supplemental material online. The distances calculated here fall into two broad categories; path lengths, where the distance measured is between partitions connected to one another, and global, where a connection is not required. Among path lengths, two further distinctions are made; stepwise, where the distance is the sum of values from one partition to the next, and pathwise, where the distance is the sum of values added from the start of the path to the current partition for each step taken. The block of equations shown here are described in the following text. From these values, the true path is found which tests the LOS criteria.
The following calculations are performed before any paths are sought as they do not require knowledge of the exact path found, merely the endpoints which give the dimensions of the convex hull containing the two partitions , k  , leading to, Τ ww , the outer product formed from the population vector taken with itself minus the weights along the diagonal to account for the self-weighting within a single partition. Further, the difference between two partitions populations is also needed, leading to the matrix, ∆w .

Line-of-Sight (LOS) Criterion
A Line of Sight (LOS) criterion is introduced in this paper as a means to cluster data which gives additional significance to data while being independent of proximity. This approach assumes that data within a convex region of other data are likely to be associated together. When seeking the LOS criteria, the data space is divided into two broad subdomains, those partitions filled with sufficient data above threshold, and those partitions containing little or no data This distance has the property that when traversing a grid from [ ] , k  , the distance calculated is different than when returning from [ ] , k  . The asymmetry of this measure proves useful in determining the LOS condition. Figure 1 illustrates how the distance is asymmetric with regard to the path taken. Three conditions must be met if two partitions are LOS: 1) A path must exist between [ ] , k  that does not exit the convex hull, requiring = T L2 L2 .
2) The path found must take a direct path between [ ] 3) The path found must follow the direct path, requiring ( ) Dijkstra's algorithm finds the minimal path taken between two points on a grid given an adjacency matrix, NN1 , giving the path length, L2 . Figure 1

> NN1
; in this way, the adjacency matrix presented to Dijkstra's      Figure 1(b), Figure 1(c). The top three rows represent the number of steps taken along each dimension from one partition to the next. The next three rows are the coordinates of the partitions along the paths taken, remembering that the path starts with the k th partition at the origin. The final row is the summed L1 path distance. The axes are labeled where i x is from the longest dimension to the shortest of the convex hull. along the same side of the parallelotope with respect to the true path, the path found "turns a corner" in order to reach the final partition. In this case, one of the two values, leading to the second criteria. The last criteria uses the results from the second application of Dijkstra, now, attempting to find a path that minimizes the va- then copying the k th row of this matrix and multiplying it by every row of the logical matrix, 0 > NN1 , a new adjacency matrix is formed and applied using Dijkstra's algorithm for the third time. At each step, the minimal summed path variance gives the most direct path from [ ] , k  , finally giving the path that is LOS between the two partitions, illustrated in Figure 2. The smallest error that can exist is when a path is found that is one step off of the true path near the middle of the path. In this case, the error is the difference between ( ) , where 2 n = C P , leading to the third criteria for LOS.

Strategy
This study employs 26 different clustering techniques to a bank of 12 representative test cases. The data sets forming the test bank were comprised of various shapes, both connected and disconnected as well as point clouds in both 2D and 3D. In each of the point clouds, four gaussian distributions were placed near one another, with three densely populated regions and a fourth low density gaussian which spans the domain. The point clouds were further varied by creating one case in 2D and 3D where the dense gaussians are clearly separated, and another two cases in 2D and 3D where the three gaussians overlap. Figure 3 illustrates  the test banks used, in this order: L, Plus1, Plus2, Concentric1, Concentric2, Flame1, Flame2, Flame3, Data2D-1, Data2D-2, Data3D-1, Data3D-2. Table 2 lists the test bank set as well as the features sought to examine in each case. The first test is the simple L as discussed in section 1. The Plus1 and Plus2 cases are extensions to the L case where symmetry is employed, testing how algorithms respond to symmetry as well as an open region (Plus2). Concentric1 and Concen-tric2 test how the routines respond to curved domains with symmetry and whether the domain is connected or not. Flame1, Flame2 and Flame3 test how asymmetry is dealt with as well as connected versus disconnected regions.
Flame3 also tests how well "tendrils" or filamentary data is handled. As a test of a 2D point cloud, Data2D-1 and Data2D-2 test how well four gaussian point clouds can be clustered for the case of three separated clusters, Data2D-1, and three close-by clusters, Data2D-2, where the fourth gaussian is evenly distributed across the domain simulating noise present in the data. Data3D-1 and Data3D-2

Clustering Algorithms
This section discusses the clustering algorithms used in this paper. Some tech-   The table lists those features which   Table 3. Clustering techniques for 26 algorithms highlighting requirements, pros and cons in each case. Some algorithms required partitions to be connected in order to search for clustering within the connected region. Clusters can be feature driven or can even the distribution of cluster assignments (balanced, group). LOS is a criterion for some clustering, which in turn can help identify data distributions. Finally, some algorithms treat isolated partitions on equal footing with larger connected subsets, making the clustering sensitive to these smaller subsets, interpreted as noise. Checks indicate a feature is used, "X" indicates the feature is not required whereas a "-" indicates the parameter is not applicable to the technique, finally "*" indicates that population weighting could be applied to the technique or not-for the results shown in this study, weights were applied to spectral algorithms making the Laplacian sensitive to the populations of the partitions.

K-Means and K-Medoids Clustering-KMEANS, KMEDOIDS
K-means is a well established clustering technique [10] [11], seeking from a data set, the lowest possible distance from individual data to a set of possible mean positions of the data, indicative of clusters. Over several passes, the cluster definitions are altered to minimize the distance from each datum to clusters found. An initial guess of the number of clusters to seek is required. K-means has been discussed thoroughly in the community for its strengths and weaknesses [1].
K-medoids has been proposed to overcome many of the shortcomings of k-means and is similarly well-established in the community [5]. In both cases, an

Maxima Clustering-Global and Path Length
In this study, data has been reduced to a set of partitions with a population assigned for each. The two schemes, MAXGLOB and MAXPATHL, assign data to clusters based on how close a partition is to a significant nearby maxima among the partitions. Treating the weights of the partitions as the height of a multi-dimensional map, the significance of a nearby maxima is determined by calculating the slopes between any two partitions, where the slope is the ratio of the weight difference, w ∆ , to the distance between any two partitions. In the global case, the distance used is the Euclidean distance, R ∆ , and for the path length case, the distance used is the path length, L2. MAXGLOB seeks to assign clusters

Clustering via Connection-CONN
In cases where local clusters of partitions are sparsely found within the data space, a simple clustering algorithm is to determine which partitions are connected to one another using first nearest neighbor steps, NN1. Section 4 discusses path lengths calculated from one partition to another where those with a finite value are connected. A logical value is set between any two connected partitions creating the matrix CONN. A unique cluster ID is assigned for each connected set of partitions.

LOS Clustering with Mutual Visibility-LOS-MUTUAL
The LLL matrix can alternatively be used to cluster partitions with the highest mutual visibility (LOS-MUTUAL) by selecting clusters with the most common  visibility is ≈4100 between partitions, where LOS-MAXVIS begins searching at the highest visibility bin and ends at the first minima found in the histogram, for a cluster with visibility from ≈ 3300 -4100. A LOS-MUTUAL search seeks a cluster with the greatest mutual visibility by beginning the search at the tallest peak around ≈2200 whose set size is ≈150 partitions. The cluster is formed between the two minima found on each side of the tallest peak. In both cases, the starting point for the cluster search defines which other partitions are near to the goal, either maximal visibility or greatest mutual visibility. The clusters are formed by grouping LOS partitions around the feature sought in the histogram, where peaks are separated by the basins. case, all partitions with a visibility between 1700 up to 3000 are included in the first cluster found. As before, once a cluster is found, the partitions are removed from further searches. Clusters formed in this manner find full data distributions first, associating tails over mixed regions with the largest distributions first, giving an alternative to the data identification offered by LOS-MAXVIS.

Spectral Clustering
Spectral clustering [14] [15] represents data as a graph, where data become ver-

Clustering by Coarse Position (LMH-POS)
The most obvious form of clustering is to associate a partition solely by its position (LMH-POS) using a coarse binning within the partition space. By setting the number of bins along each dimension to three, the bins are interpreted as being Low, Medium or High for the values represented along each axis. In this case, the sequential partition bin index, k, becomes the cluster ID, with the maximum number of possible clusters at 3 D N , for the three bins along each axis.
This approach is a coarse designation for clustering as it employs no complicated algorithms, and data with similar values are associated irrespective of all other factors. This approach suffers from many problems in that data in one bin will not be clustered with data from a neighboring bin no matter how close in proximity the two are to one another. Clusters from LMH-POS characterize data in the crudest sense with no refinement for the shape of a distribution or even the relative sizes of the distribution. One advantage to this approach is that it is easy to understand, even while spanning multiple dimensions, making it an easy entry point for a discussion of the data. When handling large data sets, this approach allows for a quick look at where the data reside within the larger space.

Results
This section shows a sampling of results from the application of 26 techniques to 12 test cases. The strengths and weaknesses of these techniques are exposed

Discussion of Techniques
Of         Figure 11 shows clustering for the Data3D-2 case, where little symmetry is present in a clean environment at a high bin resolution where two distributions overlap. K-medoids (11a) shows 16 clusters found in three main ellipsoids, with k-means (11b) giving similar yet different results. CONN (11c) clusters partitions connected to one another, which in a clean environment finds three ellipsoidal distributions, however, some partitions may have been "cutoff" from the main ellipsoids due to the higher data threshold placed, leading to singleton clusters. MAXPATHL (11d) shows similar results to CONN, however, additional clusters are found due to local maxima in the weighted partitions, leading to one cluster which follows the contour of the merged ellipsoids. LOS-MAXVIS (11e) clusters by maximal visibility first, finding the intersection as a cluster first followed by clusters based on lesser visibility.   ity exists that some larger subsets of the data will not be clustered as expected, leading to smaller subsets assigned to clusters otherwise seen as noise (singletons). The radial basis spectral techniques do not suffer from this confusion as they do not require a connection between the partitions in order to form clusters, allowing the approach to be sensitive to the larger structure within the domain, creating clusters around centroids within the data.

Robust Clustering over Multiple Algorithms
In this paper, multiple clustering algorithms have been presented and applied to several test cases. Each technique has strengths as well as weaknesses which have been exposed through the cases presented. When using multiple techniques, the possibility exists to leverage the information gathered from all techniques to arrive at a final cluster designation, based on the level of agreement or disagreement found between the algorithms [8]. This approach is comparable to ensemble modeling used in various fields [16] [17]. This section proposes four possible robust ways to gather the cluster information and assign new cluster IDs.
In each approach taken, the cluster information for the partitions is represented by a matrix of cluster IDs, where each row represents results from a single cluster algorithm and each column is a partition. The values along each row are the cluster IDs assigned to each partition, forming the matrix, where C = 26 and P is the number of partitions. In order to find agreement or disagreement between cluster IDs across many techniques, the rows are sorted so that the cluster IDs are sorted in ascending order along the first column. For any repeated values in the first column, the next column is then sorted in a similar fashion, continuing to sort further columns until all repeated values are addressed.    Alg 2 7 7 3 7 7 1 3 7 3 1 3 4 7 7 6 1 7 7 5 1 2 5 7 7 7 7 7 5 1 3 6 7 6 1 7 7 5 3 5 Alg 3 6 6 2 6 7 1 3 6 2 1 3 3 8 6 5 1 6 6 5 1 1 5 8 8 8 6 8 5 1 2 5 6 6 1 6 6 3 3 5 Alg 4 4 5 2 4 6 1 2 4 2 1 2 2 6 4 4 1 4 4 2 1 2 3 6 6 6 5 6 2 1 2 3 4 4 1 5 4 2 2 2 Alg 5 5 6 1 4 6 1 1 5 1 1 1 2 6 6 4 1 6 5 3 1 1 4 6 6 6 6 6 4 1 1 4 4 4 1 6 4 2 1 4 Alg 6 6 8 1 5 8 1 2 5 1 1 3 4 8 7 4 1 7 5 4 1 1 4 9 9 9 7 9 4 1 1 4 5 5 1 7 5 4  As examples of robust clustering, the last four figures from Figure 12 as well as Table 4 are provided to illustrate the process. These figures show the results from a consensus using all clustering techniques excluding the LMH-POS algorithm for the Data2D2 test case with no minimal population set for the partitions. The LMH-POS technique was excluded as its partition definitions do not align with the remaining 25 algorithms. In cases where multiple techniques are compared using differing partition sizes, the robust technique is then applied per datum, using the same procedures, however, the sorting is performed over all data instead of partitions. The last four figures from Figure 12 show the following consensus techniques, from left to right, the Fractured, Majority Changed (75%), All Changed (100%) and No Overlap cases. The Fractured robust designation results by assigning each partition a new cluster ID starting from one and increasing the cluster ID each time any technique changes its ID, which results in the largest set of clusters found. This approach is the most sensitive to changes in the cluster designations. The Majority Changed robust technique assigns a new cluster ID each time the accumulated number of algorithm cluster ID changes reaches a majority of the total number of algorithms. For each clustering technique, when a change occurs, any further changes from that technique are not registered until a majority is reached, at which point the accumulated sum of changes is reset to zero. This results in a medium sized set of clusters found, where a significant number of algorithms found a change, however, not all algorithms are required to note the change in ID. In the figure, a 75% majority was required, where ideally, the best majority threshold would create the largest number of clusters with the highest average membership. The All Changed robust case is equivalent to the Majority Changed case with a 100% majority threshold. This results in a small-medium sized set of clusters found, where every algorithm found a change, however, the changes may not have been at the same partition number, merely, that the total set of changes across all algorithms eventually required a change of ID. The No Overlap robust case assigns a new cluster ID whenever the total number of algorithms changes designation simultaneously, resulting in the smallest sized set of clusters found, where every algorithm must find a change for all partitions in a subset. Ideally, this would happen for each disconnected group of partitions, however, several techniques are "global" in scope and do not require a connection to exist to form clusters, leading to a single large cluster.

Conclusions
A study using 26 clustering techniques has been performed over 12 test cases to illustrate both the strengths and weaknesses of clustering algorithms. A robust form of clustering is achieved through consensus over all techniques, helping reduce clustering problems by finding consistent clustering definitions across many approaches. The approach taken by this study utilizes six main ideas to produce a robust clustering analysis: • Reduce a large data set by binning the space, where the filled bins are the multi-dimensional partitions of the data set, each with a unique serial index, k.
• Algorithms use the path length between any connected partitions as well as traditional distance metrics (L1, L2, etc.).
• A Line-of-Sight (LOS) algorithm is developed to enhance the probability that two data are associated with one another. LOS also provides a new "super" neighborhood definition to be used in graph-based techniques. Data identification is addressed in two differing ways by LOS.
• Spectral clustering using the [2] [3] eigenvectors addresses data grouping better than other methods. This study shows that high dimensional, big-data analysis can be reduced to a smaller set of partitions where multiple clustering techniques can be used to sort the data into clusters. While the techniques presented are all computationally ( ) 2 P N  , by reducing the data set to partitions, these routines are reasonable to perform. The introduction of the LOS criteria created new avenues for cluster seeking. The combination of multiple clustering techniques, various distance metrics and traditional data reduction leads to a robust set of clusters found, which worked well in addressing issues of data identification, clustering as well as grouping.