^{1}

^{*}

^{2}

A hierarchical scheme for clustering data is presented which applies to spaces with a high number of dimensions ( ). The data set is first reduced to a smaller set of partitions (multi-dimensional bins). Multiple clustering techniques are used, including spectral clustering; however, new techniques are also introduced based on the path length between partitions that are connected to one another. A Line-of-Sight algorithm is also developed for clustering. A test bank of 12 data sets with varying properties is used to expose the strengths and weaknesses of each technique. Finally, a robust clustering technique is discussed based on reaching a consensus among the multiple approaches, overcoming the weaknesses found individually.

Clustering is a fundamental technique and methodology in data analysis and machine learning. The explosion of the field of data science has, consequently, led to an expansion in how this notion is applied. In this respect, it would be more appropriate to refer to clustering as data organization, which would encompass the ideas of 1) data reduction, 2) data identification, 3) data clustering, and 4) data grouping.

Data reduction is the process of converting raw data into a form that is more amenable for the application of a specific analytical and/or computational methodology. Data identification is the process of analysing trends or distributions within the data. Data clustering is the process of associating data through proximity, similarity, or dissimilarity. Data grouping refers to breaking down data into groups according to a criterion that is appropriate for the specific application under consideration.

The literature on clustering is extensive and it is beyond the scope of this paper to provide an adequate review of this topic. The following papers [

There is no single algorithm that realizes all four of these aspects of data organization. The approach to this problem pursued in this paper is to develop a hierarchical scheme leading to a cluster analysis that encompasses the issues raised above and adapts to high dimensional spaces.

The data analysis scheme presented in this paper uses a blend of traditional data analysis via a multivariate histogram along with standard clustering techniques, such as k-means, k-medoids and spectral clustering. By binning the data onto a multi-dimensional grid, data is partitioned into regions on the grid which may be connected or separated depending on the character of the data set. Data reduction is realized by only retaining bins that have a population above a user selected threshold. The resulting multidimensional bins are referred to as partitions. The passage to partitions is the data reduction step.

Data identification is the process of assigning known data distributions (parent) to an entangled set of data. Typical examples are found in the literature of Bayesian analysis [

Data clustering traditionally refers to assigning data to subsets based on the proximity of data to one another. The goals of the field of data clustering have expanded from this definition, taking on some of the other roles identified here. For the purposes of this study, the term clustering will refer to both the overall techniques applied as well as the specific property a set has when its members are close to one another when appropriate. In the broadest sense, a cluster is simply a label given to data to identify common features.

Data grouping is the process of assigning labels to data, without regard for proximity or parent distributions. An example might be to segregate a class of thirty 2^{nd} grade children into five subgroups before entering a museum for a tour. How the larger group is broken apart is unimportant, merely that the larger group is distributed into smaller groups.

In this study, standard clustering techniques are applied such as k-means, k-medoids and spectral clustering, along with new path-based approaches. After data reduction, data within partitions may be connected in regions where a path length can be calculated along the grid of partitions between any two data. Several new clustering algorithms have been developed using the path length. Further, if two partitions are visible to each other by a Line-of-Sight criterion, the relationship between them is given additional significance. These ideas are used, in conjunction with standard clustering techniques, to construct 26 different clustering algorithms.

This paper presents five new variations of approaches to data clustering:

1) Data reduction is achieved by segmenting the data set into partitions.

2) Data clustering is sought using path lengths as a distance metric.

3) Data clustering is achieved using a Line-of-Sight criterion.

4) Spectral clustering is sought using alternatives to the graph Laplacian and the eigenspace formed.

5) Final cluster assignment is accomplished using a consensus among multiple clustering techniques.

An analysis configuration is the set of choices made that determines how a study is performed. The three most important choices are which clustering techniques out of the 26 available to use; what variables are used to describe the data, where each variable is a dimension in the data space; the number of bins chosen along each dimension. Changes to the resolution of how the data space is partitioned may lead to changes in a datum’s cluster assignment. For each choice of clustering technique, variables used (dimensions) and resolution (binning), each datum is assigned to a cluster. When data consistently cluster in one arrangement across multiple analysis configurations, the data is assigned robustly to its cluster. To determine a robust clustering assignment, a polling technique is used to arrive at a consensus amongst the clustering algorithms. While any one technique has faults, the consensus of techniques overcomes any one failure mode, giving the best all-round identification [

This paper is organized as follows: Sections 2 and 3 define the basic component used in this study, the partition. Section 4 shows the calculations of several values used throughout the analysis. Section 5 discusses a Line-of-Sight criterion. Section 6 outlines the strategy taken for this study and it lists the comprehensive set of arrays calculated that are needed for the suite of algorithms. This section also introduces a test-bank of data sets used for clustering. Section 7 presents each algorithm, with details left for the appendix. Section 8 shows the results for each clustering algorithm, discussing the strengths and weaknesses of each approach. Section 9 introduces the approach to robust clustering, employing multiple techniques and how a consensus is reached. Section 10 concludes with suggestions for extending this suite of clustering techniques. Throughout this paper, matrices and vectors are shown in bold face, while components are given subscripts.

In this study, data refer to collection of real values forming a vector, x = { x ∈ ℝ N D } , residing in a data space of dimension, N D , whose elements total N. Along each dimension of the data space, the data is coarsely delineated into a set of bins, { b i ∈ 1 ⋯ N B , i } where i = 1 ⋯ N D and N B , i is the number of bins per dimension. For each datum, the collection of indices form a bin address vector, b = { b ∈ ℝ N D } giving the unique location of a bin within the data space. Each bin is given a unique index, k ˜ , serialized by the expression given below. Within each bin, multiple data may reside, where w k ˜ is the number of elements in each bin (population).

k ˜ = ∑ i = 1 N D [ ( b i − 1 ) ∏ q = 0 i − 1 N B , q ] + 1 , where N B , 0 = 1. (1)

The maximal value the single index, k ˜ , can take is the total number of possible bins in the data space, given by the product of the number of bins, ∏ i = 1 N D N B , i . Even though a data set may be large ( ≈ 10 9 ), the number of possible bins can be much larger. Consider the case with a billion data points and a data space of 12 dimensions, each using 10 bins (very coarse), yielding 10^{12} possible bins. Depending on how the data is distributed, most likely the data will reside in small groupings within the data space, leaving much of the domain sparse.

The data has been reduced to a set of bins, D k = { k ˜ , w ∈ ℕ 2 } identified by an index and a population of only those bins containing data. The number of bins maybe be further reduced based on the population of the bins. Low density bins can be excluded from further study by either setting a threshold ( Θ p o p ) on the minimal number of data per bin, or by setting a threshold ( Θ p e r c ) based on the cumulative percentage of the low density bins with respect to the total population of all the data. The set D ˜ contains the bins of data which will be considered

D ˜ = { k ˜ , w ⊂ D k | [ w k ˜ > Θ p o p ] or [ F ( k ˜ ) > Θ p e r c N ] } , (2)

where F ( k ˜ ) = ∑ k ′ = 1 k ˜ w k ′ , with w k ′ = sort ( w k ˜ ) .

for clustering. The bin index, k ˜ , is mapped to a sequential list of indices, k = 1 ⋯ N P , where the total number of bins under consideration, N P , will be referred to as partitions, with the vector of populations, w = { w k ∈ ℕ } , for each partition addressed by k, and the partition data space given by, D P = { k , w ∈ ℕ 2 } . All calculations for this study are performed on the partition data space, D P , which represents the integer-based grid of bin locations. The complimentary data space of either empty or low population partitions is given by D o = { k ∈ D k | k ∉ D P } .

Clusters are subsets of data grouped based on a common feature. Cluster algorithms use a criterion to delineate data, which are then gathered by some mechanism and then assigned to clusters. Traditional definitions rely on proximity of data to one another, yet clustering can also be defined as a simple grouping of the data, which could be based alphabetically, by income, or some property that is difficult to map numerically such as an objects shape. Proximity alone can fail to cluster data appropriately when considering data distributed along tails of distributions far from a centroid, such as a horseshoe. By altering the definition of “proximity” to include distance measures such as path length, clustering can still be viewed as a local grouping. This paper explores multiple clustering algorithms to later sort the clustering assignments into groupings reached by consensus.

Several calculations are common to multiple techniques which require only the partition bin address vector. These low level calculations define geometrical features of how the partitions are related to one another. Calculations between two partitions form matrices indexed by [ k , l ] . Specific algorithms for each calculation can be found in the supplemental material online. The distances calculated here fall into two broad categories; path lengths, where the distance measured is between partitions connected to one another, and global, where a connection is not required. Among path lengths, two further distinctions are made; stepwise, where the distance is the sum of values from one partition to the next, and pathwise, where the distance is the sum of values added from the start of the path to the current partition for each step taken. The block of equations shown here are described in the following text.

Δ b i = b i , k − b i , l Δ R = ∑ i = 1 N D Δ b i 2

N N 1 = I ∘ Δ R I ≡ { 0 | Δ b i | > 1 , for any i 1 | Δ b i | ≤ 1 , for all i

L 2 = ∑ j = k l N N 1 j , j + 1 ( stepwise ) L 2 T = min [ ∑ j = k l N N 1 j , j + 1 ]

L 1 = ∑ i = 1 N D | Δ b i | Σ L 1 = ∑ j = k l L 1 k j ( pathwise )

Σ L 1 m i n = min [ ∑ j = k l L 1 k j ] Σ L 1 T = ∑ j = k l L 1 T , k j

Σ L 1 V A R = ∑ j = k l | Σ L 1 k j − Σ L 1 T , k j | 2

w w Τ = w ⊗ w − diag ( w ) Δ w = w k − w l

The Euclidean distance is calculated between all partitions in D P . First, the difference between two partitions bin address’ are calculated for each component, Δ b i . The distance, Δ R , is then calculated from the sum over Δ b i 2 . The first nearest neighbor matrix, N N 1 , defines the distance between any two bins that are in contact with one another. Two partitions are in contact with one another if there exists no bin address component difference greater than one in magnitude, leading to the interpretation that they share a common geometric feature; a point, line, area, etc... The matrix, N N 1 , is the adjacency matrix weighted by Euclidean distance, Δ R . As each partition is a unit hypercube, the distances range from { 1 ⋯ N D } .

The path length, L 2 , is the distance between any two partitions taken by stepping from one partition to another through NN1 steps, summing Δ R along the path stepwise, where the initial partition is k, interim partitions, j, up to the final partition, ℓ. Partitions are connected when a path is found, and for partitions having no connecting path, the path length is set to ∞. The number of steps taken between any two partitions is the Path Count, P C .

In order to find if two partitions meet a Line-of-Sight (LOS) criterion, only paths that fall within the convex hull formed between the two partitions are considered. For each connecting path found, six values are calculated to determine the LOS criteria. The true path length, L 2 T , assumes a straight path exists between two partitions, giving the stepwise length formed taking the least number of NN1 steps with the smallest L 2 values possible. The Summed L1 length, Σ L 1 , is the summation of the pathwise L 1 distances taken from the initial partition to each subsequent partition along a path. The Minimal Summed L1 path is the unique path with the least possible Σ L 1 m i n , while the True Summed L1 distance is the Σ L 1 T taken along the straight path established earlier. Finally, variance of the squared difference along a path, Σ L 1 V A R , is taken between the Summed L1 norm to the True Summed L1 norm along each step of a path. From these values, the true path is found which tests the LOS criteria.

The following calculations are performed before any paths are sought as they do not require knowledge of the exact path found, merely the endpoints which give the dimensions of the convex hull containing the two partitions [ k , l ] ; Δ b i , N N 1 , Δ R , L 2 T , L 1 , Σ L 1 m i n and Σ L 1 T . The remaining three employ Dijkstra’s algorithm ( [

Several calculations require that the partitions be weighted by the product of the populations of [ k , l ] , leading to, w w Τ , the outer product formed from the population vector taken with itself minus the weights along the diagonal to account for the self-weighting within a single partition. Further, the difference between two partitions populations is also needed, leading to the matrix, Δ w .

A Line of Sight (LOS) criterion is introduced in this paper as a means to cluster data which gives additional significance to data while being independent of proximity. This approach assumes that data within a convex region of other data are likely to be associated together. When seeking the LOS criteria, the data space is divided into two broad subdomains, those partitions filled with sufficient data above threshold, and those partitions containing little or no data (≈empty space). Within the filled regions of space, Dijkstra’s algorithm is employed to find which partitions are connected to each other via a path and to measure the path length. Partitions of the empty set, D o , are viewed as obstacles to paths within D k . By analogy, the empty set serves to prevent LOS just as walls prevent continuity in vision.

The criteria used to establish a Line-of-Sight (LOS) between two partitions relies on the pathwise summation of L1 distances along the path taken from [ k , l ] . This distance has the property that when traversing a grid from [ k , l ] , the distance calculated is different than when returning from [ l , k ] . The asymmetry of this measure proves useful in determining the LOS condition.

1) A path must exist between [ k , l ] that does not exit the convex hull, requiring L 2 = L 2 T .

2) The path found must take a direct path between [ k , l ] , requiring that Σ L 1 ≤ Σ L 1 T .

3) The path found must follow the direct path, requiring Σ L 1 V A R ≤ ( P C / 2 ) 2 .

Dijkstra’s algorithm finds the minimal path taken between two points on a grid given an adjacency matrix, N N 1 , giving the path length, L 2 .

To find the pathwise Σ L 1 value, the adjacency matrix is altered, taking the row from L 1 for the k^{th} partition and multiplying by every row of the logical matrix, N N 1 > 0 ; in this way, the adjacency matrix presented to Dijkstra’s

algorithm in a second round finds the minimal pathwise value from [ k , l ] . For an open region (no obstacles), the minimal summed L1, Σ L 1 m i n , path is along the edge of the parallelotope.

When sufficient obstacles force the paths from [ k , l ] as well as [ l , k ] to be

dims | Steps Taken Along Each Dimension―k^{th} partition is the origin | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2D Case (min path) | 2D Case (max path) | 3D Case (min path) | 3D Case (max path) | |||||||||||||||||||||||||

x_{1} | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

x_{2} | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |

x_{3} | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ||||||||||||||

Coordinates Of Path Along Each Dimension―k^{th} partition is the origin | ||||||||||||||||||||||||||||

2D Case (min path) | 2D Case (max path) | 3D Case (min path) | 3D Case (max path) | |||||||||||||||||||||||||

x_{1} | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

x_{2} | 0 | 0 | 0 | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | 4 | 4 | 4 | 0 | 0 | 0 | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | 4 | 4 | 4 |

x_{3} | Ý x_{2} holds at 4 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | |||||||||||||

Ý x_{3} holds at 2 | ||||||||||||||||||||||||||||

Ý x_{2} holds at 4 | ||||||||||||||||||||||||||||

ΣL1 | 1 | 3 | 6 | 11 | 18 | 27 | 38 | 2 | 6 | 12 | 20 | 29 | 39 | 50 | 1 | 3 | 6 | 11 | 18 | 28 | 41 | 3 | 9 | 17 | 27 | 38 | 50 | 63 |

along the same side of the parallelotope with respect to the true path, the path found “turns a corner” in order to reach the final partition. In this case, one of the two values, Σ L 1 k , l or Σ L 1 l , k will exceed the true path summed L1, Σ L 1 T , leading to the second criteria. The last criteria uses the results from the second application of Dijkstra, now, attempting to find a path that minimizes the variance of Σ L 1 with respect to Σ L 1 T . Calculating the value ( Σ L 1 − Σ L 1 T ) 2 then copying the k^{th} row of this matrix and multiplying it by every row of the logical matrix, N N 1 > 0 , a new adjacency matrix is formed and applied using Dijkstra’s algorithm for the third time. At each step, the minimal summed path variance gives the most direct path from [ k , l ] , finally giving the path that is LOS between the two partitions, illustrated in

middle of the path. In this case, the error is the difference between 1 2 n ( n + 1 ) and 1 2 ( n − 1 ) ( n ) , where n = P C / 2 , leading to the third criteria for LOS.

This study employs 26 different clustering techniques to a bank of 12 representative test cases. The data sets forming the test bank were comprised of various shapes, both connected and disconnected as well as point clouds in both 2D and 3D. In each of the point clouds, four gaussian distributions were placed near one another, with three densely populated regions and a fourth low density gaussian which spans the domain. The point clouds were further varied by creating one case in 2D and 3D where the dense gaussians are clearly separated, and another two cases in 2D and 3D where the three gaussians overlap.

the test banks used, in this order: L, Plus1, Plus2, Concentric1, Concentric2, Flame1, Flame2, Flame3, Data2D-1, Data2D-2, Data3D-1, Data3D-2.

Labels | ID | Test Bank Data Sets | |||||||
---|---|---|---|---|---|---|---|---|---|

Dim | Size (pixels/pts) | Connected | Symmetry | Plateau | Filamentary | Overlap | Noise | ||

L | (a) | 2D | 1200 × 1200 | Ö | X | Ö | X | X | X |

Plus1 | (b) | 2D | 1200 × 1200 | Ö | Ö | Ö | X | X | X |

Plus2 | (c) | 2D | 1200 × 1200 | Ö | Ö | Ö | X | X | X |

Concentric1 | (d) | 2D | 1200 × 1200 | Ö | Ö | Ö | X | X | X |

Concentric2 | (e) | 2D | 1200 × 1200 | X | Ö | Ö | X | X | X |

Flame1 | (f) | 2D | 1200 × 1200 | Ö | X | Ö | Ö | X | X |

Flame2 | (g) | 2D | 1200 × 1200 | X | X | Ö | X | X | X |

Flame3 | (h) | 2D | 1200 × 1200 | Ö | X | Ö | Ö | X | X |

Data2D-1 (pt. cloud) | (i) | 2D | 200,000 | X | X | X | Ö | X | Ö |

Data2D-2 (pt. cloud) | (j) | 2D | 200,000 | Ö | X | X | Ö | Ö | Ö |

Data3D-1 (pt. cloud) | (k) | 3D | 200,000 | X | X | X | Ö | X | Ö |

Data3D-2 (pt. cloud) | (l) | 3D | 200,000 | Ö | X | X | Ö | Ö | Ö |

overlap. For all cases other than the point clouds, the data is derived from an image, where a binary set of points is established for all 8-bit grey-scale values above 100 (1) or below (0). The image sizes when possible are 1200 × 1200, unless the aspect ratio prevented that exact size. The point clouds are based on four distributions with a summed value of 200,000 points.

Along with the 26 clustering algorithms applied, four additional cluster assignments are derived from consensus among the 26, leading to 30 differing cluster assignments per test case for a total of 360 figures showing the clustering results. These results are supplied as supplemental figures and can be found on the website. A sampling of these results is shown in Section 8.

This section discusses the clustering algorithms used in this paper. Some techniques are standard approaches, but several are variations on existing techniques with new methods. The new approaches involve treating the data in terms of partitions with populations of data serving as weights to the partitions. Also new, the distance metric used is changed from a traditional L2-norm to a path length along a grid of partitions. Along with investigating path length based clustering, a Line-of-Sight criteria is also developed. An alternative approach of spectral clustering is also used, utilizing a different set of eigenvectors to establish clusters, and alternatives to the traditional Laplacian operator are used as well. Once all twenty-six clustering techniques are used to assign a cluster identity, an overall cluster identity is given to each data based on the consensus of the set of techniques, with four algorithms employed differing in degrees of consensus reached.

Labels | # | Clustering Algorithms | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Connected required | Proximity | Weights | Sensitive to Noise | Balanced | LOS criteria | Gathering method | Laplacian type | Eigen- vectors | Fixed k guess | ||

KMEANS | 1 | X | Δ R | w w Τ | X | X | X | weighted | - | - | Ö |

KMEDOIDS | 2 | X | Δ R | w w Τ | X | X | X | weighted | - | - | Ö |

MAXGLOB | 3 | X | Δ R | Δ w | X | X | X | slopes | - | - | - |

MAXPATHL | 4 | Ö | L 2 | Δ w | X | X | X | slopes | - | - | - |

CONN | 5 | Ö | X | X | X | X | X | - | - | - | - |

LOS-MAXVIS | 6 | Ö | L 2 , Σ L 1 | * | X | X | Ö | max vis. | - | - | - |

LOS-MUTUAL | 7 | Ö | L 2 , Σ L 1 | * | X | X | Ö | mutual vis. | - | - | - |

SPECTRAL01 | 8 | Ö | X | * | Ö | X | X | 2D histo | NN1 | 1, 2 | - |

SPECTRAL02 | 9 | Ö | X | * | Ö | X | X | kmeans | NN1 | 1, 2 | Ö |

SPECTRAL03 | 10 | Ö | X | * | Ö | X | X | kmedoids | NN1 | 1, 2 | Ö |

SPECTRAL04 | 11 | Ö | X | * | Ö | Ö | X | 2D histo | NN1 | 2, 3 | - |

SPECTRAL05 | 12 | Ö | X | * | Ö | Ö | X | kmeans | NN1 | 2, 3 | Ö |

SPECTRAL06 | 13 | Ö | X | * | Ö | Ö | X | kmedoids | NN1 | 2, 3 | Ö |

SPECTRAL07 | 14 | Ö | X | * | Ö | X | Ö | 2D histo | LOS | 1, 2 | - |

SPECTRAL08 | 15 | Ö | X | * | Ö | X | Ö | kmeans | LOS | 1, 2 | Ö |

SPECTRAL09 | 16 | Ö | X | * | Ö | X | Ö | kmedoids | LOS | 1, 2 | Ö |

SPECTRAL10 | 17 | Ö | X | * | Ö | Ö | Ö | 2D histo | LOS | 2, 3 | - |

SPECTRAL11 | 18 | Ö | X | * | Ö | Ö | Ö | kmeans | LOS | 2, 3 | Ö |

SPECTRAL12 | 19 | Ö | X | * | Ö | Ö | Ö | kmedoids | LOS | 2, 3 | Ö |

SPECTRAL13 | 20 | X | Δ R | * | X | X | X | 2D histo | RAD | 1, 2 | - |

SPECTRAL14 | 21 | X | Δ R | * | X | X | X | kmeans | RAD | 1, 2 | Ö |

SPECTRAL15 | 22 | X | Δ R | * | X | X | X | kmedoids | RAD | 1, 2 | Ö |

SPECTRAL16 | 23 | X | Δ R | * | X | Ö | X | 2D histo | RAD | 2, 3 | - |

SPECTRAL17 | 24 | X | Δ R | * | X | Ö | X | kmeans | RAD | 2, 3 | Ö |

SPECTRAL18 | 25 | X | Δ R | * | X | Ö | X | kmedoids | RAD | 2, 3 | Ö |

LMH-POS | 26 | X | X | X | X | X | X | - | - | - | - |

best suit each technique. As the chief data reduction scheme here is to partition the data into multi-dimensional bins, the clustering is performed over the weighted partitions on a grid. Features indicated in the table are; require partitions to be Connected in order to cluster, Proximity uses distance as a criteria, Weights indicates populations affect the result, Sensitivity to Noise indicates some methods fail to find structure within larger connected subsets in the presence of noisy data, Balanced indicates methods which evenly divide partitions into clusters, LOS criteria is required for some, Gathering indicates the method used to gather partitions for clustering, Laplacian indicates which type of Laplacian is used for spectral algorithms, Eigenvectors indicate which modes are used in gathering, and Fixed k requires an initial guess as to the number of clusters.

K-means is a well established clustering technique [

In this study, data has been reduced to a set of partitions with a population assigned for each. The two schemes, MAXGLOB and MAXPATHL, assign data to clusters based on how close a partition is to a significant nearby maxima among the partitions. Treating the weights of the partitions as the height of a multi-dimensional map, the significance of a nearby maxima is determined by calculating the slopes between any two partitions, where the slope is the ratio of the weight difference, Δ w , to the distance between any two partitions. In the global case, the distance used is the Euclidean distance, Δ R , and for the path length case, the distance used is the path length, L2. MAXGLOB seeks to assign clusters between partitions that are not required to be connected, while MAXPATHL requires a connection. Initially, local maxima among the partitions are found which are then categorized into three types: lone peaks, ridges and plateaus. Once the maxima are classified, a peak and all of the partitions associated with it are then assigned a cluster identification number, where the slopes and distances from partition to peak are contributing factors in determining which peaks associate with partitions. Definitions of local maxima, peaks and slopes as well as details of the algorithms for these two techniques are included in the supplemental material online.

In cases where local clusters of partitions are sparsely found within the data space, a simple clustering algorithm is to determine which partitions are connected to one another using first nearest neighbor steps, NN1. Section 4 discusses path lengths calculated from one partition to another where those with a finite value are connected. A logical value is set between any two connected partitions creating the matrix CONN. A unique cluster ID is assigned for each connected set of partitions.

Clustering by Line-of-Sight is motivated by the idea that two data within a convex region of a subset of the data have a higher chance of being correlated than data outside that convex region. Considering a set of data comprised of various types of distributions, it is possible for overlapping regions to form, where the tail of one distribution mingles with the tail of another. In the worst case scenario, peaks of two differing distributions may overlap. Further, distributions may also form along curved paths, where the peak may be far from the tails. Clustering via CONN will associate all data in these distributions, however, checking whether two data lie within a convex hull more closely associates those data with one another. The Line-of-Sight criterion from Section 5 determines which partitions are convex to one another. As examples, Figures 3(i)-3(l) illustrate several distributions which have both convex regions as well as overlapping tails of distributions. In this discussion, the term visibility refers to the number of partitions that are LOS to a specific partition. A detailed discussion of the algorithms used to form clusters based on the LOS criteria is provided by the supplemental material online.

The LOS matrix is formed where each row represents a partition and each column represents all other partitions where a logical value indicates whether the two are LOS, making the LOS matrix symmetric. Squaring the LOS matrix, LOS^{2}, gives a matrix whose values along each row tally the number of partitions which are mutually LOS to one another. For the L example given next, in the first row, the last three partitions are not LOS to the first, yet they share three partitions that are LOS in common. In order to eliminate the entries in LOS^{2} that are not present in the LOS matrix, a Hadamard product is taken between LOS and LOS^{2} yielding a third matrix, LLL. To form clusters from the information in LLL, a gathering process finds partitions that meet one of two cluster criteria; maximal visibility finds those partitions that share a high value of visibility and are connected to one another, and greatest mutual visibility finds the largest sets of partitions with a common value, regardless of how high in value is their visibility.

A simple example serves to demonstrate how these matrices interact with one another. Consider a small distribution of partitions forming a 6 × 4 grid connected to each other in an “L” configuration as shown in ^{2} and LLL matrices show which partitions are visible to each other. Note that partition five is visible to partition one, meaning

that partitions can see the edges of one another. From the matrices shown, partitions (3, 4, 5) form a cluster with the maximal visibility, followed by partitions (6, 7, 8, 9) then (1, 2) (LOS-MAXVIS). Partitions (3, 4, 5, 6, 7, 8, 9) form a cluster with the highest mutual visibility followed by (1, 2) with the lowest (LOS-MUTUAL).

The LOS matrix contains for each row the logical status of which partitions are LOS to the current partition. Further, the LLL matrix shows the number of mutually visible partitions within LOS of the current. From the LLL matrix, two values can be used to determine clustering using LOS. The highest value in the LLL matrix indicates which partitions are within LOS of the most other partitions. These highest valued LLL partitions have the maximal visibility, LOS-MAXVIS, of the set of partitions that are LOS. An example would be any partition that is located at an intersection of several distributions of partitions. Consider the test cases: L and Plus1, where the corner of the L and the center of the Plus1 will have maximal visibility. The clusters formed in this manner find intersections and corners of data distributions preferentially, leading to data identification of the entangled portions of data sets arising from multiple distributions present.

Clustering by LOS-MAXVIS is achieved by forming a histogram from the visibility values of LLL, shown in

The LLL matrix can alternatively be used to cluster partitions with the highest mutual visibility (LOS-MUTUAL) by selecting clusters with the most common shared LLL value instead of the maximal value. In this manner, clusters are formed around partitions that can mutually see each other the most. From the same LLL histogram, starting from the bin with the most frequent visibility, a cluster is formed by seeking the minima on both sides of the peak in the histogram nearest the most populated bin. Once the lower and upper bins are

found, all partitions which have any visibility values in LLL within this range are clustered together. Identifies partitions are removed from further searches and the process is repeated until all partitions are identified. LOS-MUTUAL clustering finds the largest set of partitions that are LOS to each other first, then searches for the next largest set of partitions that do not include the first set and so on. In the case of the simple L, the highest mutually visible partitions are the partitions forming the long arm of the L, with values LLL = 7. For the Data3D2 case, all partitions with a visibility between 1700 up to 3000 are included in the first cluster found. As before, once a cluster is found, the partitions are removed from further searches. Clusters formed in this manner find full data distributions first, associating tails over mixed regions with the largest distributions first, giving an alternative to the data identification offered by LOS-MAXVIS.

Spectral clustering [

This analysis employs all three clustering techniques in the eigenspace as well as explores using two differing sets of eigenvectors, the lowest pair (1, 2) as well as the next lowest pair (2, 3) as a base. Spectral clustering finds clusters of partitions which are connected subdomains; however, when only a single connected domain is found (clean case), the eigenvectors reveal a modal structure within the connected domain. When showing the modal structure for the first case using eigenvectors (1, 2), the first eigenmode accentuates a single large feature within the eigenspace, where the second eigenvector segments the space into a small number of symmetric regions. When using the next lowest pair of eigenvectors (2, 3), surpassing the lowest eigenmode, the modal structure segregates the partitions differently, clustering the partitions into evenly distributed groups of data. Once the eigenspace has been populated with the partitions, k-means, k-medoids as well as traditional 2D histograms can be used to collect the partitions and assign them to cluster IDs. K-means and k-medoids have been discussed earlier in Sec. 1 as to their strengths and weaknesses. As an alternative approach to finding the clusters within the eigenspace, simply histogram the 2D eigenspace and assign to each non-zero bin a different cluster ID (2DHIST). This approach has the advantage of simplicity and finds exactly the number of clusters that fill bins within the eigenspace, not requiring an initial guess as the number of possible clusters, as in the case of k-means or k-medoids, however a maximum possible count of clusters is set by the number of bins of the 2D histogram, typically set at ( k + 2 ) × ( k + 2 ) so that the k-means and k-medoid searches are comparable to the size of the clusters sought.

The most obvious form of clustering is to associate a partition solely by its position (LMH-POS) using a coarse binning within the partition space. By setting the number of bins along each dimension to three, the bins are interpreted as being Low, Medium or High for the values represented along each axis. In this case, the sequential partition bin index, k, becomes the cluster ID, with the maximum number of possible clusters at 3 N D , for the three bins along each axis.

This approach is a coarse designation for clustering as it employs no complicated algorithms, and data with similar values are associated irrespective of all other factors. This approach suffers from many problems in that data in one bin will not be clustered with data from a neighboring bin no matter how close in proximity the two are to one another. Clusters from LMH-POS characterize data in the crudest sense with no refinement for the shape of a distribution or even the relative sizes of the distribution. One advantage to this approach is that it is easy to understand, even while spanning multiple dimensions, making it an easy entry point for a discussion of the data. When handling large data sets, this approach allows for a quick look at where the data reside within the larger space.

This section shows a sampling of results from the application of 26 techniques to 12 test cases. The strengths and weaknesses of these techniques are exposed leading to the conclusion that a consensus approach is reasonable. Ideally, all clustering techniques plus the four robust consensus results of each test case would be presented, leading to 360 figures, but due to space limitations, the full set of clustering results are provided in the supplemental material. Throughout this section, the term “noise” refers to data sets where a significant number of isolated small subsets of partitions, including singletons, are present, while the term “clean” refers to data sets without these smaller subsets. Among the supplemental material, for each data set, a high data threshold, Θ p e r c = 2 % , and a low threshold, Θ p e r c = 0 % , are applied showing how clustering is achieved in a clean versus noisy environment respectively. Data clustering is also shown at two different bin resolutions to illustrate how too fine of a resolution may not achieve good clustering. Finally, the figures are grouped into fullpage comparisons for a single test case with all 30 techniques shown as well as single page comparison for each technique across all test cases, leading to 2880 figures over 168 pages.

Of the sampled results provided, the first figure in each case shown is for k-medoids clustering using (k = 16) unless otherwise stated in order to give a comparison between established clustering and other approaches. The remaining figures are chosen to demonstrate a particular trait of a clustering technique.

K-means and k-medoids results are well understood for both their strengths and weaknesses. MAXGLOB and MAXPATHL tend to mirror results from k-means and k-medoids with the exception that MAXPATHL is restricted to clustering within a connected set, making it seek clusters following a distributions’ shapes rather than just using proximity between data. The CONN technique clusters data within a connected set regardless of other criteria. LOS-MAXVIS and LOS-MUTUAL cluster according to data within convex hulls, seeking similar visibility features as part of the gathering criteria to form clusters. Spectral techniques form clusters within the eigenspace formed from two eigenvectors. The adjacency matrix used to form the Laplacian determines the nature of the neighbors used, traditionally a first nearest neighbor, however, this study employs both the LOS criteria to define “neighbors” as well as a radial basis. Further, the choice of eigenvectors used to form the eigenspace determines whether a prominent feature is clustered about (using the 1st and 2nd eigenvectors), or a more evenly distributed clustering is achieved using the 2nd and 3rd eigenvectors. Due to the number of variations in spectral clustering, the techniques are identified by an index given in

Figures 6-8 show the clustering results for the data sets derived from images, where one datum exists for each pixel turned on. After binning, these test cases generally have flat distributions, so the clustering results reflect geometrical features, useful for showing data grouping. Figures 9-12 show the clustering results for simulated data sets for ellipsoidal distributions, where the data is unevenly distributed, some with overlapping tails, helpful in illustrating data identification.

number of non-zero elements, favoring the larger clusters first. SPECTRAL07 (9g) clusters ( [

In this paper, multiple clustering algorithms have been presented and applied to several test cases. Each technique has strengths as well as weaknesses which have been exposed through the cases presented. When using multiple techniques, the possibility exists to leverage the information gathered from all techniques to arrive at a final cluster designation, based on the level of agreement or disagreement found between the algorithms [

In each approach taken, the cluster information for the partitions is represented by a matrix of cluster IDs, where each row represents results from a single cluster algorithm and each column is a partition. The values along each row are the cluster IDs assigned to each partition, forming the matrix, { C L U S ∈ ℕ C × P } where C = 26 and P is the number of partitions. In order to find agreement or disagreement between cluster IDs across many techniques, the rows are sorted so that the cluster IDs are sorted in ascending order along the first column. For any repeated values in the first column, the next column is then sorted in a similar fashion, continuing to sort further columns until all repeated values are addressed.

40 Partition Cluster IDs | |||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Alg_{1} | 5 | 7 | 2 | 4 | 7 | 1 | 2 | 4 | 1 | 1 | 2 | 3 | 8 | 6 | 4 | 1 | 5 | 5 | 4 | 1 | 1 | 4 | 8 | 8 | 9 | 6 | 9 | 4 | 1 | 2 | 4 | 4 | 4 | 1 | 7 | 4 | 3 | 3 | 4 |

Alg_{2} | 7 | 7 | 3 | 7 | 7 | 1 | 3 | 7 | 3 | 1 | 3 | 4 | 7 | 7 | 6 | 1 | 7 | 7 | 5 | 1 | 2 | 5 | 7 | 7 | 7 | 7 | 7 | 5 | 1 | 3 | 6 | 7 | 6 | 1 | 7 | 7 | 5 | 3 | 5 |

Alg_{3} | 6 | 6 | 2 | 6 | 7 | 1 | 3 | 6 | 2 | 1 | 3 | 3 | 8 | 6 | 5 | 1 | 6 | 6 | 5 | 1 | 1 | 5 | 8 | 8 | 8 | 6 | 8 | 5 | 1 | 2 | 5 | 6 | 6 | 1 | 6 | 6 | 3 | 3 | 5 |

Alg_{4} | 4 | 5 | 2 | 4 | 6 | 1 | 2 | 4 | 2 | 1 | 2 | 2 | 6 | 4 | 4 | 1 | 4 | 4 | 2 | 1 | 2 | 3 | 6 | 6 | 6 | 5 | 6 | 2 | 1 | 2 | 3 | 4 | 4 | 1 | 5 | 4 | 2 | 2 | 2 |

Alg_{5} | 5 | 6 | 1 | 4 | 6 | 1 | 1 | 5 | 1 | 1 | 1 | 2 | 6 | 6 | 4 | 1 | 6 | 5 | 3 | 1 | 1 | 4 | 6 | 6 | 6 | 6 | 6 | 4 | 1 | 1 | 4 | 4 | 4 | 1 | 6 | 4 | 2 | 1 | 4 |

Alg_{6} | 6 | 8 | 1 | 5 | 8 | 1 | 2 | 5 | 1 | 1 | 3 | 4 | 8 | 7 | 4 | 1 | 7 | 5 | 4 | 1 | 1 | 4 | 9 | 9 | 9 | 7 | 9 | 4 | 1 | 1 | 4 | 5 | 5 | 1 | 7 | 5 | 4 | 3 | 4 |

40 Partition Cluster IDs―Resorted by Partitions in Ascending ID Order | |||||||||||||||||||||||||||||||||||||||

Alg_{5} | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | 5 | 5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |

Alg_{6} | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | 5 | 5 | 5 | 5 | 5 | 6 | 7 | 7 | 7 | 7 | 8 | 8 | 8 | 9 | 9 | 9 |

Alg_{1} | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | 5 | 5 | 6 | 6 | 7 | 7 | 7 | 8 | 8 | 8 | 9 |

Alg_{3} | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 5 | 5 | 5 | 5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 7 | 8 | 8 | 8 | 8 |

Alg_{4} | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | 5 | 5 | 6 | 6 | 6 | 6 | 6 |

Alg_{2} | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 5 | 5 | 5 | 5 | 6 | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 |

40 Partition Cluster Difference Flags (Logical) for Sorted IDs | |||||||||||||||||||||||||||||||||||||||

Alg_{5} | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

Alg_{6} | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |

Alg_{1} | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |

Alg_{3} | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |

Alg_{4} | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |

Alg_{2} | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

40 Partition Cluster Fractured IDs | |||||||||||||||||||||||||||||||||||||||

Rob_{1} | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 4 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 12 | 13 | 14 | 15 | 16 | 17 | 17 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 28 | 29 |

40 Partition Cluster Majority Changed IDs | |||||||||||||||||||||||||||||||||||||||

Rob_{2} | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 3 | 3 | 3 | 4 | 4 | 5 | 5 | 6 | 6 | 6 | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 8 | 8 | 8 | 9 | 9 | 9 | 10 | 10 | 11 | 11 | 11 | 11 |

40 Partition Cluster All Changed IDs | |||||||||||||||||||||||||||||||||||||||

Rob_{3} | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 4 | 4 |

40 Partition Cluster No Overlap IDs | |||||||||||||||||||||||||||||||||||||||

Rob_{4} | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

following the partitions from left to right across the page.

As examples of robust clustering, the last four figures from

The Fractured robust designation results by assigning each partition a new cluster ID starting from one and increasing the cluster ID each time any technique changes its ID, which results in the largest set of clusters found. This approach is the most sensitive to changes in the cluster designations. The Majority Changed robust technique assigns a new cluster ID each time the accumulated number of algorithm cluster ID changes reaches a majority of the total number of algorithms. For each clustering technique, when a change occurs, any further changes from that technique are not registered until a majority is reached, at which point the accumulated sum of changes is reset to zero. This results in a medium sized set of clusters found, where a significant number of algorithms found a change, however, not all algorithms are required to note the change in ID. In the figure, a 75% majority was required, where ideally, the best majority threshold would create the largest number of clusters with the highest average membership. The All Changed robust case is equivalent to the Majority Changed case with a 100% majority threshold. This results in a small-medium sized set of clusters found, where every algorithm found a change, however, the changes may not have been at the same partition number, merely, that the total set of changes across all algorithms eventually required a change of ID. The No Overlap robust case assigns a new cluster ID whenever the total number of algorithms changes designation simultaneously, resulting in the smallest sized set of clusters found, where every algorithm must find a change for all partitions in a subset. Ideally, this would happen for each disconnected group of partitions, however, several techniques are “global” in scope and do not require a connection to exist to form clusters, leading to a single large cluster.

Several of the clustering techniques used in this study require either a guess or fore-knowledge of the number of clusters sought, such as KMEANS and KMEDOIDS. Robust clustering can provide a reasonable guess for the k-value, by first attaining consensus over all techniques that do not use a k-value, which are: MAXGLOB, MAXPATHL, CONN, LOS-MAXVIS, LOS-MUTUAL. Using the Majority Changed technique with a suitable choice in consensus threshold, the number of clusters found can be used as a k-value, which allows a reasonable guess to re-run the analysis utilizing the full complement of techniques.

A study using 26 clustering techniques has been performed over 12 test cases to illustrate both the strengths and weaknesses of clustering algorithms. A robust form of clustering is achieved through consensus over all techniques, helping reduce clustering problems by finding consistent clustering definitions across many approaches. The approach taken by this study utilizes six main ideas to produce a robust clustering analysis:

・ Reduce a large data set by binning the space, where the filled bins are the multi-dimensional partitions of the data set, each with a unique serial index, k.

・ Algorithms use the path length between any connected partitions as well as traditional distance metrics (L1, L2, etc.).

・ A Line-of-Sight (LOS) algorithm is developed to enhance the probability that two data are associated with one another. LOS also provides a new “super” neighborhood definition to be used in graph-based techniques. Data identification is addressed in two differing ways by LOS.

・ Spectral clustering using the [

・ Employ multiple clustering techniques to the set of partitions based on first nearest neighbors, distance weighted factors and geometrical properties of the set.

・ Using a consensus overcomes any one techniques’ failure mode in favor of the strengths of multiple techniques.

・ Establish a final cluster ID based on all the consensus of techniques employed.

This study shows that high dimensional, big-data analysis can be reduced to a smaller set of partitions where multiple clustering techniques can be used to sort the data into clusters. While the techniques presented are all computationally O ( N P 2 ) , by reducing the data set to partitions, these routines are reasonable to perform. The introduction of the LOS criteria created new avenues for cluster seeking. The combination of multiple clustering techniques, various distance metrics and traditional data reduction leads to a robust set of clusters found, which worked well in addressing issues of data identification, clustering as well as grouping.

SW acknowledges the support of ONR Grant No. N00014-01-1-0769 and EPSRC Grant No. EP/P021123/1. KM acknowledges the support by ONR grants: N00014-15-WX-01814, N00014-16-WX-01705 and N00014-17-WX-01705 as well as the Kinnear Fellowship from the USNA Foundation.

Mcilhany, K. and Wiggins, S. (2018) High Dimensional Cluster Analysis Using Path Lengths. Journal of Data Analysis and Information Processing, 6, 93-125. https://doi.org/10.4236/jdaip.2018.63007

Supplemental material for this study can be found at a GitHub site dedicated to this subject titled: “Data Clustering Using Path Lengths” [