_{1}

^{*}

The visual assessment of tendency (VAT) technique, for visually finding the number of meaningful clusters in data, developed by J. C. Bezdek, R. J. Hathaway and J. M. Huband, is very useful, but there is room for improvements. Instead of displaying the ordered dissimilarity matrix (ODM) as a 2D gray-level image for human interpretation as is done by VAT, we trace the changes in dissimilarities along the diagonal of the ODM. This changes the 2D data structure (matrices) into 1D arrays, displayed as what we call the tendency curves, which enables one to concentrate only on one variable, namely the height. One of these curves, called the d-curve, clearly shows the existence of cluster structure as patterns in peaks and valleys, which can be caught not only by human eyes but also by the computer. Our numerical experiments showed that the computer can catch cluster structures from the d-curve even in some cases where the human eyes see no structure from the visual outputs of VAT. And success on all numerical experiments was obtained us- ing the same (fixed) set of program parameter values.

Clustering is the problem of partitioning a set of objects into c self-similar subsets (clusters) based on available data and some well-defined measure of similarity. The type of clusters found depends strongly on the mathematical model that underlies the clustering algorithm. All clustering algorithms will find any number (up to n) of clusters, even if no meaningful clusters exist. Therefore before choosing a clustering method one has to decide whether there are meaningful clusters, and if so, how many are there. This is called the assessing of clustering tendency.

Numerous formal (statistics-based) and informal techniques for such assessment are discussed in Jain and Dubes [

The object set O is usually represented in the following two ways. When each object is represented by a vector, the set is called an object data representation of O. The s components of represent the s features of the object. It is in this feature space that people sometimes seek descriptors of the clusters, cluster centers or prototypes, as they are called. Alternatively, when each pair of objects in O is represented by a relationship, it is called relational data. Most of the time, the relationship between and is given by their dissimilarity (a distance or some other measure; see [10,11]). These data items form a symmetric matrix

Our method, which we call VATdt, standing for Visual Assessment of cluster Tendency using diagonal tracing, replaces the visual output of the VAT algorithms (the original one or its variations). VAT applies directly on a dissimilarity matrix. If the original data consist of a

(symmetric) matrix of pair-wise similarities

then a dissimilarity matrix R can be obtained through a simple transformation such as

where denotes the largest similarity value. If the original data are represented by object data

, then can be computed as the distance between and measured by some norm or metric in the feature space Hence the VAT algorithms can always be applied, and so can our VATdt algorithm. They are applicable even if some components of the original data are missing; see [

VAT reorders the points in a data set so that points that are close to one another in the feature space will generally have similar indices (see the example below). Some versions, such as sVAT [

The largest element of R is 1 because the VAT algorithms scale the elements of R.

VAT displays the ODM on the screen in a straightforward way, as ordered dissimilarity image (ODI). In ODI the gray level of pixel (i,j) is proportional to the value of with (pure black) if and (pure white) if. The idea of VAT is shown in the following example.

Example 1. A data set of 20 points containing three well-defined clusters is shown in

The VAT algorithms are certainly very useful, but there is room for improvements. It seems to us that our eyes are not very sensitive to structures in gray level images. One example is given in

The approach of this paper is to trace changes in dissimilarities along the diagonal of the ODM, the numeric output of VAT that underlies its visual output ODI. This

will result in what we call the tendency curves. The borders of clusters in the ODM (or blocks in the ODI) are reflected as certain patterns in peaks and valleys on the tendency curves. To be exact, we will actually use only one of these curves, called the d-curve, which is the difference of two other curves. The patterns on the d-curve can be caught not only by human eyes but also by the computer. It seems that the computer is more sensitive to these patterns than human eyes are to them, or to the gray level patterns in the ODI. For example, the computer caught three clusters in the data set that produced the virtually useless ODI in

Remark: The patterns on the tendency curves only roughly match the block borders in ODI in positions, and the sizes of these blocks do not closely approximate the

sizes of clusters in the data, either. This is because the VAT algorithms tend to index each cluster’s most outlying points at the very end, after all the more dense cluster cores are indexed. Whenever we say in this paper “catch clusters/blocks”, we mean the program reveals the existence of clusters/blocks. The sizes and members (or memberships) will have to be found by a clustering method, not by a tendency algorithm such as ours.

We will describe our method in detail in §2 below, give numerical examples in §3, and conclude the paper with discussions and future plans in the last section.

We try to catch possible diagonal blocks in the ordered dissimilarity matrix R, the numeric output of VAT. We do so by using various averages of dissimilarities, which are stored as vectors and displayed as curves. The goal is to catch the borders of black blocks in an ODI such as

where This is the average of the elements of row i in the w-band shown in

When the situation is less than ideal, there will be noise, sometimes very “loud” noise, on the r-curve, which may destroy possible patterns on it. To overcome this, we extend the idea of averaging to more rows, which leads to the m-curve, whose i-the element is the average of all elements such that

and

These are the elements in up to m rows above row i, inclusive, that fall in the w-band, corresponding to the region between the two horizontal line segments in

The m-curve often reveals the pattern beneath the noisy r-curve. Since the ODM is scaled so that

the heights of peaks on the m-curve remain roughly the same from case to case, that is, when clusters are well formed.

But again there are less-than-ideal situations, in which there are outliers. The VAT algorithms tend to order outliers near the end, so the m-curve tends to move up toward the right, which is fine to human eyes but makes it hard for the program to identify peaks and valleys using thresholds. This is why we introduce the M-row moving average, called the M-curve. The M-curve is defined in the same way as the m-curve except with m replaced by M. The M-curve shows long term trends of the r-curve. We are, however, NOT interested in the M-curve itself. We use the M-curve to “correct”, or to level up, the m-curve, by subtracting the former from the latter. It is the difference of the mand M-curves, which we call the d-curve, that we are interested in. The d-curve retains the shape of the m-curve but is more horizontal, basically lying on the horizontal axis. Furthermore, the M-curve changes more slowly than the m-curve, thus when moving from one block into another block in the ODM, it will tend to be lower than the m-curve. As the result, the d-curve will show a valley, most likely below the horizontal axis, after a peak. It is the peak-valley, or high-low, patterns on the d-curve that signal the existence of cluster structures. This will become clear in our examples in the section that follows.

Although the d-curve is the only curve we really need, we will also show other tendency curves, that is, the r-, mand M-curves, in the first few examples to show the reader how the idea evolved from an intuitive r-curve to the final, rather technical, d-curve.

Remark: It may seem much more natural to define the i-th element of the r-curve as the average of all such that the object is in the same cluster as and Actually this is what we tried at the very beginning of this work. More precisely, we set in definition (2) at the beginning of the calculation, and once we believed we had found a new cluster, we reset to the index of the element we believed to be the first one in the new cluster. There were several problems. First, neither the VAT algorithms nor our program can accurately locate the borders of clusters in terms of the index values. Second, any possible patterns obtained that way were self-fulfilled: once we reset all curves went back to zero, and then it would look like there was indeed a new cluster. It would literally tear the tendency curves apart, and distort all possible high-low patterns.

In all the examples in this paper, we will use the values

where n is the number of objects in the data set. Here the ceiling function is used for m so that it is at least 1 even if n is very small. And these are the values we recommend to possible users of our algorithm when there is no clear reason to change them. Discussion on how the values of these, and two other, parameters were chosen can be found later in the section.

We first give one group of examples in so that we can use their scatterplots to show how well/poorly the clusters are separated. We also give the visual outputs (ODIs) of VAT for comparison. These sets are generated by choosing 8, 4, 3, 2, 1 and 0 in the following settings: 2000 points are generated in three groups from multivariate normal distribution having mean vectors

and

The probabilities for a point to fall into each of the three groups are 0.35, 0.4 and 0.25, respectively. The covariance matrices for all three groups are Note that and form an equilateral triangle of side length

The pictures for (

Now we know what we should look for: peaks followed by valleys, or high-low patterns, on the r and d-curves. Later on we will show that even the r-curve is not good enough and only the d-curve will do the job.

The case is given in

Note that the m-curve goes up with wild oscillations so much in

We use two thresholds to detect high-lows. When the d-curve hits a ceiling, set as 0.04, and then a floor, set as 0, the program reports one new cluster. These ceiling and floor values are satisfied by all cases in our numerical experiments, even those not reported here, where the clusters are reasonably, sometimes only barely, separated.

If we lower the ceiling and raise the floor, we would be able to catch some of the blended clusters we know we have missed, but it would also increase the chance of “catching” false clusters. We are not saying these values are the best. Any values are arguable, as arguable as the number of clusters is when the clusters are blended. We do not like the idea of tuning parameters to particular examples, and will stick to the same ceiling and floor values throughout this paper. In fact, we will stick to the same set of values for all parameters in our program, that is, the values for the ceiling and floor set here, and those for m, M and w given in (3).

The situation in the case shown in

however, picks up cluster structure from the ODM. It has several high-lows, with two of them large enough to hit both the ceiling and floor, whose peaks are near 600 and 1000 marks on the horizontal axis, respectively. This example shows that our tendency curves are more sensitive than the raw block structure in the 2D display ODI. The largest advantage of the tendency curves is probably the quantization of gray level patterns which enables the computer, not only human eyes, to catch possible patterns.

One may question how many clusters this data set truly has, but it then depends on what one means by “truly”. This may be subjective. We see three clusters in

only saying that our program can be sensitive enough to “see” three clusters in this case.

When goes down to zero, the cluster structure disappears. The scatterplots for (

We now show that, without modifying any parameter values, our VATdt algorithm works on small data sets, too. The data sets in this group of examples are similar to those in Figures 6-10 of Bezdek and Hathaway [

are similar to those in and so is the way they deteriorate as the value of decreases. The d-curves for 4 and 3 are given in

Remark: We remind the reader that the positions of the peaks and valleys do not reflect the sizes of the clusters closely unless the clusters are very well separated.

Does our method always say what it should say? Well, there is not, and there will never be, an infallible method to determine the number of clusters. In many cases, there are no right or wrong answers; it all depends on what one means by “should”. The data set used in

We now give two examples where the points are regularly arranged, on a rectangular grid, and along a pair of concentric circles, respectively. These are similar to the data sets in Figures 12 and 13 of Bezdek and Hathaway [

part (b) has a periodic nature, but no blocks. Bezdek and Hathaway [

Remark: Our program works on the original example in [

In

It is almost a sacred ritual that everybody tries the Iris data in a paper on clustering, so we also tried our program on it. It is well-known that the data consist of values of four features of each of 150 irises (150 points in a four-dimensional feature space). These irises are of three different physical types, 50 from each type, thus the data have three physically labeled classes. But two of the three flower types yield data points that largely overlap in this particular feature space, so many argue that the unlabeled data are naturally clustered into two geometrically well-defined clusters; see [

puter caught the large high-low on the left and ignored the small one on the right, and reported the existence of two clusters. Once again one may argue on the correctness of the program ignoring the smaller high-low (thus the choice of the ceiling value), just as one can argue on the “correct” number of clusters in the Iris data.

We conclude this section with some comments on the choice of the parameter values of the program. Since enough has been said about the floor and ceiling values, here we only discuss the values of m, M and w. We

ended up with the values in (3) from experiments. First, we want as small a value for m as possible so that relatively small clusters will not get lost in the averaging process. But if it gets too small, the m-curve would get noisier and noisier and eventually fall back to the r-curve. We also want it as a percentage of n so that we do not have to change it to suit data sets of different sizes. Five percent is the smallest we dare go, (n often gets below 100, and then we are only looking at the average of a few rows), and it works very well. The performance of the program is not sensitive at all to the changes in M. As long as it is several times larger than m, we did not see much difference. The value of w makes a difference only occasionally, and, when it does, only marginally. We tried values from to 5n, and all of them worked fine. All in all, the value worked best, but the difference was insignificant. Thus we decided that a single set of parameter values could successfully be used for all cases, which is a rare situation for clustering procedures involving user-selected parameter values.

One scenario in which we foresee the need of changing parameters is when the ratios of the cluster sizes in a data set are so large that (relatively) small clusters get lost in the averaging, causing the d-curve valleys to be too shallow to hit the floor. One will then need to decrease the values of m and w, which may help form larger high-low patterns on the d-curve. We would feel comfortable adjusting the values of the ceiling and floor if there are “clean” high-low patterns on the d-curve, that is, if there are not many zigzags when the curve goes up and down. When changing parameter values, we recommend the user to look at the r-curve, too. One should feel more confident if the r-curve does not show too much noise.

Our VATdt algorithm is meant to replace the straightforward visual displaying part of the VAT algorithms mentioned in the second paragraph of §1. Or, for that matter, it can start from an ordered dissimilarity matrix from any algorithm of that kind. Instead of displaying the matrix as a 2-dimensional gray-level image ODI for human interpretation, VATdt analyzes the matrix by taking averages of various kinds along its diagonal and produces the tendency curves, with the most useful of them being the d-curve. This changes 2D data (a matrix) into a 1D array, which is certainly easier to both human eyes and the computer since the concentration is now only on one variable—the height.

Possible cluster structure is reflected as high-low patterns on the d-curve with a relatively uniform range that enables the computer to catch them with thresholds. The values of thresholds may be arguable, but no more so than the “right” number of clusters that exist in a given data set. For example, some see only one single cluster in

We are truly encouraged by the two examples in Figures 14 and 15 where the ODI images do not have blocks but the d-curve still did the job nicely. An ODI shows blocks only if the data set contains (elliptical) diskshaped clusters in 2-dimensional feature space, or ellipsoidor ball-shaped clusters in feature spaces of higher dimensions. Clusters of other shapes show different patterns in the ODI, whose meaning one can only guess. Our d-curve, however, clearly shows the cluster structures in both cases, where chain-shaped clusters exist.

We plan to further investigate and improve the VATdt algorithm, experimenting with it on clusters of different shapes, even with mixed shapes in the same data set. It also interests us to use different metrics, even dissimilarities that are not metrics. We are mainly interested in cases where structures in the ODM exist but the ODI does not show them clearly, at least not in black blocks.

The author would like to thank Professor Richard J. Hathaway for his numerous helpful thoughts and suggestions, including the final title of this paper.