Clustering Categorical Data Based on Within-Cluster Relative Mean Difference

The clustering on categorical variables has received intensive attention. In dataset with categorical features, some features show the superior performance on clustering procedure. In this paper, we propose a simple method to find such distinctive features by comparing pooled within-cluster mean relative difference and then partition the data upon such features and give subspace of the subgroups. The applications on zoo data and soybean data illustrate the performance of the proposed method.


Introduction
Data clustering is a technique for identifying groups with data instances in the same group which are more similar than the instances belonging to different groups.The issue of database clustering with categorical variables has received intensive attention ([1]- [8]) along with other publications in the same issue.The categorical data come from different areas of research, both social and nature sciences; this type of variable does not present a natural ordering for their possible values, results in the difficulty in clustering process.
There are various algorithms available for clustering categorical data, but no algorithm can achieve the best result for all the data sets.some new techniques have been developed recently, for example CACTUS (CAtegorical Clus Tering Using Summaries, see [9]), ROCK (RObust Clustering using linKs, see [10]), and neural networks based approaches (e.g.self-organizing map network) are used for this purpose.Arias-Castro and Xiao ( [11]) proposed a sparse version of clustering method.Zhang et al. ([12]) proposed a novel statistical procedure (HD Vector) for clustering categorical data based on frequency distributions of the hamming distance vector upon comparing with the uniform distributed sample.Unfortunately, their categorical sample space is still complex in computation since the comparison of the total number of possible positions in a categorical sample space is need.This paper devotes to find the distinctive attributes among the categorical dataset using pooled relative within-cluster mean difference, then the data is clustered upon a single distinctive attribute.At each iteration, our algorithm recognizes one distinctive attribute and then identifies only one cluster with minimum of within-cluster mean relative difference, which will then be deleted from the dataset at the next iteration; this procedure repeats until there are no more significant clusters in the remained data.
The rest of the paper is organized as follows: A motivation example is illustrated in Section 2, and in Section 3, the methodologies are discussed.The performance of the proposed algorithm is explored through a real dataset in Section 4. Section 5 gives our conclusions.

Motivation
Considering the soybean disease dataset from the Machine Learning Depository at the University of California at Irvine, it comprises of 47 objects with 35 categorical attributes { } 35 1 k k v = (There are 14 attributes have the identity observed value for all objects, so they can be treated as noninformationed attributes and be suppressed, thereafter only 21 attributes to be considered).As [12] pointed out, these data points come from four clusters: diaporthe stem canker, charcoal rot, rhizoctonia root rot, and phytophthora rot, with sample sizes 10, 10, 10, and 17 respectively.Summary in Table 1 shows that some subgroups possess the attibutes with identical property (value) which can be defined as the subspace,  also each subgroup has some attibutes with the value differently from other subgroup which called as the distinctive attributes, for example, all 18 v equals to 2 in subgroup 2 whereas different in others subgroups.When this dataset is used to give a clustering partition upon 18 v , the original subgroup 2 can be separated effectively from the database.
Also from Table 2, we can see that when one partition the data using different attibutes, the subgroups has different number of clusters and whin-cluster mean relative differences(Defined in Section 3), therefore results in different pooled whin-cluster mean relative difference ( ) gives the minimum of W among other attrbutes, so it can be considered as a distinctive attributes.
Therefore in this paper we devote to a simple clustering method to partition the data along such distinctive attributes, that is, the clustering precedure of categorical dataset fully depends on these attributes.

Methodology
Suppose that we have the data set , where the element ( ) 1, , and 1, , ij X i n j p = =   denotes the j -th attribute of the i -th object.
Notice that each categorical attributes k v has a finite number of category levels ( )

Useful Measurements
While the Euclidean-based measure could yield satisfactory results for numeric attributes, it is not appropriate for data sets with categorical attributes.Therefore, some alternative measurements must be explored.
Hamming distance, named after Richard Hamming, is widely used to give the difference between two equal-length categorical vectors.The Hamming distance between the object i x and j x is defined as:  i.e., the hamming distance measures the number of attributes at which the corresponding objects are different.
Our proposed method is based on the pooled within-cluster mean difference of the clusters.Intuitively, when a p -dimension dataset is divided to some subgroups 1 2 , , , r C C C  according to the attribute r v , this attribute has the same value in some specified subgroup, so it has no information in such subgroups, therefore the dimension r d of the cluster becomes smaller and smaller.
In order to give the dispersion corresponding to this phenomenon, a relative version of dispersion must be adopted.
Provided that we have partitioned the data into , , , and r d the corresponding dimensions (after eliminate the identical attributers). Let be the within-cluster mean relative difference (WCMRD) in cluster k C , and ( ) ( ) be the pooled within-cluster mean relative difference (PWCMRD).
The idea of our method is to select the distinctive attributes sequentially, which results in the minimum pooled within-cluster mean relative difference comparing with the other attibutes, i.e., ( ) thereafter, partition the dataset upon the finite characters of the selected attibutes and give the subspace of each subgroup at each iteration.

Clustering Procedure
Step 1 Initially the data set D is clustered according to the characters of ( ) , i.e., the objects are partitioned to the objects in each cluster have the same character on k v ; Step 2 Find a distinctive attribute g v satisfies ( ) where ( ) k W v be the pooled within-cluster mean relative difference of the clusters partitioned upon k v .
Step 3 Partition the dataset based on g v , and calculate the corresponding within-cluster mean relative difference r W for each cluster ( ) ( ) Step

The Stop Threshold WT
The stop threshold T W can be chosen arbitrarily.In fact, different T W results are in different hierarchical clustering.In our paper, the threshold is adopted to be 0.35, means a different of 35% attributes in a cluster is accepted.

Numerical Experiments
In the section, a simulated sample is deduced as reference [12].Also the criterion of classification rate (CR) is adopted to give the accuracy of the assignment.The classification rate measures the accuracy of an algorithm to assign data points into correct clusters.With given K clusters, ( ) where i n is the number of data points that have been correctly assigned by an algorithm, n is the total number of the data.
For the simulated sample, [12] obtains a mean CR 94.62% , with standard derivation 3.14% , we obtains a mean CR 96.02% , with standard derivation 2.57% , a litter better than their method.

Soybeansmall Data
The data set is derived from UCI Machine Learning Repository (archive.ics.uci.edu/ml/), it contains 47 objects, each has 35 categorical attributes.
There are some attributes with exactly the same value, so after eliminate the attributes redundant, there only 21 attributes left in data set.Table 3(a) gives the clustering results of the proposed method.It shows that only one objects are clustered incorrectly.This diagram is different from Table 1 where we identified 11 v (with PWCMRD 0.6548) instead of 14 v (with PWC- MRD 0.6584).Table 3(b) describes the detail of the proposed method on Soybean data, all except one object is assigned correctly.Also we can see the accuracy of the proposed method with 0.98 CR = .
Figure 1 gives the details of partition for Soybean data.In each step, a distinct attribute is identified, and the data is separated along this attribute, and also the subgroup with largest W is chosen to be the target one to be separated in next step.

Zoo Data
The Zoo data set is available from UCI Machine Learning Repository (archive.ics.uci.edu/ml/), it contains 101 objects, each has 16 categorical attributes.There are some objects who posses exactly same value on all attributes, so Table 3. Cluster result of soybean data by the proposed method.it can be considered as the same ones, after eliminate the redundant objects, there only 59 objects left in data set.
Table 4 gives the clustering results of the proposed method, where "group" means the true category of the objects.It shows that the performance is poor on group 5, 6, 7, with 1 object in group 5 is clustered incorrectly into group 3, and the group 6 and 7 each has object are considered as member of the new cluster.Since the objects in group 5 (frog, frog, newt, toad) and group 3 (pitviper, seasnake, slowworm, tortoise, tuatara) have more similarity than others (there are 11 attributes are the same among 16 attributes), so the two group can roughly be considered as one group.After combining the two subgroups, our proposed methods has a precision clustering with only one incorrect.
Table 5 describes the detail of the proposed method on zoo data.

Comparison with HD Vector Method
Zhang et al. ([12]) indicates that their method can archive a good results in both the zoo data and Soybean disease data, the comparison between HD vector method and K-modes as well as Autocluss algorithm shows the superiority of their method, the drawback of their method is the comparison of possible data with the number equals to the total number of possible positions in a categorical sample space, in our proposed this possible positions are not needed, therefore the algorithm can be faster than their method.

Conclusions
Categorical variables are widely explored in different fields to give a native Table 4. Cluster result of Zoo data by the proposed method.

4
While r T W W > (where T W is the threshold predefined to stop the procedure), Update the data set D by r C

Table 1 .
Summary of the soybean data.

Table 2 .
The attribute-based clustering performance on soybean data.