Clustering Categorical Data Based on Within-Cluster Relative Mean Difference ()
1. Introduction
Data clustering is a technique for identifying groups with data instances in the same group which are more similar than the instances belonging to different groups. The issue of database clustering with categorical variables has received intensive attention ( [1] - [8] ) along with other publications in the same issue. The categorical data come from different areas of research, both social and nature sciences; this type of variable does not present a natural ordering for their po- ssible values, results in the difficulty in clustering process.
There are various algorithms available for clustering categorical data, but no algorithm can achieve the best result for all the data sets. some new techniques have been developed recently, for example CACTUS (CAtegorical Clus Tering Using Summaries, see [9] ), ROCK (RObust Clustering using linKs, see [10] ), and neural networks based approaches (e.g. self-organizing map network) are used for this purpose. Arias-Castro and Xiao ( [11] ) proposed a sparse version of clustering method. Zhang et al. ( [12] ) proposed a novel statistical procedure (HD Vector) for clustering categorical data based on frequency distributions of the hamming distance vector upon comparing with the uniform distributed sample. Unfortunately, their categorical sample space is still complex in compu- tation since the comparison of the total number of possible positions in a ca- tegorical sample space is need.
This paper devotes to find the distinctive attributes among the categorical dataset using pooled relative within-cluster mean difference, then the data is clustered upon a single distinctive attribute. At each iteration, our algorithm recognizes one distinctive attribute and then identifies only one cluster with minimum of within-cluster mean relative difference, which will then be deleted from the dataset at the next iteration; this procedure repeats until there are no more significant clusters in the remained data.
The rest of the paper is organized as follows: A motivation example is illustrated in Section 2, and in Section 3, the methodologies are discussed. The performance of the proposed algorithm is explored through a real dataset in Section 4. Section 5 gives our conclusions.
2. Motivation
Considering the soybean disease dataset from the Machine Learning Depository at the University of California at Irvine, it comprises of 47 objects with 35 categorical attributes
(There are 14 attributes have the identity observed value for all objects, so they can be treated as noninformationed attributes and be suppressed, thereafter only 21 attributes to be considered). As [12] pointed out, these data points come from four clusters: diaporthe stem canker, charcoal rot, rhizoctonia root rot, and phytophthora rot, with sample sizes 10, 10, 10, and 17 respectively. Summary in Table 1 shows that some subgroups possess the attibutes with identical property (value) which can be defined as the subspace,
Table 1. Summary of the soybean data.
also each subgroup has some attibutes with the value differently from other subgroup which called as the distinctive attributes, for example, all
equals to 2 in subgroup 2 whereas different in others subgroups. When this dataset is used to give a clustering partition upon
, the original subgroup 2 can be separated effectively from the database.
Also from Table 2, we can see that when one partition the data using different attibutes, the subgroups has different number of clusters and whin-cluster mean relative differences(Defined in Section 3), therefore results in different pooled whin-cluster mean relative difference
. Since
gives the minimum of
among other attrbutes, so it can be considered as a distinctive attributes. Therefore in this paper we devote to a simple clustering method to partition the data along such distinctive attributes, that is, the clustering precedure of cate- gorical dataset fully depends on these attributes.
3. Methodology
Suppose that we have the data set
, where the element
denotes the
-th attribute of the
-th object. Notice that each categorical attributes
has a finite number of category levels
.
3.1. Useful Measurements
While the Euclidean-based measure could yield satisfactory results for numeric attributes, it is not appropriate for data sets with categorical attributes. Therefore, some alternative measurements must be explored.
Hamming distance, named after Richard Hamming, is widely used to give the difference between two equal-length categorical vectors. The Hamming distance between the object
and
is defined as:
Table 2. The attribute-based clustering performance on soybean data.
(3.1)
i.e., the hamming distance measures the number of attributes at which the corresponding objects are different.
Our proposed method is based on the pooled within-cluster mean difference of the clusters. Intuitively, when a
-dimension dataset is divided to some subgroups
according to the attribute
, this attribute has the same value in some specified subgroup, so it has no information in such sub- groups, therefore the dimension
of the cluster becomes smaller and smaller. In order to give the dispersion corresponding to this phenomenon, a relative version of dispersion must be adopted.
Provided that we have partitioned the data into
clusters
upon attribute
, denote
the number of objects in
and
the corresponding dimensions (after eliminate the identical attributers). Let
(3.2)
be the within-cluster mean relative difference (WCMRD) in cluster
, and
(3.3)
be the pooled within-cluster mean relative difference (PWCMRD).
The idea of our method is to select the distinctive attributes sequentially, which results in the minimum pooled within-cluster mean relative difference com- paring with the other attibutes, i.e.,
(3.4)
thereafter, partition the dataset upon the finite characters of the selected attibutes and give the subspace of each subgroup at each iteration.
3.2. Clustering Procedure
Step 1 Initially the data set
is clustered according to the characters of
, i.e., the objects are partitioned to
clusters such that the objects in each cluster have the same character on
;
Step 2 Find a distinctive attribute
satisfies
(3.5)
where
be the pooled within-cluster mean relative difference of the clusters partitioned upon
.
Step 3 Partition the dataset based on
, and calculate the corresponding within-cluster mean relative difference
for each cluster
.
Step 4 While
(where
is the threshold predefined to stop the procedure),
Update the data set
by
,
Repeat Step 1 and Step 2 until all
.
End.
3.3. The Stop Threshold WT
The stop threshold
can be chosen arbitrarily. In fact, different
results are in different hierarchical clustering. In our paper, the threshold is adopted to be 0.35, means a different of
attributes in a cluster is accepted.
4. The Performance of the Proposed Method
4.1. Numerical Experiments
In the section, a simulated sample is deduced as reference [12] . Also the criterion of classification rate (CR) is adopted to give the accuracy of the assignment. The classification rate measures the accuracy of an algorithm to assign data points into correct clusters. With given
clusters,
where
is the number of data points that have been correctly assigned by an algorithm,
is the total number of the data.
For the simulated sample, [12] obtains a mean CR
, with standard derivation
, we obtains a mean CR
, with standard derivation
, a litter better than their method.
4.2. Soybeansmall Data
The data set is derived from UCI Machine Learning Repository (archive.ics.uci.edu/ml/), it contains 47 objects, each has 35 categorical attributes. There are some attributes with exactly the same value, so after eliminate the attributes redundant, there only 21 attributes left in data set.
Table 3(a) gives the clustering results of the proposed method. It shows that only one objects are clustered incorrectly. This diagram is different from Table 1 where we identified
(with PWCMRD 0.6548) instead of
(with PWC- MRD 0.6584). Table 3(b) describes the detail of the proposed method on Soy- bean data, all except one object is assigned correctly. Also we can see the accu- racy of the proposed method with
.
Figure 1 gives the details of partition for Soybean data. In each step, a distinct attribute is identified, and the data is separated along this attribute, and also the subgroup with largest
is chosen to be the target one to be separated in next step.
4.3. Zoo Data
The Zoo data set is available from UCI Machine Learning Repository (archive.ics.uci.edu/ml/), it contains 101 objects, each has 16 categorical attri- butes. There are some objects who posses exactly same value on all attributes, so
(a) (b)
Table 3. Cluster result of soybean data by the proposed method.
Figure 1. Performance on soybeansmall data.
it can be considered as the same ones, after eliminate the redundant objects, there only 59 objects left in data set.
Table 4 gives the clustering results of the proposed method, where “group” means the true category of the objects. It shows that the performance is poor on group 5, 6, 7, with 1 object in group 5 is clustered incorrectly into group 3, and the group 6 and 7 each has one object are considered as member of the new cluster. Since the objects in group 5 (frog, frog, newt, toad) and group 3 (pitvi- per, seasnake, slowworm, tortoise, tuatara) have more similarity than others (there are 11 attributes are the same among 16 attributes), so the two group can roughly be considered as one group. After combining the two subgroups, our proposed methods has a precision clustering with only one incorrect.
Table 5 describes the detail of the proposed method on zoo data.
5. Comparison with HD Vector Method
Zhang et al. ( [12] ) indicates that their method can archive a good results in both the zoo data and Soybean disease data, the comparison between HD vector method and K-modes as well as Autocluss algorithm shows the superiority of their method, the drawback of their method is the comparison of possible data with the number equals to the total number of possible positions in a categorical sample space, in our proposed this possible positions are not needed, therefore the algorithm can be faster than their method.
6. Conclusions
Categorical variables are widely explored in different fields to give a native
(a) (b)
Table 4. Cluster result of Zoo data by the proposed method.
Table 5. Cluster result of Zoo data by the proposed method.
clustering algorithm to deal with such type data; a pooled-within-cluster-mean- different based method is proposed to select some distinctive attributes, and then the data are clustered upon such distinctive attributes; the subspaces are also investigated.
The applications on zoo data and soybean data (from UC Irvine Machine Learning Repository) illustrate the performance of the proposed method. The results show a high accuracy and simplicity in practical applications.