^{1}

^{1}

^{*}

The clustering on categorical variables has received intensive attention. In dataset with categorical features, some features show the superior performance on clustering procedure. In this paper, we propose a simple method to find such distinctive features by comparing pooled within-cluster mean relative difference and then partition the data upon such features and give subspace of the subgroups. The applications on zoo data and soybean data illustrate the performance of the proposed method.

Data clustering is a technique for identifying groups with data instances in the same group which are more similar than the instances belonging to different groups. The issue of database clustering with categorical variables has received intensive attention ( [

There are various algorithms available for clustering categorical data, but no algorithm can achieve the best result for all the data sets. some new techniques have been developed recently, for example CACTUS (CAtegorical Clus Tering Using Summaries, see [

This paper devotes to find the distinctive attributes among the categorical dataset using pooled relative within-cluster mean difference, then the data is clustered upon a single distinctive attribute. At each iteration, our algorithm recognizes one distinctive attribute and then identifies only one cluster with minimum of within-cluster mean relative difference, which will then be deleted from the dataset at the next iteration; this procedure repeats until there are no more significant clusters in the remained data.

The rest of the paper is organized as follows: A motivation example is illustrated in Section 2, and in Section 3, the methodologies are discussed. The performance of the proposed algorithm is explored through a real dataset in Section 4. Section 5 gives our conclusions.

Considering the soybean disease dataset from the Machine Learning Depository at the University of California at Irvine, it comprises of 47 objects with 35 categorical attributes { v k } k = 1 35 (There are 14 attributes have the identity observed value for all objects, so they can be treated as noninformationed attributes and be suppressed, thereafter only 21 attributes to be considered). As [

Name of group | Size of group | Subspace | Distinctive attributes |
---|---|---|---|

Diaporthe stem canker | 10 | ||

Charcoal rot | 10 | ||

Rhizoctonia root rot | 10 | ||

Phytophthora rot | 17 |

also each subgroup has some attibutes with the value differently from other subgroup which called as the distinctive attributes, for example, all v 18 equals to 2 in subgroup 2 whereas different in others subgroups. When this dataset is used to give a clustering partition upon v 18 , the original subgroup 2 can be separated effectively from the database.

Also from

Suppose that we have the data set X = ( X i j ) , where the element X i j ( i = 1 , ⋯ , n and j = 1 , ⋯ , p ) denotes the j -th attribute of the i -th object. Notice that each categorical attributes v k has a finite number of category levels N ( v k ) .

While the Euclidean-based measure could yield satisfactory results for numeric attributes, it is not appropriate for data sets with categorical attributes. Therefore, some alternative measurements must be explored.

Hamming distance, named after Richard Hamming, is widely used to give the difference between two equal-length categorical vectors. The Hamming distance between the object x i and x j is defined as:

Feature | Feature | ||||||
---|---|---|---|---|---|---|---|

7 | 2.665 | 3 | 1.45 | ||||

2 | 0.727 | ||||||

2 | 0.994 | ||||||

2 | 0.805 | 4 | 1.052 | ||||

3 | 0.974 | ||||||

3 | 1.02 | 4 | 1.967 | ||||

2 | 0.988 | ||||||

4 | 1.985 | 2 | 0.732 | ||||

2 | 0.896 | ||||||

4 | 1.127 | 2 | 0.773 | ||||

2 | 0.675 | ||||||

2 | 0.909 | 2 | 0.790 | ||||

2 | 0.977 | 2 | 0.803 |

d i j = ∑ k = 1 p 1 ( x i k ≠ x j k ) (3.1)

i.e., the hamming distance measures the number of attributes at which the corresponding objects are different.

Our proposed method is based on the pooled within-cluster mean difference of the clusters. Intuitively, when a p -dimension dataset is divided to some subgroups C 1 , C 2 , ⋯ , C r according to the attribute v r , this attribute has the same value in some specified subgroup, so it has no information in such sub- groups, therefore the dimension d r of the cluster becomes smaller and smaller. In order to give the dispersion corresponding to this phenomenon, a relative version of dispersion must be adopted.

Provided that we have partitioned the data into N ( v k ) clusters C 1 , C 2 , ⋯ , C N ( v k ) upon attribute v k , denote n r the number of objects in C r and d r the corresponding dimensions (after eliminate the identical attributers). Let

W ¯ r = 1 d r 1 n r ( n r − 1 ) ∑ x i , x j ∈ C r d i j (3.2)

be the within-cluster mean relative difference (WCMRD) in cluster C k , and

W ¯ ( v k ) = ∑ r = 1 N ( v k ) W ¯ r (3.3)

be the pooled within-cluster mean relative difference (PWCMRD).

The idea of our method is to select the distinctive attributes sequentially, which results in the minimum pooled within-cluster mean relative difference com- paring with the other attibutes, i.e.,

v m = arg min v k W ¯ ( v k ) , (3.4)

thereafter, partition the dataset upon the finite characters of the selected attibutes and give the subspace of each subgroup at each iteration.

Step 1 Initially the data set D is clustered according to the characters of v k ( k = 1 , 2 , ⋯ , p ) , i.e., the objects are partitioned to N ( v k ) clusters such that the objects in each cluster have the same character on v k ;

Step 2 Find a distinctive attribute v g satisfies

v g = arg min k = 1 , 2 , ⋯ , p W ¯ ( v k ) (3.5)

where W ¯ ( v k ) be the pooled within-cluster mean relative difference of the clusters partitioned upon v k .

Step 3 Partition the dataset based on v g , and calculate the corresponding within-cluster mean relative difference W ¯ r for each cluster C r ( r = 1 , 2 , ⋯ , N ( v g ) ) .

Step 4 While W ¯ r > W T (where W T is the threshold predefined to stop the procedure),

Update the data set D by C r ,

Repeat Step 1 and Step 2 until all W ¯ r ≤ W T .

End.

The stop threshold W T can be chosen arbitrarily. In fact, different W T results are in different hierarchical clustering. In our paper, the threshold is adopted to be 0.35, means a different of 35 % attributes in a cluster is accepted.

In the section, a simulated sample is deduced as reference [

C R ( K ) = ∑ i = 1 K n i n

where n i is the number of data points that have been correctly assigned by an algorithm, n is the total number of the data.

For the simulated sample, [

The data set is derived from UCI Machine Learning Repository (archive.ics.uci.edu/ml/), it contains 47 objects, each has 35 categorical attributes. There are some attributes with exactly the same value, so after eliminate the attributes redundant, there only 21 attributes left in data set.

The Zoo data set is available from UCI Machine Learning Repository (archive.ics.uci.edu/ml/), it contains 101 objects, each has 16 categorical attri- butes. There are some objects who posses exactly same value on all attributes, so

True | ||||
---|---|---|---|---|

diap. | 10 | 0 | 0 | 0 |

char. | 0 | 10 | 0 | 0 |

rhiz. | 0 | 0 | 9 | 1 |

phyt. | 0 | 0 | 0 | 17 |

Iterate | Distin. Var. | Subspace | ||
---|---|---|---|---|

1 | 10 | 0.2511 | ||

2 | 10 | 0.2632 | ||

3 | 9 | 0.2546 | ||

4 | 18 | 0.3638 |

it can be considered as the same ones, after eliminate the redundant objects, there only 59 objects left in data set.

Zhang et al. ( [

Categorical variables are widely explored in different fields to give a native

Clusters found | |||||||||
---|---|---|---|---|---|---|---|---|---|

Group 1 | 19 | 19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

Group 2 | 12 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 0 |

Group 3 | 5 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 |

Group 4 | 5 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 |

Group 5 | 4 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 0 |

Group 6 | 6 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 1 |

Group 7 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 1 |

Clusters found | |||||||
---|---|---|---|---|---|---|---|

Group 1 | 19 | 19 | 0 | 0 | 0 | 0 | 0 |

Group 2 | 12 | 0 | 12 | 0 | 0 | 0 | 0 |

Group 3 + 5 | 9 | 0 | 0 | 9 | 0 | 0 | 0 |

Group 4 | 5 | 0 | 0 | 0 | 5 | 0 | 0 |

Group 6 | 6 | 0 | 0 | 0 | 0 | 5 | 1 |

Group 7 | 8 | 0 | 0 | 0 | 0 | 0 | 8 |

Iterate | Distin. Var. | Subspace | ||
---|---|---|---|---|

1 | 12 | 0.1485 | ||

2 | 19 | 0.2498 | ||

3 | 5 | 0.1538 | ||

4 | 5 | 0.1667 | ||

5 | 7 | 0.3 | ||

6 | 2 | 0.1 | ||

7 | 3 | 0.133 | ||

8 | 6 | 0.333 |

clustering algorithm to deal with such type data; a pooled-within-cluster-mean- different based method is proposed to select some distinctive attributes, and then the data are clustered upon such distinctive attributes; the subspaces are also investigated.

The applications on zoo data and soybean data (from UC Irvine Machine Learning Repository) illustrate the performance of the proposed method. The results show a high accuracy and simplicity in practical applications.

Su, J.X. and Su, C.J. (2017) Clustering Categorical Data Based on Within-Cluster Relative Mean Di- fference. Open Journal of Statistics, 7, 173- 181. https://doi.org/10.4236/ojs.2017.72013