_{1}

In this paper, the balanced assignment is studied in classification of a group with multiple attribute into many subgroups without losing its similarity. The similarity or closeness in clustering is often measured as a distance. The Mahalanob distance is considered as one of the tools for measuring its closeness. The comparison between the distance criterion is shown by changing a specific assignment standard, and finally comparing it against the MTS method.

Cluster analysis or clustering is the task of grouping a set of objects in such a way that entities in the same group are more similar in some sense to each other than to those in other groups. And atypical assignment is a fundamental combinatorial optimization problem. It consists of finding, in a weighted bipartite graph, or a matching in which the sum of weights of the edges is as large as possible. On the other hands, a balanced assignment is a way to make the subgroups equal in the process of distributing entities to multiple subgroups. It can be more complicated to allocate entities into subgroups using a well-balanced manner when each entity has multiple attributes. The results of clustering show that there exist characteristic differences within subgroup, and the similarity between subgroups with respect to the attributes. In this study, the properties inherent in an entity are defined as attributes, and the properties that represent the group are defined as characteristics [

The classification of entities with multiple attributes is different from the well-known solution methodology such as partitioning method or hierarchical method, but the efforts are being made to improve it due to the difficulty of the optimization process. A mathematical model or an application methodology with a constraint such as a series of processes for finding an appropriate compromise among attributes that are in conflict with each other is modeled as a multi-criteria function, and its necessity is increasing. Among the methods of solving the multi-criteria function, the most commonly used approaches are the weighting method and the goal programming method. The weights or specific numerical goals should be established appropriately for each function using the mathematical programming approaches in the optimization process. Barron and Schmidt [

In the clustering process, a distance between entities or attributes is applied in the Ward method and partitioning method as a measure of clustering accuracy. Here, the distance means not the physical distance but the distance between the attributes in the entities. Usually, the Manhattan distance, the Euclidian distance and the Mahalanobis distance are applied as a distance selection. The Manhattan distance, also known as the Taxicab distance, is a one-dimensional distance connecting two points. On the other hand, the Euclidean distance is a method to obtain the shortest distance between two points in n-dimensional space. This distance is a way to generalize dimensions to n dimensions using Pythagorean Theorem. Finally, the Mahalanobis distance is calculated by considering the correlation of variables as an index to measure the degree of diffusion of variables. Since the Mahalanobis distance is very sensitive to standardized variables, this distance could be increased significantly, even though the standardized variable for the reference group is slightly different. In this paper, the comparison between the distance criterion is shown by changing a specific assignment standard, and finally comparing it against the MTS method. This paper is a sequel to an earlier paper by Rhee [

The paper is organized as follows. We review related works in Section 2. And a balanced assignment is considered with respect to associated distances for an example in Section 3. In Section 4, the comparison between the suggested distance criterion is shown by changing a specific assignment standard, and the result of the MTS method is checked by comparing it against the given criterion. Finally, Section 5 gives concluding remarks.

In this section, the distances required in the balanced assignment process are introduced since the effectiveness in clustering is often measured a distance as its closeness. The choice of distance measures is very crucial, and it has a strong influence on the clustering results. Usually, the Euclidean distance is considered to the common distance measure in clustering. Depending on the type of data and the researcher questions, correlation-based distance is often used as an alternative. The methods used for the distance measurement include the Manhattan distance, the Euclidian distance and the Mahalanobis distance. And the MTS method is also presented to implement the balanced assignment using these distances. The MTS method is one of the well-known clustering methodologies and this method is considered to be very helpful for the purpose of classifying large groups with multiple attributes into many subgroups.

The Manhattan distance is a distance metric between two points in N-dimensional vector space. It is used extensively in a vast area of field from regression analysis to frequency distribution. It was introduced by Hermann Minkowski [

The properties of the Manhattan distance are, first, there exist several paths between two points whose length is equal to the Manhattan distance. Secondly, a straight path with length equal to the Manhattan distance has two permitted moves such vertical or horizontal by one direction only. Finally, for a given point, the other point at a given the Manhattan distance lies in a square. The Manhattan distance is frequently applied in regression analysis, specially, linear regression to find a straight line that fits a given set of points. In solving an underdetermined system of linear equations, the regularization term for the parameter vector is expressed in terms of the Manhattan distance. This approach appears in the signal recovery framework called compressed sensing. The Manhattan distance is also used to assess the differences in discrete frequency distributions. Finally, the Manhattan distance heuristic is an attempt to measure the minimum number of steps required to find a path to the goal state. The closer to get the actual number of steps, the fewer nodes have to be expanded during search, where at the extreme with a perfect heuristic, and the nodes that are guaranteed to be on the goal path can be expanded.

The choice of distance measures is very important, as it has a strong influence on the clustering results. For most common clustering software, the default distance measure is the Euclidean distance [

The Euclidean distance is applied under the assumption that the properties of the attributes that an object is inherent are consistent. Properties of the Euclidean distance are that there is a unique path between two points whose length is equal to Euclidean distance, and the other point lies in a circle such that the Euclidean distance is fixed for a given point. The radius of the circle is the fixed Euclidean distance. With this distance, Euclidean space becomes a metric space. The Euclidean distance is defined as the shortest distance connecting two points. For example, the distance of two points x = ( x 1 , x 2 , ⋯ , x n ) , h (1) in each dimension as follows and y = ( y 1 , y 2 , ⋯ , y n ) in n dimensions is expressed as E D ( x , y ) = ∑ i = 1 n ( x i − y i ) 2 n . Simply, this is a basic distance measurement in which the correlation between attributes is not considered.

The Euclidean distance is frequently used in Euclidean Geometry to find the shortest distance between two points in a Euclidean space and the length of a straight line between two points. This distance is commonly used in clustering algorithms such as K-means. If the Euclidean distance is chosen, then observations with high values of features will be clustered together. The same holds true for observations with low values of features. Finally, it is used as a simple metric to measure the similarity between two data points in associated areas. Correlation-based distance considers two objects to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance. The distance between the two objects is 0 when they are perfectly correlated.

The distance by correlation between the data can be very effective to the clustering analysis rather than the distance scale discussed in the previous section. In particular, it seems desirable to apply correlation when there are multiple attributes of an entity. This is because the disadvantages of the Euclidean distance can be compensated by analyzing the relationship between attributes and the effect of clustering can be augmented. Clustering by correlation based distance or by the Euclidean distance is quite sensitive to outliers, but generally the correlation based distance is more effective than the Euclidean distance. One of the correlation based distances is the Mahalanobis distance in the clustering methodology.

The Mahalanobis distance is known to be an appropriate measure of distance between two elliptic distributions having different locations but a common shape, and also known as an effective way to simply compare between groups with well-known characteristics and to those who are not familiar with the characteristics [

M D ( x , y ) = [ ( x − y ) S − 1 ( x − y ) t ] 1 / 2 (1)

In (1), M D ( x , y ) represents the Mahalanobis distance between entity x and entity y, where x and y denote object vector. And also S − 1 denotes the inverse matrix of the covariance, and ( x − y ) t is expressed as the transpose matrix of ( x − y ) . The Mahalanobis distance is applied the covariance matrix as a multivariate measure based on the correlation between attributes, and is effective when the units of attributes are different and there exists the correlation between attributes.

The first step to measure the Mahalanobis distance is to apply data conversion, a statistical process to provide a kind of reference point for comparing two or more different groups. The standard normal conversions are only applicable if the attributes of data in the group follow a standard normal distribution. Assuming that X ( μ , σ 2 ) has random variables with mean, μ and variance, σ 2 for the random variable in a data group, then X can be transformed into Y using the simple data conversion without losing its statistical property. The result of data conversion can be used to data comparison between attributes, since the statistics that represent a data group are different each other. As can be seen in (2), y i j is indicating the distance measure, and converted from x i j , where x ¯ i . denote the average of the attributes in group i.

y i j = ( x i j − x ¯ i . ) σ i . (2)

The MTS method is a pattern information technology, which has been used in different diagnostic applications to help in making quantitative decisions by constructing a multivariate measurement scale using data analytic methods [

The Mahalanobis space should be defined before calculating the Mahalanobis distance. The process of defining the Mahalanobis space begins with the selection of reference entities and other entities to calculate the Mahalanobis distance. In the MTS method, the Mahalanobis space is selected using the standardized variables of normal data. This selection is generally effective in clustering by selecting entities with more or less extreme attribute values rather than selecting entities that are close to the average. The reason is that the clustering effectiveness is halved if the closest entity to the average is selected as a reference entity, since most of entities are located in closer to the average. Once the Mahalanobis space is established, the number of attributes is reduced using orthogonal array and SN (signal-to-noise ratio) by evaluating the contribution of each attribute. Each row of the orthogonal array determines a subset of the original system by the including and excluding that attribute of system. Some statistical processes such as data standardization and correlation matrix are required to obtain the Mahalanobis distance by applying (1) and (2). And the inverse matrix of covariance using the correlation analysis should be followed to convert the data. The correlation coefficient between attribute i and attribute j is already known as r i j = σ i j σ i × σ j , and it becomes σ i = 1 , σ j = 1 in the standard normal data conversion. In addition, the Mahalanobis distance accounts for the variance of each variable and the covariance between variables. Geometrically, it does this by transforming the data into standardized uncorrelated data and computing the ordinary Euclidean distance for the transformed data.

The SN is computed to determine how much each attribute is affected by the Mahalanobis space. Therefore, this procedure is to apply as an evaluation criterion by reducing the low impact characteristics and to select the high impact characteristics among the various characteristics affecting the Mahalanobis distance. The SN plays a critical role to determine the influence between entity and the Mahalanobis space. The quadratic loss function for the smaller the better is used as seen in (3), since the smaller distance between the Mahalanobis space and the entity means the closer it is.

S N = − 10 log ( 1 n ∑ i = 1 n y i 2 ) (3)

The balanced assignment should be executed to ensure that the characteristics of the subgroups are similar, and that the attributes included in the characteristics are also similar by assuming that the balanced assignments should be made taking into account all attributes specified in the entity [

In this section, the case where an entity contains three attributes will be analyzed and the results of calculating the Mahalanobis distance suggested in the previous section will be presented. However, the Manhattan distances and the Euclidean distances, which are easier to compute than the Mahalanobis distance, are not presented in this section. The collected data is shown in

In order to compute the Mahalanobis distance, it is necessary to define the Mahalanobis space that can be used as a reference entity. In this study, the entities having the most extreme value of each attribute are set as the reference entities, and those are the Mahalanobis space. The entity with high value and low value for each attribute is defined as a reference point or the Mahalanobis space. Since the given example consists of three attributes, 6 shaded entities in

The entity by each attribute must be converted using (2), and followed by correlation inverse matrix. The Mahalanobis distance between the space and each entity using (1) is shown in

Furthermore, the balanced assignment by the MTS method is accompanied by calculating SN ratios using (3), and by assigning all entities into many subgroups using orthogonal array.

In this section, the comparison between the suggested distance criterion is shown by changing assignment standard, and finally comparing it against the MTS method. The mean value of the suggested attributes in

No. | attrib 1 | attrib 2 | attrib 3 | No. | attrib 1 | attrib 2 | attrib 3 | No. | attrib 1 | attrib 2 | attrib 3 |
---|---|---|---|---|---|---|---|---|---|---|---|

1 | 174 | 67 | 81 | 11 | 160 | 45 | 89 | 21 | 175 | 68 | 75 |

2 | 174 | 71 | 76 | 12 | 162 | 52 | 91 | 22 | 176 | 76 | 81 |

3 | 175 | 61 | 64 | 13 | 163 | 49 | 55 | 23 | 177 | 70 | 87 |

4 | 175 | 65 | 87 | 14 | 163 | 43 | 92 | 24 | 177 | 68 | 89 |

5 | 175 | 69 | 65 | 15 | 164 | 55 | 90 | 25 | 178 | 72 | 94 |

6 | 170 | 55 | 76 | 16 | 166 | 50 | 87 | 26 | 179 | 77 | 83 |

7 | 171 | 57 | 76 | 17 | 167 | 52 | 85 | 27 | 180 | 89 | 75 |

8 | 172 | 59 | 87 | 18 | 167 | 54 | 89 | 28 | 182 | 79 | 72 |

9 | 173 | 61 | 79 | 19 | 168 | 60 | 89 | 29 | 183 | 75 | 81 |

10 | 174 | 60 | 83 | 20 | 170 | 60 | 91 | 30 | 187 | 80 | 90 |

No. | A | B | C | D | E | F | No. | A | B | C | D | E | F |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 4.79 | 12.14 | 5.88 | 2.60 | 8.73 | 7.49 | 13 | 1.82 | 15.86 | 4.71 | 3.30 | 9.83 | 12.45 |

2 | 6.52 | 10.56 | 10.44 | 5.49 | 4.50 | 12.31 | 14 | 2.43 | 17.57 | 2.96 | 1.68 | 12.92 | 8.41 |

3 | 0.92 | 16.71 | 5.35 | 7.17 | 13.00 | 19.58 | 15 | 13.93 | 6.55 | 11.22 | 13.80 | 22.82 | 12.70 |

4 | 1.12 | 16.12 | 5.27 | 5.85 | 11.36 | 17.33 | 16 | 5.56 | 17.11 | 3.90 | 1.20 | 14.24 | 4.11 |

5 | 1.71 | 13.00 | 0.67 | 4.93 | 19.57 | 10.75 | 17 | 10.44 | 5.17 | 12.69 | 10.51 | 9.66 | 14.07 |

6 | 5.35 | 7.75 | 3.69 | 7.24 | 18.93 | 9.99 | 18 | 6.61 | 9.11 | 7.59 | 4.70 | 9.11 | 8.44 |

7 | 5.53 | 8.01 | 4.01 | 6.60 | 17.68 | 9.21 | 19 | 7.92 | 16.26 | 12.43 | 4.33 | 2.82 | 11.83 |

8 | 4.23 | 15.53 | 2.16 | 2.35 | 17.21 | 5.65 | 20 | 6.62 | 18.49 | 5.87 | 0.65 | 11.18 | 3.87 |

9 | 5.40 | 10.43 | 3.97 | 4.24 | 15.30 | 6.81 | 21 | 7.12 | 20.25 | 4.99 | 0.67 | 14.57 | 2.89 |

10 | 6.74 | 14.54 | 3.32 | 3.83 | 19.74 | 4.85 | 22 | 8.57 | 17.93 | 10.24 | 1.86 | 5.76 | 6.25 |

11 | 1.92 | 11.71 | 1.15 | 4.71 | 17.95 | 10.24 | 23 | 12.96 | 13.17 | 13.23 | 5.70 | 8.53 | 6.57 |

12 | 1.27 | 14.57 | 1.51 | 3.38 | 15.65 | 10.36 | 24 | 13.66 | 20.11 | 9.73 | 3.54 | 17.05 | 1.51 |

The result of the balanced assignment by applying the MTS method is shown in

The result for the corresponding distance scale is presented in

No. | Subgroup 1 | Subgroup 2 | Subgroup 3 | ||||||
---|---|---|---|---|---|---|---|---|---|

attrib 1 | attrib 2 | attrib 3 | attrib 1 | attrib 2 | attrib 3 | attrib 1 | attrib 2 | attrib 3 | |

1 | 179 | 77 | 83 | 162 | 52 | 91 | 175 | 69 | 65 |

2 | 178 | 72 | 94 | 167 | 52 | 85 | 171 | 57 | 76 |

3 | 175 | 61 | 64 | 170 | 60 | 91 | 163 | 43 | 92 |

4 | 177 | 70 | 87 | 175 | 68 | 75 | 174 | 71 | 76 |

5 | 182 | 79 | 72 | 170 | 55 | 76 | 180 | 89 | 75 |

6 | 163 | 49 | 55 | 174 | 67 | 81 | 164 | 55 | 90 |

7 | 176 | 76 | 81 | 187 | 80 | 90 | 166 | 50 | 87 |

8 | 168 | 60 | 89 | 173 | 61 | 79 | 175 | 65 | 87 |

9 | 167 | 54 | 89 | 172 | 59 | 87 | 183 | 75 | 81 |

10 | 160 | 45 | 89 | 177 | 68 | 89 | 174 | 60 | 83 |

Mean | 172.5 | 64.3 | 80.3 | 172.7 | 62.2 | 84.4 | 172.5 | 63.4 | 81.2 |

Vari. | 55.38 | 150.6 | 160.2 | 43.56 | 75.95 | 38.48 | 43.38 | 177.8 | 68.84 |

Subgroup Criterion | Subgroup 1 | Subgroup 2 | Subgroup 3 | ||||
---|---|---|---|---|---|---|---|

mean | max−min | mean | max−min | mean | max−min | ||

Manhattan Distance | attrib 1 | 21.98 | 17.73 | 20.09 | 16.54 | 23.93 | 26.00 |

attrib 2 | 22.10 | 24.73 | 20.60 | 17.73 | 23.29 | 26.27 | |

attrib 3 | 23.43 | 26.04 | 19.75 | 19.33 | 22.77 | 17.60 | |

simulation | 21.42 | 23.86 | 22.69 | 22.63 | 21.85 | 18.75 | |

Euclidian Distance | attrib 1 | 14.03 | 14.37 | 12.78 | 14.34 | 15.40 | 21.49 |

attrib 2 | 14.14 | 20.42 | 12.79 | 14.92 | 15.28 | 22.56 | |

attrib 3 | 15.13 | 22.49 | 13.59 | 16.63 | 14.45 | 15.44 | |

simulation | 13.38 | 20.64 | 14.27 | 22.84 | 14.09 | 16.54 | |

Mahalanobis Distance | attrib 1 | 8.19 | 3.51 | 8.30 | 2.73 | 8.35 | 5.91 |

attrib 2 | 8.21 | 3.90 | 8.34 | 2.76 | 8.42 | 5.72 | |

attrib 3 | 7.88 | 1.82 | 8.86 | 3.17 | 8.11 | 6.56 | |

simulation | 8.21 | 3.08 | 8.27 | 2.89 | 8.36 | 6.06 |

difference between the maximum values and the minimum values among the subgroups, can be used to analyze which criterion represents a good indicator for the balanced assignment. The Mahalanobis distance is considered to be the better choice for the given example, even though the difference depends on the criterion for selecting attributes. Finally, the balanced assignment is carried out by applying the MTS method, and its result under each distance criteria is shown in

Subgroup Distance | Subgroup 1 | Subgroup 2 | Subgroup 3 | |||
---|---|---|---|---|---|---|

mean | max-min | mean | max-min | mean | max-min | |

Manhattan Distance | 21.76 | 22.28 | 22.57 | 20.55 | 21.66 | 22.31 |

Euclidian Distance | 13.88 | 18.39 | 14.21 | 17.98 | 13.65 | 20.84 |

Mahalanobis Distance | 8.26 | 3.06 | 8.32 | 2.62 | 8.26 | 3.45 |

As seen in

In the clustering, the distance between entities or attributes is applied as a measure of clustering accuracy. The Manhattan distance, the Euclidian distance and the Mahalanobis distance are considered as a tool for measuring its closeness.

In this paper, the comparison between distance criterion is shown by changing specific assignment details, and finally comparing it against the MTS method. Since the standards for calculating the distances are different, it is not meaningful to compare them one by one. However, the mean and the difference between the maximum values and the minimum values within the subgroup, can be used to analyze which method represents a good indicator for the balanced assignment. In general, the balanced assignment by the Mahalanobis distance is seen as a better choice, even though the difference depends on the criterion for selecting attributes. Finally, the balanced assignment is carried out by applying the MTS method.

The authors declare no conflicts of interest regarding the publication of this paper.

Rhee, Y. (2019) Comparison and Validation of Distance on the Balanced Assignments of Group having Entities with Multiple Attributes. American Journal of Industrial and Business Management, 9, 1464-1474. https://doi.org/10.4236/ajibm.2019.96096