On the Matrices of Pairwise Frequencies of Categorical Attributes for Objects Classification ()
1. Introduction
The solution to the classification problem is reduced to calculating a function that divides a training sample (TRS) into classes and simultaneously obtains an acceptable classification accuracy for a test sample (TS). In most existing methods, algorithms for calculating these functions have considerable computational complexity [1] [2] [3] . In previous work [4] , the method of invariants (MI) was proposed, where this function is a linear combination of the simplest functions of the values of each feature that qualitatively simplifies the computation algorithm. It was shown in [5] that the MI corresponds to sensory process models of animals, which aim to recognize an object’s class by searching for a prototype in the information accumulated in the brain.
The MI proceeds from the fact that in classification problems, the accuracy of the data plays a special role since the objects, their descriptions, and their classes are correlated, and each type of entity has a randomness component. Therefore, a given data matrix is just one of possible random realization of the matrices that form the set of invariants with respect to the class. This approach is consistent with the concept proposed by L. Zadeh, which says that for most manually solved tasks, high accuracy is not required because the brain perceives only a “trickle of information” about the external world [6] . Moreover, for systems whose complexity exceeds a certain threshold, accuracy and practical sense are almost mutually exclusive characteristics.
In the MI, the range of attribute values after randomization, accompanied by an introduction of an additive component that follows a uniform distribution, is divided along each attribute into equal numbers of intervals, within which the feature values are assumed to be equiprobable. All objects falling within the interval receive an index of the corresponding attribute equal to the interval number.
For each index, one can find lists of numbers of TRS objects of a certain class and then calculate the frequencies of the indices. With some error, these frequencies will be the same for the objects in the TRS and the TS because both samples belong to the same general population. Therefore, it is possible to estimate the probability of the individual attributes of any object in each class. Then, using the simplest formula of the total probability, estimate the probability of an object having a specific set of feature values. Finally, the class of the object is determined based on the maximum likelihood principle.
There is an obvious analogy between indices and categories, the values of which can always be described by a finite sequence of integers 1, 2... Therefore, the MI serves as the basis for this article, in which two algorithms are proposed: one implements the simplest version of the MI developed for quantitative attributes, and the other more fully takes the features of categorical attributes into account.
The efficiency of the new algorithms was tested on five databases [7] .
2. Assumptions and Preliminary Assumptions
The article is devoted to solving classification problems for which all attributes are categorical. The solution is based on two MI assumptions:
· The data matrix has a set of invariants with respect to a class of objects.
· Object classes differ in the attribute probability distributions.
For categorical attributes, the number of values or levels n that individual objects can take is an important characteristic of the problem. In real tasks for quantitative attributes, the value of
, as a rule, considerably exceeds that of
—the corresponding value for categorical attributes. According to the theory proposed by C. Shannon, the information volume per value of a feature increases in proportion to the value of
. Therefore, in tasks involving categorical features, the “information load” of the data often increases several fold. This circumstance manifests in an increase in the number of objects of different classes that have the same attribute values. This reduces the difference between the attribute frequencies for objects of different classes, which can lead to an increase in the number of classification errors.
However, categorical attributes also have “favorable” features. The probability of an object of a certain class is an unknown function of its attributes, which takes into account the interrelations among all the attributes. Usually, this function is nonlinearly dependent on the attribute values of the object. This relationship is indirectly taken into account in the accepted assumption of the MI, since the frequencies of attribute indices are calculated for a particular class of objects. Then, this dependence becomes linear, which greatly simplifies algorithm’s calculation. One algorithm takes the same approach for categorical attributes whose values are, as noted above, analog indices.
The second algorithm considers the peculiarities of categorical attributes in a different way and is based on a new solution to the question of attribute relationships. Usually, the relationship between random variables is estimated using the Pearson correlation coefficient or the rank correlation coefficient. However, in the framework for this method, we are interested in the frequencies of attribute values that take a relatively small number of values. The paper further shows that pairwise frequencies of features allow an approximate assessment of the relationship between the features of objects of the same class (note that, as a rule, only a weak correlation exists between the categorical features of objects in the same class).
However, pairwise frequencies do not allow the determination of the class of TS objects if no object has the same combination of attribute values in the TRS. To classify objects, this algorithm uses an analog of the k-nearest neighbors method: the object is assigned to a class for which the total number of the k-nearest neighbors of the TRS’ objects for each attribute are maximized.
3. Two Algorithms for Solving the Classification Problem
3.1. Statement and Basic Algorithm
Let the vectors
describe the values of categorical attributes objects, which form the TRS
, where y is the vector of the object class labels, M is the number of objects, and missing data are excluded. Without loss of generality, we assume that the values of the attributes
and classes (possibly after preliminary encoding) belong to the sets of integers
and
, respectively, where
is the number of values of attribute
and С is the number of classes. The problem is to classify the TS objects.
We denote s objects by
and the data matrix by
. Consider the algorithm for the basic MI (algorithm 1). Using matrix
, we find lists
of numbers of objects of class
. The sample probability of objects in class i determines the obvious dependence:
. (1)
This dependence allows finding objects whose attribute value
. Let
denote the number of such objects. Then, the frequency of a value j for an attribute k of the TRS object of class i equals
, where
.
Object
arises as a result of appearances of each attribute k with the corresponding value j. Since these events form a complete group of incompatible events, the total probability formula gives an estimate of the probability that an object belongs to class i:
, (2)
where j is the value of attribute k for object
.
Formulas (1) and (2) yield a class probability estimate for the TRS objects. Since TRS and TS belong to a single general population, the formula also determines the frequencies of the TS objects. According to the maximum likelihood principle, the calculated class of the object
is
(3)
3.2. Features of the Model of Probability Density Objects
Essentially, the MI is based on the assumption that a class of objects can be recognized by the probability distribution of its attributes. According to (2), the probability
received its point estimate equal to the average frequencies attributes of object
of class i. Thus, the empirical frequency distribution of features is transformed into the frequency distribution of objects. Therefore, the MI considers the average composition of the attribute distribution as a probability distribution for objects of a particular class.
We investigate the characteristics of this distribution in the case of two attributes that have typical forms of attribute frequency distributions. Our analysis showed that the distributions of each attribute can be considered a sample of the theoretical distributions described by unimodal laws, the maximum of which is located in the middle and the “tails” of the distribution.
Consider the following task. Let objects have two categorical attributes, the values of which describe random variables Y and Z with probability densities
and
, respectively, where
,
, and
and n are parameters. From formula (2), a random variable
is the composition of Y and Z, which simulates the total distribution of the objects. We are interested in the features of this distribution.
Note that the functions
and
allow us to obtain an analytical solution for the distribution composition of the above types of attributes. Since these functions determine the corresponding density distribution, their parameters are related by the following:
,
.
Obviously,
, where
and
are random variables [8] . Given that density
,
, we obtain
. Similarly, we find that
.
The density
is a convolution of the functions
and
:
.
The range of u is divided into segments:
and
. Because
, the lower and upper limits of the integrals are equal to 0 and u for the first segment and
and u for the second segment, respectively. Then, we can obtain the formula for calculating the density:
,
where
,
,
Sub-integral functions are tabulated and not given for the abbreviated entries.
We performed calculations were performed for a wide range of parameters. The results are illustrated in Figure 1, where the density
is determined for the case in which the density
follows a normal distribution, and the
distribution is close to hyperbolic. The figure shows that with respect to the curve
, the ordinates of curve
increase in the region of high values of density
and decrease in sections with low values. Consequently, the function
does not follow a normal distribution. However, confidence intervals of continuous random variables can be estimated only for normal distributions.
From the analysis, it should be noted that the composition distributions of individual attributes result in a poorly predictable distribution for certain classes of objects. Thus, the effectiveness of the various MI algorithms depends on the data characteristics for a particular task and can be tested only empirically.
3.3. Algorithm 2
Algorithm 1 reduces the MI assumption that the individual classes of objects are
Figure 1. Density curves of random variables Y, Z and U (
and
correspond to
and
).
from different distributions of features to classify objects according to the frequencies of the categorical attribute values. However, another variant of the approximate realization of this assumption is also possible.
For any type of attribute, the probability of an arbitrary object
of class i is determined by the following relation:
, (4)
where
is the conditional probability of an attribute
at values
of attributes
. Here,
is found by formula (1).
Consider the features of this dependence for categorical attributes. Here, the elements of the set of Cartesian products of the attributes
and
,
are ordered pairs:
, where
and
. Let
be the number of objects of class i
whose attributes correspond to a pair
. Then, the frequency of the pair
gives the sample estimate probability of the pair for object
of class i. The set of frequencies defines a matrix
,
constructing a matrix of pairwise frequencies (MPF) for the attributes k and
for the TRS objects of class i. There are
MPFs for each class. According to the concept formed by the above matrix, we can define the properties of the TRS and TS objects. Then, from formula (4), we obtain the approximate dependence for estimating the probability that object
belongs to class i
. (5)
In formula (5),
is the element of matrix
that corresponds to the frequency of the attribute pair values k and
of an object in class i. The estimated class of this object is determined by an analog of formula (3):
(6)
3.4. Improving the Accuracy of Algorithm 2
From formula (5), it follows that
if one of the factors
. Such a case occurs when there is no object with the same attribute value among the TRS objects of class i. The total number of possible combinations of attribute values is
and, as a rule,
. Therefore, MPFs often contain zero elements and can be sparse.
If
for all i, then uncertainty arises, since formula (5) “does not work”. Note that when applying algorithm 1, such situations are practically excluded. The MI serves as the basis for eliminating this uncertainty, since it assumes that many data matrices exist that are invariant with respect to a class of objects. It can be assumed that in the case of invariant transformations, the relative position of the attribute values of TRS objects will be preserved near the singular points corresponding to the attribute values of an “undefined” object. Consequently, we can use the idea underlying the k-nearest neighbor method to solve classification problems.
We assume that the “undefined” object has a class to which most of the k-nearest TRS objects belong. Since the concept of distance between objects is not defined in the MI, we will evaluate the “proximity” for each attribute value of an “undefined” object.
Let Z be a set of TS objects for which the class could not be determined using formula (5) and object
. The goal is to find TRS objects of class i whose attributes
are in h neighborhoods of
,
. The numbers of these objects form the set
, and their frequency
is
. Having calculated the frequencies, we can find the average
frequency
of all the attributes of object
in class i. Then, the calculated class of object
is equal to
(7)
where h is a parameter whose domain is the set of integers
, where
.
Let
be an indicator of class i that equals 1 when the calculated class is not equal to the class of object
and 0 otherwise. Then, the number of incorrectly classified objects in the set Z will be equal to
. (8)
The calculated value of parameter h, denoted by
, and the corresponding value
can be found via the minimum number of “undefined” objects:
. (9)
4. The Effectiveness of New Algorithms
The MI serves as a general conceptual framework for formulas (1)-(3) and (4)-(9), which respectively define algorithms 1 and 2 for solving the classification problem. The effectiveness of the algorithms was studied with five databases from the UCI repository; the objects in these databases, the objects had only categorical features. The characteristics of the bases given in Table 1 that cover rather wide ranges of values for the numbers of objects (267 - 20,000), features (3 - 22) and classes (2 - 26).
The dependencies in (3) and (5) are applicable not only for the TS but also for the TRS. Therefore, we calculated the test error rate,
, and the training error rate,
. All the calculations were performed on the basis of the cross-validation procedure. The database was divided into 10 datasets of approximately equal size. The first 9 datasets were used as the TRS, and the remaining dataset was used for testing. This procedure was applied 10 times. Consequently, for each base, a sequence of 10 pairs of TRS and TS variants was considered. For each partitioning variant
, we calculated the error rates
and
.
The
and
curves for different databases are shown in Figure 2 and Figure 3, respectively. The graphs are identified by an ordered pair a_b, where a is the first letter of the database name and b is the algorithm identifier. For these rates, the average values E and the standard deviations St are given in Table 1.
Database Car evaluation and Spect have no “undefined” objects; for them, the functions
were not calculated. Figure 4 depicts the curves
and
Table 1. Table of databases characteristics and calculation results.
Figure 2. Frequency distributions of test errors
for algorithms 1 and 2.
Figure 3. Frequency distributions of learning errors
for algorithms 1 and 2.
Figure 4. Graphs of the function
for Breast Cancer, Haberman’s Survival and Letter Image databases.
that reflect the features of these functions for the Breast Cancer, Haberman’s Survival and Letter Image databases, respectively.
Below, we summarize the main results of the calculations:
1) With some exceptions, the error rate curves do not undergo drastic changes under the sequential changes in the composition of the TRS and TS objects under cross-validation. Both algorithms yield fairly stable results: in most cases, the error variances for TS and TRS are relatively small (
). The most stable results were obtained for algorithm 2, where
for the TS. We note that the number of test errors is typically considerably higher than training errors.
2) Algorithm 2, as a rule, is much more accurate than algorithm 1. This is well illustrated in Figure 2, where almost all the dotted lines corresponding to algorithm 1 are concentrated in the upper part. The resulting conclusion is that considering the pairwise frequencies of attributes makes it possible to more accurately differentiate the latent properties of objects of different classes. For algorithm 2, the minimum values of the mean error E are 0.076 and 0.016 for the test and training samples, respectively.
3) In many cases, the introduction of the function
and a corresponding reduction in the number of “uncertain” objects can lead to significant increases in the efficiency of the MPF and in the accuracy of the solution.
We can conclude that these experiments confirm the operability of both algorithms.
5. Conclusions
The paper proposes two new algorithms based on the MI for classifying objects with categorical features. Both algorithms originate from the same assumption: that the objects in each class differ in attribute probability distribution, but both algorithms use different models to approximate the distributions. Under this assumption, an object class is defined by the individual frequencies of its attribute values rather than by the nonlinear functions of attributes values used in most existing methods. This characteristic explains the comparative simplicity of the proposed algorithms.
It has been established that along with the correlation between categorical attributes, for objects belonging to one class, a functional relationship exists between the attribute values, which is characterized by the frequencies of the pairwise attribute values. This set of frequencies forms an MPF, which is calculated for the TRS objects for each class and attribute. In one of the algorithms, the MPF is used in conjunction with an analog of the k-nearest neighbors method. This addition allows one to determine the class of a TS object when the TRS does not contain objects with the same combination of attribute values.
It can be expected that the MPF can also be applied to solve problems with quantitative attributes because the values (with some error) can be represented by integers corresponding to the data description with a coarser measuring scale.
An experimental examination has shown that algorithm 2, using the MPF, provides more reliable results than does algorithm 1.