On the Matrices of Pairwise Frequencies of Categorical Attributes for Objects Classification

This paper proposes two new algorithms for classifying objects with categorical attributes. These algorithms are derived from the assumption that the attributes of different object classes have different probability distributions. One algorithm classifies objects based on the distribution of the attribute frequencies, and the other classifies objects based on the distribution of the pairwise attribute frequencies described using a matrix of pairwise frequencies. Both algorithms are based on the method of invariants, which offers the simplest dependencies for estimating the probabilities of objects in each class by an average frequency of their attributes. The estimated object class corresponds to the maximum probability. This method reflects the sensory process models of animals and is aimed at recognizing an object class by searching for a prototype in information accumulated in the brain. Because these matrices may be sparse, the solution cannot be determined for some objects. For these objects, an analog of the k-nearest neighbors method is provided in which for each attribute value, the class to which the majority of the k-nearest objects in the training sample belong is determined, and the most likely class value is calculated. The efficiencies of these two algorithms were confirmed on five databases.


Introduction
The solution to the classification problem is reduced to calculating a function that divides a training sample (TRS) into classes and simultaneously obtains an acceptable classification accuracy for a test sample (TS). In most existing me-thods, algorithms for calculating these functions have considerable computational complexity [1] [2] [3]. In previous work [4], the method of invariants (MI) was proposed, where this function is a linear combination of the simplest functions of the values of each feature that qualitatively simplifies the computation algorithm. It was shown in [5] that the MI corresponds to sensory process models of animals, which aim to recognize an object's class by searching for a prototype in the information accumulated in the brain.
The MI proceeds from the fact that in classification problems, the accuracy of the data plays a special role since the objects, their descriptions, and their classes are correlated, and each type of entity has a randomness component. Therefore, a given data matrix is just one of possible random realization of the matrices that form the set of invariants with respect to the class. This approach is consistent with the concept proposed by L. Zadeh, which says that for most manually solved tasks, high accuracy is not required because the brain perceives only a "trickle of information" about the external world [6]. Moreover, for systems whose complexity exceeds a certain threshold, accuracy and practical sense are almost mutually exclusive characteristics.
In the MI, the range of attribute values after randomization, accompanied by an introduction of an additive component that follows a uniform distribution, is divided along each attribute into equal numbers of intervals, within which the feature values are assumed to be equiprobable. All objects falling within the interval receive an index of the corresponding attribute equal to the interval number.
For each index, one can find lists of numbers of TRS objects of a certain class and then calculate the frequencies of the indices. With some error, these frequencies will be the same for the objects in the TRS and the TS because both samples belong to the same general population. Therefore, it is possible to estimate the probability of the individual attributes of any object in each class. Then, using the simplest formula of the total probability, estimate the probability of an object having a specific set of feature values. Finally, the class of the object is determined based on the maximum likelihood principle.
There is an obvious analogy between indices and categories, the values of which can always be described by a finite sequence of integers 1, 2... Therefore, the MI serves as the basis for this article, in which two algorithms are proposed: one implements the simplest version of the MI developed for quantitative attributes, and the other more fully takes the features of categorical attributes into account.
The efficiency of the new algorithms was tested on five databases [7].

Assumptions and Preliminary Assumptions
The article is devoted to solving classification problems for which all attributes are categorical. The solution is based on two MI assumptions: • The data matrix has a set of invariants with respect to a class of objects.
• Object classes differ in the attribute probability distributions. Journal of Intelligent Learning Systems and Applications For categorical attributes, the number of values or levels n that individual objects can take is an important characteristic of the problem. In real tasks for quantitative attributes, the value of q n , as a rule, considerably exceeds that of c n -the corresponding value for categorical attributes. According to the theory proposed by C. Shannon, the information volume per value of a feature increases in proportion to the value of ( ) ( ) 2 2 log log q c n n . Therefore, in tasks involving categorical features, the "information load" of the data often increases several fold. This circumstance manifests in an increase in the number of objects of different classes that have the same attribute values. This reduces the difference between the attribute frequencies for objects of different classes, which can lead to an increase in the number of classification errors.
However, categorical attributes also have "favorable" features. The probability of an object of a certain class is an unknown function of its attributes, which takes into account the interrelations among all the attributes. Usually, this function is nonlinearly dependent on the attribute values of the object. This relationship is indirectly taken into account in the accepted assumption of the MI, since the frequencies of attribute indices are calculated for a particular class of objects. Then, this dependence becomes linear, which greatly simplifies algorithm's calculation. One algorithm takes the same approach for categorical attributes whose values are, as noted above, analog indices.
The second algorithm considers the peculiarities of categorical attributes in a different way and is based on a new solution to the question of attribute relationships. Usually, the relationship between random variables is estimated using the Pearson correlation coefficient or the rank correlation coefficient. However, in the framework for this method, we are interested in the frequencies of attribute values that take a relatively small number of values. The paper further shows that pairwise frequencies of features allow an approximate assessment of the relationship between the features of objects of the same class (note that, as a rule, only a weak correlation exists between the categorical features of objects in the same class).
However, pairwise frequencies do not allow the determination of the class of TS objects if no object has the same combination of attribute values in the TRS. To classify objects, this algorithm uses an analog of the k-nearest neighbors method: the object is assigned to a class for which the total number of the k-nearest neighbors of the TRS' objects for each attribute are maximized.

Statement and Basic Algorithm
Let the vectors The sample probability of objects in class i determines the obvious dependence: This dependence allows finding objects whose attribute value k x j = . Let 0 kj r ≥ denote the number of such objects. Then, the frequency of a value j for an attribute k of the TRS object of class i equals ( ) Object x arises as a result of appearances of each attribute k with the corresponding value j. Since these events form a complete group of incompatible events, the total probability formula gives an estimate of the probability that an object belongs to class i: where j is the value of attribute k for object x .
Formulas (1) and (2) yield a class probability estimate for the TRS objects.
Since TRS and TS belong to a single general population, the formula also determines the frequencies of the TS objects. According to the maximum likelihood principle, the calculated class of the object x is

Features of the Model of Probability Density Objects
Essentially, the MI is based on the assumption that a class of objects can be recognized by the probability distribution of its attributes. According to (2), the probability ( ) i p x received its point estimate equal to the average frequencies attributes of object x of class i. Thus, the empirical frequency distribution of features is transformed into the frequency distribution of objects. Therefore, the MI considers the average composition of the attribute distribution as a probability distribution for objects of a particular class. We investigate the characteristics of this distribution in the case of two attributes that have typical forms of attribute frequency distributions. Our analysis showed that the distributions of each attribute can be considered a sample of the theoretical distributions described by unimodal laws, the maximum of which is located in the middle and the "tails" of the distribution. Obviously are random variables [8]. Given that density From the analysis, it should be noted that the composition distributions of individual attributes result in a poorly predictable distribution for certain classes of objects. Thus, the effectiveness of the various MI algorithms depends on the data characteristics for a particular task and can be tested only empirically.

Algorithm 2
Algorithm 1 reduces the MI assumption that the individual classes of objects are Journal of Intelligent Learning Systems and Applications For any type of attribute, the probability of an arbitrary object of class i is determined by the following relation: where ( )   (4), we obtain the approximate dependence for estimating the probability that object x belongs to class i ( ) ( ) In formula (5), Consequently, we can use the idea underlying the k-nearest neighbor method to solve classification problems.
We assume that the "undefined" object has a class to which most of the k-nearest TRS objects belong. Since the concept of distance between objects is not defined in the MI, we will evaluate the "proximity" for each attribute value of an "undefined" object.
Let Z be a set of TS objects for which the class could not be determined using formula (5) and object ( ) Journal of Intelligent Learning Systems and Applications problem. The effectiveness of the algorithms was studied with five databases from the UCI repository; the objects in these databases, the objects had only categorical features. The characteristics of the bases given in Table 1 that cover rather wide ranges of values for the numbers of objects (267 -20,000), features (3 -22) and classes (2 -26).
The dependencies in (3) and (5) are applicable not only for the TS but also for the TRS. Therefore, we calculated the test error rate, c f , and the training error rate, l f . All the calculations were performed on the basis of the cross-validation procedure. The database was divided into 10 datasets of approximately equal size.
The first 9 datasets were used as the TRS, and the remaining dataset was used for testing. This procedure was applied 10 times. Consequently, for each base, a sequence of 10 pairs of TRS and TS variants was considered. For each partitioning variant ( ) 1,10 m ∈ , we calculated the error rates сm f and lm f .
The сm f and lm f curves for different databases are shown in Figure 2 and  Table 1.
Database Car evaluation and Spect have no "undefined" objects; for them, the functions ( ) F h were not calculated. Figure 4 depicts      2) Algorithm 2, as a rule, is much more accurate than algorithm 1. This is well illustrated in Figure 2, where almost all the dotted lines corresponding to algorithm 1 are concentrated in the upper part. The resulting conclusion is that con-Journal of Intelligent Learning Systems and Applications sidering the pairwise frequencies of attributes makes it possible to more accurately differentiate the latent properties of objects of different classes. For algorithm 2, the minimum values of the mean error E are 0.076 and 0.016 for the test and training samples, respectively.
3) In many cases, the introduction of the function ( ) F h and a corresponding reduction in the number of "uncertain" objects can lead to significant increases in the efficiency of the MPF and in the accuracy of the solution.
We can conclude that these experiments confirm the operability of both algorithms.

Conclusions
The paper proposes two new algorithms based on the MI for classifying objects with categorical features. Both algorithms originate from the same assumption: that the objects in each class differ in attribute probability distribution, but both algorithms use different models to approximate the distributions. Under this assumption, an object class is defined by the individual frequencies of its attribute values rather than by the nonlinear functions of attributes values used in most existing methods. This characteristic explains the comparative simplicity of the proposed algorithms.
It has been established that along with the correlation between categorical attributes, for objects belonging to one class, a functional relationship exists between the attribute values, which is characterized by the frequencies of the pairwise attribute values. This set of frequencies forms an MPF, which is calculated for the TRS objects for each class and attribute. In one of the algorithms, the MPF is used in conjunction with an analog of the k-nearest neighbors method.
This addition allows one to determine the class of a TS object when the TRS does not contain objects with the same combination of attribute values.
It can be expected that the MPF can also be applied to solve problems with quantitative attributes because the values (with some error) can be represented by integers corresponding to the data description with a coarser measuring scale.
An experimental examination has shown that algorithm 2, using the MPF, provides more reliable results than does algorithm 1.

Conflicts of Interest
The author declares no conflicts of interest regarding the publication of this paper.