_{1}

^{*}

The paper proposes a solution to the problem classification by calculating the sequence of matrices of feature indices that approximate invariants of the data matrix. Here the feature index is the index of interval for feature values, and the number of intervals is a parameter. Objects with the equal indices form granules, including information granules, which correspond to the objects of the training sample of a certain class. From the ratios of the information granules lengths, we obtain the frequency intervals of any feature that are the same for the appropriate objects of the control sample. Then, for an arbitrary object, we find object probability estimation in each class and then the class of object that corresponds to the maximum probability. For a sequence of the parameter values, we find a converging sequence of error rates. An additional effect is created by the parameters aimed at increasing the data variety and compressing rare data. The high accuracy and stability of the results obtained using this method have been confirmed for nine data set from the UCI repository. The proposed method has obvious advantages over existing ones due to the algorithm’s simplicity and universality, as well as the accuracy of the solutions.

The classification problem is the central problem in machine learning, and methods for solving it are dealt with in a considerable number of research papers, which is constantly growing. Nevertheless, the analysis of modern methods, which is adequately described in [

The problem regards the set of real objects, the patterns of which are represented by the feature vectors. The composition of features that describes the objects is, to a certain extent, random, and in some cases, the list can be changed or shortened. In addition, feature values contain random errors of measurement or observation. The influence of the inevitable uncertainty in the relationship between the real object and its model (pattern) is further increased, since the given information is divided between the object and its model, and neither of them is fully defined.

However, the existing methods have been unable to consider these factors as they use mathematical tools within the framework of formalization of pure mathematics. Such approaches have another drawback: in solving the problem, one must proceed from the assumption of existence of a metric in the feature space and a probability density function for the objects of each class.

At the same time, all objects have feature values that are very similar or equal, so the object classes differ in the probability density functions for features, but not for objects. Yet only a generalized function can accurately consider the discontinuous densities of the features. Therefore, each of the existing methods is applied in a restricted area whose boundaries can be established, as a rule, only experimentally.

Studies in recent decades regarding the principles of information processing in complex systems open up new possibilities for solving the problem and eliminate these gaps in the theory. Most of these works were triggered by soft computing theory and were based on the concept of an organism as a granular system. It is assumed that the system consists of holons or granules, which simultaneously represent a single entity and its part in the larger system on different hierarchical levels [

Granular computing is a paradigm of research in the field of artificial intelligence. It covers multiple process modeling concepts of information processing in various hierarchical systems, as well as new approaches to learning with fuzzy databases [

The present work is based on the concept of soft computing L. Zadeh, according to which the human mind considers the “comprehensive inaccuracy of the real world” [

This concept is reflected in the proposed method, which builds upon a new approach to feature description. The range of each feature value is divided into n equal intervals. Each interval can contain anywhere from zero to several objects forming granules. The length of the granules randomly depends on the ordinal interval, which is called index of the corresponding feature. Then, each object will be uniquely described by its index vector, and the data matrix will be transformed into an index matrix. For each index, we can calculate the ratio between the number of objects from a certain class and the total number of objects, which defines the index frequencies and provides the sampling density estimate. These results allow us to establish the class of any object and the value of the parameter n that provides an acceptable margin of error.

The effectiveness of adopted approach is explained by the fact that a given set of data defines a hierarchically organized system in which there are relations of the whole and part between its elements: features, objects and classes. The mechanism of operation of such a system is determined by frequencies of interaction of granules in accordance with the simplest formulas of probability theory.

Note that the uncertainty of initial data of the problem is taken into account indirectly under all transformations. They lead to random changes in the description of any object, but the relation between the object and its class remains the same. Therefore, we can assume that the granulation is based on an approximate calculation of the data matrix invariants whose role, with known error, is played by the index matrices.

The article summarizes and develops previously completed studies [

The article is devoted to the classification problem, in which training and control samples of real objects

To identify objects and their patterns q, we will use the sequence of numbers

This problem relates to the field of artificial intelligence and has specific relevant peculiarities. Here we are dealing with two entities of the arbitrary object number s: a real-world object

ponding row of the matrix

Hence, the set

For the implementation of the relevant mapping, we will consider the given data set as a system that processes multi-level data with a hierarchical structure

Let us make two general remarks concerning the calculation of granules.

1) It should be noted that the frequencies of the features do not depend on the technique used to identify the feature values, such as magnitude, number or any other identifiers. On this basis, we can transform the system of reference for the features value and use any types of features, including mixed, in the calculations.

In particular, for a non-quantitative feature, we should establish relationship of partial order on the set of options for its values under the arbitrary rule of their numbering. The value

2) The training set is designed to reduce the level of uncertainty of our knowledge of the properties of objects of each class. The uncertainty is measured by the value of information entropy, and it is obvious that increasing the entropy of each feature improves the solution quality. The maximum value of entropy is equal to

To implement the ideas in both of the comments, we will first calculate a matrix that is of a convenient form and has sufficient accuracy to represent information contained in the conditions of the task. To this end, we will sequentially apply randomization and indexing of information.

The randomization procedure is designed to transform the feature values

We assume that the interval

Now we can clarify the concept of the index. The value

In these mappings, the initial data for the problem are significantly transformed and boundaries of not quantitative features are become fuzzy. The ongoing changes could be illustrated by the example of the vector of an object with dissimilar features:

It follows from the definition of index that

Then we can partition the combined sample into subsets called granules that contain objects of ordered pairs. Of particular interest is a subset of training sample objects of a certain class, which we will call information granules.

To calculate the composition of the granules, we establish on the set

As a result of these transformations, the values of features will be measured on a single scale with a division value equal to one index. Therefore, the matrix

Let

Since the appearance of each of the

From (1) it follows that

The estimation of the class of the object s is found using the obvious formula

The relation

respectively, characterizes the algorithm’s ability to study and classify objects. The quality of teaching as a function of length of the training sample t is estimated by overage error rates

We will consider the issue of convergence for the sequence of error rates. Let the granule

In this case for

Classes can differ significantly in the convergence of error rates of learning

Nevertheless, the high accuracy of training does not guarantee acceptable classification accuracy since training is the first step in determining the class. Within the second step, it is necessary to evaluate the accuracies of solutions for the control sample objects based on relations (1) and (2), according to which any object for each n is characterized by the frequency

Here we face a problem of overtraining: if we restrict the value of n, it is possible to obtain a more accurate solution for the control sample, but the reliability of this result will be decreased because simultaneously the accuracy of learning will be lower. It is obvious that the simplicity of the algorithm reduces the computational complexity and severity of this problem, as well as allows us to estimate the effects of the characteristics of the matrix

The impact of the parameter can be observed in the example of the data set “Car evaluation”, where the variability of the data is low. Here objects are described by six features of nominal and ordinal types, with each feature possessing one of three or four values, and all objects of the same class having the same value for one of the features and only two variant values for other features. The task is complicated by the uneven distribution of the objects by class, since one class has 19 times more objects than the other does. Therefore, when

Now consider the joint impact of n and

The information about data sets from the repository UCI is given in

The algorithm of the method is based on the data grouping that predetermines the decreasing influence of various kinds of outliers. Calculations have shown that the solution is stable for an infinite set of acceptable solutions: small oscillations of the parameters

The calculations have confirmed that for sufficiently large

Data set | Features | |||||||
---|---|---|---|---|---|---|---|---|

Abalone | 4177 | 8 | 29 | mixed | 689 | 1 | 1.6 | 0.069 |

Adult^{b } | 42121* | 14 | 2 | mixed | 3.2 | 10 | 1.5 | 0.034 |

Breast Cancer | 699 | 9 | 2 | integer | 1.9 | 1 | 6.9 | 0.01 |

Car evaluation | 1728 | 6 | 4 | mixed | 17.5 | 1 | 1.0 | 0.012 |

Glass | 214 | 9 | 6 | quant. | 8.5 | 1 | 4.2 | 0.029 |

Haberman’s Survival | 306 | 3 | 2 | integer | 2.8 | 1 | 13,1 | 0.116 |

Iris^{a } | 150 | 4 | 3 | quant. | 1 | 0 | 4.0 | 0 |

Letter Image Recognition | 20000 | 16 | 26 | integer | 1.1 | 1 | 2.6 | 0.076 |

Wine ^{a } | 178 | 13 | 3 | quant. | 1.5 | 0 | 2.3 | 0.016 |

^{a}Objects were numbered via a random number generator. ^{b} Objects with data errors are not considered.

the lower boundary of the value range of these parameters, under which the error rates are

The results were verified by 10-fold cross-validation, in which splits of combined sample

From

This analysis can greatly simplify the process of solving practical problems, in which, instead of a control sample, a test sample is specified, for which the distribution of classes is unknown. Now, we can use tabulated values of

Another important result was obtained by testing the method experimentally. A direct application of the above algorithm for the data set Adult gives the minimum values of

To mitigate of noted effects of rare information, we will introduce the additional procedure of compression, or pre-granulation, which is also based on the above considerations about the data reliability. Now, the values

After the additional transformation, the solution of task Adult has remained stable and errors are now closer to zero. The corresponding table data calculated at

This procedure has proven effective in addressing a number of other data sets, for example, Letter Image Recognition. It can also be observed as a way to reduce the values of

The paper proposes a methodology for solution classification problems, the core of which is an approximate calculation of the invariants of data matrix. The new approach implements the concepts of soft computing and granulation and is biologically inspired. In essence, it reduces to the transformation of all measured scale features, in which the values of features, called indexes, are defined according to a single scale in the new units.

Developed methodology is based on procedures of randomization and indexation of the data set (and, in some cases, also a pre-granulation procedure), which generate an infinite sequence of index matrices. These matrices are invariants of the data matrix in relation to a class of objects. They provide error-free training and allow us to calculate the object class under the simplest formulas of total probability for any single type or mixed types of features.

The proposed method differs from existing ones by the universality and simplicity of the algorithm and, as a rule, almost an order of magnitude higher accuracy.

The obtained results go beyond the problem of classification and have independent significance for the solutions of the problems of data analysis. It can be expected that the method will receive a hardware implementation, and its extension to multi-level data will lead to the development of effective image recognition systems and information retrieval. The application of the method does not require mathematical education, which increases its innovative potential.

Shats, V.N. (2017) Classification Based on Invariants of the Data Matrix. Journal of Intelligent Learning Systems and Applications, 9, 35-46. https://doi.org/10.4236/jilsa.2017.93004