^{1}

^{*}

^{1}

^{1}

In supervised learning, the imbalanced number of instances among the classes in a dataset can make the algorithms to classify one instance from the minority class as one from the majority class. With the aim to solve this problem, the KNN algorithm provides a basis to other balancing methods. These balancing methods are revisited in this work, and a new and simple approach of KNN undersampling is proposed. The experiments demonstrated that the KNN undersampling method outperformed other sampling methods. The proposed method also outperformed the results of other studies, and indicates that the simplicity of KNN can be used as a base for efficient algorithms in machine learning and knowledge discovery.

When dealing with supervised learning, one of the main problems in classification activities lies in the treatment of datasets where one or more classes have a minority quantity of instances. This condition denotes an imbalanced dataset, which makes the algorithm to incorrectly classify one instance from the minority class as belonging to the majority class, and in highly skewed datasets, this is also denoted as a “needle in the haystack” problem [

Learning from imbalanced datasets is still considered an open problem in data mining and knowledge discovery, and needs real attention from the scientific community [

This work is focused in the data adjusting algorithms, and a proposal of a KNN undersampling (KNN-Und) algorithm will be presented. The KNN-Und is a very simple algorithm, and basically it uses the neighbor count to remove instances from majority class. Despite its simplicity, the classification experiments performed with KNN-Und balancing resulted in better performance of G-Mean [

This paper is organized as follows. In Session 2 a literature review about KNN balancing methods is presented, in Session 3 the KNN-Und methodology is explained in more details. In section 4 the experiments conducted in this work will be presented, compared and commented, followed by the conclusions.

Imbalanced Dataset DefinitionThis section establishes some notations that will be used in this work.

Given the training set T with m examples and n attributes, where

subset of P created by sampling methods will be denominated S. The pre-processing strategies applied to datasets aims to balance the training set T, such as

Along the years, a great effort was done in the scientific community in order to solve or mitigate the imbalanced dataset problem. Specifically for KNN, there are several balancing methods based on this algorithm. This section will provide a bibliographic review about the KNN and its derivate algorithms for dataset balancing. Also, the random oversampling and undersampling methods, the class overlapping problem, and evaluation measures will be reviewed.

The k Nearest Neighbor (KNN) is a supervised classifier algorithm, and despite his simplicity, it is considered one of the top 10 data mining algorithms [

The KNN is a nonparametric lazy learning algorithm. It is nonparametric because it does not make any assumptions on the underlying data distribution. Most of the practical data in the real world does not obey the typical theoretical assumptions made (for example, Gaussian mixtures, linear separability, etc....). Nonparametric algorithms like KNN are more suitable on these cases [

It is also considered a lazy algorithm. A lazy algorithm works with a nonexistent or minimal training phase but a costly testing phase. For KNN this means the training phase is fast, but all the training data is needed during the testing phase, or at the least, a subset with the most representative data must be present. This contrasts with other techniques like SVM, where you can discard all nonsupport vectors.

The classification algorithm is performed according to the following steps:

1. Calculate the distance (usually Euclidean) between a x_{i} instance and all instances of the training set T;

2. Select the k nearest neighbors;

3. The x_{i} instance is classified (labeled) with the most frequent class among the k nearest neighbors. It is also possible to use the neighbors' distance to weight the classification decision.

The value of k is training-data dependent. A small value of k means that noise will have a higher influence on the result. A large value makes it computationally expensive and defeats the basic philosophy behind KNN: points that are close might have similar densities or classes. Typically in the literature are found odd values for k, normally with k = 5 or k = 7, and [

The algorithm may use other distance metrics than Euclidean [

The SMOTE algorithm proposed by [

The SMOTE executes the balancing of a P set of minority instances, creating n synthetic instances from each instance

An extensive comparison of several oversampling and undersampling methods was performed in [

According the experiments conducted in [

The ENN method proposed by [

1. Obtain the k nearest neighbors of

2.

3. The process is repeated for every majority instance of the subset N.

According to the experiments conducted in [

The Neighbor Cleaning Rule (NCL) proposed by [

This is one of the simplest strategies for data sets adjusting, and basically consists in the random removal (undersampling) and addition (oversampling) of instances. For oversampling, instances of the positive set P are randomly selected, duplicated and added to the set T. For undersampling, the instances from negative set N are randomly selected for removal.

Although both strategies have the similar operation and brings some benefit than simply classifying without any preprocessing [

According the experiments of [

In supervised learning, it is necessary to use some measure to evaluate the results obtained with a classifier algorithm. The confusion matrix from

The confusion matrix is able to represent either two class or multiclass problems. Nevertheless, the research and literature related to imbalanced datasets is concentrated in two class problems, also known as binary or binomial

Positive prediction | Negative prediction | |
---|---|---|

Positive class | True Positive (TP) | False Negative (FN) |

Negative class | False Positive (FN) | True Negative (TN) |

problems, which the less frequent class is named as positive, and the remaining classes are merged and named as negative.

Some of the most known measures derived from this matrix are the error rate (3) and the accuracy (4). Nevertheless, such measures are not appropriated to evaluate imbalanced datasets, because they do not take into account the number of examples distributed among the classes. On the other hand, there are measures that compensate this disproportion in their calculation. The Precision (5), Recall (8) and F-Measure [

In this work, both classes are considered as equal importance, therefore, the measures G-Mean and AUC will be used to evaluate the experiments.

The G-Mean [

The Receiver Operating Characteristics (ROC) chart, also denominated ROC Curve, has been applied in detection, signal analysis since the Second World War, and recently in data mining and classification. It consists of a two dimensions chart, were the y-axis refers to Sensitivity or Recall (7), and the x-axis calculated as 1-Especificity (8). According to [

The AUC measure (9) synthetizes as a simple scalar the information represented by a ROC chart, and is insensitive to class imbalance problems.

where

The KNN-Und method works removing instances from the majority classes based on his k nearest neighbors, and works according to the steps below:

1. Obtain the k nearest neighbors for

2.

3. The process is repeated for every majority instance of the subset N.

The parameter t defines the minimum count of neighbors around

If compared with ENN, the KNN-Und has a more aggressive behavior in terms of instance removal, because KNN-Und does not depend of a wrong prediction of KNN to remove an instance

The KNN-Und can be considered a very simple algorithm, and has the advantage to be a deterministic method, since different of other methods, there is no random component. In the literature review, only one study [

In this section, the experiments to validate the applicability of KNN-Und are conducted. The

In all datasets and algorithms that uses KNN, the parameter k was determined according to (1). The parameter t was adjusted in order to control the undersampling level with KNN-Und method, and in most of datasets with IR < 2, this parameter was set to t > 1 to control the excessive undersampling. The

The classification results in terms of AUC and G-Mean are presented in

Those tables show the results averaged over 10 runs and the standard deviation between parentheses for the 33 datasets.

The classification results in terms of AUC with KNN-Und data preparation were compared with our previous work [

The same experiment setup was applied forC4.5 classifier without balancing (as a baseline comparison) and for the others balancing methods: SMOTE, ENN, NCL and Random Undersampling. The last columns of

This last classifier was included to make a comparison with the evolutionary algorithm EBUS-MS-GM developed in [11] .

The best result for each dataset is marked in bold.

Analyzing the AUC results in Table 3, it can be observed the KNN-Und with C4.5 algorithm outperformed in 19 of 33 datasets and had one dataset with equal result, if compared with the results of four different sampling methods. If compared with our previous results published in [10] , the KNN-Und outperformed in 11 of 15 datasets.

Figure 2 illustrates the results of AUC using the balancing methods with C4.5 classifier for the 33 datasets. It shows the AUC values of KNN-Und (in green) at the top, or nearby, in all datasets.

The results in terms of G-Mean (Table 4) shows that the KNN-Und outperformed in 20 of 33 datasets, and one dataset with equal result. Different of GA, SMOTE and Random Undersampling methods, the KNN-Und, C4.5, ENN, and NCL have a deterministic behavior, which lead to most stable results with standard deviations equals to 0. The second best results were obtained with NCL algorithm, but an excessive undersampling was observed in datasets with IR < 2, which led to G-Mean values of 0.

Table 5 summarizes the count of the best results of the balancing methods with C4.5 classifier in terms of AUC and G-Mean. The KNN-Und has the highest scores.

These results can be explained by the fact that KNN-Und acts removing instances from the majority classes and at the same time cleaning the decision surface, reducing the class overlapping. Figure 3 and Figure 4 show the scatter plot of datasets EcoliIMU, and Satimage4, before and after the balancing methods. The points in blue belong to the majority class, the points in red to the minority class. These plots show the behavior of the methods, as described previously. The SMOTE algorithm performs a homogeneous distribution of synthetic instances around each positive instance. The ENN removes negative instances around the positive instances, and KNN-Und performs a more aggressive removal of negative instances in the decision surface region.

G-Mean and AUC values were not published by dataset for the evolutionary algorithm EUB-MS-GM in [11] , so another comparison was done with the available results, that is, the average and standard deviation of G- Mean and AUC for the 28 evaluated datasets. Table 6 compares the average results of KNN-Und and EUB-MS- GM methods. The KNN-Und results are at least 13 points higher than the EUB-MS-GM. It is not reasonable to do a comparison of standard deviations here, as the 28 datasets have independent results in both cases. One explanation for the obtained high values would be the 1-NN classifier used for comparison, that uses a decision boundary similar to KNN-Und, but the results for KNN-Und with C4.5 decision tree also had higher average values, showing that KNN-Und can also improve the classification results with other algorithms.

This work presented a proposal of an algorithm, KNN-Und, to adjust datasets with imbalanced number of instances among the classes, also known as imbalanced datasets. The proposed method is an undersampling method, and is based on KNN algorithm, removing instances from the majority class based on the count of neighbors of different classes. The classification experiments conducted with the KNN Undersampling method on 33 datasets outperformed the results of other six methods, two studies based in evolutionary algorithms and the SMOTE, ENN, NCL and Random Undersampling methods.

The good results obtained with KNN Undersampling can be explained by the fact that KNN-Und acts removing instances from the majority classes, reducing this way the “needle in a haystack” effect, at the same time, cleaning the decision surface, reducing the class overlapping and removing noisy examples. These results indicates that the simplicity of KNN can be used as a base for constructing efficient algorithms in machine learning and knowledge discovery. They also show that the selective removal of instances from the majority class is an interesting way to be followed rather than to generate instances to balance datasets. This issue is important nowadays as the datasets are approaching the size of petabytes with big data, and retaining only the representative data can be better than creating more data.