An Improved Algorithm for Imbalanced Data and Small Sample Size Classification

Traditional classification algorithms perform not very well on imbalanced data sets and small sample size. To deal with the problem, a novel method is proposed to change the class distribution through adding virtual samples, which are generated by the windowed regression over-sampling (WRO) method. The proposed method WRO not only reflects the additive effects but also reflects the multiplicative effect between samples. A comparative study between the proposed method and other over-sampling methods such as synthetic minority over-sampling technique (SMOTE) and borderline over-sampling (BOS) on UCI datasets and Fourier transform infrared spectroscopy (FTIR) data set is provided. Experimental results show that the WRO method can achieve better performance than other methods.


Introduction
Imbalanced data [1] sets can lead to the traditional data mining algorithms behaving undesirable, which is because the distribution of the data sets is not taken into consideration in the algorithms.Because of the extreme imbalance, a trivial learning algorithm may cause the decision boundary skewed toward the minority class, so the new minority test samples are likely to be misclassified.Various methods for dealing with this problem have been proposed recently.The first type of methods focuses on data processing: removing a number of samples from the majority class (under-sampling) or adding new samples into the minority class (over-sampling).The former methods [2] have drawbacks that they may lead to lose relevant information.The later method [3] is achieved by adding some synthetic samples until the desired class ratios are attained: Chawla et al. [3] oversample the minority class through synthetic minority over-sampling technique (SMOTE) method.Nguyen et al. [4] propose borderline over-sampling (BOS) method in which only the minority samples near the borderline are over-sampled.The second type of methods focuses on modifying the existing classification algorithms.For support vector machines (SVM) method, proposals such as using different weighting constants for different classes [5], or adjusting the class boundary based on kernel-alignment ideal [6] are reported.Huang et al. [7] present biased minimax probability machine (BMPM) to resolve the imbalanced problem.Furthermore, there are other effective methods such as cost-sensitive learning [8] and one-class learning [9].
In the particular tasks such as face recognition (FR) [10], the number of available training samples is usually much smaller than the dimensionality of the samples pace.Consequently, the biggest challenge that all linear discriminant analysis (LDA)-based approaches have to face is the "small sample size" (SSS) problem.These are often ill-posed problems.There are many ways to address the problem: One option is to apply linear algebra techniques to solve the numerical problem of inverting the singular within class scatter (WCS) matrix.The second option is the feature extraction-based methods, such as the well-known fisher faces method [11].However, the discarded null space may contain significant discriminatory information, and this will further effect the formation of classifier.The third option is over-sampling method: we can over-sample the training samples so that the number of samples is comparable with the dimensionality of the samples pace, which will make the WCS nonsingular.
We solve the imbalanced problem and SSS problem based on data processing.To deal with the two problems, we propose a windowed regression over-sampling (WRO) method.In this method, the virtual samples are generated according to the difference between adjacent samples.In contrast to SMOTE and BOS methods, the difference is estimated in a local window with the least square regression instead of the whole ones.Moreover, both additive and multiplicative effects between samples are considered in WRO algorithm.

Weighting Support Vector Machines for Classification
The objective of the training of SVM is to find the optimal hyperplane that separates the positive and negative classes with a maximum margin [12] Subject to: 1, , where ω and b are the weight vector and the bias of the hyperplane respectively, i ξ indicates degree of loca- tion violation of the i-th training sample, C + and C − are the different error costs for the minority and major- ity classes.
( ) ( ) ( ) = is akernel function that enables to compute dot products in the feature space without knowing the mapping φ .In this paper, we use the RBF kernel as follows: where γ is a width parameter, control the radial scope.There are no guidelines for deciding what the relative ratios of the minority to majority cost factors should be, we empirically set the cost ratio to the inverse of the imbalance ratio and that is what we have used in this paper.However, WSVM is sensitive to the minority samples and obtains stronger cues from the minority samples about the orientation of the plane than from the majority samples.If the minority samples are sparse, as in imbalanced datasets, then the boundary may not have the proper shape in the input space [13].

The Proposed Algorithm
To solve the imbalanced problem, an appropriate number of virtual samples are added to the minority class according to the sampling level; to solve the SSS problem, we generate virtual samples so that the size and the dimensionality of training samples are comparable to a certain extent.The basic idea is as follows: Let n m X × be a samples matrix whose rows and columns correspond to samples and variables respectively.Denote the n samples as 1 2 , , , n x x x  , we produce more virtual samples in the dense region and less in the sparse region: calculating the mean of the samples in the category and denoting it as x, then computing the distance between the mean value and each sample 2 1, , = and obtaining the normalized weight vector ( ) , , , n W w w w  for each sample as follows: the weight i w reflects the i-th sample distribution in the training set.Given the sampling level ( ) [ ], 1, , where [ ] ⋅ stands for backing to the nearest integer.The details of generating virtual samples are as follows: firstly, for each sample i x , we compute its i x nearest neighbors and denote them as 1 , , , then obtain the regression coefficients in a local window: where j w is a local window centered at variable j , x and its neighbor ik y correspondingly, we can obtain a series of regression coefficients pair ik a and ik b as shown in Figure 1, in order to eliminate the noise impact of the regression coefficients, we use Savitzky-Golay filter [14] to smooth the coefficients.Finally, we randomly select a pair of coefficients and interact them with ik y to generate a new sample: new , , 1, , The WRO algorithm is therefore summarized as follows: Input: sample matrix n m X × , window width l , number of generation virtual samples T , number of neighbors .k Output: virtual sample new .
x 1) Compute the number of generation virtual samples i T for each sample 1, , , according to Equation (5).
2) Find k nearest neighbors for each sample i x .Obtain the regression coefficients set ik a and ik b through the given sample i x and the corresponding k nearest neighbors according to Equation (6). 3) Smooth the regression coefficients set with Savitzky-Golay filter.4) Generate new samples according to Equation (7).
Many over-sampling algorithms such as SMOTE and BOS only reflect the additive effect between each sample, while our algorithm WRO also reflects multiplicative effect all together from Equation ( 7) and all of these effect are computed in a local region rather than in a whole region.WRO can enlarge the decision regions and also improve the prediction of the minority class while not sacrificing the accuracy of the whole testing set.

Materials
Two data sets from the UCI machine learning repository [15] including Glass (7) and Yeast (5) are used in the experiments.Numbers in parentheses indicate which class is chosen as minority class and all of the remaining classes are combined to create a majority class.We also use 500 Fourier Transform infrared (FTIR) spectra as small size data sets.The FTIR spectra in the region 4000 -650 cm −1 have been recorded with a Perkin-Elmer Spectrum GX FTIR spectrometer, equipped with the Universal ATR sampling accessory.The details of UCI data sets and FTIR dataset are provided in Table 1."Imbalance" indicates the ratio between the majority class and the minority class.

Experimental Results
The programs are written in house in Matlab Version R2012a and run in a personal computer with a 2.20 GHz Intel Core 2 processor, 4 GB RAM, and a Windows 7 operating system.

Evaluation Measures
The evaluation measures used for imbalanced samples classification in our experiments are based on the confusion matrix [16].Table 2 illustrates a confusion matrix for a two class problem with positive (minority) and negative (majority).With this matrix, our performance measures are expressed: -mean , where is based on the recalls on both classes.The benefit of selecting this metric is that it can measure how balanced the combination scheme is.If a classifier is highly biased toward one class (such as the majority class), the -mean G value is low, so it does not depend on the class distribution of the training set.In addition, -value F combines the recall and precision on the minority class.It measures the overall performance on the minority class.For imbalanced data sets, we apply -mean G and -value F as the evaluation measure; for SSS problem, we only apply prediction accuracy as the evaluation measure.

Experimental Results and Discussions
For imbalanced datasets, we compare the proposed method WRO with WSVM [5] method and some other over-sampling methods including SMOTE and BOS.For SSS problem, we compare the proposed method WRO with standard SVM and PCA feature extraction-based method.The code for SVM and WSVM are taken from the package LIBSVM [17] and the Gaussian RBF kernel is used in the next experiment.We empirically set 3 l = for the width of the sliding window and 5 k = for the number of neighbors in WRO method.In order to reduce the effect of randomness in the division of data and sampling, each method is run ten times and then the average performance is calculated.Each time consists of: 1) randomly splitting the two classes samples into training and testing sets with the ratio 7.5:2.5;2) for imbalanced problem, over-sampling the minority class samples on training data with different methods, for SSS problem, over-sampling the two class samples on the training data with different methods; 3) performing 5-fold cross-validation on the over-sampled training data to estimate the optimal parameters C and γ from Equation (3); 4) training SVM classifier; 5) predicting on the test set.Sampling le- vels are selected according to the imbalance or the relationship between size and dimension of each data set.These over-sampling levels are described in Table 1.
Results for Glass are shown in Figure 2, we can see that the proposed method WRO achieves a better result in terms of -mean G than that of the other three methods (WSVM, SMOTE, BOS) at almost all the sampling levels, with the growth of oversampling level, the -value F of WRO are comparable with that of the other three methods.For the data set yeast, Figure 3 shows that three oversampling methods perform well compared to WSVM in terms of -mean G : maybe because of the serious imbalance (Imbalance = 27) for this data set, WSVM is sensitive to the minority samples and obtains stronger cues from the minority samples about the orientation of the plane than from the majority samples, which causes most of the minority samples are misclassified.After over-sampling the minority class, the three oversampling methods improve the results in terms of -mean G , and the -value F evaluation is significantly improved with our method WRO, because the precision evaluation obtained with WRO is better than that of the other three methods.
Figure 4 shows the SSS classification problem, the dimensionality of the sample space is much higher than the    amount of training samples.Without over-sampling for the training set, the prediction accuracy with SVM is about 86%.After performed with PCA, we used the first ten features, and the prediction accuracy with SVM is about 88% in this case.While the accuracy is improved with SMOTE and WRO methods through an appropriate oversampling level.We can see that the selection of the over-sampling level p impacts on the prediction accuracy of different over-sampling methods, when p is small, we can get better neighbors for the over-sampling process, so the prediction accuracy can be dramatically improved, when p is large enough, more noise is likely to be introduced, so a larger training samples are generated with over-sampling method and less information is lost.Consequently, p is a tradeoff between inducing more noise and losing less information.Nonetheless, our method WRO is comparable with SMOTE method with almost all p values.

Conclusion
In this paper, we have addressed the imbalanced data and SSS classification problem.To solve these problems, we propose a new over-sampling method based on windowed regression.Experimental results on two UCI data sets and one FTIR data set demonstrate the efficiency of the proposed algorithm.Of course, there are too many parameters in the algorithm.Meanwhile, the method of solving regression coefficients is in the local window, so the efficiency is not high, and we are going to study all of these.

b
are the regression coefficients in the local window in the least squares sense.With the sliding window that between the sample i

Figure 1 .
Figure 1.Obtain regression coefficient with the sliding window between samples.

Figure 2 .Figure 3 .
Figure 2. G-mean and F-value performance on the Glass at different sampling level.

Figure 4 .
Figure 4. Accuracy on the FTIR at different sampling level.

Table 1 .
Data sets used for the experiment.