^{1}

^{2}

^{*}

In order to solve the problem that, the hyper-parameters of the existing random forest-based classification prediction model depend on empirical settings, which leads to unsatisfactory model performance. We propose a based on adaptive particle swarm optimization algorithm random forest model to optimize data classification and an adaptive particle swarm algorithm for optimizing hyper-parameters in the random forest to ensure that the model can better predict unbalanced data. Aiming at the premature convergence problem in the particle swarm optimization algorithm, the population is adaptively divided according to the fitness of the population, and an adaptive update strategy is introduced to enhance the ability of particles to jump out of the local optimum. The main steps of the model are as follows: Normalize the data set, initialize the model on the training set, and then use the particle swarm optimization algorithm to optimize the modeling process to establish a classification model. Experimental results show that our proposed algorithm is better than traditional algorithms, especially in terms of F1-Measure and ACC evaluation standards. The results of the six-keel imbalanced data set demonstrate the advantages of our proposed algorithm.

The problem of unbalanced data classification often exists in the field of data classification, such as bioinformatics, intrusion detection system and classification problem [

The most commonly used methods to solve the problem of class imbalance are 1) Resampling method [

Thereinto, the Random Forest (RF) algorithm is a bagging ensemble learning algorithm based on the random subspace method by Breiman L. et al. [

In response to the problem of poor performance random forest model on unbalanced data due to unreasonable hyper-parameter setting, we used the adaptive particle swarm optimization (APSO)-RF model for data classification to obtain a high precision prediction. We use the idea of clustering [

In this section, we introduced the related works about techniques of RF and PSO.

Classification and Regression Tree (CART) is an inductive learning algorithm for a single classification regressor, which is composed of root nodes, leaf nodes and non-leaf nodes. The decision tree generates a path from the root node to the leaf node through regression analysis on the training set and analyzes the path rules. Classify or predict new instance according to path rules. CART is based on information entropy and uses the Gini coefficient minimum principle index to split the node. The input space of the training set D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x n , y n ) } is divided into regions, and each sample is recursively divided into the corresponding region and a determined output value is obtained. The steps of the algorithm are as follows:

1) Assuming that the characteristic of the independent variable is j, the value of this characteristic is s. Assuming that the value s divides the space of feature j into two regions, the formula is as follows:

R 1 ( j , s ) = { x | x ( j ) ≤ s } , R 2 ( j , s ) = { x | x ( j ) > s } (1)

2) Traverse and calculate the loss function(LF) of each segmentation point (j, s) in turn, and select the segmentation point with the smallest loss function.

L F = min j , s [ min c 1 ∑ x i ∈ R 1 ( j , s ) ( y i − c 1 ) 2 + min c 2 ∑ x i ∈ R 2 ( j , s ) ( y i − c 2 ) 2 ] (2)

Among them, c_{1} and c_{2} are the output average value in the interval R_{1}, R_{2} respectively.

3) Calculate the point of division, proceed in sequence until the division can no longer be continued.

4) Divide the input space into M parts R 1 , R 2 , ⋯ , R M to generate the final decision tree as

f ( x ) = ∑ m = 1 M c m I ( x ∈ R m ) (3)

RF is composed of multiple decision trees combined into a strong classifier on the basis of bagging. (It shown in

1) The training set D input.

2) Using Bootstrap sampling to form k training subsets.

3) Randomly extract m features from the original features.

4) Perform training on the training subset, make the optimal segmentation of the randomly selected m features, and obtain k decision tree prediction results.

5) Voting based on k prediction results to get the prediction result with the highest number of votes.

The PSO algorithm simulates a bird in a flock of birds by designing a massless particle. This particle has only two attributes: speed and position. Speed represents the speed at which it moves, and position represents its spatial position. Each

particle finds the optimal solution in the individual search space, stores it as the current individual extreme value, finds the current global optimal solution according to the individual extreme values of all current particles, and adjusts its speed and position for the entire particle swarm. The traditional PSO algorithm is described as follows:

Suppose there is a population of m particles in the d-dimensional search space. Suppose that at time T, population particle information: Position X i = [ x i 1 , x i 2 , ⋯ , x i d ] , speed V i = [ v i 1 , v i 2 , ⋯ , v i d ] , personal best position p i = [ p i 1 , p i 2 , ⋯ , p i d ] , global optimal position p g = [ p g 1 , p g 2 , ⋯ , p g d ] .

Then, the speed and position information of the particles are updated at time T + 1 by the following formula:

v i t + 1 = ω v i t + c 1 r 1 t ( p i t − x i t ) + c 2 r 2 t ( p g t − x i t ) , x i t + 1 = x i t + v i t + 1 (4)

Among them, the inertia weight maintains an effective balance between global exploration and local exploration, and is the learning factor, respectively responsible for adjusting the step length in the exploration direction to the optimal position of the population and the exploration direction to the global optimal position, and is Random numbers on the uniform distribution function. In order to avoid blind search of particles, their speed and position are usually limited to [−V_{max}, V_{max}], [−X_{max}, X_{max}].

In this section, we introduce the structure of the model APSO-RF in detail. First, PSO improved by adaptive learning strategies is shown. In the process of searching, group is adaptively divided into subgroups according to the particle distribution. In each subgroup, we use two different learning strategies to guide the search directions of two different types of particles. Then, the optimization model building process is introduced. By applying APSO to optimize the selected hyper-parameters, the classification model was established.

Relevant studies have shown that the diversity of the population is the key to avoiding the premature convergence of PSO; the core guiding principle of the algorithm is clustering. According to the distribution of each particle, the fast search clustering method [_{i} and δ_{i} are defined for each particle. ρ_{i}, the distance between the local density of the particle and a higher local density of particles, is defined as follows:

ρ i = ∑ j ≠ i exp ( − ( d i j 2 d c ) ) (5)

where d_{ij} is the Euclidean distance of particles between x_{i}, and x_{j} and d_{c} is the truncation distance. The truncation distance is d_{c} = d_{R}_{*}_{M}, where R represents the

proportion and M indicates that the matrix d_{ij} contains M = 1 2 N ( N − 1 )

values, where N represents the number of particles. It can be seen that d_{c} is the distance corresponding to the R * Mth value of d_{ij}. (6) gives the expression of the distance δ_{i}, representing the minimum distance from particle i to other particles that have a higher ρ_{i}:

δ i = min j : ρ j > ρ i ( d i j ) (6)

For the maximum local density ρ of the sample, δ i = max j d i j .

According to Equation (5), if the density of particle x_{i} is the maximum, δ_{i} is much larger than the distance δ of its nearest particles. Therefore, the centre of the subgroup consists of particles that have an unusually large distance δ and a relatively high density as well. In other words, the particles with larger ρ and δ values are selected as the centre of the cluster. According to the above idea from [_{i} = ρ_{i} * δ_{i} is used to filter out particles that may become cluster centers. We arrange the γ_{i} values in descending order, then use the truncation distance to filter out the cluster centers from the order. Because the γ value of the top particle is more likely to increase exponentially than those of the other particles, it is distinguished from the γ value of the next particle. Referring to [_{j} in subgroups where the denser ρ is larger than the ρ of x_{j} and the δ is the closest to the δ of x_{j}.

The particles of each subgroup are divided into ordinary particles, and local optimal particles based on the result of the division of subgroups. Under the primary guidance of the optimal particles, the ordinary particles exert their local search ability, and the updated formula is given as (7).

x i d = ω x i d + c 1 r a n d 1 d ( p b e s t i d − x i d ) + c 2 r a n d 2 d ( c g b e s t c d − x i d ) (7)

where ω is the inertia weight, c_{1} and c_{2} are the learning factors, r a n d 1 d and r a n d 2 d are uniformly distributed random numbers in the interval [0, 1], p b e s t i d is the best position of particles, and c g b e s t c d is the current best position of particle in the subgroup c. To enhance the exchange of information between subgroups, the local optimal particles are mainly updated by integrating the information of each subgroup. The update formula is as follows (see (8)), where C is number of subgroups.

x i d = ω x i d + c 1 r a n d 1 d ( p b e s t i d − x i d ) + c 2 r a n d 2 d ( 1 C ∑ c = 1 c g b e s t c d − x i d ) (8)

Ordinary particles search for local optimality, but more importantly, they are used as the medium for information exchange between subgroups to modify the direction of population search and further improve population diversity. In the same subgroup, unlike a learning strategy that causes too many particles to be gathered locally, the learning strategy integrates the information of the locally optimal particles from different subgroups to obtain more information and help avoid local optima. In addition, learning too much information may lead to the direction of the update being too fuzzy, which may counteract the convergence of particles. Considering that the local optimal particles have the maximum probability of finding the optimal solution in the subgroup, valuable guidance for the optimal solution is provided by their information. Therefore, the g b e s t c d of each subgroup uses the average information to guide the local optimal particle update (see (8)). The transmission of the optimized information in the subgroups can be improved by this approach, the population diversity can be further increased, and particles can be prevented from falling into local optima.

In order to make the model structure of RF match the data features more accurately and get the classification prediction results accurately, we use adaptive particle swarm optimization to control the hyper-parameters of the model structure, and build the APSO-RF model (shown in

First, the hyper-parameters in the RF model are taken as the optimization target, and the position information of each particle is randomly initialized in the set hyper-parameter value space.

Second, the particles are divided into adaptive populations. This step is realized by calculating the local density of the particles and the distance to the higher local density particles. According to the value determined by the particle position, the hyper-parameters of the RF model are assigned, and the verification data is brought into the model for prediction, and the loss function value of the model on the verification data set is used as the particle fitness value.

Among them, and respectively represent the true value and the probability prediction value. According to the fitness value of each particle, the subgroup is divided into various types of particles. Use the update strategy to update the information of different types of particles. When the termination condition is reached, the optimal value in the current parameter space is obtained. Finally, the RF model is constructed with the optimal value of the hyper-parameter.

The theory of cross-validation was started by Seymour Geisser [

PLS regression modeling. In the given modeling sample, take out most of the sample to build the model, leave a small sample with the model just established for prediction, and find the prediction error of this small sample, record their square sum.

Cross-validation can make full use of limited data to find appropriate model parameters to prevent overfitting. The main steps of K-fold cross-validation are as follows: The initial sampling is divided into K sub-samples, a separate sub-sample is used as the data of the validation model, and other K − 1 samples are used for training. Cross-validation repeats K times, each subsample verifies once, the average K times result, finally obtains a single estimate. The advantage of the method is that the randomly generated subsamples are repeatedly used for training and verification. In the experiment, we used the most common 10-fold cross-validation.

Although the tree-based algorithm is not affected by scaling, feature normalization can greatly improve the accuracy of classifiers. The training set is described as D = {X, Y}, where X = { x 1 , x 2 , ⋯ , x m } represent an m dimensional eigenspace, Y = {0, 1} represents the target value. If x is a certain feature, it by 0 - 1 scaling as follows:

x ′ = x − min ( x ) max ( x ) − min ( x ) (9)

where x' expresses the standardized value.

The experimental data of this study is an unbalanced data set obtained in the keel data mining platform (see

Name | Attributes | Examples | IR |
---|---|---|---|

ecoli-3 | 7 | 336 | 8.6 |

glass-1 | 9 | 214 | 1.82 |

new-thyroid-1 | 5 | 215 | 5.14 |

page-blocks-0 | 10 | 5472 | 8.79 |

vehicle-1 | 18 | 846 | 2.9 |

wisconsin | 9 | 683 | 1.86 |

yeast-1 | 8 | 1484 | 2.46 |

Data standardization scales data so that it falls into a small specified interval. This removes the unit limitation of the data and turns it into a dimensionless, pure value that can be compared and weighted across different units or orders of magnitude.

After a large number of experiments proved that 10-fold cross-validation is the most widely used and the best effect, and before verifying the validity of the model, we unified all the cross-validation on different models, all of which were 10-fold cross-validation (See

According to the previous RF parameter optimization research, we put a group of hyper-parameters as optimization targets and set their search space. The Settings range is shown in

To compare the results of the evaluation model, we use the evaluation criteria based on confusion matrix (see

Name | Attributes |
---|---|

n estimators | (50 - 200) |

max features | (12 - 16) |

max depth | 350, 400, 450 |

min samples split | (2, 3) |

min samples leaf | (1, 5) |

Predicted Value | Actual Value | Total | |
---|---|---|---|

0 | 1 | ||

0 | TP | FN | TP + FN |

1 | FP | TN | FP + TN |

TP + FP | FN + TN | TP + FP + FN + TN |

samples that are predicted to be positive class; true negation (TN) is the number of actual negative samples and predicted negative samples; false positive (FP) is the number of actual negative samples and predicted positive samples; false negative (FN) is the number of actual positive samples and predicted negative samples. Both F1-mearsure and Roc Area are comprehensive measures of the ability to deal with unbalanced data sets. The formulas are as follows.

The average accuracy (ACC):

T P + T N T P + F P + T N + F N (10)

The F1-mearsure takes into account both precision and recall of classification models. It is the harmonic average of these two indicators, and it ranges from 0 to 1. ROC is a graph to judge the accuracy of the prediction. If the graph area is close to 1, it is 100% correct.

F 1 = 2 p r e c i s i o n ∗ r e c a l l p r e c i s i o n + r e c a l l (11)

where precision is the proportion of positive samples in positive cases, it is defined as

p r e c i s i o n = T P T P + F P (12)

And recall is the proportion of predicted positive cases in the total positive cases; it is defined as

r e c a l l = T P T P + F N (13)

In order to analyze and verify the performance of the proposed model for unbalanced data classification research, we selected several commonly used machine learning classification models for comparison.

DT: The DT is a process for classifying instances based on features, where each internal node represents a judgement on an attribute, each branch represents the output of a judgement result, and each leaf node represents a classification result. The algorithm loops all splits and selects the best-partitioned subtree based on the error rate and the cost of misclassification.

Logistic regression (LR): The statistical technique of logistic regression is usually used to solve binary classification problems. Regression analysis is used to describe the relationship between the independent variable x and the dependent variable Y and to predict the dependent variable Y. LR adds a logistic function on the basis of regression.

Multilayer perceptron mechine (MPN): It refers to neural principles, where each neuron can be regarded as a learning unit. The MPN is constructed on the basis of many neurons, which are composed of an input layer, hidden laver, and output layer. These neurons take certain characteristics as input and obtain output according to their own model. The weight assigned to each attribute varies according to its relative importance, and the weight is adjusted iteratively to make the predicted output closer to the actual target.

Support vector machine (SVM): By mapping the feature vector of an instance to a point in space, the purpose of the SVM is to draw a line to best distinguish the two types of points. The SVM finds the hyperplane that separates the data. To best distinguish the data, the sum of the distances from the closest points on both sides of the hyperplane is required to be as large as possible.

This paper proposes an unbalanced data classification model based on RF optimized by APSO. The main flow of the model is as follows. First, the data pre-processing involves standardized datasets. And divide the data sets into train data and test data, train data for the training model, the test data for prediction. Second, initialize the adaptive PSO algorithm. Take the logistic loss function as the fitness value, and calculate the fitness value of each particle. The model is constantly searching for the optimal parameters according to the fitness value updatad by the loss function. Until the termination condition is reached, the optimal value found is output. According to hyper-parameters tuned by APSO, the model is built. In the end, the trained model tests the training set and obtains indicators.

First, divide the data sets, train the data for the training model, verify the data for prediction. Initialize the adaptive PSO algorithm. Take the logistic loss function as the fitness value, and calculate the fitness value of each particle. Build the XGBoost model with the corresponding hyper-parameters determined by current best particle. Training and prediction of data sets, and the fitness value are updated by the loss function. Third, determine the position of the global optimal particle and the local optimal particle according to the result of the population division and the fitness values of the particles. Finally, update the positions of the ordinary particles and locally optimal particles, respectively. Judge whether to terminate. When the maximum number of iterations is n, return the optimal value of the hyper-parameter; otherwise, model continues training. Obtain the optimal hyper-parameters to build the XGBoost model and calculate the indexes.

On the data set ecoli-3, the RF model performs better than other types of models, and most of the indicators surpass other models. RF have obtained good results, which shows that the ensemble model can pay more attention to learning unbalanced data sets. Moreover, APSO-RF model reached the highest value on the F1-measure, 93.8%, which is 0.8% higher than NN.

On the data set glass-1, the results of APSO-RF are satisfactory for its all evaluation criteria are better than other algorithms. Compared with the SVM, our model has improved ACC and F1-measure by 7.3% and 5.8%, respectively. The model with hyper-parameters setting optimized by APSO has improved in all indicators, especially in ACC and F1-measure, compared to RF in these two indicators was 1.7% and 0.7% respectively. This shows that the model can still deal with the problem of data imbalance well meanwhile ensuring the level of overall accuracy.

On the new-thyroid-1 data set, the LR model performs better than RF. RF does not optimize the hyper-parameter settings, which makes it insufficient to learn samples. RF performs better than LR on F1-measure, indicating that RF uses the bagging method it has good generalization ability. The key to adopting this method is to deal with imbalances to obtain effective classifiers while ensuring the diversity of base classifiers; the model APSO-RF optimized by hyper-parameters has reached acc and other indicators to the top, it shows that the improved particle swarm can help the model build a branch structure suitable for the data set, by selecting reasonable hyper-parameter settings.

Most models on page-block-0 performed well, and on the evaluation indicators, APSO-RF algorithm was better than other algorithms. The model’s performance in F1-measure ranks in the forefront, indicating that our model is superior to other algorithms in the classification performance of unbalanced data.

On the wisconin data set, the APSO-RF is better 0.2% than RF at ACC; APSO-RF is 0.6% higher in ACC than the third-highest model LR model, and the model has the best recall rate, indicating that the model can distinguish more positive categories.

On the veast-1, our model has achieved the best performance in all indicators, and it also performs well in prediction accuracy and regression rate. A high accuracy rate means that the positive examples in the sample are more accurately predicted. It shows that our proposed algorithm is superior to other algorithms in the classification performance of positive classes.

On the whole, RF has better average performance than other models, which shows that this model can reduce the model error effectively and achieve more accurate unbiased estimation with the help of integrated classification strategy. Specifically, traditional classification algorithms usually use classification accuracy as the evaluation criteria, and aim to maximize the average accuracy. In order to maximize accuracy, they often sacrifice the performance of the minority class; while the RF uses an appropriate induction algorithm to benefit the minority Class classification learning. APSO-RF is improved obviously compared with RF in all indexes, which shows that hyper-parameters can match the fitness value better, and its tree structure more suitable for non-balanced data, so the precision of the model is higher. The algorithm can improve the ability of positive classification obviously without losing the ability of global classification, because APSO is optimizing the hyper-parameter reasonably, as a result, the tree structure that is more suitable for unbalanced data set is not built, and the performance is limited. Adaptive particle swarm optimization uses adaptive group division and different updating strategies to guide particles learning, which helps to maintain the diversity of the population and avoid the model falling into local optimum early.

Unbalanced data classification is a big challenge in the field of data mining. RF as an ensemble learning method is usually used to solve the problem of unbalanced data classification. This paper proposes a particle swarm optimization strategy based on adaptive partitioning, which uses the good global and local search performance of the optimization strategy to optimize the hyper-parameters of the RF, and optimizes the misclassification of samples in the imbalanced data classification problem. The purposed model is verified on six non-equilibrium data sets and gets good prediction results. The result demonstrates that the model has excellent generalization ability and the ability to deal with non-equilibrium data sets.

Our Future work will focus on the improvement of the integrated decision tree structure to further improve the performance of the model itself.

This work was supported in part by the National Natural Science Foundation of China under Grant 61972227, Grant 61873117 and Grant U1609218, in part by the Natural Science Foundation of Shandong Province under Grant ZR201808160102 and Grant ZR2019MF051, in part by the Primary Research and Development Plan of Shandong Province under Grant GG201710090122, Grant 2017GGX10109, and Grant 2018GGX101013, and in part by the Fostering Project of Dominant Discipline and Talent Team of Shandong Province Higher Education Institutions.

The authors declare no conflicts of interest regarding the publication of this paper.

He, Q.Q. and Qin, C. (2021) Adaptive Optimization Swarm Algorithm Ensemble Model Applied to the Classification of Unbalanced Data. Intelligent Information Management, 13, 251-267. https://doi.org/10.4236/iim.2021.135014