_{1}

^{*}

In the Internet, computers and network equipments are threatened by malicious intrusion, which seriously affects the security of the network. Intrusion behavior has the characteristics of fast upgrade, strong concealment and randomness, so that traditional methods of intrusion detection system (IDS) are difficult to prevent the attacks effectively. In this paper, an integrated network intrusion detection algorithm by combining support vector machine (SVM) with AdaBoost was presented. The SVM is used to construct base classifiers, and the AdaBoost is used for training these learning modules and generating the final intrusion detection model by iterating to update the weight of samples and detection model, until the number of iterations or the accuracy of detection model achieves target setting. The effectiveness of the proposed IDS is evaluated using DARPA99 datasets. Accuracy, a criterion, is used to evaluate the detection performance of the proposed IDS. Experimental results show that it achieves better performance when compared with two state-of-the-art IDS.

With the continuous development of network technology and the social economy, people enjoy the convenience that the Internet and computer technology bring, also experiencing the threat of malicious intrusion at the same time. Firewall, as the traditional network security technology, is difficult to form an effective defense against the upgrading of network intrusion means [

Traditional intrusion detection methods are mainly divided into anomaly detection and misuse detection. Anomaly detection mostly uses the expert experience and inference method. Statistical method [

Ensemble learning is a machine learning method based on statistical learning theory, which can greatly improve the generalization ability of the learning algorithm. Under the condition of the limited number of training samples, it can ensure the relatively independence of test data and keep a smaller error. When ensemble learning method is introduced in IDS, in spite of the lack of prior knowledge, it will ensure that there is better classification accuracy, so that it has the better detection performance. Therefore, an intrusion detection ensemble learning algorithm based on the support vector machine (SVM) is proposed in the paper, which combines the SVM with the ensemble learning algorithm AdaBoost.

This paper is organized as follows. Section 2 introduces the algorithm principle and system structure. Section 3 proposes an intrusion detection model based on SVM. Section 4 proposes an intrusion detection ensemble learning algorithm based on AdaBoost. Experiment results are shown in Section 5. Finally we conclude in Section 6.

The basic goal of IDS is through the collection and analysis of network data, detecting the behaviors of the breach of security strategy and the signs of attack existed possibly in target system. Through the active safety protection, the IDS will intercept the malicious intrusion and give the alarm to administrator before the network is endangered.

For this goal, an integrated IDS by combining SVM with AdaBoost is proposed in the paper and its structure is shown in

Classification algorithm is the focus of this paper and its idea is that combining the SVM with the AdaBoost, first, using SVM to learn the network feature data, then intrusion detection model is obtained as base classifiers, at the same time, in order to solve the problem that the accuracy of SVM is not high for small sample, ensemble learning algorithm AdaBoost is introduced, then which iteratively optimizes the base classifiers based on SVM, and improve accuracy.

The SVM algorithm proposed by Vapnik and other scholars can effectively deal with the nonlinear data and limit the overlearning. It has both rigorous theoretical foundation and mathematical foundation. It does not exist the problem of local minima. It has strong generalization ability for this kind of small sample learning application such as network intrusion detection, and has weak dependence on the number of samples [

The standard SVM algorithm is a convex quadratic optimization problem. The global optimum always can be found in the above problem. But when training samples increase, due to the too many constraints, it will greatly increase the training time and the memory requirements, which becomes the bottleneck in practical applications.

In order to improve the training efficiency of SVM, Suyken changed the constraints and the risk function of standard SVM, and then proposed the least square support vector machine (LS-SVM) [

In the actual network environment, each network node will receive a mass of network data. In these data, only a small part of information represents the intrusion behavior. In order to reduce the useless data, feature selection strategy of KDDCUP99 (Data Mining and Knowledge Discovery Cup in 1999) is improved in this paper.

where P_{ik} is a weight factor of a network data t_{k} from dataset d. tf_{ik} is the frequency that t_{k} appears in d. N is the number of data in d. n_{k} is the frequency that t_{k} containing a specific port appears in d. L is the length of t_{k}. The packets with greater weight are selected as the training samples.

Given the training sample set

where b is the threshold. Function approximation problem is equivalent to Equation (3).

where

Considering the linear ε-insensitive loss function has better sparsity, the loss function is shown as Equation (4):

Empirical risk function is shown as Equation (5):

According to the statistical theory, a regression function is determined by the following objective function minimization, which is shown as Equation (6):

where C is the weight parameter that is used to balance the model complex item and the training errors item;

By using the Lagrange multiplier method and kernel technology, LS-SVM can be converted to Equation (8) which is shown as below:

where

In order to satisfy the any symmetric function in the Mercer condition, the selection of kernel function

SVM regression functions

Thus, the base classifier of the network intrusion detection system is obtained, and a new network data vector x can be classified by linear decision function which is shown as Equation (11):

In the network, many factors can make nodes to be under the threat of intrusion, which result in the intrusion time and feature information presenting a certain weak randomness. Single SVM algorithm has certain generalization ability for small sample, but for the problem of intrusion detection, its accuracy is still not high. AdaBoost is a typical ensemble learning method, and it can synthetically optimize multiple weak base classifiers with relatively low accuracy [

The model based on ensemble learning algorithm AdaBoost is shown in

Training algorithm

Step 1: given feature sample set_{n} is a input vector, and represents training sample; y_{n} represents class label. The initial weight of each sample d_{1}, d_{2}, ∙∙∙, d_{n} is set to

Step 2: using algorithm to optimize connection weight of SVM, and getting optimal weight.

Step 3: using sample set to train the optimized SVM, to getting tth intrusion detection model h_{t}.

Step 4: recording the intrusion detection model h_{t}, and calculating and saving its weight ω_{t}. Then using the samples to train h_{t}, and calculate the sum of the absolute values of the prediction error δ. If δ is less than the set value, or the number of iterations achieves maximum iterations, the iteration is over and enter into Step 6, or else, entering into Step 5.

Step 5: updating the weight

Step 6: getting the final prediction model

There are two main factors affecting the AdaBoost ensemble learning effect: the one is how to distribute sample weight in each round of cycle; the two is how to integrate many rules into an effective prediction rule. These two points are respectively reflected by the sample weights and model weights.

Through adjusting the sample weights, the effect of the error samples for intrusion detection model can be effectively reduced, and the contribution of the correct sample can be promoted. The acquisition of sample weight is divided into two steps: computation and normalization. The weight is measured by using the absolute value of prediction error; the method is defined as Equation (12):

where E_{t} represents the sum of the weighted variance of training sample on the tth intrusion detection model h_{t}. β_{t} is adjustment coefficient; there is a variety of ways about the selection of adjustment coefficient, and in order to ensure the final prediction model is stable, this paper adopts the above way.

The sum of all sample weights must be 1, so the weights must be normalized; the method is defined as Equation (13):

The weight of intrusion detection model directly influences the output of the final prediction model. In order to enhance the contribution of intrusion detection model with the smaller errors in the final model, we use the absolute value of prediction error to measure the model weight ω_{t}; the method is defined as Equation (14):

where E_{t} represents the sum of the weighted variance of training sample on the tth intrusion detection model h_{t}. β_{t} is adjustment coefficient. ω_{t} is the effect weight of the tth intrusion detection model h_{t} for final intrusion detection model.

In order to verify the effectiveness of the algorithm, computer simulation is carried out in accordance with our proposed intrusion detection algorithm in this paper. All the algorithms are implemented in MATLAB 7.0 environment on a PC (Personal Computer) with Intel P4 processor (2.9 GHz) with 2 GB RAM. We investigate its classification accuracy.

For the evaluation of the performance of IDS, the majority of experts and scholars generally use DARPA99 data. In order to ensure the authority of the simulation, this paper uses the same dataset to evaluate the algorithm. The dataset was divided into training set (comprising 5 million connection data) and test set (comprising 311029 connection data). The test set includes some attacks that have not appeared in the training set.

This paper extracts 29313 sample data of 41 dimensional from the training set, which contains 6059 “Normal”, 3866 “Neptune”, 516 “Portsweep”, 177 “SatanJ”, 11 “Buffer_overflow” and 2183 “Guess-password”, and extracts 124970 sample data from test set, which is divided into 5 test sets. In experiments, we focus on the comparison between our algorithm and two state-of-the-art algorithms, including BP (Back Propagation) neural network and SVM. The “Accuracy” is used to evaluate methods, which is defined as Accuracy = (TP + TN)/ (TP + FP + TN + FN), where TP, TN, FP and FN are the number of true positive, true negative, false positive and false negative, respectively. The test results are shown in Tables 1-3.

It can be seen from

Compared with

% | Normal | Neptune | Portsweep | Satan | Buffer_overflow | Guess-password |
---|---|---|---|---|---|---|

Test set 1 | 85.2 | 78.5 | 71.3 | 65.6 | 61.2 | 75.4 |

Test set 2 | 81.5 | 74.6 | 72.5 | 68.3 | 53.5 | 73.6 |

Test set 3 | 81.2 | 75.3 | 75.3 | 72.2 | 48.4 | 75.1 |

Test set 4 | 84.3 | 74.3 | 74.6 | 64.6 | 58.7 | 74.2 |

Test set 5 | 85.6 | 76.8 | 72.3 | 66.6 | 57.3 | 72.3 |

% | Normal | Neptune | Portsweep | Satan | Buffer_overflow | Guess-password |
---|---|---|---|---|---|---|

Test set 1 | 88.4 | 77.8 | 73.5 | 69.7 | 72.2 | 74.9 |

Test set 2 | 82.5 | 77.6 | 75.5 | 75.3 | 75.5 | 78.6 |

Test set 3 | 83.2 | 76.3 | 74.3 | 78.2 | 62.4 | 76.1 |

Test set 4 | 85.3 | 72.3 | 72.6 | 72.6 | 68.7 | 72.2 |

Test set 5 | 88.6 | 75.8 | 75.3 | 75.6 | 69.3 | 73.3 |

% | Normal | Neptune | Portsweep | Satan | Buffer_overflow | Guess-password |
---|---|---|---|---|---|---|

Test set 1 | 98.2 | 95.5 | 96.4 | 89.6 | 86.3 | 96.4 |

Test set 2 | 95.5 | 93.6 | 95.4 | 92.3 | 85.5 | 95.6 |

Test set 3 | 97.2 | 96.3 | 93.2 | 87.2 | 87.8 | 97.1 |

Test set 4 | 98.3 | 95.3 | 92.3 | 85.6 | 83.6 | 93.2 |

Test set 5 | 97.6 | 93.8 | 93.1 | 88.6 | 89.2 | 93.3 |

For the other test sets, there’s not much difference between the two tables. Detection accuracy overall increase slightly in

As can be seen from

In this paper, we have proposed an efficient intrusion detection system by combining SVM with AdaBoost algorithms to detect attacks with the characteristics of fast variation, strong concealment and random. The IDS uses our proposed algorithm that is an integrated learning algorithm. Firstly, the feature of higher weight packets is learnt by using SVM. Through the training for the SVM an intrusion detection base classifier is established. Secondly, SVM base classifiers are iteratively trained by using the ensemble learning algorithm AdaBoost. Finally, the final intrusion detection model is generated. The experiment results show that our proposed algorithm is effective in detecting attacks with high detection accuracy, even if detect objects have the characteristics of small sample and randomness. Compared with IDS based on SVM or BP neural network, our proposed IDS greatly improves detection accuracy.

However, the weight setting is important for our algorithm. Our future works include constructing better weighting function and improving the generalization ability further. It is also interesting to use our proposed IDS to real-world scenario.