TITLE:
Feature Selection for Intrusion Detection Using Random Forest
AUTHORS:
Md. Al Mehedi Hasan, Mohammed Nasser, Shamim Ahmad, Khademul Islam Molla
KEYWORDS:
Feature Selection, KDD’99 Dataset, RRE-KDD Dataset, Random Forest, Permuted Importance Measure
JOURNAL NAME:
Journal of Information Security,
Vol.7 No.3,
April
7,
2016
ABSTRACT: An intrusion detection system
collects and analyzes information from different areas within a computer or a
network to identify possible security threats that include threats from both
outside as well as inside of the organization. It deals with large amount of
data, which contains various ir-relevant and redundant features and results in
increased processing time and low detection rate. Therefore, feature selection
should be treated as an indispensable pre-processing step to improve the
overall system performance significantly while mining on huge datasets. In this
context, in this paper, we focus on a two-step approach of feature selection based
on Random Forest. The first step selects the features with higher variable
importance score and guides the initialization of search process for the second
step whose outputs the final feature subset for classification and
in-terpretation. The effectiveness of this algorithm is demonstrated on KDD’99
intrusion detection datasets, which are based on DARPA 98 dataset, provides
labeled data for researchers working in the field of intrusion detection. The
important deficiency in the KDD’99 data set is the huge number of redundant
records as observed earlier. Therefore, we have derived a data set RRE-KDD by
eliminating redundant record from KDD’99 train and test dataset, so the
classifiers and feature selection method will not be biased towards more
frequent records. This RRE-KDD consists of both KDD99Train+ and KDD99Test+
dataset for training and testing purposes, respectively. The experimental
results show that the Random Forest based proposed approach can select most
im-portant and relevant features useful for classification, which, in turn,
reduces not only the number of input features and time but also increases the
classification accuracy.