Comparing the Area of Data Mining Algorithms in Network Intrusion Detection

The network-based intrusion detection has become common to evaluate machine learning algorithms. Although the KDD Cup’99 Dataset has class imbalance over different intrusion classes, still it plays a significant role to evaluate machine learning algorithms. In this work, we utilize the singular valued decomposition technique for feature dimension reduction. We further reconstruct the features form reduced features and the selected eigenvectors. The reconstruction loss is used to decide the intrusion class for a given network feature. The intrusion class having the smallest reconstruction loss is accepted as the intrusion class in the network for that sample. The proposed system yield 97.90% accuracy on KDD Cup’99 dataset for the stated task. We have also analyzed the system with individual intrusion categories separately. This analysis suggests having a system with the ensemble of multiple classifiers; therefore we also created a random forest classifier. The random forest classifier performs significantly better than the SVD based system. The random forest classifier achieves 99.99% accuracy for intrusion detection on the same training and testing data set.


Singular Value Decomposition Algorithm
The Singular Value Decomposition (SVD) technique has a long and surprising journey. SVD is first used in the social sciences with intelligence testing. The initial research in intelligence testing found out that tests given to measure different aspects of intelligence, such as verbal and spatial, were often closely correlated.
There are a lot of 150 names for which SVD is known. In the early days, it was called as principal component (PC) decomposition, factor analysis, and empirical orthogonal function (EOF) analysis. All these names are mathematically equivalent to each other, but they are treated differently in the literature.
Today, singular value decomposition has spread through many branches of science, 155 in particular psychology and sociology, climate and atmospheric science, and astronomy. It is also extremely useful in machine learning and in both descriptive and predictive statistics. In many machine learning applications, it is useful to find a lower rank matrix which can represent the data matrix. The singular value decomposition of a matrix X is the factorization of X into the product of three matrices as follows where the matrices U and V are real valued matrices. Besides this, the columns of Um160 and V are orthonormal. The matrix D is positive real valued and it is a diagonal matrix [5].

Random Forest Algorithm
There are a lot of supervised classification algorithms and the ensemble of those may yield better performance. With this intuition, Random forest algorithm creates the ensemble of several decision tree classifiers which is called as the forest of the decision 165 of trees. The Random Forest algorithm is proposed by Dr. Leo Breiman [6]. All the decision trees in the forest participate and the final results are drowned by the majority vote. Therefore, a higher number of trees in the forest give the high accuracy results. We have partitioned the training data samples into K subsets (K = 500 in our work) randomly. For each subset, we have constructed a decision tree. All the decision trees 170 are constructed by randomly selecting m variables (with randomly selected samples in the corresponding subset) and finding the best split on the selected variables. This technique is applied at each node of decision tree till a node becomes a leaf node. Each decision tree votes for a classification result and the final classification result is decided by the majority votes of the decision trees [7] [8].
Let we have K set of decision tree classifiers C1(175 x);C2(x); :::;Ck(x). These decision tree classifiers are created by the training sets, randomly drawn from the training set of KDDCUP'99 dataset. Let the vector Y and X are class label where indicator function is symbolized with I. The margin function measures the 180 average number of votes the correct attack class exceeds by average number of votes for any other class at given vector X and Y. We will get more accuracy with the larger margin. The generalization error of the system is given as Equation (3).
where the P X,Y is the probability over the X, Y space.

Related Work
In Information security, the machine learning techniques have become more attractive to researchers because of their capabilities to process large volume of data and provide classifications without prior knowledge of data. Therefore, dif- does not occur at a particular time. The model proposed in [11] accurately detected suspicious payload content in network packets through the use of the multinomial one class Naive Bayes classifier for payload based anomaly detection (OCPAD).
Also, SVM classifier is used to build IDS system. For instance, Wagner et al. [12] use one-class classifiers that can detect new anomalies data points that do not belong to the learned class. In particular, they use a one-class SVM classifier proposed by Scholkopf et al. [13]. In such a classifier, the training data is presumed to belong to only one class, and the learning goal during training is to determine a function which 65 is positive when applied to points on the circumscribed boundary around the training points and negative outside. They obtain 92% accuracy on average for all attacks classes. Catania et al. [14] proposed a novel approach to providing autonomous labeling to normal traffic, in order to overcome imbalanced class distribution situations and reduce the presence of attacks in the traffic data used for training an SVM classifier. Amer et al. [15] applied two modifications in order to make one-class SVMs more suitable for unsupervised anomaly detection: Robust one-class SVMs and eta one-class SVMs. Their aim was to make the decision boundary less sensitive to outliers in the data.
Additionally, Wang et al. [16] developed an effective IDS based on an SVM with 75 augmented features. These IDS model integrates the SVM with the logarithm marginal density ratios transformation (LMDRT), a feature transduction technique that transforms the dataset into a new one. The new and concise dataset is used to train the SVM classifier, improving its detection. By evaluating the framework using the mostly used NSL-KDD dataset, the authors could achieve a fast training speed, high accuracy and 80 detection rates, as well as low false alarm presences. Kabir et al. [17] proposed an IDS based on a modification of the standard SVM classifier, known as the least square support vector machine (LS-SVM). This alteration is sensitive to outliers and noise in the training dataset when compared to a regular SVM. Their decision-making process is divided into two stages. The first stage is 85 responsible for reducing the dataset dimension by selecting samples depending on the variability of data by using an optimum allocation scheme. Then, the next stage uses these representative samples as the input of the LS-SVM. An example of classification-based IDS is Automated Data Analysis and Mining (ADAM) [18] that provides a test bed for detecting anomalous instances. ADAM exploits a combination of classification techniques and association rule mining to discover attacks in a tcpdump audit trail. Abbes et al. [19] introduce an approach that uses decision trees with protocol analysis for effective intrusion detection. Several authors have used a combination of classifications and clustering for network intrusion detection exploiting the advantages of the two approaches. For example, Muda et al. [20] present a two stage model for network intrusion detection. In the first stage, k-means clustering is used to generate three clusters: C1 for attack data such as Probe, U2R and R2L; C2 for DoS attack data, and C3 for non-attack data. In the second stage, the Naive Bayes classifier to classify the data into the five more classes called Normal, DoS, Probe, R2L and U2R. Another approach based on the combination of k-mean for clustering and Iterative Dichotomiser (ID3) algorithm for decision tree classifier is proposed in [21]. In this approach, the training data is grouped into k clusters using Euclidean distance similarity. A decision tree is then built using ID3 algorithm on the instances in a cluster to overcome the shortcomings of k-mean algorithm. The authors claim that the detection accuracy of the k-means + ID3 method 105 is very high with low false positive rate on network anomaly data. Artificial Neural Networks (ANNs) are also used in the anomaly detection system mostly as classifiers. An example of ANN-based IDS is RT-UNNID [22]. This system is capable of intelligent real time intrusion detection using unsupervised neural networks (UNN). Subba et al. [23] employed an ANN model in order to introduce an intelligent agent for classifying whether the underlying patterns of audit records are normal or abnormal while classifying them into new and unseen records. Saeed et al. [24] proposed a two-level anomaly-based IDS using a Random Neural Network (RNN) model in an IoT environment. The RNN model was employed in order to build a behavior profile based on both valid and invalid system input parameters to distinguish 115 normal and abnormal patterns. Brown et al. [25] proposed a two-class classifier using an evolutionary general regression neural network (E-GRNN) for intrusion detection based on the features of application layer protocols.

Proposed System
In this work we have utilized two machine learning approaches for the task of network intrusion detection. These approaches are Singular Value Decomposition (SVD) and Randome Forest. The dataset used in this work to evaluate these approaches is KDDCUP'99 dataset for network intrusion detection. We have also compared these approaches in different evaluation metrics. The detail of comparison is given in Section 4. In this section we are explaining the two approaches used in the work. gories available for the attribute. Every unique category has assigned an id. The position of one in one-hot encoded vector is given by this id. The corresponding vector for a category has one at position id of the category if unique categories are "A", "B", "C", "D", "E" then the one-hot vector for category "B" is 01000 where is it is 00010 for the category "D".

Pre-Processing of the Data
2) Update Incomplete data samples: There are some attributes for which their corresponding value is not available for some sample in the dataset. We have updated 140 these values by its mean within the corresponding label class.
3) Normalized the data: The different features/attributes of the dataset have different unit and scale. The two attribute with different unit or scale cannot be compared, so we normalized the dataset. We are normalized the dataset as z score [22].

Experimental Datasets
The KDDCUP'99 dataset is the processed version of DARPA dataset created in 1998. This dataset is distributed under a competition (KDDCUP competition in 1999). This competition was sponsored by the International Conference on Knowledge Discovery in Databases. This competition was required the content to create a predictive model that can learn to predict the class label of a computer network connection [9]. The class labels for any computer network connection are legitimate and illegitimate connection. This dataset has a large number sample data for network connections. These sample have both normal connections and attack connections. The whole dataset is divided into two mutual exclusive parts name train set and test set. The train set has approximately five million records of computer network connections whereas the test set has about 0.3 million records of computer network connections.
A computer network connection is a session of data transfer in-between a pair of computer. This session is time-stamped and has 41 other attributes. Out of these 41 attributes (features), 32 attributes are continuous type and rest 7 are categorical type. Beside these attributes the connections are also labeled as normal connection or as an attack type (different attack types are mentioned in Table 1) connection. These attributes can be further broken categorize into four categories as Basic features, Traffic features, Host based traffic features, and Connection-based content features [26] [27] [28].  Basic features/attributes (refer Table 2): the basic features/attributes are common to all network connections. These features/attributes could be used in detection of intrusion/attacks targeting service and protocol vulnerabilities. Traffic features/attributes based on a fixed time window (refer Table 3): the features/attributes that are calculated using a fixed duration time window. The two-second time window is utilized to examine the connections which have the 210 same service or destination host as that of the current network connection.    Table 5). These features/attributes may or may not be useful in identifying the malicious network activities. These features are based on domain knowledge. These features are helpful in identifying the U2R and R2L attacks/intrusion by monitoring statistics disclosed in the payload section or in the audit logs.

Evaluation Criteria
We need to compare the performance of two machine learning approaches therefore, we require an evaluation measures which is sensitive as well as robust to the available dataset. It is very uncertain to have these properties in a single measure, so we are testifying the performance of the system on several measures.
For a class X there are four type of observation depending upon the pre-diction and ground truth. These four observations are listed in Table 3.
There are some performance measures based on these observations. We are utilizing some of them which are listed below  Accuracy: This measure calculates the classifier performance as how many time it predict a class correctly with respect to the class itself. It can also refer to the closeness of a predicted value to a known or true value. This measure is calculated by Equation (4) [31].
accuracy tp tn N = + (4) where N is the total number of test samples. Precision: This measure calculates the classifier performance as how many times, its prediction as a class is correct. It can also refer to the closeness of multiple measurements with each other. This measure is calculated by Equation (5).
Precision tp tp fp = + (5) Recall: This measure calculates the classifier performance as how many times its prediction of a class retrieve the class correctly. In other word we can say that the recall is the per class accuracy of the system. This measure is calculated by Equation (6).
Recall tp tp fn = + F-measure [32]: Generally Precision and Recall for a classifier are not following each other. If Precision is improving after some consideration the recall declined and vice-versa. A sound classifier need the both measurement (Precision and Recall) as high therefore a new measurement is required that incorporate both of them. This measure is the F-measure which included the Precision and Recall measure in it. This measure is calculated by Equation (7).
In our experiments we have used F1-measure so the β = 1 is used. following each other, and we need to encorporate them both for performance measurement, we can consider the area under the curve of Receiver operating characteristic [33].
As the task is multi-class problem, averaging the evaluation measures over all intrusion classes can give a view on the general results. We are using the micro-averaging and macro-averaging approaches for this task.
Macro-averaged measure The macro-averaged results for a multi-class problem can be computed by Equation (8).  Table 6).
False negative numbers (fn), and false positive numbers (fp). Let these numbers are tp λ , tn λ , fn λ , and tn λ and used to evaluate the measure A for the class Cl Finally, we calculate the mean of this measure over all attack classes (refer Equation (8)).

Micro-averaged measure
Similarly, a micro averaged measure can be computed as Equation (9)

Results and Analysis
In this section, we are showing the result of Singular Valued Decomposition (SVD) based model and Random Forest (RF) based system. Figure 1 and Figure   2 show the normalized confusion matrix for the model based on SVD and RF model respectively. These confusion matrices show that the system is capable enough to classify the most of the attack types correctly. There are still some attack type for which the performance of the system is not satisfactory. The main reason of this behavior of the system is class imbalance in training sample set. In the KDDCup'99 dataset, there are some attack classes for which number of samples are very low and for some classes it is very high. The ratio of maximum number of samples with respect to minimum.
Number of samples for an attack type in the dataset is very high. Beside these class imbalance problem the Random Forest method outperform the SVD model Table 6. Different observation of prediction for a class X.
True class label is X True class label is not X Predicted class label is X tp: true positive fp: false positive Predicted class label is not X fn: false negative tn: true negative Journal of Information Security    and Random Forest models. Figure 5 and Figure 6 show the Recall performance analysis and F-measure performance analysis at all attack type for SVD and Random Forest models. These barcharts also describe the better performance of the Random Forest methods over the SVD method. Finally we are showing the overall performance of the system.  Here we are depicting the accuracy and F-measure of the Random Forest and SVD methods (refer Figure 7). The random forest method outperforms the SVD based system in all performance measures and shows the promising behavior for the intrusion detection in network connection environment.

Conclusion and Future Scope
In this work, we have tested and analyzed the two classification methods based