Evaluation of Modified Vector Space Representation Using ADFA-LD and ADFA-WD Datasets

Predicting anomalous behaviour of a running process using system call trace is a common practice among security community and it is still an active research area. It is a typical pattern recognition problem and can be dealt with machine learning algorithms. Standard system call datasets were employed to train these algorithms. However, advancements in operating systems made these datasets outdated and un-relevant. Australian Defence Force Academy Linux Dataset (ADFA-LD) and Australian Defence Force Academy Windows Dataset (ADFA-WD) are new generation system calls datasets that contain labelled system call traces for modern exploits and attacks on various applications. In this paper, we evaluate performance of Modified Vector Space Representation technique on ADFA-LD and ADFA-WD datasets using various classification algorithms. Our experimental results show that our method performs well and it helps accurately distinguishing process behaviour through system calls.


Introduction
System call is a request for a service that program makes to the kernel.Sequence of the system calls can describe the behaviour of the process.System call traces are used in Host based Intrusion Detection System (HIDS) to distinguish normal and malicious processes.There are a number of data representation techniques found in literature (e.g, n-gram model and lookahead pairs [1] [2], sequencegram [3], pairgram [4], etc.) used to extract the features from the system call trace for process behaviour classification.By considering collected system call traces as set of document and system calls as words, we can apply classical data representation and classification techniques used in the area of natural language processing (NLP) and information retrieval (IR).Document representation techniques such as Boolean model and vector space model were reported in literature for extracting features from system call traces.X. Wang et al. [5] used n-gram with Boolean model for feature extraction and Support Vector Machine (SVM) with Gaussian Radial Basis Function (GRBF) kernel function for classification.K. Rieck et al. [6] used vector space model and considered frequency of system call in a trace as a weight of system call.They utilized polynomial kernel function for classifying the vectors storing weight of each system call.Y. Liao and V. R. Vermuri [7] have used vector space model for system call trace representation and applied k-nearest neighbour (kNN) classifier, where nearness was calculated using cosine similarity.However, these approaches are not considering system call sequence information, which would help in better describing the system call behaviour.
Researchers were utilizing the well-known system call trace datasets like University of New Maxico (UNM) intrusion detection dataset [8], and DARPA intrusion detection dataset [9] to train the machine learning algorithms for process behaviour prediction.However, these datasets were compiled decades ago and are not very relevant for modern operating systems [10].Recently (in 2013), new system call trace datasets released by G. Creech et al. known as ADFA datasets [10] [11].ADFA datasets are considered as new benchmark for evaluating system call based intrusion detection systems.It has a wide collection of system call traces representing modern vulnerability exploits and attacks.
G. Creech et al. [12] have proposed semantic model for anomaly detection using short sequences of ADFA-LD Dataset.They have prepared the dictionary of word and phrase from the dataset and evaluated it with the Hidden Markov Model (HMM), Extreme Learning Machine (ELM) and one-class SVM algorithms.They achieve accuracy of 90% for ELM and 80% for SVM with 15% false positive rate (FPR) [12] [13].For ADFA-WD evaluation also, G. Creech et al. [11] have used HMM, ELM and SVM.They noted 100% accuracy with 25.1% FPR for HMM, 91.7% accuracy with 0.23% FP rate with ELM and 99.58% accuracy with 1.78% FP rate for SVM.However, learning a dictionary of all possible short sequences is a time consuming task [14] [15].Miao Xie et al. [15] have applied k-nearest neighbour (kNN) and k-means clustering (kMC) algorithms on ADFA-LD dataset.They considered frequency based model for data representation and used principal component analysis (PCA) to reduce the dimension of feature vector.With combination of kNN and kMC they achieve accuracy of 60% with 20% of FPR.In another attempt Miao Xie et al. [13] have applied one-class SVM with short sequence based technique on ADFA-LD.With one-class SVM, they achieved maximum accuracy of 70% with around 20% of FPR.
We modified the X.Wang et al. [5] approach given for Boolean model and proposed Modified Vector Space Representation in [16] to represent process system call trace in terms of feature vector.It is system call frequency based approach and utilizes the Vector Space Model with n-gram.In [16], we have evaluated the proposed method on system call trace datasets used in [17].In this paper, we apply modified vector space representation approach on ADFA-LD and ADFA-WD dataset and discuss results obtained with the help of different classification techniques chosen for evaluation.
Rest of the paper is organized as follows: Section 2 describes classic data representation techniques in context of system call trace with their limitations.Section 3 discusses modified vector space representation.Section 4 details the datasets, algorithms selected, chosen evaluation metrics and experiments methodology used for evaluation.Section 5 discusses the performance results followed by conclusion and references at the end.

System Call Trace Representation
In order to classify the process behaviour using system call trace, one needs to extract the features from it.Data representation techniques can be used to convert the system call trace into feature vector.Common data representation techniques used for system call representation are as follows:

Trivial Representation
The basic representation of system call trace is to consider it as a string (sequence) of system calls.Let us consider an operating system with total m number of unique system calls, then set of system calls can be represented by , , , , m U s s s s =  . Let i F be finite sequence of system calls and * U represents the set of all possible finite sequences of system calls then , where i S is a system call trace of th i application and i L is its label (i.e.normal or malware).The memory complexity for such representation is ( ) represents the total length of sequences.If length of system call sequences is large, this could be a big number.

Boolean Model
Simple representation technique can be found in the area of information retrieval is Boolean Model [18].It is an exact match model, which can represent a system call trace as a vector having all possible system call number as its index.The value of index is 1 if system call is present in given trace and 0 otherwise.
Consider total number of system calls in an operating system is m.A system call trace i S can be represented using boolean model as a feature vector  ) . This model considers every system call equally important and only marks its presence or absence.It does not assign any weight to the system call that appears multiple times in a system call trace.

Vector Space Model
Vector Space Model is another common and powerful technique used in information retrieval field to represent document as set of words [18].It is also known as "bag of words" technique as it assigns weight to each word in the given document in order to determine how much the document is relevant to specific words.Here the weight is assigned to a word as number of times the word appear in the document.In the context of system call representation, system call trace is considered as document and each system call as one word.Then we can apply vector space model to represent given system call trace as a feature vector.
To represent the system call traces using vector space model (bag of words) representation, let us consider a feature set B, as a set of vectors corresponds to applications' system call traces.System call trace for an application i with this model can be represented as, vector ( ) represents the number of times the system call j s appears in the system call trace sequence i S .The memory complexity of vector space model representation is similar to Boolean model i.e.

( )
O N U × .Here, U is number of system calls.For example, Linux 3.2 has 349 system calls, then 349.

U =
Note that, number of system calls U is smaller than total length of sequences 1

Modified Vector Space Representation
Vector space model cannot preserve the relative order of system calls.e.g.Feature vector for system call traces , , , S open exit close read are similar.Relative order of system calls is more important in case of modelling process behaviour.Loss of system call sequence information can leave a system vulnerable to mimicry attacks [19] [20], where a malware writer interleaves malware system call trace patterns with benign system call trace.Thus, we consider the multiple consecutive system calls as one term.Number of system calls in a term is defined by term-size.For term-size l and total number of unique system calls m, n-gram model provide total l m number of possible unique terms in a feature vector.In order to represent the system call traces using this approach, let us consider l U be the set of all possible unique terms of length (term-size) l.Here, { } ( ) represents the th k term of length l derived through n-gram model from U. The feature set C contains the occurrence of each term in given system call trace.For instance, , where ( ) represents the number of times the term k t appears in the system call trace i S .The memory requirement for n-gram with vector space model approach for term length of l is ( ) Representation created using n-gram model is more costly than normal vector space model as l U U < .In addition to that, all features (terms) are not present in the system call traces, which means they are having zero weight in feature vector.We can reduce the dimension of feature vector by considering only those unique terms which are present in the data.
If we consider only those unique terms that appear in training data, the memory requirement would be less compared to considering all possible unique terms generated from U. This can be represented by set of all unique terms of length l occurring in training data.The set can be defined as train .However, this representation does not cover system call sequences, which were not explored during training and may appear in testing.
The feature vector built by considering only those terms that appeared in training data is much compact than other system call representations.However, it requires prior knowledge of unique terms in system call traces, which is not always possible.We can easily find the unique terms from the training data.However, during training, we might not have explored all possible usages of application.It is quite possible that, terms that were not present in the training data may appear in testing data.
Modified Vector Space Representation [16] extends the previous representation (which considers unique terms from training data only) by incorporating mechanism to handle any unforeseen terms during testing.We deliberately add a system call number (we refer it as unknown (unk)) in list, whose value is higher than any system call number present in system call list for OS.We form terms of length l comprising this unknown system call number including one term having all unknown system call number.Let E be the set of unknown terms comprising unk system call number.unk is a number deliberately added in the list of system call numbers to map terms, which are not seen during training but found in testing.Hence, the new feature set can be defined as new train , where new train Here, number of terms comprising of unk system call E will be very small.

Evaluation
In this section we provide details of datasets, classification algorithms selected, evaluation metrics and experiments methodology used for evaluation.

Datasets
We have evaluated modified vector space representation with two datasets namely ADFA-LD (Linux Dataset) and ADFA-WD (Windows Dataset) constructed by G. Creech et al. [10]- [12].Table 1 describes the number of traces collected from [21] for each category for ADFA-LD and ADFA-WD dataset.For ADFA-LD system call traces for specific process were generated using auditd [22] Unix program, an auditing utility for collecting security relevant events.These traces were then filtered for undersize and oversize limit, which is 300 Bytes to 6 kB for training data and 300 Bytes to 10 kB for validation data [11] [12].ADFA-LD dataset was collected under Ubuntu 11.04 fully patched operating system with kernel 2.6.38.The operating system was running different services like webserver, database server, SSH server, FTP server etc. ADFA-LD also incorporates system call traces of different types of attacks.ADFA-WD (Windows Dataset) represents the high-quality collection of DLL access requests and system calls for a variety of hacking attacks [11].Dataset was collected in Windows XP SP2 host with the help of Procmon [23] program.Default firewall was enabled and Norton AV 2013 was installed to filter only sophisticated attacks and ignore the low level attacks.The OS environment enabled file sharing and configured network printer.It was running applications like, webserver, database server, FTP server, streaming media server, PDF reader, etc.Total 12 known vulnerabilities for installed applications were exploited with the help of Metasploit framework and other custom methods.Table 3 describes the details of each attack class in ADFA-WD dataset [11].

Algorithms Selected for Experiments
We selected Weka workbench [24] [25] for evaluation of modified vector space representation on ADFA-LD and ADFA-WD datasets.Weka hosts number of machine learning algorithms which can be easily applied on our prepared datasets of varying term-size.We selected nine well-known classification algorithms from six different categories given in Weka.The list of selected algorithms, selected options for individual algorithm and their respective category in Weka are shown in Table 4.

Experiments Methodology
Datasets were collected from [21] and then converted into modified vector space representation for various term-size.For these experiments we selected the term-size 1, 2, 3 and 5.For each dataset (i.e.ADFA-LD and ADFA-WD) we ran experiments for binary class as well as for multiclass label classification.For binary class we considered one of two labels for each trace -normal and attack.For multiclass classification, number of classes and class labels are different for both datasets.In ADFA-LD we have total 7 class labels viz.normal, adduser, hydra-ftp, hydra-ssh, java-meterpreter, meterpreter and webshell.While in ADFA-WD we have total 13 class labels viz.normal and V1 to V12.We ran each chosen algorithms with selected options on converted data in Weka through 10-fold cross-validation method.Table 5 describes the number of features extracted from ADFA-LD and ADFA-WD dataset for varying term-size using modified vector space representation.

Evaluation Metrics
We have used the following common evaluation metrics that are widely used in information retrieval area [18] Figure 1 shows the confusion matrix, which can be used to derive other measures.Precision: It is the ratio of how many attack traces predicted as attack traces out of total number of traces predicted as attack traces.

TP Precision
TP FP = +   Area Under the ROC Curve (AUC): It is the area covered by ROC curve.It is equivalent to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one [26].

Results and Analysis
Figure 2 shows the performance results in terms of accuracy and false positive rate of selected algorithms with varying term-size on both datasets.The results shown here are the weighted average of results derived for individual class labels.Detailed experiment (weighted average) results on ADFA-LD and ADFA-WD are given in Appendix (Tables A1-A4).
From Figure 2, we can observe that using modified vector space representation all algorithms perform reasonably well.However, IBk and J48 performed best in all experiments.
With IBk algorithm we can notice that as we increase the term-size, its performance starts degrading (i.e.accuracy decreases and FP Rate increases).These changes are clearly visible in case of ADFA-LD dataset.Similar performance results are achieved by J48 in all experiments.However, IBk have higher FP Rate compare to J48 for term-size 3 and 5 on ADFA-LD dataset.
Comparing IBk and J48 with application perspective, J48 requires more time in building the decision tree model during training but it is faster during testing phase.On contrary, IBk does not have any difference between training and testing phase.It finds distance between test instance and all other training instances during testing phase.Due to this IBk seeks high amount of memory space to store all training instances during testing phase compare to J48, whereas storing J48 model is merely a tree to be stored.So, with J48 in testing phase classifying a test instance is as simple as traversing limited number of branches (based on feature values) of a decision tree from root to leaf.
On ADFA-WD dataset, all algorithms perform well for binary class classification, but perform poorly for multiclass classification.Similar facts can be observed from Figure 3 and Figure 4. Figure 3 shows ROC curves of IBk (k = 1) and J48 with all term-size on ADFA-LD and ADFA-WD datasets for binary class classification.Figure 4 shows ROC curves of IBk (k = 1) and J48 with term-size 3 on ADFA-LD and ADFA-WD datasets for multiclass classification.From Figure 4(c), Figure 4(d) and Table A4 we can observe that IBk and J48 achieves high accuracy for normal class on ADFA-WD, but fails to distinguish among attack classes.The  possible cause for this could be, similarity among system call traces of vulnerabilities exploits launched through metasploit.

Conclusion
In this work, we have evaluated our proposed modified vector space representation using ADFA-LD and ADFA-WD system call trace datasets.We extracted features from both datasets using our proposed method for varying term-size.We also considered binary class and multiclass classification for evaluation on both datasets.Modified vector space representation (term-size 2, 3 and 5) performs as well as standard vector space model (term-size 1) if not better in terms of accuracy, FP rate and F-measure.There is no significant difference in results for varying term-size.However, higher term-size preserves more system call sequence information, which provides resistance against mimicry attacks.From the evaluation results, we conclude that IBk and J48 perform better on both datasets compare with other selected algorithms.

(
close exit and 2 : of only those terms, appearing in training dataset.Here the number of terms in feature set train l U would be less compared to all possible terms of length l generated from U i.e. system call traces in training dataset, memory complexity of this representation would be

:
True Positive (TP): Number of attack traces detected as attack traces.False Positive (FP): Number of attack traces detected as normal traces.True Negative (TN): Number of normal traces detected as normal traces.False Negative (FN): Number of normal traces detected as attack traces.
It is a graph of true positive rate against false positive rate.It represents the performance of binary classifier as its discrimination threshold is varied.

Table 1 .
Number of system call traces in different category of ADFA-LD and ADFA-WD dataset.

Table 2 .
Attack vectors used to generate ADFA-LD attack dataset.

Table 3 .
Vulnerabilities considered to generate ADFA-WD attack dataset.

Table 4 .
List of selected algorithms with their options.

Table 5 .
Number of features extracted from ADFA-LD and ADFA-WD dataset for term-size 1, 2, 3 and 5. Recall also known as the True Positive Rate (TPR).It is the ratio of how many attack traces predicted as attack traces out of total number of actual attack traces.It is a measure that combines precision and recall into a single measure.It is calculated as harmonic mean of precision and recall.

Table A2 .
Experiment results for various term size on ADFA-WD dataset with binary class labels.

Table A3 .
Experiment results for various term size on ADFA-LD dataset with multiclass class labels.

Table A4 .
Experiment results for various term-size on ADFA-WD dataset with multiclass class label.