Hoeffding Tree Algorithms for Anomaly Detection in Streaming Hoeffding Tree Algorithms for Anomaly Detection in Streaming Datasets: A Survey Datasets: A Survey

This survey aims to deliver an extensive and well-constructed overview of us-ing machine learning for the problem of detecting anomalies in streaming datasets. The objective is to provide the effectiveness of using Hoeffding Trees as a machine learning algorithm solution for the problem of detecting anomalies in streaming cyber datasets. In this survey we categorize the existing research works of Hoeffding Trees which can be feasible for this type of study into the following: surveying distributed Hoeffding Trees, surveying ensembles of Hoeffding Trees and surveying existing techniques using Hoeffding Trees for anomaly detection. These categories are referred to as compositions within this paper and were selected based on their relation to streaming data and the flexibility of their techniques for use within different domains of streaming data. We discuss the relevance of how combining the techniques of the proposed research works within these compositions can be used to address the anomaly detection problem in streaming cyber datasets. The goal is to show how a combination of techniques from different compositions can solve a prominent problem, anomaly detection.


Introduction
A wide variety of application domains use streaming data. These domains in-How to cite this paper: Muallem, A., Shetty, S., Pan, J.W., Zhao, J. and Biswal, B. In most scenarios where new, dynamic data is continuously generated, streaming data is beneficial. Data collection of information is the first step in streaming data and it can evolve into a more sophisticated real-time processing.
Initially these applications may perform simple data analysis with simple actions as a response in which may ultimately evolve into more sophisticated types of data analysis, such as applying machine learning algorithms to extract precise information from the data. There are several challenges on data mining algorithm design posed by data streams [1]. One is the limited use of resources [1].
Most big data, in which data are produced continuously, can be recorded as data streams [2].
Many machine learning algorithms have been developed for sophisticated data analysis on streaming data. These include classification, regression, and clustering algorithms. The model in the data streams is that it must address the problem of three features of big data: large volume, large velocity, and large variety. A fundamental model in data streams is the necessary of dealing with data whose nature or distribution changes over time [1]. Strategies for detecting and quantifying change, forgetting stale experiments, and model revision is required in dealing with time-changing data [1].
Decision trees are a type of classifier algorithms [3] [4]. A decision tree is learned top-down by recursively replacing leaves by test nodes, starting at the root [1]. All available attributes are compared and choosing the best one according to some heuristic measure is how the attribute at a node is tested [1].
Classical decision tree learners are severely limited in the number of examples they can learn from, since they assume that all training examples can be stored simultaneously in memory [1]. Hoeffding Trees, an incremental, anytime decision tree induction algorithm capable of learning from massive data streams, was developed by Domingos and Hulten [1] [5]. The fact that a small sample can often be enough to choose an optimal splitting attribute is the theory of Hoeffding Trees [1]. The Hoeffding bound mathematically supports this idea by quantifying the number of observations or examples needed to estimate some statistics within a prescribed decision or goodness of an attribute [1]. The Hoeffding Trees have sound guarantees of performance, a theoretically interesting feature not shared by other incremental decision tree learners. Figure 1 provides the Hoeffding Tree Induction Algorithm referenced from [6].
Anomaly detection for streaming data is an important topic of research due to its nature of detecting vital information. This vital information can include intrusion and other failure information. There is extensive work on anomaly detection techniques [7] [8] [9] for streaming data. These techniques look into fault detection and intrusion detection by exploring methods for identifying anomalies based on the scalability and generality of the data streams, they do not take the transformations in the underlying organization of the data into consideration. Concept drift is known as the underlying distribution of the data changing swiftly over time within streaming data [10] [11]. The occurrence of concept drift is when the concept about which data are being collected shifts from time to time after a minimum stability period. When exploring various types of anomaly detection such as intrusion or fault detection, one must explore foundations of cyber-attacks. Abrupt, incremental, gradual or recurring changes can be created by cyber-attacks [12]. According to data mining, the target information that a model is trying to predict are the concepts [12]. The change of the underlying concept over time is known as concept change. [12]. A relatively slow change of concept is the representation of concept drift, and an abrupt change in concept is the representation of concept shift [12]. The general focus of machine learning is the representation of one type of concept drift.
There are many proposed methods of Hoeffding Trees for data streams. Most of these are built on characteristics for dealing with distribution in data streams such as concept drift. These models can be a single Hoeffding Tree, decision trees based on the Hoeffding bound, window-based Hoeffding Trees, weight-based Hoeffding Trees, distribution Hoeffding Trees, and/or an ensemble of Hoeffding Trees. Each of these models addresses various attributes of concept drift in data streams or identifying the best approach to classify data in data streams without concept drift. In many instances these models are combined with other techniques to provide improved accuracy, performance, or drift detection.
An Anomaly Detection System can also be known as an Intrusion Detection System, in which intursions are identified by classifying activities as either normal or anomalous and leading to a training phase to be implemented to recognize "new" attacks [13]. In the case of using machine learning for anomaly detection, classification algorithms can be used to determine if uncommon pat- Sort example into leaf l using HT 4: Update sufficient statistics in l 5: Increment nz, the number of examples seen at l 6: if nz mod nmin = 0 and examples seen at l not all of same class then 7: Compute Gz(Xi) for each attribute 8: Let Xa be attribute with highest Gz 9: Let Xb be attribute with second-highest Gz 10: Compute Hoeffding bound c = ✓ R 21 ;J~! 8 )

11:
if Xa =/ X0 and (Gz(Xa) -Gz(Xb) > c or c < T) then 12: Replace l with an internal node that splits on Xa 13: for all branches of the split do 14: Add a new leaf with initialized sufficient statistics 15: end for 16: end if 17: end if 18 Trees while providing the techniques used for this research work. We also provide advantages and disadvantages of each research works. In Section 6, we provide a comparison chart listing the evaluation metrics produced from each proposed research work based on accuracy, Kappa statistic, time, and/or memory as well as identifying the corresponding dataset which produced the evaluation metric for the given proposed research work algorithm.
In Figure 2, we present the three compositions of Hoeffding Tree algorithms for streaming datasets within various application domains including anomaly detection. These application domains include machine learning classification accuracy and performance in streaming data, concept drift detection in streaming data, machine learning distributed processing to reduce execution time and Figure 2. Overview of existing hoeffding tree algorithms for streaming datasets within various application domains including anomaly detection.

Surveying Distributed Hoeffding Trees
In this section, we survey a proposed research work of distributed Hoeffding Trees based on Spark Streaming and provide the advantages and disadvantages of this research work as well as metrics from our experimental evaluation performed on the proposed research work.

Spark Streaming Hoeffding Trees
Bifet et al. in [14] introduce an application called StreamDM, which they indicate is a new open-source data mining and machine learning library based on Spark Streaming. The authors define Spark Streaming as well as its advantages and disadvantages of and the reasoning behind their choice of this platform. The authors also note the purpose of their design in which their concentration was on implementing an extensible library containing advanced Spark Streaming mining algorithms which can be easily available for use by developers, researchers, and others. The authors note that StreamDM benefits from its design above Spark Streaming due to its existence within the Hadoop open-source environment. The authors also indicate they use Scala as the programming language for the implementation and due its benefits as well as its compatibility with other platforms. The authors inform their goal is to provide StreamDM as a machine learning library which can be accessible for researchers and developers to easily design additional algorithms above the implemented open-source machine learning library within StreamDM. StreamDM currently contains four classification algorithms: Multinomial Naïve Bayes, SGD Learner and Perceptron Classifier using the Stochastic Gradient Descent optimizer for learning various linear models, Hoeffding Decision Trees, and Bagging.

Evaluating StreamDM
In this section, we discuss our evaluation of StreamDM by creating some experiments to compare it to MOA and WEKA [15]. We also provide evaluation me-   Figure 3   thors' thoughts in [14], it is a great tool for streaming machine learning models as it not only provides Hoeffding Trees but also SGD Learner, Bagging, and

Surveying Ensembles of Hoeffding Trees
For streaming processes, ensemble classifiers possess an adequate number of advantages in comparison to single classifier designs [16]. This is because ensembles can be easily scaled and parallelized [16]. A key factor of ensembles is their ability to adaptively change quickly through the process of pruning parts of the ensemble which are performing poorly, as a result generating more accurate concept descriptions [16]. In this section, we evaluate several types of proposed research works on the ensembles of Hoeffding Trees and provide the advantages and disadvantages of each proposed work.

Ensembles of Restricted Hoeffding Trees Using Stacking
Bifet et al. in [17], present an algorithm or classification model based on the ensemble of restricted decision trees, i.e. Hoeffding Trees, using Stacking. The authors indicate they use a stacking approach which is detailed on how each tree is built within this algorithm using attributes, log-odds of probabilities from predicted classes, applying sigmoid perceptrons, using perceptron classifiers and more. The authors also describe how their stacking approach differs from boosting and elaborate on the method they use for forming their ensemble classifier [18] The authors also note the significance of the use of Hoeffding Trees within their work since they are working with a data stream scenario.
The authors thoroughly elaborate on the method they present in their proposed research work and why they use ADWIN, which they note is due to its theoretical guarantees on false positives within the context of change. The authors also elaborate on the benefits of ADWIN as well as how they use ADWIN for dealing with evolving data streams, replacing poorly performing ensemble members when an accuracy for one of the Hoeffding Trees has dropped significantly, and resetting the learning rate. The authors describe a detailed process of the ADWIN change detector to replace poorly performing ensemble members due to their decline in accuracy as well as the learning rate reset process.
The authors in [17]  Trees consisting of 1, 2, 3, and 4 attributes consequently using an Interleaved Test-Then-Train or Prequential evaluation. The authors also note they performed a separate experiment for the disabling of the ADWIN-based change detection for resetting the learning rate and found a decrease in average accuracy in which they note these results reinforce the benefits of using ADWIN-based change detection for resetting the learning rate. The authors note from the experimental results it can be seen that the use of more trees with bagging using ADWIN is not any more accurate than the use of less trees with bagging using ADWIN, in which they believe the positive performance of their proposed algorithm is not related to the use of many trees. The authors believe the vital factor in the improvement of accuracy obtained is their methodology of using attribute subsets and perceptron weighting. The authors note during the experimental evaluation, they found an increase in ensemble diversity produced from their new stacking method in which they indicate is due the tight clusteration of the ADWIN bagging ensembles and increased discrepancy amdist classifiers within the ensemble produced by the stacking method. To expose this divergence, the authors use the Kappa statistic method k [21] and describe its use within their newly stacking based classifier to increase ensemble diversity. The authors also introduce a new strategy to replace ensemble classifiers.
We have organized some of the evaluation metrics produced from the authors' experimental evaluation based on the results illustrated within [17], these evaluation metrics are included in a comparison chart provided within Table 1 in Section 6.
Advantages of this research: The authors in [17]  ing cyber data sets. Since an ensemble of classifiers is stronger than a single classifier, the method and techniques presented in this proposed research work can be applied to effectively address problems in the domain of anomaly detection and cyber data streams. This idea is also explained more in detail in Section 6.
Disadvantages of this research: The authors note some deficiencies with initial experiments of pruning the ensembles and discarding the least important ensemble member, in which their goal is to address this with smarter pruning methods and techniques to increase accuracy classification.

The Accuracy Updated Ensemble (AUE2)
Brezinski et al. [22] propose  The authors note the Naïve Bayes algorithm was added to their comparison as a reference in comparing an algorithm which has no drift mechanism.
The authors indicate their experimental evaluation was done using an evaluation method of data chunk evaluation in which they explain is an evaluation method which works similarly to the test-then-train design, but instead of using single examples it uses data chunks [23]. The authors also describe the benefits of this type of evaluation method as well as why they used it within their experimental evaluation.
The authors report their experiments revealed that the NB algorithm was severely malfunctioning in the presence of drifts, followed by a few other evaluated algorithms in which they believe NB failed due to its lack of containing a drift reaction mechanism and in return does not successfully learn from data streams with recurrent gradual drifts. The authors denote the experiments also revealed that in the existence of sudden recurring drifts, AUE1 and AUE2 performed the best and the accuracy produced had a considerable impact by only the first type of drift. The authors indicate the experimental evaluation showed the remaining for all classifiers Ci E \ C' do 15: incrementally train classifier Cj with Bi; 16: end for 17: if memory_usage(£) > m then 18: prune ( decrease size of) component classifiers; 19: end if 20: end for The authors conclude AUE2 produced the best average classification accuracy while consuming the least amount of memory.
We have organized some of the evaluation metrics produced from the authors' experimental evaluation based on the results illustrated within [23], these evaluation metrics are included in a comparison chart provided within Table 1 in Section 6.
Advantages of this research: The authors introduce an algorithm which they call the Accuracy Updated Ensemble (AUE2) derived from the preliminary version, AUE1, with enhancements and modifications. The authors reiterate their hybrid algorithm's ability to react equally well to various types of drift scenarios such as gradual, sudden, recurring, mixed and short-term. The authors note their algorithm is inspired by the ensemble weighting mechanism of the Accuracy Weighted Ensemble with the combination of the incremental training of component classifiers. The authors report their experimental evaluation demonstrated AUE2 can offer substantial classification accuracy within static environments as well as environments containing various types of drifts [22]. The authors conclude their experimental evaluation revealed in comparison to the other ensemble approaches, AUE2 produced the most excellent average classification accuracy and least memory consumption. In reference to our survey, we believe the Accuracy Updated Ensemble (AUE2) can be of great benefit and effective in our approach of constructing an ensemble of Hoeffding Trees for anomaly detection within cyber data streams. This proposed approach is explained in more detail in Section 6.
Disadvantages of this research: The authors indicated that AUE2 as the rest of the algorithms was not sufficient and failed in reactions to the change. The authors note this could be an interesting topic for further research, the complex combination of drifts.

Hoeffding Trees for Anomaly Detection
In this section, we evaluate a type of Hoeffding Tree used for Anomaly Detection and provide the advantages and disadvantages of the proposed research work.

Hoeffding Trees as Adaptive Trees for Real-Time Cyber-Power Event and Intrusion Classification
Adhikari et al. in [12] introduce We have organized some of the evaluation metrics produced from the authors' experimental evaluation based on the results illustrated within [12], these evaluation metrics are included in a comparison chart provided within Table 1 in  ing, and training methods. The authors in [26] indicate their goal is to clarify the characteristics involving the application of ensemble learners on a data stream context by proposing a taxonomy that organizes general techniques, presenting a classification of over 60 ensemble algorithms according to this taxonomy and discussing current and future trends for ensemble learning on a stream setting, which also includes big data stream processing. The authors in [26] convey their proposed taxonomy outlines the intersections between ensemble learning on static datasets with that of dynamic data streams while highlighting characteristics from ensemble learning that are unique and beneficial to the data stream learning setting.

Literature Review on Ensembles
For the proposed taxonomy in [26], the authors not only arrange ensemble-related techniques based on diversity, base learner, and combination, but they also discuss characteristics that influence the ensemble formulation that are unique to data stream learning in which they refer to as "update dynamics". An example of this is representing important methods for stream learning such as strategies to cope with drifts, how learning is performed and when to remove or add classifiers. The taxonomy presented in [26] is organized based on general aspects related to algorithms in a data stream learning setting which include aspects such as when they are directly mapped as characteristics of an actual algorithm, they are better represented as values rather than dimensions. An example is cardinality corresponds to a dimension, while fixed and dynamic are values.
The authors in [26] discuss combination techniques of ensemble methods such as the overall performance of combining ensemble members' predictions and its benefits and hindrance. In this particular discussion in [26], the authors differentiate between the voting method and the ensemble members' architecture. The authors discuss the different type of combination architectures such as Flat, Meta-Learner, Hierarchical, and Network. The meta-level combiner is used within the research work we discussed in Section 3, the Ensembles Restricted Hoeffding Trees. The authors in [26], also discuss the different types of voting methods such as majority voting, weighted majority voting, rank voting, classifier selection voting, and relational voting. In addition, the authors in [26]  Our survey differentiates from [26], as we focus on ensembles of Hoeffding Trees in the context of anomaly detection in streaming cyber datasets. The information presented from the authors in [26] can be used as a basis for understanding our work and the need for this type of research on ensemble learning in reference to real-world problems and/or data stream classification in general.
The authors in [26], also describe various existing research work which can help in the context of gaining more in depth knowledge on the use of ensembles for various problems including the scope of concept drift. Although concept drift can be deeply related to anomaly detection, as concept drift is the study of the changes in the underlying distribution of data and anomaly detection is the study of addressing normal versus abnormal patters in data, there is still a need to put forward research work exploring ensembles for addressing anomaly detection as a separate commodity as anomaly detection and/or intrusion detection is an immense field which can benefit from these techniques. This introduces our goal in this research survey. The work in [26], presents material which serves as a foundation in the process of constructing the necessary ensemble learning algorithm for this type of research work: ensembles addressing the problem of detecting anomalies in cyber datasets.

Discussions
Hoeffding Tree models with integrated features enhance existing machine learning models for anomaly detection in streaming datasets. This is due to their significant performance in streaming datasets; as well as their flexibility to add an adaptive, incremental learning, distributive, and drift detection approach. In reference to the surveyed algorithms mentioned above, a comparison chart, shown in Table 2, has been created indicating measurements produced from each algorithm based on accuracy, Kappa statistic, time, and/or memory. All of the surveyed algorithms may not include measurements for each of these statistics but the statistic for the corresponding algorithm is included within the chart along with the corresponding evaluated dataset which produced those statistics. As stated in the introduction, we introduce the use of techniques from one composition to solve the problem of a different composition. Here we discuss how the combination of some of the different compositions presented in the previous sections of this survey can solve the problem of anomaly detection in streaming cyber datasets. An overview of our proposed combination of compositions is depicted in Figure 5.
In Figure 5, the composition techniques we identify as key components are the ensembles of Hoeffding Trees built on a distributed platform. This ensemble of Hoeffding Trees contains two types of Hoeffding Trees, general Hoeffding Trees and HAT with ADWIN & DDM (Hoeffding Adaptive Trees with DDM and ADWIN). This ensemble of general Hoeffding Trees and HAT with ADWIN & DDM could be based on the methodology presented in Section 3.1, discussing the Hoeffding Trees using Stacking or Section 3.2, discussing the  AUE2 ensemble. In Figure 5, we propose using the AUE2 ensemble architecture for our proposed methodology; however, in the case of implementation, either ensemble techniques are effective and can be used as well as examined further to determine the best fit for the type of work.
In Figure 5, we also show that the ensembles of Hoeffding Trees for the proposed method we suggest are to be developed on StreamDM, the distributed platform for Hoeffding Trees built on top of Spark using Spark streaming. This StreamDM platform not only allows for distribution during execution but also the ability for real-time streaming data. Due to Spark Streaming as a category for real-time streaming.
A further explanation of Figure 5 can be as follows: • Input (Attack Dataset): This can be the type of attack dataset that will be inputted into the system. An example is the NSL-KDD dataset, artificially generated attack dataset, or any other type of dataset containing a single attack or multiple attacks. • Preprocessing of Data: This is the stage where the data is formatted into the necessary format required by the system. An example is using the NSL-KDD DDM) classify the data within the dataset as normal or anomalous. • Predict Accuracy and Kappa statistic: As we know in order to know the true classification of an algorithm we use the accuracy (number of correctly classified instances) it produces as well as other common statistics such as the Kappa statistic. We choose the Kappa statistic because of its as a good indicator of classification performance in streaming data. In our case we will use these metrics to determine the effectiveness of anomaly detection by the classifiers within the ensemble.
• Anomaly Detection: As stated in the list item above, this is used as a blueprint for the models (classifiers) within the AUE2 ensemble of predicting anomaly detection.
• Performance Evaluation: This stage other metrics are taken into consideration such as the performance of the classifiers. • AUE2 Classifier: As we stated we will use the AUE2 ensemble approach, this item in the diagram signifies that this complete procedure resides on top of the AUE2 classifier ensemble approach. From our analysis of the proposed research works surveyed in the previous sections, we believe in terms of machine learning, the combination of these compositions can be effective in addressing the problem of anomaly detection in streaming cyber datasets. A pivotal factor in determining if the characteristics of this type of proposed research study is feasible, is the experimentation and implementation of this proposed combination as well as the use of a diverse cyber dataset which allows for the flexibility to include public datasets such as the NSL-KDD dataset and more, an artificially generated attack dataset, and/or both.

Conclusions and Suggestions
Anomaly detection in streaming datasets is the ability to handle high volumes of abnormal data patterns in the distribution of data. In this survey, we have discussed different problem compositions that are relevant in varied streaming data applications domains within and outside of anomaly detection. These domains are explained in the Introduction section. We note that two of the three compositions are not particularly exhaustive and not focused on anomaly detection, but can be used for addressing the problem. We also note that the anomaly detection problem might be composed in other ways and with other machine learning al- Although, the focus of this survey paper has been on Hoeffding Trees for streaming datasets and how these can be used for anomaly detection in streaming cyber datasets, many of the techniques discussed here can also be applicable to concept drift in streaming datasets.
While the literature on anomaly detection for streaming datasets is rich, there are several research directions that need to be explored in the future. In this survey, we have provided a high level comparison of various techniques which can be used. An experimental study of these techniques is essential for a more in-depth understanding of their characteristics.