Review of Anomaly Detection Systems in Industrial Control Systems Using Deep Feature Learning Approach

Industrial Control Systems (ICS) or SCADA networks are increasingly targeted by cyber-attacks as their architectures shifted from proprietary hardware, software and protocols to standard and open sources ones. Furthermore, these systems which used to be isolated are now interconnected to corporate networks and to the Internet. Among the countermeasures to mitigate the threats, anomaly detection systems play an important role as they can help detect even unknown attacks. Deep learning which has gained a great attention in the last few years due to excellent results in image, video and natural language processing is being used for anomaly detection in information security, particularly in SCADA networks. The salient features of the data from SCADA networks are learnt as hierarchical representation us-ing deep architectures, and those learnt features are used to classify the data into normal or anomalous ones. This article is a review of various architectures such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Stacked Autoencoder (SAE), Long Short Term Memory (LSTM), or a combination of those architectures, for anomaly detection purpose in SCADA networks.


Introduction
Industrial Control Systems (ICS) are used to monitor and control industrial sys- tems.ICS are used to be isolated from enterprise networks, making attacks against them difficult.Moreover, these systems were using proprietary hardware, software and protocols.But as technology evolves, today's ICS use Commercial-Off-The-Shelf (COTS) software and hardware as well as open protocols such as Ethernet and TCP/IP.Things have worsened nowadays with the interconnection of ICS to enterprise network and to the Internet.A successful attack against industrial control system could have severe impact ranging from economy to loss of human lives [1] [2].ICS attacks have already targeted water treatment systems, power grids or nuclear power plants [3] [4].Some of the most famous attacks use Duqu, Flame [5], and the Stuxnet viruses [6].Although many countermeasures are deployed to secure ICS networks, Intrusion and anomaly detection systems are important complementary security measures used to protect them.
In recent years, Deep Learning [7] became a hot topic among researchers with successes in domains such as natural language processing (NLP), image and video classification.
One of the most important features of deep learning is the use of unsupervised methods to autonomously learn hierarchical features in deep learning models [7] [11] [12] [13] [14].
In fact, the data most salient features are unsupervisingly learnt using the automatic learning capability of deep architectures, and those learnt features are used in a classifier to discriminate anomalous data from normal ones.
In this paper we are making a review of SCADA networks anomaly detection systems which are using deep feature learning approach.
After some highlights on the concept of the unsupervised feature learning in the next section, the third section is dedicated to the review of different anomaly detection systems in SCADA networks using deep unsupervised feature learning.
In section four, we draw a conclusion of the review.

Unsupervised Feature Learning
Feature learning consists in modeling the behavior of data from a subset of features by deriving new features from the original ones [15].In standard machine learning, feature learning from data is a complex task as it requires experts of the domain to handcraft the original features in order to feed the machine learning algorithms with the best features.The data learning process could be supervised or unsupervised.The supervised learning also needs the intervention of human to correctly label the data, which is costly and error prone.To take advantage the huge amount of unlabeled data, deep learning algorithms can automatically learn important features from data in an unsupervised manner [7] [16].Unsupervised feature learning main goal is to map the original features' set into a different representation more suited for a given machine learning task [17] architectures help in building complex non-linear functions to better fit real world complex data [18].Unsupervised feature learning can be done by using clustering on data using algorithms such as K-means [19], or by training stacked auto-encoders or convolutional networks [20].
3. Review of Unsupervised Feature Learning in SCADA Anomaly Detection Systems

LSTM/Bloom Filter Anomaly Detector
In order to detected anomalies due to data/command injection, reconnaissance or Denial-of-Service (DoS) attacks on a gas pipeline SCADA system, [8] propose an anomaly detection approach consisting of two detectors (Figure 1).The first one is a packet-level anomaly detector which checks a packet signature in its database.The database stores network patterns and communication pattern signature as they are stable in a SCADA system.If the Bloom filter does not contain the signature the analyzed package, the packet is considered anomalous.The next detector receives normal packet that pass the Bloom filter for another detection level, which uses its power of information memorization for number of time steps to predict the behavior of the next time step.
Because of the limited memory and computing resources of some of SCADA components, using a fast and light-weighted anomaly detector as a Bloom filter is of high importance.The LSTM Anomaly Detector (Figure 2) which takes the input of time-series learns their important features in order to predict the next data point by being trained to minimize a softmax function suited for multi-class classification [7] [21].
The evaluation of the combined anomaly detection framework on a gas pipeline SCADA dataset [22] gives an accuracy of 92%, which is higher compared to other approaches.However, the time required to train the LSTM model of 35 min during 50 epochs is rather high.

Stacked Auto-Encoder Based Anomaly Detection
Because of network bandwidth and data increase, [23] proposes a deep packet inspection in order to learn the necessary feature that would allow DoS, Probe, R2L and U2R attacks detection.The authors used a Deep Neural Networks (DNN) approach which architecture is a stacked auto-encoders for the feature learning, to which a softmax layer is added for the classification (Figure 3).The Finally, the last step of the process is a classification and testing step where a test dataset is presented to the fine-tuned network to assess the efficiency of the  However, the approach proposed by [23] gives promising results in feature learning and good detection rate for some classes of attacks detection.It uses the NSL-KDD dataset for the experiment.Those datasets may not reflect modern networks traffic complexity nor integrate new complex attacks.

Stacked Auto-Encoder for Anomaly Detection in Smart Grids
The cyber-physical integration, exposes smart grids to large attack surface with potential severe consequences.Among the countermeasures against such attacks, Intrusion/Anomaly Detection Systems play a key role [24].Machine learning approaches are used to develop data-driven anomaly detection systems.
However, human handcrafted features for machine learning anomaly detectors are costly and ineffective in smart grids [25] [26].This situation led [27] to use a stacked auto-encoder approach for a better feature learning for anomaly detection (Figure 4).The approach has two main phases: The model is first trained off-line and then follows online monitoring step.During the first phase, historical data are first collected for training purpose on different system operating conditions.
Then, the stacked auto-encoder is used to learn and deliver strong and high-level features.Finally in the off-line training phase, all the building blocks are stacked and a classifier is appended to them.The obtained architecture is then supervisingly trained using back-propagation.Next, to the training process is the acquisition of measurements from SCADA in the transmission system.
These measurements are fed to the deep neural network, and the results of the classification are used for applications such as situational awareness.A testbed simulating a power grid is used to evaluate the proposed approach (Figure 5).The proposed approach that use unsupervised feature learning achieves over 96% in accuracy doing slightly better than the supervised approaches used in the study.Furthermore, it provides an adaptive and automatic intrusion detection for smart grid environments

CNN/LSTM Anomaly Detection in SCADA
The Secure Water Treatment testbed (SWaT) dataset contains up to 36 different cyber-attacks.To evaluate the use of unsupervised feature learning for intrusion detection in such system, [28] proposes two models using either LSTM (Figure 6) or 1D CNN as feature learner (Figure 7).
They use mean MSE as an error function and AdamOptimizer with weight decay.The weight decay as a regularisation technique prevent model overfitting and the AdamOptimizer [29] is computationally efficient and require little memory.The first Deep Neural Network (DNN) architecture is a stacked LTSM with a fully connected layer at the top for classification purpose.With the LSTM model, setting a learning rate (between 0.001 and 0.00001), and a decay value (from 0.9 to 0.99) they were able to test various depths of LSTM layers (from 64 to 2048) and sequence lengths (between 50 and 1000).The 1D CNN architecture adopted the ReLU-MaxPooling scheme.Different kernel sizes were used for the experimentations.On top of the convolutions layer, a fully connected layer is added for prediction, and dropout is used to prevent overfitting.The authors tested diverse variations of this CNN architecture, by adding a batch normalization layer or by replacing the basic CONV-RELU-POOL block with (CONV-RELU) × N-MAXPOOL architecture.They also replaced the convolutionals layers by Inception layers [30] know to provide better performance and lower computational cost.The Inception layers use sparse network connections instead of the fully connections used by convolution layers, hence the reduction of the computational overhead.The experiments were conducted on the SWaT dataset which has 36 different cyberattacks.The proposed 1 D CNN model has 89% of detection rate, which is fairly good, but need to be improved.

Conditional Deep Belief Networks for False Data Injection in Smart Grid
As a countermeasure for False Data Injection (FDI) attack for electricity theft in smart grids, [31]  The proposed CDBN efficiently reveal the high-dimensional temporal behavior features of the unobservable FDI attacks that bypass the SVE mechanism with a high accuracy rate over 94% even in the presence of occasional operation faults, meaning that unknown attacks could be detected.

Anomaly Detection© Using RBM-Based Deep Autoencoder
Wind turbines usually operate in harsh and variable environment, making various parts subject to failure.This situation can lead to unavailability and even destruction, causing important maintenance costs.As a remedy of this situation, the authors [33] present a deep auto-encoder (DAE) approach to detect early anomalies as well as provide fault analysis of wind turbines parts.The data associated to each wind turbine component is extracted in order to build the DAE model which is composed of stacked RBM [7].The use of a DAE based on RBM building blocks is because of the power of RBM in highly capturing the variational potential of input data [34].

Gas Turbine Combustors Monitoring with Stacked Denoising Auto-Encoder and Extreme Learning Machine
In order to monitor gas turbine combustors' health and detect abnormal behaviors and incipient faults earlier, [35] proposes a deep neural network approach.
The proposed model is a Stacked Denoising Auto-encoder (SDAE) [20], to which an Extreme Learning Machine (ELM) [13] is added (Figure 11).The SDAE used for the unsupervised learning of features allow more robust feature learning, even though the input data is noisy.The feature learned from the SDAE is fed to the ELM module for classification purpose.
Unlike in other feedforward neural networks, in ELM, don't need to be trained.
Unveiling the connections between hidden and output nodes is ELM training method, which is fast [36].The only ELM design parameter is the number of hidden neurons.To test the proposed approach, the authors have used seven months of one turbine data containing normal and abnormal data.In order to demonstrate the effectiveness of unsupervised feature learning for combustor anomaly detection, the authors compare classification performance between using the learned features and handcrafted features (Figure 12).The results show that the deep learned features give significant better classification performance than the handcrafted features (detection rate of 99% and 96% for deep learned features and the handcrafted features respectively).

Summary of Studied Approaches
Table 1 shows a summary of the different approaches.The deep architectures are formed with stacked autoencoders, convolutional neural networks, long short term memories or deep belief networks, or by combining these architectures.Those deep architectures are used to learn the SCADA networks features and softmax, fully connected neutral network, multilayer perceptron or extreme learning machine are used for the classification.For each approach we highlight the feature learning architecture, the classifier used to discriminate the data, the types of the attacks detected and the results in terms of accuracy.
stacked auto-encoder has two hidden layers, one with 20 nodes and the second with 10 nodes.The dimension of the learnt features is 10 compared to the 41 original features of the NSL-KDD dataset dataset.The overall process encompasses four steps i.e. a feature learning step with the stacked auto-encoder, a first fine-tuning step with a supervised training of the softmax.The input of this first fine-tuning step is the compressed representation of the data.The following step is a second fine-tuning with a back-propagation training applied to the whole network layers after the first fine-tuning step.The second fine-tuning step aims at refining the features of the intermediate layers to make them more relevant for the intrusion detection task by adjusting the network weights to minimize the loss function.
scheme.When the FDI attack bypass the SVE engine, the Deep Learning-Base Identification (DLBI) tries to detect the tampered data.The proposed Deep Neural Network is a Conditional Deep Belief Network (CDBN) that integrates the standard Deep Belief Network (DBN) with Conditional Gaussian-Bernoulli RBM (CGBRBM) (Figure 9).CGBRBM uses real value data and can model the impact of previously observed data on the current behavior feature learning.The use of CDBN allows the analysis of temporal attacks patterns [32].On the other hand, using CGBRBM on the first hidden layer and regular RBM for the other hidden layer reduces the training and execution time of CDBN architectures.An unsupervised approach is used to train the proposed CDBN and a fully connected layer is added on top of the model with a binary output node which has a sigmoid activation function.The whole deep neural network structure is then fine-tuned with back-propagation supervised training with labeled data.
SCADA systems.The unsupervised feature learning capability that makes it possible to learn important features from available SCADA network large data in order to deliver high anomaly detection rate contributes to the rising interest in deep learning approaches.Multiple architectures such as CNN, LSTM, DBN, SAE, SDAE or a combination of them are used to learn the SCADA data features, and classifiers such as Softmax layer, Fully connected neural network; ELM, DAE or MLP are used for the classification.In most situations, deep learning approaches outperform standard approaches, but their Achille's heel remains the high training time required for their training.Interesting research direction took by the scientific community to overcome the high training time shortcoming is the use of distributed deep learning approaches for anomaly detection in Industrial Control Systems.In a future work, we will propose a distributed deep learning approach for anomaly detection in Industrial Control Systems.

Table 1 .
Summary of deep learning unsupervised feature learning in ICS.