Application of Weighted Cross-Entropy Loss Function in Intrusion Detection

The deep learning model is overfitted and the accuracy of the test set is reduced when the deep learning model is trained in the network intrusion detection parameters, due to the traditional loss function convergence problem. Firstly, we utilize a network model architecture combining Gelu activation function and deep neural network; Secondly, the cross-entropy loss function is improved to a weighted cross entropy loss function, and at last it is applied to intrusion detection to improve the accuracy of intrusion detection. In order to compare the effect of the experiment, the KDDcup99 data set, which is commonly used in intrusion detection, is selected as the experimental data and use accuracy, precision, recall and F1-score as evaluation parameters. The experimental results show that the model using the weighted cross-entropy loss function combined with the Gelu activation function under the deep neural network architecture improves the evaluation parameters by about 2% compared with the ordinary cross-entropy loss function model. Experiments prove that the weighted cross-entropy loss function can enhance the model’s ability to discriminate samples.


Introduction
Intrusion detection system can be regarded as a kind of active defense of computer network, and it was created to ensure the security of information communication. At the moment, affected by the 2020 epidemic, most people's live and work are almost closely related to the Internet, and the amount of data has also increased dramatically, and at the same time, we are facing data abuse, data security issues such as attacks and theft have also surged. These security issues make us face many challenges; this also makes us pay more attention to intrusion detection systems.
First, machine learning was first applied to intrusion detection because it is a fairly intelligent technology that automatically obtains knowledge from massive datasets [1] [2] [3]. With machine learning IDS, IDS can be better detected if enough training data is available for learning. ML is largely independent of knowledge in related fields, which makes it easier to build models.
Nowadays, machine learning methods have been widely used in various types of network intrusion detection, and there are many analysis methods based on machine learning, such as KNN, SVM, decision tree, Bayesian algorithm and so on. With the rapid development of network equipment and related technologies, massive amounts of network data have been generated. Traditional machine learning algorithms have become increasingly difficult to solve the classification problem of massive intrusion data in actual networks. Deep learning is a new research direction in the field of machine learning [4]. Its network model contains multiple hidden layers of multi-layer perception institutions.
By combining the underlying features to form a more abstract high-level representation attribute category or feature, it can discover the distributed characteristics of the data.

Related Works
At present, applying deep learning technology to the design of intrusion detection systems can effectively improve the accuracy and efficiency of intrusion detection. Andresini et al. [5] proposed a novel deep learning method that uses a convolutional neural network (CNN) to equip a computer network with an effective means to analyze the traffic on the network to find signs of malicious activity. The basic idea is to represent the network stream as a 2D image and use the image representation of this stream to train the 2D CNN architecture. But deep metric learning method that originally combined the autoencoder and the triplet network. Khan et al. [10] used Convolutional Recurrent Neural Network (CRNN) to create a DL-based hybrid ID framework that can predict and classify malicious network attacks in the network. In HCRNNIDS, Convolutional Neural Network (CNN) performs convolution to capture local features, and Recurrent Neural Network (RNN) captures temporal features to improve the performance and prediction of the ID system. Sajith et al. [11] used computational intelligence algorithms such as genetic algorithm (GA), genetic programming (GP) and swarm intelligence algorithm to determine the optimization of interesting rules from dense databases.
Among these examples of integrating deep learning into intrusion detection systems, in fact, there are many examples that often lead to different convergence speeds due to different selection of loss functions, which affects the model training over-fitting and reduces the training accuracy instead. The DNN + Gelu algorithm uses Relu and Gelu activation functions in each layer of its neural network to work together to extract different data features to improve the generalization ability and accuracy of the algorithm. The weighted Cross-Entropy loss function is used to solve the problem that the accuracy of the deep learning model overfitting on the test set due to the imbalance of the convergence speed of the loss function decreases.

Deep Neural Network Model
Deep Neural Network (DNN) can be understood as a neural network with many hidden layers, also known as Deep Feed Forward Network (DFN), Multi-Layer Perceptron (MLP), First divide the DNN according to the position of different layers, the internal neural network of DNN can be divided into three layers, input layer, hidden layer and output layer. In general, the first layer is the input layer, the last layer is the output layer, and the middle part is the hidden layer. [12] Then the DNN deep neural network is not only layered but also divided into transmission directions, which are forward and backward respectively, the forward tim9e data passes through n hidden layers from the input layer after preprocessing, and passes it to the output layer after calculation, and then compares the output result after the output layer is activated with the expected result.
After comparison, the error is found, and then the error is passed from the output layer through the hidden layer back to the input layer in a gradient descent manner, which completes a round of neural network training [13]. The structure diagram of the deep neural network is shown in Figure 1.

Fully Connected Layer
The fully connected layer uses the form of a cooperative activation function, and divides the output of each layer into two parts on average, and uses the Relu activation function and the Gelu activation function to perform non-linear classification respectively. In the neuron, after the input layer is weighted and summed, a function is also applied. This function is the activation function. The activation function is a very important part of the neural network. It can perform a nonlinear transformation on the information received by the neuron and output the transformed information to the next layer of neurons. If the activation function is not used, then the output of each layer is the linear function of the previous layer, no matter how many layers there are in the neural network, the output is the linear combination of the previous layer [14]. After using the activation function, we can introduce non-linear factors to the neuron. At this time, the neural network can approximate any non-linear function arbitrarily, so that the neural network can be applied to most non-linear models.
However, it is also very important when we choose the activation function, The left side of the function has soft saturation, and the right side has no saturation. The average value of ELU output is basically close to 0, which makes it faster to converge. It reduces the gap between the normal gradient and the unit natural gradient, thereby speeding up the learning speed, and it can also be under negative constraints more robust [15]. But what we use here is the GELU activation function, which is what we often call the Gaussian error linear unit. The GELU activation function adds the idea of random regularization to the activation, which is equivalent to a probabilistic description of the neuron input. The nonlinear change of the GLUE activation function is a random regular transformation method that meets expectations. Therefore, GlUE also has a high-performance activation function. The output images of the three activation functions are shown in Figure 2.
The GELU function we use here as the activation function of the output layer; the mathematical formula is as Formula (1): refers to the cumulative distribution of the Gaussian normal distribution of x, as in Formula (2): The reason for choosing the GELU activation function formula is that according to the central limit theorem, the overall distribution of many independent random variables approximately obeys the normal distribution. Therefore, there are many situations in reality that can be modeled by an approximate normal distribution method, so it is more reasonable to use the normal distribution function as the activation function. Furthermore, among all possible distributions with the same variance, the normal distribution has the largest uncertainty, that is, the largest entropy.
In the fully connected layer, in addition to the activation function behind each layer, a Dropout layer is also added to randomly crop a certain proportion of neurons to prevent overfitting.

Adam Adaptive Moment Estimation Optimization
For the Adam algorithm, we must first understand the adaptive gradient algorithm (AdaGrad) and the root mean square propagation algorithm (RMSProp), The basic idea of AdaGrad is to adaptively adjust its learning rate for each parameter. The adaptive method is to multiply each parameter by a different coefficient and this coefficient is determined by the sum of squares of the gradient size accumulated before. In other words, for those that have been updated a lot before, it can be relatively slow, and for those that have not been updated much, a larger learning rate can be given. The RMSProp is actually an improvement of AdaGrad, that is, it turns AdaGrad's sum of historical gradients into an average of historical gradients. Of course, this is not the mean in the strict sense. Then using this mean to replace the accumulated gradient of AdaGrad and weight the current gradient, and use it to update.
Assuming the loss function is as Formula (3): That is, our goal is to learn the values of x and y to make the Loss as small as possible. The drawing result of the loss function is shown in Figure 3. Adam's adaptive moment estimation algorithm has done gradient moving average and deviation correction based on RMSProp. In RMSProp, the square of the gradient is smoothed by a smoothing constant, but the gradient itself is not smoothed. In Adam, the gradient is smoothed, and the square gradient is also smoothed. The smoothed sliding averages are denoted by t m and t v respectively, and there are two β in Adam. Assuming that at time t, the first derivative of the objective function with respect to the parameters is t g , then the specific formula for calculating the gradient is as shown in Formula (4): Next, calculate their respective sliding averages, the specific formula is as Formula (5): The final gradient update method is as Formula (6): Among them, η is the learning rate, 1 β is the exponential decay rate estimated for the first time, and 2 β is the exponential decay rate estimated for the second time, 8 10 ε − = , The ε in the denominator is to prevent Ho from being divided by 0 in implementation. In fact, for the learning rate, it is generally

Weighted Cross-Entropy Loss Functıon Evaluation Algorithm
First of all, Cross-Entropy is an important concept in information theory, mainly used to measure the difference between two probability distributions. For the understanding of Cross-Entropy, we must firstly know what the amount of information is. For example, "there is sea in the sea", the amount of information in this sentence is 0, why? Because this is a nonsense, there must be sea water in the sea. Here is another one, such as "The new crown pneumonia epidemic will be completely over next year", Intuitively, this sentence has a lot of information, because the new crown pneumonia epidemic will end next year, there are great uncertainties, and this sentence eliminates the uncertainty of the new crown pneumonia epidemic ending next year. Therefore, by definition, this sentence is very informative. Of course, I'm just making an analogy. In summary, the probability of information occurrence is inversely proportional to the amount of information. The greater the probability, the smaller the amount of information.
The smaller the probability, the greater the amount of information.
Suppose the probability of a certain event occurrence is ( ) P x , and its information content is expressed as shown in Formula (7): Among them, ( ) I x represents the amount of information, and log represents the natural logarithm with e as the base. The information entropy is also called entropy if you expect the amount of information. Expectation is the probability of possible outcomes in each experiment multiplied by the total number of outcomes. Therefore, the expression of information entropy is shown in Formula (8): Here X is a discrete random variable, and n represents all n possibilities. For the same random variable X, if there are two separate probability distributions, ( ) P x and ( ) Q x , the difference between the two probability distributions can be measured by KL divergence. Such as Formula (9): We further derive the KL divergence and simplify it as Formula (10): The former to represent the distribution predicted by the model.
In order to solve the problem of class imbalance in the data set, we attribute it to the imbalance in learning difficulty, which leads to different convergence speeds, so we thought of weighting in the loss function to balance the imbalance of samples in this way. So the Formula (12) is obtained: where i ω represents the weight of the loss function when the actual label of the current data is.

Network Structure
Input the processed data into the deep neural network, use the fully connected neural network to extract the features of the data, and then use the Relu activation function and Gelu activation function to nonlinearize the output of the current layer in the same layer. Its structure is shown in the following Table 1.

Intrusion Detection System Design
First of all, our detection model has a data acquisition and processing module, an intrusion detection module, a detection classification module, and a visual analysis module. Data collection and processing: Obtaining the network data set, performing preprocessing operations such as feature extraction, numerical conversion, and data normalization on the network data set, then checking the numerical data distribution and dividing it into a test set and a training set, which are used for model testing and training respectively.
Intrusion detection module: determining the input and output nodes of the deep neural network according to the dimensions of the preprocessed data, then determining the entire network structure and training parameters according to the hidden layer and other parameters, using the training set to train the model, and saving the model for testing after completing the training.
Detection and classification module: testing the test set and classifying the test results.
Visual analysis module: Visually display the distribution of numerical data and the classification results, and then make an analysis. The model structure diagram is shown in Figure 4. Journal of Computer and Communications

Data Set Selection
Here we have selected the KDDCup99 data set, which is more common in intrusion detection, for the convenience of comparison experiments. The data set has 42 dimensions, of which 41 dimensions are attributed features, and 1 dimension is flag feature. The release of the KDD Cup99 data set is very useful for many IDS evaluations, and it is also a widely used data set. The data set is composed of 5 million network connection records containing 41 characteristics. The simulated attacks can be divided into 4 categories: Denial of service attack (DOS): The intruder exhausts the resources of the attacked object by attacking the defects realized by the network protocol or directly using brute force. The purpose is to make the target computer or network unable to provide normal service or resource access, so that the target system service system stops responding or even crashes, thereby causing service interruption.
Port monitoring or scanning attack (Probe): The network intruder collects information about the types of computers on the network, and then gains root access through the firewall of the target host.
Remote to Local Attack (R2L): The network intruder sends data packets to the target, but does not have a user account on the host itself, trying to use the vulnerability to gain local access, pretending to be an existing user of the target host.
User to Root Attack (U2R): A commonly used method of network intrusion, the intruder tries to take advantage of the user's pre-existing access rights and exploits loopholes to gain root control.
Due to the huge amount of data and the limitation of memory allocation, we use 10% of the actual amount of data here. Then here we use Numpy in python to perform statistical data to get the following data set data distribution table as shown in Table 2.
The KDD data set has a total of 41 attribute features and 1 logo feature. The specific information is shown in Table 3. Journal of Computer and Communications

1) Numerical Processing
For symbolic features, we use one-hot code, which is, there are as many bits as there are states, and only one bit is 1, and the others are all 0. For example, the Normal code is 10,000. For character data, it is converted to numeric data. When transforming data, the method of function mapping is adopted, and each type of character form corresponds to a uniquely determined binary code, which is, in the formula: is the original character string in the network stream data feature, is the data in binary encoding format; is the mapping relationship.

2) Standardization
First of all, ordinary standardization is to calculate the average value k x and the average absolute error k S of each attribute. The calculation formula is as Formula (13) where ik x represents the k-th attribute of the i-th record, k S represents the average absolute error of the k-th attribute, k x represents the mean value of the k-th attribute. Then standardize the measurement for each data record, such as Formula (14): Among them, ik Z represents the k-th attribute value of the i-th record after standardization. However, adding Z-Score here is equivalent to doing another calculation after normal standardization, which is actually a process of dividing the difference between the score and the average by the standard deviation. Converting the raw scores in the normally distributed data to Z-Scores, we can know Journal of Computer and Communications the area between the average and the Z-score by consulting the table of the area under the normal curve of the Z-score, and then know the percentage rank of the original score in the data set. Z-Score is a way to see the relative position of a certain score in the distribution. The specific formula is as Formula (15): Among them, µ is the mean value of all data, σ is the standard deviation, x is the original data, and the z value represents the distance between the original score and the population average, and is calculated in the unit of standard deviation.
3) Normalization In fact, each value after standardization is normalized to the interval [0, 1]. Its formula is as Formula (16): where min x and max x are the minimum and maximum values of each data item, x is the value of the original data, and x ⊗ is the normalized data.

4) Divide the Data Set
After the data is preprocessed, 20% is randomly selected as the test set, and the remaining 80% is used as the training set. The data after the split is shown in Table 4.

Lab Environment
In order to build the model and train the parameters smoothly and effectively in the intrusion detection algorithm experiment, we use the Keras deep learning framework of TensorFlow. The specific hardware environment and software environment of the experiment are shown in Table 5.

1) Attack Type Exploration
Firstly, we will subdivide the statistics of the 4 commonly used attack types in the data set. As shown in Table 6.
Then we add these attack types and "Normal" types to the dictionary to match the predicted attack column "target". Map the class name according to the Visual analysis tools Matplotlib column where the predicted attack is located, use the value counts() function to check the unique value in the target column and visually display the number of repetitions of each tag in the predicted attack. As shown in Figure 5.
Here we find that there is an extra "." at the end of each attack name. Therefore, here we use this format to match the actual attack type. Map the actual attack type to another column named "target_type", and visually display the actual attack type statistics as shown in Figure 6.

2) Classification Feature Exploration
Here we use the info() function in python to check whether there are missing values in each column of the data set. We found no data loss. Then we get the names of all numeric columns as "target_type", "service", "flag", "target", "proto-col_type", noting that the "target" column here is our prediction, "Target_type" is packet data. Then we are determining whether there is any other binary data. We found that there is also "land", "logged_in", "root_shell", "num_outbound_cmds", "is_host_login", "is_guest_login", The meanings they represent are as follows.

3) Digital Feature Exploration
Identify the remaining digital features by subtracting the classification column.
Here we use the standard deviation to measure their degree of deviation.
Standard deviation is a measure of the degree of dispersion of data distribution,

Experimental Data Comparison
This article uses Accuracy, Precision, Recall, F1-Score to evaluate the model. The formula of the four parameters is as the Formulas (17) Among them, (True Positive) represents the number of samples that represent the attack as an attack type, (True Negative) represents the number of samples that judge the attack type as a normal type, (False Positive) represents the number of samples that judge a normal sample as an attack type, (False Negative) represents the number of samples that define the attack as a normal type.
The following table shows all the parameter settings of the loss function algorithm experiment after the Epoch of the experiment is determined, as shown in Table 7.
Below we compare and analyze the experimental data between the ordinary cross-entropy loss function model and the weighted cross-entropy loss function model. The data is shown in Table 8 and Figure 9.
From the above data comparison table and comparison chart analysis, the weighted cross entropy loss function is significantly better than the ordinary cross entropy loss function in terms of accuracy and various numerical values.
Let's look at the weighted Cross-Entropy loss function training data experiment. The experimental parameters are given above. Let us directly look at the experimental data table, as shown in Table 9.
Here due to the use of the Early stopping method, when we train deep learning neural networks, we usually hope to get the best generalization performance, Figure 9. Comparison of experimental data of two models.  The results of the visual analysis are shown in Figure 10 and Figure 11.
It can be seen from the above that after the model is trained, the accuracy curve is in a relatively balanced state, which shows that the fluctuation range of the model is not large and relatively stable during the training process.
Then we experimentally compare the data of this model with other models, as shown in Table 10 and Figure 12. Journal of Computer and Communications  From the above data, it can be seen that the model has a certain improvement in data than other models, but this may be due to the overfitting phenomenon caused by the excessively strong model training due to the problem of gradient optimization, but from the comparison of this data, The accuracy rate has indeed improved. Journal of Computer and Communications

Conclusion
Using the DNN + Gelu model architecture, the cross-entropy function is improved to a weighted cross-entropy loss function, a new intrusion detection system is constructed and applying a weighted loss function to improve the accuracy of model. In order to prove the role of the weighted loss weight function, this paper compares and analyzes with other models based on a commonly used intrusion detection data set KDDCup99, which will be more convincing. After data analysis, it is proved that the weighted loss weighted function can improve the accuracy of model recognition. However, the batch_size and epoch trained here are relatively fixed. If you change the training accuracy of these variables, it remains to be tested, and the choice of optimizer may also affect the training accuracy of the model. These are the problems that this article will solve later.