Optimal Flame Detection of Fires in Videos Based on Deep Learning and the Use of Various Optimizers

Deep learning has recently attracted a lot of attention with the aim of devel-oping a fast, automatic and accurate system for image identification and classification. In this work, the focus was on transfer learning and evaluation of state-of-the-art VGG16 and 19 deep convolutional neural networks for fire image classification from fire images. In this study, five different approaches (Adagrad, Adam, AdaMax, Nadam and Rmsprop) based on the gradient descent methods used in parameter updating were studied. By selecting specific learning rates, training image base proportions, number of recursion (epochs), the advantages and disadvantages of each approach are compared with each other in order to achieve the minimum cost function. The results of the comparison are presented in the tables. In our experiment, Adam optimizers with the VGG16 architecture with 300 and 500 epochs tend to steadily improve their accuracy with increasing number of epochs without deteriorating performance. The optimizers were evaluated on the basis of their AUC of the ROC curve. It achieves a test accuracy of 96%, which puts it ahead of other architectures.


Introduction
The increasing prevalence of surveillance of industry, public spaces and the environment in general with the help of video surveillance systems is necessary for the security of goods, pollution, fires etc. Indeed, in recent years, many works related to fire detection by analysis and processing of video images have flooded the literature. However, the fire detection problem does not seem to have a universal answer as evidenced by some work [1]- [6]. The major difficulty in our sense of the detection of fires is related to different factors; it is for example the detection of smoke often confused with clouds and fog, the detection of flames confused with tricolor lights, vehicle lights as well as fireworks and fluorescence phenomena. The work of [7], a direct precursor to this study, explored fire detection and localization based on both full-format binary fire detection and superpixels based on similar experimentally defined CNN architectural variants derived from inceptionV1 [8] and AlexNet architectures [9]. InceptionV1-On Fire achieved 89% detection accuracy for superpixel-based detection, while Fire-Net achieved 93% for full frame binary fire detection [7]. In this paper, we show that it is possible to obtain fire detection results comparable to recent work on time dependence [10] [11] [12], by moving beyond the earlier non-temporal approach of Chenebert et al. [6] and using a CNN model. The main objective of this work is to implement a classification method to detect the presence of a fire (or presence of fire) in order to avoid damage caused by fires on a large scale and with precision. Thus, in this work, we implemented a classification method based on convolutional neural networks which are VGGnet (VGG16 and 19) using transfer learning. We have classified our image learning base into two classes, namely the fire and non-fire class. For this purpose, different parameters of convolutional neural networks have been dynamically explored in order to optimize the classification or detection of fire. These include the learning rate, the proportions of the training image base, the number of recursions (epochs) and the optimization algorithms (Adagrad, Adam, Adamax, Nadam and RMSProp).
We used the AUC of the ROC curve to evaluate our classification methods which resulted in a better classification rate of 96%.

Dataset and Operating Protocol
The work of this paper being based on fire detection we have set up a database of images divided into three parts (train, test, validation). The image base used for our study consists of 20,000 images; the image base contains 10,000 fire images and 10,000 non-fire images. Our images have dimensions of 150 × 150 pixels for VGGnet networks. Data normalization is performed by dividing all pixel values by 255 to make them compatible with the initial network values. The percentages used in this work are 60% and 40%; 70% and 30%; 80% and 20%. The choice of split ratio is based on previous work: 60% and 40% [13], 70% and 30% [14] [15] [16] and 80% and 20% [17].
The dataset (Figure 1) was tested on VGG networks with five gradient descent optimizers. The VGG16 and VGG19 networks have been used in many research works [18] [19] [20] [21] and have given excellent results. Several works Open Journal of Applied Sciences have been devoted to the study of different optimizers [22] [23] to evaluate the performance and convergence of the models created. In the following sections we will explain the different architectures of VGGnet networks and how the different optimizers used work.

VGGnet Model
The VGG network architecture was initially proposed by Simonyan and Zisserman [24]. The 16-layer (VGG16) and 19-layer (VGG19) VGG architecture served as the basis for their submission to the ImageNet Challenge 2014, where the Visual Geometry Group (VGG) team secured first and second place in the location and identification tests.

VGG 16
VGG16 consists of five convolution blocks, where the first block contains two convolution layers, stacked together with 64 filters. The second block consists of two convolution layers stacked with 128 filters, where the second convolution block is separated from the first block by a max pooling layer. The third block consists of three convolution layers, stacked together with 256 filters and separated from the second block by another pool max layer. The fourth and fifth layers have the same architecture, but have 512 filters. The convolution filter used in this network is of size 3 × 3 and stride 1. Then, a flattening layer is added between the convolution blocks and the dense layers, converting the 3D vector into a 1D vector. The last block consists of three dense layers, each with 4096 neurons, to classify each image. The last layer is a softmax layer, which is used to ensure that the sum of the probabilities of the output is equal to one. ReLU was used as an activation layer throughout the network.

VGG19
The VGG19 architecture is structured starting with five blocks of convolutional layers followed by three fully connected layers. Convolutional layers use 3 × 3 cores with a stride of 1 and a fill of 1 to ensure that each activation map maintains the same spatial dimensions as the previous layer. An activation by a rectified linear unit (ReLU) [25] is performed immediately after each convolution and a max pooling operation is sometimes used to reduce the spatial dimension. Max pooling layers use 2 × 2 cores with a stride of 2 and no fill to ensure that each spatial dimension of the previous layer's activation map is divided by two. Two fully connected layers with 4096 ReLU enabled units are then used before the final layer of 1000 fully connected softmax layers. Convolutional blocks can be considered as feature extraction layers. The activation maps generated by these layers are called bottleneck features.

Optimizers
The model learns (trains) on a dataset by comparing the actual label of the input (available in the training set) to the predicted label, thus minimizing the cost function. Hypothetically, if the cost function is zero, the model has learned the dataset correctly. However, an optimization algorithm is needed to reach the minimum of a cost function. The following section discusses the different optimization algorithms introduced in the literature to minimize the cost function.

Adaptive Gradient Descent Optimizers (AdaGrad)
To scale the learning rate of each weight, the AdaGrad optimization algorithm [26] was introduced to set different updates for the different weights. It performs smaller updates for settings associated with features that occur frequently, and larger updates for settings associated with features that occur seldom. For brevity, we use g t to denote the gradient at time step t. , gradients, m t and v t respectively as follows: m t and v t are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. The author found that the first-and second-order momentum is very small during the initial training, close to 0, because the β value is large, so the author recalculates a deviation to correct for it: where t represents its t-th power, so at the beginning of training, the learning rate can be corrected by dividing by (1 − β). When training for several rounds, the denominator part is also close to 1. Back to the original equation, so the final equation for updating the total gradient is: The default value of β 1 is 0.9, the default value of β 2 is 0.999, and ε is 10 −8 . Experience shows that Adam performs very well in practice. It has more advantages than other adaptive learning algorithms [27].

Adaptive Moment Estimation Extension (AdaMax)
The AdaMax algorithm [28] is an extension of the Adam algorithm based on an infinite norm. In Adam's algorithm, the factor v t in the update rule scales the gradient inversely to the l 2 norm of the past gradients ( 1 t v − ) and the current gradient The generalization of the l 2 standard to the l p standard provides: The authors of AdaMax [29] prove that v t with l ∞ converges to a more stable value.
To avoid confusion with Adam, u t is used to denote the v t norm at infinity: Open Journal of Applied Sciences The resulting AdaMax update rule is as follows: We see that u will not be 0, so there is no need to add an ε to the denominator as Adam. Usually, the default parameter size is: 1 2 0.002, 0.9, 0.999

Nesterov-Accelerated Adaptive Moment (Nadam)
Nadam [30] is an extension of Adam's algorithm by combining it with Nesterov's momentum gradient descent. The Nadam gradient update equation is finally obtained as follows: where η is the hyperparameter of the learning rate. While, i β is used to select the amount of information needed from the previous update, where m t is the first moment.

Root Mean Square Propagation Algorithm (RMSProp)
The main drawback of AdaGrad is that the learning rate decreases monotonically because each added term is positive. After many epochs, the learning rate is so low that it stops updating the weights. RMSProp was introduced to solve the problem of monotonic learning rate decay [31].
( ) RMSprop also divides the learning rate by an exponentially decreasing average of the gradients squared. Hinton suggests γ should be set to 0.9, while a good default value for the learning rate η is 0.001.

Overcoming Overfitting
Overfitting usually involves storing the training dataset and usually results in poor performance on the test dataset. This means that the performance on the training set can be excellent, but the performance on the test dataset is quite poor. The loss of network generalization capability can be due to many factors, such as the capacity of the network or the nature of the training dataset itself. Techniques have been introduced in the literature to overcome overfitting (Dropout, Image Augmentation…).

Fire Detection Using a Reformed VGGNet Model
This paper uses a fire detection method based on a deep learning convolutional neural network model by proposing a reformed VGG model. The transfer learning method, the pre-trained model parameters optimize the convolution layer model parameters and solve the fire presence classification problem. Therefore, it is proposed to use the GlobalAveragePooling function which is a methodology used for a better representation of our vector instead of the Flatten layer function.
Our model mainly uses transfer learning to transfer the parameters of the VGG-16 and VGG-19 pre-trained models to the convolution layer, pooling layer, and fully connected layer of the fire detection model, and replaces the original with a 2-label Softmax classification layer, and fits a classification model with good accuracy, as shown in Figure 2.
The main operating process is as follows: 1) Enter an example of a fire image. The images are extracted from the fire positive and negative image base as a training sample set for input.
2) Pre-processing: All fire and non-fire images were collected in a dataset and loaded to be scaled to a fixed resolution size of 150 × 150 pixels, to be suitable for further processing in the deep learning pipeline.
3) Build new improved models: using the VGGNet model (VGG16 and VGG19), the FC layers are optimized with reduced parameters, replaced the Softmax classification layer of the original model with a two labels Softmax classifier.  Figure 2. At the end of the workflow, the overall performance analysis for each deep learning classifier will be evaluated based on the metrics described in the following section.

Model Evaluation Criteria
The evaluation of the fire detection model can be assessed from the effect and reliability. These two indicators are generally precision and time to test, while the former includes three indicators: precision, recall rate and error rate. Among them, precision is defined as (1), and recall is defined as (2). The error rate is defined as (3).  Table 1.

Experimentation and Results of the Proposed Approach
The following section details the results obtained when training the two network architectures using five optimizers selected with three learning rates, namely 10 −3 , 10 −4 and 10 −5 . Many experiments have been carried out on datasets, in order to determine the behavior of each optimizer with each network architecture to determine the best possible combination. The performances of each optimizer with the VGG16 architecture are presented in Table 2, Table 4 and those of VGG19 in Table 3, Table 5. The percentages used for the training and validation data are 60% -40%, 70% -30% and 80% -20% respectively. The AUC of the ROC curve measures performance. Optimizers are ranked on the basis of their validation AUC. Images were held constant at 150 × 150; a batch size of 32 images was used, and with a number of epochs between 300 and 500. Table 2 shows the results of the VGG16 architecture for 300 epochs and taking into account the partition of the different datasets (60%/40%, 70%/30% and 80%/20%), we note that the highest AUC is 96%, was obtained by the Adam optimizer with a ratio of 60%/40% for the highest learning rate 10 −3 , to achieve convergence. At the same time, the lowest AUC 49.81% was obtained by the AdaGrad optimizer which did not converge at all with the 80%/20% ratio. For the average learning rate 10 −4 , the RMSprop optimizer obtained the highest AUC with a value of 94.67% with the ratio 60%/40% and the other ratios obtained  lower values. The AdaGrad optimizer obtained the lowest AUC with a value of 55.56%. For the lowest learning rate 10 −5 , the Nadam optimizer with the ratio 70%/30% obtained the highest AUC value 89.29% and the AdaGrad optimizer obtained the lowest AUC 50.44%. Overall, the highest learning rate 10 −3 gave the best results, followed by the medium learning rate 10 −4 . Throughout our experiments the AdaGrad optimizer achieved the lowest AUC value for all learning rates and ratios used. Table 4 shows the results of the VGG16 architecture for 500 epochs and taking into account the partition of the different datasets (60% -40%, 70% -30% and 80% -20%). For the highest learning rate 10 −3 , it shows that the highest AUC value is 95.43% was obtained by the Adam optimizer with a ratio of 70%/30%. At the same time, the lowest AUC value of 51.14% was obtained by the AdaGrad optimizer which did not converge at all with the 70%/30% ratio for the learning rate of 10 −5 . For the average learning rate 10 −4 , the Nadam optimizer obtained the highest AUC value of 95.04% with the ratio 70%/30%. The AdaGrad optimizer obtained the lowest AUC value 59.25%. For the lowest learning rate 10 −5 , the Adam optimizer with the 80%/20% ratio obtained the highest AUC 91.44%. Overall, the highest learning rate 10 −3 gave the best results, followed by the medium learning rate 10 −4 and the lowest values were obtained with the learning rate 10 −5 . Throughout our experiments the AdaGrad optimizer achieved the lowest AUC value for all learning rates and ratios used.  Table 3 shows the results of the VGG19 architecture for 300 epochs and taking into account the partition of the different datasets (60%/40%, 70%/30% and 80%/20%), we note that the highest AUC value is 95.58% was obtained by the Nadam optimizer with the ratio of 60%/40% with the highest learning rate 10 −3 . Jointly, the lowest AUC value is 50.75% was obtained by the AdaGrad optimizer which did not converge at all with the 60%/40% ratio with the lowest learning rate 10 −5 . For the average learning rate 10 −4 , the Nadam optimizer obtained the highest AUC value 94.69% with the ratio 80%/20%. The Adagrad optimizer obtained the lowest AUC 57.21%. For the lowest learning rate 10 −5 , the RMSprop optimizer obtained the highest AUC value 90.50% for the ratio 70% -30%. Overall, the highest learning rate 10 −3 gave the best results, followed by the medium learning rate 10 −4 . Nadam optimizers gave the highest values with learning rate 10 −3 for the three ratios in our work. For the 10 −4 learning rate, the Nadam optimizers obtained the highest AUC values for the 60% -40% and 80% -20% partitions. The RMSprop optimizer obtained the high value for the 70% -30% partition. Throughout our experiments the AdaGrad optimizer obtained the lowest AUC value for all learning rates and ratios used. Table 5 shows the results of the VGG19 architecture for 500 epochs, which shows that the highest AUC value is 95.61% which was obtained by the Nadam optimizer with a ratio of 70% -30% with the highest learning rate 10 −3 . At the same time, the lowest AUC value 48.96% was obtained by the AdaGrad optimizer which did not converge at all with the 60% -40% ratio for the lowest learning rate 10 −5 . For the average learning rate 10 −4 , the Nadam optimizer obtained the highest AUC value of 95.11% with the ratio 70% -30%. The AdaGrad optimizer obtained the lowest AUC 62.50% with the ratio 70% -30%. For the lowest learning rate 10 −5 , the RMSprop optimizer achieved the highest AUC 91.68% for the ratio 70%/30%. Overall, the highest learning rate 10 −3 gave the best results, followed by the medium learning rate 10 −4 . Throughout our experiments the AdaGrad optimizer achieved the lowest AUC value for all learning rates and ratios used.

Discussion
At the end of our experimental approach and the results obtained, it is possible to draw some interesting analyses on the behavior of the CNNs, the number of epochs and the optimizers studied in this work. Taking into account the choice of optimizer, the number of epochs, the ratios used and the relationship with the learning rate, the experimental results confirm that the choice of learning rate and the ratios used can lead to an unstable behavior of the training process. This is particularly evident, for some of the networks and optimizers considered, when considering the smallest learning rate used in the experiments. As we can see, when LR = 10 −5 , the learning process of the VGG16 and VGG19 networks gave low performance value with the optimizers and the ratios studied to see a poor performance of the model. However, when we use a high learning rate, i.e. LR = 10 −3 , we obtain the highest values of our experiment whatever the CNNs used. For the average learning rate LR = 10 −4 , we obtain lower values and some close to those obtained with the learning rate LR = 10 −3 . As explained in the previous sections, this can be motivated by the fact that the weights of the network change abruptly from one epoch to another. The transition to higher LR values allows the convergence of the formation process in all studied configurations. Overall, the results do not match the theoretical expectations: a lower LR value allows a smoother convergence, but it requires more time compared to a higher LR value. Another interesting observation concerns the importance of hyperparameters. While this is a topic of fundamental importance in the field of deep learning, it is particularly evident in the results of the experimental phase. In particular, all studied architectures produced comparable performance when the best configuration between the learning rate and the optimizer (which is different for each architecture type) was considered. In other words, it seems that the choice of hyperparameters not only plays a critical role in determining the performance of the model, but the examined CNNs are indistinguishable in terms of performance. Our results confirm those of Sharma and Venugopalan [32]. We think this is an interesting observation that should further emphasize the importance of hyperparameter setting. Focusing on the optimizers and the different percentages used (60%/40%, 70%/30% and 80%/20%), the Adam optimizer produced the best performance with the 10 −3 learning rate for VGG16 and Nadam produced also the best performance for VGG19. Conversely, Adam, Nadam and RMSprop obtained the best performance on the considered CNNs when LR = 10 −5 (except Nadam and RMSprop on the VGG19 architecture, where the best performance is obtained with LR = 10 −4 ). Overall, the best result on the considered dataset was obtained by the Adam optimizer and the VGG16 network. However, the differences in performance between the best configurations of each network are not statistically significant. Overall, each optimizer behaved differently depending on the architecture considered. For example, for VGG16 architectures, Adam outperformed Nadam and RMSprop. For VGG19 architectures, Nadam outperformed Adam and RMSprop.
Given a specific network, each optimizer requires a different time to converge (i.e. to conclude the defined number of epochs). In particular, Adam was the optimizer that gave the highest AUC value for VGG16 and Nadam also gave a high value for VGG19, whether we use VGG16 or VGG19, the AdaGrad optimizer gave poor convergence results. This result is consistent with that proposed by Lydia and Francis [33], in which the authors studied some alternatives and hyperparameters to improve the performance of Gradient Descent algorithms. The authors explained that good performance can be achieved with optimizers if they are trained on different image data sets. The best performing network obtained with transfer learning (VGG16 architecture, with the Adam optimizer, 60% -Open Journal of Applied Sciences 40% ratio and a learning rate of 10 −3 ) was able to obtain an AUC value of 96%.

Conclusion
In this paper, a comparative effect of five optimization algorithms on three proportions of image classification datasets and three learning rates using two convolutional neural network architectures (VGG16 and VGG19) has been performed. Our results reveal that the performance of each optimizer varies with each proportion of the dataset used, the learning rate, and the number of recursions (epochs), confirming the effect of hyperparameters on the performance of different optimizers. Based on several experiments conducted, our results show that Adam exhibited superior and robust performance on VGG16 networks for 300 epochs compared to other optimization algorithms. Similarly, Nadam also showed superior performance for VGG19 for 300 epochs. This same result was observed with VGG16 for 500 epochs and VGG19 for 500 epochs. These results show that the Adam and Nadam optimizers are apparently suitable for the different dataset and models examined in this study. In this study, two neural network models and three percentages of image classification data were used to perform all our experiments. It will be interesting to use more than three proportions of data in different domains and experiment with the comparative effects of these optimizers on a number of different CNN architectural designs and deep learning models in order to achieve better generalization. This could be the subject of future projects.