YOLOv8 for Fire and Smoke Recognition Algorithm Integrated with the Convolutional Block Attention Module ()
1. Introduction
Fire and smoke have become significant threats due to their high frequency and destructive nature. Their rapid spread, particularly in combustible-dense areas such as residential zones, airports, and forests, poses a challenge for swift control. Consequently, timely and accurate fire detection is crucial for preventing large-scale disasters. Traditionally, research has focused on contact-based fire detection sensors like smoke, temperature, and particle sensors, which are cost-effective and easy to deploy. However, these systems are suitable mainly for small areas and have considerable limitations in larger settings. Since they require direct activation by fire temperature or smoke, there is a potential delay in optimal fire extinguishing time. Compared to sensor-based methods, vision-based fire detection offers numerous advantages, including rapid response, extensive coverage, and environmental robustness, leading to its increasing popularity.
In recent years, research on fire and smoke detection has predominantly focused on video image detection algorithms, primarily divided into two categories: traditional classifier-based and deep learning-based smoke and fire detection. The former approach initially employs feature extraction methods such as SIFT [1] and HOG [2] to extract characteristics of fires and smoke, including brightness, color, texture, and edges. These features are then fed into classifiers for training, ultimately utilizing classifiers like SVM, Bayesian networks, and BP neural networks to determine the presence of fires and smoke in images, as discussed in [3] . However, this methodology predominantly relies on manually crafted algorithms for extracting low-level image features, followed by optimization of the results. Consequently, this leads to significant time consumption, resulting in poor performance and slower real-time detection of fire and smoke. Furthermore, issues like occlusion and interference often result in numerous false positives and errors in background detection. Therefore, these methods are ineffective for timely and efficient detection and alarm signaling in the early stages of tunnel fires.
Deep learning-based fire detection algorithms excel in extracting more abstract and advanced features of fires and smoke, demonstrating superior performance compared to traditional classifier-based methods. These algorithms are characterized by their high efficiency and accuracy. Frizzi S. [4] introduces a convolutional neural network capable of automatically recognizing fires in videos. This network utilizes convolutional layers to extract features, pooling layers to reduce feature map dimensions and simplify computational complexity, and fully connects layers to amalgamate all features before outputting to a classifier. Compared to manual feature extraction methods, these algorithms significantly improve accuracy and speed. However, their reliance on two-dimensional convolution overlooks the dynamic characteristics of fires and smoke. Moreover, due to dataset limitations, they are primarily effective in recognizing only red fires.
Cao Y. [5] and D. Nguyen M. [6] explore the application of Recurrent Neural Networks (RNNs) in fire detection tasks, utilizing their ability to extract relationships between features of the same object across different frames, thereby offering long-term memory of video information. Long Short-Term Memory networks (LSTMs), a variant of RNNs, address the issue of vanishing gradients present in traditional RNN models. When applied to fire detection, LSTMs are capable of simultaneously extracting spatial and temporal features of flames and smoke. This dual extraction results in high accuracy and recall rates while meeting real-time processing requirements. However, LSTMs present challenges due to their numerous fully connected layers, extensive time spans, deep network architecture, and the computational demand of numerous parameters, making them difficult to train. Panagiotis et al. [7] propose a fire detection method using an enhanced Faster R-CNN [8] , which employs multi-dimensional texture analysis for feature extraction. This approach enables more accurate recognition of various types of flame images and offers adaptability to noise and lighting variations. Despite these advantages, the complexity of the algorithm is increased due to the extensive texture feature extraction, and the two-stage nature of Faster R-CNN, involving candidate region generation, leads to high precision and accurate localization but at the cost of a complex model structure and slower detection speed.
The YOLO [9] series represents a benchmark in single-stage detection algorithms. Cao et al. [10] introduced a fire and smoke detection model named SE_RFB_YOLO, which is based on the YOLOv3 [11] framework. This model incorporates a channel-based attention mechanism that enhances detection efficiency. Additionally, Cai W et al. [12] developed a smoke detection model named YOLO-SMOKE by embedding an efficient channel attention mechanism into the YOLOv3 model and modifying the loss function and this approach enhances the accuracy and robustness of the algorithm.
Numerous studies have already demonstrated the superiority of the YOLO series algorithms in the detection of smoke and flames. The YOLOv8 algorithm represents a further advancement by the original creators of YOLOv5, building upon its predecessors. To enhance the accuracy and robustness of smoke detection, this paper introduces a modified version of this algorithm, YOLOv8-CBAM, which incorporates the CBAM [13] (Convolutional Block Attention Module) into YOLOv8. Experiments conducted on a smoke and flame dataset and comparative analyses with YOLOv5, YOLOv6, and YOLOv8 have shown that YOLOv8-CBAM achieves a 2.3% increase in accuracy for smoke and flame detection, surpassing the performance of other methods.
The structure of this paper is organized as follows: Section 1 provides an introduction, setting the stage for the study. Section 2 elaborates on the fundamental principles of the YOLOv8-CBAM network framework. Section 3 presents comparative experiments with other smoke and flame detection algorithms, demonstrating the superiority of the YOLOv8-CBAM network. Finally, Section 4 offers a summary of the content and findings of this paper.
2. YOLOv8-CABM
As illustrated in Figure 1, the YOLOv8-CBAM architecture integrates three CBAM (Convolutional Block Attention Module) units into the base structure of YOLOv8.
2.1. YOLOv8n
The YOLOv8 algorithm is primarily composed of three parts: Backbone, Neck, and Head, as depicted in Figure 2. The Backbone primarily consists of multiple modules such as CBS, C2f, and SPPF, which are responsible for feature extraction from images. CBS represents a simple convolutional layer. The C2f module, drawing inspiration from the C3 module in YOLOv5 and the ELAN concept in YOLOv7 [14] , is designed to ensure a richer gradient flow of information while maintaining a reduced number of parameters; its structure is also shown in Figure 2. The Neck part facilitates the integration of high-resolution and high-semantic information by merging high-level and low-level features. Finally, the Head, composed of multiple detection heads, is responsible for decoupling the refined feature information from the Neck, determining the position and category of the target object. The Backbone and Neck extract feature information but are incapable of performing localization tasks, which is the primary function of the Head.
2.2. CBAM
Given the often subtle and unstable movement characteristics of fires and smoke in certain scenarios, accurately detecting them poses a significant challenge for detection algorithms. To address this, the present study proposes the integration of the Convolutional Block Attention Module (CBAM) attention mechanism during the feature extraction phase of YOLOv8. CBAM combines channel and spatial attention mechanisms, effectively identifying key features in images while suppressing irrelevant noise. This dual attention mechanism notably enhances the accuracy and efficiency of detection, especially in complex and dynamic fire scenarios, making CBAM an essential tool in advanced image-based fire detection systems.
As depicted in Figure 3, CBAM consists of two modules: the Channel Attention Module (CAM), which implements channel attention mechanisms, and the Spatial Attention Module (SAM), which employs spatial attention mechanisms.
Figure 4 and Figure 5 respectively provide detailed illustrations of the basic structures of the Channel Attention Module (CAM) and the Spatial Attention Module (SAM).
Let the input feature map be denoted as. As illustrated in Figure 5, A first undergoes channel attention processing to obtain B, and then through spatial attention to yield the final activated feature map C. This process is mathematically represented in Equation (1):
(1)
In this context, the symbol
represents element-wise multiplication. When the dimensions of the operands do not match, the spatial attention values are expanded along the channel dimension, while the channel attention values are expanded along the spatial dimensions.
2.3. Loss Function Optimization
The loss function of YOLOv8 comprises three components, as expressed in Equation (2):
(2)
In this equation,
,
,
represent the bounding box regression loss, classification loss, and Distribution Focal Loss (DFL), respectively. The bounding box regression loss is the Complete Intersection over Union (CIoU), with the full calculation detailed in Equation (3):
(3)
In this context,
is a weighting function,
measures the similarity in aspect ratios,
is the Intersection over Union of the predicted and actual boxes,
denotes the Euclidean distance, b and bgt are the center points of the actual and predicted boxes, respectively. c represents the diagonal length of the smallest enclosing box that contains both the predicted and actual boxes.
While CIoU effectively incorporates aspects such as distance, overlap area, center point deviation, and aspect ratio in bounding box regression, thus avoiding the issue present in DIoU where identical Intersection over Union (IoU) values cannot distinguish boxes with coinciding center points, it does not account for the directional mismatch between actual and predicted boxes. This paper opts to utilize WIoUv3 for bounding box regression loss. The computation formula for WIoUv1 loss is given in Equation (4):
(4)
The calculation formulas for
and
are as Equation (5) and Equation (6):
(5)
(6)
In these formulas, x, y and xgt, ygt respectively represent the center coordinates of the predicted and actual bounding boxes, while Wg and Hg denote the width and height of the actual bounding box. The calculation formulas for
are as Equation (7), Equation (8) and Equation (9):
(7)
(8)
(9)
3. Experimentation
3.1. Experimental Environment and Dataset
This study’s experiments were conducted on a system running the Windows 11 operating system, powered by an Intel(R) Core i5-13490F CPU and an NVIDIA GeForce GTX 4070Ti GPU. The deep learning framework employed was PyTorch. After preparing the experimental dataset and setting up the experimental environment, iterative training was conducted using the proposed YOLOv8-CBAM model, along with other networks for comparative purposes. The dataset used in this study was a combination of the smoke public dataset mentioned in literature [15] and additional datasets collected through web scraping and publicly available online resources. This comprehensive dataset includes images of smoke and fires from various scenarios.
3.2. Experimental Evaluation Criteria
To accurately assess the model’s effectiveness in detecting fires and smoke, this study employs precision, recall, mean Average Precision (mAP), and model forward inference time as key performance metrics.
Precision evaluates the model’s accuracy and is defined as the proportion of correct positive predictions out of all positive predictions made, as shown in Equation (10).
Recall assesses the model’s comprehensiveness by measuring the proportion of correct positive predictions out of all actual positive instances, as depicted in Equation (11).
mAP is one of the most crucial performance evaluation metrics in the field of object detection, used to gauge the model’s accuracy and comprehensiveness across multiple categories. The calculation process for mAP is outlined in Equation (12).
(10)
(11)
(12)
3.3. Experimental Results and Analysis
During training, the initial learning rate was set to 0.0001, with a batch size of 16 and the number of iterations fixed at 300. Both training and testing images were resized to a dimension of 640 × 640. Figure 6 and Figure 7 indicate that the model’s training tended to stabilize after 100 iterations. Notably, during the final 10 iterations of training, the Mosaic augmentation was disabled, resulting in a significant downward trend in the curve. This demonstrates the effectiveness of the Mosaic augmentation in enhancing the model’s performance.
3.3.1. CBAM Comparative Experiment
As previously mentioned, this paper integrates the CBAM polarized self-attention mechanism into the backbone network of YOLOv8n. To accurately evaluate the enhancement effect of CBAM on the existing algorithm, testing was extended beyond the original dataset to include images derived from real tunnel fire videos recorded in various complex scenarios, as depicted in Figure 8.
3.3.2. Comparison Experiment with Other Models
To further evaluate the performance of the proposed method in smoke and fire detection, this study conducted a comparative analysis with widely used existing algorithms, including YOLOv5, YOLOv6, and the original YOLOv8. The results of this comparison are presented in Table 1. Compared to YOLOv5, YOLOv6, and YOLOv8, the proposed algorithm achieved a substantial improvement in accuracy for smoke and fire detection, with an increase of approximately 2.3 - 2.7 percentage points. Specifically, mean Average Precision at 50% IoU (mAP50) and mAP50-90 increased by 1.8 - 2.3 percentage points and 1.3 - 2 percentage points, respectively.
To provide a more visual demonstration of the model’s performance, this paper selected four images for inference computation. As shown in Figure 9, each image represents a scenario with challenging or deceptive smoke and fire detection. The first image, depicting a sunset, was mistakenly identified as fire by YOLOv5. The second image, correctly identifying smoke, was accurately recognized only by the model trained with YOLOv8-CBAM. In the third image, featuring multiple fires, other models either failed to detect them or produced overly large bounding boxes, lacking precision. The fourth image, representing a fiery sky, was incorrectly classified by all models except the proposed one. These instances clearly demonstrate the superiority of the algorithm proposed in this paper.
Table 1. Comparison of training results of different improved models.
Figure 9. Comparison of detection results.
4. Conclusion
In this study, an enhanced smoke and fire detection algorithm based on the improved YOLOv8 framework and integrated with the Convolutional Block Attention Module (CBAM) demonstrated significant effectiveness in dealing with the complexities of shape, texture, and color in flames and smoke. The introduction of CBAM strengthened the algorithm’s feature extraction capability, making the network more efficient in detecting two specific categories: smoke and fire. Additionally, the employment of the WIoU function optimized network loss and accelerated model convergence. Extensive training experiments conducted on a smoke and fire dataset indicated that the proposed algorithm substantially improved average precision compared to existing methods. However, the research also has limitations, such as the algorithm’s adaptability in more complex smoke and fire scenarios not being fully validated. Future research will focus on exploring detection algorithms in more challenging smoke and fire environments to further validate and optimize the method proposed in this paper.