SSE-Ship: A SAR Image Ship Detection Model with Expanded Detection Field of View and Enhanced Effective Feature Information ()
1. Introduction
SAR image ship detection is an important but challenging task in maritime target detection, which requires networks to predict ships in SAR images. Ship detection can benefit many applications, for example, in the field of maritime disaster relief and marine safety monitoring to quickly and effectively target suspicious targets and take appropriate measures. Benefiting from the effective feature representation of convolutional neural networks in deep learning, many methods [1] [2] [3] [4] [5] have achieved better results. However there are still some challenges to accurate ship inspections. As shown in Figure 1(a), ships and non-ship objects have different semantics, but they have similar features (e.g., white light dots). It is difficult to distinguish them without a better combination of image contextual information. On the other hand, since the small target ships in SAR images have a single feature and large-size ships have more local features, it is difficult to detect all ships accurately if there is multi-size ship information in the image and no enhancement of effective feature information, as shown in Figure 1(b). Therefore, operations to expand the detection field of view area and enhance the effective feature information are necessary in solving the problems of ship combination and ship and non-ship fusion. Existing ship detection methods either perform fast detection of small ship targets only [5] - [12] or build deeper networks for accurate detection of medium and large size ships [4] [13] - [18] , but they do not discuss the above two types of problems, which leads to inaccurate detection performance in different scenarios.
Other common problems in ship detection are multi-ship combination movements and dock interference. As shown in Figure 1(c), the high degree of integration of the ship with the quay causes the ship to be difficult to be identified. And in Figure 1(d), multiple ships are combined together causing the overall structure to lose the typical features of a ship.
Figure 1. Case illustration of ship detection. (a) Detection results without combining image contextual information. Ships in the red boxes will be missed due to loss of connection to context. (b) Detection results without enhanced feature information. It is easy to miss the detection of small-sized ships in the red box. (c) A scene where the ship and the coastal pier are fully integrated. (d) The case of multiple ships traveling in combination.
To overcome these drawbacks, the SSE-Ship model is proposed in this paper. A new ship target detector that combines image context and enhances the effective feature information of the feature map. The model effectively enhances the effective feature information while modeling through the context of the image range. Specifically, an image containing one or more ships is used as input. First, the image features with different depths are extracted from the SAR ship images using CNN backbone networks. Then, the STCSPB network proposed in this paper is used to generate a new global memory feature map by combining the relationships of each feature layer. Secondly, the spatial attention mechanism SE Attention is introduced to enhance the effective feature information in the feature map to generate a predictable feature map. Finally, multi-task loss functions, classification loss and regression loss are constructed. The results show that the SSE-Ship detection model largely outperforms the existing methods. Specifically, P = 0.944, R = 0.940, and mAP_0.5:0.95 = 0.647 on the SSDD [19] dataset, mAP_0.5:0.95 = 0.656 and FPS = 50 on the SAR-Ship [20] dataset, mAP_0.5 = 0.978 and mAP_0.5: 0.95 = 0.667 on the HRSID [21] dataset.
2. Related Work
The SSDD dataset was first proposed by Li et al. [19] , and provides the corresponding ship real frame and label information. The SAR-Ship dataset was first proposed by Wang et al. [20] , and this data and widely used in the training process of maritime target detection models. The HRSID [21] dataset is a novel dataset for ship detection, semantic segmentation and instance segmentation tasks, first proposed by the University of Electronic Science and Technology in January 2020.
Unlike simple small target detection [22] [23] [24] [25] and deep network construction [18] [21] [26] [27] , the goal of SSE-Ship detection is to accurately detect and locate all ships in SAR images by linking contextual modeling and enhancing effective feature information. [28] modeled global semantic information using a Swin Transformer-based model. [29] proposed a multiple attention mechanism interaction and scale enhancement network for SAR ship instance segmentation.
In this paper, YOLOv7 is used as the key component. YOLOv7 [30] is the seventh version of the YOLO algorithm, which is also improved based on the algorithm idea of YOLOv5. According to the network width and depth difference, YOLOv7 is further subdivided into YOLO7m, YOLOv7x, Yolov7-Tiny, and other versions. First, the algorithm resizes the input image to 640 × 640 and inputs it to the Backbone network. Second, three-layer feature maps with different sizes are chosen as the output through the Head layer network. Finally, the prediction result is obtained.
3. Method
3.1. Network Architecture
Figure 2 illustrates the overall design architecture of the SSE-Ship model
Figure 2. Overall architecture of the SSE-Ship model. It consists of 4 main components: backbone network, image contextual feature information fuser, feature enhancement block and multitasking loss function.
proposed in this paper. SSE-Ship consists of four main components: 1) a backbone network for extracting feature maps of different depths from the input image 2) a global fuser STCSPB network for generating global memory feature maps, 3) a feature enhancement block SEA Block for enhancing the effective feature information in the feature maps, and 4) a multi-task loss function for computing classification and regression errors.
The backbone network part, which feeds the input image
into the backbone network with YOLOv7 [30] as the key component, finally outputs three feature maps of different depths.
3.2. STCSPB Module
Swin Transformer [31] employs a layered Transformer solution to improve efficiency by confining the self-attention computing to non-overlapping local Windows while still allowing cross-window wiring. There are four stages in Swin Transformer, each containing a Block. This paper employs a simple data set of SAR images, which does not require too much computational attention. Therefore, this paper refers to a block (Swin-T Block) in Swin Transformer as the main content of the STCSPB module. The STCSPB structure is shown in Figure 3.
Figure 4 shows the Swin-T Block structure. The first part comprises the two-Layer Normalization (LN), a Window-based Multi-head Self-Attention (W-MSA), and a Mul-ti-Layer Perceptron (MLP). The W-MSA module divides the image into non-coincident windows to reduce the model’s calculation amount. In the second part, to solve the cross-window information interaction problem, W-MSA in the first part is modified to the Shift-Window based Multi-head Self-attention (SW-MSA), and the rest of the part employs the LN and MLP for residual connection.
Figure 4. Swin-T Block structure diagram.
3.3. Network Structure Improvement
The SE (Squeeze and Excitation) [32] module first squeezes the feature graph obtained by the convolution to extract the channel-level global features. Then, the global features are subject to the Exception operation, and the weight of different channels is obtained by learning their relationship. Finally, the final feature is obtained by multiplying the original feature map. Figure 5 shows the SE Attention structure.
in Figure 5 represents the convolution operation. The input convolution kernel is
where
represents the c-th convolution kernel. The output is
.
is described with Formula (1), where * Represents the convolution operator, and
represents the 2-D kernel convolution of the s channel.
(1)
is a Squeeze operation, as the global average pooling method. It can encode the entire spatial feature on a channel into a global feature. The Squeeze operation is shown in Formula (2).
(2)
The Squeeze operation obtains the global description characteristics. Next, the Exception operation is utilized to capture the relationship between channels.
mainly adopts the sigmoid method. The Exception operation is shown in Formula (3).
(3)
where
. In order to reduce the complexity of the model and improve its generalization ability, it also contains two fully connected layer structures and employs the ReLU activation.
Finally, each learned channel’s sigmoid activation value (0~1) is multiplied by the original feature on U, as shown in formula (4).
(4)
3.4. Loss Function
The loss calculation consists of two parts: the classification loss
between
Figure 5. SE Attention structure diagram.
the ship target prediction and the real target, and the regression loss
calculation of the ship target detection frame. The F_S loss function in this paper is composed of cross entropy classification loss and Smooth L1 regression loss, as shown in formula 5.
(5)
This paper uses cross entropy loss to calculate the classification loss of the model. The classification cross entropy loss function formula is shown in formula 6.
(6)
γ is a parameter in the range of [0, 5], and when γ is 0, it becomes the initial CE loss function. c represents the number of categories,
indicates whether the i sample belongs to category j, if it belongs to
, otherwise
.
represents the probability that the i sample belongs to category j. This paper uses the SoftMax function to obtain the probability
of samples belonging to each category.
For the regression loss of the prediction box, this paper uses the Smooth L1 loss function. The ship detection in this paper belongs to a single sample. If x is defined as the difference between the predicted value and the true value, the corresponding Smooth L1 loss function can be expressed as Formula 7.
(7)
4. Experiment
4.1. Implementation Details
The experiments were conducted on the YOLOv7 [30] backbone network. Firstly, the initialized network was trained using the COCO format dataset. Secondly, the model was trained for 30 rounds using an SGD optimizer with a batch-size of 16. Where the initial learning rate of the backbone network is set to 0.02, the kinetic energy is 0.9, and the normalized mean value of the dataset images is [0.1559097, 0.15591368, 0.15588938] and the variance is [0.10875329, 0.10876005, 0.10869534]. All experiments in this paper were conducted on an NVIDIA GeForce RTX 3060 GPU.
4.2. Datasets
In this paper, ship detection models are trained and tested on SSDD [19] dataset and SAR-Ship [20] dataset. HRSID [21] is used as the experimental dataset for quantitative analysis of the models. In order to meet the same format of the three datasets, the image labels are uniformly set to COCO data format in this paper.
For the SAR-Ship dataset, the dataset comprises the 102-view GF-3 satellite data provided by the China Re-sources Satellite Application Center and the 108-view Sentinel-1 satellite data provided by the ESA. The Institute of Aerospace Information Innovation, Chinese Academy of Sciences research team, provides the labeled data.
The dataset first processes the original 16-bit complex data into an 8-bit digital image by performing amplitude value generation, bit depth quantization, and grayscale stretching processing on the source data. Then, a ship slice with a pixel size of 256 × 256 is constructed by cropping and filtering. Finally, the LabelImg target annotation software generates the corresponding ship label box information text for each ship slice.
The dataset contains data obtained by SAR under different environmental conditions and background complexities, including 20,000 images. Among them, the multi-dimensional characteristic signs of ships include spectrum, shape, size, and spatial distribution. The ship appears gray-white in the remote sensing image and is similar to the color of many shore buildings. The typical shapes of small and medium-sized ships are point-shaped, I-shaped, and patch-shaped when photographing by remote sensing satellites. Ships occupy a small proportion of pixels in satellite image datasets. As shown in Figure 6, the spatial distribution of ships is sparse but denser at the wharf.
4.3. Evaluation Metric
In order to evaluate the algorithm’s performance based on the validation dataset,
Figure 6. Typical samples of the training set. From right to left, the background complexity of the ship data gradually increases. From top to bottom, the ship data are obtained under larger, smaller, and severe environmental disturbances. It also includes data on different ship shapes (point, I, and plaque).
this paper employs the precision rate (P), the recall rate (R), and the average precision (AP) as evaluation indicators.
The basic parameters that construct the target detection evaluation index are TP (True Positive), FP (False Positive), and FN (False Negative).
TP represents the number of predicted positive targets and actually positive targets. FP represents the number of predicted positive targets but actually negative targets. FN represents the number of predicted negative targets but actually positive targets.
1) P (Precision) represents the proportion correctly identified in the prediction result of the ship, as shown in formula (8).
(8)
2) R (Recall) represents the proportion correctly identified in all ground-truth marker boxes of the ship, as shown in formula (9).
(9)
3) AP (Average precision) is an essential indicator for evaluating the model performance, as shown in formula (10).
(10)
5. Discussion and Analysis
5.1. Comparison to State-of-the-Art
This paper first shows in Table 1 the main quantitative comparisons of SSE-Ship with SAR image ship detection methods in each detection type. Since the detection mechanisms of the major classes of target detection algorithms differ, a classification comparison is made in this paper.
It can be seen in Table 1 that SSE-Ship performs well on both SSDD and SAR-Ship datasets compared to existing algorithms. In the SSDD dataset, SSE-Ship improves 3.6% on AP_0.5:0.95 compared to CRAS YOLO [40] and 1.5% on AP_0.5:0.95 compared to CRTransSAR [38] . In addition, SSE-Ship performs better on the SAR-Ship dataset. This is attributed to two main reasons: 1) SAR-Ship is a larger dataset than SSDD, which is important for the training of the STCSPB module. 2) The SAR-Ship dataset contains information about ships in more scenarios, which effectively improves the generalization and robustness of the model.
5.2. Ablation Study
In this section, we conduct a large number of experiments to validate the effectiveness of our proposed SSE-Ship. The ablation experiments are performed using the backbone model of YOLOv7, and the results are reported on the SAR-Ship dataset.
5.2.1. STCSPB Ablation Study
STCSPB, as the core part of the context fuser, has the ability to correlate long-range contextual information. To verify its effectiveness, we use STCSPB for comparative analysis with feature fusion networks of other detection algorithms, as shown in Table 2.
As can be seen from Table 2, the STCSPB module proposed in this paper has outstanding performance. Although it is slightly lower than the fuser FPN [45] of the two-stage algorithm on mAP_0.5, there is a 2.2% gain in mAP_0.5:0.95 compared to FPN [45] .
Table 1. Quantitative comparison of SSDD and SAR-Ship sets.
5.2.2. SE Attention
For SE Attention we use a heat map for comparison to verify the effectiveness of introducing spatial attention, as shown in Figure 7. Figure 7 uses typical large ships, docked ships and combination ships as validation images. The model with SE Attention can extract more effective characteristic information than the model without SE Attention.
5.3. Loss Function
In order to verify the effectiveness of F_S loss function on SSE-Ship model, this paper conducted three sets of comparative experiments as shown in Figure 8. It can be seen from Figure 8(a) that Focal loss has significant advantages in classifying losses, and the model converges quickly. Figure 8(b) shows that Smooth L1 loss gives full play to its advantages in model training. In Figure 8(c), CE loss and Focal loss are combined with Smooth L1 loss, respectively. The effect of combining Focal loss with Smooth L1 loss is better than the total loss of combining CE loss with Smooth L1 loss is similar.
5.4. Model Inference
In this section, the robustness and efficiency of the model are analyzed in detail
Figure 8. Comparative experimental diagram of loss function
through the model reasoning process and results.
5.4.1. Model Robustness and Generalization
In order to illustrate the robustness of the SSE-Ship algorithm, the ship detection results of part of the SAR images are shown in Figure 9. Figure 9(a) shows the detection results under dense small targets. YOLOv7 may miss detection. As shown in Figure 9(b), YOLOv7 in SAR dataset is unsuitable for detecting large target ships. Figures 9(c)-(e) show that the YOLOv7 algorithm will have false detection under interference from nearby objects and environmental clutter. The SSE-Ship algorithm has fewer false detections and missed detections when facing interference from nearby objects, environmental clutter, dense small targets, large targets, and other ship detection situations. To sum up, the SSE-Ship has good generalization ability and robustness. Besides, the problem of the significant difference in ship size and uneven distribution of ship space is well solved.
5.4.2. Model Size and Efficiency
At present, the indicators commonly used to evaluate the size and efficiency of the model include: Memory, Parameters (Params) and Frames Per Second (FPS), etc. Memory is the size of the unit bytes that the model needs to access, which indicates the model’s demand for storage unit bandwidth. Params is the sum of the parameters in the model, which is used to evaluate the size of the model volume. FPS refers to the number of pictures reasoned by the model per second, which is used to evaluate the overall efficiency of the model.
In this section, the SSE-Ship detection model is analyzed in detail using different model efficiency evaluation indicators, as shown in Table 3. CRTransSar
Figure 9. Comparison chart of detection results.
Table 3. Evaluation table of model performance index.
and PVT-SAR, as improved two-stage detection algorithms, have improved the overall detection accuracy (Table 1), but their model parameter quantity has significantly increased and the model reasoning speed has also significantly decreased. As a first-stage algorithm with significant advantages of fast detection, FASC-Net has outstanding performance in small target detection, but the accuracy of large target detection is low (Table 1). In the performance comparison of each target detection model, the FPS index shows that the reasoning speed of the model is well controlled, and Params shows that the size of the model is slightly higher, and Memory shows that the bandwidth requirement of the model for the storage unit is normal.
To sum up, the indicators of SSE-Ship compared with other detection algorithms are within the feasible range. SSE-Ship increases the detection accuracy of medium and large objects while maintaining the detection efficiency of small objects, and there is no large consumption in the inference performance and model size of the model.
5.5. Quantitative Analysis
To evaluate the performance of the SSE-Ship model in practical applications, this paper uses HRSID as a quantitative data set. The dataset contains the ship characteristics in multiple scenes such as real clouds, rain, building interference and different SAR shooting scales. All detection models use the pre-training weight of SAR-Ship dataset to train the HRSID dataset. The quantitative results are shown in Table 4. The SSE-Ship model in this paper has significant advantages over other methods.
6. Conclusions
The method proposed in this paper is used to detect and locate ships at sea. As a ship-centered detection task, it can be related to the realistic maritime livelihood
Table 4. Quantitative analysis result.
safety and ship monitoring in the no navigation zone.
Meanwhile, it is reasonable to use SSE-Ship model to detect ships at sea in SAR images. Because in practical applications, the size and noise of ships can vary greatly due to the different distances and environments of SAR shots, the use of SSE-Ship can not only identify complex types of ships according to the context, but also effectively detect multi-scale ships, i.e., ensure high detection rate of small targets and improve the detection accuracy of large targets.
Acknowledgements
This paper was funded by the Graduate Innovation Fund of Sichuan University of Science & Engineering (Y2022180).