A Novel SAR Image Ship Small Targets Detection Method

To satisfy practical requirements of high real-time accuracy and low computational complexity of synthetic aperture radar (SAR) image ship small target detection, this paper proposes a small ship target detection method based on the improved You Only Look Once Version 3 (YOLOv3). The main contributions of this study are threefold. First, the feature extraction network of the original YOLOV3 algorithm is replaced with the VGG16 network convolution layer. Second, general convolution is transformed into depthwise separable convolution, thereby reducing the computational cost of the algorithm. Third, a residual network structure is introduced into the feature extraction network to reuse the shallow target feature information, which enhances the detailed features of the target and ensures the improvement in accuracy of small target detection performance. To evaluate the performance of the proposed method, many experiments are conducted on public SAR image datasets. For ship targets with complex backgrounds and small ship targets in the SAR image, the effectiveness of the proposed algorithm is verified. Results show that the accuracy and recall rate improved by 5.31% and 2.77%, respectively, compared with the original YOLOV3. Furthermore, the proposed model not only significantly reduces the computational effort, but also improves the detection accuracy of ship small target.


Related Work
With the maturity of deep learning technology, an increasing number of innovative algorithms have been developed in remote sensing image target detection research.

Remote Sensing Image Target Detection
In Ref. [9], the fusion of both optical data and SAR data sets is performed applying two different approaches. The authors conclude that in refugee camp areas, the results of the independent analyses can be improved significantly by the proposed fusion approaches. In Ref. [10], the presented algorithm works comparatively well with images of the ocean in freezing temperatures and strong wind conditions, common in the Amundsen Sea. In Ref. [11], the continuous learning of a residual convolution neural network that is applicable to middleand high-resolution optical remote sensing images was proposed; it demonstrated good recognition accuracy for airport targets in optical remote sensing images with complex backgrounds. In Ref. [12], an improved remote sensing image target detection method was proposed based on the faster R-CNN, which can yield better detection results owing to the fusion of multiscale features and features extracted by the convolutional neural network of the rotating regional network. An improved convolutional neural network-based method for SAR image target recognition was proposed in a previous study [13]. After a linear weighted combination of the classification results of the original image and its multi-resolution representation, the classification of the test samples was assessed based on the combined results, and the validation of the proposed method using the MSTAR dataset demonstrated the effectiveness and robustness of the proposed method. The FCD-EMD [14] algorithm combines detailed information in different directions such that the results yielded are more accurate compared with those yielded by individual methods. Furthermore, it can reduce the effect of speckle noise in SAR images via feature selection.

SAR Image Ship Target Detection
In Ref. [15], an improved convolutional neural network-based SAR image ship target detection algorithm was proposed to detect multiscale ship targets in multiple scenes; the algorithm indicated good adaptability to the detection of ship targets of different sizes in complex scenes. In Ref. [16], a ship detection algorithm based on a depth feature pyramid and a cascade detector was proposed.
The feature extraction network of the original target detection algorithm improved, the cascade structure was used to adjust the network, and finally, good detection results were obtained. In Ref. [17], an improved detection method was proposed based on a regional full convolution network, which can suppress the effect of speckle noise, effectively extract the features of ships, and yield a good detection effect. In Ref. [18] a neural network with hybrid algorithm of CNN and multilayer perceptron (CNN-MLP) is suggested for image classification. In Journal of Computer and Communications this proposal, the algorithm is trained with real SAR images from Sentinel-1 and RADARSAT-2 satellites, and has a better performance on object classification than state of the art. In Ref. [19], the authors propose a modified topology, utilizing superpixels (SPs) in lieu of rectangular sliding windows to define CFAR guardbands and background. The aim is to achieve better target exclusion from the background band and reduced false detections.
Many advances have been made in previous works, but few studies have focused on the lightweight processing of small target detection algorithms. This paper presents a lightweight neural network for small target detection. The propored method can not only improve the detection accuracy, but also reduce the computational complexity.

Methods Journal of Computer and Communications
contains several convolutional layers with different scales and different number of convolution kernels. The specific scale and number of convolution kernels are shown in Figure 1. The feature extraction network used by YOLOV3 is dar-net53. The network has a total of 52 convolution layers, 5 sampling under the convolution operation and 23 residual network structures, namely the skip connections. For the sake of extracting the image features more rich information, the characteristic of each layer channel more and more, up to 1024 channels, which is result to learn the number of arguments. The network structure proposed in this paper has fewer convolutional layers, and each layer has fewer characteristic channels, with at most 512 channels.

Reducing Computational Complexity
In the detection of SAR image targets, the computation amount of model parameters can be further reduced on the basis of the original algorithm, which will further improve the real-time performance of the algorithm. In addition, the lightweight processing of parameters is currently an important direction of neural network research. In view of the large networks, the researchers hope to further reduce the computational burden of the model without changing the effect of feature extraction. The current mainstream method is Separable Convolution [20]. This convolution operation is divided into two steps: Depthwise Convolution and Pointwise Convolution. To illustrate the problem, we use a simple convolution operation: Assuming that A is a matrix of m m × and B is a matrix of p p × , then B can be represented as Formula (1): where, 1 p M × is a matrix of 1 p × and 1 p N × is a matrix of 1 p × , then the convolution of the matrices A and B can be represented as Formula (2): where, * is the convolution operation. The above equation can be generalized to a tensor convolution operation. Assuming that k k m A × × is a tensor of k k m × × and p p n s B × × × is a tensor of p p n s × × × , then the convolution of the tensor A and B can be represented as Formula (3): Separable convolution operation is shown in Figure 2.  During Depthwise Convolution, Corresponding to three inputs, there will be three convolution kernels perform convolution, then three feature maps will be obtained. Second, perform Pointwise Convolution operations on them. Four convolution kernels (1 × 1 × 3) are used for the feature maps of the three input channels, and four feature maps are generated According to the above, the first step of Separable Convolution is to ensure that the number of channels in the convolution kernel is equal to the number of input channels. The second step is to ensure that the convolution kernel with the size of 1 × 1 is used and the number of channels is equal to the number of pre-set output channels. Although the conventional convolution operation is divided into two steps, the computational amount is greatly reduced and better results are obtained at the same time. It can be seen from Figure 2 that a general convolution operation is changed to Depthwise convolution and Pointwise Convolution, whose computational complexity is greatly reduced. The time complexity of each convolution layer in the convolutional neural network can be shown as Formula (4) [21].
where, M represents the size of the output feature map, K represents the size of the convolution kernel, C in represents the number of input channels, and C out represents the number of output channels. Taking the first convolution layer of It can be seen that the computational complexity of convolutional layer is greatly reduced, only 1/8-1/9 of that before. In this way, the real-time performance of the whole network is enhanced, while the performance of feature extraction is not affected.

Reusing the Feature Information of Shallow Layer
Usually, the shallow layer features of neural network mainly contain the detail information in the image, while the deep layer features mainly include the semantic information. The deep semantic information is easy to lose the feature information of the small target. Moreover, as the number of network layer increases, the training accuracy tends to be saturated, and then falls into Network Degradation. When using a large number of samples to train the deep neural network, the learning mechanism of the network, Chain Rule, is easy to lead to the gradient gradually approaching zero, namely Gradient Vanishment. Assume that the output of each layer of the network is ( ) where, For a depth network with n layers, its final output is shown in Formula (6): Taking the derivative of the activation function, if this part is less than 1, then with the increase of layers, the gradient update information obtained will decay in Gradient Vanishment.
In order to enhance the feature information of small targets, this paper uses skip connection to form residual network structure [22], which not only effectively prevents network degradation, but also enhances the detail information of target features. By using identity mapping, the feature information of the shallow layer is directly input to the deeper convolution layer, which preserves more target details and helps to improve the detection accuracy of small targets. The structure of skip connection is shown in Figure 3.
Taking Figure 3 for example, in the forward propagation of neural network, [ ] l a represents the output of layer l, while in general neural network, it needs to pass through Layer l + 1 to reach Layer l + 2. In the residual block, it is not only necessary to pass through l + 1 layer, but also to skip connection of the output [ ] l a to l + 2 layer, as shown in Formula (7): The output of the [ ] layer, but also to the a of the l layer. In this case, the derivative object has an additional identity mapping term. For example, the derivative of the residual node is shown in Formula (8): It can be seen that even if the original derivative d d f x approaches 0, it can effectively back propagate, greatly reducing the impact brought by Gradient Vanishment. From the perspective of front-propagation, as the number of network layer increases, the image information contained in the feature graph will be less and less. Skip Connection introduces the features of the lower layer, ensuring that the features of the higher layer will contain more detail information of targets. Figure 4 shows the network model of the improved method, with the emphasis on the improved part in the feature extraction network. There are four skip connections and five downsampling layers. The residual network structure is adopted to prevent the gradient vanishment of the network as well as to reuse the shallow feature information of the target to enhance the feature of small targets, when the input SAR image first goes through the feature extraction network. Separable Convolution operation is used, which greatly reduces the computation and maintains the detection effect of the network. After these improvements, the experiment in the third part of this paper verifies that compared with the original YOLOv3, the proposed method improves the accuracy of SAR image small target detection.

Experimental Environment
The experiment in this paper runs on the Ubuntu 16.04 operating system, the code runs on python3.6, and the model training runs on the Titan Xp (12G video memory) GPU, CUDA 10.0 and cuDNN 7.0 configuration environment.
SSDD [23] is the classic publicly available data set specially used for SAR image ship target detection. It can be used for training and test detection algorithm and has been used by more than 30 universities and research institutes. For each ship target, detection algorithm predicts the boundary of ship target and gives the confidence degree of ship target. The number of iterations was 200, the learning rate was set to 0.001, and the momentum was set to 0.9. SGD(Stochastic Gradient Descent) was used in the optimization algorithm, and the momentum attenuation coefficient was 0.00004.

Experimental Evaluation Criteria
In this paper, Precision, Recall and F1, a criterion that comprehensively measures the accuracy and recall rate, is used to conduct quantitative analysis on the detection results. The accuracy rate and recall rate are shown in Formula (9)  samples. Precision reflects false alarms in detection, that is, the higher the precision rate, the less false alarms. Recall represents the phenomenon of missing detection in detection, that is, the higher the recall rate is, the fewer missed targets will be. The definition of F1 is shown in Formula (11): precision recall where F1 is an indicator used to comprehensively measure the accuracy rate and recall rate. The higher the indicator is, the better the detection effect will be.

The Experimental Results and Analysis
SSDD data set was used to train the original YOLOv3 network, and part of the test set was used to test and evaluate the trained model. The experiment is divided into six cases, as shown in Table 1. To be faired, the training and test data set used for the original YOLOv3 algorithm and the algorithm in this paper were consistent.

Detection Results in Complex Background
There are many ships in ports, docks and inlets, etc., so the accuracy of detection in this area is required to be higher. The method in this paper obtains relatively high accuracy and recall rate for detection of ship targets against a complex background. Some detection results are shown in Figure 5.

Detection Results of Ships Small Target
In real applications, there are often small ships on the sea surface, or a large number of densely packed small ships targets. In these cases, the detection algorithm needs to have good sensitivity to small-scale targets and be able to detect targets accurately. The method presented in this paper has a good performance in the detection accuracy and recall rate of small ship targets, as shown in Figure   6.   It can be seen from Table 2 that the method proposed in this paper is higher than the other two algorithms. Table 2 shows Precision is 5.31% higher than the original algorithm, Recall rate is 2.77% higher than that of the original algorithm, and F1 is about 4.24% higher than that of the original algorithm. Compared with Y. Song et al.

Conclusion
This paper proposes an improved YOLOV3 algorithm for SAR image target detection, which not only reduces the algorithm complexity, but also improves the accuracy and recall rate of SAR image target detection. Our key idea is to use the convolutional layer part of VGG16 network as the feature extraction network, and convert conventional convolution operations to Separable Convolution.
Then we introduce the skip connection in network. After the above improvement of feature extraction network, SSDD data set was used to train the neural network of the algorithm in this paper, and test set was used to verify the trained model. The detection effect and experimental results obtained were better than the original YOLOv3 algorithm. In the future work, further research will be made on training strategy and network structure optimization, which will be one of the emphases in the following work.