Deep Convolutional Feature Fusion Model for Multispectral Maritime Imagery Ship Recognition

Combining both visible and infrared object information, multispectral data is a promising source data for automatic maritime ship recognition. In this pa-per, in order to take advantage of deep convolutional neural network and multispectral data, we model multispectral ship recognition task into a convolutional feature fusion problem, and propose a feature fusion architecture called Hybrid Fusion. We fine-tune the VGG-16 model pre-trained on ImageNet through three channels single spectral image and four channels multispectral images, and use existing regularization techniques to avoid over-fitting problem. Hybrid Fusion as well as the other three feature fusion architectures is investigated. Each fusion architecture consists of visible image and infrared image feature extraction branches, in which the pre-trained and fine-tuned VGG-16 models are taken as feature extractor. In each fusion architecture, image features of two branches are firstly extracted from the same layer or different layers of VGG-16 model. Subsequently, the features extracted from the two branches are flattened and concatenated to produce a multispectral feature vector, which is finally fed into a classifier to achieve ship recognition task. Furthermore, based on these fusion architectures, we also evaluate recognition performance of a feature vector normalization method and three combinations of feature extractors. Experimental results on the visible and infrared ship (VAIS) dataset show that the best Hybrid Fusion achieves 89.6% mean per-class recognition accuracy on daytime paired images and 64.9% on nighttime infrared images, and outperforms the state-of-the-art method by 1.4% and 3.9%, respectively.


Introduction
By integrating complementary information from visible (VIS) and infrared (IR) images, multispectral data has recently received much attention in machine learning and computer vision [1] [2] [3] [4] [5]. VIS images are sensitive to variation illumination and unfavourable weather conditions, which degrade the performance of computer vision systems built on these images. Thermal camera can ameliorate the problem, but it cannot provide image with the same high-resolution as visible camera, and often exhibit a decrease in image quality during daytime due to a high background temperature. Therefore, multispectral images have been successfully used to face recognition [6] [7] [8] [9], and are also widely applied to object recognition [10], person re-identification [11], pedestrian detection [12], and object tracking [13] by exploiting deep learning in recent years.
As known to all, after the breakthrough research by Krizhevsky et al. [14], deep convolutional neural networks (CNN) have achieved remarkable success for a large variety of tasks, and quickly became the dominant tool in computer vision. Meanwhile, some well-known deep CNN models have been reported, such as Oxford VGG Model [15], Google Inception Model [16] and Microsoft ResNet Model [17]. One factor for the dramatic improvement in performance of deep CNN is that many challenging datasets for training with millions of labeled examples are harvested from the web, such as ImageNet [18]. However, a large-scale training set is expensive or difficult to collect in the real world, and training a large neural network on a small dataset would lead to poor performance due to the problem of overfitting. The lack of a large-scale training set forces the computer vision community to find practical workarounds. Much recent effort [19] [20] [21] has been dedicated to developing methods that fine-tune the well-known pre-trained deep CNN models or directly take these models as feature extractors. Research in vision tasks based on multispectral data follows the same trend, e.g., action recognition [22], pedestrian detection [23], object recognition [10]. In the previous works on multispectral data, whether fine-tuning after feature fusion or directly extracting feature without fine-tuning, features are produced at the same layer of the pre-trained deep CNN model for VIS and IR images. However, due to the aforementioned difference between VIS and IR images, features extracted from the same layer may not both be the best, so feature fusion cannot fully take advantage of multispectral data. Therefore, how features of VIS and IR images can be properly fused in pre-trained or fine-tuned deep CNN model to achieve the best performance in vision task remains to be solved.
In this paper, we focus on using the pre-trained or fine-tuned deep CNN  [10]. Third, we fine-tune the pre-trained VGG-16 model on both single spectral image and multispectral images, and also exploit three existing regularization techniques to avoid over-fitting problem.
Fourth, the best Hybrid Fusion performs 89.6% mean per-class recognition accuracy on the daytime paired images of VAIS dataset, outperforms the state-of-the-art method by 1.4%, and also achieves 64.9% on nighttime and 68.6% on all time IR images.

Related Work
Object recognition with deep convolutional feature fusion. Initializing with transferred features whether features are transferred from the low-level, middle-level or high-level of the pre-trained deep CNN, can improve generalization performance even after substantial fine-tuning on a new task [24]. Schwarz et al. [25] presented feature fusion model for multi-modal object recognition, a pre-trained AlexNet model [14] is exploited to extract features from the last two fully connected layers. An extension of the fusion model further improves object recognition accuracy by fine-tuning the pre-trained AlexNet with multi-modal training data [19]. Furthermore, Zia et al. [26] proposed a hybrid 2D/3D convolutional neural network initialized by the pre-trained VGG-16 model [15], and fused the features separately extracted from the fully connected layers of three network architectures. Another interesting work [27] presented an unsupervised feature learning framework. In this framework, the pre-trained VGG-f model [28] is taken as a feature extractor, and then recursive neural network [29] is used to reduce dimension of the extracted features and learn high-level features.
The aforementioned methods focus not only on convolutional feature fusion but  [42] proposed a decision level fusion of convolutional neural networks using a probabilistic model, in which features are extracted from the last convolutional activate map of the pre-trained VGG-19 model. Zhang et al. [43] presented a multi-feature structure fusion based on spectral regression discriminant analysis (SF-SRDA) by combining structural fusion with linear discriminant analysis, and used the pre-trained models VGG-19 and ResNet-152 [17] to achieve a promising result. The above work has achieved good ship recognition performance. However, they did not consider the difference between each convolutional layer of the pre-training or fine-tuning models for different spectrum image ship recognition, and our work considers this difference and proposes a Hybrid fusion model based on this difference.

Proposed Feature Fusion Method
Intuitively, VIS and IR images provide auxiliary visual information to each other in depicting ship objects. Encouraged by the recent tremendous advances in deep learning techniques, as well as inspired by the work of multispectral pedestrian detection [44], we explore the effectiveness of using the VGG-16 model pre-trained on ImageNet dataset and fine-tuned on VAIS dataset to perform multispectral ship recognition. The structure of our method is shown in Figure 1.
The proposed fusion framework mainly includes four stages: 1) Image preprocessing: as the pre-trained VGG-16 model expects 224×224 pixels and three channels images as input, we simply clone the single IR channel three times. Meanwhile, both VIS and IR images are resized to 224×224 using nearest interpolation.
2) Feature extraction: there are two feature extraction branches, visible image branch (shorted as VGG-16-VIS) and infrared image branch (shorted as VGG-16-IR), as shown in Figure 1. Each branch takes the pre-trained or fine-tuned VGG-16 model as feature extractor. Besides, image features of both branches are extracted from the same layer or different layers of VGG-16 model according to feature fusion architectures.
3) Feature fusion: the features extracted from the two branches are flattened to feature vectors, and then are concatenated to produce a multispectral feature vector representing the maritime ship. 4) Classification: before the fused feature vector is fed into a linear SVM classifier for the final prediction, feature vector is normalized by l2-norm (shorted as L2) normalization method. According to Hybrid Fusion, feature vector should be normalized before feature fusion because of the large gap of feature values at different layers.
Additionally, the training samples of VIS and IR images in the VAIS dataset are used to fine-tune the pre-trained VGG-16 model in an end-to-end way, respectively. Then, the two fine-tuned VGG-16 models are also taken as feature extractors.

Feature Fusion Architecture
Due to features at different levels of VGG-16 correspond to various levels of semantic information and fine visual details [45], feature fusion at different layers would lead to different recognition results. Therefore, the multispectral ship recognition task is modelled into a convolutional feature fusion problem, i.e., which feature fusion architecture could get best recognition performance. Then, we propose a feature fusion architecture called Hybrid Fusion, which combines high-level feature of VIS image and middle-level feature of IR image. We investigate Hybrid Fusion as well as Early Fusion, Halfway Fusion and Late Fusion.
These fusion architectures integrate two-branch convolutional features at different layers of VGG-16 model, as shown in Figure 2. Each branch represents a single spectral image.
Early Fusion combines the feature maps from VIS and IR images immediately after the first and second convolutional layers (C1 and C2 layers) followed by a Max Pool layer (this fusion architecture is ignored in Figure 2). Since C1 and C2 layers capture low-level visual features, such as color, corners and line segments.
This fusion architecture fuses features at low-level.
Halfway Fusion also implements feature fusion at convolutional layers. Different from Early Fusion, it fuses the features after the third, fourth and fifth convolutional layers (C3 -C5 layers) followed by a Max Pool layer, as shown in

Feature Fusion Method
After extracting two-branch convolutional features from different levels of the . This fusion method concatenates the dimensions of the two input feature vectors.

Normalization and Classification
In order to evaluate the multispectral ship recognition performance of four fea-

Dataset
To investigate our four feature fusion architectures for ship category recognition, we use the publicly available VAIS dataset [10].  Figure 1 and Table 1 shows the number of train and test samples for each class. As followed the baseline method [10], the same train data and test data are used and the mean per-class recognition accuracy is taken as the evaluation measurement in the experiments.

Implementation Platform and Details
Our processing platform is a personal computer with Ubuntu 16.04, with a single CPU (4.20 GHz) of an Intel Core i7-7770K with 16 GB random access memory (RAM). An NVIDIA GTX1080Ti Graphics PU is used for deep CNN computations. The computation environment is a Keras environment with Tensor-Flow backend, which is a high-level neural network application programming interface written in Python. Our experiment is divided into two stages: features are first extracted and stored subsequently, then are fed into linear SVM classifier. We use LibSVM toolbox [47], which has been packaged as a module of scikit-learn 2 , as classifier to implement ship classification, the relaxation coefficient C is set to 10, kernel function is set to linear. Due to limited RAM, we did not perform experiments on feature fusion at the first convolutional layer, but experiments at the second convolutional layer can reflect the performance of Early are also taken as inputs to fine-tune the model. In fine-tuning experiment, the initial learning rate is set as 0.001 for VIS images and 0.0001 for IR images and 4C VIS-IR images. Stochastic gradient descent optimizer is utilized for optimization, the momentum is set to 0.9, and the decay is set as 0.00001. The train step is set to 50 epochs, the batch size is set to 32. Random horizontal flip, random vertical flip are used for online data argumentation. Dropout is applied after the second fully connected layer and its rate is set to 0.5. L 2 weight decay is applied on the last fully connected layer and its value is set to 0.1.

Evaluation of the Pre-Trained and Fine-Tuned Models
Firstly, we evaluate the effects of existing regularization techniques during fine-tuning VGG-16 model. Table 2 shows the comparison of recognition performance with and without regularization techniques. Accuracy evaluation uses the average value together with standard deviation in 10 groups of fine-tuning experiments. Figure 3 and Figure 4 give the accuracy and loss curves of fine-tuning VGG-16 model on VIS and IR images in one group of experiments, respectively.
As shown in Table 2, using data argumentation greatly improves the recognition accuracy of 4C VIS-IR images, and combining three regularization techniques achieves the best results. However, compared to using data argumentation for VIS and IR images, fine-tuning with dropout or L 2 weight decay has slightly higher average value and smaller standard deviation. A combination of two or more regularization techniques cannot significantly improve the performance of fine-tuned model. Combining dropout and data argumentation even leads to model degradation when fine-tuning on VIS images. Furthermore, it can be observed from Figure 3 and Figure 4 that over-fitting problem is worse on IR images than VIS images. Over-fitting on VIS images is easy to overcome by using any of the three regularization techniques, as shown in Figure 3. However, data argumentation and dropout make accuracy and loss fluctuate too much for IR images, and the being fine-tuned model is difficult to converge, as shown in Secondly, we analyze the feature representation ability of different layers on the pre-trained VGG-16 model for VIS and IR images. As the horizontal axis shown in Figure 5(a), C2 is low-level layer, C3 -C5 are middle-level layers, and   Table 2, respectively. Columns 1 -2 are accuracy and loss curves of model fine-tuned on VIS images without data argumentation, and columns 3 -4 are accuracy and loss curves of model fine-tuned on VIS images with data argumentation.
F6 -F7 are high-level layers. VIS image obtains the more feature representation at high-level layers (see block line with squares in Figure 5(a)) due to the pre-trained VGG-16 model is trained by a later-scale dataset of VIS image.
However, IR image obtains more rich features at middle-level layers (see block dotted line with squares in Figure 5(a)) than high-level layers for ship recognition. The main reason is that IR images are different from VIS images, such as high contrast, low resolution and insufficient details. Meanwhile, we evaluate the effect on recognition accuracy of the two feature vector normalization methods.  Figure   5(a)). The main reason may be that IR images have more noise and are more blurry than VIS images, and L2 normalization eliminates the influence of these small values.
Thirdly, we evaluate the recognition performance of the fine-tuned models.
For convenience, the pre-trained VGG-16 model without fine-tuning is shorted   Table 2, respectively. Columns 1 -2 are accuracy and loss curves of model fine-tuned on IR images without data argumentation, and columns 3 -4 are accuracy and loss curves of model fine-tuned on IR images with data argumentation. as NOFT, the pre-trained VGG-16 model with fine-tuning on VIS images is shorted as FTVIS, and the pre-trained VGG-16 model with fine-tuning on IR images is shorted as FTIR. Figure 5(b) and Figure 5(c) show the comparison of fine-tuned and pre-trained models without or with normalizations. Fine-tuning model on VIS images doesn't obviously improve the performance of layers on VGG-16 model (see red line with rounds in Figure 5(b) & Figure 5(c)). Fine-tuning model on IR images also doesn't obviously improve the performance of C2 -C4 layers on VGG-16 model, therefore it indicates that the low-level and middle-level layers of pre-trained VGG-16 model has strong generalization performance. However, it can be found that the recognition accuracy of C5, F6 and F7 layers on FTVIS and FTIR fine-tuned models are better than those of the NOFT model (see blue dotted line with diamonds and red dotted line with rounds in Figure 5(b) and Figure 5(c)). Thus, NOFT is taken as the feature extractor of VIS images, but the feature extractors of IR images are NOFT, FTIR and FTVIS. Therefore, the three combinations of feature extractors for VIS and IR images are investigated, as shown in Table 3.

Evaluation of Four Fusion Architectures
Firstly, we investigate the recognition performance of Early Fusion, Halfway Fusion and Late Fusion by using L2 normalization method along with three combinations. Due to feature extraction and feature fusion at the same layer for these three fusion architectures, feature is normalized for SVM classifier after features are fused. Figure 6 shows the recognition accuracy of three fusion architectures by using L2 normalization method. For an intuitive comparison, that of feature

Comparison with Other Reported Methods
We compare the proposed Hybrid Fusion with four methods for paired images: 1) the baseline method (CNN + Gnostic Fields) [10], 2) Multimodal CNN [41], 3) DyFusion [42], 4) SF-SRDA [43], and with three methods for VIS images in the paired images: 5) MFL (feature-level) + ELM [38], 6) CNN + Gabor + MS-CLBP [36], 7) ME-CNN [40], and with one method for all time IR images: 8) ELM-CNN [31]. Table 5 shows the comparison results using the mean pre-class recognition accuracy as evaluation measure. As shown in Table 5, Hybrid Fusion (F6C3) is 2.2% higher than the baseline method, outperforms the state-of-the-art (DyFusion) by 1.4% in daytime, and boosts the baseline method by 3.9% on  In addition, normalized confusion matrices for Hybrid Fusion (F6C3) of combination 2 are shown in Figure 7. As shown in Figure 7(a), all categories except for medium-other and tug are above 92% accuracy. Medium-other achieves only 64% because it is often confused with passenger and small ships.
Besides, tug achieves only 60% in Hybrid Fusion (F6C3) due to it has less train samples than other classes (see Table 1) and being also confused with passenger and small ships. Nighttime IR images provide contour and few details of ship due to blur, low resolution and large pixels range, it is difficult to classify ship category on them. For normalized confusion matrix on nighttime IR images shown in Figure 7   misclassified by using either of VIS and IR images, but correctly classified by using multispectral images.

Discussion
It is not an easy work to use small-scale dataset to fine-tune the VGG- 16  and ResNet model [17]. Besides, researchers can leverage unsupervised feature learning methods to reduce feature dimension, such as principal components analysis, and also embed Network-in-Network (NIN) [51] for fine-tuning the well-known pre-trained deep CNN models. Furthermore, the baseline method and the state-of-the-art method [42] adapt the decision level fusion for ship recognition, and extract features from the last fully connected layer of the pre-trained VGG-16 model and the last convolutional layer of the pre-trained VGG-19 model, respectively. Based on our experimental results, features extracted from the same layer of the pre-trained deep CNN model are not the best for both VIS and IR images. We believe that our work can be further investigated in the decision level fusion.

Conclusion
In this paper, we take advantage of the deep CNN model and multispectral data,

Funding Statement
This work is partly supported by the National Natural Science Foundation of

Data Availability Statement
The Excel data used to support the findings of this study are included within the