New Fusion Approach of Spatial and Channel Attention for Semantic Segmentation of Very High Spatial Resolution Remote Sensing Images ()
1. Introduction
Semantic segmentation of remote sensing images is crucial for analyzing land cover and land use, especially for assessing anthropization effects in rural and urban environments and managing natural disasters [1] - [6] . Traditional methods, relying on grayscale or color analysis [7] [8] and texture or similarity features [9] , fall short in precise pixel-level classification, particularly for high spatial resolution images (≤4 m). The challenge in such images lies in representing land cover classes at the scale of objects, with varying spectral distributions between rural and urban areas [2] . Rural areas predominantly feature large natural objects, while urban areas exhibit high variability in man-made objects. Analyzing these images over large areas necessitates detailed spectral analysis alongside considering the spatial and semantic context for better discrimination [4] [5] .
CNNs are widely used for semantic segmentation, excelling in local information extraction [10] . However, for remote sensing, considering overall context and long-range dependencies is crucial to avoid ambiguity [4] [11] [12] [13] . Recently, Attention mechanisms have gained importance in computer vision, particularly for tasks like classification, detection, object localization, and segmentation [14] [15] . The optimal performance in classification and object detection is achieved by integrating classical CNNs with attention mechanisms [16] . Consequently, various attention mechanisms are combined in these architectures, typically operating at distinct spatial resolution levels [4] [17] [18] [19] .
This paper introduces SCGLU-Net, a semantic segmentation model for remote sensing images in complex urban and rural environments. Inspired by MACU-Net [20] , our hybrid architecture combines a CNN encoder with specific transformers as decoders. We use asymmetric convolution [20] to analyze local context and reduce computational complexity. The SCGLU-Net differs by introducing Propagated attention to enhance relevant descriptors from the encoder during multi-scale fusion. The SCGL block influenced by [21] , introduces a unique attention layer that combines channel and spatial attention simultaneously, addressing local and global semantic context. The SCGL block shown in Figure 1(b), considers interactions between spatial and channel descriptors. In this SCGL block, spatial attention aggregates features into a regular grid of super-tokens to enable the use of self-attention [15] at high resolutions when estimating spatial attention [22] . The model uses a fine refinement head (FRH) to merge spatial and channel information at the original image resolution. Performance testing is conducted on WHDLD and DLRSD datasets with complex scenes in various environments [20] [23] . The main contributions of this study are:
· Introduction of propagate attention, an attention mechanism to prioritize relevant information and reduce artifacts from an encoder layer during integration into the multi-scale fusion proposal in the decoder.
· Introduction of the SCGL block, incorporating channel and spatial attention in a single block. Unlike conventional methods, this block allows simultaneous interaction capture between spatial and channel descriptors, overcoming the limitation on self-attention use at higher spatial resolutions due to quadratic complexity.
Figure 1. Illustration of attention block in (a) MACU-Net and (b) SCGLU-Net.
· To address the imbalance between target and nontarget areas, mitigating classifier bias towards the background class, a combination of Focal loss and Dice loss functions is employed to resolve sample imbalance.
The remainder of this paper is structured as follows. Section 2 is devoted to a synthetic review of previous works to show the interest of the new approach. Section 3 describes in detail the architecture of asymmetric convolution, propagate attention, SCGL, and the FRH blocks that constitute the core of the proposed model. In section 4, we present the results of our experiments and an analysis of the performances obtained compared with those of the most widely used methods in the literature. This paper ends with a conclusion followed by perspectives.
2. Related Work
Remote sensing image segmentation has progressed rapidly since the early 2000s, driven by the introduction of high and very high spatial resolution satellite imagers like IKONOS, QuickBird, and GeoEye. The fine spatial resolution of these images presented challenges for traditional pixel-based analysis [7] [9] [10] [24] [25] [26] , leading to the development of new classification algorithms. Those algorithms proved insufficient due to their inability to handle the internal variability of complex scenes [27] [28] .
Inspired by works such as those in [29] [30] , CNNs have become the standard for semantic segmentation in remote sensing due to their ability to extract spatial information. Two architectures have emerged, those based on pyramidal spatial pooling such as PSNet [31] and deepLab [17] , and those based on the U-Net architecture [32] . U-Net employs is encoder-decoder that uses skip connections to concatenate information from the corresponding encoder layer and the layer below, allowing for multi-scale information capture and improving urban semantic segmentation [33] [34] [35] . Unlike U-Net family models, Models like PSPNet and Deeplab use spatial pyramid pooling to aggregate multi-scale information, from a fine-to-coarse level. Despite success on the PASCAL-VOC dataset [36] , these models require pre-trained encoders and face limitations with very high-resolution images due to limited consideration of global spatial context. Another problem in the segmentation of fine remote sensing images is that they take care only of local spatial semantic context. To address the problem of global semantic context and improve performance in the semantic segmentation of remote sensing images, hybrid CNN has been proposed, in these architectures CNN models are combined with attention mechanisms, particularly in the decoder. Thus, several authors proposed to use various attention mechanisms like additive attention, self-attention, atrous convolution, spatial, and channel attention modules to enhance urban semantic segmentation [13] [37] [38] [39] . More recently, MACU-Net [20] , featuring a densely connected CNN with CBAM-like channel attention [40] , outperforms pure CNNs by increasing the mIoU score by over 1.5%, However, these attentions are built around the convolution product and therefore highly dependent on the local context.
Recently, transformers [15] have been adapted to computer vision, demonstrating excellence in classification tasks [16] and long-term dependency modeling [41] [42] . Two architectural trends have emerged for the semantic segmentation of very high spatial resolution images. Pure transformers, serving as both encoder and decoder in [43] [44] , suffer from increasing computational complexity. The second trend involves a Transformer-based encoder and CNN-based decoder [45] [46] . Despite addressing local spatial and global semantic contexts, these models face increased complexity due to the quadratic computational complexity of transformers in the encoder. In [12] the authors show that optimal performance in object classification and detection tasks was achieved by combining classical CNNs with transformers. An alternative approach employs a CNN-based encoder and transformer-based decoder [21] [38] [47] , featuring multi-scale feature fusion and a blend of attention mechanisms at different spatial resolution scales [4] [18] [21] [47] . Transformers and attention mechanisms are used separately in various processes and at the deepest spatial resolution levels [4] [18] [19] . However, in [21] , the authors highlight the significant performance boost achieved by considering interactions between spatial and channel features, which is overlooked in certain architectures. We propose a model that combines the benefits of pure transformers and hybrid architectures, featuring a CNN-based encoder and a transformer-based decoder within the MACU-Net framework. The model introduces a new attention mechanism to capture the energy of spatial features in constructing feature maps for each network layer. Additionally, the decoder integrates a mechanism to combine channel and spatial attention interaction at different spatial resolution levels, enhancing the model’s capacity to consider both local and global semantic contexts in scenes.
3. Methods
In this section, we provide an in-depth analysis of the key components of the architecture. We begin by highlighting the architectural differences from the MACU-Net model. The focus then shifts to a detailed examination of attention mechanisms, particularly those employed in the decoder. The section is organized into sub-sections, covering a general presentation of the architecture (3.1), a review of Propagate Attention (3.2), an exploration of the Spatial-Channel- Global-Local block (SCGL) (3.3), a study of a version of the FRH block (3.4), and concludes with an estimation of the loss function in section 3.5.
3.1. Structure of SCGLU-Net
The new model is inspired by the MACU-Net architecture [20] presented in Figure 2, a densely connected Convolutional Neural Network (CNN) with an encoder-decoder structure. In Figure 3 we decribe the architecutre of our new model. Like MACU-Net, the new model encoder employs asymmetric convolution
blocks (ACB) [20] to enhance representation power and capture local context with lower computational complexity [39] . It allows the encoder to extract descriptor maps at various spatial resolutions from coarse to fine, increasing channel dimensions. The principle of ACB block is illustrated in Figure 4. In the new architecture, the transition between encoder layers involves k ACB blocks, followed by size reduction using max-pooling with a factor of 2. The value of k is 2 for transitions from layer 1 to 2 and for layer 2 to 3, 3 for transitions from layer 3 to 4 and for layer 4 to 5.
The main difference between our model and MACU-Net is the decoder architecture. In MACU-Net, the decoder utilizes deconvolution and channel attention processes to reconstruct the original image’s segmentation mask whereas in SCGLU-Net a combination of different attention mechanisms is used to reconstruct the segmentation mask. To capture global interactions, the transition from the deepest encoder layer to the decoder involves Multi-Head Self-Attention (MSA) [15] followed by 2 ACB blocks. Inspired by previous work [18] showing the performance benefits of combining multiple attention mechanisms, the decoder utilizes the new Spatial-Channel-Global-Local block (SCGL), which combines spatial and channel attention simultaneously at local and global scales. This block allows interactions between spatial and channel descriptors to be taken into account. Two ACB blocks follow each SCGL block before transposed convolution. Local attention uses 3 × 3 or 5 × 5 kernel convolutions, while global context is built around self-attention mechanisms, with pixels clustered into a regular grid of super-pixels at each spatial resolution level [22] . The multi-scale information fusion process shown in Figure 5 introduces a novel attention mechanism called propagate attention, which enables the encoder to extract feature
Figure 5. Fusion information block in SCGLU-Net architecture.
maps with both local and global context information. Information fusion from the encoder and lower decoder layers occurs at different spatial scales, using weighted summation according to Equation (1).
(1)
In this equation
represents the fused features at the input of layer l,
represents the feature tensor coming from encoder layer i weighted by the propagate attention,
represents the features coming from decoder layer j,
and
are real numbers such that their sum gives 1
The approach ensures that information coming from each feature map from encoder layers is weighted based on its importance. The weights are learned during the training phase. The final layer includes a feature refinement head (FRH) block to combine spatial and channel information at the original image resolution, capturing semantic context from lower layers. The next sections outline the key blocks that form the core of the new model.
3.2. Propagate Attention
This attention is used in our model to fuse information coming from the encoder with those coming from the decoder. It aims to favor the most relevant spatial features at each spatial scale after downsampling since in our model, input data in each decoder layer is a combination of features coming from the below decoder layers and all above layers from the encoder. Indeed, as indicated by the authors in [18] , the spatial features at a higher scale of spatial resolution have a greater impact during the process of merging information. Although uses only convolution products followed by pooling to propagate features from coarse-to- fine spatial levels suffers from the unique grasp of the spatial context and only guarantees the translational invariance of the network. This attention fills this gap by taking into account each spatial scale, the global context, and the local context with a similar computational complexity. This attention is inspired by that proposed by the authors in [18] to improve the residual blocks’ capacities. Let
be the features map at layer l comes from encoder layer i by ACB block. It is a 4D tensor
, where B is the number of samples in the batch, H,W the spatial dimensions of layer l, D the number of channels or the depth of the features map. At the layer l,
is calculated according Equation (2), Equation (3), and Equation (4):
(2)
(3)
(4)
Tensor
was built from
by freezing spatial dimensions H and W with global average pooling product combining with self-attention mechanism as described in Equation (2) and Equation (3).
In Equation (3), we use a 1D convolution with kernel size of 5 to infer Query (Q), and key (K) as tensors
. V is the tensor
reshaped as
tensor. The tensor
is subsequently reshaped as a
tensor.
In Equation (4),
denotes element-wise matrix multiplication, v is the unbiasis variance of
tensor. This variance is calculated along spatial dimensions H and W.
is the version of
tensor centered around the mean.
Figure 6 illustrates the flowchart of propagate attention.
3.3. Spatial-Channel-Gloabl-Local Block (SCGL)
The Spatial-Channel Global-Local (SCGL) block comprises a channel attention block followed by a spatial attention block, enabling consideration of spatial scale and channel dimension changes in input feature maps for multi-scale information fusion. Inspired by SACM in [48] , the channel attention introduces a new branch to estimate interactions between channels, improving upon SACM [48] by avoiding neglect of interactions between spatial and channel features. The spatial attention is bifurcated into two branches: one capturing local spatial interactions and preserving details, and the other capturing long-term dependencies and global semantic context for scene interpretation. Figure 7 illustrates the new channel attention flowchart, while Figure 8 depicts the flowchart of spatial attention.
3.3.1. Channel Attention in SCGL
In the SCGL block, channel attention is the summation of local channel attention and global channel attention according to Equation (5) and Equation (6):
(5)
(6)
is the result of local transformation branch and is given by the following equation Equation (7):
(7)
is the result of global channel attention and is formulated with Equation
Figure 6. Propagate attention flowchart.
Figure 7. Channel attention in SCGL Block.
(8) and Equation (9)
(8)
(9)
In Equation (7),
denotes element-wise matrix multiplication,
denotes matrix multiplication and
,
with P = W or H is formulated according Equation (10)
(10)
In this equation
is transformations oparator for X tensor witihn P is permuted with channel dimension D.
is a 1 × 1 2D-convolution product with one filter followed by a batch normalization to obtain a vector of dimension
.
In Equation (8)
represents global average pooling to freeze spatial dimensions before using self-attention to estimate long-range dependencies in Equation (9). In this Equation (9) Query (Q) and Key (K) tensors are estimated with the Equation (11) and Equation (12).
(11)
(12)
within
a 1D-convolution with a 5 kernel and
a sigmoid function to maintain the values obtained in the interval [0; 1]. In Equation (11) and Equation (12),
a 1D-convolution with a 5 kernel and
a sigmoid function to maintain the values obtained in the interval [0; 1]. V, is the tensor Z with dimension
rescaled to a tensor with dimension
and
represents the classical matrix product.
3.3.2. Spatial Attention in SCGL
Spatial attention in SCGL was built around 2 branches that work in parallel. One branch is responsible for building spatial attention with local context and is based on the convolution product. The other branch is responsible for capturing global context and is based on a multi-head self-attention mechanism. Spatial attention is formulated according Equation (13):
(13)
In this equation, X refers to the output tensor of the channel attention in SCGL. The principle of spatial attention is illustrated in Figure 8.
Local attention is inspired by the work of [49] who showed that in the case of semantic segmentation, atrous convolution is better suited for dense prediction without loss of resolution than the classical convolution product rather suitable for classification tasks. The local spatial attention was estimated according equations Equation (14) and Equation (15):
(14)
(15)
with
(16)
(17)
In Equation (16) a 1 × 1 2D-convolution product is used to obtain the weight tensor due to each spatial feature.
In Equation (17)
relies on 3 2D-convolution products with 3 × 3 kernel size and D/4 filters but with dilation rate of 1, 3 and 5. Moreover, to reduce the computational complexity and as proposed in [50] the 3 × 3 convolution of dilation 1 is seen as a combination of the asymmetric 2D-convolutions with 3 × 1 kernel size and 1 × 3 kernel size.
Global spatial attention was estimated according Equation (18):
(18)
In this equation, S means super-tokens features tensor, and
is the mapping matrix of the features tensor
into super-tokens features tensor
.
are the spatial dimensions of X, D is the channel dimensions,
, and m is the number of super-tokens.
represents multi-head self-attention mechanism. This attention is introduced in [15] because as the authors indicated it is the best approach to capture long-term dependencies and take into account the global context [4] [18] . To limit the effects of computational complexity, we will base our multi-head self-attention mechanism by estimation on an Atttn (adaptation of super-pixel clustering as proposed by [51] [22] .
is estimated according to Equation (19):
(19)
with A (S), the attention map with the relative position embeddings [52] . Vectors
,
,
,
are the results of the linear transformations by the weight matrix
,
,
,
obtained by a 1 × 1 2D-convolution. The vector R represents the embedded relative position vector introduced to improve self-attention performance and guarantee its permutation equivariant. Its formulation is identical to that indicated in [53] . In the Equation (19), to reduce the complexity, for each token, only its 3 × 3 surrounding super-tokens are used to compute
. In practice, we use the Unfold and Fold Python functions to extract and combine the corresponding 3 × 3 super-tokens, respectively. The relative position embeddings tensor R is constructed for each element
by determining the relative distance of
to each position
, where
is 3 × 3 neighbor around position
. Each element
receives two distances: a row offset
and column offset
as it is shown in Figure 9.
The row and column offsets are associated with an embedding
and
respectively each with dimension
. The row and column offset embeddings are concatenated to form
.
The tensor S and the matrix Q are constructed iteratively:
1) linear normalization of input tensor X with a 1 × 1 2D-convolution with D filters according Equation (20)
(20)
2) Creation of initial super-tokens tensor
by calculating the local average of the tokens on a regular sliding window with size
such that
. Tokens tensor is the features tensor
resized to a tensor
.
3) At each iteration t, matrix Q was estimated according to Equation (21) and Equation (22)
(21)
where D is the number of Channels. The set
of super-tokens is updated by the weighted sum of tokens
(22)
Global spatial attention is obtained by resizing
into a tensor with dimension
. For clarity, we have presented the results for a single attention head. In practice, multiple heads of attention are used by partitioning the features map depthwise into N groups to learn multiple distinct representations of the input tensor. The final result comes from the concatenation of the results obtained for each attention head. For our model, we have retained after our experiments the following distribution presented in Table 1 below which shows for each decoder layer, the number of attention heads, the depth of the input
Figure 9.The principle of relative distances. The row offset is in blue color, the column offset is in red color.
Table 1. Parameters for building spatial global attention.
tensor, the spatial dimension of the tensor, as well as the spatial dimension of the super-tokens and number of iterations to estimate them.
The table above summarizes the parameters for constructing the global spatial attention of the SCGL block from the deepest layer (layer 5) to the uppermost layer (layer 2). The output layer, based on the fine refinement head in the next section, is not listed. As one progresses from lower to higher layers, the number of heads decreases due to a halving of channel count, while super-token size doubles and is limited to 16 × 16 [16] for global relation extraction. Three iterations were used to estimate the super-token count, as increasing it did not significantly improve results in our experiments.
3.4. Fine Refinement Head Block (FRH)
This block, inspired by [4] , merges rich semantic data from lower network layers with spatial descriptors from the original image. Comprising two branches, one focuses on channel interactions, using the Convolutional Block Attention Module (CBAM) [40] for channel attention. The channel attention map is generated through a weight-shared network. The other branch addresses spatial interactions through depth-wise convolution. The attentions are combined by summation, processed by two asymmetric convolution blocks (ACB), and a 1 × 1 2D-convolution produces the segmentation mask. Unlike the original module, this approach avoids over-sampling and linear interpolation, reducing errors. Figure 10 illustrates a visual representation of the fine refinement head block (FRH).
3.5. Loss Function
To address the challenge of gradient vanishing in deep networks, particularly in semantic segmentation of remote sensing images with unbalanced classes, a robust loss function is crucial for optimal convergence during training. To mitigate the impact of class imbalance, the focal loss introduced by Lin in 2017 [54] is employed, defined by Equation (23)
(23)
with γ = 2 and α = 0.25 and
is the probability of the pixel belonging to the object class estimated from a softmax function at the network output. This modified cross-entropy penalizes over-represented classes, reducing their impact on
loss estimation bias. Notably, for γ= 0, the focal loss is equivalent to cross-entropy. Additionally, to ensure accurate localization of various object categories and consider interactions between classes, Dice’s loss [55] is used to minimize information loss between the reconstructed and original masks. This loss function is thus formalized by Equation (24)
(24)
where
is the mask tensor predicted by the network and y is the ground truth mask tensor. The final loss function is the sum of the 2 functions of focal loss and Dice loss defined in Equation (25)
(25)
4. Experiments and Results
To assess our model’s effectiveness in semantic segmentation of high-resolution remote sensing images, we tested it on two datasets with diverse urban and rural complex scenes. Our model’s performance was compared against state-of-the-art algorithms from the scientific literature. Two sets of experiments were conducted: the first series focused on metrics like mIoU, Precision, Recall, and mean Pixel Accuracy (mPA) for result comparisons. The second series evaluated the model’s computational efficiency, considering factors such as complexity (G) in Flops, required memory (MB), number of parameters (M), and inference speed (Fps). Subsequent sections will detail the datasets, experiments, and analysis of the obtained results.
4.1. Datasets
The first dataset is WHDLD which is a public dataset provided by Wuhan University [20] [23] [56] . It is composed of 4940 RGB color images of dimensions 256 × 256 pixels provided by the Gaofen 1 and ZY-3 satellite sensors over the urban area of Wuhan with a spatial resolution of 2 m. The segmentation masks represent 6 classes of objects namely bare soil, buildings, sidewalks, roads, vehicles, and water. For our experiments, the data was randomly partitioned into 3 subsets, training, validation, and testing according to the ratio 0.7:0.1:0.2. Figure 11 shows the images and labels in WHDLD datasets.
As for the DLRSD dataset, is a dataset containing 2100 RGB color images with a dimension of 256 × 256 pixels [23] [56] . It is composed of images of segmentation masks representing 17 classes of objects encountered both in rural and urban areas. These are airplanes, bare ground, buildings, cars, chaparral, land, docks, mobile-home, pavement, sand, sea, ships, water tanks or fuel, trees and water. The images used to build this dataset come from UC Merced Land Use
data proposed by [57] which includes 2100 images divided into 17 land cover classes of 100 images each. The images have a spatial resolution of 0.3 m. For our experiments, the data was randomly separated into 3 subsets according to the ratio 0.7:0.1:0.2 for training, validation, and testing. Figure 12 shows the images and labels in DLRSD datasets.
These 2 datasets contain a large number of objects to be identified present at different scales within the same image. There we have cars and trees, having resolutions lower than 20 × 20, and buildings, lakes, roads, etc. having resolutions greater than 200 × 200 with chaotic distribution and fuzzy borders. This makes it difficult to classify pixels between neighboring objects.
4.2. Experimental Hypotheses
To study the performance of our algorithm, the test environment included, the operating system Pop!_os in its version 22.04, CUDA12, PyTorch 1.13, and python 3.10. During the training phase, the size of the input images of the different models was fixed at 256 × 256 pixels, the optimizer is of the Adam type
introduced by [58] , the learning rate was respectively 0.0003 and 0.0001 for WHDLD and DLRSD with a cosine annealing decay strategy [59] . All experiments were implemented on an NVIDIA GeForce RTX 3070 Max-Q GPU with 8GB of VRAM. The datasets were randomly separated into 3 subsets of data including 70% data for training, 10% for validation, and 20% data used for testing. The loss function to be minimized consists of the summation of the Dice loss function proposed by [60] , and the focal loss [54] as shown in the previous section in order to be able to mitigate the impact of unbalanced data. The efficiency of our model has been compared with those of the algorithms which, to our knowledge, are among the most efficient in semantic segmentation of satellite images through metrics such as the average intersection over Union (mIoU), the mean Pixel Accuracy (mPA), precision (P), and recall (R) by class [18] [61] . The quantitative evaluation of the performance of our model was done by comparing it to those of the models used in the semantic segmentation of satellite images using the WHDLD or DLRSD datasets. Among these models, we have:
1) CNN models for semantic segmentation: U-Net [32] and U-Net3+ [35] , MulitlabelRSIR [56]
2) Those who use pyramidal spatial pooling: DeepLabv3+ [17] , PSPNet [31] , DPPNet [62] , Segment Anything Model (SAM) in Ref. [63] .
3) CNN-based attentional networks: MACU-Net introduced in [20] , MAU- Net in [18] , and Multi-scale network with HL module provided by [19] , AttU-Net U-Net with addtitive attention [37] , CAU-Net [64] .
4) Fully transformer-based networks with a transformer-based decoder: SegFormer intorduced by [44] , HrVit Multi-scale vision transformer [65] , TMNet multi-branch transformer [66] , and Fursformer [67] .
4.3. Results and Analysis
The results of the mIoU, Precision, and Recall metrics by object class as well as their mean value of the WHDLD database are summarized in Table 2 for each class. Those of DLRSD are summarized in Table 3 for each class. Table 4 presents global results. These results show the high capacity of our model to correctly locate the objects present in the scene. The results on mean pixels
Table 2. mIoU, Precision (P), Recall (R) and mPA in (%) results by object class for WHDLD dataset.
Table 3. mIoU, Precision (P), Recall (R) and mPA results by object class for DLRSDD dataset.
Table 4. Global statistics for WHDLD and DLRSD datasets.
(mPA) by class around 76.43% for WHDLD and 79.56% for DLRSD show that although there are misclassifications, in general, the pixels are mainly represented within the object when it is correctly located.
4.3.1. Comparison Results for WHDLD Dataset
Figure 13 shows the visual results of the segmentation of our model compared to that of MACU-Net. The use of several attentions and the choice of a loss function that takes into account the less represented pixel classes increase the results of the segmentation. As a result, the large homogeneous areas are relatively well identified by the 2 models even if, as for line 1, we can see that our model better discerns the edges and shapes of the buildings, which is not the case with MACU-Net. Moreover, the original model misclassifies two objects of quite similar classes by confusing the pavement with the road. Our model is
Figure 13. WHDLD test visualization results of MACU-Net and SCGLU-Net.
more sensitive to objects that are very poorly represented in an image because as we can see in line 2, the original model does not identify buildings that are less represented in the image than other objects, while our model manages to detect its presence. In line 3, the original model fails to discriminate fine objects containing large objects, such as the presence of water in bare soil, which is not the case with our model which detects its presence. In the case of the DLRSD dataset, Figure 14 illustrates the segmentation results of our model compared to MACU-Net. The observations made previously are confirmed, since in line 1, the original model very poorly classifies the objects present in the scene, while in line 2, the vehicles are not identified because they are relatively small in size at the mobile home and grass areas. In line 3, the original model is not able to sufficiently discriminate between two close classes such as bare soil and pavements. Unlike the original model, our model exhibits relatively better performance in each of these situations. In order to measure the efficiency of our algorithm during the experiments conducted on the WHDLD and DLRSD datasets, we compared the results obtained with those given by state-of-the-art approaches. The following tables present the results of the mIoU, Precision, Recall, and mPA
Figure 14. DLRSD test visualization results of MACU-Net and SCGLU-Net.
metrics for WHDLD in Table 5 and DLRSD in Table 6.
As we can see in Table 5, our approach performs better than all the models in terms of mIoU with a gain of +1.54% compared to the best model AttU-Net and +4.37% compared to U-Net which presents the weakest results. The methods combining several attention mechanisms and transformers give better results in mIoU followed by those with spatial pyramidal pooling and pure CNN. Regarding Precision (P) these gains are 7.28% respectively 1.80% relative to multilablRSI which has the worst performance with 69.12% and UNet3+ architectures with 74.60%. Only CAU-Net gives better precision than our approach, but we can see the performance is relatively close to 76.40% for our approach and 76.57% for CAU-Net. We can argue that simultaneously local semantic context and range dependencies increase the capacity of the model to detect the class of the objects in the image, unlike an approach that considers only local context. According to the Recall (R) and the mean Pixel Accuracy (mPA), our approach performs all the models with a gain of 3.72% for PSPNet with the worst performance and AttU-Net with a gain of 1.22%. This result shows that our model is among the best models to correctly identify and affect objects in the correct class
Table 5. Performances on WHDLD dataset. The best values are in bold.
Table 6. Performances on DLRSD dataset.
despite the complexity of the scene. In terms of mPA, our approach performs all the models with a gain of 6.59% for MAU-Net with the worst performance and Rmg + HL with a gain of 1.75%.These results show the ability of our model to correctly classify each object in the WHDLD dataset. This implies that our model assigns object classes to pixels better than any other algorithm. In addition, the power of locating objects remains better than the original model. When we compare our model with those that use an attention mechanism, we can see that using transformers increases performance in the mIou between 1.04% for HrVitand 2% for SegFormer compared to MACU-Net that use channel attention whereas our model increases the mIou of 3.73%. Let’s compare the performance in mIou between our model and those with attention like channel attention or spatial attention. We observe that those models increase mIou between 0.96% for MAU-Net and 2% for AttU-net compared to MACU-Net. Consequently, the use of combination attentions and transformers greatly improves the ability of networks to identify objects in the datasets such as WHDLD with à lot of large area objects.
4.3.2. Comparison Results for DLRSD Dataset
Regarding the DLRSD dataset, the results also showed that our algorithm outperforms all other algorithms in all metrics. Compared to the best models, our SCGLU-nets increase mIoU by 3.32% compared with Fursformer, increase precision by 0.88% relative to MACU-Net, increase recall by 5.62% compared with CAU-Net in Recall and 0.70% in mPA compared with Rmg + HL. Unlike some models such as MAU-Net, DeepLabV3+, U-Net, U-Net3+, PSPNet, AttU-net, and Segformer which have seen their performance deteriorate due to the large number of object categories present and their large-scale variability, our model, CAU-Net, Rmg+ Hl, on the contrary, experienced an improvement in their performance in all metrics. This is because models like MAU-Net or MACU-net do not take into account abrupt changes between object scales. By taking into account the interactions between spatial and channel features at different spatial and channel resolutions, our model manages to be sensitive to them compared to Rmg + HL where these interactions are defined in the lowest layers. PSPNet and U-Net obtain the worst performance in all the metrics while U-Net3+ and DeepLabV3+ which combine multi-scale fusion with the convolution product and atrous convolution respectively experience a notable improvement in performance with a respective gain of 5.09% and 3.19% in mIoU, 8.29% and 3.99% in accuracy, 0.9% and recall for U-Net3+ and quite similar results for DeepLabV3+ of 5.12% and 3.85% in mPA. Multi-scale fusion alone is not sufficient for datasets like DLRSD to discriminate the objects. Compared with multiple attention models and more particularly to our model, the performance of pure CNN and spatial pyramidal polling models is much lower if we compare our results to that of the most efficient in these families of models. Our algorithm presents a gain of 6.12% in mIoU, of 4.52% in Precision compared to U-Net3+, of 7.17% in recall, and of 3.37% in mPA. In all metrics, the model’s based attentions and transformers give the best performances compared to all model families. The results show that attention helps models improve their capabilities to detect object classes and their locations in the images. Compared to the other models with attention and transformers, the gain of our approach is comprised of between 2.5% for Rmg + HL and 7.12% for AttU-Net. It is due to the combination of different kinds of attention and transformers.
A study of the results on these 2 datasets shows us that the introduction of multi-scale information fusion as well as the introduction of attention mechanisms greatly increase the capacities of CNN networks in the segmentation of spatial images at very high spatial resolution. However, the combined use of several types of attention, although allowing a performance improvement, is not enough in the case of images in which objects of variable large sizes interact as in DLRSD. In this case, taking into account the interactions between spatial and channel features greatly increases the results obtained and the sensitivity of the network.
4.3.3. Comparison of Network Efficiency
We compared our SCGLU-Net with efficient segmentation networks based on the mIoU, and GPU memory footprint in the number of parameters, and complexity, on the WHDLD test set. The comparison results are listed in Table 7. When we compare the number of parameters and complexity (FLOPs) of each method, our approach performed moderately well in both aspects, indicating that SCGLU-Net does not simply pile up computational effort to obtain high accuracy. Compared with attention models in terms of complexity, our approach
Table 7. Quantitative comparison results on the WHDLD test set with state-of-the-art models. The complexity and number of parameters are measured for a 256 × 256 input on a single NVIDIA GTX 3070 GPU. The best values are in bold.
with 66.62 Flops is the median value between pure transformers like Segformer, HrVit, and TMNet and hybrid CNN with channel attention or spatial attention like MAU-Net, CAU-Net, Rmg + HL. In terms of the number of parameters, despite its complexity, our model needs fewer parameters than modern transformers like SegFormer and HrVit with better mIoU of 3.45% compared to Segformer and 3.71% compared to HrVit. Compared to hybrid CNN with attention mechanism, except MACU-Net and MAU-Net, SCGLU-Net needs fewer parameters than AttU-Net, CAU-Net, and Rmg + HL. it is because SCGLU-net combines transformers and attention mechanisms and benefits from their advantages. Compared to pure CNN and the spatial pyramidal polling models, in terms of parameters, our model is less than 32.8 M compared to DPPNet, 43.85 M compared to PSPNet, 30.88 M compared to DeepLabV3+ and 23.65 M compared to multilabelRSIR with increases in mIoU by 3.65% for DPPNet, by 5.87% for PSPNet, 3.01% for DeepLabV3+ and 3.41% for multilabelRSIR. Compared to pure CNN, our model needs a bit more parameters +1.47 M than U-Net3+ and less more parameters −70.61 M compared to U-Net models with an increase of mIoU by 2% for U-Net3+ and by 4.37% for U-Net. The results show that using a combination of channel and spatial attention is more computationally efficient and reduces the number of parameters than using only local context for image segmentation. The complexity in Flops also confirms this tendency. In this case, except for U-net and DeepLabV3+, the complexity of SCGLU-Net is less than all the models for pure CNN and spatial pyramidal polling with the best mIoU.
4.4. Abalation Study
To assess the impact of each proposed attention mechanism on our model’s performance, ablation experiments were conducted on the WHDLD and DLRSD databases. The evaluation focused on mIoU metrics, as well as complexity (flops), memory (MB), and model speed (fps). Results are summarized in Table 8 for WHDLD and Table 9 for DLRSD. In these experiments, U-Net served as the baseline, lacking any attention mechanism and considering only local context with convolution, in contrast to MACU-Net with ACB convolution, densely connected architecture, and channel attention mechanism CAB.
The baseline is U-Net architecture which only models the local contextual information in the decoder. The loss function of the baseline is the classical categorical cross-entropy.
Propagate attention: We add propagate at the inputs of skip connections in U-Net architecture to add attention to features coming from the encoder layer. The propagate attention achieves less increase of mIoU by 0.17% for WHDLD and 0.23% for DLRSD with a relatively low impact on memory and in terms of Complexity and memory requirement in terms of parameters.
Baseline + Propagate attention + Channel attention: Adding only channel attention increases mIoU by 1.05% for WHDLD and 1.28% for DLRSD. Channel attention has also an impact on memory requirements because the number of
Table 8. Abaltion studies on WHDLD dataset.
Table 9. Abaltion studies on DLRSD dataset.
parameters increases by 61% for those of baseline + propagate attention and increases by 86% for baseline parameters for DLRSD. For WHDLD, this augmentation represents 0.34% for baselin + propagate attention and 86% for baseline. The addition of channel attention does not improve notably the speed (Fps) of the model because the speed is still identic for baseline + propagate attention and base + propagate attention + channel attention and in less augmentation of 0.6% for DLRSD.
Baseline + Propagate attention + Spatial attention: Adding only spatial after propagate attention does not increase the performance of the model as shown by the increase of mIou for WHDLD and DLRSD. Also, spatial attention is responsible for the augmentation of memory requirements and complexity in flops. For our two datasets, adding only spatial attention decreases the inference speed of the model by 2.85% for WHDLD and by 4.6% for DLRSD. The consequence is that spatial attention when he was used alone is not optimal in our model for large resolution size because of its quadratic complexity.
Baseline + Propagate attention + SCGL: The impact of this block is so significant. In terms of memory requirements, the SCGL block increases baseline memory by 275.43 MB for the two datasets. It is less than the sum of memory requirements of channel and spatial attention taken individually from baseline, 940.3 MB for WHDLD and DLRSD. Also, with the use of scale block, we notice an increase of speed by 60% in the case of WHDLD and by 53% in the case of DLRSD from baseline. Concerning me, we notice a significant augmentation by 2.85% for WHDLD and by 4% for DLRSD. These results show that combining spatial attention and channel attention globally and locally manner is more suitable than using this attention alone.
Baseline + Propagate attention + SCGL+ FRH: As we see in table Table 8 and table Table 9, adding FRH block also increases less significantly the mIoU by 0.15% for WHDLD and by 0.28% for DLRSD, and the complexity of the model despite the notable augmentation in term of memory requirement by 639 MB from baseline + SCGL and for inference speed by ≈0.80% for the tow datasets. The usage of a standard 2D convolution product can explain this augmentation because this convolution product requires a lot of calculations to estimate the results.
5. Conclusions
Semantic segmentation of high-resolution remote sensing images poses challenges due to the complexity and variability of scenes. This complexity requires considering both local semantic context and long-term dependencies. Our proposed hybrid architecture, SCGLU-Net, integrates CNNs as encoder, combination of transformers and channel attention mechanisms as decoder to address this issue. The SCGL block within this architecture processes spatial and channel attention locally and globally, capturing interactions between descriptors. This is an advancement over conventional methods. Additionally, the architecture introduces Propagate attention in multi-scale fusion to selectively retain pertinent information from encoder, mitigating artifacts observed in concatenation-based approaches. Results on mIoU scores on WHDLD and DLRSD datasets demonstrate enhanced segmentation capabilities for CNN networks in very fine spatial resolution images, with controlled computational complexity.
Future work aims to enhance object boundary segmentation through the integration of a self-attention module.