Abstract:
Semantic segmentation is an important technology for remote sensing image processing and has been widely applied in many fields. Although existing remote-sensing image segmentation models, such as convolutional neural network (CNN) and transformer-based segmentation methods, have achieved great success in this domain, there are still many disadvantages and challenges, such as the difficulty in fully preserving detailed feature maps by the original encoder and dynamically capturing global contextual information. To address these disadvantages and challenges, a novel remote-sensing image segmentation method called the dynamic optimized detail-aware network (DODNet) is proposed based on a CNN–transformer hybrid framework. First, a ResNext–50 network is employed as the backbone network at the encoder, and a multi-subtraction perception module (MSPM) is designed to collect spatial detail differences between multiscale feature maps to effectively reduce redundant information. This module integrates multidirectional depth-wise separable convolutions with parallel dilated convolutions to enhance the feature representation ability. By performing pixel-wise subtraction after upsampling and spatial alignment, different feature maps are generated to capture the significant variation regions, effectively preserving the boundary and other detailed information in the remote-sensing images, while improving the model's perception of small objects. Then, a dynamic information fusion block (DIFB), which combines global bi-level routing self-attention and local attention branches to improve the ability to obtain global and local information, is designed for the decoder. The global bi-level routing self-attention branch utilizes a learnable regional routing network to filter out low-association background areas and then performs a fine-grained attention calculation within the retained semantic key windows. This scheme effectively addresses the dual challenges of background interference and computational efficiency in remote-sensing image segmentation. The local attention branch compensates for the local information that is difficult to capture by the global bi-level routing self-attention branch by utilizing multiscale convolutions. Finally, a new channel-spatial attention module, the unified feature extractor (UFE), is proposed to obtain semantic and contextual information by serially fusing channel and spatial attention mechanisms. In the channel attention stage, combined with a one-dimensional depth-separable convolution to extract channel features, dual-path average pooling in width and height directions is used to replace traditional global pooling. Subsequently, a multiscale convolution fusion strategy is introduced in the spatial attention stage, and spatial attention weights are generated through instance normalization; thus, this module pays more attention to local features and foreground objects inside an image. To verify the effectiveness and accuracy of the proposed method, experiments and ablation tests were carefully designed and implemented on three open and typical datasets: Vaihingen, Potsdam, and LoveDA. By comparing the experimental results, quantitative and visual analyses showed that DODNet outperforms ten state-of-the-art segmentation methods in terms of the F1 score, over accuracy (OA), and mean intersection over union (mIoU). In particular, the mIoU values reached 84.96, 87.64, and 52.43%, respectively, verifying the strong ability of the proposed DODNet to deal with the segmentation problem with complex background interference, large intra-class differences, and obvious inter-class similarities.