Visible-Infrared Fusion Object Detection Based on Feature Enhancement and Alignment Fusion
-
Graphical Abstract
-
Abstract
Single-modal object detectors have developed rapidly and achieved remarkable results in recent years. However, these detectors still face significant limitations, primarily because they are unable to leverage the complementary information inherent in multimodal images. Visible-infrared object detection technology overcomes challenges such as poor visibility under low-light conditions by fusing the information of visible and infrared images to achieve complementary features between the two modalities. However, precisely aligning feature maps from different modalities and efficiently fusing modality-specific information remain key challenges in this field. Although various methods have been proposed to address these issues, effectively handling modality differences, enhancing the complementarity of cross-modal information, and achieving efficient feature fusion are still bottlenecks for high-performance object detectors. To overcome these challenges, this paper proposes a method for visible-infrared object detection called F3M-Det. This method significantly improves the detection performance by enhancing, aligning, and fusing cross-modal features. The core idea of F3M-Det is to fully leverage the complementarity between visible and infrared images to enhance the model's ability to understand and process cross-modal information. Specifically, the core components of the proposed F3M-Det primarily consists of feature extraction backbone, a Feature Enhancement Module (FEM), a Feature Alignment Module (FAM), and a Feature Fusion Module (FFM). The FEM utilizes cross-modal attention mechanisms to significantly enhance the expressive power of both visible and infrared image features. By effectively capturing subtle differences and complementary information between the modalities, FEM helps F3M-Det achieve higher detection accuracy. To reduce the computational cost of calculating cross-attention on the global feature maps while fully retaining the useful features of the input feature maps, the FEM employs a multi-scale feature pooling method to reduce the dimensionality of the feature maps. Next, the FAM is introduced to effectively align feature maps from different modalities. The FAM combines global information with local details to ensure that features captured from different perspectives and scales are accurately aligned. This reduces modality differences and improves the comparability of cross-modal information. The design of the FAM allows the model to effectively handle misalignment between modalities in complex environments, enhancing the F3m-Det's robustness and generalization ability. Finally, FFM is introduced for efficient fusion of cross-modal features. The FFM uses frequency-aware mechanisms to reduce irrelevant modality differences during feature fusion while preserving useful complementary information, thus enhancing the effectiveness of the fused features. Meanwhile, FFM is also used as a cross-scale feature fusion (SFFM) to reduce information loss. F3M-Det uses YOLOv5 as the baseline. In terms of structure, a dual-stream backbone network is built using DSPDarknet, incorporating the FPN structure and detection head from YOLOv5. To validate the effectiveness of the proposed F3M-Det, we conducted comprehensive experimental evaluations on two widely used datasets, including the unaligned DVTOD dataset and the aligned LLVIP dataset. The experimental results show that F3M-Det outperforms existing monomodal and visible-infrared image object detection methods on both datasets, demonstrating its superiority in handling cross-modal feature alignment and fusion. Additionally, ablation experiments were conducted to investigate the impact of each module on F3M-Det's performance. The results fully demonstrate the importance of each proposed module in enhancing detection accuracy, further validating the effectiveness and superiority of F3M-Det.
-
-