融合动态Bins和深度不确定性的智能车辆单目3D目标检测

于承峄; 王鹏伟; 孙宾宾; 高松; 张玉桦; 杨凡; 张泽恒

doi:10.13374/j.issn2095-9389.2026.01.13.004

融合动态Bins和深度不确定性的智能车辆单目3D目标检测

Monocular 3D target detection for intelligent vehicles based on dynamic bins and deep uncertainty

摘要

摘要: 针对目前自动驾驶单目三维（3D）目标检测中，主干网络多尺度信息捕获能力不足，3D深度预测误差较大和Transformer深度信息编码能力有限导致的目标识别精度低等问题，提出了一种面向自动驾驶车辆的单目3D目标检测算法MonoDBDU. 首先，提出了一种双向注意力门控特征融合模块，并结合ResNeSt50网络（BAGFF-ResNeSt50）作为MonoDBDU框架的主干提取网络，解决了多尺度信息捕获能力不足的问题. 其次，设计了一种融合深度不确定性的动态Bins深度预测器，解决了因深度预测和Bins分布引起的3D深度预测误差较大的问题. 最后，提出了一种融合深度不确定性的轻量化Transformer（DU-Transformer），通过引入Swin transformer中的移位多头自注意力机制，减少了模型的参数量，同时对深度不确定性进行显式建模，解决了深度信息编码能力有限的问题. 仿真实验验证，MonoDBDU在KITTI数据集IOU=0.7时，Hard测试值在AP_3D指标相较于MonoLSS提升了0.57个百分点，AP_BEV指标相较于MonoDETR提升了1.17个百分点，在Waymo数据集IOU=0.7时，Hard测试值在AP_3D和AP_BEV指标相较于MonoLSS分别对应提升幅度达到1.09和1.80个百分点. 实车实验验证，本文提出的MonoDBDU在BEV视角投影的mAP指标为23.6%，具有良好的实用性和有效性.

Abstract: Monocular 3D object detection is a critical perception task in autonomous driving systems, offering advantages in cost efficiency and ease of deployment compared to multi-camera or LiDAR-based solutions. However, existing monocular 3D object detection methods suffer from low target recognition accuracy due to three key challenges: the insufficient multiscale information capture capability of backbone networks, large errors in 3D depth prediction, and limited depth information encoding ability of transformers. To address these issues comprehensively, this study proposes a novel monocular 3D object detection algorithm, namely MonoDBDU, for autonomous driving vehicles. First, a bidirectional attention gated feature fusion (BAGFF) module is introduced to overcome the limitations of traditional backbone networks in capturing multiscale features—including inadequate feature fusion, severe interference from redundant features, and unreasonable weight allocation among different scale features—which lead to poor feature representation capability in multiscale targets. This module is deeply integrated with the ResNeSt50 network to construct a BAGFF-ResNeSt50 backbone feature extraction network, which serves as the core feature extraction unit of MonoDBDU. The powerful feature extraction capability of ResNeSt50 (endowed by its split-attention mechanism) and the superior multiscale feature fusion performance of the BAGFF module are fully leveraged. The integration overcomes the issues of insufficient multiscale information capture in traditional backbones and provides high-quality multiscale feature inputs for subsequent 3D depth prediction and feature encoding. Second, to overcome the limitations of conventional depth prediction methods, where fixed bins distribution fails to adapt to the dynamic changes of depth information in complex autonomous driving scenarios and uncertainty factors are ignored in the depth prediction process, a dynamic bins depth predictor with depth uncertainty (DBDU) is designed. The dynamic bins generation unit adaptively adjusts the number, interval range, and distribution density of depth bins based on the statistical information of multiscale features from input images, realizing precise allocation of bins resources according to the depth distribution characteristics of different scenes. The depth uncertainty modeling unit explicitly models the uncertainty in depth prediction through probabilistic modeling methods, outputting depth prediction values and corresponding uncertainty coefficients that quantify the reliability of depth prediction results, providing uncertainty guidance for subsequent feature encoding and target regression and improving the accuracy and robustness of monocular 3D depth prediction. Finally, to address the limitations of traditional transformers with fixed-size multihead self-attention—such as high parameter volume, heavy computational complexity, and limited depth information encoding ability—a lightweight depth uncertainty transformer (DU-Transformer) is proposed as the core feature encoding module of MonoDBDU. Dividing feature maps into nonoverlapping windows and performing attention interaction between adjacent windows reduces the computational load and parameter volume of self-attention significantly. Meanwhile, a depth uncertainty modeling branch is introduced to enhance the encoding weight of high-reliability depth features and suppress the interference of low-reliability ones through attention mechanisms, achieving precise fusion of depth information and semantic features. In extensive simulation experiments, On the KITTI dataset with an IOU (Intersection over union) threshold of 0.7, MonoDBDU achieves an improvement of 0.57 percentage points in 3D average precision (AP_3D) over MonoLSS and 1.17 percentage points in BEV (bird’s eye view) average precision (AP_BEV) over MonoDETR for the hard test subset. On the Waymo open dataset under the same IOU threshold of 0.7, the proposed MonoDBDU outperforms MonoLSS by 1.09 and 1.80 percentage points in AP_3D and AP_BEV on the hard subset, respectively. In real-world vehicle experiments, MonoDBDU achieved an mAP of 23.6% in BEV perspective projection, highlighting its favorable practicality and effectiveness in real-world autonomous driving scenarios.

HTML全文

参考文献(28)

施引文献

资源附件(0)