Abstract:
Monocular 3D object detection is a critical perception task in autonomous driving systems, offering advantages in cost efficiency and ease of deployment compared to multi-camera or LiDAR-based solutions. However, existing monocular 3D object detection methods suffer from low target recognition accuracy due to three key challenges: the insufficient multiscale information capture capability of backbone networks, large errors in 3D depth prediction, and limited depth information encoding ability of transformers. To address these issues comprehensively, this study proposes a novel monocular 3D object detection algorithm, namely MonoDBDU, for autonomous driving vehicles. First, a bidirectional attention gated feature fusion (BAGFF) module is introduced to overcome the limitations of traditional backbone networks in capturing multiscale features—including inadequate feature fusion, severe interference from redundant features, and unreasonable weight allocation among different scale features—which lead to poor feature representation capability in multiscale targets. This module is deeply integrated with the ResNeSt50 network to construct a BAGFF-ResNeSt50 backbone feature extraction network, which serves as the core feature extraction unit of MonoDBDU. The powerful feature extraction capability of ResNeSt50 (endowed by its split-attention mechanism) and the superior multiscale feature fusion performance of the BAGFF module are fully leveraged. The integration overcomes the issues of insufficient multiscale information capture in traditional backbones and provides high-quality multiscale feature inputs for subsequent 3D depth prediction and feature encoding. Second, to overcome the limitations of conventional depth prediction methods, where fixed bins distribution fails to adapt to the dynamic changes of depth information in complex autonomous driving scenarios and uncertainty factors are ignored in the depth prediction process, a dynamic bins depth predictor with depth uncertainty (DBDU) is designed. The dynamic bins generation unit adaptively adjusts the number, interval range, and distribution density of depth bins based on the statistical information of multiscale features from input images, realizing precise allocation of bins resources according to the depth distribution characteristics of different scenes. The depth uncertainty modeling unit explicitly models the uncertainty in depth prediction through probabilistic modeling methods, outputting depth prediction values and corresponding uncertainty coefficients that quantify the reliability of depth prediction results, providing uncertainty guidance for subsequent feature encoding and target regression and improving the accuracy and robustness of monocular 3D depth prediction. Finally, to address the limitations of traditional transformers with fixed-size multihead self-attention—such as high parameter volume, heavy computational complexity, and limited depth information encoding ability—a lightweight depth uncertainty transformer (DU-Transformer) is proposed as the core feature encoding module of MonoDBDU. Dividing feature maps into nonoverlapping windows and performing attention interaction between adjacent windows reduces the computational load and parameter volume of self-attention significantly. Meanwhile, a depth uncertainty modeling branch is introduced to enhance the encoding weight of high-reliability depth features and suppress the interference of low-reliability ones through attention mechanisms, achieving precise fusion of depth information and semantic features. In extensive simulation experiments, On the KITTI dataset with an IOU (Intersection over union) threshold of 0.7, MonoDBDU achieves an improvement of 0.57 percentage points in 3D average precision (AP
3D) over MonoLSS and 1.17 percentage points in BEV (bird’s eye view) average precision (AP
BEV) over MonoDETR for the hard test subset. On the Waymo open dataset under the same IOU threshold of 0.7, the proposed MonoDBDU outperforms MonoLSS by 1.09 and 1.80 percentage points in AP
3D and AP
BEV on the hard subset, respectively. In real-world vehicle experiments, MonoDBDU achieved an mAP of 23.6% in BEV perspective projection, highlighting its favorable practicality and effectiveness in real-world autonomous driving scenarios.