基于多任务特征融合与正交注意力的交通环境感知算法

李正峰; 钟铭恩; 张亿鸿; 范康; 邓智颖; 谭佳威

doi:10.13374/j.issn2095-9389.2024.10.09.001

基于多任务特征融合与正交注意力的交通环境感知算法

Traffic environment perception algorithm based on multi-task feature fusion and orthogonal attention

摘要

摘要: 自动驾驶场景中的协同多任务感知算法设计目前仍颇具挑战，针对此提出一种深度卷积神经网络算法MTEPN，用于同时完成车辆目标检测、可行驶路面区域提取和车道线分割三项视觉任务. 首先采用CSPDarkNet网络提取交通场景图像的基础特征；其次设计了特征聚合模块C2f-K来获得更具细粒度的全局图像特征；随后提出正交注意力HWAttention降低计算量并增强空间尺度图像特征；并引进跨任务信息聚合模块CFAS实现任务间互补模式信息的融合；最后通过解耦的任务头模块来实现这三个感知目标. 在BDD100k公开数据集上的实验结果表明，所提算法的平均目标检测精度mAP和可行驶区域提取的像素平均交并比mIOU分别为79.4%和92.4%，均超越同参数规模的主流多任务感知算法，车道线分割精度IoU为次优值27.2%，模型参数量仅为7.9M，单帧图像处理时间仅为24.3 ms，具有较好的综合性能. 相关代码将在https://github.com/XMUT-Vsion-Lab/MTEPN公开.

Abstract: In the realm of autonomous driving, the design and implementation of collaborative multi-task perception algorithms pose significant challenges. These challenges are primarily rooted in the need for real-time processing speeds, effective feature sharing among diverse tasks, and seamless information fusion. Addressing these concerns is critical for enhancing the overall safety and efficiency of autonomous systems navigating complex traffic environments. Therefore, we propose MTEPN as an innovative deep convolutional neural network algorithm specifically designed to perform multiple visual tasks concurrently. This framework aims to achieve three essential objectives: vehicle target detection, extraction of drivable road areas, and segmentation of lane lines. By integrating these tasks into a unified model, MTEPN enhances the perceptual capabilities of autonomous driving systems and improves their ability to operate effectively in real-world settings. MTEPN is built upon the CSPDarkNet network, which is employed to extract fundamental features from traffic scene images. By leveraging a horizontal connection mechanism, this network enhances the feature extraction capabilities of the model, establishing a robust basis for subsequent multi-task processing. This initial step is crucial, as high-quality feature extraction determines the overall performance of the entire system. Subsequently, a multi-channel deformable feature aggregation module, termed C2f-K, is proposed. This module is designed to capture fine-grained global image features by facilitating cross-layer information fusion. By integrating features across different scales, C2f-K effectively reduces background noise and interference, thereby improving the understanding of complex scenes of the model. To further enhance the efficiency and accuracy of the model, an orthogonal attention mechanism called HWAttention is proposed. This mechanism minimizes computational load while amplifying significant spatial features within the input images. By selectively focusing on critical areas of interest, HWAttention significantly boosts the performance of the model across various environments, ensuring that it remains efficient even under real-time constraints. A notable advancement introduced in MTEPN is the cross-task feature aggregation structure. This module promotes information complementarity between tasks by implicitly modeling the global context relationships among different visual tasks. The integration of complementary pattern information deepens feature sharing, thereby enhancing the recognition accuracy of each task. This approach fosters a synergistic relationship among the tasks, enabling the model to operate more effectively than traditional methods, which treat tasks in isolation. Additionally, the decoupled task head module allows for independent processing of the three perceptual objectives. This design choice not only increases the flexibility of the model but also sharpens the focus on each task, allowing for tailored optimization strategies that enhance overall performance. Through experimental evaluations conducted on the BDD100k public dataset, MTEPN achieved impressive results, with an average mean average precision (mAP) of 79.4% for vehicle target detection and an average intersection-over-union (mIoU) of 92.4% for pixels extracted from the drivable area. Both these metrics surpass those of existing mainstream multi-task perception algorithms with comparable parameter scales. Furthermore, the lane line segmentation accuracy, measured by IoU, reached a sub-optimal value of 27.2%. Importantly, MTEPN maintains a modest parameter count of only 7.9 million and processes single-frame images in just 24.3 ms. This efficiency demonstrates its suitability for real-time applications in autonomous driving, where both speed and accuracy are paramount. The relevant code for this innovative algorithm will be made publicly available at https://github.com/XMUT-Vsion-Lab.

HTML全文

参考文献(26)

施引文献

资源附件(0)