* 通信作者,E-mail: zhongguima@ustb.edu.cn, wangzj@ustb.edu.cn[J]. Chinese Journal of Engineering. DOI: 10.13374/j.issn2095-9389.2024.04.25.001
Citation: * 通信作者,E-mail: zhongguima@ustb.edu.cn, wangzj@ustb.edu.cn[J]. Chinese Journal of Engineering. DOI: 10.13374/j.issn2095-9389.2024.04.25.001

* 通信作者,E-mail: zhongguima@ustb.edu.cn, wangzj@ustb.edu.cn

  • Object detection, as a fundamental task in computer vision, has witnessed remarkable success in domains such as autonomous driving, robotics, and facial recognition with the advancements in convolutional neural networks. To pursue exceptional performance, however, state-of-the-art models for object detection usually incorporate a substantial number of parameters, and their size has reached the limits allowed by modern hardware. The heavy design hinders their deployment on resource-constrained devices. To get over this challenge, the academic community has proposed a plenty of model compression techniques, encompassing network pruning, lightweight architecture design, quantization of neural networks, and knowledge distillation. Among these techniques, knowledge distillation stands out as it transfers knowledge from large teacher models to compact student models without modifying the network structure, enabling the student models to achieve powerful performance comparable to their teacher models. However, the majority of distillation techniques have been developed primarily for image classification tasks, and their performance in object detection tasks is often subpar. The task of object detection involves simultaneously localizing and classifying multiple target objects within natural images. These objects often exhibit variations in scale, intricate inter-class relationships, and are dispersed across different locations. The contributions to distillation may vary across the center or surroundings of the bounding box, as well as the foreground and background. Consequently, incorporating knowledge distillation into object detection models poses significant challenges. To address the aforementioned issues, this paper proposes a novel attention-based knowledge distillation framework for object detection, striking a better balance between efficiency and accuracy. The research presented in this paper is primarily divided into the following points: Firstly, this paper proposes using category semantic attention to localize foreground semantic regions for each class in the neck feature pyramid output feature map of the teacher detector, transmitting crucial positional information for each class to the student model. Distillation on the feature pyramid output feature map can also address the issue of multi-scale targets. To mitigate differences between teacher and student model feature maps, this paper normalizes the feature maps used for distillation, ensuring they have zero mean and unit variance. Furthermore, to address the insufficient consideration of background information in category semantic distillation, and to tackle the problem of severed relationships between foreground and background regions as well as the overlooked relationships among different class targets, this paper proposes utilizing a criss-cross attention mechanism to capture long-range dependencies between target pixels in the teacher model. This information is then transmitted to the student detector model to further enhance its performance. Combining the aforementioned two distillation techniques, this paper introduces the overall approach termed Category Semantic and Global Relations (CSGR) distillation. The former focuses on crucial foreground positions for each class, while the latter captures global relationships among target pixels across different classes. To validate the effectiveness and generalization of the proposed method, extensive experiments are conducted on challenging benchmarks including SODA10M, PASCAL VOC, and MiniCOCO. Across various object detectors, the student models distilled through CSGR distillation exhibit impressive improvements compared to those trained from scratch. Additionally, when compared to other baseline methods, the proposed approach introduces competitive improvements in mAP without significantly increasing Parameters and FLOPS during distillation training. Thus, the proposed method achieves a better balance between accuracy and efficiency.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return