Abstract
In hazardous environments such as cement aggregate production plants, workers are required to wear safety equipment including helmets, masks, and reflective vests to mitigate the risks of heavy dust and flying debris. However, non-compliance with safety gear requirements remains prevalent, contributing to frequent workplace accidents. Manual supervision proves inefficient due to environmental limitations. As a result, the deployment of AI-based video analysis for real-time safety wear detection has become increasingly vital. Yet, this task presents significant challenges, particularly due to the presence of small objects and multi-scale targets in complex scenes, which compromise detection accuracy, increase false negative rates, and hinder real-time performance. To address these issues, this study proposes a novel multi-scale small object detection algorithm, ODE-YOLO, built upon the YOLOv8 architecture. The core innovation lies in integrating the Omni-Dimensional Dynamic Convolution (ODConv) module into the shallow layers of the backbone to enhance feature extraction for small objects, and embedding an improved attention mechanism, iEMA (inverted Efficient Multi-scale Attention), within the neck network to strengthen multi-scale feature representation while preserving real-time inference performance. The EMA module, known for its multi-scale parallel structure and spatial attention capabilities, was modified using an inverted residual mobile block (iRMB) to form iEMA. This structure balances efficiency and accuracy by reusing features, reducing computation, and eliminating the need for complex matrix operations found in traditional self-attention mechanisms. The combination of ODConv and iEMA allows the model to better capture contextual cues across varying object scales, especially for hard-to-detect categories like masks and unhelmeted heads. A customized dataset comprising 9,877 labeled instances was created using surveillance footage from multiple workstations in a cement plant, covering various time periods and camera angles. This dataset included six categories: vest, no-vest, helmet, head, mask, and no-mask. Statistical analysis revealed a strong presence of small and scale-diverse targets, with some classes occupying less than 0.5% of the image area. Training was conducted using PyTorch 2.0.0 on an NVIDIA RTX 3090 GPU. A comprehensive series of experiments was carried out, including attention mechanism comparisons, ablation studies, and benchmarking against state-of-the-art models such as YOLOv5n, YOLOv10n, Faster R-CNN, Mask R-CNN, and RT-DETR-L. The results demonstrate that the proposed ODE-YOLO outperforms other YOLO variants and R-CNN models in terms of mean average precision (mAP@0.5 = 0.868) and small object detection precision (AP@0.5mask = 0.722), while maintaining a lightweight architecture (11.3 MB) and fast inference (2.2 ms/image). The iEMA attention mechanism outperformed other mainstream attention modules (SE, CBAM, CA), particularly in improving the precision of mask detection by 28.5% compared to the baseline. Ablation experiments confirmed the individual and combined contributions of ODConv and iEMA to both accuracy and speed, evidencing their synergistic effect. Visual inspection using real-world test images showed that ODE-YOLO achieved balanced detection across object scales without missed detections or misclassifications, making it highly suitable for real-time deployment in production environments. In conclusion, this study introduces a robust and efficient algorithm tailored for safety wear detection in industrial scenarios characterized by multi-scale and small object challenges. ODE-YOLO provides a practical tool for enhancing workplace safety supervision, offering timely alerts for non-compliance, and supporting safety management personnel in mitigating risks and preventing accidents.