Abstract:
                                      Accurate recognition of working conditions can optimize the zinc flotation process and improve its efficiency. Traditionally, this recognition heavily relies on manual observations of froth appearance, a method prone to human error and subjective judgment. To address this issue and improve recognition accuracy, a sparse attention convolution-ViT model is proposed. This model leverages machine vision techniques to investigate the relationship between froth visual features and the working conditions using real-time froth images from industrial sites. The model aims to recognize zinc flotation working conditions in real time, thereby providing guidance for operations. First, it combines the strengths of convolutional neural networks (CNNs) and vision transformers (ViT) to effectively extract both local and global features from froth images. Specifically, CNNs are adept at capturing local features, such as texture, color, and fine details of the froth, while ViT excels at identifying global features, such as the froth size distribution. By combining these two architectures, the sparse attention convolution-ViT model comprehensively analyzes the froth images. To enhance the global feature processing of froth images, a sparse multi-head attention mechanism is introduced into the ViT component. This mechanism allows the model to process global features with different sparsity levels, reducing computational costs and improving the model’s adaptability to different froth appearances. Each attention head in the sparse multi-head attention mechanism targets different aspects of global features, allowing the model to extract various information from the froth images while maintaining efficiency. Furthermore, an attention gated unit is introduced to refine the feature processing. This unit allows adaptive weighting of extracted features in the image, enhancing model interpretability and optimizing feature transfer. By effectively capturing the relevant features, the attention-gated unit helps the model to focus on critical features of the froth images that can indicate the working conditions. Experimental results demonstrated the effectiveness of the proposed sparse attention convolution-ViT model in recognizing zinc flotation working conditions. The model achieved a recognition accuracy of 88.62% on the zinc flotation froth image dataset, surpassing traditional CNN and ViT models. Ablation experiments highlighted the critical role of the sparse multi-head attention mechanism and the attention-gated unit, contributing to accuracy improvements of 0.92% and 2.63%, respectively. Moreover, gradient-weighted class activation mapping was used to visualize feature weights, confirming the model’s capability to effectively characterize froth images by identifying both local and global features. This accurate recognition of zinc flotation conditions underscores the potential of the model in providing reliable real-time recognition, supporting the optimization of the flotation process, thereby improving efficiency and resource utilization in zinc flotation.