ViTAU：基于Vision transformer和面部动作单元的面瘫识别与分析

高嘉; 蔡文浩; 赵俊莉; 段福庆

doi:10.13374/j.issn2095-9389.2024.05.06.003

摘要: 面部神经麻痹（Facial nerve paralysis，FNP），通常称为贝尔氏麻痹或面瘫，对患者的日常生活和心理健康产生显著影响，面瘫的及时识别和诊断对于患者的早期治疗和康复至关重要. 随着深度学习和计算机视觉技术的快速发展，面瘫的自动识别变得可行，为诊断提供了一种更准确和客观的方式. 目前的研究主要集中关注面部的整体变化，而忽略了面部细节的重要性. 面部不同部位对识别结果的影响力并不相同，这些研究尚未对面部各个区域进行细致区分和分析. 本项研究引入结合Vision transformer（ViT）模型和动作单元（Action unit，AU）区域检测网络的创新性方法用于面瘫的自动识别及区域分析. ViT模型通过自注意力机制精准识别是否面瘫，同时，基于AU的策略从StyleGAN2模型提取的特征图中，利用金字塔卷积神经网络分析受影响区域. 这一综合方法在YouTube Facial Palsy（YFP）和经过扩展的Cohn Kanade（CK+）数据集上的实验中分别达到99.4%的面瘫识别准确率和81.36%的面瘫区域识别准确率. 通过与最新方法的对比，实验结果展示了所提的自动面瘫识别方法的有效性.

Abstract: Facial nerve paralysis (FNP), commonly known as Bell’s palsy or facial paralysis, significantly affects patients’ daily lives and mental well-being. Timely identification and diagnosis are crucial for early treatment and rehabilitation. With the rapid advancement of deep learning and computer vision technologies, automatic recognition of facial paralysis has become feasible, offering a more accurate and objective diagnostic approach. Current research primarily focuses on broad facial changes and often neglects finer facial details, which leads to insufficient analysis of how different areas affect recognition results. This study proposes an innovative method that combines the vision transformer (ViT) model with an action unit (AU) facial region detection network to automatically recognize and analyze facial paralysis. Initially, the ViT model utilizes its self-attention mechanism to accurately determine the presence of facial paralysis. Subsequently, we analyzed the AU data to assess the activity of facial muscles, allowing for a deeper evaluation of the affected areas. The self-attention mechanism in the transformer architecture captures the global contextual information required to recognize facial paralysis. To accurately determine the specific affected regions, we use the pixel2style2pixel (pSp) encoder and the StyleGAN2 generator to encode and decode images and extract feature maps that represent facial characteristics. These maps are then processed through a pyramid convolutional neural network interpreter to generate heatmaps. By optimizing the mean squared error between the predicted and actual heatmaps, we can effectively identify the affected paralysis areas. Our proposed method integrates ViT with facial AUs, designing a ViT-based facial paralysis recognition network that enhances the extraction of local area features through its self-attention mechanism, thereby enabling precise recognition of facial paralysis. Additionally, by incorporating facial AU data, we conducted detailed regional analyses for patients identified with facial paralysis. Experimental results demonstrate the efficacy of our approach, achieving a recognition accuracy of 99.4% for facial paralysis and 81.36% for detecting affected regions on the YouTube Facial Palsy (YFP) and extended Cohn Kanade (CK+) datasets. These results not only highlight the effectiveness of our automatic recognition method compared to the latest techniques but also validate its potential for clinical applications. Furthermore, to facilitate the observation of affected regions, we developed a visualization method that intuitively displays the impacted areas, thereby aiding patients and healthcare professionals in understanding the condition and enhancing communication regarding treatment and rehabilitation strategies. In conclusion, the proposed method illustrates the power of combining advanced deep learning techniques with a detailed analysis of facial AUs to improve the automatic recognition of facial paralysis. By addressing previous research limitations, the proposed method provides a more nuanced understanding of how specific facial areas are affected, leading to improved diagnosis, treatment, and patient care. This innovative approach not only enhances the accuracy of facial paralysis detection but also contributes to facial medical imaging.

ViTAU：基于Vision transformer和面部动作单元的面瘫识别与分析

ViTAU: Facial paralysis recognition and analysis based on vision transformer and facial action units