深度神经网络模型量化方法综述

杨春; 张睿尧; 黄泷; 遆书童; 林金辉; 董志伟; 陈松路; 刘艳; 殷绪成

doi:10.13374/j.issn2095-9389.2022.12.27.004

摘要: 近年来，利用大型预训练模型来提高深度神经网络在计算机视觉以及自然语言处理等具体任务下的泛化能力和性能，逐渐成为基于深度学习的人工智能技术与应用的发展趋势. 虽然这些深度神经网络模型表现优异，但是由于模型的结构复杂、参数量庞大与计算成本极高，使得它们仍然难以被部署在如家电或智能手机等资源受限的边缘及端侧硬件平台上，这很大程度上阻碍了人工智能技术的应用. 因此，模型压缩与加速技术一直都是深度神经网络模型大规模商业化应用推广的关键问题之一. 当前在多种模型压缩与加速方案中，模型量化是其中主要的有效方法之一. 模型量化技术可以通过减少深度神经网络模型参数的位宽和中间过程特征图的位宽，从而达到压缩加速深度神经网络的目的，使量化后的网络能够部署在资源有限的边缘设备上，然而，由于量化会导致信息的大量丢失，如何在保证模型任务精度条件下实现模型量化已经成为热点问题. 另外，因硬件设备以及应用场景的不同，模型量化技术已经发展成为一个多分支的研究问题. 通过全面地调研不同角度下模型量化相关技术现状，并且深入地总结归纳不同方法的优缺点，可以发现量化技术目前仍然存在的问题，并为未来可能的发展指明方向.

Abstract: The study of deep neural networks has recently gained widespread attention in recent years, with many researchers proposing network structures that exhibit exceptional performance. A current trend in artificial intelligence (AI) technology involves using deep learning and its applications via large-scale pretrained deep neural network models. This approach aims to improve the generalization capability and task-specific performance of the model, particularly in areas such as computer vision and natural language processing. Despite their success, the deployment of high-performance deep neural network models on edge hardware platforms, such as household appliances and smartphones, remains challenging owing to the high complexity of the neural network architecture, substantial storage overhead, and computational costs. These factors hinder the availability of AI technologies to the public. Therefore, compressing and accelerating deep neural network models have become a critical issue in the promotion of their large-scale commercial applications. Owing to the growing support for low-precision computation technology provided by AI hardware manufacturers, model quantization has emerged as a promising approach for the compression and acceleration of machine learning models. By reducing the bit width of deep neural network model parameters and intermediate feature maps during the forward propagation of the model, memory usage, computation efficiency, and energy consumption can be substantially reduced, enabling the utilization of quantized deep neural network models in resource-limited edge devices. However, this approach involves a critical tradeoff between task performance and hardware deployment, which directly impacts its potential for practical application. Quantizing the model to a low-bit precision can lead to considerable information loss, often resulting in a catastrophic degradation of the task performance of the model. Thus, alleviating the challenges of model quantization while maintaining task performance has become a critical research topic in AI. Furthermore, because of the differences in hardware devices, constraints of application scenarios, and data accessibility, model quantization has become a multibranch problem, including data-dependent, data-free, mixed-precision, and extremely low-bit quantization, among others. By comprehensively investigating various quantization methods for deep neural networks proposed based on different perspectives, and summarizing their advantages and disadvantages thoroughly, the essential problems that are associated with the quantization of deep neural network quantization can be explored, which points out the directions for possible future developments.

深度神经网络模型量化方法综述

A survey of quantization methods for deep neural networks