基于SE-DR-Res2Block的声纹识别方法

李平; 高清源; 夏宇; 张小勇; 曹毅

doi:10.13374/j.issn2095-9389.2022.09.19.001

摘要: 针对声纹识别领域中基于传统Res2Net模型特征表达能力不足、泛化能力不强的问题，提出了一种结合稠密连接与残差连接的特征提取模块SE-DR-Res2Block(Sequeeze and excitation with dense and residual connected Res2Block). 首先，介绍了应用传统Res2Block的ECAPA-TDNN(Emphasized channel attention, propagation and aggregation in time delay neural network)网络结构和稠密连接及其工作原理；然后，为实现更高效的特征提取，采用稠密连接进一步实现特征的充分挖掘，基于SE-Block(Squeeze and excitation block)将残差连接和稠密连接相结合，提出了一种更高效的特征提取模块SE-DR-Res2Net. 该模块以一种更细粒化的方式获得不同生长速率和多种感受野的组合，从而获取多尺度的特征表达组合并最大限度上实现特征重用，以实现对不同层特征的信息进行有效提取，获取更多尺度的特征信息；最后，为验证该模块的有效性，基于不同网络模型采用SE-Res2Block(Sequeeze and excitation Res2Block)、FULL-SE-Res2Block(Fully connected sequeeze and excitation Res2Block)、SE-DR-Res2Block、FULL-SE-DR-Res2Block(Fully connected sequeeze and excitation with dense and residual connected Res2Block)，分别在Voxceleb1和SITW(Speakers in the wild)数据集开展了声纹识别的研究. 实验结果表明，采用SE-DR-Res2Block的ECAPA-TDNN网络模型，最佳等错误率分别达到2.24%和3.65%，其验证了该模块的特征表达能力，并且在不同测试集上的结果也验证了其具有良好的泛化能力.

Abstract: Aiming at the problems of insufficient feature expression ability and weak generalization ability of the traditional Res2Net model in the field of voice print recognition, this paper proposes a feature extraction module known as the SE-DR-Res2Block, which combinedly uses dense connection and residual connection. The combination of low-semantic features with spatial information characteristics allows focusing more on detailed information and high-semantic information that concentrates on global information as well as abstract features. This can compensate for the loss of some detailed information caused by abstraction. First, the feature of each layer in the dense connection structure is derived from the feature output of all previous layers to realize feature reuse. Second, the structure and working principle of the ECAPA-TDNN network using traditional Res2Block is introduced. To achieve more efficient feature extraction, the dense connection is used to further realize full feature mining. Based on SE-block, a more efficient feature extraction module, SE-DR-Res2Net, is proposed by combining the residual join and dense links. As compared to the traditional SE-Block structures, the convolutional layers are used here instead of fully connected layers. Because they not only reduce the number of parameters needed for training but also allow weight sharing, thereby reducing overfitting. Therefore, effective extraction of feature information from different layers is essential for obtaining multiscale expression as well as maximizing the reuse of features. During the collection of more scale-specific feature information, a large number of dense structures can lead to a dramatic increase in parameters and computational complexity. By using partial residual structures instead of dense structures, we can effectively prevent the dramatic increase in parameter quantity while maintaining the performance to a certain extent. Finally, to verify the effectiveness of the module, SE-Res2block, Full-SE-Res2block, SE-DR-Res2block, and Full-SE-DR-Res2block are adopted based on the different network models. Voxceleb1 and SITW (speakers in the wild) datasets were used for Voxceleb1 and SITW, respectively. The performance comparison of Res2Net-50 models with different modules on the Voxceleb1 dataset shows that SE-DR-Res2Net-50 achieves the best equal error rate of 3.51%, which also validates the adaptability of this module on different networks. The usage of different modules on different networks, as well as experiments and analyses conducted on different datasets, were compared. The experimental results showed that the optimal equal error rates of the ECAPA-TDNN network model using SE-DR-Res2block had reached 2.24% and 3.65%, respectively. This verifies the feature expression ability of this module, and the corresponding results based on different test data sets also confirm its excellent generalization ability.

基于SE-DR-Res2Block的声纹识别方法

Voiceprint recognition method based on SE-DR-Res2Block