基于DL-T及迁移学习的语音识别研究

张威; 刘晨; 费鸿博; 李巍; 俞经虎; 曹毅

doi:10.13374/j.issn2095-9389.2020.01.12.001

基于DL-T及迁移学习的语音识别研究

Research on automatic speech recognition based on a DL–T and transfer learning

摘要

摘要: 为解决RNN–T语音识别时预测错误率高、收敛速度慢的问题，本文提出了一种基于DL–T的声学建模方法。首先介绍了RNN–T声学模型；其次结合DenseNet与LSTM网络提出了一种新的声学建模方法— —DL–T，该方法可提取原始语音的高维信息从而加强特征信息重用、减轻梯度问题便于深层信息传递，使其兼具预测错误率低及收敛速度快的优点；然后，为进一步提高声学模型的准确率，提出了一种适合DL–T的迁移学习方法；最后为验证上述方法，采用DL–T声学模型，基于Aishell–1数据集开展了语音识别研究。研究结果表明：DL–T相较于RNN–T预测错误率相对降低了12.52%，模型最终错误率可达10.34%。因此，DL–T可显著改善RNN–T的预测错误率和收敛速度。

Abstract: Speech has been a natural and effective way of communication, widely used in the field of information-communication and human–machine interaction. In recent years, various algorithms have been used for achieving efficient communication. The main purpose of automatic speech recognition (ASR), one of the key technologies in this field, is to convert the analog signals of input speech into corresponding text digital signals. Further, ASR can be divided into two categories: one based on hidden Markov model (HMM) and the other based on end to end (E2E) models. Compared with the former, E2E models have a simple modeling process and an easy training model and thus, research is carried out in the direction of developing E2E models for effectively using in ASR. However, HMM-based speech recognition technologies have some disadvantages in terms of prediction error rate, generalization ability, and convergence speed. Therefore, recurrent neural network–transducer (RNN–T), a typical E2E acoustic model that can model the dependencies between the outputs and can be optimized jointly with a Language Model (LM), was proposed in this study. Further, a new acoustic model of DL–T based on DenseNet (dense convolutional network)–LSTM (long short-term memory)–Transducer, was proposed to solve the problems of a high prediction error rate and slow convergence speed in a RNN–T. First, a RNN–T was briefly introduced. Then, combining the merits of both DenseNet and LSTM, a novel acoustic model of DL–T, was proposed in this study. A DL–T can extract high-dimensional speech features and alleviate gradient problems and it has the advantages of low character error rate (CER) and fast convergence speed. Apart from that, a transfer learning method suitable for a DL–T was also proposed. Finally, a DL–T was researched in speech recognition based on the Aishell–1 dataset for validating the abovementioned methods. The experimental results show that the relative CER of DL–T is reduced by 12.52% compared with RNN–T, and the final CER is 10.34%, which also demonstrates a low CER and better convergence speed of the DL–T.

HTML全文

参考文献(28)

施引文献

资源附件(0)