贺学剑, 陈安琪, 郭志强, 王致茹, 陈群. 基于渐进机器学习的中文问句匹配方法[J]. 工程科学学报. DOI: 10.13374/j.issn2095-9389.2023.11.05.002
引用本文: 贺学剑, 陈安琪, 郭志强, 王致茹, 陈群. 基于渐进机器学习的中文问句匹配方法[J]. 工程科学学报. DOI: 10.13374/j.issn2095-9389.2023.11.05.002
A Question Matching Method Based on Gradual Machine Learning[J]. Chinese Journal of Engineering. DOI: 10.13374/j.issn2095-9389.2023.11.05.002
Citation: A Question Matching Method Based on Gradual Machine Learning[J]. Chinese Journal of Engineering. DOI: 10.13374/j.issn2095-9389.2023.11.05.002

基于渐进机器学习的中文问句匹配方法

A Question Matching Method Based on Gradual Machine Learning

  • 摘要: 问句匹配旨在判断不同问句的意图是否相近。近年来,随着大型预训练语言模型的发展,利用其挖掘问句对在语义层面隐含的匹配信息,取得了目前为止最好的性能。然而,由于基于独立同分布假设,在真实场景中,这些深度学习模型的性能仍然受制于训练数据的充足程度和目标数据与训练数据之间的分布漂移。本文提出一种基于渐进机器学习的中文问句匹配算法方法。该方法基于渐进机器学习框架,从不同角度提取问句特征,构建融合各类特征信息的因子图,然后通过迭代的因子推理实现从易到难的渐进学习。在特征建模中,我们设计并实现了两种类型特征的提取:(1)基于TF-IDF的关键词特征;(2)基于DNN的深度语义特征。最后,我们通过通用的基准中文数据集LCQMC和BQ corpus验证了所提方法的有效性。实验表明,相比于单纯的深度学习模型,基于渐进机器学习的方法可以有效提升问句匹配的准确率,且其性能优势随着标签训练数据的减少而增大。

     

    Abstract: Question matching aims to determine whether the intentions of two different questions are similar. In recent years, with the development of large-scale pre-trained language models, the state-of-the-art performance of question matching has been achieved by these deep models. However, due to the assumption of independent and identical distribution (I.I.D), the performance of these deep models in real scenarios is still limited by the adequacy of training data and the distribution drift between target data and training data. In this article, we propose a novel method for question matching based on the paradigm of gradual machine learning. It extracts diverse semantic features from different perspectives, and then constructs a factor graphs by fusing the extracted feature information to enable gradual learning from easy to hard. In feature modeling, we design and implement two types of features: 1) TF-IDF based keyword features; 2) DNN based deep features. Finally, we have validated the efficacy of the proposed method by a comparative study on the open-sourced benchmark datasets, LCQMC and BQ corpus. Our extensive experiments show that compared to pure deep learning models, our proposed method can effectively improve the accuracy of question matching, and its performance advantage increases with the decrease of labeled training data.

     

/

返回文章
返回