嵌入共识知识的因果图文检索方法

梁彦鹏; 刘雪儿; 马忠贵; 李卓

doi:10.13374/j.issn2095-9389.2023.05.28.001

摘要: 跨模态图像−文本检索是一项在给定一种模态（如文本）的查询条件下检索另一种模态（如图像）的任务. 该任务的关键问题在于如何准确地测量图文两种模态之间的相似性，在减少视觉和语言这两种异构模态之间的视觉语义差异中起着至关重要的作用. 传统的检索范式依靠深度学习提取图像和文本的特征表示，并将其映射到一个公共表示空间中进行匹配. 然而，这种方法更多地依赖数据表面的相关关系，无法挖掘数据背后真实的因果关系，在高层语义信息的表示和可解释性方面面临着挑战. 为此，在深度学习的基础上引入因果推断和嵌入共识知识，提出嵌入共识知识的因果图文检索方法. 具体而言，将因果干预引入视觉特征提取模块，通过因果关系替换相关关系学习常识因果视觉特征，并与原始视觉特征进行连接得到最终的视觉特征表示. 为解决本方法文本特征表示不足的问题，采用更强大的文本特征提取模型BERT（Bidirectional encoder representations from transformers，双向编码器表示），并且嵌入两种模态数据之间共享的共识知识对图文特征进行共识级的表示学习. 在MS-COCO数据集以及MS-COCO 到Flickr30k上的跨数据集实验，证明了本文方法可以在双向图文检索任务上实现召回率和平均召回率的一致性改进.

Abstract: Crossmodal image-text retrieval involves retrieving relevant images or texts based on a query condition from the opposite modality. Its primary challenge lies in precisely quantifying the similarity metric used for feature matching between the two distinct modalities, playing an important role in mitigating the visual-semantic disparities between the heterogeneous realms of visual and linguistic domains. It has extensive applications in domains such as e-commerce product search and medical image retrieval. Traditional retrieval paradigms depend on harnessing deep learning techniques for extracting feature representations from images and texts. Crossmodal image-text retrieval learns semantic feature representations of disparate modal data by harnessing the formidable feature–extraction ability, subsequently mapping them into a shared semantic space for semantic alignment. However, this approach primarily depends on superficial data correlations, lacking the capacity to reveal the latent causal relationships underpinning the data. Moreover, owing to the inherent “black-box” nature of deep learning, the interpretability of model predictions often eludes human comprehension. In addition, an undue reliance on training data distributions impairs the generalization performance of the model. Consequently, the existing methods suffer the challenge of representing high-level semantic insights while maintaining interpretability. Causal inference, which endeavors to ascertain the causal effect of specific phenomena by isolating confounding factors by means of intervention, presents a novel avenue for enhancing the generalization capability and interpretability of deep models. Recently, researchers have sought to combine visual and linguistic tasks with the principles of causal inference. Accordingly, we introduce causal inference and embeds consensus knowledge into the bedrock of deep learning, and a novel causal image-text retrieval methodology with embedded consensus knowledge is proposed. Specifically, causal intervention is introduced into the visual feature extraction module, replacing correlated relationships with causal counterparts to cultivate common causal visual features. These features are then fused with the primal visual features acquired through bottom-up attention, resulting in a definitive visual feature representation. This study adopts the potent textual feature extraction ability of bidirectional encoder representations from transformers to address the shortfall in textual feature representation. Shared consensus knowledge between the two modal data is entwined, allowing for consensus-level feature representation learning image-text features. Empirical validation on the dataset MS-COCO and crossdataset experiments on the dataset Flickr30k substantiate the capacity of the proposed method to consistently enhance recall and mean recall in bidirectional image-text retrieval tasks. In summary, this pioneering approach endeavors to bridge the gap between visual and textual representations by combining causal inference principles and shared consensus knowledge within the framework of deep learning, thereby promising enhanced generalization and interpretability.

嵌入共识知识的因果图文检索方法

Causal image-text retrieval embedded with consensus knowledge