GLIHamba: 基于Mamba的整体–局部上下文图像和谐化

孙金胜; 潘姣; 郭宇; 姚超

doi:10.13374/j.issn2095-9389.2024.09.12.006

摘要: 近年来，包含Transformer组件的深度学习模型已经推动了包括图像和谐化在内的图像编辑任务的快速发展. 与使用静态局部滤波器的卷积神经网络(CNN)相反，Transformer使用自注意力机制允许自适应非局部滤波来敏感地捕获远程上下文. 现有基于CNN和Transformer等方法图像和谐化方法，未能很好的兼顾局部内容和整体风格的一致性，导致前景与背景的视觉一致性不足. 本文提出了一种用于图像和谐化的新型网络模型，基于Mamba的整体–局部上下文图像和谐化（Global-local context image harmonization based on Mamba，GLIHamba），将全局特征和局部特征引入到Mamba模型，建立具有整体–局部上下文感知能力的图像和谐化模型. 具体来说，介绍了一种新的基于学习的图像和谐化模型GLIHamba，其核心组件包括局部特征序列提取器(LFSE)和全局特征序列提取器(GFSE). LFSE维护图像高维特征中相邻特征的局部一致性，显式地确保空间上邻近的特征沿着通道保持一致性，从而保证和谐化结果的局部内容完整一致. 另一方面，GFSE在所有空间维度上建立全局序列，保持图像的整体风格一致性. 研究结果表明，GLIHamba提供了优于最先进的基于CNN和Transformer的方法的性能.

Abstract: Image harmonization is a technique that ensures the consistency and coordination of appearance features, such as lighting and color, between the background and foreground of a composite image. Image harmonization has emerged as a significant research area in the field of image processing. With the rapid development of image processing technologies in recent years, it has gradually become a focal point of attention in both academia and industry. The primary challenge in this research area is the development of image harmonization methods that achieve local content integrity and global style consistency. Traditional image harmonization methods rely primarily on matching low-level features, such as gradients and color histograms, to maintain good color coherence. However, these methods lack semantic awareness of the contextual relationship between the foreground and background, which leads to a lack of realism owing to inconsistencies between content and style. In recent years, harmonization methods based on deep learning have achieved significant progress. Pixel-wise matching methods utilize convolutional encoder-decoder models to learn the transformations from background to foreground pixel features. However, because of the limited receptive fields of convolutional neural networks (CNNs), these methods primarily use local regional features as references, which makes it difficult to incorporate the overall background information into the foreground. In contrast, region-based matching methods treat the foreground and background regions as two different styles or domains. Although these methods achieve global consistency in harmonization results, they often overlook the spatial differences between the two regions. Breakthroughs in state-space models (SSMs), particularly the Mamba model based on the selective state-space model, have brought about significant advancements. The Mamba model utilizes a selective scanning mechanism to achieve linear complexity in capturing global relationships and demonstrated excellent performance in a series of computer vision tasks. However, the Mamba model cannot maintain spatial local dependencies between adjacent features and thus lacks local consistency. In this study, we draw inspiration from the operational methods of CNNs and Transformer models as well as introduce global and local features into the Mamba model to establish an image harmonization model with global-local context awareness. Specifically, we propose a novel learning-based image harmonization model called GLIHamba (Global-local context image harmonization based on Mamba). The core components of GLIHamba include a local feature sequence extractor (LFSE) and global feature sequence extractor (GFSE). The LFSE preserves the locality of adjacent features in high-dimensional arrays to explicitly ensure consistency among spatially neighboring features along the channels and thereby guarantee the local content integrity and consistency of the harmonization results. In contrast, GFSE compresses features across all spatial dimensions to maintain the overall style consistency of the image. Our experimental results demonstrate that the proposed GLIHamba model outperforms previous methods based on CNNs and Transformer in image harmonization tasks. On the iHarmony4 dataset, our model achieved a PSNR value of 39.76 and exhibited excellent performance on real scene data. In summary, the proposed GLIHamba model provides a novel solution to the challenges of image harmonization by integrating global and local context awareness and thus achieves superior performance compared with existing methods.

GLIHamba: 基于Mamba的整体–局部上下文图像和谐化

GLIHamba: global–local context image harmonization based on Mamba