Hamba: 基于选择状态空间的图像和谐化

Hamba: Image Harmonization based on Selective State Space Model

  • 摘要: 近年来,包含Transformer组件的深度学习模型已经推动了包括图像和谐化在内的图像编辑任务的性能极限。与使用静态局部滤波器的卷积神经网络(CNN)相反,Transformer使用自关注机制允许自适应非局部滤波来敏感地捕获远程上下文。然而,这种敏感性是以大量模型复杂性为代价的,这可能会损害学习效率,特别是在相对中等规模的成像数据集上。在这里,我们提出了一种用于图像和谐化的新型网络模型Hamba,利用选择性状态空间建模(SSM)来有效地捕获远程上下文,同时保持局部精度。为此,Hamba构建了基于VSS块的U型网络,SSM层用于多个空间维度来学习上下文关系,并通过局部-全局特征序列提取器,建立合成图像在前景和背景的语义和风格特征之间的联系。我们的研究结果表明,Hamba提供了优于最先进的基于CNN和Transform的方法的性能。

     

    Abstract: In recent years, deep learning models incorporating Transformer components have pushed the performance boundaries of image editing tasks, including image harmonization. Unlike Convolutional Neural Networks (CNNs) that utilize static local filters, Transformers employ a self-attention mechanism that allows for adaptive non-local filtering to sensitively capture long-range contexts. However, this sensitivity comes at the cost of significant model complexity, which can hinder learning efficiency, especially on relatively medium-scale imaging datasets. Here, we propose a novel network model, Hamba, for image harmonization, leveraging Selective State Space Modeling (SSM) to effectively capture long-range contexts while maintaining local precision. To this end, Hamba constructs a U-shaped network based on VSS blocks, where SSM layers are employed across multiple spatial dimensions to learn contextual relationships. Moreover, it establishes connections between the semantic and stylistic features of the foreground and background in composite images through a local-global feature sequence extractor. Our research findings demonstrate that Hamba offers performance superior to state-of-the-art CNN-based and Transformer-based methods. .

     

/

返回文章
返回