Abstract:
In recent years, deep learning models incorporating Transformer components have pushed the performance boundaries of image editing tasks, including image harmonization. Unlike Convolutional Neural Networks (CNNs) that utilize static local filters, Transformers employ a self-attention mechanism that allows for adaptive non-local filtering to sensitively capture long-range contexts. However, this sensitivity comes at the cost of significant model complexity, which can hinder learning efficiency, especially on relatively medium-scale imaging datasets. Here, we propose a novel network model, Hamba, for image harmonization, leveraging Selective State Space Modeling (SSM) to effectively capture long-range contexts while maintaining local precision. To this end, Hamba constructs a U-shaped network based on VSS blocks, where SSM layers are employed across multiple spatial dimensions to learn contextual relationships. Moreover, it establishes connections between the semantic and stylistic features of the foreground and background in composite images through a local-global feature sequence extractor. Our research findings demonstrate that Hamba offers performance superior to state-of-the-art CNN-based and Transformer-based methods. .