Improving Image Editability in Image Generation with Layer-wise Memory

CVPR 2025

Seoul National University, Republic of Korea

Abstract

Most real-world image editing tasks require multiple sequential edits to achieve desired results. Current editing approaches, primarily designed for single-object modifications, struggle with sequential editing: especially with maintaining previous edits along with adapting new objects naturally into the existing content. These limitations significantly hinder complex editing scenarios where multiple objects need to be modified while preserving their contextual relationships. We address this fundamental challenge through two key proposals: enabling rough mask inputs that preserve existing content while naturally integrating new elements and supporting consistent editing across multiple modifications. Our framework achieves this through layer-wise memory, which stores latent representations and prompt embeddings from previous edits. We propose Background Consistency Guidance that leverages memorized latents to maintain scene coherence and Multi-Query Disentanglement in cross-attention that ensures natural adaptation to existing content. To evaluate our method, we present a new benchmark dataset incorporating semantic alignment metrics and interactive editing scenarios. Through comprehensive experiments, we demonstrate superior performance in iterative image editing tasks with minimal user effort, requiring only rough masks while maintaining high-quality results throughout multiple editing steps.

Our framework enables the interactive generation of images with enhanced control but in a simple manner, by rough mask and prompt, through iterative scene editing.

Improving Editability

Sequential image editing is crucial since real-world image generation tasks typically involve iterative refinements, where new elements are continuously added or modified to progressively achieve a desired scene. Existing methods, primarily optimized for single-step edits, struggle with maintaining consistency when sequential edits are applied. Our work addresses this by enabling intuitive, iterative editing using rough masks and textual prompts, preserving previously edited content while naturally integrating new elements, as illustrated in the figure below.



Our Framework

To address the challenges of sequential image editing, we propose a novel framework that consists of Background Consistency Guidance (BCG) and Multi-Query Disentangled Cross-Attention (MQD) to enable iterative image editing with minimal user effort. Additionally, we propose a benchmark to evaluate under the scenario of multiple object insertion with mask, enabling evaluation of sequential iterative edits with inpainting masks.


BCG and MQD

Thanks to Background Consistency Guidance (BCG) and Multi-Query Disentanglement (MQD), our framework significantly enhances sequential image editing. BCG ensures faster editing by efficiently leveraging stored latents to preserve visual coherence without redundant computations. MQD naturally integrates new objects into existing scenes by disentangling cross-attention based on mask order and semantic context. Together, these components enable rapid, consistent, and intuitive multi-step editing with minimal user effort.

MultiEdit-Bench

MultiEdit-Bench is a comprehensive benchmark specifically designed to evaluate iterative image editing tasks, addressing limitations of previous benchmarks that primarily focused on single-step edits. It introduces scenarios with layered compositions and varying occlusion levels to assess the semantic and visual alignment of sequential edits. By incorporating detailed layer-wise prompts and masks generated with GPT-4, MultiEdit-Bench effectively measures a model’s capability to maintain spatial consistency and semantic accuracy across multiple editing steps. This benchmark thus provides robust evaluation criteria for advanced editing frameworks.

Results

Our qualitative results demonstrate that our method clearly surpasses existing editing models by effectively maintaining background consistency and accurately integrating objects according to user intentions across multiple sequential edits. For HD-Painter, the background (Bus) is not preserved, and the inserted object is not well integrated into the scene. Also for Blended Latent Diffusion (BLD), they fail to generate the desired objects.

For more details, please check out the paper (To-be-released) .


Qualitative Comparison with State-of-the-Art Editing Models

Our method shows superior result over baselines (i.e., Blended Latent Diffuson, HD-Painter) in other examples as well.

More Qualitative Comparison

Improved Editing : Object Deletion

Our framework further improves editability by enabling efficient object deletion, particularly in scenarios involving overlapping elements. By utilizing stored latent representations from previous editing steps, we selectively erase objects while preserving foreground integrity and background coherence. Leveraging Background Consistency Guidance (BCG) and Multi-Query Disentanglement (MQD), object removal is achieved swiftly and naturally without requiring precise masks or additional forward passes, significantly enhancing editing flexibility.

For more details, please check out the paper (To-be-released) .


Quantitative Comparison with State-of-the-Art
Qualitative Comparison with State-of-the-Art

BibTeX

@inproceedings{dkm2025improving,
  author    = {Kim, Daneul and Lee, Jaeah and Park, Jaesik},
  title     = {Improving Image Editability in Image Generation with Layer-wise Memory},
  booktitle = {CVPR},
  year      = {2025},
}