BEGIN:VCALENDAR VERSION:2.0 PRODID:Linklings LLC BEGIN:VTIMEZONE TZID:Asia/Tokyo X-LIC-LOCATION:Asia/Tokyo BEGIN:STANDARD TZOFFSETFROM:+0900 TZOFFSETTO:+0900 TZNAME:JST DTSTART:18871231T000000 END:STANDARD END:VTIMEZONE BEGIN:VEVENT DTSTAMP:20250110T023312Z LOCATION:Hall B5 (2)\, B Block\, Level 5 DTSTART;TZID=Asia/Tokyo:20241204T130000 DTEND;TZID=Asia/Tokyo:20241204T131100 UID:siggraphasia_SIGGRAPH Asia 2024_sess116_papers_308@linklings.com SUMMARY:DiffUHaul: A Training-Free Method for Object Dragging in Images DESCRIPTION:Technical Papers\n\nOmri Avrahami (Hebrew University of Jerusa lem), Rinon Gal (Tel Aviv University), Gal Chechik (NVIDIA), Ohad Fried (T he Interdisciplinary Center Herzliya), Dani Lischinski (Hebrew University of Jerusalem), and Arash Vahdat and Weili Nie (NVIDIA)\n\nText-to-image di ffusion models have proven effective for solving many image editing tasks. \n However, the seemingly straightforward task of seamlessly relocating objects within a scene remains surprisingly challenging. Existing methods addressing this problem often struggle to function reliably in real-world scenarios due to lacking spatial reasoning. \n In this work, we propos e a training-free method, dubbed \emph{DiffUHaul}, that harnesses the spat ial understanding of a \emph{localized} text-to-image model, for the objec t dragging task.\n Blindly manipulating layout inputs of the localized model tends to cause low editing performance due to the intrinsic entangle ment of object representation in the model. To this end, we first apply at tention masking in each denoising step to make the generation more disenta ngled across different objects and adopt the self-attention sharing mechan ism to preserve the high-level object appearance. Furthermore, we propose a new diffusion anchoring technique: in the early denoising steps, we inte rpolate the attention features between source and target images to smoothl y fuse new layouts with the original appearance; in the later denoising st eps, we pass the localized features from the source images to the interpol ated images to retain fine-grained object details. To adapt DiffUHaul to r eal-image editing, we apply a DDPM self-attention bucketing that can bette r reconstruct real images with the localized model.\n Finally, we intro duce an automated evaluation pipeline for this task and showcase the effi cacy of our method. Our results are reinforced through a user preference s tudy.\n\nRegistration Category: Full Access, Full Access Supporter\n\nLang uage Format: English Language\n\nSession Chair: Dani Lischinski (Hebrew Un iversity of Jerusalem, Google) URL:https://asia.siggraph.org/2024/program/?id=papers_308&sess=sess116 END:VEVENT END:VCALENDAR