BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Australia/Melbourne
X-LIC-LOCATION:Australia/Melbourne
BEGIN:DAYLIGHT
TZOFFSETFROM:+1000
TZOFFSETTO:+1100
TZNAME:AEDT
DTSTART:19721003T020000
RRULE:FREQ=YEARLY;BYMONTH=4;BYDAY=1SU
END:DAYLIGHT
BEGIN:STANDARD
DTSTART:19721003T020000
TZOFFSETFROM:+1100
TZOFFSETTO:+1000
TZNAME:AEST
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260114T163633Z
LOCATION:Darling Harbour Theatre\, Level 2 (Convention Centre)
DTSTART;TZID=Australia/Melbourne:20231212T093000
DTEND;TZID=Australia/Melbourne:20231212T124500
UID:siggraphasia_SIGGRAPH Asia 2023_sess209_papers_304@linklings.com
SUMMARY:Break-A-Scene: Extracting Multiple Concepts from a Single Image
DESCRIPTION:Omri Avrahami (The Hebrew University of Jerusalem), Kfir Aberm
 an (Google Research), Ohad Fried (Reichman University), Daniel Cohen-Or (T
 el Aviv University), and Dani Lischinski (The Hebrew University of Jerusal
 em)\n\nText-to-image model personalization aims to introduce a user-provid
 ed concept to the model, allowing its synthesis in diverse contexts. Howev
 er, current methods primarily focus on the case of learning a single conce
 pt from multiple images with variations in backgrounds and poses, and stru
 ggle when adapted to a different scenario. In this work, we introduce the 
 task of textual scene decomposition: given a single image of a scene that 
 may contain several concepts, we aim to extract a distinct text token for 
 each concept, enabling fine-grained control over the generated scenes. To 
 this end, we propose augmenting the input image with masks that indicate t
 he presence of target concepts. These masks can be provided by the user or
  generated automatically by a pre-trained segmentation model. We then pres
 ent a novel two-phase customization process that optimizes a set of dedica
 ted textual embeddings (handles), as well as the model weights, striking a
  delicate balance between accurately capturing the concepts and avoiding o
 verfitting. We employ a masked diffusion loss to enable handles to generat
 e their assigned concepts, complemented by a novel loss on cross-attention
  maps to prevent entanglement. We also introduce union-sampling, a trainin
 g strategy aimed to improve the ability of combining multiple concepts in 
 generated images. We use several automatic metrics to quantitatively compa
 re our method against several baselines, and further affirm the results us
 ing a user study. Finally, we showcase several applications of our method.
 \n\nRegistration Category: Full Access, Enhanced Access, Trade Exhibitor, 
 Experience Hall Exhibitor\n\n
URL:https://asia.siggraph.org/2023/full-program?id=papers_304&sess=sess209
END:VEVENT
END:VCALENDAR