BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Australia/Melbourne
X-LIC-LOCATION:Australia/Melbourne
BEGIN:DAYLIGHT
TZOFFSETFROM:+1000
TZOFFSETTO:+1100
TZNAME:AEDT
DTSTART:19721003T020000
RRULE:FREQ=YEARLY;BYMONTH=4;BYDAY=1SU
END:DAYLIGHT
BEGIN:STANDARD
DTSTART:19721003T020000
TZOFFSETFROM:+1100
TZOFFSETTO:+1000
TZNAME:AEST
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20240214T070249Z
LOCATION:Meeting Room C4.11\, Level 4 (Convention Centre)
DTSTART;TZID=Australia/Melbourne:20231215T105000
DTEND;TZID=Australia/Melbourne:20231215T110000
UID:siggraphasia_SIGGRAPH Asia 2023_sess135_papers_304@linklings.com
SUMMARY:Break-A-Scene: Extracting Multiple Concepts from a Single Image
DESCRIPTION:Technical Papers, TOG\n\nOmri Avrahami (The Hebrew University 
 of Jerusalem), Kfir Aberman (Google Research), Ohad Fried (Reichman Univer
 sity), Daniel Cohen-Or (Tel Aviv University), and Dani Lischinski (The Heb
 rew University of Jerusalem)\n\nText-to-image model personalization aims t
 o introduce a user-provided concept to the model, allowing its synthesis i
 n diverse contexts. However, current methods primarily focus on the case o
 f learning a single concept from multiple images with variations in backgr
 ounds and poses, and struggle when adapted to a different scenario. In thi
 s work, we introduce the task of textual scene decomposition: given a sing
 le image of a scene that may contain several concepts, we aim to extract a
  distinct text token for each concept, enabling fine-grained control over 
 the generated scenes. To this end, we propose augmenting the input image w
 ith masks that indicate the presence of target concepts. These masks can b
 e provided by the user or generated automatically by a pre-trained segment
 ation model. We then present a novel two-phase customization process that 
 optimizes a set of dedicated textual embeddings (handles), as well as the 
 model weights, striking a delicate balance between accurately capturing th
 e concepts and avoiding overfitting. We employ a masked diffusion loss to 
 enable handles to generate their assigned concepts, complemented by a nove
 l loss on cross-attention maps to prevent entanglement. We also introduce 
 union-sampling, a training strategy aimed to improve the ability of combin
 ing multiple concepts in generated images. We use several automatic metric
 s to quantitatively compare our method against several baselines, and furt
 her affirm the results using a user study. Finally, we showcase several ap
 plications of our method.\n\nRegistration Category: Full Access\n\nSession
  Chair: Chongyang Ma (ByteDance)
URL:https://asia.siggraph.org/2023/full-program?id=papers_304&sess=sess135
END:VEVENT
END:VCALENDAR