BEGIN:VCALENDAR VERSION:2.0 PRODID:Linklings LLC BEGIN:VTIMEZONE TZID:Australia/Melbourne X-LIC-LOCATION:Australia/Melbourne BEGIN:DAYLIGHT TZOFFSETFROM:+1000 TZOFFSETTO:+1100 TZNAME:AEDT DTSTART:19721003T020000 RRULE:FREQ=YEARLY;BYMONTH=4;BYDAY=1SU END:DAYLIGHT BEGIN:STANDARD DTSTART:19721003T020000 TZOFFSETFROM:+1100 TZOFFSETTO:+1000 TZNAME:AEST RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=1SU END:STANDARD END:VTIMEZONE BEGIN:VEVENT DTSTAMP:20240214T070249Z LOCATION:Meeting Room C4.11\, Level 4 (Convention Centre) DTSTART;TZID=Australia/Melbourne:20231215T105000 DTEND;TZID=Australia/Melbourne:20231215T110000 UID:siggraphasia_SIGGRAPH Asia 2023_sess135_papers_304@linklings.com SUMMARY:Break-A-Scene: Extracting Multiple Concepts from a Single Image DESCRIPTION:Technical Papers, TOG\n\nOmri Avrahami (The Hebrew University of Jerusalem), Kfir Aberman (Google Research), Ohad Fried (Reichman Univer sity), Daniel Cohen-Or (Tel Aviv University), and Dani Lischinski (The Heb rew University of Jerusalem)\n\nText-to-image model personalization aims t o introduce a user-provided concept to the model, allowing its synthesis i n diverse contexts. However, current methods primarily focus on the case o f learning a single concept from multiple images with variations in backgr ounds and poses, and struggle when adapted to a different scenario. In thi s work, we introduce the task of textual scene decomposition: given a sing le image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image w ith masks that indicate the presence of target concepts. These masks can b e provided by the user or generated automatically by a pre-trained segment ation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing th e concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a nove l loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combin ing multiple concepts in generated images. We use several automatic metric s to quantitatively compare our method against several baselines, and furt her affirm the results using a user study. Finally, we showcase several ap plications of our method.\n\nRegistration Category: Full Access\n\nSession Chair: Chongyang Ma (ByteDance) URL:https://asia.siggraph.org/2023/full-program?id=papers_304&sess=sess135 END:VEVENT END:VCALENDAR