BEGIN:VCALENDAR VERSION:2.0 PRODID:Linklings LLC BEGIN:VTIMEZONE TZID:Australia/Melbourne X-LIC-LOCATION:Australia/Melbourne BEGIN:DAYLIGHT TZOFFSETFROM:+1000 TZOFFSETTO:+1100 TZNAME:AEDT DTSTART:19721003T020000 RRULE:FREQ=YEARLY;BYMONTH=4;BYDAY=1SU END:DAYLIGHT BEGIN:STANDARD DTSTART:19721003T020000 TZOFFSETFROM:+1100 TZOFFSETTO:+1000 TZNAME:AEST RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=1SU END:STANDARD END:VTIMEZONE BEGIN:VEVENT DTSTAMP:20240214T070240Z LOCATION:Darling Harbour Theatre\, Level 2 (Convention Centre) DTSTART;TZID=Australia/Melbourne:20231212T093000 DTEND;TZID=Australia/Melbourne:20231212T124500 UID:siggraphasia_SIGGRAPH Asia 2023_sess209_papers_304@linklings.com SUMMARY:Break-A-Scene: Extracting Multiple Concepts from a Single Image DESCRIPTION:Technical Papers\n\nOmri Avrahami (The Hebrew University of Je rusalem), Kfir Aberman (Google Research), Ohad Fried (Reichman University) , Daniel Cohen-Or (Tel Aviv University), and Dani Lischinski (The Hebrew U niversity of Jerusalem)\n\nText-to-image model personalization aims to int roduce a user-provided concept to the model, allowing its synthesis in div erse contexts. However, current methods primarily focus on the case of lea rning a single concept from multiple images with variations in backgrounds and poses, and struggle when adapted to a different scenario. In this wor k, we introduce the task of textual scene decomposition: given a single im age of a scene that may contain several concepts, we aim to extract a dist inct text token for each concept, enabling fine-grained control over the g enerated scenes. To this end, we propose augmenting the input image with m asks that indicate the presence of target concepts. These masks can be pro vided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optim izes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the con cepts and avoiding overfitting. We employ a masked diffusion loss to enabl e handles to generate their assigned concepts, complemented by a novel los s on cross-attention maps to prevent entanglement. We also introduce union -sampling, a training strategy aimed to improve the ability of combining m ultiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further a ffirm the results using a user study. Finally, we showcase several applica tions of our method.\n\nRegistration Category: Full Access, Enhanced Acces s, Trade Exhibitor, Experience Hall Exhibitor URL:https://asia.siggraph.org/2023/full-program?id=papers_304&sess=sess209 END:VEVENT END:VCALENDAR