BEGIN:VCALENDAR VERSION:2.0 PRODID:Linklings LLC BEGIN:VTIMEZONE TZID:Asia/Tokyo X-LIC-LOCATION:Asia/Tokyo BEGIN:STANDARD TZOFFSETFROM:+0900 TZOFFSETTO:+0900 TZNAME:JST DTSTART:18871231T000000 END:STANDARD END:VTIMEZONE BEGIN:VEVENT DTSTAMP:20250110T023312Z LOCATION:Hall B5 (2)\, B Block\, Level 5 DTSTART;TZID=Asia/Tokyo:20241205T144500 DTEND;TZID=Asia/Tokyo:20241205T145600 UID:siggraphasia_SIGGRAPH Asia 2024_sess134_papers_608@linklings.com SUMMARY:Still-Moving: Customized Video Generation without Customized Video Data DESCRIPTION:Technical Papers\n\nHila Chefer (Google Research, Tel Aviv Uni versity); Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, and Michael Rub instein (Google Research); Lior Wolf (Tel Aviv University); Tali Dekel (Go ogle Research, Weizmann Institute of Science); Tomer Michaeli (Google Rese arch, Technion – Israel Institute of Technology); and Inbar Mosseri (Googl e Research)\n\nCustomizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylizat ion, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customiz ed video data. \nIn this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring a ny customized video data. The framework applies to the prominent T2V desig n where the video model is built over a text-to-image (T2I) model (e.g., v ia inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop).\nN aively plugging in the weights of the customized T2I model into the T2V mo del often leads to significant artifacts or insufficient adherence to the customization data. \nTo overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers.\nI mportantly, our adapters are trained on "frozen videos" (i.e., repeated im ages), constructed from image samples generated by the customized T2I mode l. This training is facilitated by a novel Motion Adapter module, which al lows us to train on such static videos while preserving the motion prior o f the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2 I model.\nWe demonstrate the effectiveness of our approach on diverse task s including personalized, stylized, and conditional generation. In all eva luated scenarios, our method seamlessly integrates the spatial prior of th e customized T2I model with a motion prior supplied by the T2V model.\n\nR egistration Category: Full Access, Full Access Supporter\n\nLanguage Forma t: English Language\n\nSession Chair: Nanxuan Zhao (Adobe Research) URL:https://asia.siggraph.org/2024/program/?id=papers_608&sess=sess134 END:VEVENT END:VCALENDAR