BEGIN:VCALENDAR VERSION:2.0 PRODID:Linklings LLC BEGIN:VTIMEZONE TZID:Asia/Tokyo X-LIC-LOCATION:Asia/Tokyo BEGIN:STANDARD TZOFFSETFROM:+0900 TZOFFSETTO:+0900 TZNAME:JST DTSTART:18871231T000000 END:STANDARD END:VTIMEZONE BEGIN:VEVENT DTSTAMP:20250110T023312Z LOCATION:Hall B5 (2)\, B Block\, Level 5 DTSTART;TZID=Asia/Tokyo:20241205T154300 DTEND;TZID=Asia/Tokyo:20241205T155400 UID:siggraphasia_SIGGRAPH Asia 2024_sess134_papers_785@linklings.com SUMMARY:TrailBlazer: Trajectory Control for Diffusion-Based Video Generati on DESCRIPTION:Technical Papers\n\nWan-Duo Kurt Ma (Victoria University of We llington), J. P. Lewis (NVIDIA Research), and W. Bastiaan Kleijn (Victoria University of Wellington)\n\nLarge text-to-video (T2V) models such as Sor a have the potential to revolutionize visual effects and the creation of s ome types of movies. Current T2V models require tedious trial-and-error ex perimentation to achieve desired results, however. This motivates the sear ch for methods to directly control desired attributes. In this work, we ta ke a step toward this goal, introducing a method for high-level, temporall y-coherent control over the basic trajectories and appearance of objects. Our algorithm, TrailBlazer, allows the general positions and (optionally) appearance of objects to be controlled simply by keyframing approximate bo unding boxes and (optionally) their corresponding prompts. Importantly, ou r method does not require a pre-existing control video signal that already contains an accurate outline of the desired motion, yet the synthesized m otion is surprisingly natural with emergent effects including perspective and movement toward the virtual camera as the box size increases. The meth od is efficient, making use of a pre-trained T2V model and requiring no tr aining or fine-tuning, with negligible additional computation. Specificall y, the bounding box controls are used as soft masks to guide manipulation of the self-attention and cross-attention modules in the video diffusion m odel. While our visual results are limited by those of the underlying mode l, the algorithm may generalize to future models that use standard self- a nd cross-attention components.\n\nRegistration Category: Full Access, Full Access Supporter\n\nLanguage Format: English Language\n\nSession Chair: N anxuan Zhao (Adobe Research) URL:https://asia.siggraph.org/2024/program/?id=papers_785&sess=sess134 END:VEVENT END:VCALENDAR