BEGIN:VCALENDAR VERSION:2.0 PRODID:Linklings LLC BEGIN:VTIMEZONE TZID:Australia/Melbourne X-LIC-LOCATION:Australia/Melbourne BEGIN:DAYLIGHT TZOFFSETFROM:+1000 TZOFFSETTO:+1100 TZNAME:AEDT DTSTART:19721003T020000 RRULE:FREQ=YEARLY;BYMONTH=4;BYDAY=1SU END:DAYLIGHT BEGIN:STANDARD DTSTART:19721003T020000 TZOFFSETFROM:+1100 TZOFFSETTO:+1000 TZNAME:AEST RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=1SU END:STANDARD END:VTIMEZONE BEGIN:VEVENT DTSTAMP:20240214T070250Z LOCATION:Meeting Room C4.8\, Level 4 (Convention Centre) DTSTART;TZID=Australia/Melbourne:20231215T120300 DTEND;TZID=Australia/Melbourne:20231215T121300 UID:siggraphasia_SIGGRAPH Asia 2023_sess156_papers_504@linklings.com SUMMARY:Interactive Story Visualization with Multiple Characters DESCRIPTION:Technical Communications, Technical Papers\n\nYuan Gong (Tsing hua University); Youxin Pang (MAIS & NLPR, Institute of Automation, Chines e Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences); Xiaodong Cun and Menghan Xia ( Tencent); Yingqing He (Hong Kong University of Science and Technology); Ha oxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, and Ying Shan (Tencent); and Yujiu Yang (Tsinghua University)\n\nAccurate Story visualization requ ires several necessary elements, such as identity consistency across frame s, the alignment between plain text and visual content, and a reasonable l ayout of objects in images. Most previous works endeavor to meet these req uirements by fitting a text-to-image (T2I) model on a set of videos in the same style and with the same characters, e.g., the FlintstonesSV dataset. However, the learned T2I models typically struggle to adapt to new charac ters, scenes, and styles, and often lack the flexibility to revise the lay out of the synthesized images.\nThis paper proposes a system for generic i nteractive story visualization, capable of handling multiple novel charact ers and supporting the editing of layout and local structure. It is develo ped by leveraging the prior knowledge of large language and T2I models, tr ained on massive corpora. The system comprises four interconnected compone nts: story-to-prompt generation (S2P), text-to-layout generation (T2L), co ntrollable text-to-image generation (C-T2I), and image-to-video animation (I2V). First, the S2P module converts concise story information into detai led prompts required for subsequent stages. Next, T2L generates diverse an d reasonable layouts based on the prompts, offering users the ability to a djust and refine the layout to their preference. The core component, C-T2I , enables the creation of images guided by layouts, sketches, and actor-sp ecific identifiers to maintain consistency and detail across visualization s. Finally, I2V enriches the visualization process by animating the genera ted images.\nExtensive experiments and a user study are conducted to valid ate the effectiveness and flexibility of interactive editing of the propos ed system.\n\nRegistration Category: Full Access\n\nSession Chair: Sergi P ujades (National Institute for Research in Computer Science and Automation (INRIA), Université Grenoble Alpes) URL:https://asia.siggraph.org/2023/full-program?id=papers_504&sess=sess156 END:VEVENT END:VCALENDAR