BEGIN:VCALENDAR VERSION:2.0 PRODID:Linklings LLC BEGIN:VTIMEZONE TZID:Asia/Tokyo X-LIC-LOCATION:Asia/Tokyo BEGIN:STANDARD TZOFFSETFROM:+0900 TZOFFSETTO:+0900 TZNAME:JST DTSTART:18871231T000000 END:STANDARD END:VTIMEZONE BEGIN:VEVENT DTSTAMP:20250110T023312Z LOCATION:Hall B7 (1)\, B Block\, Level 7 DTSTART;TZID=Asia/Tokyo:20241204T113100 DTEND;TZID=Asia/Tokyo:20241204T114300 UID:siggraphasia_SIGGRAPH Asia 2024_sess114_papers_382@linklings.com SUMMARY:Autonomous Character-Scene Interaction Synthesis from Text Instruc tion DESCRIPTION:Technical Papers\n\nNan Jiang (Peking University, Beijing Inst itute for General Artificial Intelligence); Zimo He (Peking University); Z i Wang (Beijing University of Posts and Telecommunications); Hongjie Li (P eking University); Yixin Chen and Siyuan Huang (Beijing Institute for Gene ral Artificial Intelligence); and Yixin Zhu (Peking University)\n\nSynthes izing human motions in 3D environments, particularly those with complex ac tivities such as locomotion, hand-reaching, and human-object interaction, presents substantial demands for user-defined waypoints and stage transiti ons. These requirements pose challenges for current models, leading to a n otable gap in automating the animation of characters from simple human inp uts. This paper addresses this challenge by introducing a comprehensive fr amework for synthesizing multi-stage scene-aware interaction motions direc tly from a single text instruction and goal location. Our approach employs an auto-regressive diffusion model to synthesize the next motion segment, along with an autonomous scheduler predicting the transition for each act ion stage. To ensure that the synthesized motions are seamlessly integrate d within the environment, we propose a scene representation that considers the local perception both at the start and the goal location. We further enhance the coherence of the generated motion by integrating frame embeddi ngs with language input. Additionally, to support model training, we prese nt a comprehensive motion-captured dataset comprising 16 hours of motion s equences in 120 indoor scenes covering 40 types of motions, each annotated with precise language descriptions. Experimental results demonstrate the efficacy of our method in generating high-quality, multi-stage motions clo sely aligned with environmental and textual conditions.\n\nRegistration Ca tegory: Full Access, Full Access Supporter\n\nLanguage Format: English Lan guage\n\nSession Chair: Kai Wang (Amazon) URL:https://asia.siggraph.org/2024/program/?id=papers_382&sess=sess114 END:VEVENT END:VCALENDAR