BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Asia/Tokyo
X-LIC-LOCATION:Asia/Tokyo
BEGIN:STANDARD
TZOFFSETFROM:+0900
TZOFFSETTO:+0900
TZNAME:JST
DTSTART:18871231T000000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20250110T023312Z
LOCATION:Hall B7 (1)\, B Block\, Level 7
DTSTART;TZID=Asia/Tokyo:20241205T170500
DTEND;TZID=Asia/Tokyo:20241205T171600
UID:siggraphasia_SIGGRAPH Asia 2024_sess138_papers_215@linklings.com
SUMMARY:TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenac
 tment with Diffusion Model
DESCRIPTION:Technical Papers\n\nJiazhi Guan (Tsinghua University); Quanwei
  Yang (University of Science and Technology of China); Kaisiyuan Wang, Han
 g Zhou, Shengyi He, Zhiliang Xu, Haocheng Feng, Errui Ding, and Jingdong W
 ang (Baidu); Hongtao Xie (University of Science and Technology of China); 
 Youjian Zhao (Tsinghua University); and Ziwei Liu (Nanyang Technological U
 niversity (NTU))\n\nRecently, 2D speaking avatars have increasingly partic
 ipated in everyday scenarios due to the fast development of facial animati
 on techniques. However, most existing works neglect the explicit control o
 f human bodies. In this paper, we propose to drive not only the faces but 
 also the torso and gesture movements of a speaking figure. Inspired by rec
 ent advances in diffusion models, we propose the Motion-Enhanced Textural-
 Aware ModeLing for SpeaKing Avatar Reenactment (TALK-Act) framework, which
  enables high-fidelity avatar reenactment from only short footage of monoc
 ular video. Our key idea is to enhance the textural awareness with explici
 t motion guidance in diffusion modeling. Specifically, we carefully constr
 uct 2D and 3D structural information as intermediate guidance. While recen
 t diffusion models adopt a side network for control information injection,
  they fail to synthesize temporally stable results even with person-specif
 ic fine-tuning. We propose a Motion-Enhanced Textural Alignment module to 
 enhance the bond between driving and target signals. Moreover, we build a 
 Memory-based Hand-Recovering module to help with the difficulties in hand-
 shape preserving. After pre-training, our model can achieve high-fidelity 
 2D avatar reenactment with only 30 seconds of person-specific data. Extens
 ive experiments demonstrate the effectiveness and superiority of our propo
 sed framework.\n\nRegistration Category: Full Access, Full Access Supporte
 r\n\nLanguage Format: English Language\n\nSession Chair: Hongbo Fu (Hong K
 ong University of Science and Technology)
URL:https://asia.siggraph.org/2024/program/?id=papers_215&sess=sess138
END:VEVENT
END:VCALENDAR
