BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Australia/Melbourne
X-LIC-LOCATION:Australia/Melbourne
BEGIN:DAYLIGHT
TZOFFSETFROM:+1000
TZOFFSETTO:+1100
TZNAME:AEDT
DTSTART:19721003T020000
RRULE:FREQ=YEARLY;BYMONTH=4;BYDAY=1SU
END:DAYLIGHT
BEGIN:STANDARD
DTSTART:19721003T020000
TZOFFSETFROM:+1100
TZOFFSETTO:+1000
TZNAME:AEST
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260114T163633Z
LOCATION:Darling Harbour Theatre\, Level 2 (Convention Centre)
DTSTART;TZID=Australia/Melbourne:20231212T093000
DTEND;TZID=Australia/Melbourne:20231212T124500
UID:siggraphasia_SIGGRAPH Asia 2023_sess209_papers_501@linklings.com
SUMMARY:Emotional Speech-Driven Animation with Content-Emotion Disentangle
 ment
DESCRIPTION:Radek Daněček (Max Planck Institute for Intelligent Systems); 
 Kiran Chhatre (KTH Royal Institute of Technology); Shashank Tripathi, Yand
 ong Wen, and Michael Black (Max Planck Institute for Intelligent Systems);
  and Timo Bolkart (Max Planck Institut for Intelligent Systems)\n\nTo be w
 idely adopted, 3D facial avatars need to be animated easily, realistically
 , and directly, from speech signals. While the best recent methods generat
 e 3D animations that are synchronized with the input audio, they largely i
 gnore the impact of emotions on facial expressions. Instead, their focus i
 s on modeling the correlations between speech and facial motion, resulting
  in animations that are unemotional or do not match the input emotion. We 
 observe that there are two contributing factors resulting in facial animat
 ion - the speech and the emotion. We exploit these insights in EMOTE (Expr
 essive Model Optimized for Talking with Emotion), which generates 3D talki
 ng head avatars that maintain lip sync while enabling explicit control ove
 r the expression of emotion. Due to the absence of high-quality aligned em
 otional 3D face datasets with speech, EMOTE is trained from an emotional v
 ideo dataset (i.e., MEAD). To achieve this, we match speech-content betwee
 n generated sequences and target videos differently from emotion content. 
 Specifically, we train EMOTE with additional supervision in the form of a 
 lip-reading objective to preserve the speech-dependent content (spatially 
 local and high temporal frequency), while utilizing emotion supervision on
  a sequence-level (spatially global and low frequency). Furthermore, we em
 ploy a content-emotion exchange mechanism in order to supervise different 
 emotion on the same audio, while maintaining the lip motion synchronized w
 ith the speech. To employ deep perceptual losses without getting undesirab
 le artifacts, we devise a motion prior in form of a temporal VAE. Extensiv
 e qualitative, quantitative, and perceptual evaluations demonstrate that E
 MOTE produces state-of-the-art speech-driven facial animations, with lip s
 ync on par with the best methods while offering additional, high-quality e
 motional control.\n\nRegistration Category: Full Access, Enhanced Access, 
 Trade Exhibitor, Experience Hall Exhibitor\n\n
URL:https://asia.siggraph.org/2023/full-program?id=papers_501&sess=sess209
END:VEVENT
END:VCALENDAR