BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Australia/Melbourne
X-LIC-LOCATION:Australia/Melbourne
BEGIN:DAYLIGHT
TZOFFSETFROM:+1000
TZOFFSETTO:+1100
TZNAME:AEDT
DTSTART:19721003T020000
RRULE:FREQ=YEARLY;BYMONTH=4;BYDAY=1SU
END:DAYLIGHT
BEGIN:STANDARD
DTSTART:19721003T020000
TZOFFSETFROM:+1100
TZOFFSETTO:+1000
TZNAME:AEST
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20240214T070245Z
LOCATION:Meeting Room C4.11\, Level 4 (Convention Centre)
DTSTART;TZID=Australia/Melbourne:20231213T174500
DTEND;TZID=Australia/Melbourne:20231213T175500
UID:siggraphasia_SIGGRAPH Asia 2023_sess127_papers_501@linklings.com
SUMMARY:Emotional Speech-Driven Animation with Content-Emotion Disentangle
 ment
DESCRIPTION:Technical Communications, Technical Papers\n\nRadek Dan&#283;&#269;ek (M
 ax Planck Institute for Intelligent Systems); Kiran Chhatre (KTH Royal Ins
 titute of Technology); Shashank Tripathi, Yandong Wen, and Michael Black (
 Max Planck Institute for Intelligent Systems); and Timo Bolkart (Max Planc
 k Institut for Intelligent Systems)\n\nTo be widely adopted, 3D facial ava
 tars need to be animated easily, realistically, and directly, from speech 
 signals. While the best recent methods generate 3D animations that are syn
 chronized with the input audio, they largely ignore the impact of emotions
  on facial expressions. Instead, their focus is on modeling the correlatio
 ns between speech and facial motion, resulting in animations that are unem
 otional or do not match the input emotion. We observe that there are two c
 ontributing factors resulting in facial animation - the speech and the emo
 tion. We exploit these insights in EMOTE (Expressive Model Optimized for T
 alking with Emotion), which generates 3D talking head avatars that maintai
 n lip sync while enabling explicit control over the expression of emotion.
  Due to the absence of high-quality aligned emotional 3D face datasets wit
 h speech, EMOTE is trained from an emotional video dataset (i.e., MEAD). T
 o achieve this, we match speech-content between generated sequences and ta
 rget videos differently from emotion content. Specifically, we train EMOTE
  with additional supervision in the form of a lip-reading objective to pre
 serve the speech-dependent content (spatially local and high temporal freq
 uency), while utilizing emotion supervision on a sequence-level (spatially
  global and low frequency). Furthermore, we employ a content-emotion excha
 nge mechanism in order to supervise different emotion on the same audio, w
 hile maintaining the lip motion synchronized with the speech. To employ de
 ep perceptual losses without getting undesirable artifacts, we devise a mo
 tion prior in form of a temporal VAE. Extensive qualitative, quantitative,
  and perceptual evaluations demonstrate that EMOTE produces state-of-the-a
 rt speech-driven facial animations, with lip sync on par with the best met
 hods while offering additional, high-quality emotional control.\n\nRegistr
 ation Category: Full Access\n\nSession Chair: Jernej Barbic (University of
  Southern California)
URL:https://asia.siggraph.org/2023/full-program?id=papers_501&sess=sess127
END:VEVENT
END:VCALENDAR