BEGIN:VCALENDAR VERSION:2.0 PRODID:Linklings LLC BEGIN:VTIMEZONE TZID:Australia/Melbourne X-LIC-LOCATION:Australia/Melbourne BEGIN:DAYLIGHT TZOFFSETFROM:+1000 TZOFFSETTO:+1100 TZNAME:AEDT DTSTART:19721003T020000 RRULE:FREQ=YEARLY;BYMONTH=4;BYDAY=1SU END:DAYLIGHT BEGIN:STANDARD DTSTART:19721003T020000 TZOFFSETFROM:+1100 TZOFFSETTO:+1000 TZNAME:AEST RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=1SU END:STANDARD END:VTIMEZONE BEGIN:VEVENT DTSTAMP:20240214T070245Z LOCATION:Meeting Room C4.11\, Level 4 (Convention Centre) DTSTART;TZID=Australia/Melbourne:20231213T174500 DTEND;TZID=Australia/Melbourne:20231213T175500 UID:siggraphasia_SIGGRAPH Asia 2023_sess127_papers_501@linklings.com SUMMARY:Emotional Speech-Driven Animation with Content-Emotion Disentangle ment DESCRIPTION:Technical Communications, Technical Papers\n\nRadek Daněček (M ax Planck Institute for Intelligent Systems); Kiran Chhatre (KTH Royal Ins titute of Technology); Shashank Tripathi, Yandong Wen, and Michael Black ( Max Planck Institute for Intelligent Systems); and Timo Bolkart (Max Planc k Institut for Intelligent Systems)\n\nTo be widely adopted, 3D facial ava tars need to be animated easily, realistically, and directly, from speech signals. While the best recent methods generate 3D animations that are syn chronized with the input audio, they largely ignore the impact of emotions on facial expressions. Instead, their focus is on modeling the correlatio ns between speech and facial motion, resulting in animations that are unem otional or do not match the input emotion. We observe that there are two c ontributing factors resulting in facial animation - the speech and the emo tion. We exploit these insights in EMOTE (Expressive Model Optimized for T alking with Emotion), which generates 3D talking head avatars that maintai n lip sync while enabling explicit control over the expression of emotion. Due to the absence of high-quality aligned emotional 3D face datasets wit h speech, EMOTE is trained from an emotional video dataset (i.e., MEAD). T o achieve this, we match speech-content between generated sequences and ta rget videos differently from emotion content. Specifically, we train EMOTE with additional supervision in the form of a lip-reading objective to pre serve the speech-dependent content (spatially local and high temporal freq uency), while utilizing emotion supervision on a sequence-level (spatially global and low frequency). Furthermore, we employ a content-emotion excha nge mechanism in order to supervise different emotion on the same audio, w hile maintaining the lip motion synchronized with the speech. To employ de ep perceptual losses without getting undesirable artifacts, we devise a mo tion prior in form of a temporal VAE. Extensive qualitative, quantitative, and perceptual evaluations demonstrate that EMOTE produces state-of-the-a rt speech-driven facial animations, with lip sync on par with the best met hods while offering additional, high-quality emotional control.\n\nRegistr ation Category: Full Access\n\nSession Chair: Jernej Barbic (University of Southern California) URL:https://asia.siggraph.org/2023/full-program?id=papers_501&sess=sess127 END:VEVENT END:VCALENDAR