BEGIN:VCALENDAR VERSION:2.0 PRODID:Linklings LLC BEGIN:VTIMEZONE TZID:Australia/Melbourne X-LIC-LOCATION:Australia/Melbourne BEGIN:DAYLIGHT TZOFFSETFROM:+1000 TZOFFSETTO:+1100 TZNAME:AEDT DTSTART:19721003T020000 RRULE:FREQ=YEARLY;BYMONTH=4;BYDAY=1SU END:DAYLIGHT BEGIN:STANDARD DTSTART:19721003T020000 TZOFFSETFROM:+1100 TZOFFSETTO:+1000 TZNAME:AEST RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=1SU END:STANDARD END:VTIMEZONE BEGIN:VEVENT DTSTAMP:20240214T070241Z LOCATION:Darling Harbour Theatre\, Level 2 (Convention Centre) DTSTART;TZID=Australia/Melbourne:20231212T093000 DTEND;TZID=Australia/Melbourne:20231212T124500 UID:siggraphasia_SIGGRAPH Asia 2023_sess209_papers_501@linklings.com SUMMARY:Emotional Speech-Driven Animation with Content-Emotion Disentangle ment DESCRIPTION:Technical Papers\n\nRadek Daněček (Max Planck Institute for In telligent Systems); Kiran Chhatre (KTH Royal Institute of Technology); Sha shank Tripathi, Yandong Wen, and Michael Black (Max Planck Institute for I ntelligent Systems); and Timo Bolkart (Max Planck Institut for Intelligent Systems)\n\nTo be widely adopted, 3D facial avatars need to be animated e asily, realistically, and directly, from speech signals. While the best re cent methods generate 3D animations that are synchronized with the input a udio, they largely ignore the impact of emotions on facial expressions. In stead, their focus is on modeling the correlations between speech and faci al motion, resulting in animations that are unemotional or do not match th e input emotion. We observe that there are two contributing factors result ing in facial animation - the speech and the emotion. We exploit these ins ights in EMOTE (Expressive Model Optimized for Talking with Emotion), whic h generates 3D talking head avatars that maintain lip sync while enabling explicit control over the expression of emotion. Due to the absence of hig h-quality aligned emotional 3D face datasets with speech, EMOTE is trained from an emotional video dataset (i.e., MEAD). To achieve this, we match s peech-content between generated sequences and target videos differently fr om emotion content. Specifically, we train EMOTE with additional supervisi on in the form of a lip-reading objective to preserve the speech-dependent content (spatially local and high temporal frequency), while utilizing em otion supervision on a sequence-level (spatially global and low frequency) . Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotion on the same audio, while maintaining the lip m otion synchronized with the speech. To employ deep perceptual losses witho ut getting undesirable artifacts, we devise a motion prior in form of a te mporal VAE. Extensive qualitative, quantitative, and perceptual evaluation s demonstrate that EMOTE produces state-of-the-art speech-driven facial an imations, with lip sync on par with the best methods while offering additi onal, high-quality emotional control.\n\nRegistration Category: Full Acces s, Enhanced Access, Trade Exhibitor, Experience Hall Exhibitor URL:https://asia.siggraph.org/2023/full-program?id=papers_501&sess=sess209 END:VEVENT END:VCALENDAR