Currently available projects

« Back

Incorporating Visual Prosody into Speech Animation

Information

  • Start date: October 2013
  • Programme: PhD
  • Mode of Study: Full Time
  • Studentship Length: 3 years

How to Apply

Fees & Funding

  • Funding Status: Competition Funded Project (EU Students Only)
    Further Details
  • Funding Conditions:

    Funding is available to EU students. If funding is awarded for this project it will cover tuition fees and stipend for UK students. EU students may be eligible for full funding, or tuition fees only, depending on the funding source.

  • Fees: Fees Information (Opens in new window)

Entry Requirements

  • Acceptable First Degree:

    Computer Science or Mathematics

  • Minimum Entry Standard: 2:1

Project Description

Speech animation involves lip-synching face models to (acoustic) speech. This is a difficult problem because viewers are very sensitive to even only minor discrepancies in the animation. So, practical applications of speech animation (e.g. computer games and animated movies) resort to using either expensive motion-capture, or animators to create sequences by hand – a slow and expensive process. There has been recent research into methods for generating speech animation automatically, and state-of-the-art systems can generate plausible speech movements. However, when compared with sequences animated from motion-capture, no automated system can yet produce true human-like speech.

Part of the problem is that most systems are trained using speech spoken without expression (no emotion) and the actor maintains a constant head pose (directly facing the camera). Thus, the resultant animated speech lacks all of the non-verbal cues that usually accompany speech. One might argue that rather than focussing only on speech movements that are a direct result of speech production, the overall perception of the realism of the speech might be improved by adding other non-verbal, visual prosodic cues (head nods, eye-gaze and eye-brow movements) to the animation.

This study will involve an analysis of non-verbal cues that accompany real speech, where we will apply computer vision techniques to track facial movements in video sequences. Using real data, both high-level rule-based methods and lower-level feature-based methods will be investigated for re-synthesising these cues. Using formal subjective testing we can then quantify the improvement in the realism of the speech that can be gained by adding other non-speech movements into the animation.

This project will continue on-going collaboration between UEA and Disney Research (Pittsburgh), and the results will directly inform the animation industry.

References

K. Munhall, J. Jones, D. Callan, T. Kuratate and E. Vatiliotis-Bateson. Visual Prosody and Speech Intelligibility: Head Movement Improves Auditory Speech Perception. Psychological Science 15:133-137, 2004.

H.P. Graf, E. Cosatto, V. Strom and F. Huang. Visual Prosody: Facial Movements Accompanying Speech. In Proceedings of Automatic Face and Gesture Recognition, pp. 396-401, 2002.

S. Al Moubayed, J. Beskow, B. Granström and D. House. Audio-Visua Prosody: Perception, Detection and Synthesis of Prominence. Towards Autonomous, Adaptive and Context-Aware Multimodal Interfaces: Theoretical  and Practical Issues, pp. 55-71, 2010.

M. Swerts and E. Krahmer. Visual Prosody of Newsreaders: Effects of Information Structure, Emotional Content and Intended Audience on Facial Expressions. Journal of Phonetics, 38(2):197-206, 2010.

M. Sargin, E. Erzin, Y. Yemez, A. Tekalp, T. Erdem, C. Erdem and M. Ozkan. Prosody-Driven Head-Gesture Animation. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.677-680, 2007.



Apply online