Creating Expressive Speech Animation

We are developing techniques for driving expressive speech animation automatically from both text and from voice. There are several strands to the work, which involve:

The example below shows an original expressive (sad) sequence (left) time aligned to a corresponding (original) neutral speech sequence (right), and the result of transforming the neutral sequence to contain the same style as the original expressive sequence (middle).

1. Deriving a better animation unit for creating speech animation. Typically the visual unit of speech is assumed to be the viseme (visual phoneme), but we instead use a gestural approach by extracting units based on facial behaviour. This gives a more intuitive and natural description of the visual signal during speech.

Here we show an animated speech sequence driven by our new gestural units. On the left is a reference animation based on static lip shape targets, which might be the starting point for hand-animation, while our new approach is shown on the right.

2. Understanding how movements of the facial features due to both speech and expression are combined to produce expressive conversational signals. The goal is to model these components independently, so that when training a speech animation system, there is no requirement to observe all speech in all expressive contexts. We can train a system using a large corpus of ‘neutral' speech, and subsequently we need only a few expressive sequences to transform this neutral speech into an expressive equivalent.

The example below shows an original expressive (sad) sequence (left) time aligned to a corresponding (original) neutral speech sequence (right), and the result of transforming the neutral sequence to contain the same style as the original expressive sequence (middle).

3. Mapping speech animation between different characters so that it remains realistic and believable.

This example shows our technique of mapping facial movements for one face model to another. In this sequence, the animation is performance driven, but it could equally be generated automatically.

References

  1. F. Shaw and B. Theobald. Transforming neutral speech into expressive speech. To appear in the Proc. of the International Conference on Auditory-Visual Speech Processing, 2013.
  2. B. Theobald and Iain Matthews. Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis. IEEE Transactions on Audio, Speech & Language Processing 20(8):2378-2387, 2012.
  3. S. Boker, J. Cohn, B. Theobald, I. Matthews, M. Mangini, J. Spies, Z. Ambadar, and T. Brick. Something in the Way We Move: Motion, not Perceived Sex, Influences Nods in Conversation. Journal of Experimental Psychology: Human Perception and Performance, 37(3):874-9, 2011.
  4. S. Hilder, B. Theobald and R. Harvey. In Pursuit of Visemes. In the Proceedings of the International Conference on Auditory-Visual Speech Processing (AVSP), pages 154-159, 2010.