In this project we study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal. Automatic Speech Animation (or audio-to-visual speech conversion) has application for TV, movie and games production, low-bandwidth multimodal communication and speech and language therapy.

We introduce a sliding window deep neural network (SW-DNN) that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping visual predictions are averaged to generate continuous, smoothly varying speech animation. Prediction is fast and results in lower mean squared error, a higher correlation coefficient and more perceptually realistic animation than a baseline HMM inversion technique.

In this movie we show rendered examples of speech that has been predicted from the audio signal of held-out test sentences. On the left we show the tracked facial motion, in the centre we show predicted facial motion using our SW-DNN approach and on the right is the predicted facial motion using a baseline HMM inversion technique.

References

S. Taylor, A. Kato, I. Matthews and B. Milner. "Audio-to-Visual Speech Conversion using Deep Neural Networks". To appear in InterSpeech. 2016.

Research Team

Dr Sarah Taylor and Dr Ben Milner

Collaborators

Iain Matthews (Disney Research)