Speech animation, the process of animating a human-like model to give the impression it is talking, most commonly relies on the work of skilled animators, or performance capture. These approaches are time consuming, expensive, and lack the ability to scale. This work develops algorithms for content driven speech animation; models that learn visual actions from data without semantic labelling, to predict realistic speech animation from recorded audio. We achieve these goals by forming a multi-modal corpus that represents the style of speech we want to model; speech that is natural, expressive and prosodic. This allows us to train deep recurrent neural networks to predict compelling animation. We first develop methods to predict the rigid head pose of a speaker. Predicting the head pose of a speaker from speech is not wholly deterministic, so our methods provide a large variety of plausible head pose trajectories from a single utterance. We also apply our methods to learn how to predict the head pose of the listener while in conversation, using only the voice of the speaker.
Finally, we show how to predict the lip sync, facial expression, and rigid head pose of the speaker, simultaneously, solely from speech.
Some examples of head pose prediction
A short and high level video of the thesis.
- In this work we show how to predict head pose, from speech, using a data driven approach.
- We record expressive speech from multiple camera views.
- We fit Active Appearance Models to each of the video sequences, and track the actors performance. Then, we derive a sparse 3d mesh that describes the actors head pose.
- We further reduce dimensionality using principal component analysis.
- To parametrise the audio we use the log of 40 filter banks.
- We jointly learn rigid head pose and facial expression using deep recurrent neural networks
- To view the predictions from our model, we animate the same sparse 3d mesh.