Research summary and findings so far Research summary and findings so far

Lip-reading uses the information of lip movements to transcribe speech, a process largely used as communication by people with hearing difficulties.

Recent work in this area has explored the automation of this process, with the essential aim to build a recognition system entirely driven by lip movements. However, this work has produced some relatively poor results, highlighting several problems.

block diagram

Fig. 1 A block diagram identifying where the confusion model is used to correct the noisy output transcriptions produced by the standard lip-reading recogniser

In comparison with the acoustic speech signal, lip movements provide much lower information about speech than is present in audio because many important features that carry information in the audio signal (e.g. voicing, some places of articulation) cannot be seen. In addition, certain sets of speech units that sound very different cannot be discriminated in video. For example, the words "pat" and "bat", sound different but are visually indistinct.  However,  "mat" and "gnat" are difficult to distinguish using audio alone but are visually distinct.

This project focuses on modelling confusions between visual speech units to improve lip-reading recognition, focussing on three types of confusions; substitutions (replacing one phoneme with another), insertions (inserting a phoneme into a sequence) and deletions (removing a phoneme from a sequence).

multiple weighted finite-state transducers

Fig. 2 We use a composition of multiple weighted finite-state transducers to correct the noisy output from the standard recogniser. In this case, we have the noisy output from the standard recogniser which is modelled as a transducer (P*) and a simple substitution confusion model to correct any errors (C).


Howell,D., Cox, S. and Theobald, B., Confusion Modelling for Automated Lip-Reading using Weighted Finite-State Transducers. In the Proceedings of the International Conference on Auditory-Visual Speech Processing 2013, 197-203, 2013

Research Team

Mr. Dominic Howell, Prof. Stephen Cox, Dr. Barry Theobald