Lip-reading uses the information of lip movements to transcribe speech, a process largely used as communication by people with hearing difficulties.
Recent work in this area has explored the automation of this process, with the essential aim to build a recognition system entirely driven by lip movements. However, this work has produced some relatively poor results, highlighting several problems.
Fig. 1 A block diagram identifying where the confusion model is used to correct the noisy output transcriptions produced by the standard lip-reading recogniser
In comparison with the acoustic speech signal, lip movements provide much lower information about speech than is present in audio because many important features that carry information in the audio signal (e.g. voicing, some places of articulation) cannot be seen. In addition, certain sets of speech units that sound very different cannot be discriminated in video. For example, the words "pat" and "bat", sound different but are visually indistinct. However, "mat" and "gnat" are difficult to distinguish using audio alone but are visually distinct.
This project focuses on modelling confusions between visual speech units to improve lip-reading recognition, focussing on three types of confusions; substitutions (replacing one phoneme with another), insertions (inserting a phoneme into a sequence) and deletions (removing a phoneme from a sequence).
Fig. 2 We use a composition of multiple weighted finite-state transducers to correct the noisy output from the standard recogniser. In this case, we have the noisy output from the standard recogniser which is modelled as a transducer (P*) and a simple substitution confusion model to correct any errors (C).
Howell,D., Cox, S. and Theobald, B., Confusion Modelling for Automated Lip-Reading using Weighted Finite-State Transducers. In the Proceedings of the International Conference on Auditory-Visual Speech Processing 2013, 197-203, 2013