Current techniques of lip-reading use techniques that have been successful in audio speech recognition – a network of HMMs. But in lip reading there are no phonemes, there are only gestures (sometimes called visemes) which are poorly defined.
It seems human lip-readers scan a visual sequence looking for characteristic gestures (e.g. the "F' lip shape). These are infrequent compared to phonemes but are reasonably reliable. So, in contrast to audio recognition where there is a dense stream of phonemes, we have a stream of unknown elements interspersed with sparse gestures. This is a classic temporal learning scheme and is more analogous to event detection, change-point detection than speech recognition.
This project proposes to build classifiers to recognise these sparse events and hence augment or replace the HHM recogniser and so improve the robustness of computer lip-reading.