Traditional speech processing has focussed on the audio signal produced by a talker and has been successful in developing a variety of audio-speech applications. In this project, the visual component of the speech signal (the face and in particular lip movement) will be examined in addition to the audio speech component.
While less information is present in visual speech features than in audio speech features, they do have the distinct advantage that they are not contaminated by acoustic noise. The project began by analysing the relationship between audio and visual speech features with the aim of extracting noise-free audio information from a combination of noise-contaminated audio features and visual features. These visually-derived noise-free features were then be applied to speech processing applications such as speech enhancement and speaker separation.
The project is still underway and some promising results have been achieved so far. Here, an example is shown of how the technique can separate speech from two speakers mixed together. Three spectrograms are shown. The first is that of speech from a single speaker, the second is that of the same speech mixed with speech from another speaker at 0db. The technique aims to restore the speech from the first speaker, and is the spectrogram of the recovered audio.
- Khan, F., Milner, B. Speaker Separation Using Visual Speech Features and Single-channel Audio, Interspeech, Lyon France, 2013
- Khan, F., Milner, B. Speaker Separation using Visually-derived Binary Masks, AVSP, Annecy, France, 2013