The aim of speech enhancement is to reduce or remove the effect of noise on speech in terms of its quality and intelligibility. Conventional speech enhancement methods apply a two-stage procedure to remove the noise. The noise signal is estimated from the noisy speech and this noise estimate is then filtered out of the noisy speech. Such methods are effective at removing stationary noises however in non-stationary noises annoying artefacts, known as musical noise, remain in the signal where the noise has been under or over-estimated.
This project does not attempt to ﬁlter noisy speech to remove noise. Instead the aim is to reconstruct a clean speech signal from a set of acoustic speech features extracted from the noisy speech. There are two challenges to such an approach. First, a suitable model of speech, driven by a set of acoustic speech features, must be developed. The speech model should only be able to reconstruct speech, not noise, and must not reduce quality or intelligibility. Next, a method of robust feature extraction is required to obtain the acoustic features of clean speech from the noisy speech.
Results have shown this method to be very effective at processing speech affected by a variety of noises including: babble (many background speakers talking at once), street noise, in-car noise and machine gun noise. The following examples compare the model-based approach to a state of the art conventional method (log MMSE) for the task of removing street noise mixed with speech at 5dB SNR.
- Harding, P. and Milner, B. Enhancing Speech by Reconstruction from Robust Acoustic Features. In Thirteenth Annual Conference of the International Speech Communication Association, 2012
- Harding, P. and Milner, B. On the use of Machine Learning Methods for Speech and Voicing Classification. In Thirteenth Annual Conference of the International Speech Communication Association, 2012
- Harding, P. and Milner, B., Speech enhancement by reconstruction from cleaned acoustic features. In Twelfth Annual Conference of the International Speech Communication Association, 2011
examples_clean [,png/.wav] - clean speech
examples_noisy [.png / .wav] - speech mixed with street noise at 5dB SNR
examples-logmmse [.png/ .wav] - noisy speech processed by log MMSE
examples_model [.png/ .wav] - noisy speech processed by model-based enhancement method
png files are narrowband spectrograms whilst wav files are audio examples corresponding to the spectrograms