Although many methods exist for estimating the fundamental frequency of speech, they tend to be prone to errors. These errors are often caused by the weakness of the fundamental frequency in relation to its harmonics or by unusual movements of the vocal chords, and can result in traditional methods recording a doubling or halving of the true fundamental frequency. The objective of this research is to train a model of fundamental frequency and the way it changes over time when speakers' voices rise and fall. By only using real data to train the models, pitch doubling and halving should be eliminated, as this is never present in the training data.
By splitting the training speech signal into short frames, we can convert it into a feature vector that is used to train a model for each fundamental frequency present. By looking at how this frequency changes from one frame to the next, we can also build a temporal model. Unknown speech is then split into frames and matched to both the frequency models and the temporal model, and a ‘best fit' contour of fundamental frequency is produced.
The first video is an example showing how the fundamental frequency rises and falls during normal speech. The second is a three dimensional model showing the temporal changes. The x axis represents the pitch at time t-2, the y axis is the pitch at time t-1, and the z axis is the pitch at time t. Silence and unvoiced frames are shown in the first 2 positions along each axis. The size of the circles represent the number of occurrences of each tri-pitch, with large circles near the origin representing silence-silence-silence.
Taylor, J. and Milner, B. Modelling and estimation of the fundamental frequency of speech using a hidden Markov model. To appear in the Proc. of Annual Conference of the International Speech Communication Association, Interspeech, Lyon France, 2013.