In traditional speaker identification (SID) systems, features from speech utterances are used to build speaker models during training, which are then used to infer a speaker identity at testing time. However, the speech waveform contains a lot of information about various speaker characteristics. In this project, we are proposing a multi-layered approach to SID that includes the use of gender identification (GID) and accent identification (AID). The first objective of the project was to build a robust GID system. The classifier constructed is a robust, unsupervised method of automatic gender identification from speech. We first design a baseline gender classifier based on MFCC features, and add a second classifier that uses context-dependent but text-independent pitch features. The results of these classifiers are then examined for disagreements in gender classification. Any disagreements are resolved by the use of a novel pitch-shifting mechanism applied to the utterances. We show how the acoustic-context classifier provides very good gender identification results, and how these are further enhanced by the pitch-shifting process. Furthermore, this enhancement is preserved across a set of different corpora .
Figure 1: I-vector based projections
The second research objective was to devise an unsupervised AID algorithm that is accurate enough to be used for SID purposes. The crux of our work is based on the I-vector model. In traditional MAP adaptation, we can create an utterance specific GMM by adapting a UBM with utterance data. The resulting GMM is speaker/channel/utterance dependent. We can describe the entire GMM by the mean super-vector, which is of very high dimensionality. We can then apply PCA to this super-vector space to get to low dimensionality. The resulting subspace is known as eigenvoices. The coordinates of the super-vector relative to this basis is the theoretical I-vector. The components of the I-vector represent high-level characteristics of the utterance, irrespective of phonetic content (the UBM models phonetic variability). The problem with this concept is that the GMM super-vector is not observable. The I-vector model helps by constructing a prior model of latent supervector variation by simple factor analysis over Baum-Welch statistics from an utterance. We can then achieve class-dependent modelling (accent-dependent in our case) by discriminative linear projections such as LDA, as in Figure 1.
Our work has presented a comprehensive analysis of the use of I-vector based classifiers for the classification of unlabelled acoustic data as native British accents [2, 3]. We demonstrated the different behaviours of various dimensionality reduction techniques that have been previously used in problems such as speaker and language classification. Our results (Figure 2) show that a fusion of I-vector based systems gives state-of-the-art performance (in comparison with previous results by Hanani et. al.) for unlabelled classification of British accent speech data, reaching ∼81% accuracy. This AID system was also employed in accent and speaker adapted ASR systems in collaborative work with the University of Birmingham . The final stage of our work will involve applying our unsupervised GID and AID classifiers to aid and improve SID results.
Figure 2: State-of-the-art unsupervised AID
- DeMarco, A. and Cox, S.J., An accurate and robust gender identification algorithm, in INTERSPEECH, 2011.
- DeMarco, A. and Cox, S.J., Iterative classification of regional British accents in i-vector space, in MLSLP, 2012.
- DeMarco, A. and Cox, S.J., Native accent classification via i-vectors and speaker compensation fusion, in INTERSPEECH, 2013.
- Najafian, M., DeMarco, A., et. al., Supervised and unsupervised adaptation to regional accented speech using limited data for automatic speech recognition, submitted to: ICASSP 2014.