Is it sensible to classify Mandarin, for example, using an English recogniser? Why English phonemes? What is the optimum performance that could be obtained for ALID and VLID assuming 100% accurate phonetic recognition? It is this last question that we have used as the starting point for our work.
We have taken the standard text for the work which is the United Nations Declaration on Human Rights and converted it into the International Phonetic Alphabet (IPA) and also 22 Language Corpus. By examining the relative occurrence of IPA symbols (unigrams and bigrams initially) we can see how distinguishable various languages are under varying smoothing assumptions (smoothing is the process in which zero-count phonemes are replaced with a small probability of occurrence).
Current work we concentrate on audio language identification with vector quantisation. After training and testing data, we can define how similar is between these two dataset. According to the language tree, we can assume the smaller distance between two data cluster, the more probably same language they are.
Fig. 1 Histogram of vector quantised chi-square distance between English, Mandarin and Arabic