There are two basic approaches to speech synthesis: model-based systems attempt some degree of modelling of the speech production process to generate speech, whereas data-driven systems store and re-constitute speech.
Concatenative speech synthesis (CSS), an example of the second approach, in which fragments of pre-stored speech are "spliced'' together to form a new utterance, is currently the most successful technique. CSS can produce speech of high quality but it is still discernible as synthetic speech. When a large database of speech sounds is available, a major problem is in the selection of units to concatenate. Firstly, an appropriate length of unit must be chosen (e.g. phoneme, phoneme-in-a-context-window, syllable, word, phrase) and then the "best'' sequence of units (judged using an appropriate "distance'' metric between units) must be selected from those available in the database. A standard approach is to use the Viterbi algorithm to search for the optimum sequence of units. However, although this produces a globally optimal solution (given the choice of distance metric), some individual transitions can sound poor subjectively, and even a short "bad'' section in a spoken phrase can produce synthesis that is poorly rated by listeners.
A possible approach to alleviating this problem is to preserve syllabic continuity by using syllable rather than phoneme concatenation, because joins within syllables are more intrusive than joins between them. However, the problems is that there are at least 10 000 different syllables in English, which makes recording and storage impossible, so a way of synthesizing missing syllables must be found. Another problem is that of assignment of stress and duration to units. There is some information about stress and duration from an isolated word (e.g. CONtent vs. conTENT), which is known as lexical stress, but at a higher level, stress depends on both the syntactic structure of the phrase or sentence being uttered and on the semantics (e.g. consider the variation of meanings possible by stressing different words in the phrase "I thought she was married''). For synthesis of an arbitrary phrase, this is a huge problem, but if the domain of synthesis is restricted (e.g. reading traffic directions, giving information on TV programmes), then a model of how stress interacts with syntax and semantics is feasible.
Prof. Stephen Cox, Ian Read