Visual speech synthesis systems fall into one of two categories; graphics-based systems or image-based systems.
Graphics-based systems represent points on the surface of the face as vertices in a 3-D space and approximate the surface itself by connecting the vertices. Animation is usually achieved by applying a set of rules to deform the mesh in some controlled manner. Image-based systems generally adopt a sample-based synthesis approach, where real facial images corresponding to synthesis units are extracted from a database and are concatenated to produce new sequences.
In terms of photorealism, image-based systems are more realistic than model-based systems. Providing the correct images are extracted from the database and the blending between the extracted images results in natural looking facial movements, videorealism can also be achieved. The photorealism of graphics-based systems can be improved by texture mapping a facial image onto the model surface, however even models with very complex animation rules still do not convince a viewer that they are looking at a real person.
Recently, near videorealistic systems have been proposed that extract models of the face from video sequences using computer vision techniques, and these models applied to the problem of synthesising visual speech. In this work models of the shape and appearance variation of the face are used.
Shape and appearance models are an example of a generative model. That is, a model capable of reproducing examples of the object from which it is derived. The models are used in the flexible appearance model search algorithm to track the face of a talker. For each frame in a video sequence, a set of model parameters is output from the tracker, which gives the synthetic equivalent of the actual face in the corresponding video frame.