NATO Advanced Study Institute 


Modification of Audible and Visual Speech

Michele Covell
Interval Research Corporation

Speech is one of the most common and richest methods that people use to communicate with one another. Our facility with this communication form makes speech a good interface for communicating with or via computers. At the same time, our familiarity with speech makes it difficult to generate synthetic but naturalsounding speech and synthetic but natural-looking lip-synced faces. One way to reduce the apparent unnaturalness of synthetic audible and visual speech is to modify natural (human-produced) speech. This approach relies on examples of natural speech and on simple models of how to take those examples apart and to put them back together to create new utterances.

We discuss two such techniques in depth. The first technique, Mach1, changes the overall timing of an utterance, with little loss in comprehensibility and with no change in the wording of or emphasis within what was said or in the identity of the voice. This ability to speed up (or slow down) speech will make speech a more malleable channel of communication. It gives the listener control over the amount of time that she spends listening to a given oration, even if the presentation of that material is prerecorded. The second technique, Video Rewrite, synthesizes images of faces, lip synced to a given utterance. This tool could be useful for reducing the data rate for video conferencing, as well as for providing photorealistic avatars.



