Justine Cassell
MIT Media Lab
20 Ames Street
Cambridge, MA 02139-4307

In this paper I describe ongoing research that seeks to provide a common framework for the generation and interpretation of spontaneous gesture in the context of speech. I present a testbed for this framework in the form of a program that generates speech, gesture, and facial expression from underlying rules specifying (a) what speech and gesture are generated on the basis of a given communicative intent, (b) how communicative intent is distributed across communicative modalities, and (c) where one can expect to find gestures with respect to the other communicative acts. Finally, I describe a system that has the capacity to interpret communicative facial, gestural, intonational, and verbal behaviors.

I am addressing in this paper one very particular use of the term "gesture" -- that is, hand gestures that co-occur with spoken language. Why such a narrow focus, given that so much of the work on gesture in the human-computer interface community has focused on gestures as their own language -- gestures that might replace the keyboard or mouse or speech as an direct command language? Because I don't believe that everyday human users have any more experience with, or natural affinity for, a "gestural language" than they have with DOS commands. We have plenty of experience with actions, and the manipulation of objects. But the type of gestures defined as (Väänänen & Böhm, 1993) "body movements which are used to convey some information from one person to another" are in fact primarily found in association with spoken language (90% of gestures are found in the context of speech according to McNeill, 1992). Thus if our goal is to get away from learned, pre-defined interaction techniques and create natural interfaces for normal human users, we should concentrate on the type of gestures that
come naturally to normal humans.

Spontaneous (that is, unplanned, unselfconscious) gesture accompanies speech in most communicative situations, and in most cultures (despite the common belief to the contrary). People even gesture while they are speaking on the telephone (Rimé, 1982). We know that listeners attend to such kinds of gestures, and that they use gesture in these situations to form a mental representation of the communicative intent of the speaker.

What kinds of meanings are conveyed by gesture? How do listeners extract these meanings? Will it ever be possible to build computers that can extract the meanings from human gesture in such a way that the computers can understand natural human communication (including speech, gesture, intonation, facial expression, etc.)? When computers can interpret gestures, will they also be able to display them such that an autonomous communicating agent will act as the interlocutor in the computer? We imagine computers that communicate like we do, producing and understanding gesture, speech, intonation and facial expression, thereby taking seriously the currently popular metaphor of the computer as conversational partner.