Semantic Priming of Speech Recognition Using Visual Context

We are developing a multimodal processing system called Fuse to explore the effects of visual context on the performance of speech recognition. We propose that a speech recognizer with access to visual input may "second guess" what a person says based on the visual context of the utterance, thereby increasing speech recognition accuracy. To implement this idea, several problems of grounding language in vision (and vice versa) must be addressed. The current version of the system consists of a medium vocabulary speech-recognition system, a machine-vision system that perceives objects on a tabletop; a language acquisition component that learns mappings from words to objects and spatial relations; and a linguistically driven focus of visual attention. A corpus of naturally spoken, fluent speech was used to evaluate system performance; speech ranged from simple constructions such as "the vertical red block" to more complex utterances such as "the large green block beneath the red block." We found that integrating visual context reduces the error rate of the speech recognizer by over 30 percent. We are currently investigating implications of this improved recognition rate on the overall speech understanding accuracy of the system. This work has applications in contextual natural language understanding for intelligent user interfaces. For example, in wearable computing applications, awareness of the user's physical context may be leveraged to make better predictions of the user�s speech to support robust verbal command and control.