Publication

Improving automatic speech recognition through head pose driven visual grounding

April 26, 2014

People

Soroush Vosoughi

Former Postdoctoral Associate

Share this publication

Soroush Vosoughi

Abstract

In this paper, we present a multimodal speech recognition system for real world scene description tasks. Given a visual scene, the system dynamically biases its language model based on the content of the visual scene and visual attention of the speaker. Visual attention is used to focus on likely objects within the scene. Given a spoken description the system then uses the visually biased language model to process the speech. The system uses head pose as a proxy for the visual attention of the speaker. Readily available standard computer vision algorithms are used to recognize the objects in the scene and automatic real-time head pose estimation is done using depth data captured via a Microsoft Kinect. The system was evaluated on multiple participants. Overall, incorporating visual information into the speech recognizer greatly improved speech recognition accuracy. The rapidly decreasing cost of 3D sensing technologies such as the Kinect allows systems with similar underlying principles to be used for many speech recognition tasks where there is visual information.

CHI2014_vosoughi.pdf

Improving automatic speech recognition through head pose driven visual grounding

People

Abstract

On Effects of Caregiver Speech on Early Child Language Acquisition Using a Naturalistic, Dense and Longitudinal Corpus

Effects of Caregiver Prosody on Child Language Acquisition.

An Automatic Child-Directed Speech Detector for the Study of Child Language Development

A longitudinal study of prosodic exaggeration in child-directed speech

Improving automatic speech recognition through head pose driven visual grounding

People

Share this publication

Abstract

On Effects of Caregiver Speech on Early Child Language Acquisition Using a Naturalistic, Dense and Longitudinal Corpus

Effects of Caregiver Prosody on Child Language Acquisition.

An Automatic Child-Directed Speech Detector for the Study of Child Language Development

A longitudinal study of prosodic exaggeration in child-directed speech