A Trainable Spoken Language Understanding System for Visual Object Selection

Sept. 16, 2002


Deb Roy, Peter Gorniak, Niloy Mukherjee, Josh Juster


We present a trainable, visually-grounded, spoken language understanding system. The system acquires a grammar and vocabulary from a “show-and-tell” procedure in which visual scenes are paired with verbal descriptions. The system is embodied in a table-top mounted active vision platform. During training, a set of objects is placed in front of the vision system. Using a laser pointer, the system points to objects in random sequence, prompting a human teacher to provide spoken descriptions of the selected objects. The descriptions are transcribed and used to automatically acquire a visually-grounded vocabulary and grammar. Once trained, a person can interact with the system by verbally describing objects placed in front of the system. The system recognizes and robustly parses the speech and points, in real-time, to the object which best fits the visual semantics of the spoken description.

Related Content