Newt: A Visually Grounded Speech Understanding System 

Newt is a trainable, visually-grounded, spoken language understanding system. The system acquires a grammar and vocabulary from a ``show-and-tell'' procedure in which visual scenes are paired with verbal descriptions. The system is embodied in a table-top mounted active vision platform. During training, a set of objects is placed in front of the vision system. Using a laser pointer, the system points to objects in random sequence, prompting a human teacher to provide spoken descriptions of the selected objects. The descriptions are transcribed and used to automatically acquire a visually-grounded vocabulary and grammar. Once trained, a person can interact with the system by verbally describing objects placed in front of the system. The system recognizes and robustly parses the speech and points, in real-time, to the object which best fits the visual semantics of the spoken description.

Deb Roy, Peter Gorniak, Niloy Mukherjee, and Josh Juster. (International Conference for Spoken Language Processing, 2002). A Trainable Spoken Language Understanding System for Visual Object Selection. pdf (86K)

Video of Training and Understanding (Quicktime, 27 MB)