Learning Visually Grounded Words and Syntax of Natural Spoken Language

Jan. 23, 2002


Deb Roy


Properties of the physical world have shaped human evolutionary design and given rise to physically grounded mental representations. These grounded representations provide the foundation for higher level cognitive processes including language. Most natural language processing machines to date lack grounding. This paper advocates the creation of physically grounded language learning machines as a path toward scalable systems which can conceptualize and communicate about the world in human-like ways. As steps in this direction, two experimental language acquisition systems are presented. The first system, CELL, is able to learn acoustic word forms and associated shape and color categories from fluent untranscribed speech paired with video camera images. In evaluations, CELL has successfully learned from spontaneous infant-directed speech. A version of CELL has been implemented in a robotic embodiment which can verbally interact with human partners. The second system, DESCRIBER, acquires a visually-grounded model of natural language which it uses to generate spoken descriptions of objects in visual scenes. Input to DESCRIBER’s learning algorithm consists of computer generated scenes paired with natural language descriptions produced by a human teacher. DESCRIBER learns a three-level language model which encodes syntactic and semantic properties of phrases, word classes, and words. The system learns from a simple ‘show-and-tell’ procedure, and once trained, is able to generate semantically appropriate, contextualized, and syntactically well-formed descriptions of objects in novel scenes.

Related Content