Word Learning

A Robot Which Learns Shape and Color Names

Toco is a robot with microphone and camera input which learns speech labels for shapes and colors. This is a difficult task since no vocabulary or visual categories are built in. Based solely on untranscribed spontaneous speech paired with video, the system learns and audio-visual lexicon. Lexical items learned by Toco are represented by a statistical model of the speech sounds (in terms of Hidden Markov Models) paired with statistical models of shape/color categories. The meanings of words are "grounded" in camera observations. Once Toco has learned, he can understand novel speech input (by finding objects which match the meaning of the speech) and generate speech which describes novel visual observations.

Learning from Infant-directed Spontaneous Speech Paired with Video

We tested our model of Cross-modal Early Lexical Learning (CELL) on a corpus of infant-directed speech. CELL learned object names from the speech of six different caregivers. In a comparison with a speech-only learning system, CELL achieved superior learning rates by integrating visual observations in the word learning task.



Toco learns names. MPEG (26.1MB)


Deb Roy. (2005). Grounding words in perception and action: computational insights. Trends in Cognitive Science, 9(8), 389-396. pdf (272K)

Deb Roy. (2003). Grounded Spoken Language Acquisition: Experiments in Word Learning. IEEE Transactions on Multimedia, 5(2): 197-209. pdf (1.1MB)

Deb Roy and Alex Pentland. (2002). Learning Words from Sights and Sounds: A Computational Model. Cognitive Science, 26(1), 113-146. pdf (689K)