Publication

A Trainable Visually-Grounded Spoken Language Generation System

June 1, 2002

People

Deb Roy

Professor of Media Arts and Sciences

Share this publication

Deb Roy

Abstract

A spoken language generation system has been developed that learns to describe objects in computer-generated visual scenes. The system is trained by a ‘show-and-tell’ procedure in which visual scenes are paired with natural language descriptions. Learning algorithms acquire probabilistic structures which encode the visual semantics of phrase structure, word classes, and individual words. Using these structures, a planning algorithm integrates syntactic, semantic, and contextual constraints to generate natural and unambiguous descriptions of objects in novel scenes. The output of the generation system is synthesized using word-based concatenative synthesis drawing from the original training speech corpus. In evaluations of semantic comprehension by human judges, the performance of automatically generated spoken descriptions was comparable to human generated descriptions.

describer_icslp02.pdf

A Trainable Visually-Grounded Spoken Language Generation System

People

Abstract

Wealth of Words

Big problems, big data solutions

Newseum Features Groundbreaking Campaign Analytics Tool

Make Jobs Great Again

A Trainable Visually-Grounded Spoken Language Generation System

People

Share this publication

Abstract

Wealth of Words

Big problems, big data solutions

Newseum Features Groundbreaking Campaign Analytics Tool

Make Jobs Great Again