Describer:  A Trainable Visually-Grounded Spoken Language Generation System

Describer learns to describe objects in computer-generated visual scenes. The system is trained by a `show-and-tell' procedure in which visual scenes are paired with natural language descriptions. A set of learning algorithms acquire probabilistic structures which encode the visual semantics of phrase structure, word classes, and individual words. Using these structures, a planning algorithm integrates syntactic, semantic, and contextual constraints to generate natural and unambiguous descriptions of objects in novel scenes. The learning system is able to generalize from training data to generate expressions which never occurred during training. The output of the generation system is synthesized using word-based concatenative synthesis by drawing from the original training speech corpus. In evaluations of semantic comprehension by human judges, the performance of automatically generated spoken descriptions was comparable to human generated descriptions.

Deb Roy. (in press). Learning Words and Syntax for a Visual Description Task. Computer Speech and Language. pdf (513K)

Deb Roy. (in review). A Trainable Visually-Grounded Spoken Language Generation System. Submitted to the International Conference of Spoken Language Processing. pdf (177K)

Sample output generated by Describer in response to novel images (target objects indicated by yellow arrows). The sentences generated by Describer did not occur in the training data. The grammatical structure acquired from training examples is used generatively to describe novel scenes using novel word sequences.

This material is based upon work supported by the National Science Foundation under Grant No. 0083032. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.