Research Group Projects and Descriptions

Cognitive Machines
Principal Investigator: Deb Roy

The goal of the Cognitive Machines group is to create systems that engage in fluid, situated, meaningful communication with human partners. We seek to understand and model the processes by which words are grounded in the physical world as a result of embodied perception, action, and learning. These models are applied to create situated human-machine interfaces. We also use our computational models as a source of predictions and possible accounts for a number of cognitive phenomena including aspects of children's language acquisition, concept formation, and attention.

goto web site

Behavior Capture from Thousands of People Online Jeff Orkin and Deb Roy

The Restaurant Game is a multiplayer simulation that captures the behavior and language of thousands of people playing the roles of waitresses and customers. We are developing machine-learning algorithms that mine game-play logs to acquire generative models of human language, behavior, and social roles. These models will power synthetic conversational characters that interact with humans in training simulations, games, and other virtual worlds.

goto web site

BlitzScribe: Speech Analysis for the Human Speechome Project Deb Roy, Jethran Guinness and Brandon Roy

BlitzScribe is a new approach to speech transcription driven by the demands of today's massive multimedia corpora. High-quality annotations are essential for indexing and analyzing many multimedia datasets; in particular, our study of language development for the Human Speechome Project depends on speech transcripts. Unfortunately, automatic speech transcription is inadequate for many natural speech recordings, and traditional approaches to manual transcription are extremely labor intensive and expensive. BlitzScribe uses a semi-automatic approach, combining human and machine effort to dramatically improve transcription speed. Automatic methods identify and segment speech in dense, multitrack audio recordings, allowing us to build streamlined user interfaces maximizing human productivity. The first version of BlitzScribe is already about 4-6 times faster than existing systems. We are exploring user-interface design, machine-learning and pattern-recognition techniques to build a human-machine collaborative system that will make massive transcription tasks feasible and affordable.

HeadLock: Video Analysis for the Human Speechome Project Philip DeCamp and Deb Roy

HeadLock is a semi-automated system for head pose annotation that explores how human-computer interfaces can be combined with computer vision technologies to efficiently extract behavioral information from video recordings. For images with limited resolution, the orientation of a head is often the best approximation for gaze direction, a crucial component to analyzing the rich interactions and behaviors of humans. The goal of HeadLock is to reduce the cost of extracting head pose from video by several orders of magnitude by developing machine-perception technologies that can perform robust head pose estimation with minimal constraints on resolution and camera angle.

Human Speechome Project Deb Roy, Philip DeCamp, Brandon Roy, Jethran Guinness, Rony Kubat and Stefanie Tellex

The Human Speechome Project is an effort to observe and computationally model the longitudinal language development of a single child at an unprecedented scale. To achieve this, we are recording, storing, visualizing, and analyzing communication and behavior patterns in over 400,000 hours of home video and speech recordings. The tools that are being developed for mining and learning from thousands of terabytes of multimedia data offer the potential for breaking open new business opportunities for a broad range of industries—from security to Internet commerce.

Alumni Contributor(s): Alexia Salata and Michael Fleischman

goto web site

SLIMD Deb Roy, Stefanie Tellex, Kleovoulos Tsourides and Gregory Marton

SLIMD is a prototype system for retrieving video clips via natural language queries. An automated video analysis system tracks the movements of people in surveillance video and stores track data in a database. A natural language interpretation system converts queries such as "to the kitchen counter" into semantic path filters, which are used to find tracks in the database that match the description. The system can be used to search large video archives for specific human behaviors and can be adapted to other forms of geospatial data such as GPS logs.

Sports Video Search Using Situated Natural Language Processing Michael Fleischman and Deb Roy

The quantity and availability of video content is soaring due to the combination of television networks and the Internet. The aim of this project is to develop more effective means to manage, search, and translate video content. We are developing algorithms that interpret language in video (speech and closed caption text) by exploiting aspects of the non-linguistic context, or situation, conveyed by the accompanying video. We model situations by automatically finding patterns within low-level audio/video features that represent events. Event patterns are then mapped to words spoken in the video in order to create a “grounded” dictionary of word meanings. Our research focuses on sports video, in particular, on Major League Baseball games. We are exploring applications in multimedia search and video-based machine translation.

Trisk: A Conversational Robot Deb Roy, Kai-yuh Hsiao, Stefanie Tellex, Rony Daniel Kubat, Soroush Vosoughi, Thananat Jitapunkul and Kleovoulos Tsourides

Trisk is a humanoid robot that integrates speech input, visual perception, and active touch in order to interact with humans and its environment. It can understand and obey natural language commands, and will soon be able to answer questions. The robot is a platform for designing new algorithms and multimodal knowledge representations for sensory-motor grounded language use. This research takes steps towards social robots that can coordinate activities with human partners using natural language and gesture.

goto web site



MIT Media Laboratory Home Page | Research Main Index