The Human Speechome Project

We seek to better understand how children learn the meaning of words through analysis of observational recordings of child-caregiver interactions in natural contexts. Currently available corpora greatly under-sample crucial early stages of child development. As a result, our understanding of language acquisition hinges on surprisingly sparse and incomplete data. Motivated by this basic problem, Roy has begun a pilot project in which he is recording his son's development at home by gathering approximately 10 hours of high fidelity audio and video on a daily basis from birth to age three. The resulting corpus, which already contains over 100,000 hours of multi-track recordings, constitutes the most comprehensive record of a child's development made to date. This data provides many new opportunities to understand the fine-grained dynamics of language development.

A principal challenge of the project is to efficiently transcribe and annotate the massive corpus. New software algorithms and human- computer interfaces will be developed that enable a small team of researchers to quickly and accurately code the raw data semi- automatically. Using these software tools, we plan to study and computationally model the early words uttered by the child by tracing back to the contexts in which they were used by adults speaking to him.

For most children, language development is steady, progressive, and to a casual observer effortless. But for some children -- those with developmental delays due to biological or environmental causes -- language is a major developmental hurdle. Understanding the regularities in home environments is essential to understanding mechanisms of language acquisition, causes of delay, and ultimately appropriate intervention procedures. We believe this project will shed new light on fundamental aspects of how child-caregiver social interactions shape language acquisition.

Although there are clear limits to what may be concluded from studying a single child, in the time-honored tradition of longitudinal case studies dating back to Piaget, the findings from this project may guide more extensive follow-on observational and experimental studies. Beyond the Speechome corpus, the development of an effective semi-automated data coding and analysis methodology may enable scientists to leverage high density audio-visual corpora to address numerous open questions in the behavioral sciences.

Privacy Statement

Audio and video recording of children in their homes is a widely used method with mature ethical norms that is well established in the field of developmental psychology (e.g., see Our project is distinct due to the unusual sampling density of the recordings. There is no plan to distribute or publish the complete original recordings due to privacy considerations, although we will explore ways to work with other researchers by sharing appropriately coded and selected portions of the full corpus.

Press Archive


The birth of a word, 2011 TED talk by Professor Roy.

H2.0 presentation on the Speechome project by Professor Roy.

Sample video image from the kitchen. [JPG, 101K]

Timelapse video of a day of life at home. [QuickTime, 3.5 megs]

Evolution of "water" over several months. [WAV, 3.5 megs]

Video collage of "ball" over several months. [QuickTime, 1.2 megs]

Video visualization of caregiver and child interaction. [high resolution PNG (3M) | low resolution PNG (145K)]

Dynamic generation of video visualization. [QuickTime, 3.6 megs]

Photo/video credit: MIT Media Lab


Soroush Vosoughi and Deb Roy. (2012). An Automatic Child-Directed Speech Detector for the Study of Child Language Development. Proceedings of Interspeech 2012. Portland, Oregon. pdf (1.5MB)

Brandon C. Roy, Michael C. Frank, and Deb Roy. (2012). Relating Activity Contexts to Early Word Learning in Dense Longitudinal Data. Proceedings of the 34th Annual Meeting of the Cognitive Science Society. Sapporo, Japan. pdf (912KB)

Soroush Vosoughi and Deb Roy. (2012). A longitudinal study of prosodic exaggeration in child-directed speech. Proceedings of the 6th International Conference on Speech Prosody. Shanghai, China. pdf (200KB)

Philip DeCamp, George Shaw, Rony Kubat and Deb Roy. (2010). An Immersive System for Browsing and Visualizing Surveillance Video. Proceedings of ACM Multimedia 2010. Florence, Italy. pdf (3.5MB)

Meredith Meyer, Philip DeCamp, Bridgette Hard, Dare Baldwin and Deb Roy. (2010). Assessing Behavioral and Computational Approaches to Naturalistic Action Segmentation. Proceedings of the 32nd Annual Cognitive Science Conference. Portland, Oregon. pdf (388KB)

Brandon C. Roy, Soroush Vosoughi, and Deb Roy. (2010). Automatic Estimation of Transcription Accuracy and Difficulty. Proceedings of Interspeech 2010. Makuhari, Japan. pdf (1.7MB)

Soroush Vosoughi, Brandon C. Roy, Michael C. Frank, and Deb Roy. (2010). Contributions of Prosodic and Distributional Features of Caregivers' Speech in Early Word Learning. Proceedings of the 32nd Annual Cognitive Science Conference. Portland, Oregon. pdf (348KB)

Soroush Vosoughi, Brandon C. Roy, Michael C. Frank, and Deb Roy. (2010). Effects of Caregiver Prosody on Child Language Acquisition. Proceedings of the 5th International Conference on Speech Prosody. Chicago, IL. pdf (344KB)

Deb Roy. (2009). New Horizons in the Study of Child Language Acquisition. Proceddings of Interspeech 2009. Brighton, England. pdf (1.4MB)

Brandon C. Roy and Deb Roy. (2009). Fast transcription of unstructured audio recordings. Proceedings of Interspeech 2009. Brighton, England. pdf (276K)

Rony Kubat, Daniel Mirman and Deb Roy. (2009). Semantic context effects on color categorization. Proceedings of the 31st Annual Cognitive Science Society Meeting. pdf (392K)

Brandon C. Roy, Michael C. Frank and Deb Roy. (2009). Exploring word learning in a high-density longitudinal corpus. Proceedings of the 31st Annual Meeting of the Cognitive Science Society. pdf (820K)

Philip DeCamp and Deb Roy. (2009). A Human-Machine Collaborative Approach to Tracking Human Movement in Multi-Camera Video. Proceedings of the 2009 International Conference on Content-based Image and Video Retrieval (CIVR). pdf (1.0MB)

Rony Kubat, Philip DeCamp, Brandon Roy, and Deb Roy. (2007). TotalRecall: Visualization and Semi-Automatic Annotation of Very Large Audio-Visual Corpora. Ninth International Conference on Multimodal Interfaces (ICMI 2007). pdf (491K)

Philip DeCamp. (2007) HeadLock: Wide-Range Head Pose Estimation for Low Resolution Video. M.Sc. in Media Arts and Sciences Thesis. pdf (24.4M)

Brandon Roy. (2007) Human-Machine Collaboration for Rapid Speech Transcription. M.Sc. in Media Arts and Sciences Thesis. pdf (13.1M)

Michael Fleischman, Philip DeCamp, and Deb Roy. (2006). Mining Temporal Patterns of Movement for Video Content Classification. Proceedings of the 8th ACM SIGMM International Workshop on Multimedia Information Retrieval. pdf (323K)

Deb Roy, Rupal Patel, Philip DeCamp, Rony Kubat, Michael Fleischman, Brandon Roy, Nikolaos Mavridis, Stefanie Tellex, Alexia Salata, Jethran Guinness, Michael Levit, Peter Gorniak. (2006). The Human Speechome Project. Proceedings of the 28th Annual Cognitive Science Conference. pdf (756K)