The Audio Notebook
Paper and Pen Interaction with Structured Speech

Lisa Joy Stifelman

Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning, on August 8, 1997,
in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
at the Massachusetts Institute of Technology


This dissertation addresses the problem that a listener experiences when attempting to capture information presented during a lecture, meeting, interview, or conversation. Listeners must divide their attention between the talker and their notetaking activity. A tape recording can capture exactly what and how things are said, but it is time consuming and often frustrating to find information on a tape. This thesis combines user interaction and acoustic processing techniques to enable a listener to quickly access any portion of an audio recording. Audio recordings are structured using two techniques: user structuring based on notetaking activity, and acoustic structuring based on a talker's changes in pitch, pausing, and energy. By bringing audio interaction techniques together with discourse theory and acoustic processing, this dissertation defines a new approach for navigation in the audio domain.

The first phase of research involved the design, implementation, testing, and use of the Audio Notebook. The Audio Notebook combines the familiarity of taking notes with paper and pen with the advantages of an audio recording. This device augments an ordinary paper notebook, synchronizing the user's handwritten notes with a digital audio recording. The user's natural activity, writing and page turns, implicitly indexes and structures the audio for later retrieval. Interaction techniques were developed for spatial and time-based navigation through the audio recordings. Several students and reporters were observed using the Audio Notebook during a five-month field study. The study showed that the interaction techniques enabled a range of usage styles, from detailed review to high speed skimming of the audio. The study also pointed out areas where additional information was needed to improve the correlation between the user's notes and audio recordings, and to suggest structure where little or none was generated by the user's activity.

In the second phase of research, an acoustic study of discourse structure was performed using a small multi-speaker corpus of lectures. Based on this study, acoustic processing techniques were designed and implemented for predicting the locations of major phrases and discourse segment beginnings. These acoustic structuring techniques were incorporated into the Audio Notebook, creating two new ways of interacting with the audio--audio snap-to-grid and topic suggestions. Using phrase detection, the Audio Notebook "snaps" back to the nearest phrase beginning when users make selections in their notes. Topic suggestions are displayed along an audio scrollbar, providing navigational landmarks for the listener. The combination of user activity and acoustic structuring techniques is very powerful. The two techniques complement each other, allowing listeners to quickly and easily navigate through an audio recording and locate portions of interest. Thus, rather than replacing real-world objects like paper and pen, we can successfully augment them, combining the advantages of the physical world with the capabilities of digital technology.

Return to Lisa's home page