Reference: MIT Media Laboratory Technical Report. May, 1993.
This work was completed as a final project for Computational Models of Discourse taught by Prof. Barbara Grosz.
 
 

User Repairs of Speech Recognition Errors:
An Intonational Analysis

Lisa J. Stifelman

Speech Research Group
MIT Media Laboratory
May, 1993

1 Introduction

Many researchers have the goal of creating human-computer spoken interaction that more closely resembles human-human conversation. A serious challenge to this goal is the error-prone nature of current speech recognition technology. Human communication is not error-free eitheróhuman conversations are filled with misunderstandings and false starts and stops. An important difference is that humans have the ability to come to a common understanding, while current human-computer spoken interactions provide only limited capabilities, if any, for repair.

Most of the research on repair, from both a psychological and computational standpoint has been concentrated on self-repair (e.g., [9] [19] [21]). Self-repair refers to cases in which speakers repair errors or correct the appropriateness of their own speech. Some research has also addressed "other-repair" or cases in which the other conversational participant attempts to repair the speakerís error [26]. However, very little research has addressed repairs that involve the speaker correcting a mistake made by the listener in hearing or interpreting what they have said (Figure 2). This is a problem that occurs frequently in human-computer spoken interactions. The computer (hearer) does not correctly hear, interpret, or take action on the words spoken by the user (speaker).

There are two approaches for dealing with speech recognition errors in human-computer dialogues. One approach is to attempt to automatically detect and correct errors in the recognized speech through syntactic, semantic, pragmatic, and discourse level analyses (e.g., [27] [31] [32]). The problem is that even if speech recognition reached the accuracy level of human recognition, errors would still occur. Therefore, the second approach to the problem involves providing the ability for users to "repair" errors made by the recognizer. In current spoken language interactions, when a speech recognition error occurs, the interaction breaks down severely. The challenge is to "design systems for error" by allowing the user to engage in an interactive dialogue with the system to repair the error, the way they would with a human conversational partner.

Several researchers are currently taking the "repair" approach to the problem of dealing with speech recognition errors [1] [3] [4]. Brennan and Cahn are applying Clark and Schaefferís contribution model [5][6][7][8] for providing a cognitive and computational architecture for dialog and repair that could be used in human-computer conversational systems. According to this model, a conversation consists of a series of "presentations" and "acceptances." A conversational participant not only presents an utterance but must receive evidence that it has been accepted by the other conversant. In applying this theory to human-computer spoken interaction, a question arises as to how to detect when a "presentation" has not been acceptedói.e., when a user is attempting to repair an error made by the recognizer.

Shriberg addresses the problem of automatically detecting and correcting self-repairs [28]. Shriberg, unlike previous researchers (e.g., [15]), is working on methods for detecting repairs that do not rely on the presence of an explicit "edit" signal (see section 2.3). This is critical because repairs will not always be accompanied by an "editing expression" [20]. For example, in a data set of self-repairs used by Levelt [19], 43% of the repairs were not marked by an editing expression. Another way in which repairs can be marked other than lexically, is prosodically. Shriberg studied the intonational characteristics of self-repairs in order to explore whether this information would be useful for automatic detection.

The same problem that exists for self-repairs, occurs for user repairs of speech recognition errors. Once again, repairs do not always contain an editing expression; in the corpus of data used for this study, only 4% of repairs contained editing expressions. This paper reports on an intonational analysis of user repairs of speech recognition errors. The goal of this work is to determine if repair utterances are marked prosodically, and if so, how these results can be applied for automatic detection. Current spoken language systems respond to user attempts at repair as separate commands, with no understanding of how the userís repair utterance relates to the previous dialogue. Identifying user repair utterances is a step toward supporting the establishment of mutual understanding between user and computer.

2 Background

2.1 Self- and Other-Repair

The term "repair" has generally been used to refer to self-repair or self-correction. "Speakers monitor what they are saying and how they are saying it. When they make a mistake, or express something in a less felicitous way, they may interrupt themselves and make a repair" ([20], p. 458).

Schegloff, Jefferson, and Sacks distinguish the use of the term "correction" from the use of "repair" [26]. "Correction" is said to refer to "the replacement of an ëerrorí or ëmistakeí by what is ëcorrectí." However, "repairs" are not limited to errors or simple replacement. For example, repairs are often used by a speaker to address the appropriateness of an utterance rather than an error. According to Schegloff et al., the term "repair" refers to this more general domain of occurrences. Also, rather than referring to errors, they use the term "trouble source" or "repairable."

Schegloff discusses both self- and other- repair. Other-repair refers to the repair of disfluencies in the speakerís speech by the hearer. A distinction is made between the process of initiating a repair and itís outcome. Repair-initiations can occur at a number of different locationsówithin the turn that contains the trouble, in the transition between turns, on the very next turn, or on a third turn. The person who initiates a repair isnít always the one who accomplishes it. A repair can be self- or other-initiated and the outcome can be a self- or other-repair. In addition, once initiated, repairs can also fail.

One interesting point made by Schegloff regards multiple attempts to make a repair. "If more than one other-initiated sequence is needed, the other-initiations are used in order of increasing strength" ([26], p. 369). In the example shown in Figure 1, the other-correction technique of saying "You mean" followed by a possible understanding used the first time is considered stronger than the technique of repeating a portion of the trouble source used the second time.

A: I have a: ó cousin teaches there.
D: Where.
A: Uh; Columbia
D: Columbia? ¬repeat of trouble source
A: Uh, huh.
D: You mean, Manhattan? ¬"You mean" plus understanding

Figure 1: Example of multiple other-initiated repairs. The "strength" of the repair increase in the second instance (taken from [26]). 2.2 Taxonomy of Repair

This paper addresses user repair of speech recognition errors. These repairs do not fall into either the category of self-repair or other-repair. While other-repair refers to mistakes made by the speaker that are corrected by the hearer, user repair of recognition errors deals with mistakes made by the hearer (in this case the recognizer) that are corrected by the speaker (user of the system). Therefore, it seems a new taxonomy of "repair" is needed to cover all of these types. A proposed taxonomy is given in Figure 2. Commonly referenced types of repair, self- and other-repair, correspond to those errors made by the speaker. The two additional types of repair correspond to errors made by the hearer. The types of repair studied in this paper are in the category of "recognition repairs."
Makes Error
 
Speaker
Hearer
Repairs Error
Speaker
Self-Repair
Recognition
Repair
 

Hearer
Other-Repair
 

Figure 2: Taxonomy of Repair: This paper addresses "Recognition Repairs" in human-computer dialogues. 2.3 Editing Expressions

Levelt defines three phases of self-repairóinterruption of speech, an editing phase, and the repair itself [19]. The editing phase is said to consist of a hesitation pause and in some cases an editing term or editing expression. Examples of non-lexical editing terms are "uh", "ah", and "oh." An example is shown below (taken from [19]):

I saw .. uh .. twelve people at the party

Examples of "overt" or lexical editing expressions used in self-repair are "that is," "rather," and "I mean." For example (taken from [19]):

He hit Mary .. that is .. Bill did.

There are also differences between the editing terms used in repairs of appropriateness versus those for errors. Examples of editing terms used in error repairs are "no" and "sorry." In addition, editing terms occur more frequently in error repairs than in repairs of appropriateness (in the data set used in [19], an editing term was present in only 28% of the appropriateness repairs but in 58% of error repairs).

Editing expressions used in speaker repair of hearer errors have not received as much attention as editing expressions used in self-repair. Brennan and Cahn provide sample dialogues in which repairs are preceded by "No I said" or "No I meant." Figure 3 is an example taken from the data set used in the intonational study reported on by this paper.

System: What date will you be returning on?
User: September twenny-nine
System: Here are the flights for september twenty
User: No I said september twenty nine

Figure 3: Sample dialogue with editing expression. The other editing expressions used in correcting the recognizer were "No" and "I was referring to" (see section 3.2.3).

2.4 Prosody of Self-Repair

"Spontaneous self-corrections in speech pose a communication problem; the speaker must make clear to the listener not only that the original utterance was faulty, but where it was faulty and how the fault is to be corrected. Prosodic marking of correctionsómaking the prosody of the repair noticeably different from that of the original utteranceóoffers a resource which the speaker can exploit to provide the listener with such information" ([21], p. 205). By Cutlerís definition, a repair is considered "marked" if there is a "noticeable" change in pitch, amplitude, and/or duration between the original and the repair utterance [9]. The differences can be either positive or negative, however, a tendency was found for repairs to be of higher pitch, greater intensity, and of longer duration. Cutler considers a repair to be "unmarked" if it does not exhibit any of these changes, even if it is preceded by a pause. In the corpus of self-repairs used by Cutler, only 38% of lexical error corrections were marked [9]. In a subsequent study, 53% of repairs for error were found to be marked [21].

Cutler found several examples of repairs that were unmarked on a first repair attempt that was unsuccessful but marked for the second attempt [9]. This seems to correspond to Schegloffís theory about the increasing "strength" in the selection of strategies for repair. One way of strengthening a repair other than changing its lexical nature (e.g., adding "You mean...") is to mark the repair prosodically. Cutler does not, however, distinguish the degree of markingómarking is a binary characteristic.

Other factors found to affect whether or not a repair is prosodically marked are the type of repair (error vs. appropriateness) and the size of the semantic domain in which the error and repair contrast. According to Levelt and Cutler, repairs tend to be marked more often for errors than for appropriateness [21]. This corresponds to Leveltís finding that editing terms are used more frequently for repairing errors than for appropriateness. By using an editing term and/or prosodically marking a repair, the speaker is attempting to distinguish this information for the listener. In addition, a repair is more likely to be marked if the element in error can be replaced by one of a small set of alternatives (e.g., morning vs. evening). However, when the error and replacement are antonyms, this effect may be due to the degree of opposition rather than simply the number of items.

Levelt and Cutler draw several important conclusions regarding the use of prosody to mark repairs. First, "in the absence of [a] lexical joint between repair and original utterance, the listener may very well use intonational cues to match the repair to the trouble item" ([21], p. 214). Secondly, since repairs were marked more often for corrections of error (53%) than for corrections of appropriateness (19%), it is argued that marking is used to express rejection. Lastly, prosodic marking of repairs was found to be similar to prosodic marking in general (for non-repairs). If this is the case, then it will prove difficult to use prosodic information for automatic detection of such repairs.

2.5 Automatic Detection of Self-Repairs

Shriberg et al. performed a study of the syntactic, semantic, and acoustic characteristics of self-repairs in human-computer dialogue. The goal was to automatically locate (and correct) repairs without relying on an explicit edit signal. The corpus analyzed was composed of spontaneous speech containing self-corrections taken from the ATIS (Air Travel Information System) domain (Figure 4). Most of the data came from Wizard of Oz style sessions and a small subset using SRIís spoken language system. Four techniques were employed in an attempt to detect the repairs: pattern matching, syntactic, semantic, and acoustic analysis. Pattern matching is performed first in order to narrow down the potential set of repairs, and then the other analyses are applied to separate actual repairs from false positives.

(1) show me flights
|
daily flights
(2) I want to leave
|
depart before ...
(3) flights for one
|
one person
Figure 4: Examples of self-repairs analyzed by Shriberg [28]. (Shriberg uses a vertical bar (|) to indicate the location of the repair ). Acoustic comparisons were performed for patterns of the form in examples (1) and (3) in Figure 4 since comparisons could be made between the matching words. Cue words like "no" (e.g., "I want to go to Boston, no Dallas") were also studied in an attempt to determine prosodic differences between cue and sentential uses of these words (as in [17]). Comparisons were made between repairs and false positives for pauses, duration, and F0 data.

For repairs containing a single repeated word with no intervening insertions, a difference was found for the interval between the matching words (e.g., one <interval> one) óthe average interval between words was 380 msec for repairs and 42 msec for false positives. A reduction in duration ( = 53.4 msec) was also found between the first and second instance of the word. F0 data was not found to be helpful in separating repairs from false positives. For repairs containing an insertion and single repeated word, pauses (silence ? 200 msec) generally occurred before insertions for repairs and after for false positives. However, there werenít always pauses before or after the insertion, so cases without leading or following pauses remain ambiguous. Secondly, there was a trend for the average F0 of the inserted word to be higher than that of the preceding word in the repair cases, but lower in the false positives. Lastly repair and sentential uses of cue words were distinguished by differences in F0 rise/fall, pausing, and lexical stressórepairs exhibited an F0 fall and a preceding or following silent pause while sentential uses exhibited an F0 rise, lexical stress, and the lack of a preceding or following pause.

It is important to note that Shribergís use of acoustic analysis was secondary to the application of pattern matching techniques. Acoustic data was used in conjunction with syntactic and semantic analyses to identify false positives in the set of potential repairs identified by pattern matching. Shriberg argues that "acoustics alone cannot tackle the problem of locating repairs, since any prosodic patterns found in repairs will most certainly be found in fluent speech" ([28], p. 4). However, in conjunction with other techniques, Shriberg found acoustic information to be useful.

3 Intonational Analysis

3.1 Introduction

The problem of automatically detecting repairs in human-computer dialogue applies not only to self-repairs but also to user repairs of recognition errors. The purpose of this study, similar to the one performed by Shriberg, was to determine if prosodic information could be used for detecting user repairs of recognition errors. Comparisons are made between an original utterance and a repair utterance spoken after a misrecognition.

A number of hypotheses were made about the intonational characteristics of recognition repairs. Based on findings from studies of the prosody of self-repair it is expected that the repair utterances (words, phrases, and entire commands) would be spoken at a slower rate, and be higher in energy and pitch than the original. An open question is what quantitative measures are most appropriateóaverage or peak energy, average F0 or F0 range. Each of these calculations was made so that the measurement yielding the greatest distinctions could be determined. It is expected that F0 range will provide a better measure than F0 average that was used by Shriberg.

In addition to duration, energy and F0 data, how might the accenting and phrasing of the repair phrase differ from that of the original utterance? One hypothesis is that the word(s) being repaired might be marked by intermediate or intonational phrase boundaries [25]. It is also predicted that the repair phrase will be preceded and followed by a larger silence interval than in the original utterance. In terms of accenting, based on Terken and Nooteboomís theory that "new" information is accented and "given" information deaccented, it might be expected for accented words in the original utterance to be deaccented if repeated in the repair utterance [30]. However, even though the words are being repeated, they have not been correctly recognized and so are not already mutually believed. Based on Pierrehumbert and Hirschbergís hypothesis about the relationship between accenting and predication, one would predict that the repeated words would be accented (H* in Pierrehumbertís notation [24]), and therefore not distinguishable from the accenting in the original utterance. This later hypothesis seems more plausible and intuitive since, from a qualitative standpoint, it makes sense that the repair words would be accented.

Another hypothesis, based on research in self-repair, is that for multiple attempts at correcting the same error, a "stronger" distinction will be made between the original utterance and repair utterance on the second attempt. This is based on Schegloffís theory that the selection of repair techniques chosen by the initiator (in this case the speaker) will increase in strength after a failed repair. In addition, Cutler also found that repairs that were not prosodically marked on a first attempt were marked on a second one.

3.2 Method

3.2.1 Data Collection

The corpus of data used in this study was taken from spontaneous speech data collected by the Spoken Language Systems Group at the Massachusetts Institute of Technology for the ATIS domain [18]. The original experiments involved some user sessions with a "Wizard" and others with MITís spoken language system. Only those sessions performed using the MIT speech recognition system are used in this analysis.

Transcripts and speech data were obtained for user sessions with the speech recognizer. Each speech recognition error and user strategy for repair was classified and marked in the transcription file for each session (see section 3.2.2). According to Brennanís feedback model, the errors addressed by this study all occurred while in the "attending" stateóthe recognizer failed to correctly identify some of the words spoken by the user [2].

The speech data used in this analysis was digitized at 16 kHz, 16 bits per sample and recorded using a Sennheiser close-talking microphone.

3.2.2 Data Classification

Figures 5 and 6 show the classifications for recognition errors and user repair strategies used in this study. Note that in cases where multiple attempts are needed to correct a single error, different strategies may be employed in subsequent repair utterances.

Substitution A word or phrase in the user utterance is substituted with a different word or phrase in the grammar. In Example 4.2 of the data set, the word "prices" is misrecognized as the word "classes."

Insertion A word or phrase that was not a part of the actual user utterance is inserted by the speech recognizer. In Example 5.4 of the data set (section 3.2.3), the word "morning" is mistakenly inserted.

Deletion A word or phrase that was a part of the actual user utterance is left out of the recognition result. In Example 1.3 of the data set, the word "nine" is deleted from the result.

Partial Multiple errors (substitutions, insertions, deletions) occur in the recognized output. This includes multiple substitutions or combinations of different error types in some cases. In Example 9.2 of the data set, although not apparent from the systemís response, it recognized "How much is flight" but misrecognized the remainder of the utterance.

Complete All or nearly all the userís utterance has been misrecognized. In Example 2.4 of the data set, the system misrecognizes most of the userís repair utterance.

Figure 5: Classification of speech recognition errors in the speech corpus.
Note that the categorizations used here differ from those defined by the Multi-Site ATIS Data Collection Working Group (MADCOW) [10] [18]. For example, the MADCOW definitions for user responses are: (1) new information, (2) repeat, (3) rephrase, and (4) unevaluable. For the purposes of this study, only repairs were analyzed, so only categories (2) and (3) apply. In addition, these categories are broken down more specifically in order to focus closely on user repair strategies. The MADCOW break down for a system "answer" is: (1) correct, (2) incorrect, or (3) partially correct. Again, since this study is focused on recognition errors, only categories (2) and (3) apply. In addition, rather than simply labeling responses as "incorrect" the type of error has s been identified. Exact Repeat The user repeats the entire command using exactly the same wording as the original command. This strategy is used in Example 4.3 of the data set.

Partial Repeat The user repeats only the portion of the original command that was misrecognized. A partial repeat is sometimes preceded by an editing expression. Example 1.4 shows a case where an editing expression was employed, and Example 2.5 (second repair attempt) with the repeated portion alone.

Reword The user reissues the command using a different wording. Some rewords are further classified into those that are simplified, more specific, or broken into multiple queries. Queries that are simplified eliminate any unnecessary words while those that are made more specific tend to have added words.

Figure 6: Classification of repair strategies employed by users.
3.2.3 Speech Corpus

The following are all of the examples analyzed in this study. For a numerical break down of original utterances, repair utterances, and repair strategies, see section 3.2.4.

Example 3 shows one of the three editing expressions found in this data set but was not due to a recognition error. In this example, the user is responding to the information that the return date on the flight is restricted to after September 2 and the user wants to return on August 30. The dialog has failed in the "intending" state of the Brennan feedback model. One question arises as to whether this should be considered a cue or non-cue use of the word "no." It is not being used to indicate a recognition error, but it could be said to represent an error of "intention."

Example 1

1.1 System What date will you be returning on?  
1.2 User September twenty nine  
1.3 System Here are the continental flights from denver to boston...on Sunday September 20... Deletion
1.4 User No I said September twenty nine Partial repeat with an editing expression

Example 2

2.1 User Iíd like the cheapest fare possible going on continental in flight seventeen sixty five  
2.2 System I heard you refer to flight number 76, which I donít recognize in this context. One-way or round-trip? Substitution
2.3 User I was referring to flight seventeen sixty five Partial repeat with an editing expression
2.4 System I heard you refer to flight number 545, which I donít recognize in this context. One-way or round-trip? Complete
2.5 User flight seventeen sixty five round trip Partial repeat 

Example 3

3.1 System Iíll book continental flight 1765 from boston to denver for you. Your restrictions require you to return on or after Wednesday September 2 and on or before Thursday October 1...  
3.2 User No I want to leave august thirtieth Editing expression

Example 4

4.1 User What are the prices of these flights?  
4.2 System Here are the fare classes for the continental flights from boston to denver... Substitution
4.3 User What are the prices of these flights? Exact repeat

Example 5

5.1 User Show me flights from pittsburgh to philadelphia tomorrow leaving around six p.m.  
5.2 System Here is the flight from pittsburgh to philadelphia leaving at 6:00 p.m. ... Substitution
5.3 User Show me flights from pittsburgh to philadelphia tomorrow leaving around six p.m. Exact repeat
5.4 System There are no flights from pittsburgh to philadelphia leaving between 5:30 p.m. and 6:30 p.m. in the morning on Tuesday June 30 Insertion
5.5 User Show me the flights leaving from pittsburgh to philadelphia leaving at six p.m. in the evening Rewordómore specific
5.6 System Here is the flight from pittsburgh to philadelphia leaving at 6:00 p.m.  
5.7 User Show me other flights leaving around six p.m.  
5.8 System Iíll show it to you again (system got "at" instead of "around" again) Substitution
5.9 User Show me other flights leaving around six p.m. Exact repeat 

Example 6

6.1 User What is the return date on that flight?  
6.2 System Here is the departure time for...continental flight from boston to denver... Partial error
6.3 User What is the latest possible return date? Reword
6.4 System Here is the latest round trip continental flight from boston to denver ... Substitution
6.5 User What is the return date on that flight? Exact repeat 

 

Example 7

7.1 User What meal is served on this flight?  
7.2 System Here is the airline for the flight from atlanta to baltimore... Partial error
7.3 User Will there be food on the flight? Reword
7.4 System Here are the fares for the flights from atlanta to baltimore... Substitution
7.5 User What type of meal will be served on this flight? Reword

Example 8

8.1 User Which of the flights serves breakfast?  
8.2 System Iíll show them to you again Complete
8.3 User Can you tell me which of those flights serve breakfast? Reword

Example 9

9.1 User How much is flight nineteen eighty four?  
9.2 System Iíll show them to you again Partial error
9.3 User How much is flight one nine eight four? Reword

3.2.4 Data Distribution

There were a total of 45 user sessions with the speech recognition system, containing a total of 388 user commands. There were 103 misrecognitions (of the types described in section 3.2.2). This only includes errors in which the recognizer incorrectly identified some or all of the words in the userís utterance. Errors in interpretation (e.g., reference resolution) and intention (e.g., action taken by system was not the one intended by the user) are not included. In addition, an attempt was made to exclude those errors that the system was able to recover from using domain knowledge or by detecting the missing information and querying the user. The break down of error types is shown in Figure 7.


Error Type
Number of Occurrences
Substitutions 44
Insertions 5
Deletions 6
Partial 32
Complete 16
Total 103
Figure 7: Break down of the distribution of recognition errors. There were 67 user repair utterances. The total number of repair utterances does not equal the total number of errors, because some errors were either undetected or ignored by the user, and in some instances the session would end on an error without the user completing the task. The break down of repair strategies is shown in Figure 8. Note that reworded utterances may contain some of the same wording as the original utterance, but have been modified by an addition or deletion of words.

Repair Strategy
Number of Occurrences
Exact repeat 8
Partial repeat with an editing expression 2
Partial repeat without an editing expression 5
Reword 52
Total 67
Figure 8: Break down of the distribution of repair strategies. A portion of the 67 repair utterances and their associated original utterance were selected for this analysis (examples given in section 3.2.3). A total of 20 utterances were analyzedó8 original utterances, and 12 repair utterances. The examples are composed of 5 pairs (original utterance, repair utterance), 3 containing three utterances (original utterance, first repair attempt, second repair attempt), and 1 that did not have an associated original utterance. The break down of repair strategies analyzed is shown in Figure 9. These repairs are shown in context in section 3.2.3.

Repair Strategy
Number of Occurrences
Exact repeat 4
Editing expressions (2 with partial repeats) 3
Partial repeat without an editing expression 1
Reword 4
Total 12
Figure 9: Break down of the distribution of strategies used in the repair utterances analyzed. It is also interesting to note that editing expressions were only used in 4.4% of all repair utterances (the three editing expressions selected for analysis were the only editing expressions found in the entire speech corpus of 67 repair utterances). It is unclear whether this is due to the userís a priori perception about what they can say to the recognition system, or if the user has adapted in some way [29].

3.2.5 Analysis Techniques

The speech data was analyzed using the Entropics Waves software package. First, pitch tracks and energy values were calculated and the data was labeled using Waves. Word, repair, and editing expression onset and offsets were labeled by inspection of the waveform and the spectrogram across the time interval in question. Accenting and phrasing were also marked using Pierrehumbertís theory of English intonation [24] [25]. On a first pass, words were labeled only as accented or deaccented (*/-) but accenting was not differentiated across the 6 tones defined by Pierrehumbert. In addition, intermediate and intonational phrase boundaries were marked but phrase accent and boundary tones were not identified. On a second pass, an attempt was made to specify pitch accents and phrase and boundary tones. However, it is important to note that the author is inexperienced at making these judgments and further analysis would be needed to validate these results. For cases in which the presence of a phrase boundary was questionable, a second person was called upon to review the data.

Once the data was labeled, label files and Waves parameter files containing F0 and energy values were analyzed to make the following calculations: rate (wpm), duration (msec), peak RMS energy, average RMS energy, F0 minimum, F0 maximum, F0 range, F0 average, and pause lengths between words. These calculations were made for each word, repair phrase, editing expression, and across the entire utterance. Mean and maximum pause lengths were calculated across repair words, editing expressions, and entire utterances. For the F0 calculations, only values with a probability of voicing greater than 0.2 were used. Reports were generated for each of the 20 utterances displaying these calculations as well as accenting and phrasing information. Differences between the original utterance and repair utterance were than calculated and summarized.

Qualitative judgments were also made in terms of the ability to "hear" the differences between the original and repair utterances.

3.3 Results

Sections 3.3.1ñ3.3.3 discuss analysis within each repair strategy and sections 3.3.4ñ3.3.6 discuss analysis across repair strategies for different groups of data. For each repair strategy, comparisons are made across the repaired portions of the commands alone (sections titled Analysis of Repair Words) and over the entire command (sections titled Analysis of Repair Commands).

3.3.1 Exact Repeats

Four exact repeats were compared with the associated original utteranceóExamples 4.3, 5.3, 5.9, and 6.5. Example 6.5 represents a second attempt at correcting the recognition error (a different repair strategy was used on the first attempt). In Example 5, two exact repeats occur in the same session within close succession of one another (see section 3.3.7).

3.3.1.1 Analysis of Repair Words

First, an analysis was performed on the specific word(s) being repaired within the repeated utterance. Each of the exact repeats analyzed were spoken in response to a substitution error. Figure 10 shows the original words spoken by the user, and the words substituted by the recognizer. Note that in Example 5, the recognizer has made the same mistake twice. Example 6 is not quite parallel to Examples 4 and 5 since multiple words have been misrecognized, and the mapping of the original to the substituted words is less clear given the system feedback in 6.4.

Exact Repeat
Word(s) spoken by user
Word(s) substituted
4.1--4.3
prices
classes
5.1--5.3
around
at
5.7--5.9
around
at
6.1--6.5
return date
between Denver
Figure 10: Summary of the repair words analyzed for exact repeat utterances. The most consistent differences were found between the duration of the words the first time spoken (original utterance) and the second time (repair utterance) (e.g., "prices" in 4.1 versus "prices" in 4.3). Differences in duration and word rate are given in Figure 11. Note that differences in duration are valid in this case since the words are repeated exactly and in the same context. However, these differences are normalized by calculating word rate and percent differences. The duration of the words increased by an average of 51.4%, corresponding to a decrease in speaking rate of 52.9 wpm.

Exact Repeat
Word(s) repeated
? Duration (msec)
%Change in Duration
? Rate (wpm)
4.1--4.3
prices
258 63.4 -57.2
5.1--5.3
around
99 33.1 -49.9
5.7--5.9
around
320 88.2 -77.4
6.1--6.5
return date
165 20.9 -27.0
ó
210 51.4 -52.9
sd
ó
97 30.3 20.8
Figure 11: Duration and rate differences between the exactly repeated repair words and the original. Differences were also calculated for peak and average RMS energy, F0 range, and F0 average (Figures 12ñ13). Among these measures, a consistent direction of change was only found for F0 minimum and F0 average. Peak energy had a tendency to increase, while average energy tended to decrease. The decrease in average energy could be due, in part, to the overall increase in duration. In addition, the increase in energy may be limited to the stressed syllable only. There was a decrease in F0 minimum ( = -15.5) and F0 average ( = -4.1) in all cases. F0 range did not show a consistent direction of change. However, the differences in F0 range constitute an average change of 176.2%, whereas the F0 average constitutes only 3.5%.

Exact Repeat
Word(s) repeated
? Peak RMS
? Avg RMS
4.1ñ4.3
prices
-356.7 -677.9
5.1ñ5.3
around
62.0 -345.8
5.7ñ5.9
around
53.0 -733.1
6.1ñ6.5
return date
645.3 171.7
ó
100.9 -396.3
sd
ó
412.1 415.5
Figure 12: Energy differences between exactly repeated repair words and the original.

Exact Repeat
Word(s) repeated
? F0 min
? F0 max
? F0 range
? F0 avg
4.1ñ4.3
prices
-11.2 10.4 21.5 -9.8
5.1ñ5.3
around
-0.9 -6.8 -5.9 -2.4
5.7ñ5.9
around
-35.2 13.1 48.3 -1.4
6.1ñ6.5
return date
-14.7 -25.5 -10.7 -2.8
ó
-15.5 -2.2 13.3 -4.1
sd
ó
14.4 17.9 27.3 3.8
Figure 13: F0 differences between exactly repeated repair words and the original. 3.3.1.2 ---- of Repair Commands

An analysis was next performed for the entire repeated command, comparing the first instance to the second. Again, the strongest differences were found for speaking rate (Figure 14). The overall speaking rate for the repeated utterance decreases by an average of 27.8 wpm (an increase in duration of  = 16%). However, there is one case in which the speaking rate increases. In this case, the first instance of the command was spoken at an unusually slow rate given that it was the first command in the session.


Exact Repeat
? Duration (msec)
%Change in Duration
? Rate (wpm)
4.1ñ4.3
683 39.1 -68.0
5.1ñ5.3
-621 -12.9 22.0
5.7ñ5.9
576 24.7 -40.0
6.1ñ6.5
301 13.2 -25.0
235 16.0 -27.8
sd
593 22.0 37.7
Figure 14: Speaking rate differences between the exactly repeated command and the original. These measures are calculated across the entire command. Energy and F0 differences between the original and repeated utterance calculated across the entire command are given in Figures 15 and 16. The differences in peak and average energy are not consistent in the direction of change, but as found at the word level, peak energy tends to increase in most cases while average energy tends to decrease. Again, perhaps due to the overall lengthening of the command.

The only measure with a consistent direction of change is average F0 ( = -3.9 Hz), as was found at the word level of analysis. Although the direction of change is not consistent, the average difference for F0 range (111.3%) is larger than that for F0 average (3.4%).


Exact Repeat
? Peak RMS
? Avg RMS
4.1ñ4.3
322.5 -207.9
5.1ñ5.3
466.8 -173.7
5.7ñ5.9
-2269.4 -443.7
6.1ñ6.5
432.6 216.8
-261.9 -152.1
sd
1339.8 273.7
Figure 15: Energy differences between the exactly repeated command and the original. These measures are calculated across the entire command.

Exact Repeat
? F0 min
? F0 max
? F0 range
? F0 avg
4.1ñ4.3
-7.1 -0.3 6.7 -5.9
5.1ñ5.3
4.7 -23.5 -28.1 -5.7
5.7ñ5.9
-29.0 226.9 256.0 -1.5
6.1ñ6.5
0.8 -25.5 -26.2 -2.4
-7.7 44.4 52.1 -3.9
sd
15.1 122.2 136.9 2.3
Figure 16: F0 differences between the exactly repeated command and the original. These measures are calculated across the entire command. 3.3.1.3 Accents, Phrasing, and Pause Data

On preliminary analysis, pitch accents do not appear to be useful for differentiating the original utterance from the exactly repeated repair utterance. For example, "prices" is accented using H* in both the original and repeated utterances. The difference is in the extent of the emphasis on the word being repaired and this is indicated in part by an increase in duration.

Phrasing differs between the original and repeated utterance in some cases. In two examples (4.3 and 5.9), the word being repaired is followed by an intonational phrase boundary in the repeated utterance but not in the original. There also appears to be a weak intermediate phrase boundary preceding the repair words in the repeated utterance. Further analysis of more examples would be necessary to determine if repair words tend to be contained in a separate intermediate or intonational phrase. This finding could indicate whether phrasing is used to mark the portion of the original utterance that was misrecognized by the system.

Overall, mean and maximum pause lengths did not differ noticeably between the original and repeated utterance. Silence intervals before and after repair words increased only slightly from the original to the repeated utterance (the pause before increased by an average of 19 msec and after by 55 msec).

3.3.2 Partial Repeats

Three partial repeatsórepair in which the portion of the original command recognized is repeatedówere analyzed (Examples 1.4, 2.3, and 2.5). In two of the three cases, the repeated portion is preceded by an editing expression ("No I said" in Example 1.4 and "I was referring to" in Example 2.3). In addition, the two repair utterances in Example 2 represent first (2.3) and second (2.5) attempts at repair. In this case, the same repair strategy is employed on both attempts, except that on the first attempt, an editing expression is employed (see section 3.3.7).

3.3.2.1 Analysis of Repair Words

This section presents an analysis performed on the repeated portion of the utterance. The repeated portion alone is compared between the original and repair utterance. Note that in Example 2.5, since no editing expression is used, the repair portion of the utterance is equal to the entire utterance. Figure 17 shows each of the repeated portions of the utterances for the examples analyzed.

Partial Repeat
Repeated Words
Editing Expression
1.2ñ1.4
september twenty nine
No I said
2.1ñ2.3
flight seventeen sixty five
I was referring to
2.1ñ2.5
flight seventeen sixty five
ó
Figure 17: Summary of the repair portions analyzed for partial repeat utterances. As in the exact repeat cases, an analysis of partial repeats across the repeated portion alone indicates a consistent decrease in speaking rate from the first to the second instance ( = -33.7 wpm), also corresponding to an average increase in duration of 27.6%. However, these differences are smaller than those found for the exact repeats.
Partial Repeat
Repeated Words
? Duration (msec)
%Change in Duration
? Rate (wpm)
1.2ñ1.4
september twenty nine
444 42.0 -50.0
2.1ñ2.3
flight seventeen sixty five
378 23.0 -28.0
2.1ñ2.5
flight seventeen sixty five
293 17.8 -23.0
ó
372 27.6 -33.7
sd
ó
76 12.8 14.4
Figure 18: Duration and rate differences between the repeated words and the original. For the partial repeats, the direction of change in peak and average energy is not consistent. While both increase for the two partial repeats in Example 2, they show a decrease in Example 1.4. This differs from the data for exact repeats, in which peak and average energy differences tended to be in opposite directions (peak energy increasing, and average energy decreasing).
Partial Repeat
Repeated Words
? Peak RMS
? Avg RMS
1.2ñ1.4
september twenty nine
-342.7 -323.4
2.1ñ2.3
flight seventeen sixty five
1439.6 329.4
2.1ñ2.5
flight seventeen sixty five
690.2 129.4
ó
595.7 45.1
sd
ó
894.9 334.5
Figure 19: Energy differences between the repeated words and the original. The direction of changes for F0 maximum, F0 range, and F0 average are consistent for the partial repeat cases. There is a tendency for the F0 maximum and F0 range to decrease while the F0 average tends to increase. There is an average compression in F0 range of 62.3 Hz between the first and second time the words were spoken. This corresponds to an average decrease of 31.3%. The change in F0 average constitutes less than a 1% increase.
Partial Repeat
Repeated Words
? F0 min
? F0 max
? F0 range
? F0 avg
1.2ñ1.4
september twenty nine
-3.5 -32.9 -29.4 2.4
2.1ñ2.3
flight seventeen sixty five
82.5 -25.2 -107.7 4.3
2.1ñ2.5
flight seventeen sixty five
11.1 -38.6 -49.7 12.1
ó
30.0 -32.2 -62.3 6.3
sd
ó
46.0 6.7 40.6 5.1
Figure 20: F0 differences between the repeated words and the original. 3.3.2.2 Analysis of Repair Commands

This section presents comparisons between utterances containing partial repeats and the associated original utterances. Comparisons are performed across the entire command, which includes an editing expression in two of the three cases.

Figure 21 shows the differences in overall speaking rate between the original and repair utterance. The differences in speaking rate vary quite a bitófrom a 7.0 wpm decrease to a 29.0 wpm increase. This is in contrast to comparisons made for the partial repeat portions alone which all exhibited a decrease in word rate. The repeated portion of Example 2.3 showed a decrease in speaking rate of 28 wpm, while the comparison across the entire command results in the exact opposite effectóan increase of 29 wpm. A closer analysis reveals that this difference is due to the fast rate at which the editing expression is spoken. The absolute speaking rate for the repeated portion of the repair utterance is 118 wpm, while the editing expression is spoken at over twice that rateó271 wpm. This high word rate for the editing expression "I was referring to" is due to the shortening of the word "to" (a clitic). This is also the case for Example 1.4óthe repair portion of the command is spoken at 120 wpm while the editing expression "No I said" is spoken at a rate of 262 wpm. This could also be an artifact of comparing word rates over small portions of utterances. Example 2.3 contains no editing expression.

Partial Repeat
? Rate (wpm)
1.2ñ1.4
-7.0
2.1ñ2.3
29.0
2.1ñ2.5
-53.0
-10.3
sd
41.1
Figure 21: Speaking rate differences between the repair utterance containing a partial repeat and the original command. These measures are calculated across the entire command. A comparison of energy values reveals an increase in peak and average energy for the repair utterances containing editing expressions, but a decrease for the one without an editing expression. An analysis of F0 values shows a compression in range ( = -58.2 Hz) for each of the three examples, although the amount is variable (sd = 62.7). This represents an average decrease in F0 range of 31.1%.

Note that since only part of the command is repeated, comparisons shown below (Figures 22-23) are between two different utterances (e.g., "september twenty nine" compared to "no I said september twenty nine"). The value of these comparisons is questionable since the difference in wording between the original utterance and repair command represents a confounding factor.

Partial Repeat
? Peak RMS
? Avg RMS
1.2ñ1.4
233.8 36.9
2.1ñ2.3
2218.7 449.4
2.1ñ2.5
-382.4 -221.3
690.0 88.3
sd
1359.2 338.3
Figure 22: Energy differences between the repair utterance containing a partial repeat and the original command. These measures are calculated across the entire command.
Partial Repeat
? F0 min
? F0 max
? F0 range
? F0 avg
1.2ñ1.4
-3.5 -26.3 -22.8 1.8
2.1ñ2.3
95.0 -35.6 -130.6 2.4
2.1ñ2.5
6.4 -14.8 -21.2 -4.4
32.6 -25.6 -58.2 -0.1
sd
54.2 10.4 62.7 3.8
Figure 23: F0 differences between the repair utterance containing a partial repeat and the original command. These measures are calculated across the entire command. 3.3.2.3 Accents, Phrasing, and Pause Data

Once again, pitch accents were not found to be a useful metric for distinguishing the original from the repair utterance containing a partial repeat. In the two cases in which editing expressions were used, only one case had a weak intermediate phrase boundary between the editing expression and the repeated portion of the repair (following "No I said" in Example 1.4). No noticeable differences were found for pauses before and after the repair portions of the utterance, or for overall mean pause length.

3.3.3 Rewords

Four repairs using a reword strategy were analyzedóExamples 6.3, 7.3, 7.5, and 9.3. In Example 6.3, the words "latest possible" are added as a description of "return date." In addition note the use of ellipsisóthe words "on that flight" have been eliminated in the reword. In Example 7, the misrecognized portion of the original commandó "What meal is served"óis first reworded to "Will there be food" and in a second attempt "What type of meal will be served." In Example 9.3, the reworded utterance ëspells outí the misrecognized flight number, one digit at a time.

Note that in each of these examples, the portion of the original utterance that has been misrecognized is reworded while the remainder of the utterance remains similar or identical to the original. In Example 7, the initial portion of the utterance is always followed with "on the/this flight" and in Example 8, both utterances start with "How much is flight."

3.3.3.1 Analysis of Repair Words

An analysis was performed to compare the original and reworded portions of the commands (Figure 24). The reworded portion was chosen for comparison since this represents the portion of the phrase that was misrecognized and that the user is attempting to repair.

Reword
Reworded Portion
Original Portion
6.1ñ6.3
latest possible return date
return date on that flight
7.1ñ7.3
Will there be food
What meal is served 
7.1ñ7.5
What type of meal will be served 
What meal is served 
9.1ñ9.3
one nine eight four
nineteen eighty four
Figure 24: Summary of reword examples analyzed. Only the reworded portions of the commands are given in this Table. For the complete commands, see section 3.2.3. In three of the four cases, the speaking rate decreases an in analyses of repair portions of partial and exact repeat examples. (Figure 25).
Reword
? Rate (wpm)
6.1ñ6.3
-25.0
7.1ñ7.3
31.0
7.1ñ7.5
-78.0
9.1ñ9.3
-28.0
-25.0
sd
44.5
Figure 25: Speaking rate differences between the original and reworded portions of the command alone. Energy and F0 comparison data is given in Figures 26-27. Note that the value of these comparisons is questionable since the difference in wording between the original utterance and repair command represents a confounding factor.
Reword
? Peak RMS
? Avg RMS
6.1ñ6.3
-1420.7 -658.9
7.1ñ7.3
-332.6 892.7
7.1ñ7.5
286.0 -440.3
9.1ñ9.3
2089.8 343.1
155.6 34.1
sd
1469.8 716.0
Figure 26: Energy differences between the original and reworded portions of the command alone.
Reword
? F0 min
? F0 max
? F0 range
? F0 avg
6.1ñ6.3
-32.1 -60.1 -27.9 -18.9
7.1ñ7.3
5.0 -7.0 -12.1 3.9
7.1ñ7.5
-17.9 55.1 73.0 -7.6
9.1ñ9.3
-13.1 167.6 180.6 6.9
-14.5 38.9 53.4 -3.9
sd
15.3 97.9 95.7 11.8
Figure 27: F0 differences between the original and reworded portions of the command alone. 3.3.3.2 Analysis of Repair Commands

For each of the four reword examples, comparisons were made between the initial utterance and the reworded utterance across the entire command. Figure 28 shows the differences in speaking rate between the original and reworded command. As with the repeated utterances, the speaking rate tends to decrease for the repair utterance. For reworded repairs, there is an average decrease of 45.3 wpm. However, there is a large standard deviation since one of the examples shows a slight increase in rate and another a very large decrease in speaking rate.

Reword
? Rate (wpm)
6.1ñ6.3
-60.0
7.1ñ7.3
11.0
7.1ñ7.5
-100.0
9.1ñ9.3
-32.0
-45.3
sd
46.7
Figure 28: Speaking rate differences between the original and reworded command. These measures are calculated across the entire command. Energy and F0 comparison data is given in Figures 29-30. Again, given that these are comparisons between reworded rather than repeated utterances, the value of this data is questionable.
Reword
? Peak RMS
? Avg RMS
6.1ñ6.3
380.7 -113.2
7.1ñ7.3
-332.6 865.5
7.1ñ7.5
286.0 -323.1
9.1ñ9.3
-40.9 230.6
73.3 164.95
sd
325.3 519.8
Figure 29: Energy differences between the original and reworded command. These measures are calculated across the entire command.
Reword
? F0 min
? F0 max
? F0 range
? F0 avg
6.1ñ6.3
1.7 -29.8 -31.4 -4.4
7.1ñ7.3
12.7 1.1 -11.6 11.3
7.1ñ7.5
-7.9 55.1 63.0 -5.1
9.1ñ9.3
-13.1 163.3 176.4 3.1
-1.65 47.4 49.1 1.2
sd
11.4 84.8 94.1 7.7
Figure 30: F0 differences between the original and reworded command. These measures are calculated across the entire command. 3.3.3.3 Accents, Phrasing, and Pause Data

Three of the four reworded utterances show differences in phrasing between the original and reworded utterance (Examples 7.3, 7.5, and 9.3). In these examples the reworded portion of the utterance (shown in Figure 24) is followed by an intonational phrase boundary. In addition, each of the reworded portions in these examples may be preceded by an intermediate phrase accent, although the evidence is not strong enough to be conclusive. The most pronounced differences in phrasing between the original and the repair utterance is in Example 7. In this example, the repair utterance is broken into three intonational phrases ("what type of meal..." "will be served..." "on this flight") separated by LL% phrase accents and boundary tones. The original command is composed of a single intonational phrase, possibly containing an intermediate phrase accent after the first word "what."

Accenting and pause data were not found to be useful metrics of change, however, one of the intonational phrase boundaries following a repair coincided with a large pause (750 msec).

3.3.4 Summary Across Repair Strategies

The differences between original and repair utterances appear to be strongest for speaking rate (Figures 31-32). These differences were found at both the local (comparison across the repaired words only) and at the global level (comparisons across the entire command). The decrease in word rate is higher at the local than the global level for the repeat repair strategies. This may indicate that the user employs intonational changes more for the particular words being repaired than for other parts of a repair utterance.

The reworded repair utterances show a large decrease in speaking rate ( = -45.3 wpm) when compared to the original. This may be due to the differing number of words in the original and repair utterance (whereas exact repeats have the same number of words as the original). Perhaps syllables per second would be a better measure than words per minute (this measure was used by Hirschberg and Grosz [16]). Another measure that can be used is "speech activity." This is a measure of the duration of speech over the total duration (speech plus silence) [14].

There appears to be an interaction effect between the level of comparison (words vs. commands) and the repair strategy (exact vs. partial vs. reword). However, this is most likely an artifact due to using a words per minute rate measure in comparing reworded utterances. Looking at only the repeats, no significant difference was found between the average decrease in rate for repair words alone ( = -44.6) and the average decrease for entire repair commands ( = -20.3) (t = -1.71, df = 6, p < 0.14).

Strategy
Mean ? Rate Repair Words
Mean ? Rate Repair Commands
Exact repeat -52.9 -27.8
Partial repeat -33.7 -10.3
Reword -25.0 -45.3
-37.2 -27.8
sd
14.3 17.5
Figure 31: Summary across repair strategies for rate differences at the word and command level.
Figure 32: Plot of mean speaking rate differences across repair strategies, for comparisons made across repaired words only and entire repair commands. Differences were also found in F0 range for partial repeats (Figure 33). A compression in F0 range was found for comparisons performed across the repeated portion alone and across the entire command. Again, these differences are slightly larger for the comparisons across the specific portion of the command being repaired. The direction of change in F0 range was not consistent for exact repeat and reword strategies.
Partial Repeats
Comparison Across
Mean
? F0 range (Hz)
Repair Words -62.3
Repair Command -58.2
-60.3
sd
2.9
Figure 33: Summary of differences found in F0 range for partial repeats. 3.3.5 First vs. Second Repair Attempts

In four of the example dialogues in the data set, the user makes multiple attempts at correcting a recognition error. In some cases the user employs the same repair strategy in each attempt, while in others a different strategy is used in the second attempt. Figure 34 gives a summary of repair strategy use for first and second attempts. Note that Example 5 differs from the other examples, in that the second attempt represents the second time the user is correcting the same error ("at" vs. "around") within the same session (5.1ñ5.3 and 5.7ñ5.9).

Example
First Attempt Second Attempt
2
Partial repeat with editing expression Partial repeat without editing expression
5
Exact repeat Exact Repeat
6
Reword Exact Repeat
7
Reword Reword
Figure 34: Repair strategies for first and second attempts at correcting an error. The question is whether or not the differences between the original and repair utterances are greater for the second attempt at repair than for the first. Repair strategies can be ordered based on their apparent "strength" in marking a correction ([26] provides such an ordering for other-repair techniques). From strongest to weakest in terms of providing cues for the listener, based on intuition this ordering might be: partial repeat with an editing expression, exact repeat, partial repeat without an editing expression, and reword. Given this ordering, qualitatively, the "strength" of the userís selection of repair strategy increases from the first to second attempt in Example 6, decreases in Example 2, and remains the same in the remaining examples.

A quantitative comparison of first and second repair attempts in given in Figures 35 and 36. A comparison of speaking rate for the repair portion of the utterance alone, shows a greater decrease for the second attempt ( = -51.4 wpm) than for the first ( = -18.0). However, this difference is not statistically significant (t = 1.28, df = 3, p < 0.29). In Example 7, the first attempt shows an increase in rate (31.0 wpm) while the second shows a large decrease (-78.0). Again, this difference is not statistically significant (t = 1.74, df = 3, p < 0.18).

The same effect occurs when comparing the entire repair command to the original for first and second attempts. In all but one of the first attempts, the speaking rate increases, while the rate decreases for second attempts ( = -54.5). A comparison of F0 range shows a compression in range across the repair words in the first attempt ( = -38.4, sd = 47.1), while a compression of range is only evident in two of the four second attempts. The same effect is found when comparing first and second attempts of the entire repair command.

 
Repair Words ? Rate
Repair Commands ? Rate
Example
1st Attempt
2nd Attempt
1st Attempt
2nd Attempt
2
-28.0 -23.0 29.0 53.0
5
-49.9 -77.4 22.0 40.0
6
-25.0 -27.0 -60.0 25.0
7
31.0 -78.0 11.0 -100.0
-18.0 -51.4 0.5 54.5
sd
34.5 30.5 41.0 32.4
Figure 35: Rate differences for first and second attempts.
 
Repair Words ? F0 range
Repair Commands ? F0 range
Example
1st Attempt
2nd Attempt
1st Attempt
2nd Attempt
2
-107.7 -49.7 -130.6 -21.2
5
-5.9 48.3 -28.1 256.0
6
-27.9 -10.7 -31.4 -26.2
7
-12.1 73.0 -11.6 63.0
-38.4 15.2 -50.4 67.9
sd
47.1 55.7 54.1 131.9
Figure 36: Differences in F0 range for first and second attempts. 3.3.6 Qualitative Analysis

This section discusses differences in intonation between the original and repair utterance that might be detected by a human listener.

In some of the examples, the original utterance contains much greater coarticulation effects than the second. In Example 1 for instance, in the original utterance the words "twenty" and "nine" are coarticulated as "twennine" whereas in the repair utterance they are clearly separated and fully articulated. These differences were not captured by the quantitative analysis. Even for the repair utterances, breaks between words were not usually more than a few milliseconds. For words that were coarticulated, the end point of one word is the same as the starting point for the next, so the break between words in calculated to be zero. However, breaks are also zero for cases in which words are not so severely coarticulated.

In some cases, there was a particular emphasis on one word in the repair. In Example 1, the word "nine" sounded the most distinct, and in Example 2, the word "five" sounded the most distinct in each of the two attempts. In Example 1 this emphasis may be used to indicate that "nine" is the word that was incorrectly deleted by the recognizer (it got twenty instead of twenty nine). However, in Example 2, the entire flight number is misrecognized so thereís no particular reason for "five" to be emphasized more than the remainder of the utterance. In both examples the distinct words are at the end of the repair portion of the utterance, one at the end of the entire utterance, and one followed by an intonational phrase boundary. Therefore, they may seem distinct due to increased final lengthening effects, rather than due to a specific attempt to correct those particular words.

In terms of first versus second attempts at repair, a "strengthening" in repair strategy is apparent when listening to the first and second attempts at correcting the "at" versus "around" recognition error. In the first attempt, the repair utterance is not clearly distinguishable from the original, although there is slightly greater emphasis on "around" in the repair utterance. On the next attempt to correct another "at" vs. "around" error, the emphasis on the word "around" is much more apparent when listening to the speech (this matches the quantitative findings shown in section 3.3.5). However, the remainder of the repair utterance is not clearly distinguishable from the original.

Two of the most noticeable corrections are in Examples 7 and 9. In Example 7.5, the separation of this repair utterance into three separate phrases can be heard quite distinctly. In Example 9, the user "spells out" the flight number, clearly distinguishing each digit. However, while this is apparent when listening to the utterance, the quantitative data indicates only a 12 msec increase in the average length of silence intervals in the utterance and a 20 msec increase when measured across the repair phrase alone.

3.4 Discussion

3.4.1 Rate

It was hypothesized that repair words and commands would show a decrease in speaking rate when compared to the original command. A decrease in rate was found when comparing the original and repair utterances. This decrease was found for each of the repair strategies, both at the word ( = -37.2 wpm) and command level ( = -27.8 wpm).

In order to determine whether the average decrease in speaking rate is statistically significant, control conditions are needed. For example, the change in rate between an original utterance and a repair utterance could be compared to the average change of rate found between pairs of utterances in general (where repairs do not occur). In addition, more samples of repairs would be needed. Since only a small number of examples of each repair strategy were tested, there was a large amount of variance in some cases.

Speaking rate was the most consistent difference between the original and repair utterance. The user slowed down their rate of speaking as when talking to a child or someone who does not speak the same language. Therefore a talkerís rate of speaking seems to be affected, in part, by their perception of the hearerís competence.

In summarizing across repair strategies stronger effects (defined as a larger decrease in rate) were found for comparisons at the word level (i.e., when comparing only the particular words being corrected) than for the entire command (although these differences were not statistically significant). This was found for exact repeat and partial repeat repair utterances, but not for rewords. Differences in speaking rate between reworded and original commands may be larger, when rate is measure in words per minute, due to differences in the number of words. This difference could be normalized by calculating rate in syllables per second.

3.4.2 Energy and Fundamental Frequency

In addition to a decrease in speaking rate, it was also hypothesized that the energy and pitch would be higher in the repair utterance than in the original, based on findings by Cutler and Levelt for self-repairs [21]. However, Levelt and Cutler did not define a quantitative criteria for "higher energy" and "higher pitch." For exact repeats, there was a tendency for peak energy to increase (3 out of 4 cases) and average energy to decrease (3 out of 4 cases) when comparing across the words being repaired and across commands. The decrease in average energy is probably due to the considerable increase in duration of the repeated utterance ( = 51.4% for repair words,  = 16% for commands). The direction of change for peak and average energy was not consistent, however, for any of the repair types. Energy was calculated from word onset to word offset for each word in a command. Hirschberg and Grosz calculated energy for the accented syllable only [16]. Calculating energy in this manner may have proven more useful in eliciting consistent changes in energy between the original and repeat.

The most interesting result found when comparing F0 range for the original and repair utterance was for the partial repeats. The repeated portion of the command was compressed in range by an average of -60.3 Hz. For the partial repeats that are preceded by an editing expression, this could be the speakerís way of signaling to the listener, "now hereís the part you got wrong." Again, control conditions would be necessary to test the significance of this difference. Grosz and Hirschberg found a significant compression in pitch range for parenthetical statements [11]. Parentheticals generally coincided with an increase in speaking rate, while repairs are associated with a slower speaking rate. However, parentheticals that were uttered at a slower rate were found to coincide with a lower pitch range (< 196 Hz) than those spoken rapidly. Repair portions of the partial repeats had an average pitch range of 100.3 Hz, while the same portions of the original utterance had a pitch range of 145.5 Hz. Perhaps the decrease in pitch range for repair portions of the partial repeats is associated with the decrease in speaking rate.

Overall F0 range (used by [11] for example) appears to be a more useful measure for eliciting differences than F0 average (used by [28]).

In general, the direction of change for energy and F0 measurements was not found to be consistent. Other groupings of the data were investigated, however, no logical grouping could be found that improved the consistency. This is probably due to the small number of samples tested for each repair strategy. More data is needed to determine consistent trends. Levelt and Cutler defined intonational "marking" of self-repairs as a noticeable change in duration, energy, and fundamental frequency, the direction of which could be positive or negative. However, in order to work toward the application of these results for automatic detection of repairs, consistent trends must be identified and these changes must be significant. Therefore, further research is necessaryómore data must be gathered and analyzed and compared with control conditions in order to determine the significance of the difference found.

3.4.3 Accenting and Phrasing

It was hypothesized that words being repaired might be marked by a preceding and/or following phrase boundary and pause. In five of the seven exact repeat and reworded repairs, the specific portion of the utterance containing the repair is offset by phrasing. In each case, the repair words are followed by an intonational phrase boundary, and possibly preceded by an intermediate phrase accent. This effect was not found for partial repeatsóthe editing expression and repair were separated by a weak intermediate phrase accent, and in only one of the two examples. More data is needed to determine how likely an editing expression is to be followed by an intermediate phrase accent and /or boundary tone. However, a difference between exact repeats and rewords in comparison to partial repeats would not be surprising. Exact repeats and rewords both represent a reissue of the entire command, while partial repeats only reissue the portion of the command in error. For exact repeats and rewords the speaker needs to indicate to the listener not only that a correction is being made, but what portion of the repair utterance contains the corrected words. In addition, partial repeats showed a compression in pitch range from the original utterance that was not found for exact repeats or rewords.

For those five repairs that were possibly set off in a separate intermediate phrase, the repair portions are preceded by a silent interval ( = 85 msec) and followed by one as well ( = 242 msec). For the most part, silent intervals between words in a command were small (under a few milliseconds) in both original and repair commands.

Accenting information did not prove useful for differentiating original from repair utterances. A large number of words in both the original and repair phrase were labeled as accented. Therefore, differentiation could only be made for the type of accent (H*, L*, complex). Sometimes words sounded increased in energy and duration, but the pitch tracks looked rather flat, so accent judgments were difficult to make. Most of the accents appeared to be simple H* ones, but this judgment should be validated by other labelers.

3.4.4 First vs. Second Attempts

Based on theory and research for self- and other-repair, it was hypothesized that the "strength" of the differences found between the original and repair utterance would increase from the first to the second attempt at correcting the error. In terms of the repair "technique" chosen by the user, this was true in only one of the four examples containing multiple repair attempts. In two of the examples, the same technique was used for both attempts. The sessions reviewed for this paper seemed to show a tendency for userís to re-employ the same repair technique throughout the interaction. However, this effect was not specifically measured.

Quantitative measures of repairs revealed some differences between first and second attempts. Second attempts showed a larger decrease in speaking rate, and so on this basis, might be considered as "stronger" than first attempts at marking a repair. Pitch range tended to be compressed across repair words for first attempts, but changes in pitch range were not consistent for second attempts. Since a consistent pattern has not been established for changes in energy and F0 data, it is difficult to use these measures as a basis for judging the relative "strength" of a repair attempt.

4 Conclusions

4.1 Combining Techniques for Automatic Detection

More data must be collected and analyzed in order to determine the intonational features that most reliably identify a repair utterance. Even once these features are determined, their utility for automatic detection is questionable if applied in isolation. As Shriberg argued, the intonational patterns found for repairs are likely to occur for non-repairs as well [28]. For example, in several cases the words being repaired appeared to be in a separate intermediate phrase. However, this effect occurs for non-repairs as well (in example 5.1, the word "tomorrow" is in its own intermediate phrase). Therefore, intonational differences must be combined to form a set of distinctive features. The combination of differences in phrasing, speaking rate, energy, and fundamental frequency may help to distinguish repairs more reliably.

Intonational features must be combined with other techniques for detecting repairs. Shriberg used acoustic analysis as a secondary measure after first applying pattern matching techniques to narrow down the number of potential self-repairs [28]. Pattern matching involved, for example, looking at sequences of identical words, and matching single words surrounding a cue word. Similar techniques might be applicable to the problem of detecting user repairs of recognition errors. However, Shribergís analysis was performed using Wizard of Oz data that does not contain recognition errors. Applying these pattern matching techniques to data collected with a speech recognition system will be much more difficult. Speech recognition errors will degrade the effectiveness of such techniques. The use of pattern matching techniques in combination with acoustic analysis may still prove useful. Further research is necessary to make this determination. A first step could be to apply Shribergís pattern matching techniques for self-repair to data that has been collected using an actual spoken language system.

4.2 Designing Systems for Error

Current spoken language systems (e.g., MITís spoken language system for the ATIS domain) do not attempt to determine when the userís utterance is in response to a recognition error made by the system. For example, given the user repair utterance "No I said september twenty nine," robust parsing techniques [27] would simply pick out the content words september twenty nine. Even if the system recognizes this repair utterance correctly, the response would not be natural. For example, the interaction would be as follows:

1. System: What date?
2. User: September twenty nine
3. System: Here are the flights for September twenty...
4. User: September twenty NINE!
5. System: Here are the flights for September twenty nine...

In many of the interactions, responses like (5) give no explicit indication to the user that the system "understands" that it has made a mistake and is now correcting it. The response would be the same if instead of (4) the user had said:

4a. User: "And I would also like to see flights for September twenty nine."
5. System Here are the flight for September twenty nine...

However, utterance (4a) is a completely different request than (4). In the first case, the userís utterance "September twenty nine" replaces the expression "September twenty," while in the second case, the expression "September twenty nine" is added to the list of potential referents [12]. Therefore, interpreting these two utterances identically will further lead to problems with reference resolution. Given the repair utterance (4), if the next utterance is (6) "Book me on the earliest flight," the reference to "earliest flight" is unambiguousóit refers to the earliest flight on September twenty ninth. Given utterance (4a), the reference to "earliest flight" may refer to the earliest flight on the twenty ninth, or the earliest flight on either day. Therefore, the systemís response to (4a) might be followed by a response like (7). However, such a response would be inappropriate given user utterance (4).

6. User "Book me on the earliest flight."
7. System The earliest flight on September 20 or September 29?

Another difference between user utterances (4) and (4a) relates to the userís intention or discourse purpose [13]. Given utterance (4), the userís intention is to determine the flights available on a single day. However, given utterance (4a), the userís intention may be to decide between travel on two different days or to make travel arrangements for both days. Another problem is that while the system response in (5) allows the interaction to continue, this kind of feedback makes it difficult for the user and system to establish what is mutually believed.

Given the ability to detect that utterance (4) is a repair, then a more appropriate response might be:

5a. System Oh you said September twenty NINTH, here are those flights...

or

5b. System Oops, I thought you said September twentieth, here are the flights for September twenty NINTH instead...

The detection of repairs might also enable the system to avoid making the same mistake twice. In Example 5 of the speech corpus (section 3.2.3), the system substitutes the word "at" for the word "around." The user corrects the error, after first repeating and then rewording the original command. In the next user command the word "around" is also used, and it is misrecognized a second time. Once again the user repeats the entire command in an attempt to correct the error. If the system had the ability to detect the first "at" to "around" correction, perhaps this information could be used to "break the chain" of errors.

4.3 Summary

Current spoken language systems are designed for correct recognitionówhen recognition errors occur, the interaction breaks down severely. Alternatively, human-conversations are designed to handle errorsówhen people make speech errors or misunderstand what someone has said, they have the ability to recognize that an error has occurred and work together toward repairing it and establishing a mutual understanding. Previous research on repair has focused mostly on repair of errors made by the speaker (i.e., self- and other-repair) rather than those made by hearer (i.e., recognition repair). This paper has attempted to quantify acoustic features of these "recognition repairs," that may be identified and exploited in future human-computer conversational systems.

References

1. S.E. Brennan. A cognitive architecture for dialog and repair. In Working Notes of the AAAI Fall Symposium Series, Symposium: Discourse Structure in Natural Language Understanding and Generation, pages 3ñ5. 1991.

2. S.E. Brennan and E.A. Hulteen. A dynamic feedback model for spoken language interaction. Unpublished draft, 1992.

3. J.E. Cahn. A computational architecture for dialog and repair. In Working Notes of the AAAI Fall Symposium Series, Symposium: Discourse Structure in Natural Language Understanding and Generation, pages 5ñ7, 1991.

4. J.E. Cahn. A computational architecture for mutual understanding in dialog. Technical Report #92-4. MIT Media Laboratory, 1992.

5. H.H. Clark and S.E. Brennan. Grounding in communication. In J. Levine, L.B. Resnick and S.D. Teasley, editor, Perspectives on socially shared cognition, pages 127-149. APA, 1991.

6. H.H. Clark and E.F. Schaefer. Collaborating on contributions to conversations. Language and Cognitive Processes, 2(1):19ñ41, 1987.

7. H.H. Clark and E.F. Schaefer. Contributing to discourse. Cognitive Science, 13:259ñ294, 1989.

8. H.H. Clark and D. Wilkes-Gibbs. Referring as a collaborative process. Cognition, 22:1ñ39, 1986.

9. A. Cutler. Speakerís conceptions of the function of prosody. In A. Cutler and D.R. Ladd, editors, Prosody: Models and Measurements, chapter 7, pages 79ñ91. Springer-Verlag, 1983.

10. D. Goodine, L. Hirschman, J. Polifroni, S. Seneff and V. Zue. Evaluating interactive spoken language systems. In Proceedings of Second International Conference on Spoken Language Processing, 1992.

11. B. Grosz and J. Hirschberg. Some intonational characteristics of discourse structure. In Proceedings of Conference on Spoken Language Processing,1992.

12. B. Grosz, A. Joshi and S. Weinstein. Providing a unified account of definite noun phrases in discourse. In Proceedings of the 21st Annual Meeting of the Association of Computational Linguistics, pages 44ñ50, 1983.

13. B. Grosz and C. Sidner. Attentions, intentions, and the structure of discourse. Computational Linguistics, 12(3):175ñ204, 1986.

14. J. Gruber. A comparison of measured and calculated speech temporal parameters relevant to speech activity detection. IEEE Transactions on Communications, COM-30(4):728ñ738, 1982.

15. D. Hindle. Deterministic parsing of syntactic non-fluencies. In Proceedings of the 20th Annual Meeting of the Association for Computational Linguistics, pages 123ñ128, 1983.

16. J. Hirschberg and B. Grosz. Proceedings of the DARPA Workshop on Spoken Language Systems. In Proceedings of Proceedings of the DARPA Workshop on Spoken Language Systems, pages 441ñ446, 1992.

17. J. Hirschberg and D. Litman. Now letís talk about now: Identifying cue phrases intonationally. In Proceedings of the 25th Annual Meeting of the Association of Computational Linguistics, pages 163ñ171, 1987.

18. L. Hirschman, M. Bates, D. Dahl, W. Fisher, J. Garofolo, D. Pallett, K. Hunike-Smith, P. Price, A. Rudnicky and E. Tzoukermann. Multi-site data collection and evaluation in spoken language understanding. In Proceedings of DARPA Human Language Technology Workshop, 1993.

19. W.J.M. Levelt. Monitoring and self-repair in speech. Cognition, 14:41ñ104, 1983.

20. W.J.M. Levelt. Self-monitoring and self-repair. Speaking: From intention to articulation, chapter 12, pages 458ñ499, 1991.

21. W.J.M. Levelt and A. Cutler. Prosodic marking in speech repair. Journal of Semantics, 2(2):205ñ217, 1983.

22. J.D. Moore. Indexing and exploiting a discourse history to generate context-sensitive explanations.

23. J.D. Moore and C.L. Paris. Planning text for advisory dialogues. In Proceedings of the 27th Annual Meeting of the Association of Computational Linguistics, pages 203ñ211, 1989.

24. J. Pierrehumbert. The phonology and phonetics of English intonation. Ph.D. Thesis. Massachusetts Institute of Technology, 1975.

25. J. Pierrehumbert and J. Hirschberg. The meaning of intonational contours in the interpretation of discourse. Intentions in Communication, chapter 14, pages 271ñ311. MIT Press, 1990.

26. E.A. Schegloff, G. Jefferson and H. Sacks. The preference for self-correction in the organization of repair in conversation. Language, 53(2):361ñ382, 1977.

27. S. Seneff. Robust parsing for spoken language systems. International Conference on Acoustics, Speech, and Signal Processing, 1992.

28. E. Shriberg, J. Bear and J. Dowding. Automatic detection and correction of repairs in human-computer dialog. In Proceedings of the DARPA Workshop on Spoken Language Systems, 1992.

29. E. Shriberg, E. Wade and P. Price. Human-machine problem solving using spoken language systems (SLS): Factors affecting performance and user satisfaction. In Proceedings of the DARPA Workshop on Spoken Language Systems, 1992.

30. J. Terken and S.G. Nooteboom. Opposite effects of accentuation and deaccentuation on verification for given and new information. Language and Cognitive Processes, 2(3/4):145ñ163, 1987.

31. S. Young and M. Matessa. Using pragmatic and semantic knowledge to correct parsing of spoken language utterances. Technical Report #CMU-CS-92-120A. Carnegie Mellon University, 1992.

32. S.R. Young and M. Matessa. MINDS-II feedback architecture: Detection and correction of speech misrecognitions. Technical Report #CMU-CS-92-120A. Carnegie Mellon University, 1992.