If you have an unusual name because it's rare,
ethnic or your parents got creative with its
spelling, you're probably used to people
butchering it when they address or introduce you.
Some people admit it's useful to have an uncommon
name, since they use it to screen out
telemarketers ("Hello, er, um, Mr. 'Bucket'?" "No,
Goodbye!" answers Mr. Buquet).
While few of us want to help telemarketers (at
least, when they call us), we all want the best
voice dialing, directory assistance, reverse
directories, security, access to email, directions
and ordering systems; in other words, we want
improvements in the full range of voice activated
services, to enable them to be fast, accurate and
personalized. To that end, it helps to know the
difference between Bucket and Buquet.
Having one's name mispronounced is a common
malady, more common than many people realize. It's
hard to pronounce names. As difficult as it is for
people, imagine how hard it is for computers!
Achieving better performance has been the quest
for several research teams, some working over 20
years.
- Excuse Me: Was that Cook or Koch?
Names are hard to pronounce because of several
factors. Foremost is the sheer quantity of them
and the preponderance of rare names. Smith, the
most common US surname, accounts for 1% of the
country. Figure 1 shows how the rank order of
names associates with the population. The graph is
an analysis of databases obtained from Donnelley
Publishing; it corresponds well with Social
Security information. The 50% mark is near the
2000th name - that is, about 2000 names (Smith
through Arroyo, Gannon, Worthington) “covers” half
of the population. There are a total of 2 million
surnames. The data graphically shows how many rare
names there are: 20% of us have names rarer than
the top 50,000 (Boetcher, Marchioni, Yehle). As
many as one in 100 households have a unique family
name, such as the single households of Adeyooye,
Caioppo, Xoumphayvient, and Zabdy. Paradoxically,
it’s quite common to have a rare name!
The second factor is that our names derive from
dozens of languages and nearly every country in
the world, resulting in an unusually high level of
ethnic diversity. Perhaps the most important
factor is the many variations for some names. One
person's "wrong" rendition may be another person's
"correct" pronunciation, because of the influences
of large ethnic populations in a region, how much
assimilation has occurred over time, and even
personal preferences for pronunciations. (Given
names [first names] and business names show
similar variation.)
Let's explore one factor, i.e., regional
variations. Regional variants for name
pronunciations are influenced by immigrant
settlement patterns. It's common knowledge some
cities are “strongholds” for certain ethnic
populations; however, few people realize the
extent to which name distributions vary from place
to place.
Figure 2 shows the most common names in
some cities. Few have surname rankings resembling
the national ranking. (Chicago is the exception
here: 18 of its top 20 names are in the national
top 20 list; in the other cities, only 50-75% of
their top 20 names are in the national top 20.)
In NYC, the most common names include many
Hispanic, Jewish and Asian names. In San
Francisco, the top names include Asian and Irish
names not commonly found elsewhere. In Boston,
Irish names predominate; in Shreveport, French
names; and in Milwaukee, German names. The
implication is Germanic names (such as Greulich,
Doetsch, Wendschlag) more likely retain
pronunciations close to the original German in
Milwaukee, but not elsewhere. Elsewhere, an
Anglicization/assimilation process occurs. Quite
often the Anglicization is only partial; that is,
the name is not pronounced exactly in accordance
with how English words are pronounced. Instead,
the original pronunciation and ethnography
"flavors" the original pronunciation.
Whether the general public uses an ethnic
pronunciation is determined by a complex interplay
of many processes. It is influenced by the
majority culture’s interest in a minority's
culture, values and even its foods. The details of
this influence are studied by the field of
socio-linguistics and are beyond the scope of this
article. Needless to say, the full range of
accommodation is seen in different areas of the
country and at different time periods. Some
immigrants seemingly reject their native culture
and, to "blend in," accept full assimilation of
their names said by "standard" English
pronunciation rules. Some even adopt common US
names, knowing the general populace can't easily
render "correct" pronunciations of Chandrashekhar,
Nahekeaopono or Kaweiuokalani. Other immigrant
populations are successful at interesting and
educating the public around them in many aspects
of their culture, spilling over into adoption of
more native pronunciations of their names.
Some names have a great number of variations.
Koch, for instance, has at least six distinctly
different pronunciations that vary regionally. The
challenge for TTS systems is to use the best
pronunciation for names like Koch, which has no
majority pronunciation - only 30 or 40% of the
population uses the most common pronunciation. The
ASR implications are even more important: the
programs must recognize most of those variant
pronunciations - otherwise recognition fails when
these variations are encountered.
The regional variations mentioned are, at this
point, too complex for most TTS systems; but ASR
systems are beginning to account for regional
variability.
- HOW DO THESE PROGRAMS WORK and WHY DO THEY
FAIL?
Understanding how these systems work will help
explain some of the "failures" or errors one
occasionally sees.
For pronunciations, all ASR and TTS systems
roughly work the same way. Most rely on a large
dictionary; many also contain rules for "out of
vocabulary” words and names. (Even the base
dictionaries in most systems were originally
generated “back in the lab” by running a
rule-based system; a dictionary is used in
deployed systems to save run time.)
Rules: Rule-based systems embody basic
knowledge of how words and names are pronounced.
The rules distinguishing most "hard" from "soft"
c's might be written as:
c[eiy] > s # soft c when followed by e,i,y,
as in center, city, cycle
c[ao] > k # hard c when followed by a or o,
as in cat, cot
The rules can become quite complex, with
context sensitivity determined by a rough
ethnographic classification for the name. Rule
systems vary in their ability to predict
pronunciations. The best actually pronounce names
better than humans (1). In their lifetime most
people make acquaintance with only a few thousand
names, a small fraction of the number researched
for the best programs.
Dictionaries: It is impossible for most
companies to authenticate the pronunciations of
tens or hundreds of thousands of names; therefore,
linguists created many dictionary entries using
their intuition of how names might be pronounced.
Errors arise because no person's intuition is
fully accurate. Is Kreamer a homonym for Kramer?
Should Cremer also be- Kremmer? Furthermore,
dictionaries used in many systems contain
inconsistencies because multiple people edited
them.
Some recognition errors reveal mistakes in the
underlying rule-based system. A VP noticed one ASR
system wouldn't recognize his name, Wagener, until
he said it as "wage ner." This was caused by an
incorrect analysis of compound words, or their
morphology. Just as humans implicitly see the
compounds in Wineberg, Winegarden and Winemiller,
so too must pronunciation systems in order to
properly recognize the silent "e," the long "i"
vowel, etc. An inappropriate context for
compound-name rules would pronounce Winegrad as
"wine grad" (and was the likely reason for the
system mistaking Wagener as "wage ner"). Another
revealing error was the TTS system that pronounced
Malone as "mal wun." Here the pronunciation for
"1" [one] was matching the letters in Malone
possibly after an inappropriate morphological
analysis.
- EVALUATION METRICS
How can one be assured a system performs well -
what are metrics for evaluation? With most
features of a system (TTS: intelligibility and
naturalness; ASR: word accuracy, task completion
rates, noise immunity), one can be confident the
results from a small, established test are
representative of the whole system (limited, of
course, by the variability associated with small
sample-set size). However, there is a problem with
establishing a standard testset of names (for
example, including Keogh, Riordan, and
O'Shaughnesy): all developers can place in their
system dictionaries the troublesome names on the
test and the system immediately appears to have
better test performance (recognizing or
synthesizing more accurately) than it does in
actual service. (By the time you have reached this
point in the article, every developer has already
corrected “Buquet,” and the other names mentioned,
but they probably won't admit it!)
Thus, fixed, published vocabulary tests will
yield inaccurate results. Better metrics derive
from actively updated, private lists. It is
certainly time consuming and expensive for each
company to develop, verify and refine lists on its
own, but I argue there are few shortcuts to honest
results. If your company is interested in
developing tests for recognition or synthesis of
names, the following are guidelines to consider:
Test set: A key requirement is a wide sampling
of names, both common and uncommon. The US Census
Bureau publishes lists of names, but without any
rare names. Don't rely on published sets of “name
zingers" - most software fixed those particular
items long before you read about them. (This
doesn't imply there are no more problematic names:
there are many more!) You can also mine your
contact list and company directories for unusual
names.
Pronunciations: The best test is one that
gathers judgements of which pronunciations are
possibly right, which are definitely wrong, from
many people. Find out how the names are actually
pronounced - do not guess or use your intuition;
many people are surprised how often their
intuitions are wrong. When obtaining
representative pronunciations, be aware that
multiple pronunciations may exist. Your experience
with Devon or Gautier may not match the actual
commonest pronunciations nationally, or within
your community of interest. (An anecdote from the
deployment of a Reverse Directory service
reinforces this point: a Chicago subscriber
complained the system mispronounced his name,
Koch. When told his pronunciation (Cook) was less
common in Chicago than our system default, he was
surprised to learn there were other
pronunciations! As per-line customization wasn't
possible, Mr. "Cook" accepted that our system
chose the best single default.)
Scoring: For ASR purposes, an obvious scoring
metric is recognition success or a low recognition
"distance." Error sensitivity grows with the
number of items in the dictionary: for name
dialing, a contact list of 100 people is less
demanding than an auto-attendant choosing among a
5,000-name company directory, and Automated
Directory Assistance is several magnitudes harder
still.
For TTS applications, most researchers agree a
3-tiered scoring scheme is adequate. While
category labels may differ ["excruciatingly
correct," "sensible," "outright wrong" vs.
"clearly acceptable," "somewhere in between,"
"clearly bad," etc], most capture the
pronunciation differences among a) a person saying
their own name, b) how other people saying this
name, and c) no one saying the name this way.
The pronunciation accuracy of the best systems
typically exceed 99% for common names and are
better than 92-94% for very rare names. Depending
on the exact mix of common and uncommon names in
your test, frequency-weighted results can be
better than 96-97%. This is not typical
performance, but the best systems are better than
humans. Accuracies this high could mean your
ASR/TTS system’s performance would be limited by
other factors, and not by the difficult area of
proper-name pronunciation.
- PROGRAMS IN ACTUAL USE
The best programs are used in a variety of
applications. Some involve a map company’s need to
provide accurate pronunciations for streets and
town names to its customers; others improve the
accuracy of voice dialing on mobile phones (the
rules were modeled by a company providing software
for embedded chips). Targus Information Systems
uses the software within its SpeechCapture Express
application, increasing the capture rate of names
and addresses obtained with speech recognition
transactions. Sprint’s Voice Command uses the
software for high accuracy TTS playback of names
of people and businesses voice dialed by their
customers. Applications for the software for
company auto-attendants and automating Directory
Assistance are currently planned; as such software
continues to be refined, more applications and
widespread use is likely.
- WHAT WILL FUTURE BRING?
How are these systems being improved? What is
at the forefront of research? There are three
major thrusts in the field. The first is improving
the dictionaries, both in scope (number of
entries), breadth (number of variations) and
overall accuracy (eliminating spurious guesses).
As companies gain experience with pronunciations
via auto-attendants and automation trials for
Directory Assistance, the best dictionaries will
begin to reflect, for each name, its variation of
pronunciations and relative frequencies. The
second is continuing to improve the accuracy of
handcrafted rule-based systems, augmented by small
exception dictionaries. These are currently
capable of achieving the highest accuracy when
they properly imitate the underlying rules that
people use. A third approach involves automated
learning theory: let computer algorithms analyze
the underlying relationships in a pronunciation
dictionary and devise the most efficient and
predictive rule set. This promising line of
inquiry, automatically developing rule sets
without human intervention, is very attractive.
The current state of the art obtains 20-30%
errors, which is still too high. Automated
learning algorithms are likely hampered by the
inconsistencies and errors still found in most
public research dictionaries (2). It remains to be
seen whether automated learning can achieve higher
accuracies than earlier attempts (neural networks,
analogy systems and the like). So far, the
handcrafted systems have a definite edge.
Footnotes: 1. M.F. Spiegel, “Proper Name
Pronunciations for Speech Technology
Applications,” Int’l Journal of Speech Technology,
in press, 2003.
2. A. F. Llitjós, "Improving Pronunciation
Accuracy of Proper Names with Language Origin
Classes," CMU Masters Thesis, Dept of Computer
Science, 2001. Available as
http://www.cs.cmu.edu/~aria/papers/mthesis-cmu.pdf
Murray Spiegel has been active in the speech
community for 20 years and directs speech
application research at Telcordia Technologies. He
can be reached at spiegel@research.telcordia.com.