Experience in User Research, Interaction Design (UI), Usability Testing, Gestural Interface, Mobile Design, iPhone & iPad Apps Designs, E-commerce Websites & Enterprise Web Applications
Speech & Voice
Speech Interface Design / Voice Dialogue Design
.
1. Some Speech Technology Terms
2. Defining the Problem
3. Design Constraints
4. Techniques
5. Technique: Explicit Prompts
6. Technique: Implicit Prompts
7. Technique: Incremental and Expanded Prompts
8. Technique: Tapering
9. Technique: Hints
10. Techniques Specific to Multimodal Systems
11. Summing Up
_____________________________________________________________________________ …
Some Speech Technology Terms
Following are definitions of speech-related terms used in this Article as well as some suggestions for learning more about speech technologies.
Speech Recognition
The capability of a computer to take spoken input from the user and convert it to text.
Speech Recognizer
A software system (sometimes with hardware components) that processes a digital audio signal from a file, microphone, or telephone. The processing produces the best fit of possible word sequences based on a frequency table or predefined grammar specification. The output is one or more strings of text, often with associated measures of how closely the string matches the specification.
Discrete Speech
Some speech recognizers only allow users to speak one word or short phrase at a time, such as Calendar or Send-to-back.
Continuous Speech
Other speech recognizers allow users to speak connected streams of words, although usually only one sentence at a time. For example, I would like my calendar, or Send this to the back.
Vocabulary
The words or phrases a speech recognizer can hear. Both discrete and continuous speech recognizers have a word list, sometimes called a lexicon.
Grammar
A specification used by some continuous speech recognizers for how words in the vocabulary can be strung together.
Dictation
The use of speech for entering free text, as in a memo or the body of an electronic mail message.Commercially available dictation systems currently all use discrete speech recognition, but with very large vocabularies. Dictation systems are distinct from command-and-control systems, which are designed to allow the user to issue commands and to control application behavior.
Speech Output
The computer can either play recorded audio messages or can convert text to speech using a speech synthesizer. The recorded audio provides higher quality output, but a synthesizer is almost always used when the content of the output is not known ahead of time.
Prompt
A recorded or synthesized message produced by the system for the user.
Speech-only Interface
An application interface that has no other input or output mechanism other than speech. Telephone-based applications are the most common speech-only interfaces.
Multimodal Application
In the context of this article, an application that uses any number of input and output modalities, including speech.
As speech recognition technology improves, many software designers are being challenged to design speech user interfaces. One of the most challenging issues is How do users know what they can say?
If you think about it, a speech-only interface poses the same problems as a command-line interface. The functionality of applications is hidden, and the boundaries of what can and cannot be done are invisible.
Graphical interfaces were invented largely to make hidden functionality visible to the user. In a speech-only environment, such as over a telephone, it is not possible to display menus, show options, or highlight buttons. Instead, other techniques must be used to guide users through a successful interaction.
_____________________________________________________________________________ …
Defining the Problem
Before discussing the solutions, let me provide you with some additional insight into the problem. One of the applications that I designed is a speech-only interface to a shared calendar [4]. Here is what an interaction over the telephone with the SpeechActs Calendar application might sound like:
Comp: Currently, you have “lunch with Julie.” Later, at 2:30,
you have “Staff Meeting in Conference Room B.”
User: What’s on my calendar the Monday after Easter?
Comp: On Monday, April 8, you have “Call Fidelity Monitor,”
you have “Barbara Munch’s birthday.”
User: What does Paul have?
Comp: On Monday, April 8, Paul has “Chris goes to Wisconsin.”
From 10 o’clock to 10:30, “Meeting with Bob.”
User: What about the next day?
Comp: On Tuesday, April 9, Paul has no appointments.
The dialogue has progressed smoothly so far, but now the user gets into trouble:
User: What do I have tomorrow morning?
Comp: Sorry?
Although this query about “tomorrow morning” sounds perfectly reasonable given the prior dialogue, the user has just crossed the functional boundary of what the application supports. In designing the calendar application, we opted to support browsing calendars only in units of whole days and handle “out of grammar” requests, such as this one, with our error recovery interface. The decision to constrain the grammar was driven primarily by our desire to maintain reasonable recognition performance. Perhaps the decision seems arbitrary, but it’s a slippery slope. Once you allow users to talk about “morning,” “afternoon,” and “evening,” they’ll expect to be able to say things like “early afternoon,” “after 6 p.m.,” “around noon,” and so forth. At some point, the constraints of the speech recognition technology dictate that only a limited set of utterances be allowed. Designers must draw the line someplace.
The calendar example illustrates one side of the problem: Users assume they can say things that the application does not support. The flip side of the problem is that users don’t know to say things that the application does support. Here’s an example from another one of my group’s projects: an Automated Customer Service Representative. With this application, users can view the Lands’ End clothing catalog on an interactive television display.
When they dial a telephone number, they converse with the Automated Rep, which helps them browse the catalogue and select items for purchase. When the top level display is presented the user hears:
Comp: Thank you for calling Lands’ End. Which of these items are
you interested in?
The user will do fine by answering the question directly:
User: I’m interested in woman’s clothing.
The application, however, will accept a much richer set of responses. These range from specific:
User: I’d like a large men’s cotton button-down shirt.
…to quite vague:
User: Do you have any red sweaters?
One solution, of course, is to create a much more detailed initial prompt that spells out the range of options available to the user. The disadvantage of this approach is that speech output is slow and temporal. Not only are long prompts costly in terms of the time they take, but users often only remember the end of what was said. In the following discussion, hints, suggestions, and other techniques are described that attempt to provide more satisfying solutions to the problem of letting users know the range of available options.
To summarize, the problem of knowing what to say to a speech application has two components. Users can assume the computer will be able to understand more than is actually possible, and users can be unaware of functionality that is available.
_____________________________________________________________________________ …
Design Constraints
Like all design problems, the techniques available to guide users in speaking well-formed input are subject to technological constraints. A major constraint on designing a speech application involves the features supported by the speech recognizer being used for the project. Some recognizers can support a fairly large vocabulary of continuous speech, whereas others can support only a small vocabulary (but typically with greater reliability). Other recognizers require discrete words or phrases rather than continuous speech. In addition, some recognizers support a feature, important to interface design, called “barge-in.” This refers to a user’s ability to interrupt speech output and still be understood by the speech recognizer. A second design constraint is the environment in which the speech recognizer is being used. Many speech applications are designed to work over a telephone, but others are designed to be used in conjunction with a graphical interface or as part of a consumer device.
Another constraint involves the user profile. Some speech applications are intended for infrequent use by a large user population. For example, a number of phone companies offer speech recognition for accepting or rejecting a collect call. Other speech applications are designed for frequent use by a small user population, and others fall somewhere in the middle of this spectrum.
_____________________________________________________________________________ …
Techniques
The techniques I focus on for helping users of speech applications produce well-formed spoken input all involve design of prompts. Although other techniques such as printed documentation and on-line help can be indispensable, prompt design is at the heart of effective speech interface design.
Matt Marx of AlTech offers some useful vocabulary for analyzing different strategies for prompt design. He suggests that prompts fall along a continuum from implicit to explicit. Troy Kamphuis of Nuance Communications illustrates three points along this continuum with these possible prompts for a banking application:
Comp1: Welcome to ABC Bank. What would you like to do?
Comp2: Welcome to ABC Bank. You can check an account balance,
transfer funds, or pay a bill. What would you like to do?
Comp3: Welcome to ABC Bank. You can check an account balance,
transfer funds or pay a bill. Say one of the following choices:
check balance, transfer funds, or pay bills.
Implicit prompts, such as the first prompt in the set above, are usually inappropriate for small vocabulary recognizers because they invite open-ended responses.
_____________________________________________________________________________ …
Technique: Explicit Prompts
Most designers constrained by small vocabularies stick to the explicit end of the spectrum, illustrated by the third prompt in the set. The most explicit prompts are called directive prompts [3]. These tell the user the exact words they should say. Amir Mane of AT&T points out that listing options in this way is most effective if the recognizer supports barge-in. He provides this example:
Comp: Please choose: leave a message, return my call, or talk
to the secretary.
While the prompt is playing, the user can interrupt and say one of the choices. Since users can say the option as soon as they hear it, this technique mitigates the problem of users forgetting the exact wording of the choices by the time a long prompt ends. It also means, however, that the user may not hear all the possible choices. This problem is not so bad if the list is short, as in the preceding example, or if it acts as a reminder, as it does in AlTech`s prototype Music Mall application. Notice in this example that all the prompts are explicit, but the first is less explicit than the second two:
Comp: Please say the name of the artist you want.
User: The Beatles.
Comp: I have a lot of albums by the Beatles; please say the name
of the album you want. For example, you could say “Magical
Mystery Tour” or “Abbey Road.” For a full list of albums I
have by The Beatles, say “list.”
User: List.
Comp: Here are the albums I have by the Beatles. Say the one you want
as soon as you hear it. “With the Beatles.” “Beatles For Sale.”
“Revolver.” “The White Album.” “Please Please Me….
User (interrupting): The White Album.
Comp: I think you said, “The White Album.” Is that correct?
User: Yes.
Comp: Okay. To sample “The White Album”…..
The AlTech recognizer used in the Music Mall application is capable of handling a large vocabulary (more than 2,300 artist names). From a technical standpoint, the recognizer could also handle a flexible grammar, but the designer made the choice to tightly constrain the user input instead of allowing more flexible input such as:
Comp: Hey man, I’d like to hear the Beatles, okay.
The constrained approach significantly helps to minimize errors, which is essential for an application that has to work well for first-time callers.
_____________________________________________________________________________ …
Technique: Implicit Prompts
Although flexible grammars, coupled with large lists of options, tend to increase error rates, this type of input has an enormous potential benefit to users. Most people working on research-oriented applications are attempting to design some degree of conversational interaction. Perhaps the best examples of conversational speech interfaces are those developed as part of a competition funded by ARPA to create a speech interface to an air travel information system. In many of these systems, the dialogue with the user is a mixed-initiative interaction. Here’s a hypothetical example from the MIT Galaxy air travel system:
User: I’d like to fly from Boston to London.
Comp: What date will you be traveling on?
User: I want to leave next Friday.
Comp: Here are the flights from Boston to London on Friday, May 31 …
Notice that the user begins the conversation. In this example, the user does not provide a fully qualified request. If some information is missing, the application formulates a query to collect the missing data. The application would have accepted a request such as:
User: I’d like to fly from Boston to London next Friday afternoon
on British Airways.
or
User: Tell me about British Airways flights leaving Boston for London
a week from Friday.
or
User: Is there a flight next Friday on British Airways from Boston to
Heathrow in London?
The grammar for this application allows a wide range of grammatical constructs. In addition, think about the list of options in every place in the sentence that a city name, airline, or date can go. This application probably has too high a perplexity (number of equally likely word choices at a given time) to be practical over today’s fairly low-quality telephone lines. Because of this current technological constraint, most people designing conversational interfaces use cleverly designed prompts to encourage users to speak sentences that conform to a somewhat more constrained grammar. The following example from MIT’s Wheels classified advertisements application illustrates the use of a suggestion:
Comp: There are 100 convertibles. Can you specify a make or a make and model?
Users are not required to specify a make or model at this prompt. The system will accept queries such as
User: I’d like a red one with a manual transmission.
If they don’t know what to say, however, the suggestion helps the user to formulate a query that the system is likely to understand. Another common technique on the implicit end of the spectrum is to use subtle discourse cues in the prompts. In conversations, people tend to mimic the style of their conversational partners in terms of vocabulary used, length of utterance, and grammatical constructs. For example, in the Speech Acts Calendar, the user might ask about a person whose name is ambiguous:
User: What’s on Bob’s calendar tomorrow?
Comp: Did you mean Bob Kuhns or Bob Sproull?
User: I meant Bob Sproull.
This “Did you mean…” disambiguation prompt encourages users to answer with just a name or with something like “I meant…” or “I mean…” It is certainly possible for the user to say something outside the grammar, but the aforementioned prompt makes it much less likely that the user would say something like:
User: Sproull’s the one I want.
So far, we have examined purely explicit or purely implicit prompt designs. Not surprisingly, some of the most effective prompt design techniques fall somewhere in the middle of this continuum.
_____________________________________________________________________________ …
Technique: Incremental and Expanded Prompts
An incremental prompt involves providing an implicit prompt, pausing to wait for user input, and then providing a more explicit prompt if the user does not provide input or says something that cannot be interpreted. Here is how this technique could be applied to the banking prompts described earlier:
Comp: Welcome to ABC Bank. What would you like to do?
User: (silence)
Comp: You can check an account balance, transfer funds or pay a bill.
What would you like to do?
User: (silence)
Comp: Say one of the following choices: check balance, transfer funds or
pay bills.
For some applications, it is not possible to provide users with an exhaustive list of everything they can legally say. This holds true for GTE’s telephone directory system, Speech Connect. In this application, information about several thousand businesses is available. Because of the large vocabulary, the incremental prompt guides users more than it provides them with exact utterances:
Comp: Welcome to Speech Connect. What business please?
User: (silence)
Comp: Please say either the name or the type of business you are calling,
or say “list” for a list of business types or “help” for a help
message.
A slight variation on incremental prompting is prompt expansion. In this case, the user responds to the implicit prompt by saying something that the system cannot interpret. The next prompt she hears is an expanded or more explicit version of the first. Here’s an example from a telephone answering machine application designed at Philips Research Laboratories [1]:
Comp: You have heard all messages, please give an instruction.
User: Mmmh, I don’t know.
Comp: The accepted speech commands are “replay,” “delete,” “new
announcement,” and “turn off.”
Voice Processing Corporation’s Phone-Wizard telephone answering machine application uses a somewhat less explicit prompt to follow a recognition error:
Comp: Who’s calling, please?
User: I’m calling long distance. This is Jack Armstrong.
Comp: Please say your first and last names.
Incremental and expanded prompts provide help for users in a way that does not slow down the interaction if the user provides well-formed input at the outset.
_____________________________________________________________________________ …
Technique: Tapering
Another technique, called tapering, is aimed at shortening the interaction for users as they gain experience with a system. The user is first given an explicit prompt. Later in the interaction, the prompt is tapered to something usually more implicit. In a voice directory application designed at Bellcore, users first encounter this prompt:
Comp: Please say the first and last name of the person you want.
If the user stays on the line to make a second request, the prompt is shortened to
Comp: Say the name of the person you want.
All the SpeechActs applications that my group has designed make extensive use of tapering. For example, in the SpeechActs electronic mail application, users can record outgoing messages:
Comp: Please record your message after the tone. Pause for several
seconds when done.
Since many users record multiple messages during a session, this lengthy instructional message would start to sound repetitive very quickly. The next time they ask to record a message in the same session, the prompt is shortened to:
Comp: Record then pause.
_____________________________________________________________________________ …
Technique: Hints
A variation on tapering is use of hints. For example, novice users of Wildfire’s telephone assistant are provided with prompts that involve an implicit, conversational question, followed by an explicit hint.
Comp: Hello. You’ve reached the office of Tony Lovell. If you are he,
press the star key. Otherwise, please say your full name.
User: (presses *)
Comp: What’s the pass code? Say the digits one by one or use touchstones
followed by the pound key.
User: 4 5 4 5 1 2.
Comp: Oh, hi! You have 2 new messages. The first is from Rich Miner.
I’d recommend saying What’s-it-say, Describe-it, or Next-item.
User: What’s it say?
Comp: (message is played) I’d suggest Give-them-a-call, Send-a-reply,
or Throw-it-away.
User: Give them a call.
Comp: Rich Miner at work? yes or no?
User: Yes.
Comp: Dialing… press 44 when you want to end the call.
Hints are different from explicit prompts in two ways. Rather than comprehensively listing all choices, they include only a few of the most useful or common options. Also, hints are designed to be removed. The Wildfire system keeps track, across sessions, of how many times a user has successfully responded to each individual prompt in the system. After a certain number of successful attempts, the hints are removed from that prompt, leaving only the implicit question that the user has already become accustomed to hearing. The previous dialogue would be much shorter for a frequent user of the system:
Comp: Hello. You’ve reached the office of Tony Lovell. Please say your
full name.
User: It’s me Wildfire.
Comp: What’s the pass code?
User: 4 5 4 5 1 2.
Comp: Oh, hi! You have 2 new messages. The first is from Rich Miner.
User: What’s it say?
Comp: (message is played)
User: Give them a call.
Comp: Rich Miner at work?
User: Yes.
Comp: Dialing…
This personalized approach provides a powerful means of offering help for commands that are infrequently used and tapering prompts that are frequently encountered.
_____________________________________________________________________________ …
Techniques Specific to Multimodal Systems
Thus far, the discussion has focused on speech-only systems. Although letting users know what they can say may not be as problematic in a multimodal application, the problem nonetheless exists.
In both consumer devices and desktop dictation systems, one strategy is what Francis Ganong from Kurzweil Applied Intelligence calls “say what you see.” In the Kurzweil Voice for Windows dictation product, the rule of thumb is that users can say any command that is visible in a menu or dialog box. In a voice-controlled car radio designed at Philips Research Laboratories , for example, users can speak the names of the labels on the dedicated buttons for “bass,” “treble,” or “volume.” Another technique is to use the display on the radio to give users clues about acceptable command words.
This may help suggest to users that they may speak names of other categories such as “pop” or “classical.”
A larger display, such as the one available in a dictation system, allows the designer to provide users with an explicit list of all the currently available command words and phrases. The Kurzweil Voice product shows a scrollable list of currently active words, organized hierarchically. This technique is most effective when the word list is relatively short.
User: (selects “Info” button to display properties for a selected hotel)
Comp: Display information about the Hilton.
In another example, the user might use the mouse or pen to filter a hotel database based on attributes such as “pool,” “pets,” or “fireplace.”
User: (clicks on “pool” and selects the filter option)
Comp: I only want hotels with a pool.
This technique of educating users with verbal feedback while they’re carrying out tasks has the advantage of providing concrete examples of acceptable utterances in a context that the user is likely to understand and remember.
_____________________________________________________________________________ …
Summing Up
Letting users know what to say in a speech interface is challenging, whether or not a graphical interface is present. Here’s a summary of the prompt design techniques in use in today’s speech-only systems that help users formulate acceptable speech input:
· Explicit prompts help to constrain user input when only a limited vocabulary is available.
· Implicit prompts incorporating discourse strategies help shape user input in conversational systems.
· Incremental or expanded prompts are fast when user input is well formed and provide help when it is not.
· Tapering helps instruct users the first time they encounter a specific prompt and then speeds interaction when the same activity is encountered again.
· Hints help guide inexperienced users and speed interaction as they gain experience.
And in multimodal systems:
· A simple convention, “say what you see,” can guide user input.
· The display can be used to suggest or explicitly indicate what users can say.
· Speech output corresponding to users’ actions can provide contextual examples of spoken input.
Although these techniques can be effective, there is no substitute for user experience. Either through training or by exposure to an interface, the more users interact with a system, the more likely they are to know what to say. Patti Price of SRI reports that “some of our experiments have shown that the second 10 minutes of interaction have about half the error rate of the first 10 minutes of interaction.” Not only do users learn the allowable vocabulary and grammar of a system, but they tend to adapt their manner of speaking in a way that the computer is more likely to understand. For example, users learn to not speak before the computer is ready to listen and they learn to not pause in the middle of a sentence (which causes the computer to process only what was said before the pause).
User experience coupled with thoughtful prompt design does not guarantee a successful speech interface. Other design issues play an equally important role. These include crafting grammars or command languages, providing appropriate feedback, and recovering from errors. In addition, a consistent “sound and feel” across speech applications can go a long way in helping users know what to say.
Web Sites with General Information about Speech Technology
Comp. Speech FAQ
http://www.speech.cs.cmu.edu/comp.speech/
Commercial Speech
http://www.tiac.net/users/rwilcox/speech.html
My Bycicle routes