Human communication is filled with a lot of subtle signals, like facial expressions, body language, emphasis, and intonations. Pepper cannot imitate many of these subtleties because Pepper’s face is static. Pepper’s gestures are not as flexible as a human’s and Pepper’s voice utilizes text- to-speech software. Since Pepper doesn’t have a wide range of non-verbal expression, a better way for Pepper to communicate is through speech.
WHY?
Oral and written languages are two distinct methods of communication – we do not speak the way we write. Unlike written communication, verbal language is informal and flexible; this is how Pepper speaks.
HOW?
Keep in mind that Pepper speaks just like us. Pepper’s communication involves actual spoken expressions and even verbal phrases or sounds (that is to say, logical connectors). It’s important to write the way we speak. Pepper can’t read text written for publication, such as online content.
EXAMPLE
Don’t:
“Hello, my name is Pepper, and I am a humanoid robot. I am fully equipped to be able to communicate with humankind. I am connected to the Internet. I have sensors and much more.”
Do:
“Hi! I’m Pepper. I’m designed to communicate with people, just like you!”
WHY?
Even when replying to unexpected responses, the user has to get an appropriate reaction from Pepper.
HOW?
For closed-ended questions with a “yes/no” response, please write an output for each case. Don’t hesitate to add “u1” for “I don’t know,” “I don’t care,” and “As you want.” of which you can also find concepts for in the lexicon, you can also create your own if you don’t find what you need.
EXAMPLE
When a user must choose between three games, Pepper can understand at least “Game1,” “Game2,” and “Game3.” But it makes Pepper appear smarter if Pepper can also understand, “First one,” “Second one,” “Third one,” “I don’t know,” “As you want,” “I don’t want to play,” and similar phrases.
WHY?
People can’t retain much information given in long sentences. Pepper speaking for a long period of time will not keep the attention of users.
HOW?
Short sentences are easier to understand. Get straight to the point and be explicit.
EXAMPLE
Don’t:
“I love games and I’m pretty sure that you love games too, so I propose that you play with me. Lucky you, I know a lot of games, you can choose from three games: Guess Animals, Fun Quiz, and Music Boxes. Which game do you want to play?”
Do:
“Let’s play a game together! You can choose between Guess Animals, Fun Quiz, and Music Boxes. Which one should we try?”
WHY?
Pepper’s goal is to be understood by as many people as possible. Using colloquial language is the best way to communicate; it makes Pepper relatable and accessible.
HOW?
There’s no need to use sophisticated language. It’s better to consistently use colloquial language. Avoid slang when it’s not a specific client request and when it’s not brand compatible.
EXAMPLE
Don’t: “Hello, my friend. How do you do today?”
Do: “Hi! How are you doing?”
WHY?
Don’t use complicated or technical language that users may not understand. We want as many people as possible to easily understand Pepper.
HOW?
Using technical vocabulary may cause Pepper to lose a user’s attention. Opt for user-friendly language and dialogue.
EXAMPLE
For a verbal notification error regarding, for example, motor stiffness:
Don’t: “I need to remove the stiffness in my motors.”
Do: “Let me take a moment to rest.”
WHY?
When Pepper speaks for too long without a pause, users don’t keep up and may not retain the information.
HOW?
Use pauses throughout your text; it helps the user understand more: pau=xxxare expressed in milliseconds.
Tip: To know where a pause is needed, read your text out loud to detect when one seems most natural and when you need to breath in.
EXAMPLE
Don’t: “My name is Pepper and I’m a humanoid robot. I’m 47 inches tall and I was created at the SoftBank Robotics lab in Paris.”
Do: “My name is Pepper. I’m a humanoid robot and I’m 47 inches tall. I was born at SoftBank Robotics, in Paris.”
WHY?
Any mispronunciation in a sentence deteriorates the quality of the verbal interaction and makes it harder to understand Pepper.
HOW?
Check how TTS reads each word. If a word is not pronounced properly:
EXAMPLE
In English TTS, “NAO” is mispronounced. A skin is necessary to ensure that every time Pepper says NAO, it is properly pronounced: “now”: s:({*} Nao {*}) ^replace(Nao, now, 1)
WHY?
A grammatically “positive” question is worded so that the listener can respond “Yes” to indicate an affirmative answer. A grammatically “negative” question is worded so that the listener must respond “No” to affirm, and “Yes” to deny or reject. In other words, negative questions switch the “yes/no” response order of regular (i.e. positive) questions to a less intuitive “no/yes” order. Positive questions are efficient and less ambiguous.
HOW?
Formulate Pepper’s questions in a positive way, making plain to the user that saying “yes” or “no” will cause a consistent and predictable outcome. If an unambiguous formulation is difficult to think of, check that Pepper is clearly only asking for a single decision per question.
EXAMPLE
To confirm if user wishes to delete an application:
Don’t:
Are you sure you don’t want to keep this application? (negative: “yes” means “do NOT delete”; “no” could mean either “I’m NOT sure” OR “DO delete”)
Do:
Do you want to delete this application? (positive: “yes” means “DO delete”, and “no” means “do NOT delete”)
WHY?
Yes / No questions are easiest for both Pepper and the users: the number of possible answers is small (two), and the binary nature of the choice pretty obvious. For questions with a larger number of possible answers, however, users may get lost in the interaction if they are not sure of the scope in which they can act. A clear explanation of possible answers to a given question can improve the transparency of the interaction and reduce vocal interaction failures.
HOW?
In field observations, we have observed that users match the syntactic structure of their answers to that of Pepper’s questions: that is, the user will often echo Pepper’s wording with their own. Pepper can thus teach the user how to answer by using in the question specific wording and syntactic structure of possible responses. This helps the users understand how to speak with Pepper.
A versatile and natural way to express a request is with “verb-object” structure: “do X (to) Y”. This construction allows us to use the same wording to describe an action from the user’s point of view and the robot’s without much substitution.
EXAMPLE
Don’t:
Pepper: “{Do you} want to listen to music, play a game, or notify someone that you are here?”
User: “{I} want to notify someone that I am here.
Do:
Pepper: “{Do you} want to listen to music, play a game, or notify someone?”
User: “{I} want to listen to music.
WHY?
Pepper can teach users its vocal commands by saying them, displaying them on the tablet, or both. Because the tablet is used to display various types of information, the vocal commands should be easily to visually distinguish. The quotes explicitly help the users to understand something is sayable or not.
HOW?
Enclose every vocal command within quotes. To further graphically highlight the commands, have some visual indicator of say-ability.
EXAMPLE
Don’t display on the tablet:
You can say: play a game.
Do display the vocal command with quotes:
“Play a game”.
WHY?
Vocal commands that follow a predictable and consistent grammatical pattern are easier for users to remember and use.
HOW?
Standardize the syntactic structures for each vocal command and their variants on the same level: a sequence of “verb + noun” (“Take photo”), for example, or “adjective + noun” (“Funny face”).
EXAMPLE
When defining the vocal commands in a menu:
WHY?
When Pepper verbally lists more than three items, it is hard for users to remember what each thing was by the time of their next turn.
HOW?
For a list of two or three items, use a simple syntactic form, “Do you…, …, … or…?”, and insert the choices in the same order as they are displayed on the tablet.
For a longer list, it is best to write a more open-ended question, according to the context: “Which game do you want to play?”, or “What shop are you looking for?”. Pepper should not enumerate more than three items at a time, to avoid taxing short-term memory. This three-item constraint should not be avoided by simply splitting a longer list of options into sequential chunks of 3: users naturally assume that they can only ask about the things Pepper mentioned in the current turn. Presenting further options for the same event in a subsequent turn will cause confusion.
EXAMPLE
For a short list of 2 or 3 possible lobby activities, like “playing a game”, “listening to music”, or “notifying someone”:
For a longer list, like you can find in a Store Locator App in a mall:
Don’t:
“In the mall, I can locate for you Adidas, Aesop, Allsaints, Apple (+154 other stores)…. What store do you want me to locate?”
Do:
“What store do you want me to locate?” and
DO display the list on the tablet.
WHY?
To ensure good recognition of each vocal command, avoid commands that are phonetically close to each other (i.e. that have many overlapping or shared sounds). Every command needs to have the same chance to be trigged by the users.
If commands are insufficiently distinct, Pepper could favor one command and makes it difficult to access others.
HOW?
To find out if commands are phonetically distinct enough, say them out loud: if you notice that the commands have many sounds in common, they are most likely too close. In that case, find synonyms to express the same thing, or rephrase the voice command as a sentence starting with a verb.
EXAMPLE
When defining the vocal commands in Pepper’s questions:
Don’t:
“Are you a male or a female?”
Male and female are too close to be used by the users to answer without conflicting.
Do:
“Are you a boy or a girl?”
Boy and girl are phonetically different enough to be distinguishable by Pepper in users’ answers.
WHY?
Sometimes users respond to instructions from Pepper without listening to what the outcome will be. Pepper should make an action’s goal clear before encouraging a user to take the action.
HOW?
Explain what will happen when an action is taken before encouraging the user to take the action. This helps users understand what an action will do before they take it. Plus, this will notice them about what is the vocal command to say immediately before their turn to speak.
EXAMPLE
Don’t:
Press the start button to begin
Do:
To begin, press the start button.
To begin, tell me “Let’s start”.
WHY?
It can be difficult for users to know what to say and how to formulate a request to a robot. If the vocal command begins with a verb in command (imperative) form, it’s easiest for the users to understand what exactly to request and to expect from the robot because the verb represents the action.
HOW?
Pick the verb which is the most representative of what the app is doing.
It’s better when the verb represents the action from the user side and not from the app or robot side.
EXAMPLE
When defining the vocal commands the users will use:
WHY?
The tablet is very useful for laying out possible user actions by displaying buttons or vocal commands. However, because it’s Pepper’s tablet and the command will be said by the user, the presence of pronouns like “you”, “I”, “your”, and “mine” can cause cognitive dissonance. The burden is then on the user to resolve whether the displayed pronoun refers to themself, or to Pepper.
HOW?
Avoid using personal pronouns in the vocal commands and on their display on the tablet. Try to find a different way to express the same thing without a personal pronoun.
Remember that Pepper can still understand pronouns in the variants if the users feel more comfortable expressing their intent that way.
EXAMPLE
When defining the vocal commands suggested on the tablet:
Don’t display:
“Play with Pepper”, “Play with me”, “Play with you”
Do display:
“Let’s play together”
WHY?
Short vocal commands (comprising just one or two syllables) are not long enough to be efficient. Pepper may understand this vocal command too often, which will lower accuracy. For best speech recognition performance, it is wise to have at least three syllables for each vocal command and its variants.
HOW?
Avoid using a single word or a short word as a command. A simple way to increase the strength of the vocal commands is to add a verb or an adjective to your keyword.
Don’t forget to have consistent wording in tablet content, Pepper’s speech, and vocal commands!
EXAMPLE
When defining the vocal commands the users will use:
WHY?
Pepper can only understand the language that it is currently speaking, so it’s really difficult or impossible for Pepper to understand a vocal command in a different language than the one expected.
HOW?
Translate or find synonyms in the expected language. If the vocal command is technically a foreign word or expression, but often used in the expected language, and well-recognized when you test it, it is likely OK to keep.
EXAMPLE
When defining the vocal commands the users will use in an English conversation:
WHY?
Communicating with Pepper is easiest and most satisfying when the user can use conversational and intuitive triggers based on human - human interactions. If you have to explain a voice command, something’s wrong: it means the command has to be rethought and redefined.
HOW?
Instead of listing commands for possible actions, ask a simple and clear question to clarify that it’s the user’s turn to speak.
EXAMPLE
Don’t:
“To save your work at any time, say save and finish, but if you want to continue editing say save and continue”
Do:
Human: “Pepper I’m done.”
Pepper: “Do you want to save that?”
Human: “Sure”
Pepper: “Got it. Do you want to exit the application now?”
Human: “Yes please”