Pepper communicates mainly by voice¶

Human communication is filled with a lot of subtle signals, like facial expressions, body language, emphasis, and intonations. Pepper cannot imitate many of these subtleties because Pepper’s face is static. Pepper’s gestures are not as flexible as a human’s and Pepper’s voice utilizes text- to-speech software. Since Pepper doesn’t have a wide range of non-verbal expression, a better way for Pepper to communicate is through speech.

A. Pepper uses natural language¶

Pepper uses spoken language, not written language¶

WHY?

Oral and written languages are two distinct methods of communication – we do not speak the way we write. Unlike written communication, verbal language is informal and flexible; this is how Pepper speaks.

HOW?

Keep in mind that Pepper speaks just like us. Pepper’s communication involves actual spoken expressions and even verbal phrases or sounds (that is to say, logical connectors). It’s important to write the way we speak. Pepper can’t read text written for publication, such as online content.

EXAMPLE

Don’t:

“Hello, my name is Pepper, and I am a humanoid robot. I am fully equipped to be able to communicate with humankind. I am connected to the Internet. I have sensors and much more.”

Do:

“Hi! I’m Pepper. I’m designed to communicate with people, just like you!”

Pepper has to understand more answers than Pepper suggests¶

WHY?

Even when replying to unexpected responses, the user has to get an appropriate reaction from Pepper.

HOW?

For closed-ended questions with a “yes/no” response, please write an output for each case. Don’t hesitate to add “u1” for “I don’t know,” “I don’t care,” and “As you want.” of which you can also find concepts for in the lexicon, you can also create your own if you don’t find what you need.

EXAMPLE

When a user must choose between three games, Pepper can understand at least “Game1,” “Game2,” and “Game3.” But it makes Pepper appear smarter if Pepper can also understand, “First one,” “Second one,” “Third one,” “I don’t know,” “As you want,” “I don’t want to play,” and similar phrases.

B. Pepper’s language is easy to understand¶

Opt for short sentences, which may be easier to understand¶

WHY?

People can’t retain much information given in long sentences. Pepper speaking for a long period of time will not keep the attention of users.

HOW?

Short sentences are easier to understand. Get straight to the point and be explicit.

EXAMPLE

Don’t:

“I love games and I’m pretty sure that you love games too, so I propose that you play with me. Lucky you, I know a lot of games, you can choose from three games: Guess Animals, Fun Quiz, and Music Boxes. Which game do you want to play?”
Do:

“Let’s play a game together! You can choose between Guess Animals, Fun Quiz, and Music Boxes. Which one should we try?”

Use colloquial language¶

WHY?

Pepper’s goal is to be understood by as many people as possible. Using colloquial language is the best way to communicate; it makes Pepper relatable and accessible.

HOW?

There’s no need to use sophisticated language. It’s better to consistently use colloquial language. Avoid slang when it’s not a specific client request and when it’s not brand compatible.

EXAMPLE

Don’t: “Hello, my friend. How do you do today?”

Do: “Hi! How are you doing?”

Adapt the vocabulary to the audience¶

WHY?

Don’t use complicated or technical language that users may not understand. We want as many people as possible to easily understand Pepper.

HOW?

Using technical vocabulary may cause Pepper to lose a user’s attention. Opt for user-friendly language and dialogue.

EXAMPLE

For a verbal notification error regarding, for example, motor stiffness:

Don’t: “I need to remove the stiffness in my motors.”

Do: “Let me take a moment to rest.”

Pepper is easier to understand when you add pauses in Pepper’s speech¶

WHY?

When Pepper speaks for too long without a pause, users don’t keep up and may not retain the information.

HOW?

Use pauses throughout your text; it helps the user understand more: pau=xxxare expressed in milliseconds.

Tip: To know where a pause is needed, read your text out loud to detect when one seems most natural and when you need to breath in.

EXAMPLE

Don’t: “My name is Pepper and I’m a humanoid robot. I’m 47 inches tall and I was created at the SoftBank Robotics lab in Paris.”

Do: “My name is Pepper. I’m a humanoid robot and I’m 47 inches tall. I was born at SoftBank Robotics, in Paris.”

Pepper must pronounce every word in each sentence correctly¶

WHY?

Any mispronunciation in a sentence deteriorates the quality of the verbal interaction and makes it harder to understand Pepper.

HOW?

Check how TTS reads each word. If a word is not pronounced properly:

Rewrite it another way. An abnormal spelling of the word may be necessary.
Use skins. A skin allows you to keep changes over time.
Replace the word with a synonym.

EXAMPLE

In English TTS, “NAO” is mispronounced. A skin is necessary to ensure that every time Pepper says NAO, it is properly pronounced: “now”: s:({*} Nao {*}) ^replace(Nao, now, 1)

Pepper asks positive questions, rather than negative or double-negative questions ¶

WHY?

A grammatically “positive” question is worded so that the listener can respond “Yes” to indicate an affirmative answer. A grammatically “negative” question is worded so that the listener must respond “No” to affirm, and “Yes” to deny or reject. In other words, negative questions switch the “yes/no” response order of regular (i.e. positive) questions to a less intuitive “no/yes” order. Positive questions are efficient and less ambiguous.

HOW?

Formulate Pepper’s questions in a positive way, making plain to the user that saying “yes” or “no” will cause a consistent and predictable outcome. If an unambiguous formulation is difficult to think of, check that Pepper is clearly only asking for a single decision per question.

EXAMPLE

To confirm if user wishes to delete an application:

Don’t:

Are you sure you don’t want to keep this application? (negative: “yes” means “do NOT delete”; “no” could mean either “I’m NOT sure” OR “DO delete”)

Do:

Do you want to delete this application? (positive: “yes” means “DO delete”, and “no” means “do NOT delete”)

3C - Pepper manages the vocal commands in the interaction¶

Pepper explicitly lays out possible answers ¶

WHY?

Yes / No questions are easiest for both Pepper and the users: the number of possible answers is small (two), and the binary nature of the choice pretty obvious. For questions with a larger number of possible answers, however, users may get lost in the interaction if they are not sure of the scope in which they can act. A clear explanation of possible answers to a given question can improve the transparency of the interaction and reduce vocal interaction failures.

HOW?

In field observations, we have observed that users match the syntactic structure of their answers to that of Pepper’s questions: that is, the user will often echo Pepper’s wording with their own. Pepper can thus teach the user how to answer by using in the question specific wording and syntactic structure of possible responses. This helps the users understand how to speak with Pepper.

A versatile and natural way to express a request is with “verb-object” structure: “do X (to) Y”. This construction allows us to use the same wording to describe an action from the user’s point of view and the robot’s without much substitution.

EXAMPLE

Don’t:

Pepper: “{Do you} want to listen to music, play a game, or notify someone that you are here?”

User: “{I} want to notify someone that I am here.

Do:

Pepper: “{Do you} want to listen to music, play a game, or notify someone?”

User: “{I} want to listen to music.

Use quotes to display vocal commands on the tablet ¶

WHY?

Pepper can teach users its vocal commands by saying them, displaying them on the tablet, or both. Because the tablet is used to display various types of information, the vocal commands should be easily to visually distinguish. The quotes explicitly help the users to understand something is sayable or not.

HOW?

Enclose every vocal command within quotes. To further graphically highlight the commands, have some visual indicator of say-ability.

EXAMPLE

Don’t display on the tablet:

You can say: play a game.
Do display the vocal command with quotes:

“Play a game”.

Standardize the way Pepper says and displays the vocal commands ¶

WHY?

Vocal commands that follow a predictable and consistent grammatical pattern are easier for users to remember and use.

HOW?

Standardize the syntactic structures for each vocal command and their variants on the same level: a sequence of “verb + noun” (“Take photo”), for example, or “adjective + noun” (“Funny face”).

EXAMPLE

When defining the vocal commands in a menu:

Don’t: use heterogeneous formulations: “Play a game”/”The story of the 7 dwarfs”/”Arcadia Dance”
Do: use verb+noun for example: “Play a game”/”Tell a story”/”Do a dance”

Pepper verbally offers the user no more than three choices at a time ¶

WHY?

When Pepper verbally lists more than three items, it is hard for users to remember what each thing was by the time of their next turn.

HOW?

For a list of two or three items, use a simple syntactic form, “Do you…, …, … or…?”, and insert the choices in the same order as they are displayed on the tablet.

For a longer list, it is best to write a more open-ended question, according to the context: “Which game do you want to play?”, or “What shop are you looking for?”. Pepper should not enumerate more than three items at a time, to avoid taxing short-term memory. This three-item constraint should not be avoided by simply splitting a longer list of options into sequential chunks of 3: users naturally assume that they can only ask about the things Pepper mentioned in the current turn. Presenting further options for the same event in a subsequent turn will cause confusion.

EXAMPLE

For a short list of 2 or 3 possible lobby activities, like “playing a game”, “listening to music”, or “notifying someone”:

Do: “Do you want to play a game, to listen to some music or to notify someone?”

For a longer list, like you can find in a Store Locator App in a mall:

Don’t:

“In the mall, I can locate for you Adidas, Aesop, Allsaints, Apple (+154 other stores)…. What store do you want me to locate?”
Do:

“What store do you want me to locate?” and

DO display the list on the tablet.

Pepper’s vocal commands and their variants must be easily phonetically distinguishable ¶

WHY?

To ensure good recognition of each vocal command, avoid commands that are phonetically close to each other (i.e. that have many overlapping or shared sounds). Every command needs to have the same chance to be trigged by the users.

If commands are insufficiently distinct, Pepper could favor one command and makes it difficult to access others.

HOW?

To find out if commands are phonetically distinct enough, say them out loud: if you notice that the commands have many sounds in common, they are most likely too close. In that case, find synonyms to express the same thing, or rephrase the voice command as a sentence starting with a verb.

EXAMPLE

When defining the vocal commands in Pepper’s questions:

Don’t:

“Are you a male or a female?”

Male and female are too close to be used by the users to answer without conflicting.
Do:

“Are you a boy or a girl?”

Boy and girl are phonetically different enough to be distinguishable by Pepper in users’ answers.

Pepper presents a command after explaining its purpose, not before it ¶

WHY?

Sometimes users respond to instructions from Pepper without listening to what the outcome will be. Pepper should make an action’s goal clear before encouraging a user to take the action.

HOW?

Explain what will happen when an action is taken before encouraging the user to take the action. This helps users understand what an action will do before they take it. Plus, this will notice them about what is the vocal command to say immediately before their turn to speak.

EXAMPLE

Don’t:

Press the start button to begin
Do:

To begin, press the start button.

To begin, tell me “Let’s start”.

Whenever possible, Pepper’s vocal commands begin with an imperative verb and are user-oriented ¶

WHY?

It can be difficult for users to know what to say and how to formulate a request to a robot. If the vocal command begins with a verb in command (imperative) form, it’s easiest for the users to understand what exactly to request and to expect from the robot because the verb represents the action.

HOW?

Pick the verb which is the most representative of what the app is doing.

It’s better when the verb represents the action from the user side and not from the app or robot side.

EXAMPLE

When defining the vocal commands the users will use:

Don’t opt for: “Open news application.”
Do opt for: “Tell me the news”, “Play a game”, “Set a timer”, “Go to sleep”, “Speak louder”,etc..

As much as possible, avoid using personal pronouns on the tablet and in the vocal commands ¶

WHY?

The tablet is very useful for laying out possible user actions by displaying buttons or vocal commands. However, because it’s Pepper’s tablet and the command will be said by the user, the presence of pronouns like “you”, “I”, “your”, and “mine” can cause cognitive dissonance. The burden is then on the user to resolve whether the displayed pronoun refers to themself, or to Pepper.

HOW?

Avoid using personal pronouns in the vocal commands and on their display on the tablet. Try to find a different way to express the same thing without a personal pronoun.

Remember that Pepper can still understand pronouns in the variants if the users feel more comfortable expressing their intent that way.

EXAMPLE

When defining the vocal commands suggested on the tablet:

Don’t display:

“Play with Pepper”, “Play with me”, “Play with you”
Do display:

“Let’s play together”

Whenever possible, each of Pepper’s vocal commands and variants should have at least three syllables ¶

WHY?

Short vocal commands (comprising just one or two syllables) are not long enough to be efficient. Pepper may understand this vocal command too often, which will lower accuracy. For best speech recognition performance, it is wise to have at least three syllables for each vocal command and its variants.

HOW?

Avoid using a single word or a short word as a command. A simple way to increase the strength of the vocal commands is to add a verb or an adjective to your keyword.

Don’t forget to have consistent wording in tablet content, Pepper’s speech, and vocal commands!

EXAMPLE

When defining the vocal commands the users will use:

Don’t: “Game”
Do: “Play a game”

Pepper’s vocal commands should not be written in any language other than the expected one ¶

WHY?

Pepper can only understand the language that it is currently speaking, so it’s really difficult or impossible for Pepper to understand a vocal command in a different language than the one expected.

HOW?

Translate or find synonyms in the expected language. If the vocal command is technically a foreign word or expression, but often used in the expected language, and well-recognized when you test it, it is likely OK to keep.

EXAMPLE

When defining the vocal commands the users will use in an English conversation:

Don’t opt for: “I have a rendez-vous”
Do opt for: “I have a date”

Don’t “over-teach” voice interaction: speaking should be intuitive, and as natural as a conversation between two humans can be ¶

WHY?

Communicating with Pepper is easiest and most satisfying when the user can use conversational and intuitive triggers based on human - human interactions. If you have to explain a voice command, something’s wrong: it means the command has to be rethought and redefined.

HOW?

Instead of listing commands for possible actions, ask a simple and clear question to clarify that it’s the user’s turn to speak.

EXAMPLE

Don’t:

“To save your work at any time, say save and finish, but if you want to continue editing say save and continue”
Do:

Human: “Pepper I’m done.”

Pepper: “Do you want to save that?”

Human: “Sure”

Pepper: “Got it. Do you want to exit the application now?”

Human: “Yes please”

Take-Aways¶

3A - Pepper uses natural language¶

Pepper uses spoken language, not written language.
Pepper has to understand more answers than Pepper suggests.

3B - Pepper’s language is easy to understand¶

Opt for short sentences, which may be easier to understand.
Use colloquial language.
Adapt the vocabulary to the audience.
Pepper is easier to understand when you add pauses in Pepper’s speech.
Pepper must pronounce every word in each sentence correctly.
Pepper asks positive questions, rather than negative or double-negative questions.

3C - Pepper manages the vocal commands in the interaction¶

Pepper explicitly lays out possible answers.
Use quotes to display vocal commands on the tablet.
Standardize the way Pepper says and displays the vocal commands.
Pepper verbally offers the user no more than three choices at a time.
Pepper’s vocal commands and their variants must be easily phonetically distinguishable.
Pepper presents a command after explaining its purpose, not before it.
Whenever possible, Pepper’s vocal commands begin with an imperative verb and are user-oriented.
As much as possible, avoid using personal pronouns on the tablet and in the vocal commands.
Whenever possible, each of Pepper’s vocal commands and variants should have at least three syllables.
Pepper’s vocal commands should not be written in any language other than the expected one.
Don’t “over-teach” voice interaction: speaking should be intuitive, and as natural as a conversation between two humans can be.