For developers who need to develop a custom experience for the Amazon Alexa voice assistant, the key is in creating a Custom Alexa Skill. This is accomplished via the Alexa Skills Kit (ASK) which is a collection of API's, tools, documentation and code examples. We have already seen how to get started with creating skills, as well as the next step needed once you are registered. Let's get going with a deeper understanding of voice interaction.
What Is A Skill?
A skill runs in the cloud, and gets called by the Alexa service after the user has made a verbal request to their Echo device. The skill take a JSON payload as a request and returns a JSON response over an https (secure) connection Internet connection. The request contains the information about the commands the user made when requesting your skill. The response contains text which ends up being spoken to the user (via text to speech), as well as weather to stop listening further from the user, or to expect a response back from the user.
The Non-Code Part
The tricky part for those who are new to Skills development is that the developer must specify the voice interaction model in the Alexa Developer Portal. We've all seen code before and any web developer is familiar with the technologies demonstrated in the ASK examples.
It is this voice interaction information which is tricky to developers who have not(yet) worked on a voice based application. So, let's look at this closer.
What Are Voice Commands?
In voice user interface we don't really speak of commands directly. This is because a command is usually just one way of instantiating or doing something. So, in a graphical user interface we might have a button to start some action.
The challenge with voice is that there are often many ways to say something which all has the same intention. For example, if we were congratulating a loved one for something they said, we could say any of the following:
- "Good job!"
- "Right on, you did good."
- "Hey fantastic job."
- "Glad you got it! "
and on, and on. You get the idea. So, with a voice user interface, we don't speak of commands specifically, but Intentions. We would create an Intention, which is something the user says to Alexa, called "CongratulationsIntention" (it can be any string). It should then, be able to respond to not just "Congratulations", but as many ways as possible to say something which means the same thing. We just group it all into a single Intention.
The first part of the voice interaction model is the Intent Schema. This is a JSON file which describes the intentions your skill will respond to. Most likely you're skill will have at least one unique intention, but there are also default intentions like AMAZON.StopIntent and AMAZON.RepeatIntent. A very simple Intent Schema (usually named IntentSchema.json) would look something like this:
Just an array of intents. For those who are following along with their advanced ideas for Skills, you will immediately wonder "What about when I want to query information from the user?". For example, what if my skill needs to know the color of something.
No problem: this is also specified in the intent file and is called a Slot. When you specify a slot, you must also specify all the values it could be. These are known as custom slot types. Amazon has been at this for a while, so there are slot types created including, dates, duration, number, times, US Cities, US States people's names and many more. In fact, Amazon greatly expanded this list recently, which should be a boon to all developers, as it is one less thing we have to come up with.
The second thing we need are the utterances. This is literally just a list of the individual commands that a user might use to be the cause of this intention. For example:
GetFutureIntent tell me the future
GetFutureIntent what is happening
GetFutureIntent what is going to happen
GetFutureIntent what is the future like
GetFutureIntent predict the future
GetFutureIntent what do I have in store for me
GetFutureIntent tell me my future
GetFutureIntent what is the future
GetFutureIntent what is my future
You can see the name of the intent is first on each line, followed by the example phrase that might be used. You want to think of as many ways as possible, because this allows the Alexa natural language understanding respond to a larger number of phrases. When you specify custom slots, you do need to describe where that values would be in the utterances field, which is carried out with curly braces.
The default intentions don't require any utterances to be specified.
For those of you working on Smart Home Skills, they don't have to worry about all this voice model, because Amazon has already defined all the voice data used when interacting with home automation. However, it doesn't hurt to understand how this works, even if you are working on Smart Home Skills.
Specifying Your Voice Interaction Model
All of this Voice Interaction Model is entered into the Amazon Developers Portal when creating or modifying your Skill. I wish there was a tool which could repeat or update these values from a command line, but that is not yet available. The web interface works fine, but just leaves open the possibility of a mismatch between your source repository and what is entered into the portal. You just have to be careful that you probably now have two versions of the Intents.json and utterances files, along with a file each for any custom slots.
It can be easy to update a local copy of one of these files and forget to have to go update the value in the developer portal, or vice versa, leaving your local copy outdated (or vice versa).
Although Alexa Skill Kit development is new and sometimes confusing, I think the largest challenge is understanding the voice interaction model which is this idea of intentions, slots and utterances. If you can get these principles down, it will make it alot easier when you jump into the code that needs to be written for your skill.