There is a massive amount of amazing technology that powers voice assistants like Alexa, Siri, Cortana and Google Home. These devices are the combination of technologies which each have taken, sometimes decades to improve. There are also a lot of three-letter-acronyms to ‘simplify’ the names of these technologies. While this makes it easier to communicate a technology, it is difficult for those who have not worked as much with them.
This list serves as a dictionary to the technologies that power voice computing and their related acronym.
- ASK – Alexa Skills Kit, which is the software development kit for Amazon Alexa. It is the collection of programing interface, tools (in the form of the developer portal), documentation and code examples for creating Alexa Skills.
- ASR - Automatic Speech Recognition, converts text spoken by the user as digital audio into text strings more-easily usable for your application.
- AVS- Alexa Voice Service, which is the web service used to give Alexa capabilities to third party devices. It lets you voice-enable your product with Alexa.
- AWS – Amazon Web Services, is a secure and scalable cloud environment. Amazon’s examples all demonstrate how to add Alexa capability using AWS.
- ML - Machine Learning, allows for a computer to develop new knowledge by learning on very large subsets of data, and incrementally learning from each piece of data as well as how the data relates together as a whole.
- NLU - Natural Language Understanding, this is the ability to make sense of what a voice command really means, along with the parts of speech they provide. The first work in NLU was done in 1964. Services like AWS Lex or Microsoft LUIS will take a phrase of speech and return the commands (or intentions) from the text.
- SSML – Synthesized Speech Markup Language, is a markup language (like HTML or XML) which lets you define closely how to pronounce text as well as embed digital audio. This is used for text to speech situations, so that your Application can have better control pronunciation. For example, potato can be pronounced 'to-May-to' or 'to-MAH-to', and you can control that with SSML. You can also add pauses to the voice output this way. Although SSML is a standard, note that there are different versions available, and they are not very compatible.
- TTS - Text to Speech, converts the text strings a computer would use into spoken speech in the form of digital audio. Services like AWS Polly take a string of text and convert it into spoken speech in the form of an MP3 file. You have a choice of 47 different voices and 24 different spoken languages (English, Spanish, German, etc.) for your resulting audio output. Operating systems like iOS, macOS and Windows also provide libraries that have text to speech capability.
Being that these are all based on voice, many of these technologies are related to each other. Most importantly, the voice assistants today use many of these together, to form something even better.
This list will be updated with new definitions as they become apparent. Know of a missing one? Make a comment below, and I’ll be sure to add it.