Build a Speech Translation App on Deno With Azure and Vonage

Published June 09, 2020 by Ben Greenberg

Vonage recently released Automatic Speech Recognition (ASR) as a new feature on the Voice API, which is a great reason to build an entertaining new voice application to leverage this new capability!

In this tutorial, we will build a voice application running on Deno that will:

  1. Receive a phone call
  2. Accept the speech said by the caller at the prompt
  3. Convert that speech into text using Vonage ASR
  4. Translate it into a randomly chosen language using Microsoft Azure
  5. Speak back both the original English text and the newly translated text
  6. If the newly translated text has an accompanying Vonage Voice available, that voice will be used

We are building using Deno as our runtime environment because Deno lets us build a server-side app in TypeScript with lightweight dependencies. It allows us to both integrate only the external code we actually need and to give it only the runtime permissions we desire it to have.

There are several possible providers we can integrate with to provide text translation. In this tutorial we will be building utilizing the Microsoft Azure Speech Translation API.

Let’s get started!

tl;dr If you would like to skip ahead and just run the app, you can find a fully working version on GitHub.

Prerequisites

To build this application, you will need several items before we can start implementing it:

Once you have all that taken care of, we can move on to begin our application implementation.

Vonage API Account

To complete this tutorial, you will need a Vonage API account. If you don’t have one already, you can sign up today and start building with free credit. Once you have an account, you can find your API Key and API Secret at the top of the Vonage API Dashboard.

This tutorial also uses a virtual phone number. To purchase one, go to Numbers > Buy Numbers and search for one that meets your needs.

Start building with Vonage

Creating the Folder Structure

The first step is to create the folder structure for our application. It will look like this at the end:

For this tutorial, we will call the root folder of the application, speech-translation-app, but you can name it whatever you would like. Once you have created the root folder, change directory into it and create the data, services, and services/auth subfolders.

Inside the root folder create server.ts and .env files by running touch server.ts .env from inside the root directory.

Perform a similar action inside the data, services, and services/auth folders, executing touch to create the files shown in the directory tree above.

Creating the Deno Server

In your preferred code editor, open up the server.ts file from the root directory that you created in the last step.

Within this file, we will instantiate an HTTP server, provide it with its routes, and control the flow of the application.

We are going to use Opine as our web framework for the server. Opine is a minimalist framework made for Deno ported from ExpressJS. If you are familiar with ExpressJS, then the constructs in Opine will feel familiar.

To use Opine, we need to import it at the top of our file. Deno, unlike NodeJS, does not use node_modules or another similar package management system. As a result, each package brought into your application is imported directly from its source:

Once we have made opine available to use, we can instantiate an instance of it and create the skeleton structure for our routes:

The three GET requests enumerated in the server correspond to three unique webhooks from the Vonage Voice API. The first one is where the API sends an incoming call. The second one is where the API will send the converted speech to text using the Vonage Automatic Speech Recognition feature. Lastly, the third route is where all the event data for the lifecycle of the call is sent to.

We need to provide logic for each of these three routes that will control the way our application functions:
This conversation was marked as resolved by NJalal7

  • The incoming call route will handle taking in the caller’s voice input and sending it to the Vonage API for text conversion.
  • The second route will receive the text and send it to the Azure Speech Translation API to translate it into a second language. It will also playback to the caller the original and translated messages.
  • The final route will receive all call lifecycle event data and acknowledge the receiving of the data.

Defining the Routes

Let’s build the logic for the incoming call /webhooks/answer route.

Inside the route we need to assign the caller ID (UUID) to a variable so we can use it later. The UUID is a necessary component of the ASR request:

Next, we need to respond with an HTTP status code of 200 and send back in response a Nexmo Call Control Object (NCCO), which is a JSON object containing the set of instructions we wish the Vonage API to perform:

As you can see the NCCO is composed of two actions: talk and input, respectively.

The talk action welcomes the caller and asks them to say something. It also sets a parameter bargeIn to equal true, which allows the caller to start speaking before the message has finished.

The input action is where we accept the caller’s voice input. In this action we define a few unique parameters:

  • eventUrl: Where to send the completed converted speech to text to. In the action we define the URL as a variable called asrWebhook. We will create that later.
  • eventMethod: Which HTTP verb to use to send the completed speech to text with. In this case, we use GET.
  • action: The base parameter for all NCCO actions. Its value is equal to the action you wish to perform, in this case input.
  • speech: A parameter whose value is equal to an object that contains the UUID of the caller and the language of the speech that is being converted into text.

All together, this first GET route looks like the following:

The second route we need to define is the /webhooks/asr route, which will receive the converted speech to text from the Vonage API and act upon it.

There are a few values we want to assign to variables to use. The first is the results of the ASR conversion, which comes to us in the form of an array of objects. The objects are in descending order of probability of accuracy. The second variable will hold the text from the object with the highest accuracy of probability.

We instantiate the second variable as an empty variable and assign its value based on the condition of whether Vonage ASR was able to pick up the speech offered by the caller. If the speech was recognized then that value is used. However, if the speech was not recognized, then a default value is provided and a message is displayed in the console as to why.

In the last two variables we create we assign the value of the random language choice for the speech to be translated into, and pick the voice to speak the translation with. We then share the language and the voice information in the console:

Then, we set the response HTTP status code to 200 like in the first route, and we respond with another NCCO JSON object:

This NCCO object contains three talk actions, each one of them with variables and functions that need to be created. We will do that after we have finished defining the routes.

The first talk action says back to the caller their original message in English as it was understood during the automatic speech recognition conversion.

The second talk action tells the caller what language their message was translated into.

The third talk action says to the caller their newly translated message. It also leverages the voiceName parameter to say the translated message in the language’s designated voice, if one is available for that language.

The last route we need to define will be a short one. This is the one that will receive the rest of the event webhook data for the call. In this tutorial we are not going to do anything with that data besides acknowledging we received it. We acknowledge it by sending back a 204 HTTP status code, which is the equivalent of saying the message was successful and there is no content to respond with:

With the server defined, we are ready to build the helper functions that we invoked in the server routes.

Creating the Services and Data

Let’s navigate to the top of the server.ts file again and add a few more import statements for functions and data that will define:

As the snippet above indicates we need to create the five following items:

  • languageList: An array of possible languages to translate the message into
  • voicesList: An array of possible voices to speak the translated message with
  • translateText: The function to translate the text into the second language
  • voicePicker: The function to choose a voice to speak the translated text with
  • languagePicker: The function to choose a language to translate the text into

We will build each of these now.

Defining the Data

First, let’s add some data to our application.

We need to add two items of data: a list of languages and a list of voices to speak those languages.

The list of supported languages is derived from the Vonage ASR Guide. The list of voice names is likewise derived from the Vonage Voice API Guide.

Open up the data/languages.ts file and we will add an array of objects to it:

This represents the list of supported languages at the time of publication of this tutorial. The list is subject to change, and the guide on the website should be consulted for the most up to date information.

Next, open up the data/voices.ts file and add an array of objects to this file as well. Like the language list, the data here represents the list of voice names at the time of publication. There is often more than one voice per language. For the sake of this tutorial, we have removed duplicate language voices to keep it to one voice per language:

Defining the Service Functions

Our application uses several functions defined in services/ that provide the core functionality. We will build each of them out at this point.

To utilize the Microsoft Azure Speech Translation API, we must authenticate to the API using a two-step process. The first step is obtaining a JSON Web Token (JWT) from the token creation API endpoint that we then use in the second step when we make an HTTP call to the translation API endpoint.

Open up the services/auth/token.ts file and in it we will create the functionality to obtain a JWT from Azure. Please note this depends upon you successfully creating an account on Microsoft Azure and receiving your API key. The function reads the API key from an environment variable in our .env file, which we will define later in this tutorial:

The getToken() function accepts a key parameter, and along with the Microsoft Azure URL endpoint defined in your dotenv file, it makes a fetch() request sending your API key. The value that comes back is your JWT that is explicitly returned as the value from the function. At the very top of the file, we import a dotenv loader module from Deno that lets us read the values in the .env file.

If the key is undefined or if there is no value for the azureEndpoint, the function will return early and provide an explanation in the console for what was missing.

Once we have the token from getToken(), we are ready to use it to build a helper function to call the translation API and get the translated text back.

Open up the services/translate.ts file and in that file we will create a translateText() function:

This function, like the one before it, reads from our .env file to obtain the Azure API key that we defined. It takes two arguments: the two-letter language code and the converted speech-to-text.

The function then creates two variables: token and response. The former invokes the getToken() function passing the Azure API key as its argument. The latter invokes a fetch() POST request to the Azure speech translation API endpoint using the two-letter language code as part of the query parameters. The JWT generated by the getToken() function is passed into the Authorization header. The body of the POST request is the converted speech to text made into a JSON string.

The response from the request is held in the translation variable and the actual translated text is returned by the function, which is contained inside translation[0]["translations][0]["text].

We have two remaining functions to create before we can move on to define our .env environment variables.

The first of the two remaining functions we will build will randomly choose a language from the languages list for the text to be translated into.

Open up services/language_picker.ts and add the following code:

The function uses a bit of math to randomly pick an index from the languages list and return the value of the object in that index.

The last function we will build will choose a Vonage voice to speak the translated language into, if one exists for the language. If one does not exist, it will return the Salli voice, which represents American English. We also guarantee that if the language chosen is one of the regional dialects of Arabic that the voice chosen is one of the Vonage Arabic voices.

Open up services/voice_picker.ts and add the following into it:

That does it for all the functions! If you made it this far, we are almost at the finish line.

The last items we need to take care of are to assign the values to our .env environment variables and to provision a Vonage virtual phone number.

Defining the Environment Variables

There are three values we need to assign in the .env file:

  • AZURE_SUBSCRIPTION_KEY
  • AZURE_ENDPOINT
  • VONAGE_ASR_WEBHOOK

The first two are our Azure API key and Azure URL endpoint, respectively.

The latter is the webhook URL for the data returned by the Vonage Automatic Speech Recognition feature. This latter value needs to be an externally accessible URL. A good tool to use during development is ngrok to make your local environment externally available. You can find a guide to setting up ngrok locally on our developer website.

Provisioning a Vonage Virtual Phone Number

There are two ways to provision a Vonage virtual phone number. Once you have a Vonage developer account you can either purchase a phone number through the dashboard or using the Nexmo NodeJS CLI. We will do so here using the CLI.

To install the CLI you can use either yarn or npm: yarn global add nexmo-cli or npm install nexmo-cli -g. After installation you need to provide it with your API credentials obtained from the dashboard:

Now that your CLI is set up, you can use it to search for available numbers in your country. To do so run the following using your two-letter country code. The example below shows a number search in the United States. Make sure to add the --voice flag to only return numbers that are voice-enabled:

Once you found a number you want you can purchase it also with the CLI:

You will be asked to type confirm after you submit the command to officially purchase the number.

Since we are creating a voice app, we also need to create a Vonage Application. This, too, can be done with the CLI and, once finished, we can link the recently provisioned phone number to the application. You can also use the creation of the application to supply it with the answer webhook and event webhook URLs. If creating in development, now is a good time to create your ngrok server and supply the ngrok URLs:

The command will return to you the application ID: Application created: asdasdas-asdd-2344-2344-asdasdasd345. We will use that ID now to link the application to the phone number:

Once you have finished those commands, you are ready to run your application!

Running the Application

To use your application, start both your ngrok server and your Deno web server. To start the Deno application run the following from the root folder:

Now that it is running you can give your Vonage provisioned phone number a call and follow the prompt to say a message. Your message will be converted into text using the Vonage Automatic Speech Recognition feature and then translated into a random second language using Microsoft Azure and then said back to you. Enjoy!

Leave a Reply

Your email address will not be published.

Get the latest posts from Nexmo’s next-generation communications blog delivered to your inbox.

By signing up to our communications blog, you accept our privacy policy , which sets out how we use your data and the rights you have in respect of your data. You can opt out of receiving our updates by clicking the unsubscribe link in the email or by emailing us at [email protected].