Translating speech from German to English on a voicecall

Build a Babel Fish with Nexmo, and the Microsoft Translator Speech API

Published March 14, 2018 by Naomi Pentrel

If you were on the internet in these past few months chances are you saw Google’s real-time translation Pixel Buds. A technology quite like the Babel fish in The Hitchhiker’s Guide to the Galaxy that can translate any sentient speech for its wearer thus enabling them to communicate with virtually every being. The Google Pixel Buds come at a price of course – so why not build our own?! That’s what Danielle and I thought at the latest hackference. We went on to create a Nexmo Babel fish that lets two people talk on the phone with either party hearing a translated version of what the respective other party says.

Image of a Babel fish from The Hitchhiker's Guide to the Galaxy

In this blogpost, we will go over how this Babel fish system works step by step starting with the required setup and configuration. Then, we will set up a Nexmo number for handling incoming calls. Following this, we will implement a Python server which will receive speech via a WebSocket and route the incoming speech from the Nexmo number to the Microsoft Translator Speech API. We will use the Translator Speech API to handle the transcription and translation. On top of this, we will implement logic to manage a bi-directional dialogue and to instruct the Nexmo number to speak the translations. For ease of implementation, both parties will have to call our service’s Nexmo number. Below, you can see a high-level system diagram of how an instance of speech from either side gets processed. Note that throughout this tutorial I will use the example of a German/British English conversation.

Diagram that shows how a message passes through the system. A German caller speaks a message in German which Nexmo passes through to a Python server. The Python server sends the German audio to the Microsoft Speech API. The Speech API responds by sending the English translation as text to the Python server. The Python server then sends a request to Nexmo to speak the English message to the British caller. At this point the British caller hears the translated message in English.

If you would prefer to just see the code, it is available on GitHub here.


You will need to have both Python 2.x or 3.x and the HTTP tunnelling software ngrok installed to be able to follow along. We will list all the commands you need to install everything else as you follow along.

Getting Started

Set Up Your Environment

Let’s get started with our DIY Babel fish solution by setting up a virtual environment for this project using Virtualenv. Virtualenv allows us to isolate the dependencies of this project from our other projects. Go ahead and create a directory for this project and copy the following list of dependencies into a file in your project directory named requirements.txt:

To create and activate your virtual environment, run the following commands in your terminal:

At this point, please start ngrok in a separate terminal window by running the command below. ngrok will allow us to expose our localhost at port 5000 to incoming requests. You will need to keep ngrok running in the background for this to work. You can read more about connecting ngrok with Nexmo here.

Once you run the above command, your terminal should look similar to the screenshot below. You will need the forwarding URL when configuring your Nexmo application and number in the next steps.

Screenshot of ngrok running in a terminal and displaying a forwarding URL of the form “”.

Acquire a Nexmo Number

For our translation service, we need to acquire a Nexmo number. Sign up for a developer account at if you haven’t yet. Next, head over to to purchase a number with voice capability.

Screen capture of a user buying a number using the Nexmo buy numbers menu. A user selects their country, Voice as the feature, and mobile as the type and clicks on the search button. The user then clicks on buy for the first number that comes up and confirms the purchase.

Create a Nexmo Application

Go to your applications and add a new application. Use the Ngrok forwarding URL for both the Event URL and the Answer URL adding /event as the path for the Event URL (e.g. and /ncco for the Answer URL (e.g. We will set these endpoints up later. Generate a public/private key pair via the user interface and store the key on your computer.

Screen capture of a user creating an application using the Nexmo application menu. A user clicks on add new application. In the form that appears the user enters babelfish as the application name, <code></code> as the Event URL, and <code></code> as the Answer URL. The user then clicks on the <code>Generate public/private key pair</code> link, saves the key when prompted, and finally clicks on create application.” /></p>
<p>The last step for our number setup is to link the number you purchased earlier to your application. Use the application dashboard to link the number.</p>
<p><img src=

Obtain keys for Microsoft’s Translator Speech API

The other service we are going to need to set up is Microsoft’s Translator Speech API. Sign up for a free Microsoft Azure account at and afterwards go to and create a Translator Speech API resource. You will need the key it generates for the next step.

Screen capture of a user setting up the Microsoft Translator Speech API. A user types translator speech into the Marketplace search on the Microsoft Azure portal. The user then clicks on the Translator Speech API option that comes up and clicks on the create button on the API overview screen. The user then fills in the form for the resource using babelfish as the name, Pay-as-you-go as the subscription, F0 (10 Hours of audio input) as the pricing tier, and babelfish-resource as the resource group name. After checking the box that the user has 'read and understood the notice' and checking add to dashboard, the user clicks on create and is redirected to the dashboard. After the deployment finishes, the user clicks on the deployed resource and is presented with a resource dashboard. On the resource dashboard under the section grab the keys the user clicks on keys and copies key 1.

Manage Secrets and Config

Now that we have our Nexmo number and our Translator Speech API key, all we need to do is set up a secrets and a config file with all these important details so that we don’t have to keep writing them and can keep them separately managed. Store the below in in your project folder and replace the placeholder values with your values.

Afterwards store the below in in your project folder and again replace the placeholder values with your values. Note that you can choose other languages than the ones below. You can also alter these at any point later.

Tutorial Steps

Below we will first go through how to authenticate with the Translator Speech API. Then we will set up our Tornado Web server using a supplied template. Following this, we will implement the CallHandler, the EventHandler, and the WSHandler. The CallHandler will handle incoming calls to the Nexmo number for us. On top of that, the EventHandler will be used to handle events that Nexmo sends, such as a call starting or completing. With each event, Nexmo sends information about the actor who started or completed the call. We will use this information to store who is in a specific call. The WSHandler will meanwhile be used to open the WebSocket through which Nexmo and our Python server will communicate. The Python server will create snippets of audio and send them to the Translator Speech API. The handler will use the information that the EventHandler gathers to route messages correctly. Each section below will explain these concepts further and show the respective implementation.

Authenticate with Microsoft’s Translator Speech API

To use the Translator Speech API we need to get a token which we name the MICROSOFT_TRANSLATION_SPEECH_CLIENT_SECRET. Luckily Microsoft provides a Python AzureAuthClient which we will use without change. Please copy the below and save it in a file called in your project directory.

Create a Server

The computer communications protocol WebSockets allows us to have a two-way communication channel over a single TCP connection. Nexmo’s Voice API lets you connect phone calls to such WebSocket endpoints. We will use the Tornado Web server web framework as it implements the WebSocket protocol for us.

If you have been following along and named all the files as described, you can start with the below Tornado Web server setup. This code handles all our imports, sets up the Nexmo client and the azure auth client, and starts a server on port 5000. Note that this server does not do anything useful yet. It has three endpoints: ncco, event, and socket which call the CallHandler, EventHandler, and WSHandler respectively. We will implement the handlers in the following sections.

Create a file named in your project directory and copy this code into it.

Implement the CallHandler

To connect phone calls to WebSocket endpoints, Nexmo’s Voice API uses a Nexmo Call Control Object (NCCO) or an API call. When someone calls your Nexmo number, Nexmo will issue a get request to the Answer URL you provided when setting up your Nexmo Voice Application. We pointed our application to our server which now needs to answer this request by returning an NCCO. This NCCO should instruct Nexmo to give a short welcome message to the caller and then connect the caller to the WebSocket.

Diagram that shows the interactions between a user, Nexmo, and the web server. When the user calls the Nexmo number, Nexmo sends a GET request to the web server's /ncco endpoint. The web server responds with an NCCO that instructs Nexmo to open a socket with the web server.

Go ahead and save the following NCCO into a file called ncco.json within your project directory. It contains a template that will perform the required actions. However, it includes some placeholder variables ($hostname, $whoami, and $cid) which we will need to replace later when we use it.

In the template for the server the section reproduced below sets up the mapping between the /ncco endpoint and the CallHandler. This mapping ensures that when the /ncco endpoint receives a GET request, the CallHandler‘s get method is executed by the server.

When the server executes the method, it returns an assembled NCCO using the code below. To begin with, we gather data from the query (i.e. the GET request) in a data variable. We also store the conversation_uuid for later use. In this case, there is a print statement so that you can see the conversation_uuid when you are testing your server. In the next step, the code loads the NCCO from the ncco.json file we created. To complete the loaded NCCO, we substitute the placeholder variables ($hostname, $cid, and $whoami) with the gathered values from the data variable. After the substitution, we are ready to send it back to Nexmo.

Replace the CallHandler from the template above with this code:

Whenever someone now calls the Nexmo number, Nexmo will send a GET request to our /ncco endpoint and the CallHandler will assemble and send the NCCO. Nexmo will then perform the actions as laid out in the NCCO. In this case, that means the caller will hear “Please wait while we connect you.”. Afterwards, Nexmo will attempt to connect the call to the provided socket endpoint. It also provides Nexmo with the event endpoint to be used. If you start your server now by running python in your terminal window, you will find that you will hear the message but the call will end after it. This is because we haven’t implemented the EventHandler or the WSHandler. Let’s do that now!

Implement the EventHandler

The EventHandler handles events that Nexmo sends. We are interested in any incoming calls and therefore check any incoming request to see whether its body contains a direction and whether that direction is incoming. If it is, we will want to store the uuid and finish the request context. The call_id_by_conversation_id dictionary will be used for routing messages between the callers in the WSHandler.

Replace the EventHandler from the template with this code:

Implement the WSHandler

The CallHandler and the EventHandler have allowed our application to set up the call. The WSHandler will now take care of the audio stream of the call. The speech on the primary caller’s side will be transcribed and translated by the Translator Speech API, and the resulting text will be spoken by a Nexmo voice on the other end of the line. The second person can thus hear the caller in a language they understand and afterwards respond. The Translator Speech API will translate the response in turn so that the first person hears it in their language. This workflow is the bit we will now implement.

When the Nexmo Voice API connects to a WebSocket, Nexmo sends an initial HTTP GET request to the endpoint. Our server responds with a HTTP 101 to switch protocols, and the server will subsequently connect to Nexmo using TCP. This connection upgrade is handled for us by Tornado. Whenever someone makes a call to our Nexmo number, Nexmo will open a WebSocket for the duration of the call. When a WebSocket is opened and finally closed, the Tornado framework will call the open and close methods below. We do not need to do anything in either case, but we will print messages so that we can follow what is going on when we run the server.

Now that we have an open connection, Nexmo will send messages that we handle in the on_message method. The first message we will receive from Nexmo will be plain text with metadata. Upon receiving this message, we will set the whoami property of the WSHandler to be able to identify the speaker. Afterwards, we will create a wave header that we will send to the Translator Speech API. To send messages to the Translator Speech API, we will create a translator_future. Depending on the caller, i.e. the person who the message comes from, we will create the translator_future with the respective language variables so that the API knows from which language to translate into which other language.

A translator_future is another WebSocket that connects to the Speech Translator API. We use it to pass on the messages we receive from the Nexmo Voice API. After its creation, the translator_future is stored in the variable ws and used to send the wave header we created before. Each subsequent message from Nexmo will be a binary message. These binary messages are passed to the Translator Speech API using the translator_future which processes the audio and returns the transcribed translation.

When we initialize the translator_future, we state that when the Translator Speech API has processed our message it should call the method speech_to_translation_completed. This method will, upon receiving a message, check that the message is not empty and then speak the message in the language voice of the receiver of the message. It will only speak the message for the other caller, not for the person who initially spoke. Additionally, we will print the translation to the terminal.

Replace the WSHandler from the template with this code:

In the above we use a function called make_wave_header to create the header that the Translator Speech API expects. The code used to create a WAV header was copied from the Python-Speech-Translate project and is reproduced below.

Copy the make_wave_header function to the end of your file:

Lastly, the speak function used above is a simple wrapper around the nexmo_client method send_speech. As you can see below, it will print some information that may be useful to you when running the code and then use the Nexmo API to instruct Nexmo to speak a given text with a given voice_name.

Copy the speak function below to the end of your file.


If you followed along, you have now successfully built your own Babel fish! If you haven’t followed along you can find the final code here.

Run it by typing python into your terminal. Now team up with a fellow human (or use two phones) and call your Nexmo number from two lines. You should hear your welcome message and then be able to talk to each other in your two chosen languages.

Let us recap: We began by setting up our environment, as well as our Nexmo Application and the Microsoft Translator Speech API. Then, we built our Tornado WebServer which allowed us to use WebSockets to handle voice calls and pass the speech of the voice call on to the Translator Speech API. The API then translates and transcribes the speech for us. Upon receiving the result, we spoke the message in the new language. Our service handles bi-directional calls due to our routing logic which means that our service will, after connecting two callers, translate either person’s speech before relaying it, thus enabling them to communicate in their chosen languages.
And there we have it! Our working Babel fish! I’m afraid our DIY babel fish does not look quite as endearing as the one from the movie but it is a working alternative.

If you have any questions, please reach out on @naomi_pen or find me on

Where Next?

If you’re interested in exploring this further why not implement logic that allows users to choose languages at the beginning of the call. Such logic might also remove the necessity for hard coding our primary phone number. For a fun project you could also explore making this work for conference calls and creating transcripts for each call. Lastly, I would expect that you might want to work on the security of your service and not let random people call your service. You could achieve this by only letting a certain number (or multiple) use your service and having logic to initiate a second leg of the call from within the call to allow you to invite other users without giving them the privilege of using your Babel fish service. I would love to hear what you build on Twitter @naomi_pen!

Leave a Reply

Your email address will not be published.