A call center agent resolving a customer issue

MultiTrack Call Transcription with Split Recording

Published December 03, 2018 by Michael Heap

Back in April we announced that split recording was available as part of the Nexmo Voice API. Split recording allows you to record a conversation in stereo, one participant in each channel. This makes common use cases such as transcription much easier to handle.

However, there was one downside to split recording – if you have more than two participants then the first participant would be in channel 0 and everyone else would be in channel 1, which means that we lose the ability to transcribe what everyone said individually.

What if I told you that Nexmo now supports not one, not two, but three(!) separate channels in a recording? Would that make you happy? How about if I told you we could support four? Five? We’re pleased to announce that available immediately, Nexmo supports up to 32 channels in every single recording. You read that correctly – we can provide 32 separate channels of audio, each containing a single participant for you to process however you like.

Just like last time, we’re going to walk through a simple use case together. In this scenario, Alice needs to discuss a work project with Bob and Charlie, and they’ve all agreed that it’d be a good idea to record the call. To achieve this, Alice has created a small Node.js application that uses Nexmo to connect her to Bob and Charlie and record the conversation.

All of the code in this post is available on Github.

Bootstrapping Your Application

The first thing we need to do is create a new application and install all of our dependencies. To do this, we’ll use npm to initialise a project and install express and body-parser to handle our HTTP requests, and dotenv to handle our application configuration. We’ll also need to install the nexmo client so that we can access our call recording once we receive a notification that it’s available and @google-cloud/speech to transcribe the audio .

Once you’ve installed all of your dependencies, you’ll need to create an application and rent a number that people can call. If you don’t already have a Nexmo account, you can create an account via our dashboard.

Nexmo sends HTTP requests to your application whenever an event occurs within your application. This could be when a call starts ringing, when it’s answered or when a call recording is available, just to name a few. To do this Nexmo needs to be able to reach your application, which is difficult when the application is running locally on your laptop. To expose your local application via the internet, you can use a tool called ngrok. For more information you can read our introduction to ngrok blog post.

Expose your server now by running ngrok http 3000. You should see some text that looks similar to http://2afec62c.ngrok.io -> localhost:3000. The first section is your ngrok URL, and is what you need for the rest of this post.

We’re finally ready to create a Nexmo application and link a number to it. You’ll need to replace http://2afec62c.ngrok.io with your own ngrok URL in these examples:

If you now make a call to the number that you purchased, Nexmo will make a request to http://[id].ngrok.io/webhooks/answer to find out how to handle the call. As we haven’t built that application yet, the call will fail.

Handling an Inbound Call

Let’s build our application to handle inbound calls. 

Create the index.js file and enter the code shown below. This requires all of our dependencies, configures the Nexmo client and creates a new express instance:

Next, we need to create our /webhooks/answer endpoint which will return an NCCO and tell Nexmo how to handle our call. In our NCCO we use two connect actions to automatically dial Bob and Charlie’s phone numbers and add them to the conversation, and a record action that will tell Nexmo to record the call.

The record action is where the magic happens. We tell Nexmo to split the audio into separate channels by setting split: conversation and that the audio should be split into three channels by setting channels: 3.

We’ll need to provide Bob’s and Charlie’s phone numbers for this to work, but let’s finish creating our endpoints before we configure our application.

Create the /webhooks/event endpoint which will be informed when events are triggered on the call. For now, all we’re going to do is log the parameters we received and acknowledge that we received it by sending back a 204 response.

Finally, we need to implement our/webhooks/recording endpoint. This URL was defined in our NCCO in the record action and will be notified when a recording is available. We’ll automatically transcribe the audio later, but for now let’s log the request parameters so that we can see what’s available.

At this point there’s only one thing left to do – populate our .env file to provide the information that our application needs. You’ll need your application_id, as well as some phone numbers for Bob and Charlie to help you test. Create a .env file and add the following, making sure to replace the application ID and phone numbers with your own values:

Make sure to add the Google credentials line, even though we haven’t created that file yet. We’ll create it in the next section

Once you’ve done this you can test your application by running node index.js and then calling the Nexmo number you purchased. It should automatically call the two numbers that you added to the .env file and once everyone hangs up, your /webhooks/recording endpoint should receive a recording URL.

Connecting to Google

We’re going to use Google’s Speech-To-Text service to transcribe our call recording. To get started with that, you’ll need to generate Google Cloud credentials in the Google console. Download the JSON file they provide, rename it to google_creds.json and place it alongside index.js. This is the file that the Google SDK will try and read to fetch your authentication credentials.

To transcribe our audio data, we’re going to connect to the Google Speech API, passing in the expected language, the number of channels in the recording and the audio stream itself. Update your transcribeRecording method to contain the following code:

This returns a promise from the Google Cloud speech SDK which will resolve when the transcription is available. Before we can use the speech SDK, we need to require it so add the following to your require section at the top of your file:

At this point we’re ready to transcribe our audio. The final step is to fetch the recording from Nexmo when we receive a request to /webhooks/recording and feed the audio in to Google’s transcription service. To do this, we use the nexmo.files.get method and pass the audio returned in to our transcribeRecording method:

As well as passing the audio as a parameter, we tell our transcribeRecording method that there are 3 channels in the audio and that we want to transcribe using the en-US language model. Once the promise is resolved by the Google speech SDK, we read the results and output a transcription of the conversation, including which channel the audio came from.

If you call your Nexmo number now, you’ll be connected to both Bob and Charlie. Once the call is done, hang up and wait for Nexmo to send you a recording webhook. Once it arrives, we send that audio off to Google and the transcription will appear in the console. This is how it looks for me:

Conclusion

We’ve just built a conference calling system with automatic transcription in just 76 lines of code. Not only does it automatically transcribe the call, but it transcribes each channel separately, allowing you to know who said what on the call. For more information about the options available when recording a call, you can view the record action in our NCCO reference, or see an example implementation of the NCCO required in multiple languages.

Leave a Reply

Your email address will not be published.