Nexmo Voice API Demo: Bridging Telephony and AI Platforms with WebSockets

Published November 30, 2016 by Glen Kunene

Using the Nexmo Voice API‘s native WebSockets support, Developer Advocate Sam Machin demonstrates how the new API can easily bridge between telephony and powerful platforms like IBM Watson. Watch him demonstrate a one-way audio stream directly to a browser, an innovative technique with endless possibilities.

The video (running time: 5 minutes) was recorded at TADSummit 2016. Read a full transcript of the demo below.

Transcript: Sam Machin’s “Dangerous” Nexmo Voice API Demo @ TADSummit 2016

JAMES BODY (Head of R&D, Truphone): So, Sam is not actually going to show us a hacked Ubuntu phone. Sam is going to show us a bit of Nexmo hackery. Sam said he was pressured to do a demo. He said ‘you know what, I’ve got so much going on. I can’t do it. I’m just timing out.’ But then five minutes before we started, Sam came running up and said ‘I’ve done my other stuff. I can now do a quick Dangerous Demo for you.’ So by the power of Nexmo and the brain of Sam, here we go.

SAM MACHIN (Developer Advocate, Nexmo): Just going to get the resolution a bit better.

JB: Perfectionist as always, not only doing the demo right but making the resolution right as well.

SM: Okay so yeah. As James said, this is something actually I’ve been working on as part of the project. So this is some stuff we announced 10 days ago. We demoed a thing at the Watson Developer Conference with IBM.

[IBM worked with Nexmo to demonstrate the ways Intu can be integrated with both Watson and third-party APIs to bring an additional dimension to cognitive interactions via voice-enabled experiences. The Watson and Nexmo demo also used the Nexmo Voice API’s support of WebSockets.].

They’ve got a whole self AI bot platform [Project Intu] and we have built a connector to that, which is quite cool. What I’ll show here is the technology that’s actually enabling this connector.

So, we have a voice API much like many of the other voice APIs that allow you to build the obvious modules — talk to say things, and play audio, and collect input, and connect to endpoints. Those endpoints typically are phone numbers, SIP endpoints, that kind of thing. What we’ve actually also built now is a WebSocket as an endpoint. So this is sending the audio of a call or binary WebSocket messages to a server.

The kind of cool thing with this — forget too much about the browser stuff — but we’re now bridging between the telephony world of SIP and phone calls and sessions and ringing and answer, but all these kind of bot AI platforms, they don’t really understand the idea of a call and SIP. What they’re used to exposing is WebSockets. They take a straight open TCP connection and start streaming audio to them, and back into and from them. So, that’s the language that Microsoft Cognitive Services speaks, that Watson speaks, that Amazon Alexis speaks — all these internet technology platforms. So, what we’ve really got here is we’re bridging the two worlds.

So, the little demo I have is I’m making a call into our voice platform and then I’m going to connect that WebSocket into a little relay server I’ve built, which is really just taking the WebSocket in, and I’m going to connect from a browser to the same WebSocket server.

So, I’m going to push the audio down to the browser, which we all know we can do with WebRTC. But you’ll notice this is running in Safari, so this is not WebRTC because Safari doesn’t do WebRTC. This is WebSockets and web audio, two very standard web technologies. If I connect that and then make the call — please, don’t anyone else dial the number because it’s not really set up for multiple calls yet. Hopefully, fingers crossed…

“…this is running in Safari, so this is not WebRTC because Safari doesn’t do WebRTC. This is WebSockets and web audio, two very standard web technologies.

[Automated voice greeting from Nexmo Voice API]

That’s our voice API, and now we’ve bridged to the WebSocket, and that’s coming down. [Echo] 1-2-3. 1-2-3. So there’s maybe a second of latency on this. This is kind of going through a bit of a hack server at the moment. But the interesting thing with this is this is now just a one-way audio stream. So we’re not doing conversation. We’re not so much doing this work for a real-time conversation thing, but imagine the kind of things like All-Hands calls and stuff when you have, like, five people talking and 200 people listening. We’re broadcasting the audio into the browser, and I’m just pushing it out through the Web Audio visualizer, which is analyzing it.

And the quite cool thing, the actual code for this entire page, including a little bit of CSS, so it’s a single-page demo, you got some CSS up here, a bit of HTML for the layout, and a little chunk of JavaScript, the whole thing is 86 lines with no libraries, no plugins, nothing. Just using native WebSockets, native web audio.

“…no libraries, no plugins, nothing. Just using native WebSockets, native web audio.”

JB: Wow, that is powerful. That is amazing.

SM: So really it’s the technology that’s the potential for this. The demo is pretty crude.

JB: I think you’re going to have a number of people from the audience coming and finding you afterwards, because that is so useful. There’s a million and one things you can do with that. Brill. Thank you.

Leave a Reply

Your email address will not be published.