telecommunications APIs

What Is VoiceXML?

Published February 21, 2015 by Tim Lytle

If you’re just getting started building phone-enabled applications, you may be wondering what exactly VoiceXML is used for. In this post I’ll give an overview of what VoiceXML is, what it’s used for, and the general components of a VoiceXML document.

In a nutshell, VoiceXML is like HTML for a phone call: it describes what the call should “look” like when a call connects a person with an interactive voice response (IVR) system.

While HTML is a standard that specifies how information is exchanged between humans and computers through a web browser, VoiceXML (VXML) is a similar standard for connecting humans and computers through a phone call. More than likely, you’ve been on the human side of that equation, when calling you bank, customer service, or checking the status of a package.

It’s so similar to its web counterpart, platforms that process VoiceXML are called ‘browsers’. Like HTML, VoiceXML is a standard governed by the The World Wide Web Consortium (W3C). That means building a VXML application doesn’t lock you to a vendor. And if you have an existing VoiceXML application, it means bringing that application to Nexmo requires little – if any – modifications.

But how does this all work with Nexmo’s Voice API? While VoiceXML documents control the flow of a phone call, requests to Nexmo’s API start outbound calls and, like inbound SMS, a callback from the API is sent when there’s an inbound call.

Unlike traditional VoiceXML platforms, Nexmo fetches the VoiceXML document from the URL you provide. For outbound calls that URL is part of your request to initiate a call. For inbound calls, you define that URL for each inbound number on your account.

Because the VoiceXML documents are fetched for each call and not hosted on our platform like a traditional IVR provider, you’re able to dynamically generate content and easily add voice support to existing systems. For example, the same infrastructure that provides order status over the web, can be extended to provide the same information over the phone.

So let’s take a look at what makes a VoiceXML document. I’ll cover the major components and how they relate to each other, and include links to our VoiceXML Quickstart guides as well as the W3C’s VoiceXML specification.


Forms Are The Building Blocks

The primary element of a VXML document is the <form> element. That’s not to be confused with the ‘root’ element. Just like a HTML document starts with <html>, a VXML document starts with <vxml>. But web developers structure a page with <div> elements (of course now there are more semantic elements), and in VXML the <form> element is used to define what are called “voice dialogs.”

A form is the phone equivalent of a dialog box and can provide information to a user using the <prompt> element, or request information from the user with a <field> element. A VXML document can contain multiple forms, and they can be navigated to and from during the call.

Just like HTML forms, VXML forms can submit the collected data to a URL. The data collected by the form can also be used in a prompt to notify the user of what they selected, confirm their choice, and make the prompts dynamic.

While text-to-speech (TTS) allows for dynamic prompts, audio files can be referenced for a higher quality, although static, prompt. Text-to-speech can easily be mixed with recorded audio to allow dynamic data, or be used as a fallback should the audio not be available. This makes it possible to build an IVR quickly, referencing audio files that will be added later.

Grammar Is For Listening

iStock_000000444905Large_resizedForms and fields define what data is collected, but the <grammar> element is used to define what to listen for. The simplest of grammars is a list of words to recognize; however, more complex grammars may define optional words, weigh certain grammars higher than others, or map multiple phrases to the same output value.

There are a few different formats that can be used to define grammars, and they can be either inline to the document or external. While grammars tend to live inside a field, it’s also possible to define global grammars that are active throughout the entire call.

Grammars are not limited to spoken responses, they are used to capture dialed digits as well.

Sometimes recognizing a phrase is not enough, you need to capture what the caller says. For those cases, a voicemail system for example, we have the <record> element. Like a field, its home is in a form, but instead of a value based on grammar, its value is the recorded audio of the response.

When the data is submitted to a URL, the recording can be sent just like a file upload from an HTML form. It’s also possible to add recording to grammar, sending both the recognized response and the audio that was spoken.

Being Logical About Events

When grammar is not recognized, or perhaps when there’s no response at all, the VXML platform will throw a matching event. VXML has a few shorthand elements that catch those common events. It’s also possible to catch other events. For example, when a caller disconnects, that event could be caught and the call data logged.

You can also throw custom events, allowing more complex interaction between a form and the rest of the VXML document. If there are events, then there must be some kind of logic support. The <if> element (as well as the related <else> and <elseif> elements) provide conditional logic that can be tied to the results of a form field, or any other variable.

VXML supports ECMAScript (Javascript) to express the evaluated condition, but it’s not limited to those three elements. The <script> element, much like the HTML script element, allows inline or external code, and ECMAScript may be used throughout the VXML to modify variables, control flow, create dynamic prompts and more.

Getting Organized

A single VXML document can have multiple forms, and multiple fields within each form. You can define what form to transition to once one is complete, or even transition based on the value of a single field. You can also transition to other VXML documents, or specific forms in those documents.

Using a ‘root document’ it is possible to share variables and scripts, and define global grammar and event handling across multiple VXML documents. Subdialogs can be used as application building blocks, wrapping common activities into reusable components.

Subdialogs are not aware of the main application, so they can be used to provide generic third-party services to other VXML applications. Things like address confirmation or voiceprint identification can be packaged as a subdialog and provided as a service.

When It’s Time To Go


When a form is submitted and no VXML is returned, when there’s no more VXML to process, or when the <exit> element is encountered, the call ends.

However, the <transfer> element can also be used to connect the call to another party. This allows VXML to be used in cases where data is collected before connecting to other systems, like a call center, or to present the user with a directory and connect them to the right party.

With that, it’s time for this post to <exit>. Or, if you want to <transfer> to our upcoming VXML webinar and learn more, register (Americas, EMEA, or APAC) and we’ll send you a reminder, no need to stay on the line.

Leave a Reply

Your email address will not be published.

Get the latest posts from Nexmo’s next-generation communications blog delivered to your inbox.

By signing up to our communications blog, you accept our privacy policy , which sets out how we use your data and the rights you have in respect of your data. You can opt out of receiving our updates by clicking the unsubscribe link in the email or by emailing us at