ssml text to speech

Nexmo Launches SSML Support for More Natural Text to Speech

Published January 04, 2018 by Glen Kunene

Great news for brands who engage customers with text to speech (TTS): those interactions don’t have to sound like a robot struggling to mimic human speech. Thanks to the new Speech Synthesis Markup Language (SSML) feature in the Nexmo Voice API, TTS can sound like a human being, with the typical characteristics of natural spoken language.

SSML is an XML-based markup language that developers can use to programmatically augment TTS audio so it’s easier for end users to understand. With SSML, developers can control various characteristics of TTS output such as voice, pitch, rate, pronunciation, and others.

Used effectively, SSML can transform robotic-sounding synthetic speech into natural-sounding language. This capability can enhance the user experience of any TTS application, but it is particularly useful in use cases such as two-factor authentication (2FA) where the audio has to relay digits.

TTS Is Not Inherently Context-aware

The main problem with basic TTS is it can not navigate the ambiguity around how certain content is spoken in different contexts. TTS is not inherently aware of context and will pronounce the provided input the same way every time. This is where SSML comes in. It enables developers to programmatically interpret context and adjust playback accordingly.

Take digits, for example. Given this number—4155551212—as input, how should a TTS reader pronounce it? There is more than one way:

  • “4 billion, 155 million, 551 thousand, 212”
  • “Four one five” [pause] “five five five” [pause] “one two one two”

Which is correct? It depends on the context. Does the number represent a PIN code? A phone number? An amount? Each of these should be pronounced differently. SSML enables developers to encode the content in a way that instructs the reader to pronounce it properly for the given context.

The Complexities of Spoken Language

Spoken language, which most people take for granted, is actually an intricate application of many auditory techniques. We apply concepts such as pausing, phonemes, prosody and language mixing in our spoken communication without even realizing how complex the exchange we’re having is. The ability to simply recite written text is a far cry from being able to communicate clearly through spoken language. Hearing content recited without these speech techniques properly applied can make what should be a simple message confusing or off-putting.

Some words have more than one pronunciation, for instance. Two native English speakers can pronounce vitamin, aluminum, privacy, schedule, and garage differently depending on their country of origin or even in which region within the same country they were born. Using SSML to specify the common pronunciations in a target region can make a TTS exchange sound natural to the local customer base.

This becomes even more personal to the customer when the TTS system needs to speak their name. Like other words, many names can be spelled one way but pronounced another. Julia can begin with the “J” sound in one country but with the “Y” sound in another. Programming a TTS application to use the proper phonetic pronunciation will avoid mispronouncing names and improve the customer experience.

SSML provides tags for controlling speech elements like these in TTS audio. Key tags include:

  • <break> – indicates a configurable pause in the speech
  • <lang> – indicates the language of a specific word or phrase
  • <phoneme> – provides a phonetic pronunciation for the indicated text
  • <prosody> – controls the volume, rate, and pitch of the text delivery
  • <say-as> – indicates how the input text should be interpreted

SSML also enables other cool effects, such as dynamic range compression, whispering, and vocal track length, that can further enhance the caller’s experience.

Alternatives to SSML?

To deliver truly natural pronunciation and intonation, there’s no substitute for recorded human speech. That would be the most natural-sounding audio to play to customers, but it would eliminate the cost-effectiveness and scalability of synthetic speech. A business would need to hire speech talent to record all the anticipated responses—in all languages and dialects—to equal the functional message delivery of TTS.

While recorded speech may be feasible for some applications (in fact, the Nexmo Voice API supports streaming audio files into live calls), most applications will require TTS to be cost-effective—and we’ve already covered the natural language shortcomings of basic TTS. The combination of TTS and SSML-enabled control, as offered by Nexmo, offers more natural speech while still being cost effective.

If you’d like to try this cool new feature, check out the SSML documentation and start making your TTS applications speak more naturally. Your customers will be delighted.

Leave a Reply

Your email address will not be published.

Get the latest posts from Nexmo’s next-generation communications blog delivered to your inbox.

By signing up to our communications blog, you accept our privacy policy , which sets out how we use your data and the rights you have in respect of your data. You can opt out of receiving our updates by clicking the unsubscribe link in the email or by emailing us at