Text to speech functionality in web applications using the iSpeech API

“Turn right onto Main Street”. If you are using a car navigation system like Garmin or TomTom, you will be all too familiar with phrases like this, spoken to you by the device mounted to your dashboard, magically guiding you through the maze of local roads and highways to your destination. Such text-to-speech capabilities are usually associated with native software implementations, not so much with web applications. With the iSpeech API, that changes. You can now build web applications that implement text-to-speech functionality with the power of JavaScript and HTML5.

The iSpeech demo page gives you an idea of the power of the API by providing a regular text input and being able to turn any text into spoken words. Let’s have a closer look now as to what is required to make this work in a web application.

Step 1: Sign up for a Developer Account and set up an API key

Signing up for a developer account is a requirement in order to use the API. After creating your account and logging in, you can create a new API key. Fill in the form and select “Desktop, Web, Other” as the application type in order to be able to use iSpeech’s REST API.

Step 2: Trying out the API

After creating an API key, you should find it in your list of available API keys. Clicking on “Settings” leads you to a form where you can set certain parameters for the text-to-speech functionality, like file formats, bit rates and frequency as well as add some more information about the app that uses the key. For this step, we just want to try out the API and get some text converted into speech, so the first thing we need to do is familiarize ourselves with how the API request is constructed. iSpeech provides good documentation on that and gives us the general layout of the request:

http://api.ispeech.org/api/rest?apikey=YOURAPIKEYHERE&action=convert&text=This+is+the+text+I+want+to+convert

Now we only need to replace YOURAPIKEYHERE with the API key we created in Step 1 and copy that request URL to the address bar of our browser or create an HTML5 audio element like the one below to embed in our HTML document.

Step 3: Tweaking the Request Parameters

The attentive reader might have noticed, from looking at the source code of the audio element in step 2, that we need to provide several fallback versions in addition to the MP3 format that the iSpeech API generates by default. This is necessary to satisfy the different codec requirements by modern browsers. To achieve this, the iSpeech API has a format parameter, that lets us specify the format in which we would like the audio piece to be returned to us. A complete list of supported formats can be found in the iSpeech API documentation.

If we want to use the Ogg Vorbis format, we can specify this like this:

http://api.ispeech.org/api/rest?apikey=YOURAPIKEYHERE&action=convert&text=This+is+the+text+I+want+to+convert&format=ogg

We can also specify a different voice using the “voice” parameter:

http://api.ispeech.org/api/rest?apikey=YOURAPIKEYHERE&action=convert&text=This+is+the+text+I+want+to+convert&format=ogg&voice=auenglishfemale

It is also possible to slow down the voice by providing a “speed” parameter with values between -10 (very slow) and 10 (very fast). However I noticed that not all voices provided by iSpeech support this parameter.

http://api.ispeech.org/api/rest?apikey=YOURAPIKEYHERE&action=convert&text=This+is+the+text+I+want+to+convert&format=ogg&speed=-5

Sounds great (literally), but…

By this point it is probably obvious that being able to use this text-to-speech API from within HTML documents using regular web technologies is pretty powerful. However, there is a catch to it in the name of costs. The web API key that we created is free for the first 500 words being transformed to audio by the API (you can check your key’s total word balance on the settings page of the key). If you exceed that limit, you have to pay, $50 for 1000 word credits to be exact, which essentially means that you can only feasibly use this API in some sort of premium app setting where your revenue generated can cover for these costs. While I would love this technology to be free, I can understand that it is not, considering the amount of work that goes into developing such piece of software.

The upside is, however, that iSpeech in addition to the REST API also offers SDKs for pretty much all popular mobile platforms, such as the iPhone, and API keys generated for these do not have any word quota imposed on them. Maybe another good reason to start learning Objective-C.