On this page:
Wavenet TTS API Interface
api-key
endpoint
voice-names
select-voice
synthesize
voice
8.9

Wavenet TTS API Interface

Joel Dueck <joel at jdueck dot net>

 (require wavenet) package: wavenet

A Racket interface for Google’s Wavenet text-to-speech engine.

The functions in this module make HTTP requests to a Google Cloud API (see endpoint). You will need a valid API key from Google in order to make use of this package.

The source code is on Github and licensed under the Blue Oak Model License 1.0.0.

Here’s an example program:

#lang racket
 
(require wavenet
         racket/gui/base)
 
(api-key (file->string "api.rktd"))
 
;; One way to pick a voice. (Use voice-names to list available voices.)
(define eliza (select-voice "en-GB-Wavenet-F (FEMALE)"))
 
;; Another way to pick a voice. Statically defining a voice allows us to avoid
;; an extra API call to fetch voices every time the program is run. But you
;; can’t just make up your own values!
(define british-dude
  #hasheq((languageCodes . ("en-GB"))
         (name . "en-GB-Wavenet-B")
         (naturalSampleRateHertz . 24000)
         (ssmlGender . "MALE")))
 
;; Turn your sound up and call this function!
(define (say text)
  (synthesize text british-dude #:output-file "temp.mp3")
  (play-sound "temp.mp3" #t))

parameter

(api-key)  string?

(api-key key-string)  void?
  key-string : string?
 = #f
A parameter for your Google Cloud API key. You must set this before calling voice-names, select-voice or synthesize, or an exception will be raised.

You should store this key in a separate file and make sure to exclude that file from Git (or whatever version control system you use). Then you can load it up at runtime:

(api-key (file->string "api.rktd"))

parameter

(endpoint)  string?

(endpoint uri)  void?
  uri : string?
 = "https://texttospeech.googleapis.com/v1/"
A parameter holding the URL to use for API calls. You can change it if you wish to use a different version of the API.

procedure

(voice-names [prefix])  (listof string?)

  prefix : string? = ""
Returns a list of names of voices available for you to use.

If prefix is provided, only names that begin with prefix will be included in that list. Voice names have a standard format — for example, "en-AU-Wavenet-A (FEMALE)", so prefix is good for narrowing the list to particular languages, or language/engine combinations.

Each time your program is run, the first call to either voice-names or select-voice will generate an API call to endpoint to fetch information about voices currently available from Google Cloud. If api-key is not set, or if the HTTP response code is anything other than 200, an exn:fail:user exception is raised. Subsequent calls to these two functions will refer to a local cache of this information instead of making another API call.

procedure

(select-voice voice-name)  voice?

  voice-name : string?
Returns the voice identified by the voice-name argument; this argument must match one of the names returned by voice-names.

Each time your program is run, the first call to either voice-names or select-voice will generate an API call to endpoint to fetch information about voices currently available from Google Cloud. If api-key is not set, or if the HTTP response code is anything other than 200, an exn:fail:user exception is raised. Subsequent calls to these two functions will refer to a local cache of this information instead of making another API call.

procedure

(synthesize text    
  voice-or-name    
  [#:output-file filename])  (or/c bytes? integer?)
  text : string?
  voice-or-name : (or/c voice? string?)
  filename : (or/c #f path-string?) = #f
Makes an API request to endpoint to synthesize text into MP3 audio using voice-or-name, which must either be a voice or a string matching one of the names returned by voice-names. Also note that the API specifies a limit to the length of text; as of current writing the limit is 5,000 characters per request.

If api-key is not set, or if the HTTP response code is anything other than 200, an exn:fail:user exception is raised.

If #:output-file is specified, the bytes of the MP3 audio are saved to that file, silently overwriting it if it exists already, and the return value is the number of bytes written out. Otherwise the MP3 audio bytes are themselves returned.

struct

(struct voice (languageCodes
    name
    naturalSampleRateHertz
    ssmlGender))
  languageCodes : (listof string?)
  name : string?
  naturalSampleRateHertz : integer?
  ssmlGender : string?
A struct-like type containing information about a single voice available for speech synthensis. In reality voice is a hash-view, i.e. a hash that can be accessed with struct-like accessor functions. Because it is a hash table, it can be easily marshaled to and from a JSON representation of the same data.

Examples:
> (require wavenet json)
> (define british-dude (voice '("en-GB") "en-GB-Wavenet-B" 24000 "MALE"))
> british-dude

'#hasheq((languageCodes . ("en-GB"))

         (name . "en-GB-Wavenet-B")

         (naturalSampleRateHertz . 24000)

         (ssmlGender . "MALE"))

> (hash-ref british-dude 'name)

"en-GB-Wavenet-B"

> (voice-name british-dude)

"en-GB-Wavenet-B"

> (display (jsexpr->string british-dude))

{"languageCodes":["en-GB"],"name":"en-GB-Wavenet-B","naturalSampleRateHertz":24000,"ssmlGender":"MALE"}