An Overview of Text to Speech

Introduction

This page provides a very brief outline of text to speech. Text to speech, or TTS for short, is the automatic conversion of text into speech. The device or computer program that does this is a TTS synthesiser, and if it is a computer program it is often called a TTS Engine. You do not, of course, need to know all about TTS to be able to use a TTS Engine like Orpheus, but please read on to find out more.

A typical TTS engine comprises several components. Depending on the nature of a particular TTS engine, these include those which:

Resolve uncertainties (ambiguities) in the text, and convert common abbreviations like Mr to Mister, or recognise dates and currencies as differing from just ordinary numbers
Convert the text to their corresponding speech sounds (phonemes) using a pronunciation dictionary (lexicon), pronunciation rules, statistical methods predicting most likely pronunciations, or a combination of all these
Generate information, often called prosody, which describes how to speak the sounds, for instance their duration and the pitch of the voice
Convert the phonemes and prosody to an audio speech signal

There are other sub-systems that may also be used, but all TTS engines have the equivalent of the above. Some of these can be very sophisticated and may use very complex natural language processing methods, for instance especially to resolve ambiguities in the text, or to help pronunciations that depend on grammar or the context of the text. Consider the difficulty determining the pronunciation of words like "row" and "bow" in phrases like:

"There was a row."

and

"The violin player took a bow."

We human readers can have problems at times working these pronunciations out. So it is no wonder that TTS engines sometimes get it wrong also!

The last stage in the synthesis process, the one that turns the phonemes and prosody of the speech into the speech signal you hear, is often used to label the type of the TTS Engine. This is probably because this stage broadly determines what the synthesised speech sounds like.

The last stage may fall into one of a number of broad classes. Formant synthesis uses a relatively simple system to select from a small number of parameters, with which to control a mathematical model of speech sounds. A set of parameters is picked for each speech sound and they are then joined up to make the speech. This stream of parameters is then turned into synthetic speech using the model.

Articulatory synthesis also mathematically models speech production, but models the speech production mechanism itself using a complex physical model of the human vocal tract.

Concatenative synthesis does not use these sorts of model directly, but instead uses a database of fragments, or units, of recorded and coded speech and extracts from it the best string of units to stitch together to form the synthetic speech.

A bit more about each of these follows.

topic list

Formant Synthesis

Formant synthesis systems synthesise speech using an acoustic model of the speech signal. This means that they model the speech spectrum and its changes in time as we speak, rather than the production mechanisms themselves. Formant synthesis systems are sometimes referred to as synthesis-by-rule systems or more usually formant synthesisers. Commercial TTS engines using formant synthesis have been around for many years. DecTalk, Apollo, Orpheus and Eloquence are well known TTS engines that use formant synthesis.

Formant synthesis is not a very computationally intensive process especially for today's computing systems. The strength of formant synthesis is its relative simplicity and the small memory footprint needed for the engine and its voice data. This can be important for embedded and mobile computing applications. Another less often reported strength is that the speech is intelligible and can be highly so under difficult listening conditions. This is partly because, although the speech is not natural sounding, all instances of a particular speech sound are somewhat the same. It is thought that with training, this sameness may help some listeners spot sounds in speech at unnaturally fast talking rates.

The weakness of rule-based formant synthesis is that the speech does not sound natural. This is because of the simplicity of the models used; it is very difficult, if not impossible, to model, those subtleties of speech that give rise to a perception of naturalness.

topic list

Articulatory Synthesis

Articulatory synthesisers model human speech production mechanisms directly rather than the sounds generated; in some cases they might give more natural sounding speech than formant synthesis. They classify speech in terms of movements of the articulators, the tongue, lips and velum, and the vibrations of the vocal chords. Text to be synthesised is converted from a phonetic and prosodic description into a sequence of such movements and the synchronisations between their movements calculated. A complex computational model of the physics of a human vocal tract is then used to generate a speech signal under control of these movements. Articulatory synthesis is a computationally intensive process and is not widely available outside the laboratory.

topic list

Concatenative Synthesis and Unit Selection

A broad class of TTS engines use a database of recorded and coded speech units from which to synthesise the speech. These are often termed concatenative synthesisers, and the process concatenative synthesis. Depending on the process used to pick units to be spoken these types of TTS Engine may also be referred to as unit selection synthesisers or be said to use unit selection.

Unit selection is a very popular technique in modern TTS engines and has the ability to create highly natural sounding speech. In unit selection a recorded database is analysed and labelled to define the speech units. These can be arbitrary pieces of speech, half a phoneme (demi-phone), phonemes, diphones (two adjacent half phonemes), syllables, demi-syllables, or words and whole phrases, or statistically selected arbitrary pieces of speech.

Typically, a set of cost functions is then used. The unit selection process then picks units that minimise the overall cost of their selection. A variety of cost calculations may be used. However, they all measure a concept of 'distance' between a speech unit and its environment in the database and the ideal unit at the point in the speech for which the unit is a candidate. The concept of distance includes such things as the units' durations, pitch, the identity of adjacent units, and the smoothness of the joins to adjacent units in the resulting synthesis.

Generally, the selected units will not match perfectly the required duration and pitch and have to be adjusted. It may just be that the required 'prosodic' adjustments are so small that they do not need to be made, there being enough variants of each unit in the database to satisfy the needs of normal synthesis.

The great advantage of unit selection is that it generates natural sounding voices. When used with a large well-prepared database and sophisticated methods of selecting units, speaking in situations where few or no prosodic adjustments are required, then the naturalness can be stunning.

If adjustment is required, naturalness and voice quality can be affected, and it is more of a problem the greater the adjustment needed. Users of unit selection TTS engines who want to use speech at fast-talking rates may be a surprised at the reduction in voice quality that can then arise.

Although concatenative synthesis is not generally computationally intensive, the unit selection process can be. A huge number of combinations of units may have to be tried and rejected as too costly before the best are found. The search for lowest cost units has to be done before an utterance can be spoken. This computation can give rise to a discernable delay, or latency, before speaking begins. This is especially true if the voice database is large. For some applications the delay may be unacceptable. TTS Engine designers employ special techniques to shorten the search with minimal affect on the speech quality for this reason.

topic list