An Overview of Text to Speech
This page provides a very brief outline of text to speech.
Text to speech, or TTS for short, is the automatic conversion of text
into speech. The device or computer program that does this is a TTS
synthesiser, and if it is a computer program it is often called a TTS Engine.
You do not, of course, need to know all about TTS to be able to use a TTS Engine
like Orpheus, but please read on to find out more.
A typical TTS engine comprises several components. Depending on the
nature of a particular TTS engine, these include those which:
- Resolve uncertainties (ambiguities) in the text, and convert common
abbreviations like Mr to Mister, or recognise dates and currencies as
differing from just ordinary numbers
- Convert the text to their corresponding speech sounds (phonemes) using a
pronunciation dictionary (lexicon), pronunciation rules, statistical methods
predicting most likely pronunciations, or a combination of all these
- Generate information, often called prosody, which describes how to
speak the sounds, for instance their duration and the pitch of the voice
- Convert the phonemes and prosody to an audio speech signal
There are other sub-systems that may also be used, but all TTS engines have
the equivalent of the above. Some of these can be very sophisticated and
may use very complex natural language processing methods, for instance
especially to resolve ambiguities in the text, or to help pronunciations that depend
on grammar or the context of the text. Consider the difficulty determining
the pronunciation of words like "row" and "bow" in phrases like:
"There was a row."
"The violin player took a bow."
We human readers can have problems at times working these pronunciations out.
So it is no wonder that TTS engines sometimes get it wrong also!
The last stage in the synthesis process, the one that turns the phonemes and prosody of the
speech into the speech signal you hear, is often used to label the type of the TTS
Engine. This is probably because this stage broadly determines what the synthesised
speech sounds like.
The last stage may fall into one of a number of broad classes.
Formant synthesis uses a relatively simple system to
select from a small number of parameters, with which to control a mathematical
model of speech sounds. A set of parameters is picked for each speech
sound and they are then joined up to make the speech. This stream of
parameters is then turned into synthetic speech using the model.
Articulatory synthesis also mathematically models
speech production, but models the speech production mechanism itself using a
complex physical model of the human vocal tract.
Concatenative synthesis does not use these sorts of
model directly, but instead uses a database of fragments, or units, of recorded
and coded speech and extracts from it the best string of units to stitch together
to form the synthetic speech.
A bit more about each of these follows.
Formant synthesis systems synthesise speech using an acoustic model of
the speech signal. This means that they model the speech spectrum and
its changes in time as we speak, rather than the production mechanisms
themselves. Formant synthesis systems are sometimes referred to as
synthesis-by-rule systems or more usually formant synthesisers.
Commercial TTS engines using formant synthesis have been around for many years.
DecTalk, Apollo, Orpheus and Eloquence are well known TTS engines that use formant
Formant synthesis is not a very computationally intensive process especially
for today's computing systems. The strength of formant synthesis is its relative
simplicity and the small memory footprint needed for the engine and its voice data.
This can be important for embedded and mobile computing applications. Another
less often reported strength is that the speech is intelligible and can be highly so
under difficult listening conditions. This is partly because, although the speech
is not natural sounding, all instances of a particular speech sound are somewhat the same.
It is thought that with training, this sameness may help some listeners spot sounds in
speech at unnaturally fast talking rates.
The weakness of rule-based formant synthesis is that the speech does not sound natural.
This is because of the simplicity of the models used; it is very difficult, if not impossible,
to model, those subtleties of speech that give rise to a perception of naturalness.
Articulatory synthesisers model human speech production
mechanisms directly rather than the sounds generated; in some cases they might
give more natural sounding speech than formant synthesis. They classify
speech in terms of movements of the articulators, the tongue, lips and velum,
and the vibrations of the vocal chords. Text to be synthesised is converted
from a phonetic and prosodic description into a sequence of such movements and
the synchronisations between their movements calculated. A complex computational
model of the physics of a human vocal tract is then used to generate a speech signal
under control of these movements. Articulatory synthesis is a computationally
intensive process and is not widely available outside the laboratory.
Concatenative Synthesis and Unit Selection
A broad class of TTS engines use a database of recorded and coded speech
units from which to synthesise the speech. These are often termed concatenative
synthesisers, and the process concatenative synthesis.
Depending on the process used to pick units to be spoken these types of TTS Engine
may also be referred to as unit selection synthesisers or be said to use
Unit selection is a very popular technique in modern TTS engines and has
the ability to create highly natural sounding speech. In unit selection a recorded
database is analysed and labelled to define the speech units. These can be arbitrary
pieces of speech, half a phoneme (demi-phone), phonemes, diphones (two adjacent half
phonemes), syllables, demi-syllables, or words and whole phrases, or statistically selected
arbitrary pieces of speech.
Typically, a set of cost functions is then used. The unit selection process then picks
units that minimise the overall cost of their selection. A variety of cost calculations
may be used. However, they all measure a concept of 'distance' between a speech unit
and its environment in the database and the ideal unit at the point in the speech for which
the unit is a candidate. The concept of distance includes such things as the units' durations,
pitch, the identity of adjacent units, and the smoothness of the joins to adjacent units
in the resulting synthesis.
Generally, the selected units will not match perfectly the required duration and pitch
and have to be adjusted. It may just be that the required 'prosodic' adjustments are so small
that they do not need to be made, there being enough variants of each unit in the database
to satisfy the needs of normal synthesis.
The great advantage of unit selection is that it generates natural sounding voices.
When used with a large well prepared database and sophisticated methods of selecting units,
speaking in situations where few or no prosodic adjustments are required, then
the naturalness can be stunning.
If adjustment is required, naturalness and voice quality can be affected, and it is
more of a problem the greater the adjustment needed. Users of unit selection TTS engines
who want to use speech at fast-talking rates may be a surprised at the reduction in
voice quality that can then arise.
Although concatenative synthesis is not generally computationally intensive, the unit
selection process can be. A huge number of combinations of units may have to be tried
and rejected as too costly before the best are found. The search for lowest cost units
has to be done before an utterance can be spoken. This computation can give rise to
a discernable delay, or latency, before speaking begins. This is especially
true if the voice database is large. For some applications the delay may be
unacceptable. TTS Engine designers employ special techniques to shorten the search with
minimal affect on the speech quality for this reason.