SPEECH SYNTHESIS AT HIGHER SPEAKING RATES


John M. Rye


Abstract

Synthetic speech is used widely with adaptive software to enable visually impaired computer users to have the same level of computer access as their sighted counterparts. Some users prefer, and others need, to be able to listen to synthesized texts spoken rapidly. This paper discusses some issues relevant to the design and operation of speech synthesizers used in these systems, giving particular emphasis on the ability to produce rapid but intelligible speech.

Introduction

Synthetic speech has been used for a number of years as a final output stage in adaptive systems for visually impaired PC users. Visually impaired users make widely varying use of their PCs. At one end of the spectrum an occasional user may read a book, or a 'phone bill for instance, with their home PC and special attachments such as an optical character recogniser. At the other end the professional PC user, a programmer for instance, may use a talking PC for much of their working day. Many people of course vary the amount of time at a PC or terminal according to need or fancy.

Intelligibility and naturalness are accepted as being important desirable aspects of synthesizer performance. Intelligibility is the obvious requirement that any synthetic speech should meet, but a high degree of naturalness has been widely accepted as easing listening strain.

The example of the visually impaired professional PC user is important; this type of user tends to make demands on the speech synthesis not often taken into consideration when studying the effectiveness of synthetic speech. The most obvious extra requirement is speed.

The Need for Speed

There are two fairly obvious cases when a visually impaired PC user may want their synthesizer to speak quickly.

Firstly, just reading a document. Most sighted readers would readily accept that they can read a text faster than they can speak it. Therefore, there is no reason why a visually impaired user should not want to listen to a document at the fastest rate they can process the received audio and linguistic information contained. Whilst it is not claimed that linguistic information received acoustically and visually are processed in the same way or at the same rate, it is suggested that both can be faster than a normal speaking rate.

Secondly, when using a screen reader and synthesizer to follow typing and cursor movements, the user may want to hear texts, such as selected menu items, spoken quickly to speed up their overall use of the PC. A fast response from the speech system is important here and has to be taken into account in the design of the screen reader, synthesizer driver and synthesizer itself. The speed of initial response is important, but what is spoken must also be intelligible.

Experience suggests that frequent synthesizer users train themselves to their particular synthesizer and become familiar with its speech quality and idiosyncrasies. With this self-training they are often able to process speech synthesized at much greater rates than a human can produce.

Speaking Quickly

When we speak quickly we tend to shorten longer sounds more than shorter sounds. As the speech becomes even more rapid it becomes impossible for the articulators to get to some intended target positions. Consequently the associated sounds become altered and blur into one another. At these speeds linguistically unimportant sounds sometimes disappear altogether. Shorter alternative pronunciations may also be used. As an example, in British English, the word temperature is sometimes pronounced temp'eture; even by BBC weather announcers at normal speed. Similarly, US English speakers might consider alternative pronunciations of the word particular.

Shortening of the transitions between the spoken phonetic segments themselves does not occur, at least in part, because of limited speeds of movement of the articulators.

It is highly desirable for a synthesizer to model non-linear shortening of the phonetic segments at normal speaking rates in order to maintain naturalness as speed increases. However, there is sometimes the requirement, for example in the case of the visually impaired professional PC user, to synthesize at unnaturally fast rates. If we carried on shortening the sounds but not the transitions as the rate went up, then many sounds would be blurred into their neighbors or disappear altogether. This fact leads to the hypothesis that, increasing the speed of interpolation between phonetic segments as synthesizer talking rate increases, may improve intelligibility in unnaturally fast speech.

High Voiced Pitch or Low Pitch?

The pitch of the synthesized voice is primarily a matter of taste for the listener. However, if we are to consider only the intelligibility at a high speaking rate, then the selection of overall voice pitch may have a bearing.

At very low pitch the voice pulses are relatively far apart in time, consequently they do not sample synthesizer phonetic segments very often. The perception of short voiced sounds in which the vocal tract parameters, for instance formant frequencies, are varying most rapidly, may then be impaired. Conversely, too high a pitch value may produce voice harmonics too far apart to sample the synthesized vocal tract resonances effectively, resulting again in a loss in intelligibility.

Implications for Synthesizer Design

To speed up the interpolation from one phonetic segment to the next, it is necessary to control the variation in time of synthesis vocal tract model parameters, such as formant frequencies. Consequently, formant or LPC based synthesizers are better equipped to achieve this kind of alteration. Systems that synthesize from gross segments, for instance the waveform concatenation of diphones, although capable of a high degree of naturalness and intelligibility, at normal rates, cannot directly achieve this kind of manipulation.

It does not matter whether the synthesizer uses phonemes or diphones, so long as the interpolation between phonemes or the transitional section of the diphones can be quickened.

Tests

Much of the justification for altering phonetic element interpolations comes from informal tests, observations and a little intuition. However, to test the effects of segment interpolation steepening, a software speech synthesizer has been modified to control the overall phonetic segment interpolation rate. Test phrases will then be synthesized at various speaking rate and interpolation rate and played to a panel of listeners. The results from the panel will be analyzed to see if increasing the rate of interpolation improves intelligibility at fast speaking rate. The details and results will be reported.

It will also be demonstrated that at normal speaking rates, increasing the interpolation rate introduces distortion into the speech, reducing its naturalness and intelligibility.

Discussion

It would seem that the following properties are useful in a speech synthesizer intended for use as an adaptive aid for visually impaired computer users.

a) Fast but intelligible speech. Many users want rapid speech in order to follow speedy typing or when reading text.

b) Speed of response; the user needs to hear the speech associated with a key press rapidly.

c) Individual users need to be able to tailor their synthesizer to match their taste in voice style, but certain selections, particularly in voice pitch, may affect intelligibility at speed.

There would, however, appear to be a conflict between the desirability of natural sounding synthesis on the one hand, and the need for rapid intelligible speech on the other. Certain synthesis techniques can help resolve this dichotomy, by gradually shifting from a model of the natural phonetic segment interpolation process, to a more specialized one as speed increases.