White Paper: Look who’s talking

The biggest problem with speech synthesis is the artificial sounds that are generated. Linking pre-recorded samples can help...

The biggest problem with speech synthesis is the artificial sounds that are generated. Linking pre-recorded samples can help avoid this and boost user acceptance

Text-to-Speech (TTS) has been around for more than 15 years. It is commonly used in applications where a large amount of information has to be spoken or where the information changes frequently. Even though TTS has been around for years, it is felt that the quantum leap required to achieve broad acceptance has not yet been made. Largely, this is because artificially synthesized speech sounds artificial. Human acceptance of non-human sounds is not on a par with human voices.

Ideally, a high-quality TTS engine will generate speech that is almost indistinguishable from human speech. For instance, it may be based on concatenation algorithms, where actual human voice segments are stored and used to convert any text into speech. In-depth language specific linguistic knowledge provides intelligent pronunciation of a wide range of variable input. Such an engine is used in Lernout & Hauspie's RealSpeak.

This is ideal for industries, such as telephony, which require a very high quality voice. Many market segments, such as banking and airlines, conduct much of their business by phone and therefore rely on dynamic information retrieval from databases. Concatenation technology can cost-effectively and efficiently handle the spoken responses required for applications such as dial-in banking, reservations and other applications employed in these industries. As a result, usage of TTS technology is expected to increase, first with the telecommunications market and eventually among other industries, such as automotive and consumer electronics.

Different TTS engines each have different CPU and memory requirements and feature sets. They can offer support for a wide range of platforms and languages, and provide customers with a TTS engine to fit any application or requirement set. Scalable engines may be incorporated into products ranging from small embedded devices to large scale telephony systems.

What is Text-to-Speech?

Text-to-Speech is synthetic or computer generated speech. Typed text is converted into speech using various algorithms.

Human speech is difficult to produce artificially. At the simplest level, the entire vocal tract must be mimicked for a speech synthesiser to have the clarity and naturalness of real human speech.

During early attempts at synthetic speech, memory was extremely expensive. This expense affected the way early synthesis engines were designed. A memory efficient synthesis by rule system was popular, most commonly known as a formant engine. A formant synthesizer creates totally digitised or synthetic speech; no human recordings are used. An advantage of the formant synthesizer is that the pitch and duration of words may be varied. However, in general, the sound quality is inferior. Although this technique has been used with some success when speaking numbers, most perceive this type of text-to-speech engine as sounding robotic.

A popular technique today is to store actual speech segments. Phonemes are the smallest units of speech that distinguish one utterance from another. Smaller speech segments, known as diphones, are obtained from recordings of these phonemes from human speech. Diphones contain all co-articulation effects that occur for a particular language and are concatenated to produce words and sentences. The use of diphones, in combination with various synthesis techniques, produces speech that is fairly intelligible and requires relatively little computing power.

In speech synthesis, the input is standard text or a phonetic spelling, and the output is a spoken version of the text. A two-phase process is usually employed: (1) the text is converted into a phonetic representation with markers for stress and other pronunciation guides; and (2) the phonetic representation is spoken. The computation can be done on a DSP, a microprocessor or both.

Why use Text-to-Speech?

Text-to-speech is emerging as a major feature in telecommunication systems. Several factors are involved including the increased computer power, the deregulation of the telephony networks, and a general acceptance of text-to-speech as a practical tool for business and consumers.

The immense increase in computer power has greatly enhanced text-to-speech in all applications. They have made text-to-speech systems for telecommunications much less expensive, on a per port basis, by allowing host based solutions to become a reality without having to use a high-cost, dedicated DSP resource board.

Deregulation of the telephony industry has also been a principal driving force for text-to-speech technology. The telecommunications industry has gone from a series of monopolies to competitive industries. This competition has created a great need for differentiation among companies offering similar services. Text-to-speech provides an opportunity to offer real value applications and enhanced services to their customers. This competition has also created a need for efficiency and an increase in productivity among telephone companies. Some have taken advantage of the cost savings when using text-to-speech by the automation of reverse directory assistance, replacing live operators.

Historically, the primary use of text-to-speech in telephone applications has been to save money. The leading applications have been the automation of telephone company operator services and call center operations. In both applications, the fundamental motivation has been to handle more calls with fewer operators. Although these applications will continue to be a major segment of the text-to-speech telecommunications market, applications that are focused on using text-to-speech to make money have started to emerge. These enhanced services are typically sold to subscribers and generate an incremental revenue stream, such as retrieval of email over the telephone.

Consumers, including business users, now expect greater automation and access to a variety of telecommunication services. For many tasks automation is practical and enjoyable. Calling for train schedules, airline departures and arrivals, and entertainment is more easily and conveniently carried out using automated systems which incorporate text-to-speech.

The advantage of a concatenation-type engine

Poor to fair quality text-to-speech application users have had to "train" their ears to fully understand synthetic speech. This training period can last from three to 10 minutes. Without the training, synthetic speech is difficult to understand and causes fatigue. Previous research has shown that poor to fair quality synthetic speech is less well received than natural speech. Luce et al (1983) concluded that this was because synthetic speech increases the effort involved in listening to and storing presented information.

Lernout & Hauspie has actively worked on the improvement and refinement of the basic text-to-speech algorithms to enhance the intelligibility and naturalness of text-to-speech engines. These improvements have resulted in the concatenation engine used in L&H RealSpeak.

In the past, speech companies have been focusing on systems with a relatively small footprint, CPU and memory, plus an excellent trade-off between speech quality and said footprint. However, for some applications, limitations of CPU and memory are becoming less important. That's why it was a logical step to focus on other issues, such as improving speech quality. This increased quality of text-to-speech will establish a new standard, with a substantial increase in voice quality. This next generation text-to-speech engine will be used in server-based telephony services among others.

Compiled by Craig Hinton

(c) Lernout & Hauspie

Read more on Voice networking and VoIP