Voice Web technology promises speech-activated Internet access

Ian Hugo investigates the potential for voice recognition technology to interact with the Web

Ian Hugo investigates the potential for voice recognition technology to interact with the Web

Have you ever been stopped by the police for using a mobile phone while driving? Or, while driving, have you ever fiddled surreptitiously with your PDA to find an address, tried to tune the radio for traffic information, or the like? Suppose you could do all those things and a lot more, simply by using your voice? Wouldn't that be worth something?

The same applies to any hands-off situation and, indeed, to many hands-on situations. Voice is the means by which human beings most commonly and conveniently communicate. The major source of information for many of us, our computers and the Internet, remains locked behind a keyboard - but not perhaps for much longer. Voice Web technology promises to allow access to information on the Internet by voice for many practical tasks within the next two to five years.

Voice interaction with machines is probably best epitomised in people's minds by the malevolent HAL computer in Stanley Kubrick's film "2001". Unfortunately, voice technology thus far has not matched up to HAL's level of performance, the principal problems being in natural language processing. This is difficult, Chomsky territory but steady if unspectacular progress is being made.

Technology Analysis

The technology under the microscope here is voice Web technology, to be incorporated into voice Web servers and browsers. The aim is to allow information sources such as the Internet to be accessed by means of voice technology. Other important aspects of the technology are the development of dialogues and context-dependent knowledge of user intentions.

Most readers will be familiar with the universally despised call-centre voice interface, whereby the caller waits interminably for a menu to be read out and intermittently press buttons, all the while wishing they could talk to someone. This technology is known, incidentally, as IVR (Interactive Voice Response).

It is important to understand that call-centre development is driven by the desire to reduce costs rather than to create new markets. The determining factor is the worst possible service the user will tolerate rather than what he/she will buy. That distinction divorces it completely from the line of research being undertaken by Hewlett-Packard, a leader in voice Web, which focuses on improved quality of response.

HP's research has two main goals. First, improvement of natural language processing capabilities to allow users to cut through standard menus by asking directly for the information or routing required. Second, multimodality - that is, the incorporation of voice, text and images in combination to provide the optimal response at the user's terminal device.

Voice response from the Web requires pages of information to be marked up separately in a voice mark-up language and for these pages to co-exist with pages of text and images. Standardisation (as with HTML) must therefore be a critical factor, so bodies concerned with voice Web standards are important.

Key here is the W3C Voice Browsing Activity Group, whose multimodality subgroup is chaired by HP. This group is scheduled to define an initial voice browsing standard later this year defining minimal standard dialogues for Internet browsing not dissimilar to IVR. Essentially, instead of intermittently pressing buttons, you could say "Yes" at appropriate points to a similarly voiced menu of options for any Web page.

This may sound like simply compounding the call-centre problem but a blind person might think differently.

While a necessary initial step, this is not the end-game of HP's research efforts. HP is looking to a second round of standardisation, probably a couple of years ahead, which will go much further in multimodal dialogue definition and natural language processing. These developments will be crucial to the success of this technology.

Developments needed in natural language processing are helped enormously if good contextual information is present, since each context can be inferred to have its own, often quite limited, keyword vocabulary. If you need traffic information, words such as bedroom, bath and meals can reasonably be assumed to be superfluous. On the other hand, if you are trying to book accommodation ahead while en route, words such as accident, congestion and road repairs are similarly irrelevant.

An important part of the HP research is therefore to define scenarios in which voice Web technology could be used. Identified scenarios include in-vehicle, in shops, on foot and at home. The contextual information narrows the form of dialogue, keyword vocabulary and the set of user intentions that need to be taken into consideration.

HP Labs has an experimental technology codenamed Maverick in which one scenario is a Web music shop. Users can control and navigate using speech and/or a mouse and a keyboard. All these behaviours are encoded in a new mark-up language. This technology gives a good indication of what interaction with the Web of the future might feel like. A further constraint within which voice Web technology must work is defined by the telecoms services available, whether wired or, more probably in the voice Web situation, wireless. However, these appear adequate for most purposes at the moment and will surely improve in the future. The key is to understand what kind of service, at any one time, they are capable of supporting.

Finlly, probable delivery mechanisms for voice Web are mobile phones or microphones attached to PCs or PDAs (most of which have sound speaker facilities). For multimodal response, particularly in hands-off situations, PDAs would seem to have a distinct edge.

In the pipline

Hewlett-Packard's principal centres for research are at Palo Alto in the US and at Bristol in England. Further research labs are located in Cambridge (Massachusetts), Grenoble in France, Haifa in Israel, and Tokyo.

HP's research can be broadly grouped into four categories:

  • Internet security (cryptography, and so forth)

  • Printing and publishing systems, including digitisation, storage and trading of media such as pictures, video, film and music and expertise in digital media rights

  • Service provider systems, including data centre systems, mobile e-services and interactive entertainment

  • Imaging systems, lightweight and low-power materials, image and sound compression algorithms, very small embedded cameras

    Unusually, HP positions its labs at the forefront of its public relations and corporate marketing activities rather than hiding them in the background. Newly installed CEO Carly Fiorina highlighted them in her keynote address to the Comdex conference at the end of last year as the engine of innovation for the company. She said: "I am confident because of the inventive spirit of HP and our people, who are best personified, perhaps, by what I found in HP Labs."


    Speech processing technology has a number of components:

    Speech Recognition

    This is the conversion of acoustic signals received from voice input into a sentence of distinct words by matching segments of the signals with a stored library of phonemes (irreducible units of sound). Problems include elimination of background noise and variations in pronunciation by individual and circumstance, such as a cold, sore throat and accent. Training systems to recognise a specific voice is still needed for the most accurate results.

    Language Understanding

    This involves parsing each sentence received from the voice recognition system into its grammatical components (verb, subject, object, and so on) and formatting these into semantic frames. A semantic frame is a command-like structure containing a clause, topic and predicate. A major sensitivity here is the need for coherent logical phrasing of the key, usually initial, sentence in any conversation. Garbled sentences can be clarified by iterative Q/A sessions, but clear and logically expressed requests are ideally needed.

    Language Generation

    This is the processing of semantic frames for inclusion in natural language responses or for translation into SQL or other computer languages in order to retrieve the information necessary for a response. Retrieved information is placed into further semantic frames which are assembled to build a response.

    Language Synthesis

    Speech synthesisers render the responses in natural language. Problems here are less to do with the understanding of meaning than with cosmetic but nonetheless important factors such as avoiding Mickey Mouse voices or sounding too much like a robot.

  • Read more on IT risk management