Have you ever been stopped by the police for using a mobile phone while driving? Or, while driving, have you ever fiddled surreptitiously with your PDA to find an address, tried to tune the radio for traffic information, or the like? Suppose you could do all those things and a lot more, simply by using your voice? Wouldn't that be worth something?
The same applies to any hands-off situation and, indeed, to many hands-on situations. Voice is the means by which human beings most commonly and conveniently communicate. The major source of information for many of us, our computers and the Internet, remains locked behind a keyboard - but not perhaps for much longer. Voice Web technology promises to allow access to information on the Internet by voice for many practical tasks within the next two to five years.
Voice interaction with machines is probably best epitomised in people's minds by the malevolent HAL computer in Stanley Kubrick's film 2001. Unfortunately, voice technology thus far has not matched up to HAL's level of performance, the principal problems being in natural language processing. This is difficult, Chomsky territory but steady if unspectacular progress is being made.
Voice Web technology does not appear to have the potential for sudden, dramatic discontinuities in the way we carry out processe. It does, however, have the potential to add an important new means of accessing existing services which will require significant work on the part of service providers. Anyone who does not track this technology closely could be badly caught out.
The technology under the microscope here is voice Web technology, to be incorporated into voice Web servers and browsers. The aim is to allow information sources such as the Internet to be accessed by means of voice technology. Other important aspects of the technology are the development of dialogues and context-dependent knowledge of user intentions.
Most readers will be familiar with the universally despised call-centre voice interface, whereby the caller waits interminably for a menu to be read out and intermittently press buttons, all the while wishing they could talk to someone. This technology is known, incidentally, as IVR (Interactive Voice Response).
It is important to understand that call-centre development is driven by the desire to reduce costs rather than to create new markets. The determining factor is the worst possible service the user will tolerate rather than what he/she will buy. That distinction divorces it completely from the line of research being undertaken by Hewlett-Packard, a leader in voice Web, which focuses on improved quality of response.
HP's research has two main goals. First, improvement of natural language processing capabilities to allow users to cut through standard menus by asking directly for the information or routing required. Second, multimodality - that is, the incorporation of voice, text and images in combination to provide the optimal response at the user's terminal device.
Voice response from the Web requires pages of information to be marked up separately in a voice mark-up language and for these pages to co-exist with pages of text and images. Standardisation (as with HTML) must therefore be a critical factor, so bodies concerned with voice Web standards are important.
Key here is the W3C Voice Browsing Activity Group, whose multimodality subgroup is chaired by HP. This group is scheduled to define an initial voice browsing standard later this year defining minimal standard dialogues for Internet browsing not dissimilar to IVR. Essentially, instead of intermittently pressing buttons, you could say "Yes" at appropriate points to a similarly voiced menu of options for any Web page. This may sound like simply compounding the call-centre problem but a blind person might think differently.
While a necessary initial step, this is not the end-game of HP's research efforts. HP is looking to a second round of standardisation, probably a couple of years ahead, which will go much further in multimodal dialogue definition and natural language processing. These developments will be crucial to the success of this technology.
Developments needed in natural language processing are helped enormously if good contextual information is present, since each context can be inferred to have its own, often quite limited, keyword vocabulary. If you need traffic information, words such as bedroom, bath and meals can reasonably be assumed to be superfluous. On the other hand, if you are trying to book accommodation ahead while en route, words such as accident, congestion and road repairs are similarly irrelevant.
An important part of the HP research is therefore to define scenarios in which voice Web technology could be used. Identified scenarios include in-vehicle, in shops, on foot and at home. The contextual information narrows the form of dialogue, keyword vocabulary and the set of user intentions that need to be taken into consideration.
HP Labs has an experimental technology codenamed Maverick in which one scenario is a Web music shop. Users can control and navigate using speech and/or a mouse and a keyboard. All these behaviours are encoded in a new mark-up language. This technology gives a good indication of what interaction with the Web of the future might feel like. A further constraint within which voice Web technology must work is defined by the telecoms services available, whether wired or, more probably in the voice Web situation, wireless. However, these appear adequate for most purposes at the moment and will surely improve in the future. The key is to understand what kind of service, at any one time, they are capable of supporting.
Finlly, probable delivery mechanisms for voice Web are mobile phones or microphones attached to PCs or PDAs (most of which have sound speaker facilities). For multimodal response, particularly in hands-off situations, PDAs would seem to have a distinct edge.
The potential market for voice Web lies with two sectors: those who need it and those who would like it. The former consists primarily of hands-off situations and would use adequate if not ideal technology; for the latter, cosmetic or lifestyle factors will be paramount.
Part of HP's research into voice Web consists of constructing scenarios, in which devices incorporating the technology would be used. An indication of the size of the in-vehicle market alone, given appropriate devices and services, can be gleaned from statistics from the AA.
According to the AA, there were 22 million cars registered in the UK in late 1998 (the latest available figures) and 144,000 commercial vehicle, although a significant part of the car population will be using their vehicle for commercial purposes too. The commercial sector is important because any identifiable benefits or savings from the use of voice Web will have an immediate commercial case. Moreover, market take-up should be earlier because adequate rather than preferred technology will serve the purpose.
In-vehicle technology is relevant only when people are actually in their vehicles and probably only when making long journeys. There appear to be no statistics for this but the AA estimates there are 144,000 vehicles every day on the northern sector of the M25, from which reasonable inferences on journey length and vehicles in motion at any one time could be made.
All these figures are for the UK, or parts of it. The worldwide market is clearly huge for any successful deployment of the technology. Moreover, all the major vehicle manufacturers appear to be working on voice customisation - the incorporation of optimal voice detection systems. And this is just one of the scenarios HP is working on.
The two- to five-year timescale is key in this case. Elements of the technology required for voice Web interaction are already available but most need improvement. Also, time is needed for standards to be created and implemented and for Web pages to be produced in voice mark-up language. It's probably unwise to expect anything really noticeable to happen within three years, but after that signficant changes could be discernibly under way.
Numerous question marks have to be set against voice Web technology but almost all relate to timescales rather than viability or potential demand. There can be little doubt that HP is at the leading edge of the technology and will stay there, so any doubts relate to when rather than whether. Realisable benefits from technological development will be incremental rather than discontinuous, so key to exploitation will be positioning and understanding of opportunities as they occur.
If all this works to plan, how does HP gain? HP is in the fortunate position of being a seller of base technology, hardware devices and services. It is arguably the major IT supplier that has done most for the creation, implementation and promulgation of open standards (think of printer interfaces and Unix, for instance). It should therefore have little difficulty in selling a good implementation of the base technology. It also has a considerable presence in various hardware markets, making the company well positioned to exploit the market for voice Web devices.
In the pipeline
Hewlett-Packard's principal centres for research are at Palo Alto in the US and at Bristol in England. Further research labs are located in Cambridge (Massachusetts), Grenoble in France, Haifa in Israel, and Tokyo.
In this article, we look at work being carried out at the Bristol laboratories, which covers all the fields of research listed below, although the focus of this feature is voice Web technology.
HP's research can be broadly grouped into four categories:
Unusually, HP positions its labs at the forefront of its public relations and corporate marketing activities rather than hiding them in the background. Newly installed CEO Carly Fiorina highlighted them in her keynote address to the Comdex conference at the end of last year as the engine of innovation for the company. She said: "I am confident because of the inventive spirit of HP and our people, who are best personified, perhaps, by what I found in HP Labs."
Speech processing technology has a number of components:
This is the conversion of acoustic signals received from voice input into a sentence of distinct words by matching segments of the signals with a stored library of phonemes (irreducible units of sound). Problems include elimination of background noise and variations in pronunciation by individual and circumstance, such as a cold, sore throat and accent. Training systems to recognise a specific voice is still needed for the most accurate results.
This involves parsing each sentence received from the voice recognition system into its grammatical components (verb, subject, object, and so on) and formatting these into semantic frames. A semantic frame is a command-like structure containing a clause, topic and predicate. A major sensitivity here is the need for coherent logical phrasing of the key, usually initial, sentence in any conversation. Garbled sentences can be clarified by iterative Q/A sessions, but clear and logically expressed requests are ideally needed.
This is the processing of semantic frames for inclusion in natural language responses or for translation into SQL or other computer languages in order to retrieve the information necessary for a response. Retrieved information is placed into further semantic frames which are assembled to build a response.
Speech synthesisers render the responses in natural language. Problems here are less to do with the understanding of meaning than with cosmetic but nonetheless important factors such as avoiding Mickey Mouse voices or sounding too much like a robot.