The gift of the gab

Most PC users already talk to their computers, usually with four-letter words. So voice recognition technology could be just the...

Most PC users already talk to their computers, usually with four-letter words. So voice recognition technology could be just the thing.

In the mid-'50s, Alan M Turing said a machine can be considered "intelligent" when a person hearing an interaction between a computer and a human, without seeing the participants, cannot decide which is the person, and which is the machine.

Enter the off-world of Blade Runner, where the line blurs between human and replicant, and you don't know whether you are talking to a man or a machine. This is the hyped world of voice technology.

Speech is the most accessible, mobile and easy-to-use interface that exists today. Speech recognition, by "understanding" what an end user is saying, enables users to interact with computers using voice commands. Allowing voice access to information anytime, anywhere, from any device, speech recognition can provide a more effective way for companies to communicate with customers, save money and facilitate e-business.

As the number of mobile workers grows and the desire for instant information increases, so there will be a demand for voice-enabled unified messaging, e-mail, address books, production schedules, inventory details, sales records, and other database information.

Until recently, corporate take-up of speech technology, such as IBM's ViaVoice, has been slow, but the market has changed dramatically. The network revolution has required companies to increase their customer reach, provide better service, and handle more transactions - all while reducing costs.

At the same time, users need consistent, convenient access to the internet wherever they are. The IT environment is primed for changing to a new conversational user interface.

The biggest uptake of speech recognition technology has been seen in the travel, finance, and telecommunication industries, where each is trying to improve service, and cut costs in call centres. In a report by the Yankee Group, research showed that call centre solutions can reduce the cost from five dollars per minute for a live agent, to 10 to 30 cents for a virtual agent.

"We are now seeing 'contact centres' that extend e-business into the voice self-service and assisted space; because companies cannot afford to have two sets of IT staff - one focussing on voice, one on the Web," said Sunil Soares, program director of product management, IBM Voice Systems.

"We have seen all our contact centre markets growing at double-digit rate. We believe that the market that is growing the fastest is actually speech recognition. The speech recognition market impacts the interactive voice response (IVR- touch-tone telephone to interact with a database) market in a fundamental way, because you are changing the interface.

"It's no longer DTMF (dual tone multi-frequency used by touch-tone), it's speech." Big Blue is angling to make a big splash in the speech technology market and announced its commitment over the next two quarters by giving its portfolio of voice recognition systems a new umbrella term, "Conversational Services".

IBM will roll out products, which will include speech translation, multi-modal interfaces, middleware, natural language understanding, text-to-speech, and biometrics.

Duncan Ross, IBM sales and marketing manager for EMEA Voice Systems, explained: "The big challenge in technology development is the human factors involved with developing speech application, and getting the accuracy right. No longer do you have to train the program to your voice. It has to be speaker independent and be able to understand different accents. We have moved from 'phoneme' analysis, which dissects the smallest intelligible segment of sound in a word, to 'trigram' analysis, which statistically models three words in a sentence. This makes the application more intelligent.

"With the Websphere voice server, speech recognition happens on the server. Because we do not have to cram the technology onto a mobile device, having it on the server opens up what we can do.

"The voice server allows users access to information and enables them to conduct transactions using their voice, which is a more natural and pleasant experience. It is a part of WebSphere, so it is a natural extension for a company that's already web-enabled their business."

IBM has launched WebSphere infrastructure software that includes a speech recognition engine, Voice Server, VoiceXML browser, and HTML to VoiceXML features. VXML, a derivative language of XML, enables speech-recognition systems to access Web based information. Supported by AT&T, IBM, Motorola and Lucent Technologies, VXML has been accepted as the industry standard, enabling phone callers to access the same Web based data as an internet browser.

'Soft knowledge'
David Bradshaw, vice-president of consulting for North America at Ovum, says: "VXML makes it easier to build speech applications, but it's not quite as simple as that, for to make these applications work for users, you need 'soft knowledge'. It's not based on hard knowledge and hard fast rules, but trade craft and experimentation to get the application to work right."

Microsoft is also taking this market very seriously. Researchers in the Speech Technology group have developed the 'Dr. Who' engine, which uses continuous speech recognition and spoken language understanding to enable speech input, but also understand questions. Representatives from Microsoft were unavailable for comment.

But Bradshaw claimed: "Microsoft is working mostly behind the scenes, partnering with and investing in main players, but having no direct play in this space. Intel is apparently taking a similar approach."

But is the technology up to scratch? Simone Roberts, senior analyst at the Yankee Group, believes the technology has come a long way in recent years. "The concept is not out of this world, but the technology has just not been up to it. Currently, speech vendors are claiming 95 to 98% accuracy or quality of service levels for their technologies.

"But based on wireless connections, the accuracy level would be much lower. For wireless access, speech recognition technologies must be able to cope with interference, increased background noise and limited wireless reception."

Limitations advantage
Perfection is not all that its cracked up to be, Roberts argues, and there are advantages to limitations in technology: "There are benefits for the machine sound. It alerts the users that they are not talking to a human, making them talk clearer and use simpler language to assist the machine."

So is there a future for speech recognition? Duncan Ross believes it will be part of a bigger picture. "I think the big challenges are enabling total flexibility of how an end user can get things done: multi-modal interaction. Now, if you use a phone, you get the information back as speech. But maybe I would like to get it back through text or table.

"The end users should be able to get the information they want; we need to be more flexible in conversational methods. Development is aimed towards all different styles of human interface picking the most appropriate style at that time."

A theme which is supported by his colleague Soares. 'In countries like India there are over 1 billion people, with less than 1% having web access, so if you are a content provider you look around and you will see that the most ubiquitous device is the phone. Probably every small village has one or two phones. You can therefore see a business case where you can use speech recognition plus VoiceXML to source web content to people who don't have PCs.

Case study: T Rowe Price
T. Rowe Price, a Baltimore fund complex, began rolling out a voice-response system designed by IBM that lets callers ask a computer for their account balances, and other basic information, using ordinary spoken sentences.

The voice system, claims IBM, recognises 35,000 English phrases and accommodates a range of US speech patterns. T. Rowe Price and IBM bill their set-up as a 'natural-language understanding' (NLU) system able to interpret a broad variety of different ways to say the same things. NLU, part of WebSphere voice server and DirectTalk, is a sophisticated system that 'hears' the sentence, breaks it down into its component parts, and picks up key words and phrases that has been programmed to connect with certain enquiry or transaction functions.

The system then checks to see if the caller's request is valid, according to the rules of his or her plan, and then responds appropriately. The new service offers 24 hour access to all their accounts.

Callers can transfer to a representative at any time during the call by saying 'I'd like to speak to a representative.'

"With this latest technology, we've jumped to the next generation of voice recognition, one that will significantly enhance a participant's ability to manage their retirement investments. It does away with lengthy menu options, making it easy, simple-to-use, flexible, and fast," said Charlie Vieth, president, T. Rowe Price Retirement Plan Services.

T. Rowe Price plans to offer the service to 1.2 million of its retirement-plan participants by the year end, and hopes to get a competitive advantage with a cutting-edge, easy-to-use phone system. The company received substantial discounts for helping IBM develop the technology, but declined to discuss prices.

IBM is currently negotiating trials with top banks and telecommunication companies in the UK, and is looking to bring out the NLU system in UK English shortly.

Read more on IT jobs and recruitment