Most PC users already talk to their computers, usually with
four-letter words. So voice recognition technology could be just
the thing.
In the mid-'50s, Alan M Turing said a machine can be considered
"intelligent" when a person hearing an interaction between a
computer and a human, without seeing the participants, cannot
decide which is the person, and which is the machine.
Enter the off-world of Blade Runner, where the line blurs between
human and replicant, and you don't know whether you are talking to
a man or a machine. This is the hyped world of voice
technology.
Speech is the most accessible, mobile and easy-to-use interface
that exists today. Speech recognition, by "understanding" what an
end user is saying, enables users to interact with computers using
voice commands. Allowing voice access to information anytime,
anywhere, from any device, speech recognition can provide a more
effective way for companies to communicate with customers, save
money and facilitate e-business.
As the number of mobile workers grows and the desire for instant
information increases, so there will be a demand for voice-enabled
unified messaging, e-mail, address books, production schedules,
inventory details, sales records, and other database information.
Until recently, corporate take-up of speech technology, such as
IBM's ViaVoice, has been slow, but the market has changed
dramatically. The network revolution has required companies to
increase their customer reach, provide better service, and handle
more transactions - all while reducing costs.
At the same time, users need consistent, convenient access to the
internet wherever they are. The IT environment is primed for
changing to a new conversational user interface.
The biggest uptake of speech recognition technology has been seen
in the travel, finance, and telecommunication industries, where
each is trying to improve service, and cut costs in call centres.
In a report by the Yankee Group, research showed that call centre
solutions can reduce the cost from five dollars per minute for a
live agent, to 10 to 30 cents for a virtual agent.
"We are now seeing 'contact centres' that extend e-business into
the voice self-service and assisted space; because companies cannot
afford to have two sets of IT staff - one focussing on voice, one
on the Web," said Sunil Soares, program director of product
management, IBM Voice Systems.
"We have seen all our contact centre markets growing at
double-digit rate. We believe that the market that is growing the
fastest is actually speech recognition. The speech recognition
market impacts the interactive voice response (IVR- touch-tone
telephone to interact with a database) market in a fundamental way,
because you are changing the interface.
"It's no longer DTMF (dual tone multi-frequency used by
touch-tone), it's speech." Big Blue is angling to make a big splash
in the speech technology market and announced its commitment over
the next two quarters by giving its portfolio of voice recognition
systems a new umbrella term, "Conversational Services".
IBM will roll out products, which will include speech translation,
multi-modal interfaces, middleware, natural language understanding,
text-to-speech, and biometrics.
Duncan Ross, IBM sales and marketing manager for EMEA Voice
Systems, explained: "The big challenge in technology development is
the human factors involved with developing speech application, and
getting the accuracy right. No longer do you have to train the
program to your voice. It has to be speaker independent and be able
to understand different accents. We have moved from 'phoneme'
analysis, which dissects the smallest intelligible segment of sound
in a word, to 'trigram' analysis, which statistically models three
words in a sentence. This makes the application more
intelligent.
"With the Websphere voice server, speech recognition happens on the
server. Because we do not have to cram the technology onto a mobile
device, having it on the server opens up what we can do.
"The voice server allows users access to information and enables
them to conduct transactions using their voice, which is a more
natural and pleasant experience. It is a part of WebSphere, so it
is a natural extension for a company that's already web-enabled
their business."
IBM has launched WebSphere infrastructure software that includes a
speech recognition engine, Voice Server, VoiceXML browser, and HTML
to VoiceXML features. VXML, a derivative language of XML, enables
speech-recognition systems to access Web based information.
Supported by AT&T, IBM, Motorola and Lucent Technologies, VXML
has been accepted as the industry standard, enabling phone callers
to access the same Web based data as an internet browser.
'Soft knowledge'
David Bradshaw, vice-president of
consulting for North America at Ovum, says: "VXML makes it easier
to build speech applications, but it's not quite as simple as that,
for to make these applications work for users, you need 'soft
knowledge'. It's not based on hard knowledge and hard fast rules,
but trade craft and experimentation to get the application to work
right."
Microsoft is also taking this market very seriously. Researchers in
the Speech Technology group have developed the 'Dr. Who' engine,
which uses continuous speech recognition and spoken language
understanding to enable speech input, but also understand
questions. Representatives from Microsoft were unavailable for
comment.
But Bradshaw claimed: "Microsoft is working mostly behind the
scenes, partnering with and investing in main players, but having
no direct play in this space. Intel is apparently taking a similar
approach."
But is the technology up to scratch? Simone Roberts, senior analyst
at the Yankee Group, believes the technology has come a long way in
recent years. "The concept is not out of this world, but the
technology has just not been up to it. Currently, speech vendors
are claiming 95 to 98% accuracy or quality of service levels for
their technologies.
"But based on wireless connections, the accuracy level would be
much lower. For wireless access, speech recognition technologies
must be able to cope with interference, increased background noise
and limited wireless reception."
Limitations advantage
Perfection is not all that its
cracked up to be, Roberts argues, and there are advantages to
limitations in technology: "There are benefits for the machine
sound. It alerts the users that they are not talking to a human,
making them talk clearer and use simpler language to assist the
machine."
So is there a future for speech recognition? Duncan Ross believes
it will be part of a bigger picture. "I think the big challenges
are enabling total flexibility of how an end user can get things
done: multi-modal interaction. Now, if you use a phone, you get the
information back as speech. But maybe I would like to get it back
through text or table.
"The end users should be able to get the information they want; we
need to be more flexible in conversational methods. Development is
aimed towards all different styles of human interface picking the
most appropriate style at that time."
A theme which is supported by his colleague Soares. 'In countries
like India there are over 1 billion people, with less than 1%
having web access, so if you are a content provider you look around
and you will see that the most ubiquitous device is the phone.
Probably every small village has one or two phones. You can
therefore see a business case where you can use speech recognition
plus VoiceXML to source web content to people who don't have
PCs.
Case study: T Rowe Price
T. Rowe Price, a Baltimore
fund complex, began rolling out a voice-response system designed by
IBM that lets callers ask a computer for their account balances,
and other basic information, using ordinary spoken sentences.
The voice system, claims IBM, recognises 35,000 English phrases and
accommodates a range of US speech patterns. T. Rowe Price and IBM
bill their set-up as a 'natural-language understanding' (NLU)
system able to interpret a broad variety of different ways to say
the same things. NLU, part of WebSphere voice server and
DirectTalk, is a sophisticated system that 'hears' the sentence,
breaks it down into its component parts, and picks up key words and
phrases that has been programmed to connect with certain enquiry or
transaction functions.
The system then checks to see if the caller's request is valid,
according to the rules of his or her plan, and then responds
appropriately. The new service offers 24 hour access to all their
accounts.
Callers can transfer to a representative at any time during the
call by saying 'I'd like to speak to a representative.'
"With this latest technology, we've jumped to the next generation
of voice recognition, one that will significantly enhance a
participant's ability to manage their retirement investments. It
does away with lengthy menu options, making it easy, simple-to-use,
flexible, and fast," said Charlie Vieth, president, T. Rowe Price
Retirement Plan Services.
T. Rowe Price plans to offer the service to 1.2 million of its
retirement-plan participants by the year end, and hopes to get a
competitive advantage with a cutting-edge, easy-to-use phone
system. The company received substantial discounts for helping IBM
develop the technology, but declined to discuss prices.
IBM is currently negotiating trials with top banks and
telecommunication companies in the UK, and is looking to bring out
the NLU system in UK English shortly.