IBM has renamed its portfolio of voice recognition systems
Conversational Services.
The company will roll out products that will include speech
translation, multimodal interfaces, middleware, natural-language
understanding (NLU), text-to-speech and biometrics.
IBM will soon introduce one of the first products to use visual
cues, such as lip and mouth movements, to understand the spoken
word for speech interpretation, according to Dr David Nahamoo,
senior manager of the human language technologies department at
IBM's Thomas J Watson Research Center.
He added that the product is already in beta with a number of
enterprises and would be available in about two years' time.
The visual recognition system could be of benefit in customer
relationship management applications, for example, where call
centre personnel would be able to interpret a customer's body
language, Nahamoo said.
"The face is sending a message, happiness, sadness, anger. The
challenge is how you model that and integrate it on top of the
other [speech] technologies," he added.
In the short term, IBM's visual recognition system uses a
microphone, a camera to monitor lip and mouth movement, and a set
of business rules built into the recognition system.
"It might have a policy that if the face is not looking at the
camera, the system understands that the person is not talking to me
and so the computer can eliminate the sounds as noise," Nahamoo
said. Also, if the lips are not moving but the system is picking up
words or sounds, that information is filtered out as
extraneous..
Some of these technologies will be especially useful in noisy
environments, such as a moving car or on a stock market trading
floor, noted Nigel Beck, IBM's director of Voice Systems.
"If the vocabulary in the system is small enough, it can recognise
some words even in noise, and can be trained for digits in
something as noisy as a 10-decibel environment," Beck said.
The system builds templates in time for each movement of the lips
and converts the information into the basic ones and zeroes that
computers understand.
The visual analysis is called a "viseme", not unlike a phoneme, the
smallest intelligible segment of sound in a word. A viseme is the
smallest intelligible segment of a lip gesture which, when put
together with other visemes, allows the system to recognise the
movements in aggregate as a word.
In other recent developments, last week IBM officials displayed a
prototype add-on sled that will fit onto the back of a Palm
handheld. The speech sled contains a DSP (digital signal
processing) chip and memory for translating speech to text, and can
be used for executing commands to a contact database or
appointments calendar, as well as for voice-activated phone
dialing.
IBM is also focusing efforts on its WebSphere middleware products.
Last month it introduced its WebSphere Translation Server, which
can translate about 500 words per second. An official said that
Deutsche Bank will be one of the first customers.
Finally, IBM is using something called phrase slicing so that users
will not have to listen to a mechanised voice.