Humans have been “talking” to computers since Star Trek and 2010: A Space Odyssey through the 1960s and beyond, so one might reasonably argue that we have Hollywood to thank for our perception of where speech recognition technology should be today.
We can’t quite record a piece of speech and just process it through speech recognition technology to output a full (completely accurate) transcript right now, but we’re not far off suggests Peter Mahoney, senior VP and chief marketing officer of Boston-based Nuance.
In a series of press briefings held this week, the maker of the ‘Dragon NaturallySpeaking’ speech recognition product has sought to explain where we are with this technology today and detail just how close we are to a potential “paradigm” shift for developers and users alike.
“We’ve made huge advances, but we still have a few challenges before we can reach full ‘Star Trek’-ness,” said Mahoney as he explained how challenging speech recognition can be when it comes down to handling so-termed “homonyms” across the 51 languages that his company’s technology supports.
NOTE: Wikipedia’s “homonyms” definition is short & succinct: “In linguistics, a homonym is one of a group of words that share the same spelling and the same pronunciation but have different meanings.” — an example would be there, their and they’re for example.
So are we at a tipping point for voice?
• Algorithmic advances have pushed real speech recognition forward by a quantum leap.
• NLU (natural language understanding) innovations have progressed significantly.
• Mobile device proliferation (think Apple Siri) has led to more users potentially using speech recognition (Nuance has a working relationship with Apple, but won’t say any more).
• Computational advances in terms of processing power, parallelism and multi-core have also helped.
• Mixed modality designs — a combination of images and voice makes it easier for users to interact with voice-based systems.
So does speech recognition really work now? Many of us will remember having used it
over the last decade and having to resort to ‘robot speech’ in order to get any of this technology to actually understand what we are saying.
Nuance’s Mahoney says that his company has delivered mobile speech in more than 500 million devices to date and that amounts to the automation of more than 10 billion caller interactions each year.
“The human voice is an incredibly rich, natural and efficient means of communication. Nuance builds solutions that enable computers, phones, tablets, automobiles, TVs and consumer electronics to understand the human voice, providing a natural interface between man and machine,” he said.
Nuance talks extensively about the “nuances” (yes, I know!) of engineering individual algorithmic engines to accommodate for different languages and different accents, but it gets even more complicated than whether you happen to have a Mancunian or Liverpudlian lilt.
Imagine if you are a British Indian from a Punjabi background but living in Glasgow. The intonations make the brain spin. If you happened to have a cleft palate or some kind of speech impediment… or even learning difficulties, the software has even more hurdles to cope with.
“There’s a lot of work to be done connecting to the different data domains in each region,” explains Mahoney. OK so the company started by focusing on US English and connecting to Facebook — but as it now looks to China’s Facebook equivalents and other social networks and services in the region, a new challenge presents itself.
So where is this technology used today?
Consumers are typically a bit more “forgiving” with regard to speech technology. But in fields like healthcare we find that application-specific use cases demand for extremely high levels of accuracy.
Nuance says that speech will now change the role of the medical transcriptionist from that of a writer, to that of an editor. Medical is a huge field for speech and software application developers have shown particular interest in supporting hospital and healthcare staff who are notoriously short of time and notoriously poor at ‘written admin and reporting’.
The company also will use its sister imaging solutions business to combine with its speech knowledge and work with Multi-Function Printers (MFP) manufacturers to offer speech as an interface to command and control some core MPF functions, at the device.
… and of course Nuance has “cracked the accuracy” (its words not mind) of the Dragon desktop software offering.
The company says that it has slashed user “enrolment times” (the amount of time it takes to install and train the software), improved experiences by offering noise-cancelling microphones and generally improved the total engineering mechanics of its product.
Equally, text to speech is important here too. The Kindle uses Nuance’s code to “reads” books aloud.
So have we reached the Star Trek point yet — just how close are we today?
Mahoney suggests that we could be close to full “Star Trek”-ness (or to give it its proper name – “robust natural language” capability) in six to ten languages by end of this year.