Where we once considered punch cards to be at the cutting edge of human computer interaction, the years passed and industry innovations brought us forward to a point where speech recognition technology has advanced to a point at which I am writing this blog without touching my keyboard.
Note: letters not forget the intermediary years in between punchcards and speech recognition, where we were perfectly happy with keyboards, mice and touchpads of various kinds.
After visiting Nuance in Boston last week and interviewing several of the company’s executives on the subject of speech recognition, it seems only fair to put Dragon NaturallySpeaking through its paces and speak this blog straight into a Word document to be later posted online.
Dragon is really pretty powerful now and although a few niggles will crop up in any spoken paragraph, I think that with a little training (of both the computer and myself the user) I could become far more used to using this input method – although I will have to get used to thinking as I speak rather than thinking as I type, which is not necessarily as easy as it sounds.
To extend my analysis of Nuance and its work with natural language understanding I also spoke to another vendor and so connected with Dr. Ahmed Bouzid, who is Angel’s senior director of product and strategy.
Note: Angel is a subsidiary of MicroStrategy and exists as provider of on-demand customer engagement solutions.
I asked Dr. Ahmed how the speech recognition software application developer community differs now compared to five years ago –
Is it easier to recruit now the talent pool is richer?
Dr. Ahmed Bouzid — Demand for speech scientists today far outstrips supply. The advent of Siri has been a galvanising event that has awakened the world to the possibilities of highly usable speech/voice user interfaces on the smart device. Evidence is the emergence of a whole new crop of Voice Assistants such as Evi, Cluzee, Eva, Ask Ziggy, and a couple of dozen more on all three of the three Mobile OS platforms (Apple, Android, and Micosoft). Available speech scientists (or software developers) today are indeed very difficult to find. I would say that the vast majority of them are taken up, not surprisingly, by Apple, Google, Microsoft, but also by AT&T and Nuance.
Who is wining in the speech vs touch battle?
Dr. Ahmed Bouzid — I think it is a mistake to view speech and touch as mutually competing interfaces. Speech is a highly compelling interface, but only in the right circumstances. You don’t want to use speech in a noisy place, or in a setting where you are not able to engage your device privately (e.g., financial transaction). On the other hand, when you are driving, you do not want to take your eyes away from the road — or your hands off your wheel. For that setting, speech is ideal. So, I would say that the winner is going to be whoever is able to understand that value is not inherent in any given interface but rather in how that interface is introduced in the user’s interaction stream. I would venture today that today, Siri is not there yet, nor are any of the Assistant Apps out there today. None of the Speech/Voice-enabled assistants today combine speech and touch in a compelling way that empowers the user to do what they want, they way they want it, and when they want it.
Will speech data now become part of the big data mountain in the cloud?
Dr. Ahmed Bouzid — Yes, indeed. The recorded voice is a highly rich collection of data points, and at least for now, such data is transferred over the network for processing. Since the arrival of Siri, Apple has collected billions of audio snippets from actual people asking actual questions (serious as well as silly). Google has done the same and in fact has developed a highly accurate speech recognition engine as a result of the audio data that it has collected over the last few years. Microsoft also runs its speech engine in the cloud and similarly has a treasure trove of audio data. Such data will not only enable these companies to continue refining their speech engines, but may also push these engines to be resilient enough to be highly accurate in data mining other audio (e.g., podcasts).
Note: we still have some way to go with speech recognition, but the advancements that have been made make the technology really quite impressive and fascinating (I’m still talking not typing). We will still get problems with Homer names, homonyms — that’s better, when words do sound the same… As that live mistake just shows. But this has to be part of the way we start to use computing devices more in the future wouldn’t you agree?