Since Apple developed Siri there have been great strides made in the science of voice recognition. Will we soon be throwing away our mice and keyboards and simply talking to our computers? Or will the problems I have with Alexa continue to haunt voice recognition?
My wife and I are like all married couples at breakfast. We do not speak to each other. She sits at one side of the dining room table, eating porridge and reading the news. I sit at the other side consuming toast and sport. The silence is only broken by the occasional click of a mouse and the dog’s plaintive wailing for my crusts.
But all that could be about to change. Technology is marching on and, increasingly, voice commands are being used to control our computers and mobile devices. Could the peace and quiet of our breakfast be about to be shattered?
The History – AUDREY and the Shoebox
It is tempting to think that voice recognition and text-to-speech are relatively new developments. In fact, the story of voice recognition dates back to the middle of the last century, when Bell Laboratories came up with AUDREY, vacuum-tube circuitry housed in a six-foot high relay rack. But AUDREY could understand spoken numbers, recognising them with 97% accuracy.
Ten years on and IBM unveiled the Shoebox machine at the 1962 World Fair. Shoebox could understand no less than 16 words, but could also understand commands such as ‘add’ and ‘total’ – effectively making it the world’s first calculator powered by voice recognition.
By the middle of the 70s – with some of the initial research funded by the US Department of Defense – the HARPY voice recognition system could recognise around 1,000 words. But the real goal was to move from recognition to prediction, and for machines to develop more normal patterns of speech.
By 1990 we had seen the release of the first consumer-grade voice recognition product: that was Dragon Dictate, originally priced at $9,000 (or $17,000/£12,000 in today’s terms). By 1997 Dragon Naturally Speaking could understand natural speech at 100 words per minute.
But then developments stalled: it was not until 2010 that voice recognition began to take off again – and this time in a big way.
Google, Siri and friends
Apple first introduced the iPhone in 2007 and in 2008 Google introduced its Voice Search App, once more allowing voice recognition to move forward rapidly. Smartphones were in many ways the ideal vehicle for voice recognition: talking to your phone seemed much more natural than talking to a computer and there was an obvious incentive to develop hands-free technology.
Google’s approach was perfected by Apple, who introduced Siri to the world in 2011, an AI-driven personal assistant that relies on cloud computing to predict what you are saying. Siri was, arguably, the first piece of voice recognition to make an attempt at ‘personality’ – apart from HAL, of course – a development which we now see in devices like Amazon’s Alexa, of whom more later…
The future: what’s coming next?
Voice control and recognition is now becoming central to our everyday lives. Personal assistants such as Alexa and Cortana should soon be able to integrate a sales pitch into a natural sounding conversation: ask Siri where to get a good pizza and rather than seeing an ad via a platform like AdWords, the special offer from your local pizza shop will be in the conversation.
Voice recognition powered by AI can now be used to transcribe phone calls – so no more arguing about who promised to buy dog food on the way home – and even to predict the outcome of a conversation, based on the tone of voice and the words being used. Marriage guidance? The computer will just listen to you speaking to each other for five minutes…
In theory, we will also use voice recognition to control our alarm systems, lighting systems, heating and our kitchen appliances. It will also play a major part in the workplace and has clear applications in fields such as medicine. Advocates of the technology also see a time when we will drive our cars using voice recognition, allowing drivers and passengers to be completely hands-free. Given some of the things I say when I’m driving I will stick with the clutch and the accelerator for now…
Typing or Talking: which will win?
Voice recognition is going to have major implications and change the lives of millions of people: blind and partially sighted people are an obvious example. Commercially it has applications too: if you are an illiterate farmer in the third world a written weather forecast is of no use to you – a spoken one could make the difference between a successful harvest and your crops being ruined.
The spoken web could also help the one-in-five adults in Europe and the US with poor reading skills – but how close are we to having a real, everyday conversation with a device like Alexa or Cortana?
“To understand that pizza is served at an Italian restaurant is easy,” says Nils Lenke, head of research at Nuance. “But to have a conversation with you on every single level, that’s still far out. AI just isn’t smart enough yet.”
Transcribing your voice into text – automatic speech recognition – is a tough problem to solve. I do know people who use Dragon Dictate very successfully: I know an equal number who have put it on the shelf, breathed a sigh of relief and gone back to their keyboard.
For most consumers, voice-to-text and text-to-voice is probably at a crossroads. I dictate texts into my phone and professionally I proofread articles I’ve written using TTS Reader (which is free and which I unreservedly recommend).
But Alexa and Siri? They bore me to tears. We bought our daughter an Amazon Echo Dot when she moved into her flat: it’s now in the drawer. Ours sits proudly in the kitchen and receives one instruction a day: “Alexa, timer, 20 minutes.”
The next big leap forward will come when voice recognition will develop a personality. Yes, I suspect that one day I will cave in and start dictating these articles to my computer, but do I really want to go home and have a conversation with Alexa? My wife has her faults, but I quite like discussing the day’s events with her. Once we start speaking to each other…