Today, whether you own an Android phone or iPhone, you're often just telling the device what you want. Siri is sassy and will tickle your funny bones if you are bored, while saying 'Okay Google' unleashes the predictive magic of Google Now. Microsoft is preparing to give us Cortana. We've come to take it for granted that our phones will come with a little personality, but the roadmap for the future goes a lot further than a few canned jokes.
We caught up with Sunny Rao, the MD of Nuance Communications India and South East Asia, and chatted about the developments in speech recognition, frustrations with using speech-to-text software and how the way we interact with our devices is about to change forever.
Rao speaks like a person who has been talking to machines for a long time - his speech is clear, and there's a small space around each word for maximum clarity. Over tea, we're able to discuss how voice recognition is being used around the world, and how he sees the future of the technology shaping up. And naturally, we talked about the movie Her. Edited excerpts follow.
NDTV Gadgets: We see more and more devices like phones and wearables using voice recognition but sometimes it's really inaccurate and at other times it can be amazing. Why is that?
Sunny Rao: There are two streams in the technology - one is to make it highly speaker dependent, and the other is to make it as speaker independent as possible. If I kept the speech recognition only on your device, it would be a more speaker dependent technology. The tradeoff for doing that is that in the first maybe 3-4 times you use it, it won't become very reliable. It's imperative that you use it beyond 3-4 times, and that's where the chasm is, where people may stop.
Speaker independent gets you off the ground from go quite reliably, because I have this mass amount of data which is why we're going to move to a more speaker independent model. [This is particularly important for devices that multiple people will use.] Devices like tablets for example, are typically family devices, you want to have multiple people using it. So we're embedding our voice biometric technology on these devices, so when you say send email, it brings up your profile and your email, it can tell based on your speech pattern and voice.
How does the biometric technology help?
We're coming to an age where tablets will have multiple user profiles, so your emails should not be accessed by somebody else from the family. I have kids, there have been times when they've sent mails from my corporate account... So we combine voice biometric with voice recognition technology.
It's the same with TV - with smart TVs, we've done it with LG, and Samsung and almost all new TVs now have some amount of speech recognition and we'll get more and more voice biometric technology as well. There are two reasons, one is that you want to load your profile, not someone else in the family, and also then you don't need to go through with passwords or anything. The way you interact with the device is also your password, we call it one-shot recognition.
How accurate is the voice recognition though?
Over the years the accuracy has gone phenomenally high, it's a great benchmark that doctors are using it, it is a critical environment, legal is another critical environment where people are using it. We're at the stage where Dragon Medical and Dragon Legal give you 99% accuracy.
Today you have high courts in India which use our technology to dictate all the judgements, all judicial officers under the Karnataka High Court all of them use Dragon Legal. In consumer versions [the vocabulary is less predictable, so] on day one, the quantum of data that you have is very less. That is opening up now too, if you look at many of the virtual assistants that are available on phones they have improved in the last 12 months, and that's a result of more and more people using them, and the critical mass has been hit, so now you're going to see very accurate solutions coming out.
How do you see this technology evolving in the future?
You're going to reach a point where you wake up and you talk to your TV and ask for a traffic report. I'll ask my TV, how's the traffic today on the way to work. It'll check the traffic and show me the best route, and that's going to go to my phone, and then also to my car, maintaining the continuity of the transaction, one is within that device, but across devices as well. We want to cover all four screens that are available and have a transaction that crosses all these devices. And your voice is the key, I walk into the car and it says, "good morning" and I reply with "good morning" and it knows that I'm getting in, and adjusts the steering column and chair for me, while if my wife gets in, then it can adjust the seat according to her preferences.
Dictation is only one component of it though. The other component, to make what you've said more intelligently understood. This is done with Natural Language Processing, which means talking to your device in a human like fashion. It should be context aware, and able to do semantic analysis.
Sort of like the movie Her... Do you think - I mean, aside from the AI stuff - that they did a good job with showing how we'll be using our computers in the future?
I think more and more people will use speech for a productivity perspective. Her really looked at it in terms of using your computer, but think what you're going to find is that you'll have very discrete and disparate devices in your home, all talking to each other, and that is how you'll really communicate - so the ability to have microphones in your roof, your refrigerator, your microwave oven, all of those and beginning to talk to each other. You'll be able to walk into the kitchen and start saying things to the toaster! Hopefully your wife doesn't think that you're mad, but you know, get into it, and communicate. I think that Her was a great representation of the possibilities of what it can do, I don't think that you'd sit in front of a computer to do all that, you'll be mobile while you're doing that.