Today, whether you own an Android phone or iPhone, you're often just
telling the device what you want. Siri is sassy and will tickle your
funny bones if you are bored, while saying 'Okay Google' unleashes the
predictive magic of Google Now. Microsoft is preparing to give us
Cortana. We've come to take it for granted
that our phones will come with a little personality, but the roadmap for
the future goes a lot further than a few canned jokes.
up with Sunny Rao, the MD of Nuance Communications India and South East
Asia, and chatted about the developments in speech recognition,
frustrations with using speech-to-text software and how the way we
interact with our devices is about to change forever.
like a person who has been talking to machines for a long time - his
speech is clear, and there's a small space around each word for maximum
clarity. Over tea, we're able to discuss how voice recognition is being
used around the world, and how he sees the future of the technology
shaping up. And naturally, we talked about the movie Her. Edited
NDTV Gadgets: We see more and more devices like
phones and wearables using voice recognition but sometimes it's really
inaccurate and at other times it can be amazing. Why is that?
Rao: There are two streams in the technology - one is to make it highly
speaker dependent, and the other is to make it as speaker independent
as possible. If I kept the speech recognition only on your device, it
would be a more speaker dependent technology. The tradeoff for doing
that is that in the first maybe 3-4 times you use it, it won't become
very reliable. It's imperative that you use it beyond 3-4 times, and
that's where the chasm is, where people may stop.
independent gets you off the ground from go quite reliably, because I
have this mass amount of data which is why we're going to move to a more
speaker independent model. [This is particularly important for devices
that multiple people will use.] Devices like tablets for example, are
typically family devices, you want to have multiple people using it. So
we're embedding our voice biometric technology on these devices, so when
you say send email, it brings up your profile and your email, it can
tell based on your speech pattern and voice.
How does the biometric technology help?
coming to an age where tablets will have multiple user profiles, so
your emails should not be accessed by somebody else from the family. I
have kids, there have been times when they've sent mails from my
corporate account... So we combine voice biometric with voice
It's the same with TV - with smart TVs,
we've done it with LG, and Samsung and almost all new TVs now have some
amount of speech recognition and we'll get more and more voice biometric
technology as well. There are two reasons, one is that you want to load
your profile, not someone else in the family, and also then you don't
need to go through with passwords or anything. The way you interact with
the device is also your password, we call it one-shot recognition.
How accurate is the voice recognition though?
the years the accuracy has gone phenomenally high, it's a great
benchmark that doctors are using it, it is a critical environment, legal
is another critical environment where people are using it. We're at the
stage where Dragon Medical and Dragon Legal give you 99% accuracy.
you have high courts in India which use our technology to dictate all
the judgements, all judicial officers under the Karnataka High Court all
of them use Dragon Legal. In consumer versions [the vocabulary is less
predictable, so] on day one, the quantum of data that you have is very
less. That is opening up now too, if you look at many of the virtual
assistants that are available on phones they have improved in the last
12 months, and that's a result of more and more people using them, and
the critical mass has been hit, so now you're going to see very accurate
solutions coming out.
How do you see this technology evolving in the future?
going to reach a point where you wake up and you talk to your TV and
ask for a traffic report. I'll ask my TV, how's the traffic today on the
way to work. It'll check the traffic and show me the best route, and
that's going to go to my phone, and then also to my car, maintaining the
continuity of the transaction, one is within that device, but across
devices as well. We want to cover all four screens that are available
and have a transaction that crosses all these devices. And your voice is
the key, I walk into the car and it says, "good morning" and I reply
with "good morning" and it knows that I'm getting in, and adjusts the
steering column and chair for me, while if my wife gets in, then it can
adjust the seat according to her preferences.
Dictation is only
one component of it though. The other component, to make what you've
said more intelligently understood. This is done with Natural Language
Processing, which means talking to your device in a human like fashion.
It should be context aware, and able to do semantic analysis.
of like the movie Her... Do you think - I mean, aside from the AI stuff
- that they did a good job with showing how we'll be using our
computers in the future?
I think more and more people will use
speech for a productivity perspective. Her really looked at it in terms
of using your computer, but think what you're going to find is that
you'll have very discrete and disparate devices in your home, all
talking to each other, and that is how you'll really communicate - so
the ability to have microphones in your roof, your refrigerator, your
microwave oven, all of those and beginning to talk to each other. You'll
be able to walk into the kitchen and start saying things to the
toaster! Hopefully your wife doesn't think that you're mad, but you
know, get into it, and communicate. I think that Her was a great
representation of the possibilities of what it can do, I don't think
that you'd sit in front of a computer to do all that, you'll be mobile
while you're doing that.