Context aware and emotionally intelligent: The future of voice assistants

VoiceBox's Sydney lab is working on voice tech that can hold a conversation and know how you're feeling

Hello Dave.

Comments

“Look Dave, I can see you’re really upset about this.”

It is one of the creepiest lines in 2001: A Space Odyssey, uttered by the murderous artificial intelligence system HAL 9000.

The supercomputer of the film’s doomed spaceship Discovery One, voiced by Douglas Rain, broke the sci-fi trope that fictional robots and virtual assistants should be slaves to cold logic and incapable of perceiving human emotion.

The real-life voice assistants of today, however, have so far failed to shake this expectation. Snarl at Siri or worriedly weep to Cortana or Google Now and the response given will be no different to that given to a typed question.

Tactless tech

“Conventional systems focus on artificial intelligence. But there’s other types of intelligence – emotional intelligence,” says Dr. Mark Johnson of VoiceBox Technologies. “We’re trying to build systems that will be much more context aware, have context sensitivity. Existing technologies really don’t do that.”

The Macquarie University professor was last month appointed as the VoiceBox’s chief scientific officer and head of its Advanced Technology Division, which is opening in Sydney.

An awareness of context and the ability to sustain conversations, not simply respond to questions, is essential if voice assistants (referred to internally by VoiceBox as ‘agents’) are to become more useful and ubiquitous, the company says. A little tact wouldn’t go amiss either.

“Systems that are available in your home and car will be confronted with a lot of individual and family emotions and they have to react in a smart way,” says Phil Cohen, Voicebox’s chief scientist, based in Seattle.

“If it always sounds perky even if someone has passed away, it’s going to sound like a very odd thing.”

The company, which partners with car manufacturers, telcos and consumer technology giants, is working on ways its agents can read the tone and timbre of user voices and respond accordingly. It’s also exploring how recording visual cues like gesticulations, head posture and direction of gaze can improve contextual understanding and the appropriateness of a response. It’s not as simple as mirroring what is seen and heard.

“People imitate the behaviour and attitude and stance of their conversational partners but if a system is confronted with very emotional and angry language, you don’t want it to be snarky in response,” says Cohen.

“You may want a calm and reassuring response which might then calm and reassure the person it’s engaged with.”

That could be a particularly useful feature in vehicles – a common placement for the company’s technology. VoiceBox counts Renault, Fiat Chrysler and Toyota among its partners.

“On a long trip you may have additional goals,” says Johnson. “Tracking the context to a much greater degree than is currently done allows you to do them. You might imagine chatbots to help try and keep the driver engaged. If you detect they’re not as alert as they should be you can respond appropriately.”

I don’t understand’

While existing voice assistants are good at executing basic commands, they lack the ability to get a handle on the higher purpose of a conversation, says Johnson.

“They’re really what is called single intent. You can ask it to do one thing: navigate me to here. What’s the weather? But you can’t do anything that’s more complicated. You can’t do things that involve a sustained series of queries over a longer period of time.

“It’s not intelligent. If something goes wrong, it can’t propose a plan to achieve your higher level goal, because it’s not trying to detect that.”

VoiceBox believes it has started to crack this limitation, and will grow its Sydney lab by up to 10 of the “best people in the field” this year to further improve its offering.

The team will be using machine learning and deep neural networks to learn complex patterns and determine responses as well as taking advantage of the multi-lingual make-up of Sydney’s residents as a source of data.

Johnson and Cohen believe that in less than a decade, humans will be interacting with a wide range of internet of things enabled devices via voice conversation.

“There is a threshold. Once you start doing something and it provides enough utility…you won’t ever go back,” says Cohen.

“Most of these devices are not going to have screens or keyboard,” says Johnson. “You’re going to want to interact with speech. It can radically change the environment we live in.

“I think we’re really at the edge of a revolution in the way computers interact with humans.”