We spoke to Larry Heck, the former CEO of Viv Labs and SVP/Head of Bixby North America at Samsung, and now the Rhesa S. Farmer, Jr, Advanced Computing Concepts Chair, Georgia Research Alliance Eminent Scholar, and Professor of Electrical and Computer Engineering at the Georgia Institute of Technology. Read his thoughts on everything from virtual assistants and natural language processing, to the evolution of speech recognition and how to build feedback loops into your conversation intelligence products.
At Symbl.ai, our mission is to help developers easily build and deploy conversation intelligence with ready-to-go toolkits. Part of that mission involves sharing our knowledge so developers can truly understand the magic of conversation-centric AI and all its exciting applications. So, we’re going beyond the tech stack and exploring the exceptional minds at the forefront of conversation intelligence to bring you fresh perspectives and a peek inside the industry.
To start, we spoke with one of the most accomplished contributors in speech processing, contextual AI, and deep neural networks: Larry Heck.
For a bit more background, Larry Heck was recently the CEO of Viv Labs and SVP & Head of Bixby North America at Samsung. He also led a dialogue research effort for Google Assistant, founded Microsoft’s Cortana, held research and executive roles at Yahoo! and Nuance Communications, and has been on the Symbl.ai Technology Advisor board for two years helping our team scale conversation intelligence and unlock its business potential.
As a pioneer in speech technology and with decades of industry experience to share, we tapped into Larry Heck’s thoughts on all things conversation AI — from the evolution of context and speech recognition, to the art of building conversation intelligence products. Let’s dig in.
What are your thoughts on human to human and human to machine conversations?
The human to human conversation is something I‘ve been thinking about for a long time. Back in my early days at Microsoft, the team I assembled was looking at the problem of learning conversations in a different way, which was more like how a child learns to have a conversation by sitting at the dinner table with their parents.
The child primarily learns through observation, and they can choose when to join the conversation. That ability to opt in when you have something relevant to say is important from a technology perspective.
Right now, personal assistants don’t have the option to observe how people interact or decide when to join in. All that understanding happens offline. So the problem of understanding meetings opens up entirely new opportunities to teach technology how to gradually understand those conversations over time.
It’s just a wonderful problem to me. I think human to human conversation, particularly in meetings, is going to have a big impact, not only on solving industry problems today, but also on conversational learning on the human-machine side.
Why build human to human understanding with AI in the first place?
There’s a number of applications where you’d want to bring technology into the conversation — to augment it, to help it, to assist it. But without that technology being in the critical path or getting in the way.
Let’s say you’re talking to a friend about where you want to eat. The assistant is invited to listen, but it doesn’t get in the way. It’s just there and available. And when I say to my friend, “I wonder if there are any good Chinese restaurants near Los Altos,” the assistant can contribute to your human-to-human conversation by proactively pulling up a map and showing you Chinese restaurants near Los Altos. It could even suggest a place that’s highly rated.
So, in this case, my friend and I benefit from the machine being a quiet participant in the meeting until it has something to add. I think it’s super compelling: human to human conversations with augmentation.
“Augmentation technology is enriching the communication channel between humans, it’s like a whole new way of thinking about how people talk with each other.”
What do you think about passive vs. active conversation intelligence systems?
I think, over time, systems can be a bit more interruptive because they’ll know when it’s appropriate to interrupt the conversation or provide relevant information, but until the technology is ready, conversation systems should be more passive.
In your opinion, why hasn’t that evolution in speech technology happened yet?
I think part of it comes down to technical challenges and the other part is business focus. In the early days of Cortana, we weren’t really thinking of going out mobile-first. But then Siri was acquired by Apple and launched on the iPhone, so there was a lot of business pressure to respond to that.
I believe a similar scenario happened at Google when Alexa launched. That’s kind of why the technology has gone this way, rather than on the human to human side and augmentation of meetings. But there are definitely some technical challenges that have been in the way. One of those is open microphone.
Open microphone is a close cousin of open conversation, where you have a lot of disfluencies and partial sentences and all kinds of complexity. Not only does it make natural language understanding difficult, but also speech recognition.
What are your thoughts on advancing speech recognition?
When I was working on it in the ‘80s, there was a debate on how to advance speech recognition technology. One side of that debate said, well, we can barely recognize digits properly so we should just focus on digits. The other side said we should work on the broader problem and eventually that will help digit recognition.
Happily, the latter won. And I think it was a lesson learned early on in speech recognition, which continues to this day. As you go up the tech stack and go broader with all the different kinds of context, it can all be brought back to improve speech recognition.
Speaking of context, how do you view its progression in AI?
I think things are really starting to accelerate in terms of the technology stack. Primarily because we’re getting better at recognizing what are the important forms of context and how to leverage them without requiring a lot of manual supervision and data labeling.
So I think we’re going to continue to expand. Especially in terms of climbing up the technology stack, away from low level word transcriptions to understanding conversations and predicting where the conversation will (and should) go next through inference and reasoning. Much of this will be enabled by injecting knowledge into the technology stack — maybe business knowledge that two people are talking about in a meeting.
How should a business go about building robust conversation intelligence systems?
In a business setting, you have to be able to effectively communicate that this type of technology can provide value from day one — but it’s always evolving. If you can provide value early on, even when the technology is perhaps not as polished as it needs to be, you can finetune it and provide more and more value over time. That’s a win.
When I led the R&D team at Nuance, we learned the importance of that feedback loop pretty quickly. The core technology team needed to be connected directly to the customers and work on building that feedback loop first.
“Anybody that wants to get into this space of deploying conversation technology has to recognize that the problem is never solved. It’s always evolving.”
How should businesses think about getting that feedback?
It’s important to note that we can’t exclusively rely on feedback to drive the innovation of the product. The product has to be good enough out of the box that the customer gets some value from it, and they feel like it’s worth investing time to provide that feedback.
So spend time on actually getting them motivated to use the product, then make sure the feedback loop goes really fast. You have to show them an incremental value from the feedback they provided about your technology. Otherwise, again, they lose motivation.
In my experience, you can do a lot with a small number of customers. Even in the single digits. You can learn what’s really important and then scale your product from there. It creates a network effect where your core technology gets better from all the different participants, and that network effect is magical when it starts to happen.
Lastly, what are you most excited about over the next two to three years?
When I was in grad school, my goal was to be able to have a natural conversation with my computer every day when I came into the office. After 29 years, I’m going back to Georgia Tech as a Professor and shifting gears into long term research. So, personally, I’m excited about actually realizing that dream from grad school.
I really think it’s attainable to have a conversation with my computer, and not only have it recognize what I’m saying and understand what I mean, but have the ability to make higher level inferences and reasoning. And also have knowledge about me and about the world.
I want to see that kind of evolution in this technology.
Learn more about Larry Heck
To learn more about Larry Heck and his impressive work, check out these sources:
Larry Heck Talks Bixby and 30 Years in Voice
Bixby Developer Day 2018 Korea: The Future of Personal Assistants
Making Personal Assistants Smarter with Samsung and Bixby
Larry Heck’s Google Scholar page