Human to human (H2H) conversations are more complex than the conversations we have with voice assistants and other machines. Machine learning or AI applications depend on quality audio and more learning before they can pick up on all the cues the human brain relies on to fully interpret and understand a conversation contextually.

There are an estimated more than four billion voice assistants in use worldwide today. A lot of humans have adapted to speaking clearly and making their intents easy to pick up on for even the slowest of artificial intelligences. At least when they’re giving commands to voice assistants or smart televisions.

However, when we speak to other humans there is still a long list of pitfalls. Whether on stage, one on one or in group meetings we communicate in ways that are hard to capture and convert into text.

Human to human (H2H) conversations are valuable in many situations. But only if they’re captured properly, otherwise it’s just a massive loss of information. To help you get it right, here are the biggest challenges of capturing H2H conversations.

The problem with integrating audio streams

The first challenge in capturing human to human conversations is usually the integration with a voice or speech API. For example, if you want to help a colleague with a hearing impairment participate more fully in meetings, you need to secure a high quality audio stream in your code to offer up live captioning.

Low sampling rates, background noise, and large chunk sizes can all result in lower accuracy that could lead to confusion for the meeting attendees who rely heavily on live captioning. Live captioning has been performed by humans for decades and has only recently started to see some real challenge from AI. 

Even if your meeting is recorded and used to secure key points and divide tasks within your team, poor sound quality and audio enhancements can all make it hard to get a precise transcript.

Scripted vs. unscripted conversations

Plenty of human to human conversations are scripted in some form. When a telemarketer calls you, a lot of that conversation will be scripted ahead of time. With every response you give the telemarketer, an AI can pick up on your intent and assist the telemarketer with the best response to guide your conversation to the desired outcome.

In this scenario, the two humans probably don’t know each other. This makes it easier to follow the script and prepare for an ideal capture of the conversation, because there’s no common understanding established prior to the call.

But what if your colleague calls you to talk about that new solution you’re working on together? You already have a common area of expertise and other common references that can make it a challenge to understand the context of the conversation.

Perhaps you’re both excited about this task. While you might be quick to pick up on your colleagues enthusiasm subconsciously, an AI needs to know more, such as what certain spikes and breaks in your colleague’s voice means and any references your colleague might make to “this” or “that” or to “them”. Questions can also be problematic for an AI if they don’t provide much specific or absolute information.

Human conversations are complex

If your colleague were to snap his fingers or tap on his desk to catch your attention, you would pick up on it and understand it. But an AI would interpret it as background noise and ignore these important cues. Even though our body language is a highly visual input, humans can also pick up on it acoustically. It’s a big challenge for AI to make inferences based on:

  • Intonation
  • Body language
  • Acronyms and abbreviations
  • Unintelligible verbal cues and interjections
  • Jargon and lingo within groups or industries
  • Callbacks to a common history between the speakers

Human participants in a conversation can usually understand the cues or quickly learn to. But if an artificial intelligence is to learn, it first needs to overcome some technical challenges. A foundational one is to segment the conversation correctly by its participants.

Hold it! Who speaks there?

Speaker diarization – the process of breaking the conversation up in segments containing just one speaker – is critical to ensuring conversation understanding. 

One way of keeping the Diarization Error Rate (DER) low is to have a human listen in on the output and grade the diarization. There are two major issues with this approach. Firstly, humans would need to listen to the conversations at close to normal speed in order to make a manual diarization. Secondly, human biases have proven to be a huge source of errors in this task.

A better solution would be to build an automatic diarization solution with enough speech data to construct speaker models, a background noise model, and a verification system with previous recordings of the speakers.

Although this demands prior consent and might not be feasible for the diarization of meeting recordings with several guests, or recordings of sensitive conversations, like between doctors and patients. In this case, the AI could be trained on the known speakers and use the exclusion method to batch the other speaker’s parts of the conversation.

How machines can capture conversations to augment human capabilities

Now that we have touched on the biggest challenges of capturing H2H conversations, let’s look at some of the ways AI can capture our conversations to augment human capabilities and the work that we do.  

In our example with live captioning, we touched on the issue of securing sufficient quality in the captured audio. When this challenge is overcome, an AI will most likely be better than humans at live captioning. Some public speakers have a habit of talking very fast and using phrases in other languages to drive home a point like, “sic transit gloria mundi”.

An AI can work faster and even translate speech in near real-time to let the audience know that the speaker wants to emphasize the transitory nature of life and earthly honors. Most humans would trip up in a situation like that.

The human brain is good at recognizing patterns. Even ones that aren’t really there. AI is both faster and shows less bias. This makes it efficient at:

  • Fact checking in real-time
  • Delivering correct medical diagnosis
  • Indexing recorded conversations or video calls
  • Learning new skills, games and even attaining PhD’s
  • Assisting on sales calls by suggesting actions and follow-ups

When we leverage the speed of AI we can become more efficient at most things without working harder. Doctors could save more lives, journalists could expose more inaccuracies, and none of us would ever need to wrestle with our scribbled meeting notes. Perhaps even political debates could be more nuanced.

AI would for example be able to speed up classes on any topic and make them mutual learning opportunities for both teacher and students. Whenever there is doubt or an interesting question that the teacher wants to read up on for the next class, the AI can either deliver the short answer in real-time or name the best resources that all attendees can read up on for the next class.

If you want to learn more about the challenges of capturing H2H conversations and the benefits of overcoming them, check out these resources.

Additional reading

Avatar photo
Team Symbl

The writing team at