The Technical Shortcomings of LLMs in Analyzing Human-to-Human Interactions

Large Language Models (LLMs), with their vast capabilities, have revolutionized the field of Natural Language Processing (NLP). While impressive at analyzing text content, they face substantial challenges in deeply understanding spoken human language, particularly in real-life contexts such as calls, meetings, and interviews. This article delves into some of these limitations, contrasting them with the nuanced manner in which humans interpret conversations. Specifically, this article considers the challenges that arise from discounting multimodal data, and in particular conversational cues from audio and voice modalities. Such cues are key components of multi-party human conversations, and play a big role in expert human understanding of interactions. Future articles will consider other open challenges inherent in using LLMs for human conversations.

Human Interaction: Beyond Words

Human interactions, especially in professional settings like calls and meetings, are laden with complexities. The use of tone, emphasis, pauses, and body language can convey subtle intentions or emotions. Moreover, the understanding of these interactions is often influenced by culture, social norms, and personal experiences. The spontaneity and fluidity of spoken language pose significant challenges for machines.

Temporal Dependency

  • Overlapping Speech: In human conversations, especially during meetings or interviews, individuals might talk over each other. This overlap poses challenges for LLMs, as most models are trained on isolated speech segments and cannot accurately segregate simultaneous speakers.
  • Interruptions and Back-Channeling: Common verbal cues like ‘uh-huh,’ or ‘right’ signify active listening in human interaction. LLMs typically lack the context to interpret these cues correctly, often treating them as noise or irrelevant information.
  • Sequence Analysis Limitations: Most LLMs use sequential processing, which may ignore the temporal dynamics of speech. Human conversation isn’t always linear and can involve frequent shifts in topics or references to previous statements, something LLMs struggle to follow.

Prosodic Features

  • Rhythm and Timing: The flow of speech, including the rate and duration of syllables, can convey emotions or emphasis. Existing LLMs mostly focus on textual information and lose these temporal characteristics.
  • Pitch Analysis: Pitch variations express different emotions or questions. LLMs often miss these nuances as pitch requires analyzing frequency variations, a feature not commonly incorporated into text-based models.
  • Volume and Emphasis: The loudness of speech can signify importance or emotion, yet LLMs typically lack the capability to interpret this auditory feature.

Contextual Understanding

  • Sarcasm and Metaphors: Understanding figurative language requires a deep comprehension of cultural and linguistic context, an area where LLMs often fall short.
  • Cultural References: References specific to certain cultures or communities can easily be missed by LLMs unless they are specifically trained on diverse and representative datasets.
  • Intention Recognition: Decoding the hidden intentions or sentiments behind words necessitates a complex interplay of linguistic and extralinguistic knowledge, an intricate task that remains a challenge for LLMs.

LLMs in Calls, Meetings, and Interviews

The potential of business conversations is still untapped, and AI-based approaches  often fall short of human expectations in very noticeable ways. First, inaccurate transcriptions due to lack of recognition of various accents or dialects can lead to errors in analysis and understanding. LLMs are usually trained on standardized, majority-population accents, making them prone to mistakes when confronted with regional variations. Similarly, they can be confused by homophones, and background noise often exacerbates transcription inaccuracies. Furthermore, the misinterpretation of emotions is another significant issue. In contexts such as sales calls or performance reviews, a proper understanding of the emotional state is vital. However, LLMs frequently lack the subtlety to gauge emotions correctly, possibly leading to misguided conclusions that can affect business relationships or even an employee’s career progression. Most importantly, there is a pronounced lack of contextual awareness. Truly understanding the essence of a conversation, especially in professional environments, requires knowledge of the participants, their objectives, and the overall background of the discussion. Presently, LLMs are generally incapable of integrating this crucial context, resulting in superficial or misguided interpretations. These shortcomings illustrate the complexity of human interaction and the significant work still needed to enable LLMs to adequately analyze and comprehend spoken language in professional settings.

Technical and Ethical Considerations

The gap between current LLM capabilities and the nuanced understanding of human-to-human interactions in spoken language requires both a technical and ethical endeavor. Integrating emotion recognition algorithms into LLMs could significantly enhance their ability to perceive feelings and sentiments, mirroring human-like comprehension. However, such integration comes with substantial privacy concerns and warrants careful handling of personal and sensitive information. In addition, adopting multimodal analysis, which combines speech with visual cues, may further enhance understanding.

This approach is promising but also complex, requiring significant advancements in audio, image and video processing technologies to be effective. Perhaps most crucially, any move towards achieving a human-like understanding must be undertaken with a firm commitment to ethical principles. This includes the establishment of rigorous guidelines and oversight mechanisms to prevent potential misuse, bias, or other unintended consequences that could arise from these more advanced and intricate systems. The balance between technical innovation and ethical responsibility remains a delicate and essential aspect of this ongoing pursuit.

Future Perspectives: Towards a More Holistic Understanding

The innovations that are required to address these shortcomings of LLMs when it comes to understanding spoken language have to be handled by a collaboration between speech scientists, AI engineers, and behavioral experts among others. Investment in domain-specific training and  developing models tailored to specific tasks like calls or interviews could enhance performance.

The challenge of equipping LLMs with the ability to deeply analyze human-to-human interactions within calls, meetings, and interviews is both a technical and philosophical undertaking. While the field has seen remarkable advancements, the road to truly mimicking and ultimately achieving human-like understanding is still uncharted.

Nuanced understanding of spoken language, especially in professional settings, remains a frontier for AI. It calls for innovative approaches, grounded in both technical excellence and ethical considerations, to bridge the gap between machine capabilities and the complex world of human interaction.

Nebula: LLM for Human Conversations

To create a language model that takes some of the above shortcomings into account, the Symbl.ai team has built Nebula, a proprietary large language model (LLM) which is trained to perform generative tasks on human conversations. Nebula excels at modeling the nuanced details of a conversation and using those details to generate high quality human-like responses that speak to the specific task in question (e.g. summarization and follow-ups for meetings, analysis for sales calls and interviews, etc.).

You can interact with Nebula using the Model Playground or the Model API. The Playground allows you to test the model with conversations and tasks without writing code, while the Model API enables you to interact with the model programmatically.

As of now Nebula ingests transcript data, and you can generate the transcript using Symbl’s tried and tested speech recognition API. We are excited to embark on this path of building the first LLM that truly unlocks the meanings of conversations across context, domain and modalities. 

Avatar photo
Kartik Talamadupula
Director of AI Research

Kartik Talamadupula is a research scientist who has spent over a decade applying AI techniques to business problems in automation, human-AI collaboration, and NLP.