Blog

Developer
Concepts
Use Cases
Product
Spotlight

Enhance Real-Time Engagement by Using the Raw Audio Stream

You can enhance real-time engagement by accessing the raw audio stream coming from real-time communication platforms.

We’ve reached the point with online communication where the questions surrounding Real-Time Engagement (RTE) or Real-Time Communication (RTC) are no longer about how to develop these platforms. Instead, it’s become a matter of understanding how to better augment the experiences RTE and RTC provide.

RTE is the ability to interact with others over a stream as it happens. For example, this could be dynamic communication with customers (which can result in their instant gratification). If customers have a question about your product you can provide an answer immediately as you’re streaming, rather than via email a few days later.

RTC is any kind of conversation that happens live between two or more people, like face-to-face conversations or talking on the phone.

The most exciting aspect of RTC and RTE is the ability to unlock the data that lies within these conversations as they happen. To do this, you need access to the raw audio stream.

What is the raw audio stream?

The raw audio stream (RAS) is the unaltered audio captured by RTC platforms, like Twilio and Agora.io. It’s essentially audio that hasn’t been compressed, so it’s easier to read and analyze because nothing has been lost. Compressed audio needs to balance quality against file size and, as a result, some data loss occurs when the file is compressed down. This, in turn, affects the ability of an AI system to successfully interpret what’s being said. Raw audio, on the other hand unaltered, which means you have access to everything that was captured in its purest state.

The RAS is the key source of data you need to allow AI and other systems to add value to RTC scenarios. Take, for example, a raw video stream. If you don’t have access to the  raw video stream, you can’t add layers of augmented reality (AR) or virtual reality (VR), because those layers are a reflection of changes in the raw video stream. An example of this would be a face opening a mouth that triggers a VR or AR event on the raw video stream. But AR and VR are only one of the ways we can take advantage of these changes within the raw stream.

The raw audio stream’s state changes are incredibly important for real-time insights or algorithms for conversation data. Live captioning can only take you so far. If you really want to impress, you’ll need to capture specific state changes during a conversation, like a sudden change of tone or a leap into a new topic. None of this is even a remote possibility without tracking state changes in the raw audio stream.

Much in the same way latency is no longer a concern for RTC, the primary concern around transcription is no longer accuracy, but what the technology operating on top can provide.

How can you access the RAS?

In terms of devices, Apple’s HomePod, Google’s Alexa, or Amazon’s Echo don’t offer open APIs for developers to access raw audio data in streams.

Even among the major RTE or RTC companies, there’s often little to no access whatsoever to the raw audio stream. However, if you would like access, consider try building something with Zoom or Agora.io. Here’s a quick rundown of what you can expect from of these platforms:

  • Agora.io’s support for access to raw audio streams is well documented. You can use the input/outputs for API payload or response data to enable development around the raw audio streams.
  • Zoom is apparently the only other company that provides developers with access to raw audio streams as a part of their suite of developer tools. Their software development kits (SDKs), for instance, provide support for access to the raw audio streams in iOS and Android, as well as the web.

 

What you can do with the RAS

RTE and RTC are increasingly being used to focus on going beyond the basics. It’s not just to provide an accurate, real-time transcription of a live event. We’ve reached the point where there’s less concern around precision and more about how you can take your audio and video experience to enhance the experience for those live events.

As an example, our team accessed the lowest levels of the audio streams streaming through Agora.io’s APIs & SDKs to enable functionality around the creation of experiences derived from Conversation AI.

Symbl.ai provides a wide array of capabilities around transcription, such as sentiments and analytics (including metrics around talk time, silence, and overlap in conversations).

Along with that, Symbl.ai lets developers familiar with SDKs, like Agora.io, create experiences that go far beyond what you might normally consider possible, like focusing on the accuracy of a transcription.

We’ve built Symbl.ai’s software into Swift/Objective-C iOS application, a Kotlin/Java Android application, a web application running on Node.js, JavaScript, and JavaScript WebSockets. The examples demonstrate how you can get started with the RAS for different platforms and lay the foundation for doing more, like including our Conversation API to help you manage and process your conversation data.

Our flexible APIs allow you to create a system that lets you access the data found in the raw audio stream for real-time insights into the conversation, capturing action items, and even creating better experiences for students in e-learning situations.

Starting with a platform that provides you with easy and direct access to the RAS means you don’t have to waste developing your own communications platform, nor do you have to waste time figuring out how to gain access. Instead, you can focus on building out the functionality that helps drive engagement and enhances human-to-human communication in real-time.

Ready to open up the raw streams and explore the possibilities? You can learn more in our documentation.

Additional Reading