Trade-offs in Building Speaker Separation Into Your Application for Advanced Speech Analytics

Separating overlapped speech is a type of advanced speech analytics known as “speaker separation.” You can do this either in real-time (active speaker events in a telephony API, or use a streaming API using WebSocket), or asynchronously — after the event (with the recorded file and speaker timestamp events, async audio API with a speaker-separated channels feature, or use speaker diarization). The best option depends on the capability of your customer’s platform, balanced against their budget and business needs.

What is speaker separation?

Speaker separation is a method of distinguishing between different speakers in an audio stream for the purposes of conversation intelligence. The ability to separate and identify different speakers allows for a more accurate and detailed analysis, in either real-time or after the event (known as async).

What are the different speaker separation options in real-time and async?

You can use Symbl.ai for speaker separation. Symbl.ai is a conversation intelligence platform that provides real-time and async contextual AI capabilities with plug-and-play API’s.

Real-time speaker separation

There are two ways you can do speaker separation in real-time:

1. You can use active speaker events. If the call is connected to Symbl.ai’s Telephony API through PSTN or SIP, and your meeting platform has access to the active speaker talking timestamps, you can create timestamps for those active speaker events and push these events back to the telephony API connection.

2. You can use a streaming API with WebSockets. If your meeting platform has access to the audio streams through PBX, backend, or through the client running a browser app (for example, WebRTC,) you can pass each speaker stream through a WebSocket connection using Symbl.ai’s Streaming API. This is the most accurate way to get speech-to-text accuracy per speaker as each speaker gets their own stream. If a speaker talks at the same time in a meeting the accuracy won’t be affected.

If you don’t have access in real time, then you’ll need to do your speaker separation asynchronously with Symbl.ai’s Async APIs.

Async speaker separation

There are three ways you can approach speaker separation after the audio/video stream has ended.

1. For Speaker events – If you have the speakers timestamps events, you can first digest the recorded file using the Async Audio/Video API to generate the conversationId, and after the operation is completed you can add these speaker timestamps using a PUT request with a Speaker API in the required format as it appears here:

PUT https://api.symbl.ai/v1/conversations/{conversationId}/speakers

2. If you have a recorded file that’s already in separate channels then you can use the Async Audio API with a speaker-separated channels feature using the flags enableSeparateRecognitionPerChannel set to true, and channelMetadata with the speaker’s details per channel. This is a very similar approach to using the real-time WebSocket, except instead of being in real time, the audio has already been recorded in a separate channel for each participant in the file.

Here’s a metadata example for a two-channel file:

{  "channelMetadata": [    {      "channel": 1,      "speaker": {        "name": "Robert Bartheon",        "email": "[email protected]"      }    },    {      "channel": 2,      "speaker": {        "name": "Arya Stark",        "email": "[email protected]"      }    }  ] }

3. In the scenario where you have the file but don’t have separate channels or no active timeline of speaker events, then you can use a feature called speaker diarization, with the recorded file. You’ll need to set the flag enableSpeakerDiarization to true value and add the number of participants that were in the call using the flag diarizationSpeakerCount.

What is the accuracy/cost trade-off for each speaker separation technique?

The most accurate method is in real-time using WebSockets with separate audio streams. If you’re operating asynchronously, then a speaker-separated channel will give you the most accurate results.

However, the most accurate options will cost more since you need to account for each speaker being digested and analyzed in a separate channel.

The next most accurate way is to use the active speaker events feature or the async speaker events (i.e. with timestamps of talking events). These are both very similar to each other, except that one is real-time and the other is async). The only disadvantage here is if two people are talking at the same time, you’ll end up with less accurate results. This is because when two or more people talk, the platform that provides the data can sometimes miss details within their overlapping conversation and can’t separate the individual speakers.

The last option, speaker diarization, doesn’t provide access to channels or speaker events. So, while speaker identification isn’t possible, you can still perform speaker separation. You’ll likely get more errors here but you can still achieve a decent level of accuracy.

Which option is best depends on the capability of your customer’s platform, balanced against their budget and business needs.

Examples of speaker separation in action

Call centers can have real-time insights with the agent and customer if they have two separate channels. This is very powerful because if, for example, the agent speaks to a customer and then passes them on, the next agent will have the benefit of the information and real-time analytics to be more helpful and efficient in serving the customer.
Call centers can also benefit from Symbl.ai’s async speaker diarization if they have recordings with one channel and no speaker timestamps events. This gives businesses the ability to analyze thousands of recordings to distinguish between agents and customers and to get transcripts and conversation insights that they can use to improve their business. And moving all recorded audio to text can also save on data storage costs.
Some meeting platforms, like Zoom, don’t use separate channels but rather they generate a TIMELINE file that includes the speaker time events. At the end of the recorded call you can first digest the recorded file using Symbl.ai’s Async API, and then get speaker separation by converting the timeline file to Symbl.ai speaker events format. After this, you send it back to the same digested conversation using the speaker API.
You can use meeting platforms with real-time WebSockets so that each participant in the platform has their own channel. This means that different people can talk about different things and have different action items that can be accurately captured and reviewed in real-time. If you don’t have the WebSocket capability for separate streams, then you can use a telephony API and active speaker events (depending on the platform).

Learn more about Symbl.ai’s Speaker Events API which enables the most accurate form of speech-to-text accuracy for speaker separation in your audio and video streams.

Nebula

Generative APIs

Understanding APIs

Integration

Pre-Built UI

Deployment

Security

Featured Blogs

Introducing a Gen AI Powered Pre-Built Experience for Call Insights

Symbl.ai Blog