You can add machine learning to data endpoints of VoIP or SIP systems to analyze speech patterns in real time and enhance the conversation with insights like caller intent, emotions, and mood. This is especially valuable for call center apps or any voice-enabled application that deals with human to human interaction at scale.

Access the audio data in your application

Most VoIP runs on session initiation protocol (SIP). Even if yours runs on Real-time Transport Protocol (RTP), you can use VoIP signaling and media gateway control protocol (MGCP) in the back to back user agent (B2BUA) to send the call audio to your machine learning (ML) system. This can then feed valuable insights for internal or external conversations.

Pulling pre-conversation data from your IVR or virtual assistant

Before a VoIP call begins, you can extract useful metadata from it – like who’s calling, from where, and indications of the caller’s intent. Businesses often use this to help prepare their staff for the next call. Plus, you can use a live sniffer to pick up on SIP packets and pull available data like source IP, caller ID, previous calls, extension numbers, and IP addresses.

This helps you predict who’s calling and whether to route the caller to a certain employee or team.

In the case of human to human conversations over a VoIP connection, many companies funnel callers through an interactive voice response (IVR) system, also known as a phone tree. Your voice command or push of a button is translated by a programmable voice AI. When you push a button, the AI picks up on the dual-tone multi-frequency signaling (DTMF tones).

You’ve probably spent time in an IVR yourself and been asked to “Press 1 if you are a new customer” or, “Say “invoice” to be connected to an employee in our accounting department.” The concept is meant to save time and route callers to the employees best suited to help them. But as you may have experienced, it has limitations.

When the call is put through to a human operator, your ML model can make real-time inferences about caller intent from the audio stream and surface those insights on-screen to help the human agent improve the interaction. This is particularly useful for customer service, sales calls, and support applications where one of the key performance indicators may be to keep conversations short to avoid a long queue or to identify responses that drive upsell opportunities.

You can also implement predictive ML models to recommend the “next best action” (NBA) and help find patterns before or during the call based on historic data and ongoing conversation characteristics that determines which actions are most likely to lead to the desired outcome.

During the call – using machine learning to enhance the conversation

When you’re in a conversation with another human, AI can assist the caller by analyzing speech patterns in real time, recognizing their current mood and any changes in mood. In a call centre, this helps agents avoid making a bad situation worse and lets them wrap up calls quicker and to the satisfaction of the caller.

For this to work, you need to dedicate enough bandwidth to secure your VoIP calls against packet loss. This ensures the correct quality and order of each packet in real time. You may want to scale up your offline machine learning for optimal packet loss concealment. This will help mask issues like delayed or completely missing packets of voice data.

You can leverage AI in real time for several types of customer conversations where it’s important to optimize engagement and amplify the interaction:

  • Offer suggestions on how the agent can proceed to make the customer happy: Using models for emotional analysis, machine learning can pick up on the emotional weight of each sentence and suggest responses, for example, “offer refund,” “suggest follow-up,” or, “tell the customer that you will pass this feedback on.”
  • Take care of tasks that humans don’t need to be involved with: Like pulling up the calendar of an employee whose name was mentioned, book a meeting, compose an itinerary based on the conversation and send out the calendar invite – shaving precious seconds if not minutes off the call.
  • Translate in real-time: With a speech-to-speech translation model like Google’s Translatotron, calls in any language could be answered by the same customer service agent. This makes it easier to handle calls from customers in several countries without delay – cutting the cost of building new call centers in other countries or hiring multilingual humans. Such models are still experimental and require more progress in the domain of machine learning for speech-to-speech translation models before you can rely on them for highly accurate translation. In the meantime, you should notify your callers that your system uses machine translation so they’re more forgiving of any problems in understanding.

Processing call data with ML after the call ends

When conversation intelligence is continually used on your VoIP data, the AI can keep learning more about your customers. What are their moods? How do they relate to specific issues? What are their most common objections?

Using the backlog of customer problems, including conversations and solutions from your voice calls, your AI can be trained to answer frequently asked questions right there in the phone queue.

This can be particularly helpful if your AI discovers a surge of one specific question or a range of questions on a specific topic. It can then make suggestions for you to set up automatic responses using virtual assistants, augment existing knowledge base, or build better decision trees for IVR. In a call centre application, where average call handling is the key metric, all questions on a specific issue can be routed to one or more experts on that topic freeing up other agents to handle other calls.

Building new ML models for your call recordings

In the case where businesses store the call recordings on stack, a conversation intelligence system can be used to audit the calls for specific entities or keyword phrases, redact any sensitive or PIIA data, and identify coaching opportunities using analytics like pace, talktime, overlap and sentiment across the conversation.

It’s also important to identify these characteristics for all the speakers involved in the call, and hence speaker separation and identification is an important part of the overall ML system. You can use some off-the-shelf conversational AI APIs or open-source models to build this system on both voice and text data asynchronously.

Symbl offers Async APIs on voice, video and text that can be used to aggregate insights and analyze conversation with several aspects in offline mode:

  • Meta-data like speakers, contact information, title of the conversation.
  • All members, transcripts and messages in the conversation as well as the topics discussed.
  • Any questions or requests for information that went unanswered in the call with identified speakers.
  • Appointments or follow-ups.

This could benefit a call center agent, sales executive and knowledge workers by creating transcripts, automatically reporting call metrics, and giving each participant a personalized list of tasks to complete for better call outcomes and enhancing productivity.

How AI adds to your VoIP security

You’re not the only one who can sniff packages and make VoIP work better for your purposes. VoIP systems have been hacked for years. Hackers mainly target VoIP systems to make money, save money with free calls, or steal data.

With ML you could set up a system to prevent VoIP hacking and continuously train the model to get better at it. Some attacks will be hard to trace, but a software testing technique like functional protocol testing (fuzzing) involves a higher than usual number of sent packages and leaves traces of unusually high data consumption. Manual fuzzing takes a lot of time, but with a tool like Google’s free and open source ClusterFuzz, you can find the bugs in your code before VoIP crashes become widespread within your application.

Your ML models can also be trained for other patterns that characterize security attacks, too. These include eavesdropping, audio injection, caller ID spoofing, and VoIP phishing. Some will involve a series of very short calls. Others will take up capacity on the network without connecting to an agent at the company, because the call to or from the customer is redirected to a hacker. The data use alone will drive up your expenses, but they’re nothing compared to the cost of successful hacks.

Additional reading:

Check these resources for more info about adding machine learning to your VoIP system:

Avatar photo
Team Symbl

The writing team at