Blog

Developer
Concepts
Use Cases
Product
Spotlight

What is the Best Use-Case for the Message Objects in Real-Time Transcription with Symbl.ai over a WebSocket?

Symbl.ai’s WebSocket provides access to transcription of the real-time audio streaming through Symbl.ai in more than one form of message objects. Depending upon your use-case, you may choose one message object over another. The WebSocket, for instance, provides developers access to the following types of message objects: recognition_result or message_response. Generally, recognition_result is best for live captioning, whereas message_response is for transcripts or beyond. In the following guide you walkthrough how to understand the differences between these message objects.

Sign up

The first step to exploring the differences is to register for an account at Symbl (i.e., https://platform.symbl.ai/). Grab both your appId and your appSecret.

Authenticate

With both your appId and your appSecret you authenticate either with a cURL command or with Postman so that you receive your x-api-key. Here is an example with cURL:

curl -k -X POST "https://api.symbl.ai/oauth2/token:generate" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d "{ \"type\": \"application\", \"appId\": \"<appId>\", \"appSecret\": \"<appSecret>\"}"

After authenticating you receive your x-api-key.

Ideally a token server would handle authentication (with code that makes RESTful API call for generating token) so that neither the appSecret nor the appId were ever exposed. However, cURL sets you up immediately anyway. With thex-api-key handy you are now ready to establish a WebSocket endpoint for performing live transcription.

Run the WebSocket

Add your x-api-key as the value for your const called accessToken as the first line of code for the full code sammple of the WebSocket below:

/**
 * The JWT token you get after authenticating with our API.
 * Check the Authentication section of the documentation for more details.
 */
const accessToken = ""
const uniqueMeetingId = btoa("user@example.com")
const symblEndpoint = `wss://api.symbl.ai/v1/realtime/insights/${uniqueMeetingId}?access_token=${accessToken}`;
const ws = new WebSocket(symblEndpoint);
// Fired when a message is received from the WebSocket server
ws.onmessage = (event) => {
  // You can find the conversationId in event.message.data.conversationId;
  const data = JSON.parse(event.data);
  if (data.type === 'message' && data.message.hasOwnProperty('data')) {
    console.log('conversationId', data.message.data.conversationId);
  }
  if (data.type === 'message_response') {
    for (let message of data.messages) {
      console.log('Transcript (more accurate): ', message.payload.content);
    }
  }
  if (data.type === 'topic_response') {
    for (let topic of data.topics) {
      console.log('Topic detected: ', topic.phrases)
    }
  }
  if (data.type === 'insight_response') {
    for (let insight of data.insights) {
      console.log('Insight detected: ', insight.payload.content);
    }
  }
  if (data.type === 'message' && data.message.hasOwnProperty('punctuated')) {
    console.log('Live transcript (less accurate): ', data.message.punctuated.transcript)
  }
  console.log(`Response type: ${data.type}. Object: `, data);
};
// Fired when the WebSocket closes unexpectedly due to an error or lost connetion
ws.onerror  = (err) => {
  console.error(err);
};
// Fired when the WebSocket connection has been closed
ws.onclose = (event) => {
  console.info('Connection to websocket closed');
};
// Fired when the connection succeeds.
ws.onopen = (event) => {
  ws.send(JSON.stringify({
    type: 'start_request',
    meetingTitle: 'Websockets How-to', // Conversation name
    insightTypes: ['question', 'action_item'], // Will enable insight generation
    config: {
      confidenceThreshold: 0.5,
      languageCode: 'en-US',
      speechRecognition: {
        encoding: 'LINEAR16',
        sampleRateHertz: 44100,
      }
    },
    speaker: {
      userId: 'example@symbl.ai',
      name: 'Example Sample',
    }
  }));
};
const stream = await navigator.mediaDevices.getUserMedia({ audio: true, video: false });
/**
 * The callback function which fires after a user gives the browser permission to use
 * the computer's microphone. Starts a recording session which sends the audio stream to
 * the WebSocket endpoint for processing.
 */
const handleSuccess = (stream) => {
  const AudioContext = window.AudioContext;
  const context = new AudioContext();
  const source = context.createMediaStreamSource(stream);
  const processor = context.createScriptProcessor(1024, 1, 1);
  const gainNode = context.createGain();
  source.connect(gainNode);
  gainNode.connect(processor);
  processor.connect(context.destination);
  processor.onaudioprocess = (e) => {
    // convert to 16-bit payload
    const inputData = e.inputBuffer.getChannelData(0) || new Float32Array(this.bufferSize);
    const targetBuffer = new Int16Array(inputData.length);
    for (let index = inputData.length; index > 0; index--) {
        targetBuffer[index] = 32767 * Math.min(1, inputData[index]);
    }
    // Send audio stream to websocket.
    if (ws.readyState === WebSocket.OPEN) {
      ws.send(targetBuffer.buffer);
    }
  };
};
handleSuccess(stream);

After adding your x-api-key, you open up a Chrome browser. Inside of the Chrome browser, you press Option + Command + J to initialize the console to the right of your browser’s tab. Inside of the console, paste the WebSocket code. Press Enter. The browser’s tab should ask for permission to access your browser’s microphone. After giving permission, speak. Messages begin accumulating in real-tiome as you speak. The messages your voice create are either recognition_result or message_response results. You take a look at both.

recognition_result

The WebSocket streams messages in the form of JSON message objects with a type message set. These message objects contain the recognition_results. Message objects of type recognition_result come in two different ways. The first to come is when the property for isFinal is set to false. The other is when the property is set for true. Messages are of an indeterminate length. You cannot determine the length of a message until the utterance for which it is measured, terminates. At the point of its termination the message is completed.

FALSE: If isFinal is set to false

Here is an example of a speech_recognition result whose isFinal property is set to false:

{
  "type": "message",
  "message": {
    "type": "recognition_result",
    "isFinal": false,
    "payload": {
      }
    },
    "punctuated": {
      "transcript": "Testing."
    },
    "user": {
      "userId": "example @symbl.ai",
      "name": "Matt Sample",
      "id": "1a93a562-8450-4753-9dbb-36123822dbb0"
    }
  },
  "timeOffset": 4058
}

If the recognition_result message whose isFinal property is set to false, then the message itself is incomplete. Since it is not “complete” the transcription data is set to change. As the speaker speaks new words, a myriad of ever newer and newer recognition_result objects whose isFinal property is set to false, accumulate. In fact, a new result accumulates for each new word a speaker speaks until finally the recognition_result or message begins to terminate. As the message terminates, the message completes. As soon as the message completes, a new recognition_result arises. The new recognition_result is a message object whose isFinal property is set to true because now the recognition_result is no longer waiting for the speaker to speak.

If isFinal is set to true

The recognition_result that arises is a message object whose isFinal property is set to true, since the speaker stopped speaking his or her message. Here is an example of a speech_recognition result whose isFinal property is set to true:

{
  "type": "message",
  "message": {
    "type": "recognition_result",
    "isFinal": true,
    "payload": {
 
      }
    },
    "punctuated": {
      "transcript": "Testing."
    },
    "user": {
      "userId": "example @symbl.ai",
      "name": "Matt Sample",
      "id": "1a93a562-8450-4753-9dbb-36123822dbb0"
    }
  },
  "timeOffset": 4058
}

A closer examination of the recognition_result whose property isFinal is set to true, reveals the depth of analysis that is captured just at the initial stage of automatic speech recognition with Symbl.ai. Here is an example:

{
  "type": "message",
  "message": {
    "type": "recognition_result",
    "isFinal": true,
    "payload": {
      "raw": {
        "alternatives": [
          {
            "words": [
              {
                "word": "Is",
                "startTime": {
                  "seconds": "113",
                  "nanos": "200000000"
                },
                "endTime": {
                  "seconds": "113",
                  "nanos": "400000000"
                }
              },
              {
                "word": "a",
                "startTime": {
                  "seconds": "113",
                  "nanos": "400000000"
                },
                "endTime": {
                  "seconds": "113",
                  "nanos": "500000000"
                }
              },
              {
                "word": "lot",
                "startTime": {
                  "seconds": "113",
                  "nanos": "500000000"
                },
                "endTime": {
                  "seconds": "113",
                  "nanos": "800000000"
                }
              },
              {
                "word": "of",
                "startTime": {
                  "seconds": "113",
                  "nanos": "800000000"
                },
                "endTime": {
                  "seconds": "113",
                  "nanos": "900000000"
                }
              },
              {
                "word": "noise",
                "startTime": {
                  "seconds": "113",
                  "nanos": "900000000"
                },
                "endTime": {
                  "seconds": "114",
                  "nanos": "500000000"
                }
              },
              {
                "word": "but",
                "startTime": {
                  "seconds": "114",
                  "nanos": "700000000"
                },
                "endTime": {
                  "seconds": "115",
                  "nanos": "200000000"
                }
              },
              {
                "word": "it'll",
                "startTime": {
                  "seconds": "116",
                  "nanos": "000000000"
                },
                "endTime": {
                  "seconds": "116",
                  "nanos": "700000000"
                }
              },
              {
                "word": "show",
                "startTime": {
                  "seconds": "117",
                  "nanos": "200000000"
                },
                "endTime": {
                  "seconds": "117",
                  "nanos": "600000000"
                }
              },
              {
                "word": "you",
                "startTime": {
                  "seconds": "117",
                  "nanos": "600000000"
                },
                "endTime": {
                  "seconds": "118",
                  "nanos": "000000000"
                }
              },
              {
                "word": "like",
                "startTime": {
                  "seconds": "118",
                  "nanos": "000000000"
                },
                "endTime": {
                  "seconds": "118",
                  "nanos": "400000000"
                }
              },
              {
                "word": "you",
                "startTime": {
                  "seconds": "118",
                  "nanos": "400000000"
                },
                "endTime": {
                  "seconds": "118",
                  "nanos": "500000000"
                }
              },
              {
                "word": "can",
                "startTime": {
                  "seconds": "118",
                  "nanos": "500000000"
                },
                "endTime": {
                  "seconds": "118",
                  "nanos": "700000000"
                }
              },
              {
                "word": "see",
                "startTime": {
                  "seconds": "118",
                  "nanos": "700000000"
                },
                "endTime": {
                  "seconds": "118",
                  "nanos": "800000000"
                }
              },
              {
                "word": "the",
                "startTime": {
                  "seconds": "118",
                  "nanos": "800000000"
                },
                "endTime": {
                  "seconds": "119",
                  "nanos": "000000000"
                }
              },
              {
                "word": "words",
                "startTime": {
                  "seconds": "119",
                  "nanos": "000000000"
                },
                "endTime": {
                  "seconds": "119",
                  "nanos": "300000000"
                }
              },
              {
                "word": "that",
                "startTime": {
                  "seconds": "119",
                  "nanos": "300000000"
                },
                "endTime": {
                  "seconds": "119",
                  "nanos": "500000000"
                }
              },
              {
                "word": "are",
                "startTime": {
                  "seconds": "119",
                  "nanos": "500000000"
                },
                "endTime": {
                  "seconds": "119",
                  "nanos": "600000000"
                }
              },
              {
                "word": "popping",
                "startTime": {
                  "seconds": "119",
                  "nanos": "600000000"
                },
                "endTime": {
                  "seconds": "120",
                  "nanos": "000000000"
                }
              },
              {
                "word": "up",
                "startTime": {
                  "seconds": "120",
                  "nanos": "000000000"
                },
                "endTime": {
                  "seconds": "120",
                  "nanos": "300000000"
                }
              },
              {
                "word": "like",
                "startTime": {
                  "seconds": "120",
                  "nanos": "300000000"
                },
                "endTime": {
                  "seconds": "120",
                  "nanos": "500000000"
                }
              },
              {
                "word": "this",
                "startTime": {
                  "seconds": "120",
                  "nanos": "500000000"
                },
                "endTime": {
                  "seconds": "120",
                  "nanos": "700000000"
                }
              },
              {
                "word": "is",
                "startTime": {
                  "seconds": "120",
                  "nanos": "700000000"
                },
                "endTime": {
                  "seconds": "120",
                  "nanos": "900000000"
                }
              },
              {
                "word": "like",
                "startTime": {
                  "seconds": "120",
                  "nanos": "900000000"
                },
                "endTime": {
                  "seconds": "121",
                  "nanos": "400000000"
                }
              },
              {
                "word": "really",
                "startTime": {
                  "seconds": "121",
                  "nanos": "500000000"
                },
                "endTime": {
                  "seconds": "121",
                  "nanos": "800000000"
                }
              },
              {
                "word": "good",
                "startTime": {
                  "seconds": "121",
                  "nanos": "800000000"
                },
                "endTime": {
                  "seconds": "122",
                  "nanos": "200000000"
                }
              },
              {
                "word": "for",
                "startTime": {
                  "seconds": "122",
                  "nanos": "200000000"
                },
                "endTime": {
                  "seconds": "122",
                  "nanos": "400000000"
                }
              },
              {
                "word": "closed",
                "startTime": {
                  "seconds": "122",
                  "nanos": "400000000"
                },
                "endTime": {
                  "seconds": "122",
                  "nanos": "800000000"
                }
              },
              {
                "word": "captioning.",
                "startTime": {
                  "seconds": "122",
                  "nanos": "800000000"
                },
                "endTime": {
                  "seconds": "123",
                  "nanos": "600000000"
                }
              }
            ],
            "transcript": "Is a lot of noise but it'll show you like you can see the words that are popping up like this is like really good for closed captioning.",
            "confidence": 0.8649077415466309
          }
        ]
      }
    },
    "punctuated": {
      "transcript": "Is a lot of noise but it'll show you like you can see the words that are popping up like this is like really good for closed captioning."
    }
  },
  "timeOffset": 117128
}

As you can imagine, the above recognition_result whose isFinal property is set to true tells the entire story. The message contains the words the speaker spoke in his or her utterances, as well as the seconds and nanos for which the beginning of the utterance is based. If you were to ask yourself in reverse, how did the recognition_result whose isFinal property is set to false look at the first, second, third or nth utterance? It looked almost the same, except it did not have whatever other utterance came after the first, second, third or nthutterance occured prior to time when the message terminated.

You can tell from the way that the recognition_result whose isFinal for false continues to update in real-time on an n-th message basis until the result becomes final. In that way a live captioning use-case is a best use-case for recognition_result. The recognition_result continues to update in real-time in the way that a speech event develops in real-time. What happens next?

Message_Response

Although you may think that the only real-time use-case for automated speech recognition is live captioning, there are many, many, many others. One of those use-cases is transcripts. The message_response object, albeit after the recognition_result, is a best use case for transcripts, since message_response is not only a algorithmically layered derivative analysis of the underlying speech events from recognition_result but a processed end result of Symbl.ai’s algorithms. It means that in addition to transcripts, you now receive real-time Artificial Intelligence.

There is a short delay from the time when isFinal as true precedes the message_response. Since there is an almost indetectable delay between the time when the recognition_result ends and the message_response is complete, compare the following JSON response data to the same message.

{
  "type": "message_response",
  "messages": [
    {
      "payload": {
        "content": "Is a lot of noise but it will show you like you can see the words that are popping up like this is like really good for closed captioning.",
        "contentType": "text/plain"
      },
      "id": "95dc684b-23a8-472f-a767-d831ad838558",
      "channel": {
        "id": "realtime-api"
      },
      "metadata": {
        "disablePunctuation": true,
        "originalContent": "Is a lot of noise but it'll show you like you can see the words that are popping up like this is like really good for closed captioning.",
        "words": "[
          {\"word\":\"Is\",\"startTime\":\"2021-07-14T23:14:04.674Z\",\"endTime\":\"2021-07-14T23:14:04.874Z\"},{\"word\":\"a\",\"startTime\":\"2021-07-14T23:14:04.874Z\",\"endTime\":\"2021-07-14T23:14:04.974Z\"},{\"word\":\"lot\",\"startTime\":\"2021-07-14T23:14:04.974Z\",\"endTime\":\"2021-07-14T23:14:05.274Z\"},{\"word\":\"of\",\"startTime\":\"2021-07-14T23:14:05.274Z\",\"endTime\":\"2021-07-14T23:14:05.374Z\"},{\"word\":\"noise\",\"startTime\":\"2021-07-14T23:14:05.374Z\",\"endTime\":\"2021-07-14T23:14:05.974Z\"},{\"word\":\"but\",\"startTime\":\"2021-07-14T23:14:06.174Z\",\"endTime\":\"2021-07-14T23:14:06.674Z\"},{\"word\":\"it'll\",\"startTime\":\"2021-07-14T23:14:07.474Z\",\"endTime\":\"2021-07-14T23:14:08.174Z\"},{\"word\":\"show\",\"startTime\":\"2021-07-14T23:14:08.674Z\",\"endTime\":\"2021-07-14T23:14:09.074Z\"},{\"word\":\"you\",\"startTime\":\"2021-07-14T23:14:09.074Z\",\"endTime\":\"2021-07-14T23:14:09.474Z\"},{\"word\":\"like\",\"startTime\":\"2021-07-14T23:14:09.474Z\",\"endTime\":\"2021-07-14T23:14:09.874Z\"},{\"word\":\"you\",\"startTime\":\"2021-07-14T23:14:09.874Z\",\"endTime\":\"2021-07-14T23:14:09.974Z\"},{\"word\":\"can\",\"startTime\":\"2021-07-14T23:14:09.974Z\",\"endTime\":\"2021-07-14T23:14:10.174Z\"},{\"word\":\"see\",\"startTime\":\"2021-07-14T23:14:10.174Z\",\"endTime\":\"2021-07-14T23:14:10.274Z\"},{\"word\":\"the\",\"startTime\":\"2021-07-14T23:14:10.274Z\",\"endTime\":\"2021-07-14T23:14:10.474Z\"},{\"word\":\"words\",\"startTime\":\"2021-07-14T23:14:10.474Z\",\"endTime\":\"2021-07-14T23:14:10.774Z\"},{\"word\":\"that\",\"startTime\":\"2021-07-14T23:14:10.774Z\",\"endTime\":\"2021-07-14T23:14:10.974Z\"},{\"word\":\"are\",\"startTime\":\"2021-07-14T23:14:10.974Z\",\"endTime\":\"2021-07-14T23:14:11.074Z\"},{\"word\":\"popping\",\"startTime\":\"2021-07-14T23:14:11.074Z\",\"endTime\":\"2021-07-14T23:14:11.474Z\"},{\"word\":\"up\",\"startTime\":\"2021-07-14T23:14:11.474Z\",\"endTime\":\"2021-07-14T23:14:11.774Z\"},{\"word\":\"like\",\"startTime\":\"2021-07-14T23:14:11.774Z\",\"endTime\":\"2021-07-14T23:14:11.974Z\"},{\"word\":\"this\",\"startTime\":\"2021-07-14T23:14:11.974Z\",\"endTime\":\"2021-07-14T23:14:12.174Z\"},{\"word\":\"is\",\"startTime\":\"2021-07-14T23:14:12.174Z\",\"endTime\":\"2021-07-14T23:14:12.374Z\"},{\"word\":\"like\",\"startTime\":\"2021-07-14T23:14:12.374Z\",\"endTime\":\"2021-07-14T23:14:12.874Z\"},{\"word\":\"really\",\"startTime\":\"2021-07-14T23:14:12.974Z\",\"endTime\":\"2021-07-14T23:14:13.274Z\"},{\"word\":\"good\",\"startTime\":\"2021-07-14T23:14:13.274Z\",\"endTime\":\"2021-07-14T23:14:13.674Z\"},{\"word\":\"for\",\"startTime\":\"2021-07-14T23:14:13.674Z\",\"endTime\":\"2021-07-14T23:14:13.874Z\"},{\"word\":\"closed\",\"startTime\":\"2021-07-14T23:14:13.874Z\",\"endTime\":\"2021-07-14T23:14:14.274Z\"},{\"word\":\"captioning.\",\"startTime\":\"2021-07-14T23:14:14.274Z\",\"endTime\":\"2021-07-14T23:14:15.074Z\"}]",
        "originalMessageId": "95dc684b-23a8-472f-a767-d831ad838558"
      },
      "dismissed": false,
      "duration": {
        "startTime": "2021-07-14T23:14:04.674Z",
        "endTime": "2021-07-14T23:14:15.074Z"
      }
    }
  ],
  "sequenceNumber": 6
}

There is not only a tremendous difference in form. The form of the message_response contains an array called words where the data no longer contains seconds or nanos but timestamps. In the message_response you receive timestamped data as opposed to mere words. There is also a difference in the result you obtain. The message_response contains an id where that id is subject to Symbl.ai’s real-time Artificial Intelligence algorithms through the conversationId. At this point you have the power to extend your speech analysis beyond message_response’s transcription to contextual insights such as whether or not the message contains a sentiment, a topic, a question, action-item, or follow-up.

Conclusion

In conclusion Symbl.ai’s WebSocket connection implements a method called onmessage. The method responds to live speech events in real-time with immediate updates. At first recognition_result begins its initial rolling analysis of the live speech events, updating its analysis of utterances in the live speech events. As the message begins to end, the recognition_result begins to finalize until the message terminates. After the message terminates, the recognition_result completes. At that time a recognition_result whose isFinal as true produces an array of words. The best use-case for recognition_result is live captioning, since the recognition_result immediate updates on a word-by-word basis.

After recognition_result is finalized, Symbl.ai’s algorithms produces a message_response. The message_response contains an array called words with timestamped data as opposed to mere words. With the message_response you have the power to not only capture transcripts but also extend your speech analysis beyond message_response’s transcription to contextual insights such as whether or not the message contains a sentiment, a topic, a question, action-item, or follow-up.

What’s Next

After determining which message object’s specific characteristics meet your desired outcomes, you can align your choice of object with your specific use-case.

As stated above, there is also real-time Artificial Intelligence but you can gain access to these insights through response objects in a way similar to that of message_resposne. These are topic_response and insight_response. With these objects you can configure your application to generate topics or create new experiences around your conversations with contextual insights within the WebSocket connection without a call to another API.

Join Our Community

Symbl.ai‘s invites developers to reach out to us via email at developer@symbl.ai, join our Slack channels, participate in our hackathons, fork our Postman public workspace, or git clone our repos at Symbl.ai’s GitHub.