The main variables for accurate audio transcriptions are frequency and quality of audio. You can improve the accuracy even more if you: pre-feed custom vocabulary, set up different streams of audio, identify dialects and accents, keep your audio clean, beware of noise cancellation, avoid using automatic gain control (AGC), avoid audio clipping, and position the user close to the microphone.

Why is video and audio transcription accuracy important?

When you provide video and audio transcriptions for your clients, you’ll want them to be as close to the spoken word as possible to ensure they’re correct, helpful, and, professional.

The main variables that can affect audio transcription accuracy are the frequency and quality of your audio. The type of video or audio file that you’re transcribing and how it has been created will affect how much you can improve the resulting transcription.

The three most commonly transcribed types of audio or video

You’ll see next that the audio sampling rate can range from 8-48khz, depending on the type of stream you use to produce transcriptions. The better the audio frequency, the more accurate your transcriptions will be.

1. Recorded files

You can use recorded files, either audio or video, and create transcriptions after the event. You can digest the files with’s Async API and then use the platform’s Conversation API to produce the transcription.

2. Real time WebSocket based integration

A WebSocket is a protocol for establishing two-way communication streams over the Internet. If you use an API for WebSockets, you can create transcriptions in real-time conversations. WebSockets facilitate communications between clients or servers in real time without the connection suffering from sluggish, high-latency, and bandwidth-intensive HTTP API calls. supports most common audio formats with a sample rate range of 8 to 48kHz. recommends you use the audio format OPUS because it provides the most flexibility in terms of audio transportation. OPUS also has packet retransmission mechanisms, like the Forward Error Correction (FEC) feature, which work well especially in low-bandwidth scenarios.

3. Real time SIP and PSTN-based integration (also known as telephony)

Session Initiation Protocol (SIP) is the foundation of voice over internet protocol (VoIP). It enables businesses to make voice and video calls over internet-connected networks. If you use SIP, your chosen meeting assistant can connect to the stream and listen like another user. Using a SIP line provides a higher sample rate (Zoom’s sample rate is 16-48 kHz, which is very good) and consequently provides more accurate transcriptions.

The Publicly Switched Telephone Network (PSTN) is the traditional worldwide circuit-switched telephone network that carries your calls when you dial in from a landline phone. It includes privately-owned and government-owned infrastructure. The audio sample rate for PSTN is a maximum of 8kHz.’s 8 tips and tricks to improve transcription accuracy

Often you won’t have any control over the types of audio or video stream your client provides you with to create transcription capabilities. The type of stream will be a major determinant of the accuracy of the transcription. However, there are other ways that you can optimize your speech recognition results.

1. Boost accuracy with custom vocabularies

If your subject matter is specialized or technical, you can help out the machine by pre-feeding it with some custom vocabulary that it might expect to hear. An example of custom vocabulary in a medical context would be specific medical terminology or abbreviations.

2. Set up different streams of audio

If there are a lot of people on the same channel, you’ll be faced with two options:  to write your code in a way that creates separate audio/video streams for each speaker, or to leave it as one stream.

If you create different audio streams for each speaker, you’ll provide better speech recognition accuracy and handling of streams. There is a cost factor to consider in this scenario if there are lots of people on the recording because it means you’ll need a better infrastructure and more resources to manage all the different streams.

Let’s say you have three speakers and a large audience in an interactive webinar. You could attribute a single stream to each of the speakers so they can be individually identified and then create one additional stream for the whole audience. In this scenario, all questions would be displayed in a transcription as “audience”.

3. Identify dialects and accents

If you know where the speakers on the audio stream are from, then you can identify the accent or dialect that they are using (e.g. American vs. Scottish accents). By pre-teaching your model to identify and contextually understand different relevant accents and dialects, you can avoid simple errors and improve the accuracy of your transcription.

4. Keep your audio clean

It’s best to provide audio for transcription that is as clean as possible. Excessive background noise and echoes can reduce transcription accuracy. This balance between speech and noise is measured by a speech-to-noise ratio (SNR). SNR is a measure of unwanted noise in an audio stream relative to recognizable speech.

The SNR can negatively affect your system’s performance by limiting operating range or affecting receiver sensitivity. When you understand how to calculate and manage this you’ll be able to create a robust, accurate system for real-life situations.

5. Beware of noise cancellation

If you are considering noise-canceling techniques, you should be aware that they may result in information loss and reduced accuracy. If you’re unsure whether the techniques you are considering will do this, it’s best to avoid noise cancellation.

6. Don’t use automatic gain control

A disadvantage of automatic gain control (AGC) is that when recording something with both quiet and loud periods, the AGC will tend to make the quiet passages louder and the loud passages quieter, compressing the dynamic range. The result can be a reduced audio quality if the signal is not re-expanded when playing.

7. Position the speaker close to the microphone whenever possible

The proximity of the speaker to the microphone will affect the audio quality. You can disrupt your audio if your microphone is too close to your mouth or too far away. It’s usually best to position your face about two inches away from the microphone. Hold your hand in front of your face with your fingers pointed up and spread naturally – that’s about the right distance. Any closer and the microphone will pick up your mouth sounds. Any farther and the microphone will pick up room sounds. Try to maintain a constant distance from the microphone throughout the recording.

8. Avoid audio clipping

Audio clipping is a form of waveform distortion. When you push an amplifier beyond its maximum limit, it goes into overdrive. The overdriven signal causes the amplifier to attempt to produce an output voltage beyond its capability, which is when clipping occurs. If your audio is clipping, you are overloading your audio interface or recording device. In doing so, you have run out of headroom in your recording equipment. There are ways to avoid audio clipping, like using any attenuator technology built into your camera or recorder, and/or creating a safety channel.

How do you know how good your transcription is?

To analyze the quality of the transcription, you can measure the word error rate (WER) and/or the sentence error rate (SER). To have a basis of comparison, a human also needs to do the same transcription from the same audio or video stream.

There is no set norm of what a good percentage accuracy would be. Based on what is achievable with today’s technology, 80% or above can be considered a very good level of accuracy. provides real-time and asynchronous transcription capabilities that can help you achieve better accuracy in your transcriptions. is a conversation intelligence platform with a secure and scalable infrastructure that provides programmable APIs and SDKs for developers and doesn’t require building or training a machine learning model.

If you use for your audio transcriptions, you’ll usually be able to achieve up to 90% audio transcription accuracy. Of course, your accuracy always depends on the audio quality you have to work with, but by deploying the tips and tricks from in this article you can better optimize your results.

Additional reading: