Speech recognition refers to how well AI can identify and respond to human speech. You can improve the accuracy of speech recognition in your software for real-time and asynchronous speech recognition by adjusting a few critical variables, such as codecs, channels, sampling rate, and audio quality.
Source: Speech Recognition Is Not Solved
One of the big challenges is the sheer number of variations that exist within human speech. Within any given language, you have to contend with regional differences like pronunciation and accents that can affect how well a voice is accurately picked up. The age of the speaker can also affect the results, with virtual assistants like Alexa struggling to keep up with young children or capture the muddled speech of older adults.
What happens as a result of these errors can be a mix of humorous and frustrating. On one end, Siri can activate and start transcribing a tuba. On the other, a virtual assistant can’t understand what you’re trying to say. Transcriptions that capture pure nonsense instead of what was actually said are particularly frustrating because you have to go back and listen to what was transcribed.
A lot of the problem stems from the fact that you need to train your speech recognition algorithms with hundreds of hours of audio. If the audio you used to train the AI comes from a quiet, professionally recorded source, it’s going to struggle with anything that doesn’t reflect that source, like background noise at a party.
But it’s not just accents and how the system was trained that affects the quality of speech recognition. There are also a few technical factors that can make or break the accuracy of your speech recognition.
What factors can impact the accuracy of speech recognition?
From a technical perspective, audio is a tricky thing. It’s not just a matter of grabbing an audio file and using it. Everything from how the audio was captured to the format it’s stored in can have a huge impact on the usability.
Codecs are devices or programs that are used to compress audio files down to a more usable size. Each format (MP3, WAV, FLAC, OGG) uses a different amount of compression when processing files.
There are two types of compression. The first is lossless compression, where a file is reduced in size without losing much audio quality. This includes file formats like WAV and FLAC. Lossless audio files are larger because there is less compression and, as a result, the quality is higher.
The second is lossy compression, where audio quality is sacrificed for the sake of a smaller file. MP3s are a great example of this. Most audio we deal with uses the MP3 file format because the files are small. They’re great for everyday use, but they lack quality.
You might be tempted to use light and easy file formats like MP3 because they’re easier to work with and faster to process, but they can also sabotage the quality of your results. Ideally, choose a high-quality codec that records in G.711 to boost the accuracy of your automatic speech recognition (ASR).
Audio quality is a major part of ASR. Codecs provide a high quality audio format to work with. But if there is a lack of quality at the source, you end up with issues like a high word error rate.
While human brains can focus its attention on a single in a crowded room, AI can’t. Things like background noise, too many speakers talking at once, and echoes can mess up AI’s ability to focus on a conversation.
You can improve your AI’s chances by training recognition engines to filter out some background noises, like TVs, radios, or traffic with an open-source tool like Audacity, although doing that can reduce the audio’s quality.
If you’re just starting to add communication to your app, or building one from scratch, adopting a Video API that comes with noise reduction can be a differentiating foundation not just for the experience today, but also for any AI capabilities you add later on.
The best way to describe audio channels is using home audio equipment, like stereos or surround sound. When something is recorded in stereo, it’s captured in a way that lets you hear it from both a left speaker and a right speaker. Surround sound uses six channels. Mono uses a single channel.
When recording, channels are separated not by the sound they produce, but the sound they capture. For example, if two people are speaking on the phone, each phone line should be recorded as a separate channel.
If you can’t capture separate channels, then you need tighter control over the recording itself. This means limiting the number of people talking at once (ideally to one) and recording in a quiet or soundproof space.
You can use something like the real-time WebSocket API with speaker separated audio. However, price is an important factor to consider here. If you use different channels, the cost is multiplied by the number of streams you use. So, while this is the best way to integrate audio, businesses might want to use a single stream using either automated speaker diarization or external speaker events to save on costs.
You can learn more about speaker diarization here.
General best practices to consider
If you’re looking to take the quality of your ASR efforts up to the next level, here are a few more tips:
- Sampling rate: The number of samples that are taken per second from a signal. You want to capture audio with a rate of 16,000 Hz or higher.
- Buffer size: Also known as the audio chunk. Try to use a single audio buffer of around 100-milliseconds for balanced latency (the delay between the creation of the sound and it being recorded).
- Background noise: Do everything you can to reduce background noise without sacrificing quality. Use audio that has been recorded in a controlled environment and puts speakers as close to the mic as possible. Avoid audio clipping since it causes the audio to break up, and be wary of noise reduction tools, as they can reduce audio quality by turning background noise into a hiss or buzz.
If you want to learn more about improving the accuracy of speech recognition or dive deeper into anything we covered in this post, here’s where you can learn more.