A Guide to Building an End-to-End Speech Recognition Model

With speech being such a natural and fundamental form of communication, speech recognition is among the most exciting, and important, applications of AI.

A large reason for this is that speech-activated applications feel natural and intuitive to users, offering a gentle learning curve, which allows a level of comfort that facilitates fast adoption. Consequently, voice-activated technology offers some of the first examples of widely-used AI applications, as the general public has been familiar with automated customer service systems for decades. Even more recently, digital assistants, such as Apple’s Siri and Google’s Assistant, have had a large impact – and have become a daily fixture for tens of millions of people globally.

With this in mind, this guide explores the process of building your own end-to-end speech recognition model. We take a look at how they work, their common applications, their various architectures, and how to train and evaluate them.

What is an End-to-End Speech Recognition Model?

An end-to-end speech recognition model is a deep learning model that takes an aural speech signal as its input and outputs a textual transcript.

However, in contrast to prior “traditional” speech recognition systems that are composed of multiple models that process audio and text independently, i.e., feature extraction, acoustic and language modeling, etc., end-to-end speech models directly map the audio input to its corresponding text output. This eliminates the need for explicit intermediate representations, reducing the complexity of developing the model while improving its performance and accuracy.

How do End-to-End Speech Recognition Models work?

Here is an overview of the end-to-end speech recognition process, broken down into several key stages.

Acoustic Signal Processing: an acoustic signal input, i.e. the analogue waveform of the audio, is captured by a microphone and converted to digital data.
Feature Extraction: relevant features are extracted from the data, including its pitch, intensity, and spectral characteristics, i.e., the audio signal’s unique frequency properties and patterns. Spectral characteristics can be represented by Mel spectrogram, a visual representation of the signal’s frequency over time, or Mel-Frequency Cepstral Coefficients (MFCCs), which represent the signal’s frequency range – and capture the audio’s higher-level characteristics.
Acoustic Model Encoding: a statistical model, such as a neural network, that maps the signal’s extracted features to phonetic or sub-word representations to capture the acoustic-textual relationship.
Language Model Encoding: a statistical model that predicts the probability of words, or sub-word tokens, sequences occurring in a sequence. The language model complements the acoustic model by incorporating linguistic, grammatical, and syntactic context to help distinguish between words that sound similar but mean different things.
Decoding: the acoustic and language models work in tandem to generate output that represents the most probable text transcription of the audio signal. A decoding algorithm, such as beam search or greedy decoding, traverses possible word sequences and identifies those that are most likely to represent the audio input.

Determining the Use Case for Your Speech Recognition System

Before you begin building your end-to-end speech recognition model, it is essential to clearly define its use case.

This is a fundamental part of the process because it will factor into your choice of model architecture and, more importantly, the model size, i.e., the number of parameters, to aim for. Generally, the more capable you want your speech model to be, e.g., the larger its vocabulary, its adaptability, whether it is multi-lingual, etc., the more trained parameters it requires.

These aspects then trickle down to the training process as they determine the amount of data you will need – and the more capable your model, the larger the required training dataset. Subsequently, the more training data you have, the longer it will take to train the model and the more computational resources, i.e., memory, storage, electricity, etc., that are required.

Applications of End-to-End Speech Recognition Models

Popular uses for speech recognition systems include:

Digital Personal Assistants: likely the first thing that comes to mind for many when it comes to voice-activated technology are virtual assistants like Siri and Alexa, which use speech recognition models to process and respond to spoken queries and commands.
Home Automation: voice-activated digital assistants act as the central hub in “smart” homes that use speech recognition in home automation tasks. This includes controlling security devices, lighting, air conditioning and other appliances.
Customer Service: speech models are used in automated customer service lines to handle simple queries and tasks, or to streamline the process of directing the customer to the appropriate human agent.
Transcription Services: speech models can turn verbal expressions into written text, making them ideal for tasks such as transcribing interviews or meeting minutes, as well as for creating content.
Translation Tasks: if a speech model is multilingual, it can receive spoken input in one language and transcribe it into the written text of another language – or even multiple languages.
Accessibility Features: speech recognition models can enhance the accessibility of a vast range of digital solutions, making them far more functional for people who are physically or mentally impaired.

Defining the End-to-End Speech Recognition Model’s Architecture

After identifying the use case for your end-to-end speech recognition model, the next stage is choosing a neural network architecture to match your intended use case.

Fortunately, instead of having to build the required neural networks from scratch, there are engines, frameworks, and toolkits that streamline the process of building end-to-end speech recognition models. Let us look at two of the most widely used frameworks – DeepSpeech and ESPnet.

DeepSpeech is a prominent speech recognition engine that uses the TensorFlow deep learning framework as its foundation. Developed by Mozilla, it is based on the DeepSpeech model proposed in the influential paper published by Chinese technology giant Baidu – credited with popularizing the idea of end-to-end speech recognition models. It is compatible with various languages, including Python, JavaScript, and C, and is lightweight enough to enable deployment on resource-constrained devices.

DeepSpeech employs a combination of convolutional neural networks (CNNs), to identify patterns and extract high-level features from the audio, and recurrent neural networks (RNNs), to capture the audio’s temporal and predict textual output. It offers both pre-trained models and the components and infrastructure to develop your own end-to-end speech recognition systems, including comprehensive libraries for data preprocessing, training, and evaluation.

ESPnet is an open-source framework developed by researchers at Johns Hopkins University for building end-to-end speech processing. Built on top of the PyTorch and Chainer neural network libraries, its modular and extensible design enables the development of models capable of automatic speech recognition (ASR), text-to-speech (TTS), and translation tasks. Much like DeepSpeech, you can start from a selection of pre-trained models or customize one of several RNN or transformer-based architectures.

Building a Data Pipeline

After selecting the architecture for your end-to-end speech recognition model, it’s time to build a data pipeline for training your model. This is a crucial part of the process because the quality of your training data determines the quality, i.e., the capabilities and accuracy, of your speech recognition model. To underscore the importance of this step, it is not uncommon to see data curation listed before selecting a neural network architecture in a list of steps for creating a deep learning model.

Building a data pipeline is composed of four steps:

Collecting data
Preprocessing the training data
Data augmentation
Dividing the data into training and evaluation sets

Let us consider each step in greater detail.

Collecting Data

The first stage of creating your data pipeline is compiling a training dataset. Speech recognition datasets contain two types of data: spoken audio data that serves as input and text transcripts that represent the target output labels.

Training an end-to-end speech recognition model requires a lot of data -ranging from tens to thousands of hours of audio data. Generally, the more sophisticated you need your model to be, the more data you will need to amass. Additionally, if you’re training your model for a domain-specific purpose – for use in the engineering field, for instance – you’ll also need specific, typically smaller datasets with the appropriate content for fine-tuning the model after its initial pre-training.

Fortunately, there are existing datasets compiled by the AI research community that are publicly available; here are some of the most commonly used datasets for training speech recognition systems:

LibriSpeech: around 1000 hours of English-language recording from various audiobooks read by diverse speakers. The Multilingual LibriSpeech (MLS) dataset is also available, which contains 44,500 hours of English audio and 6,000 hours of audio in French, Spanish, Dutch, and several other languages.
Mozilla Common Voice: a popular open-source dataset that features over 20000 hours of recordings of sentences read in dozens of languages, offering a diverse selection of speakers, accents, and recording conditions.
Switchboard-1: composed of close to 2500 telephone conversations between pairs of speakers on various topics.

Preprocessing the Training Data

Next, we need to pre-process our compiled dataset to make it easier for the model to process and to make training more efficient. This involves addressing variations within data instances to ensure they have the similar dimensions the model will expect.

Typical preprocessing tasks include:

Filtering the audio signal to reduce background noise
Establishing uniform attributes for the data, including:
- Duration: truncating longer sequences, e.g., based on silence or pauses, or padding shorter ones, e.g., with zero-value samples
- Sampling rate
- Bit depth
- Channels

Data Augmentation

Data augmentation is a technique used to artificially increase the diversity of your dataset to increase your dataset size. This strategy is especially helpful when data is scarce or your model is overfitting.

For speech recognition, data augmentation techniques for speech recognition models include:

Changing the pitch
Altering the speed
Adding background noise
Adding reverb
Lengthening or compressing the audio signal
Resampling, i.e., changing the sampling rate
Time shifting, i.e., moving the audio signal by a small percentage to introduce variations

Dividing the Data into Training And Evaluation Sets

Using the same data to both train and evaluate your speech model can result in overfitting. This occurs when the model is already familiar with data instances and has learned them – instead of being able to generalize to new data. For this reason, you should retain a portion of the training data for the later evaluation stage.

Training Your Speech Recognition Model

Training an end-to-end speech recognition model passing through training data to initialize the parameters, i.e., weights and biases within its neural network. The objective of this process is for the model to learn the characteristics of the audio input data well enough to predict its corresponding text output labels. This process is composed of two stages: forward propagation (also called the forward pass) and backward propagation.

During forward propagation, the audio input enters the speech recognition model, which learns to extract relevant features and patterns from the audio signals. As the input progresses through the neural network’s hidden layers, the model captures increasing numbers of higher-level representations, so it gains a richer and deeper understanding of the speech signal from which to transcribe it most accurately. The forward pass continues until the network’s final layer outputs a text transcription based on the input audio data and the model’s parameters.

Backward propagation is where the model’s parameters are updated based on the incorrect predictions for the correct output label. The gradients of the model’s parameters, i.e., the nature of adjustments required to maximize the speech model’s accuracy, are computed and propagated backwards through the network. These gradient adjustments are made to minimize the loss function – an algorithm that measures the difference between the predicted text output and the actual “ground truth” transcription from the training data.

A commonly used loss function algorithm is connectionist temporal classification (CTC), which aligns the audio input with the textual output sequence. A key strength of CTS is that it aligns sequences automatically, without you needing to explicitly and/or manually stipulate alignment as part of the labeled training data. Backward propagation continues iteratively until the speech model’s parameters converge to a point where the loss function is minimized and, consequently, the model is optimized for accurate speech recognition.

Fine-Tuning Your Model

After training your model with general-purpose audio data, it will (pending successful evaluation) be capable of general speech recognition tasks. Making your end-to-end speech recognition model effective at your specific use case, however, may require you to fine-tune it – with further training on domain or task-specific data. Much like the initial training process, known as pre-training, fine-tuning is an iterative process in which the speech model’s parameters are updated to minimize the loss function according to the newly introduced fine-tuning dataset.

There are two types of fine-tuning: full fine-tuning and transfer learning:

Full Fine-Tuning: the most comprehensive fine-tuning method whereby all of the model’s parameters are updated. Though likely to yield the best results, it requires more data, memory, and time.
Transfer Learning: this method leverages the existing capabilities of the speech model developed during pre-training and transfers them to the desired use case. Transfer learning requires many or all of the base LLM’s neural network layers to be “frozen” to limit which parameters can be altered. The unfrozen layers are fine-tuned with new data, which requires smaller datasets, time and computational resources than full fine-tuning. Parameter-efficient fine-tuning (PEFT) is a commonly used method of transfer learning.

Evaluating Your Speech Recognition Model

After training and fine-tuning your speech model, it is time to evaluate whether it can carry out its intended use case and to what level. For evaluation, you’ll need different data from that used to train the model – to avoid overfitting. This is often called a holdover dataset because it was “held over” from the training data to test the model later in the process.

One of the most effective ways to measure the performance of your end-to-end speech recognition model is with evaluation metrics. Commonly used evaluation metrics for speech models include:

Word Error Rate (WER): compares the model’s output transcription with the original ground truth transcription on a per-word basis.
WER = (S+D+I/N) x 100, where:
S = substitutions
D = deletions
I = insertions
N = number of words

The resulting figure is the WER expressed as a percentage; the lower the WER, the better the model’s performance.

Token Error Rate (TER): similar to WER but measures errors at a sub-word level, allowing for a more precise measure of accuracy.
Character Error Rate (CER): measures errors at a character level, offering higher precision than WER and TER.

Conclusion

To briefly recap, the process of building an end-to-end speech recognition system is as follows:

Determining the model’s use case
Defining the speech model’s architecture
Building a data pipe
Training the speech recognition model
Evaluating the model

While speech recognition models have seen considerable advancements in recent years, the inherent challenges in processing audio, including persistent background interference, different accents, distinct vocal tonality and cadence, etc., have seen them progress at a slower rate than other areas of deep learning, such as language models. Consequently, the use of end-to-end speech models is not as democratized as other AI applications, as only companies with the necessary vast resources can afford to build models performant enough for consistent real-world performance.

Fortunately, because speech recognition is a key component of human-to-machine communication, and gives rise to so many use cases, it’s an area that AI vendors and researchers are sure to throw their resources behind.

Kartik Talamadupula

Director of AI Research

Kartik Talamadupula is a research scientist who has spent over a decade applying AI techniques to business problems in automation, human-AI collaboration, and NLP.