A Guide to Transformer Architecture

Transformers were introduced in the now seminal paper Attention is All You Need (2017) by Vaswani et al, a research team from Google and the University of Toronto. Though initially developed for machine translation tasks, transformers have since provided the foundation for a variety of large language models (LLMs) and other machine learning models and have been applied to a wide range of real-world use cases.

In this guide, we take an in-depth look at the transformer architecture, including its core components, what distinguishes it from its predecessors, and how it works.

What Is the Transformer Architecture?

A transformer is a type of neural network architecture capable of learning context and relationships from sequential data such as text. This makes it applicable to a wide range of natural language processing (NLP) tasks such as:

Machine translation
Classification tasks
Question answering (QA)
Text generation
Text summarization
Sentiment analysis
Conversational agents, i.e., chatbots
Semantic search

Although there are many variations of the transformer architecture, one of the most common ways to classify transformers is as encoder-decoder, encoder-only, and decoder-only.

Encoder-Decoder Transformers: also called sequence-to-sequence transformers, they encode an input sequence and decode it into an output sequence. The best examples of encoder-decoder transformers are the original transformer model and the Text-to-Text Transfer Transformer (T5) model.
Encoder-Only Transformers: these models only encode input and do not undertake decoding. The best examples of encoder-only transformers are Bidirectional Encoder Representations from Transformers (BERT), also developed by Google, as well as its many variations like RoBERTa.
Decoder-Only Transformers: in contrast to encoder-only models, decoder-only transformers specialize in decoding input into output. Prolific examples of decoder-only transformers are the Generative Pre-trained Transformer (GPT) family of models by OpenAI, e.g., ChatGPT.

In this guide, we are going to focus on the encoder-decoder transformer.

Shortcomings of Previous Neural Network Architectures

To better appreciate the capabilities of the transformer architecture, let us now take a look at two of its prominent predecessors, recurrent neural networks (RNNs) and long short-term memory (LSTM).

Recurrent Neural Networks (RNNs)

An RNN processes input sequences a token at a time in cyclic iterations. The network’s input layer receives the first token of the input, which is then passed to hidden internal layers that process and output it for the next iterative step. This output, along with the next token from the sequence, is fed back into the neural network – so the output at every step is dependent on previous outputs as well as the current input. This process is repeated for every token in the input prompt.

Additionally, the RNN maintains a hidden state – in the form of a vector that stores the context and dependencies between the tokens it has learned so far – effectively acting as the network’s memory.
Long Short-Term Memory (LSTM)

An LSTM is a type of RNN that improves upon the conventional memory mechanism through cell states. These allow an LTSM to selectively recall or forget particular aspects of previous input according to their importance.

A cell contains three gates that store a value between 0 and 1, signifying the extent to what should be “let through” the gate and passed on to the next cell:
- Forget gate: indicates what current state information can be forgotten
- Input gate: what new information should be added to the state
- Output gate: what information stored in the current state should be output
Each cell takes a token, the previous cell state, and the output of the previous cell’s output to generate a new cell state, and an output.

Both RNNs and LTSMs have two main drawbacks:

They process input sequentially: each step in the process depends on the previous ones, resulting in longer training and inference times. Plus, this does not make efficient use of GPUs – which are designed for parallel computation.
Inability to handle long-term dependencies: this is where the network becomes less effective at keeping track of data points that are far apart in an input sequence; generally, the longer the input sequence, the higher the chance that contextual information is lost.

Although LSTMs are designed to mitigate this problem, they only do so up to a point – and longer input sequences still often struggle to retain their context. The probability of retaining the context from a token positioned far away from the current token decreases exponentially with the distance from it – due to the vanishing gradient problem, i.e., gradients becoming increasingly smaller during backward propagation.

In contrast, the transformer’s self-attention mechanism allows it to process input sequences simultaneously in parallel, resulting in faster training and inference. Consequently, the transformer architecture makes efficient use of a GPU’s processing abilities and is more scalable than its predecessors – because you can add more GPUs to increase computational power.

Secondly, the positional encoding mechanism within the transformer tracks the position of each token – eliminating the need for recurrence or hidden state vectors. This makes it easier for the network to handle longer-range dependencies, which allows for larger context windows.

Components of the Transformer Architecture

Embedding Layer

This is where input enters the transformer, which breaks it down into tokens, i.e., around four characters or 0.75 words per token on average, and turns them into numerical representations called embeddings that the model is better able to understand and process.

Positional Encoder

This adds information to each token’s embedding to indicate its position within the sequence – without recurrence or maintaining an internal state. This is typically achieved by using an alternating set of sine and cosine functions to generate a unique positional signal for each token. Sine and cosine functions are well-suited to this purpose because they repeat their patterns over a regular interval, which is ideal for capturing sequential relationships, while being perpendicular to each other – preventing overlap.

Self-Attention Mechanism

The transformer’s self-attention component systematically compares token embeddings against each other to determine their similarity and relevance. This results in a weighted representation of the input which captures the appropriate patterns and relationships between the tokens, which the transformer can use to calculate the most probable output.

Both the encoder and decoder feature self-attention mechanisms, with the encoder containing a single self-attention layer and the decoder containing two such layers.

Encoder

The encoder’s purpose is to take the input sequence and convert, or encode, it into a weighted embedding that the decoder can use to generate output.

As opposed to a single encoder, transformers contain several encoders in a stack – with the original transformer featuring a stack of six encoders, for example. This increases the transformer’s efficacy, as each encoding layer captures different aspects of the input to enhance its understanding and, subsequently, the model’s predictive capabilities.

Decoder

The decoder takes the weighted embedding output by the encoder, generates the most probable output tokens, and decodes them into readable output.

Like the encoder, the transformer architecture contains a stack of decoders – mirroring the number of encoders, e.g., six in the original design.

How Does the Transformer Architecture Work?

Let us now turn our attention to how the transformer architecture works in more detail.

In short, an encoder-decoder transformer architecture works by the encoder taking a given input sequence and converting it, a token at a time, into a numerical representation, i.e., word embeddings. These input embeddings are then passed to the decoder, which uses them to generate output as a series of embeddings before being ultimately converted into text.

This process encompasses the following steps:

Generation of Input Embeddings

The input prompt is fed into the encoder which tokenizes it and converts it into a series of embeddings.
Addition of Positional Encodings

The transformer generates positional encodings and adds them to the input embeddings for each token to provide information about their position within the sequence.
Multi-Head Self-Attention

The next stage is the self-attention layer in which the encoder develops an understanding of the input sequence and assigns each token an attention score, i.e., how much importance a token should receive. This process is referred to as multi-head attention because the attention mechanism features multiple heads that enable the encoder to process different parts of the input sequence in parallel – increasing the model’s capabilities and speed of training and inference.

This part of the process is divided into several sub-stages:

i. Calculation of Queries, Keys, and Values

First, each embedding is further broken down into three components: query, key and value.

Query: akin to a question that each token asks itself, i.e., what the current token is looking for in other tokens to gain a better understanding of its own context.
Key: this provides information about a token that helps other tokens in the sequence better understand it – and how relevant the current token is to them.
Value: the actual content or meaning of the token that other tokens in the sequence will use to update their own embeddings.

This enables the transformer to better compare each token against each other to determine its context and relative importance within the input sequence. The queries, keys, and values are calculated through linear transformations using parameters learned during the model’s training. Instead of sequentially, each head in the encoder performs its attention mechanism processes on the queries, keys and values in parallel.

ii. Creation of Score Matrix

The encoder creates a score matrix of each token by taking the dot product of its query and the key of every other token in the sequence, i.e., multiplying them together. The score matrix determines the level of emphasis each token should place on other tokens: the higher the score, the greater the emphasis.

Additionally, the score matrices are scaled down – by dividing them by the square root of the dimension of the query and key embeddings. This results in more stable gradients, as the dot products have the potential to be high in some cases.

iii. Application of Softmax Function

A softmax function is applied to each matrix to produce a set of attention scores that add up to 1. This distributes the attention among all the tokens and makes it easier to compare their relative importance. For instance, if a token has a softmax score of 0.6 and another has a score of 0.2, the first token is three times more significant than the second.

iv. Multiplying Softmax Attention Scores with Value Embeddings

The softmax-adjusted attention scores are multiplied by the token’s value to create an output embedding – and fed into a final linear transformation layer for further refinement.

Finally, the output for each embedding is added together to produce a concatenated embedding that represents the entire input sequence.

Normalization and Residual Connections

After the self-attention layer, the input passes through a normalization layer to ensure the embeddings fall within a reasonable range. This helps to stabilize the model and expedite the training process by preventing very small or very large, i.e., vanishing or exploding gradients.

Vanishing gradients are problematic as they often result in small updates to the model’s parameters during training – which prolongs the process. This is especially the case in neural networks with many layers, as gradients tend to diminish during backwards propagation, i.e., the model correcting its parameters through its loss function. So, the more layers between the output and input nodes, the greater the potential for vanishing gradients.

Conversely, exploding gradients cause overly drastic changes to the model during training and prevent it from converging on the optimal output. Additionally, if gradients grow too large, they can result in overflow errors as the model attempts to save them to memory – halting training entirely. Both vanishing and exploding gradients can cause underfitting or overfitting, where the model exhibits poor performance on the training data and/or evaluation datasets.

Additionally, the encoder and decoder feature residual connections that feed the output of one layer into the input of another, so data can flow more efficiently through a neural network – particularly those with many layers. These connections enable the neural network to learn to predict the difference, or residual, between the input and corresponding output – instead of the output itself. Like normalization, this helps to mitigate the vanishing gradient problem and enables faster and more effective training.

As well as after the multi-headed attention layer, this process also takes place after the feed-forward layer, before the input is passed to the decoder.
Feed-Forward Network

After passing through the self-attention mechanism and being normalized, the input reaches the feed-forward network. The purpose of this step is for the model to capture the input sequence’s higher-level features so it can learn more intricate relationships from the data.

It is composed of three layers:
- Linear Transformation: each token is multiplied by a weight matrix and added to a bias vector (both learned through training) allowing it to better fit the data and learn its more complex underlying relationships.
- Activation Function: this introduces non-linearity into the network, further enabling it to model complex patterns that mirror relationships in real-world relationships – which are not simply linear. The most commonly used activation function within transformer architectures is the Rectified Linear Unit (ReLU), which works by directly outputting the input when it is a positive value while outputting zero if it is negative – creating a non-linear relationship between input and output.
- Linear Transformation: similar to the first layer transformation, but with its own set of weights and biases.
Decoder

Following its conversion into a weighted numerical representation, the input is passed to the transformer’s decoder, which uses it to generate the appropriate output sequence.

Much of the decoder’s workflow mirrors that of the encoder – with a few key differences, as outlined below:
- Create Output Embeddings: the decoder receives the output from the last encoder layer – the embedding of the input sequence, tokenizes it, and converts it into embeddings.
- Output Positional Encoding: positional encodings are added to the output embeddings to incorporate data about their position within the sequence.
- Self-Attention: in contrast to the encoder, the decoder features two self-attention layers:
  - Masked Multi-Head Attention: similar to the self-attention mechanism within the encoder, with the addition of causal masking – which prevents the present token from comparing itself against future tokens.
  - Encoder-Decoder Multi-Head Attention: also known as cross attention, each token in the output sequence calculates attention scores against all tokens in the input sequence. This allows the decoder to better establish relationships between the input and output tokens. More specifically, the input tokens serve as queries and keys, while the output from the previous self-attention layer are the values. Causal masking is also employed here.
- Normalization and Residual Connections: these appear in the decoder three times: after each attention layer and after the feed-forward network.
- Feed-Forward Network: the output passes through a feed-forward layer to introduce non-linearity to the output.
- Output Projection: the refined output from the previous layers is projected into an embedding that is as large as the number of output possibilities, i.e. the vocabulary of the output language.
- Output Probability Calculations: the projected output is fed into a softmax function to convert the attention scores into probabilities. The token with the highest probability for each position in the sequence is selected as output.

What are the Limitations of the Transformer Architecture?

Despite its many advantages, the transformer architecture isn’t perfect and still has its shortcomings.

Limited Context Length: while transformers effectively mitigate the long-term dependency issues exhibited by RNNs and LTSMs, they fail to do so completely. When the context length, i.e., the maximum size of the input, grows past a certain point, transformers still struggle with recalling relevant information in the middle of the sequence.
Large Resource Requirements: the transformer’s computational complexity makes them resource-intensive, requiring large amounts of memory and storage. In contrast to RNNs, for which computational demands scale linearly with the length of an input sequence, the nature of the self-attention mechanism means that memory requirements scale quadratically with increasing sequence lengths. Additionally, the larger the size of the neural network, the harder they are to deploy on resource-constrained devices – making their use increasingly less feasible.
Longer Training Times: the complexity of the transformer architecture results in longer training times than its predecessors. Transformers also require large, labeled datasets to ensure effective training.
Lack of Transparency: transformer models are often described as “black boxes” because it is difficult to interpret their internal reasoning and explain how they arrived at certain predictions.

Conclusion

The development of the transformer architecture represented a landmark moment in the field of machine learning and laid the groundwork for many subsequent innovations in the field of AI. However, as powerful as transformers have proven to be, it is likely that we are only scratching the surface of their potential.

Addressing the shortcomings of the transformer architecture is one of the key focuses of AI researchers. Considerable – and encouraging – efforts have been made to make them less computationally demanding, e.g., quantization, and to effectively extend their context length without sacrificing accuracy.

More notably, significant research is being devoted to improving the self-attention mechanism itself – with a solution that scales sub-quadratically with sequence length. This will be a significant breakthrough that, in mitigating the limitations of the transformer architecture, will open the door to a vast range of possibilities in AI.

Kartik Talamadupula

Director of AI Research

Kartik Talamadupula is a research scientist who has spent over a decade applying AI techniques to business problems in automation, human-AI collaboration, and NLP.