As large language models (LLMs) become increasingly integral to more industries, companies must decide which of the growing number of LLMs on the market best suits their workflows and long-term goals. One of the most crucial factors when weighing up its options must be a model’s context length.
In this guide, we explore the concept of context length, why it is important, and the benefits and drawbacks of differing context lengths.
What is Context Length and Why is it Important?
An LLM’s context length is the maximum amount of information it can take as input for a query. In other words, the larger the context length, also referred to as the context window (with the terms used interchangeably throughout), the more information a user can enter into a prompt to generate a response.
Now, while it’s common – and natural – to think of context length in terms of words, language models actually measure content in terms of token length. Subsequently, a token is measured as four characters (in English) or ¾ of a word; so, 100 tokens is the equivalent of 75 words.
With that in mind, here are the context lengths of some of the most prominent LLMs.
- Llama: 2K
- Llama 2: 4K
- GPT-3.5-turbo: 4K. However, GPT-3.5-16k has a context length of 16K.
- GPT-4: 8K. Similarly, GPT-4-32k has a context window of up to 32K.
- Mistral 7B: 8K
- Palm-2: 8k. However, Google has reported their new Gemini multi-modal model as having a 32K
- Claude: 9K
- Claude 2: 100,000K (in beta stage at the time of writing).
Context length is significant because it helps determine an LLM’s functionality and efficacy in several ways:
- Input Scope and Complexity: the larger the context length, the greater an LLM’s ability to handle more detailed and complex inputs. Similarly, this determines how it can be used and to what degree. A 4K context window, as found in GPT 3.5 or Llama 2, for example, is equivalent to six pages, while a context length of 32K amounts to 49 pages. A summarisation task, for instance, is only limited by each respective size. Put another way, the context length is a huge determiner of an LLM’s suitability for a task.
- Coherence: in lieu of having memory, a model’s context length determines how much prior input it can recall. This affects the coherence and accuracy of output.
- Accuracy: the greater the size of the context window, the more potential there is for the model to provide a relevant response by leveraging a more comprehensive understanding of the input.
Different Context Lengths: Pros and Cons
Now that we’ve looked at the significance of context lengths, let’s move on to the benefits and drawbacks of short and long context lengths respectively.
Faster Response Times
A small amount of context means that an LLM has lesser input to process, resulting in faster input generation. This results in performant GenAI solutions and constructive user experiences.
With less input to process, smaller context windows don’t require as much computational power, memory, or electricity as larger ones. Not only is this more cost-effective but it makes them easier to deploy and makes them accessible to a wider range of users on a larger variety of devices.
Lack of “Memory”
LLMs are stateless, so they don’t retain any of the context input by a user. Consequently, what appears to be their memory is their context window – meaning they can’t access any information outside of it. A prominent example of this is an AI chatbot, whereby the application will only be able to effectively recall parts of the conversation within its given context length. Subsequently, if a user refers to a part of a conversation outside its context window – such as initial instructions at the beginning of the chat – it may stray off-topic or offer incoherent or irrelevant information. This both limits the number of use cases for an LLM and reduces its efficacy in tasks it should feasibly be applied to.
Lack of Contextual Understanding
As a user is limited in the amount of input they can enter into an LLM, the model may not sufficiently understand its given context and return inaccurate or incomplete output. To mitigate this, users may have to repeat input or rely on more precise and detailed contexts, i.e., prompt engineering, which could include examples of desired output, i.e., few-shot prompting.
Larger range of applications
A larger context window can accept and process larger inputs and consequently can be applied to more use cases. The greater the context length, the better it can handle:
- Documents, extending to research papers or even books
- Large databases
- Large codebases
- Multiple data sources
Additionally, with a larger “memory”, a model can be applied to more complicated, multi-step tasks.
The larger the context provided to it, the greater the chance an LLM will understand the user’s request and generate appropriate feedback. Returning to the example of a chatbot, an LLM with a larger window has access to more – or, ideally, all of – a user’s conversation and will produce relevant, coherent, and consistent responses as a result.
With the ability to receive more data at once and understand it more comprehensively, larger context windows save users time. This is not only the result of being able to undertake a larger number of tasks but also because they can process more information at a time – and don’t require iterative steps to complete a desired task.
While achieving longer context lengths is a significant challenge in itself, research has shown maintaining a model’s accuracy across extended context windows is a challenge in itself.
When conducting tests on a selection of LLMs with context lengths ranging from 2k to 100k, researchers at Stanford University (Liu et al, 2023) found that models consistently struggled to accurately recall information in the middle of the context. Conversely, the LLMs performed better at processing information at the beginning of the context (primary bias) and the end of the context (recency bias) – producing a U-shaped curve when the position of the relevant context was mapped against accuracy. This performance bottleneck is often referred to as the “missing middle”, in respect to context lengths.
In the case of document summarisation, this would result in inaccuracy or incoherency with respect to the middle of the inputted document. Or, once again returning to a chatbot use case, would cause inconsistencies if queried about information towards the middle of the conversation.
The larger the context window, the greater its compute and memory requirements. The resources for the self-attention mechanism within transformers, which determines the relationship between tokens, scales quadratically as context length grows. So, increasing a context length from 4K to 8K, for instance, requires 4x the memory and computational power. This imposes infrastructural constraints on both developers of LLMs and the base of users who could viably use their models.
Because LLMs with large context lengths require increasing amounts of memory and compute, this often results in increased inference latency. This then impacts the real-time performance of applications built on top of the model, resulting in sub-optimal, or simply negative, user experiences.
Training and Deployment Challenges
The larger the desired context length, the greater the number of resulting parameters within the model, which results in longer training times. This makes it infeasible for organisations or researchers with limited resources to train or fine-tune their own LLMs.
Subsequently, when it comes to deploying LLMs with large context windows, model developers must consider their user base, which will be constrained by the computational power and memory of their devices. Devices like smartphones, tablets, and lower-end workstations and laptops may not be able to utilise LLMs with larger context lengths, limiting the applicability of such models in certain scenarios.
Finding the Missing Middle: Solutions and Challenges
Now that we’ve explored the considerable benefits of longer context lengths, can anything be done about the problem of the “missing middle” exhibited by the current generation of LLMs with larger context windows? Fortunately, recent research into the concept of context length extrapolation, i.e., extending the context window of an LLM beyond its pre-trained limit, has yielded encouraging results. However, this isn’t as simple as entering longer inputs into the model and requires the modification of the transformer’s self-attention mechanism within particular LLMs.
The Attention Mechanism and Positional Encoding
Now, to delve into the particulars of how researchers amended the attention mechanism, it’s prudent to provide a brief recap of how transformers handle input.
Firstly, to process the context entered into it, a transformer needs to convert it into a numerical representation that the model can understand. This is known as input embedding and involves each token comprising the context being mapped onto a multi-dimensional vector space – in which the distance between vectors represents the relationship between tokens, e.g., their relevance or similarity to each other.
However, because the transformer architecture within the current generation of LLMs processes input in parallel, as opposed to sequentially like their predecessors recurrent neural networks (RNNs), they need to add a vector to each token – called a positional embedding. The positional embedding allows the transformer to remember the sequential order of the context without storing it within the neural network – reducing its memory requirements. Subsequently, this means the model doesn’t have a short-term memory – necessitating the need for a larger context window. The process of assigning positional embedding to each input token is called positional encoding.
RoPE and Position Interpolation
There are several types of positional encoding mechanisms inherent to different LLMs. This includes absolute encoding, which encodes the definite position of a token with a single embedding; sinusoidal encoding, as first put forth in the seminal research paper from Google, Attention is All You Need (Vaswani et al, 2017)  paper that introduced the transformer architecture; and relative encoding, which encodes a token’s position relative to every other token, by encoding it with n embeddings – where n is the number of tokens in the context, to describe their relationships.
However, it’s a particular type of relative encoding, rotary positional embedding (RoPE), found in models such as Llama 2 and PaLM, that provides the basis for context length extrapolation. RoPE works by rotating a token’s vector according to its position for more accurate distances in vector space – while retaining some of the efficiency of absolute encoding as the embedding won’t change if the word’s position within the sequence doesn’t change.
Researchers  introduced a mechanism called position interpolation (PI) to LLMs with RoPE to successfully extend the context window – and see strong results on several tasks that require long context. Instead of extending the encoding to positions outside the model’s pre-trained context length of 2K, PI repositions the positional embedding within the original context length. This is achieved by simply dividing the position vector by a scaling factor s, where s = L/L* – with L being the pre-trained context length and L* being the desired context length.
With this approach, the researchers saw good performance on a variety of tasks such as document summarisation and passkey retrieval up to a context length of up to 32k. Similar results were attained by kaiokendev , also up to a context length of 32k, and Arka Pal et al , who successfully extrapolated to a length of 8k.
Alternative RoPE Extensions
However, despite positive results, the RoPE and PI method isn’t a comprehensive solution to the missing middle. This is due to the fact that interpolating the positional embeddings caused the model to perform worse on tasks with shorter contexts – on which it scored highly before the context window extrapolation. This is theorised to be a result of the model losing important high-frequency information, i.e., tokens that are very similar and close together in vector space. This information is lost when the interpolation pushes the tokens closer together – and changes the rotation determined by the RoPE encoding.
To combat this, researchers  developed an alternative RoPE extension called (Yet another RoPE extensioN), Llama models, which utilises a more targeted type of interpolation called neural tangent kernal (NTK)-by-parts. NTK-by-parts addresses the problem of high-frequency information loss by applying the scaling factor discriminantly: interpolating lower-frequency information while not doing so for high-frequency information. With YaRN, the models fine-tuned for longer context lengths achieved their original benchmark scores for shorter context tasks while also showing strong results in various tests up to a context length of 128k.
While larger context lengths present more options, increased efficiency, and can be applied to a greater number of use cases, LLMs with large context windows typically lack the accuracy and reliability of smaller sizes. There’s a lot of work to be done. Fortunately, LLM development is still in its nascent stage and lots of AI solution providers and researchers are committed to finding viable solutions – and are doing so at an encouraging pace.
-  Liu, et al: Lost in the Middle: How Language Models Use Long Contexts (2023)
-  Vaswani et al: Attention Is All You Need (2017)
-  Chen et al: Extending Context Window of Large Language Models via Positional Interpolation (2023)
-  kaiokendev: Things I’m Learning While Training Superhot (2023)
-  Arka Pal et al: Giraffe: Adventures in Expanding Context Lengths in LLMs (2023)
-  Peng et al, 2023: YaRN: Efficient Context Window Extension of Large Language Models (2023)