A Guide to LLM Hyperparameters

When selecting the best large language models for your organisation’s needs, there are many factors to consider. Undoubtedly, with there being a strong correlation between a model’s parameter count, looking at the size of an LLM is a wise strategy. Similarly, you might look at its performance at common benchmark or inference performance tests – which give you a quantitative measure of performance as well as indicate how well the LLMs that pique your interest compare against each other.

However, after selecting an LLM that seems to best suit your requirements, there are other ways to further mould a language model to fit your particular needs – hyperparameters. In fact, your choice of hyperparameters and how you choose to configure them could be the difference between an LLM failing to meet your expectations and exceeding them.

With this in mind, let’s take a look at the concept of LLM hyperparameters, why they’re important, and how particular hyperparameters affect a language model output.

What are LLM hyperparameters and why are they important?

Hyperparameters are configurations that you can use to influence/govern the process of training and LLM. Unlike the model parameters, or weights, hyperparameters aren’t altered by the training data as it’s passed through; instead, they’re external to the model and set before training begins. Subsequently, even though they govern the LLM’s training process, they won’t become a part of the resulting base model and you can’t determine which hyperparameters were used to train a model after the fact.

An LLM’s hyperparameters are important because they offer a controllable way to tweak a model’s behaviour to produce the outcome desired for a particular use case. Instead of going through the considerable effort and expense of developing a bespoke model, the process of hyperparameter tuning offers the chance to reconfigure a base model so it performs more in line with your expectations.

Exploring Different LLM Hyperparameters

Let’s move on to looking at some of the most commonly used LLM hyperparameters and the effect they have on a language model’s output.

Model Size

The first hyperparameter to consider is the size of the LLM you want to use. Generally speaking, larger models are more performant and are more capable of handling complex tasks, as they have more layers within their neural networks. Resultantly, they have more weights that can be learned from training data and better determine the linguistic and logical relationship between tokens.

However, a larger LLM costs more, requires larger datasets to train and more computational resources to run, and typically runs at a slower rate than smaller models. Additionally, the larger a model becomes, the more prone it becomes to overfitting, where a model becomes too familiar with its training data and fails to consistently generalise with previously unseen data.

Conversely, a small base LLM can perform as well as its larger equivalents on simple tasks while requiring fewer resources to both train and run. This is especially the case if the model has been quantized, i.e., a compression technique to reduce the size of its weights, and/or fine-tuned, i.e., further trained with additional data. Additionally, the smaller a LLM, the easier it will be to deploy and the more feasible it becomes on less powerful hardware, i.e., devices without several high-powered GPUs.

Ultimately, the optimal size of an LLM is dependent on the nature of the use case you’re looking to apply it to. The more complex the task – and the more computational resources and training data you have at your disposal – the larger your model can be.

Number of Epochs

An epoch refers to a complete iteration of an LLM processing an entire dataset. As a hyperparameter, the set number of epochs influences output by helping determine a model’s capabilities.

A greater number of epochs can help a model increase its understanding of a language and its semantic relationships. However, too many epochs can result in overfitting – where the model is too specific to the training data and struggles with generalisation. Alternatively, too few epochs can cause underfitting, where the LLM hasn’t learned enough from its training data to correctly configure its weights and biases.

Learning Rate

Learning rate is a fundamental LLM hyperparameter that controls how quickly the model is updated in response to the calculated loss function, i.e., how often it predicted an incorrect output label, during training. On one hand, a higher learning rate expedites the training process but may result in instability and overfitting. On the other hand, a lower learning rate increases stability and improves generalisation during inference – but lengthens training time.

Additionally, it’s often beneficial to reduce an LLM’s learning rate as its training progresses through the use of a learning rate schedule. Three of the most common learning rate schedules are time-based decay, step decay, and exponential decay.

Time-based decay: reduces the learning rate according to a preset time value.
Step decay: also known as linear decay, decreases the learning rate by a decay factor every few epochs.
Exponential decay: reduces the learning rate proportional to itself every epoch.

Batch Size

An LLM’s batch size parameter determines how much data the model processes each epoch. Creating a batch size requires dividing the dataset into portions, so larger batch sizes accelerate training compared to smaller batches. However, small batches require less memory and compute power and can help an LLM model process each data point of a corpus more thoroughly. With the computational demands in mind, batch size is often restricted to your hardware capabilities.

Max Output Tokens

Also often referred to as max sequence length, this is the maximum number of tokens that an LLM can generate as its output. While the number of tokens a model can ultimately output is determined by its architecture, this can be further configured as a hyperparameter to influence an LLM’s response.

Typically, the higher you set the max output tokens, the more coherent and contextually relevant the model’s response will be. The more output tokens an LLM is allowed to use in formulating a response, the better able it is to express its ideas and comprehensively address the ideas given to it in the input prompt. Naturally, however, this comes with a price – as the longer the output, the more inference is performed by the model – increasing computational and memory demands.

Subsequently, in contrast, setting a lower max token limit requires less processing power and memory, but in potentially not providing the model with sufficient room to craft the optimal response, you leave the door open for incoherence and errors. That said, there are scenarios in which setting a lower maximum sequence length would prove beneficial, such as:

When trying to boost other aspects of an LLM’s performance, such as throughput or latency, and want to expedite the process by lowering inference time.
Similarly, to better control inference costs, you might cap the length of a model’s response.
To constrain the amount of generated text so it conforms to a particular format, i.e., for a specific GenAI application.

Decoding Type

Within the transformer architecture that comprises most modern LLMs, there are two stages to inference: encoding and decoding. Encoding is where the user’s input prompt is converted into vector embeddings, i.e., words are turned into numerical representations, that can be processed by the model to generate the best response.

Decoding, on the other hand, is where the selected output is first converted from vector embeddings into tokens before being presented to the user as a response. There are two main types of decoding: greedy and sampling. With greedy decoding, the model simply chooses the token with the highest probability at each step during inference.

Sampling decoding, in contrast, sees the model choose a subset of potential tokens and select a token at random to add to the output text. This creates more variability – or randomness, to how tokens are selected, which is a desirable trait in creative applications of language models. Understandably, however, opting for sampling decoding increases the risk of incorrect or nonsensical responses.

Top-k and Top-p Sampling

When you opt for sampling rather than greedy decoding, you’ll have an additional two hyperparameters with which to influence a model’s output: top-k and top-p sampling values.

The Top-k sampling value is an integer that ranges from 1 to 100 (with a default value of 50) that specifies that the tokens sampled by the model should be those with the highest probabilities until the set value is reached. To better illustrate how top-k sampling works, let’s use a brief example.

Let’s say you have the sentence “I went to meet a friend…”.

Now, out of the vast number of ways to end this sentence, let’s look at the five examples provided below – each beginning with a different token:

at the library
for a brief work lunch
to discuss our shared homework assignment
in the centre of the city
on the other side of town

From there, let’s assign each of the initial tokens for each sentence a probability, as follows: `

Token	Probability
At	0.30
For	0.25
To	0.22
In	0.15
On	0.12

Now, if we set the top-k sampling value to 2, it will only add at and for to the sampling sunset from which it selects an output token. Setting it to 5, by contrast, would mean all options could be considered. So, in short, the higher the k-sampling value, the greater the potential variety in output.

Alternatively, the Top-p sampling value is a decimal number in the range of 0.0 to 1.0, that configures a model to sample the tokens with the highest probabilities until the sum of those probabilities reaches the set value.

Returning to the above table, if the top-p sampling value is set to 0.7, once again, at and for will be the only tokens included in the subset, as their combined probabilities are 0.55 (0.30 + 0.25). As at, for, and to have a cumulative probability of 0.77 (0.30 + 0.25 + 0.22), this breaches the set threshold of 0.7 and to is excluded from the subset as a result. As with top-k sampling, the higher the value, the more varied the output.

Lastly, in the event both sampling values are set, top-k takes precedence – with all probabilities outside the set threshold set to 0.

Temperature

Temperature performs a similar function to the above-described top-k and top-p sampling values, providing a way to vary the range of possible output tokens and influence the model’s “creativity”. It is represented by a decimal number between 0.0 (which is effectively the same as greedy decoding, whereby the token with the highest probability is added to the output) and 2.0 (maximum creativity).

The temperature hyperparameter influences output by changing the shape of the token probability distribution. For low temperatures, the difference between probabilities is amplified, so tokens with higher probabilities become even more likely to be output compared to less-likely tokens. Consequently, you should set a lower temperature value when you want your model to generate more predictable or dependable responses.

In contrast, high temperatures cause token probabilities to converge closer to one another, so less likely or unusual tokens receive an increased chance of being output. In light of this, you should set a higher temperature value when you want to increase the randomness and creativity of responses.

Stop Sequences

Aside from the max output tokens hyperparameter, the other way to influence the length of an LLM’s response is by specifying a stop sequence, i.e., a string composed of one or more characters, which automatically stops a model’s output. A common example of a stop sequence is a period (full stop).

Alternatively, you can specify the end of a sequence by setting a stop token limit – which is an integer value rather than a string. For instance, if the stop token limit is set to 1, the generated output will stop at a sentence. If it’s set to 2, on the other hand, the response will be constrained to a paragraph.

A reason you might set a stop sequence or stop token limit is that, similar to the max output tokens parameter, you have greater control over inference, which may be a concern if budget is a consideration.

Frequency and Presence Penalties

A frequency, or repetition, penalty, which is a decimal between -2.0 and 2.0, is a an LLM hyperparameter that indicates to a model that it should refrain from using the same tokens too often. It works by lowering the probabilities of tokens that were recently added to a response, so they’re less likely to be repeated to produce a more diverse output.

The presence penalty works in a similar way but is only applied to tokens that have been used at least once – while the frequency is applied proportionally to how often a specific token has been used. In other words, the frequency penalty affects output by preventing repetition, while the presence penalty encourages a wider assortment of tokens.

What is LLM Hyperparameter Tuning?

LLM hyperparameter tuning is the process of adjusting different hyperparameters during the training process with the goal of finding the combination that generates the optimal output. However, this inevitably can involve considerable trial and error: meticulously tracking the application of each hyperparameter and recording the corresponding results on the output. Consequently, performing this manually is time-consuming. In response to this, methods of automated LLM hyperparameter tuning have emerged to streamline this process considerably.

The three most common methods of automated hyperparameter tuning are random search, grid search, and Bayesian Optimisation.

Random Search: as suggested, this type of hyperparameter tuning method randomly selects and evaluates combinations of hyperparameters from a range of values. This makes it a simple yet efficient method capable of traversing a large parameter space. However, in its simplicity, it sacrifices performance and it may not find the optimal combination of hyperparameters while being computationally expensive.
Grid Search: in contrast to random search, this method exhaustively searches each possible combination of hyperparameters from a range of values. While, like random search, it’s resource intensive, it offers a more systematic approach that ensures finding the method that guarantees the optimal choice of hyperparameters.
Bayesian Optimisation: differs from the above two methods in that it employs a probabilistic model to predict the performance of different hyperparameters and chooses the best ones in response. This makes it an efficient tuning method that can both better handle large parameter spaces and is less resource-intensive than grid search. The downside, however, is that it’s its more complex to set up and is less reliable at identifying the optimal set of hyperparameters than grid search.

Another advantage offered by automated hyperparameter tuning is that it makes the development of multiple language models, each with a unique combination of hyperparameters, more feasible. By training them on the same dataset, you’re then in a position to compare their output and determine which is best for your desired use case. Similarly, each model tuned on a different set of hyperparameters and value ranges could prove better suited to different use cases.

Conclusion

Though often falling under the broader category of fine-tuning, hyperparameter fine-tuning is an important discipline that should be considered separately – and as an important part of an AI strategy. By configuring the different hyperparameters detailed in this guide, and observing how your chosen LLM modifies its output in response, you can improve the performance of base models to better suit your desired real-world scenarios.

Kartik Talamadupula

Director of AI Research

Kartik Talamadupula is a research scientist who has spent over a decade applying AI techniques to business problems in automation, human-AI collaboration, and NLP.