Guide to Fine-Tuning Techniques for LLMs

With a wide range of capabilities, such as text generation, question answering, and summarization, ever-increasing numbers of organizations are keen to integrate large language models (LLMs) into their business processes. However, one of the most significant barriers to the adoption of generative AI tools is their lack of applicability to a particular domain or the specific workflows that an industry may have in place. While appreciating LLMs’ general language capabilities, organizational stakeholders may conclude that the current generation of language models aren’t suitable for their unique requirements.

Fortunately, a key solution to the problem of a lack of specificity in LLMs can be found in fine-tuning. Understanding the principles behind fine-tuning an LLM, as well as the potential benefits and implications, should be an essential part of every organizations’s AI strategy.

With that in mind, this guide explores the concept of fine-tuning, how the process works, its pros and cons, potential use cases, and the different ways you can fine-tune an LLM.

What is Fine-Tuning and How Does It Work?

Fine-tuning is the process of further training a pre-trained base LLM, or foundational model, for a specific task or knowledge domain. By fine-tuning an LLM on a domain or task-specific dataset, which is considerably smaller and better curated than the larger corpus on which it was initially trained, you can improve its performance on specific use cases.

Pre-training an LLM is achieved through the unsupervised learning of vast amounts (typically, terabytes) of unstructured data from a variety of sources on the internet. This is often referred to as big web data, with a prominent example of this being the Common Crawl dataset.

This results in a foundational model with a detailed understanding of language, which is internally represented within the LLM by a vast series of parameters. These parameters are the linguistic patterns and relationships between words, and create weightings that are assigned to different layers throughout the LLM’s neural network. The parameters and the magnitude of their weights are how LLMs determine the probability of the next token to be output in response to its given input prompt.

At this stage, while the pre-trained model has considerable general knowledge of language, it lacks certain kinds of specialized knowledge. Similarly, while pre-trained models are capable of producing coherent, contextually relevant responses, these responses are typically document-style answers as opposed to the conversational responses expected of an AI assistant. Fine-tuning bridges the gap between generic pre-trained models and the unique requirements of specific generative AI applications.

How the Process of Fine-Tuning an LLM Works

During fine-tuning, a base LLM is trained with a new labeled dataset tailored towards a particular task or domain. In contrast to the enormous dataset the model was pre-trained on, the fine-tuning dataset is smaller and curated by humans. As the LLM is fed this previously unseen data, it makes predictions on the correct output based on its pre-training. However, as it hasn’t been exposed to this specialized data, many of the model’s predictions will be incorrect; it will then calculate the difference between its predictions and the correct output, which is known as its loss function.

Next, the LLM uses an optimization algorithm, such as gradient descent, to determine which parameters need to be adjusted to result in more accurate predictions. The optimization algorithm takes the loss function and uses it to identify which parameters contributed to the predictive error and to what extent. Consequently, the parameters that are most responsible for the error are adjusted more, while those less responsible are adjusted less. Over several iterations of this process on the dataset, the LLM will keep adjusting its parameters to create a neural network configuration that minimises the loss function for the given dataset and, ostensibly, the task or domain for which the model is being adapted.

The Two Types of Fine-Tuning

While there are a variety of techniques or methodologies, which are detailed later in this guide, there are generally two kinds of fine-tuning – full fine-tuning and transfer learning:

Full Fine-Tuning: a process during which all of a base model’s parameters are updated, creating a new version with altered weighting. While this is the most comprehensive way to adapt a pre-trained LLM to a new task or domain, it is also the most resource-intensive. As with pre-training, full fine-tuning requires sufficient CPU power and memory to process and store all the adapted parameters, gradient changes, loss functions, and other necessary components that are updated during the process.

On a similar note, full fine-tuning creates a new iteration of the base LLM for every task or domain you train it for – with each being the same size as the original. If you plan on creating models for different use cases or end up creating several iterations of a fine-tuned LLM, this can rapidly increase your storage requirement.
Transfer Learning: also referred to as repurposing, transfer learning involves training a foundational model for a task different to that on which it was originally trained. Since the LLM has already learned significant linguistic knowledge during pre-training, certain features can be extracted and adapted for another use case or domain. To achieve this, transfer learning involves “freezing” most, if not all, of the base model’s neural network layers to limit the extent to which its parameters can be adjusted. Subsequently, the remaining layers, or, in some cases, brand new layers are fine-tuned with the domain or task-specific data.
With far fewer parameters to adjust, transfer learning can be carried out with smaller fine-tuning datasets and requires less time and computational resources. This makes it a favorable approach for organizations constricted by budget, time, or availability of sufficient labeled data.

What are the Benefits and Challenges of Fine-Tuning?

Having explored what fine-tuning is, the next consideration is why you should fine-tune an LLM and the challenges involved in doing so. To address this, let’s look at the benefits and challenges of fine-tuning foundational models.

The Benefits of Fine-Tuning

More Performant LLMs: a fine-tuned LLM is capable of a wider range of tasks and applicable to more use cases than a model that’s merely pre-trained. Moreover, a fine-tuned model is typically able to perform its capabilities at a higher level than after its initial training, producing more accurate and informative output more closely aligned with user expectations.
Task or Domain-Specificity: training an LLM on the distinctive language patterns, terminology, and contextual nuances of a particular domain or a particular task makes it more useful for its specified purpose. Fine-tuning base models on datasets tailored towards certain industries significantly increases the value they can offer organizations within that field.
Customisation: by training an LLM to adopt your organizations’s tone of voice and terminology in its generated content, you can ensure your generative AI applications offer the same experience your customers have become accustomed to. This helps ensure a consistent user experience across all forms and channels of communication and maintains, if not increases, customer satisfaction levels as you incorporate generative AI into your business processes.
Lower Resource Consumption: in some cases, fine-tuned models consume far fewer computational and storage resources than pre-trained LLMs. The smaller a model is, the less it will cost to run and the more options you’ll have regarding its deployment. On top of this, smaller (i.e., fewer parameters) fine-tuned base models can outperform larger general-purpose models, depending on the use case.
Enhanced Data Privacy and Security: while organizations would like to train a model with proprietary or customer data to generate more accurate output, they must consider how an LLM could subsequently leak said data once it has learned it. Fine-tuning allows companies to better control the data to which the model is exposed, giving them the benefit of a task or domain-adapted LLM while maintaining data security and compliance.

The Challenges of Fine-Tuning

Can Be Costly: fine-tuning, particularly full fine-tuning is computationally expensive, requiring increasing amounts of compute, memory, and storage space as models get larger. Naturally, this cost increases with each required fine-tuned model.
Time-Intensive: with it requiring time to gather and clean data, feed data into models, evaluate outputs, and so on, fine-tuning can take considerable amounts of time.
Hard to Source Data: another factor that can contribute to the cost of fine-tuning is how much it costs to source the appropriate data for the intended use case or knowledge domain. Insufficient or noisy data can reduce an LLM’s performance and reliability and prevent proper fine-tuning, so it is crucial to ensure fine-tuning data is both adequate and properly formatted- which can prove difficult.
Catastrophic Forgetting: when fine-tuning for a specific task, a base model may “lose” the general knowledge it had previously acquired due to its parameters being altered. This is referred to as catastrophic forgetting, which compromises an LLM’s model’s performance on a broader variety of tasks in favour of specificity.

Fine-Tuning Use Cases

Here are some specific use cases in which a fine-tuned LLM is beneficial.

Language Translation: exposing an LLM to a language that it had little or no initial training in will enhance its language translation capabilities.
Specialized Knowledge Base: pre-training an LLM with vast swathes of text from the internet provides it with good general knowledge (albeit some of it being questionable); fine-tuning it with a specific dataset allows you to create a specialized knowledge base that can be applied to a variety of advanced use cases.
Conversational AI: the better a base model is trained on data from a particular domain, the better it will perform at engaging in conversations regarding said domain. This results in chatbots and other conversational AI applications holding more authentic-sounding and informative conversations tailored to a specific task or industry.
Summarisation: being more familiar with its knowledge, language and document structure, a fine-tuned LLM is more capable of interpreting and summarising documents pertaining to particular domains. While a pre-trained LLM might not be capable of accurately summarising a science research paper, for example, a model fine-tuned on advanced scientific knowledge could provide a valid summary.
Sentiment Analysis: fine-tuning an LLM to understand the subtle differences in a particular country or region’s language or dialect will help it perform better at sentiment analysis. With a firmer grasp of what someone is saying, a fine-tuned LLM can better interpret their mood and extract otherwise unrealised metadata from an interaction.

LLM Fine-tuning Techniques

With a better understanding of the advantages and drawbacks of fine-tuning, as well as viable use cases, let’s turn our attention to the different ways in which an LLM can be fine-tuned.

Supervised Fine-Tuning

Supervised fine-tuning is a collection of techniques with which an LLM is trained on a task-specific labeled dataset, i.e., where every input has an associated correct label or output. The objective is for an LLM to learn the difference between its output and the correct labels in the fine-tuning data and adjust its parameters to better perform for the use case or domain for which it is being fine-tuned.

Types of supervised fine-tuning include:

Task-Specific Fine-Tuning: where an LLM is trained for a particular use case or knowledge domain, allowing its parameters to adapt to its specific requirements and intricacies. Task-specific fine-tuning is especially useful for maximising a model’s performance on a single, well-defined task and can be implemented with smaller datasets. The potential downside of task-specific fine-tuning, however, is the potential for catastrophic forgetting, where a model’s parameters are adjusted to the point it is unable to perform tasks it was capable of after pre-training.
Multi-Task Fine-tuning: this technique sees a model fine-tuned on multiple related tasks simultaneously with a mixed dataset, improving its performance on several use cases concurrently. This method of fine-tuning leverages the similarities and differences across different tasks to produce an LLM with varied capabilities – while also avoiding catastrophic forgetting. Despite the benefits of this approach, however, a notable drawback of multi-task fine-tuning is that it requires the use of a lot of data.
Sequential Fine-Tuning: this method involves training an LLM on multiple related tasks in sequence. This approach lends itself well to iterative fine-tuning, where a model is trained in stages to make it progressively better suited to a particular use case. For example, you could use sequential fine-tuning to train a pre-trained LLM into one adapted for use in the legal field, before further fine-tuning it for (say) employment law.
Few-Shot Fine-Tuning: this type of fine-tuning involves providing the model with a few examples, or shots, within the given prompt to help it adapt to a new task. By providing additional context, and perhaps the desired output format, directly in the prompt, few-shot fine-tuning guides the LLM into delivering the desired response.

Reinforcement Learning from Human Feedback (RLHF)

RLHF is a fine-tuning methodology that uses human feedback to train an algorithm that will fine-tune an LLM for a particular task or domain.

It works through the process of giving the LLM a prompt-generation pair, from which it generates two answers. A human evaluator then gives the output a numerical rating, which signals to the LLM which answers are preferable and trains it to generate higher-quality outputs.

The first step in RLHF is training a Reward Model (RM): a separate LLM designed to take an input prompt and assign it a numerical score based on a human preference. To create the RM, a human evaluator provides an LLM with prompt-generation pairs and labelling instructions to correctly identify the target output. By training the RM in this way, it will learn to rate its given inputs based on the labelling instructions given to it by the human evaluators.

Next, the prompt generation pairs are input into the pre-trained LLM intended for fine-tuning. Its output is then fed into the RM which gives it a numerical rating. Having calculated a reward for the output, this is passed to an algorithm called Proximal Policy Optimization (PPO), which alters the gradients of the parameters of the model being fine-tuned, adjusting its subsequent output towards human-defined expectations.

RLHF leverages the expertise of human evaluators to ensure LLMs produce more accurate responses and develop more refined capabilities. This process contributed to the development of GenAI applications that conform to human expectations and apply to real-life scenarios.

Parameter Efficient Fine-Tuning (PEFT)

One of the main issues with full fine-tuning of LLMs is the amount of resources it requires. As LLMs increase in size, the CPU and memory required to train them make conventional hardware unfeasible – which instead necessitates specialized devices equipped with several GPUs. Similarly, as each fine-tuned LLM ends up as the same size as the original pre-trained base model, storing them becomes increasingly costly – especially if you create several model iterations.

PEFT is a transfer learning technique that addresses the challenges of full-fine tuning by reducing the number of parameters that are adjusted when fine-tuning an LLM. It involves freezing all of the pre-trained model’s existing parameters, while adding additional new parameters to be adjusted during fine-tuning. With far fewer parameters to alter, which can be as much as 80% fewer than those in the pre-trained LLM, PEFT significantly lowers the computational resources and time required to fine-tune an LLM – and can be achieved with a far smaller dataset.

Just as importantly, PEFT helps mitigate the problem of catastrophic forgetting: by freezing the pre-trained LLM’s original weights, it is unlikely to forget its prior capabilities.

Low-Rank Adaptation (LoRA)

There are several ways to implement PEFT, with the most frequently used being LoRA: a technique that facilitates model fine-tuning by tracking changes to its parameters instead of updating them directly. LoRA’s key mechanism is low-rank decomposition, in which a matrix, that represents how a parameter has been modified, is decomposed into two smaller matrices. As these matrices are of a lower rank, i.e., have fewer dimensions, they require less CPU to operate on and less memory to store. This makes fine-tuning more efficient and feasible on conventional hardware.

Direct Preference Optimization (DPO)

DPO was proposed by Rafailov et al (Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Dec 2023), as a simpler and less-resource-intensive alternative to Reinforcement Learning with Human Feedback (RLHF).

Although it shares some similarities with the initial stages of RLHF, i.e., inputting curated prompt generation pairs into a pre-trained base model, DPO does away with the concept of the reward model. Instead, it implements a parameterized version of the reward mechanism, whereby the preferable answer from the response output pair is labeled positive and the inferior answer is labeled negative. This incentivizes the pre-trained LLMs parameters to generate the output labeled positive and veer away from those labeled negative.

Research has revealed that DPO offers better or comparable performance to RLHF while consuming fewer computational resources and without the complexity inherent to RLHF. In particular, DPO fine-tuning was shown to match or outperform RLHF in summarization and single-turn dialog tasks.

Conclusion

With their broad understanding of language and vast general knowledge, foundational LLMs have proven to be a revelation in a wide variety of industries. However, rapidly growing numbers of organizations are discovering that base models merely offer a glimpse into the potential of how AI can transform their business – and, in turn, the significant value that a fine-tuned LLM can offer.

Fortunately, much like LLMs themselves, the concept of fine-tuning is a nascent one. As fine-tuning methods grow in sophistication, they will push the boundaries of what language models are capable of. This in turn will result in a greater number of novel use cases, increased awareness and adoption of generative AI, and further innovation – creating a virtuous cycle that accelerates advancements in the field.

Kartik Talamadupula

Director of AI Research

Kartik Talamadupula is a research scientist who has spent over a decade applying AI techniques to business problems in automation, human-AI collaboration, and NLP.