How to Fine-Tune GPT on Conversational Data

While AI has been around in various forms for decades, and has had mainstream applications, such as chatbots and virtual assistants (e.g., Alexa), it was ChatGPT that undoubtedly sparked the AI revolution we are currently in. With the most recent estimates placing its user base at over 180 million people, ChatGPT is not only the most popular AI application but one of the most popular applications overall – and the fastest-growing consumer application in history.

However, despite its immense capabilities and advantages, ChatGPT, or more specifically, GPT, the learning language model (LLM) that powers the application has a few limitations – particularly when it comes to commercial use.

Firstly, it lacks specialized knowledge. Naturally, this is to be expected: GPT, which stands for Generative Pre-trained Transformer, can’t be expected to know everything, especially when the overall field of human knowledge is growing so rapidly. But also, in a more practical sense, GPT has a cutoff based on when its training process ended, e.g., the latest GPT-4-o models’ knowledge ends in Oct 2023.

Secondly, and more importantly, there are limitations around the use of private and/or proprietary data. On one hand, there’s no guarantee that GPT will understand an organization’s distinct data formats or the nature of requests made by users – resulting in diminished efficacy at more specialized takes. On the other hand, there’s the problem of OpenAI using the sensitive data fed into GPT to train future models. Consequently, companies that enter private data into GPT may be unintentionally sharing sensitive information – making them non-compliant with data privacy laws.

Despite these challenges, as organizations become increasingly aware of the productivity-enhancing and cost-saving potential of generative AI, they are motivated to integrate LLMs like GPT with their distinct workflows and proprietary and private data. This is where LLM fine-tuning comes in.

Fine-tuning is the process of taking a pre-trained base LLM and further training it on a specialized dataset for a particular task or knowledge domain. The pre-training stage involves inputting vast amounts of unstructured data from various sources into an LLM. Fine-tuning an LLM, conversely requires a smaller, better-curated, and labeled domain or task-specific dataset.

With all this in mind, this post takes you through the process of fine-tuning GPT for conversational data. We will detail how to access OpenAI’s interface, load the appropriate dataset, fine-tune your choice of model, monitor its progress, and make improvements, if necessary.

How to Fine-Tune GPT on Conversational Data: Step-by-Step

Set Up Development Environment

First, you need to prepare your development environment accordingly by installing the OpenAI software development kit (SDK). We will be using the Python version of the SDK for our code examples, but it is also available in Node.js and .NET. We are also installing the python-dotenv, which we will need to load our environment variables.

pip install openai python-dotenv

# For Python Version 3 and above

pip3 install openai python-dotenv

From there, you can import the OpenAI class as shown below. This will return a client object that is used to access the OpenAI interface and that acts as a wrapper around calls to its various APIs.

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.environ['OPENAI_API_KEY'],
)

To access OpenAI’s APIs, you’ll also need an API key, which you can obtain by signing up for its developer platform. Above, we have accessed our API key, by creating an instance of the os class; this allows us to retrieve the API key from a .env file like the one shown below.

# .env 

OPENAI_API_KEY=your_openai_api_key

Choose a Model to Fine-Tune

With your environment established, you need to choose the model you want to fine-tune; OpenAI currently offers the following models for fine-tuning:

GPT-4o-mini-2024-07-18
GPT-3.5-turbo
davinci-002
babbage-002

When looking at OpenAI’s pricing, you can see that the newest model, Gpt-4o-mini, is the 2nd lowest priced after babbage 002 – despite being the most current model with the largest context length. This is because GPT-4-o-mini is a scaled-down version of GPT with fewer parameters; this results in a lower computational load and, consequently, lower costs. In contrast, GPT-3.5-turbo and davinci-002 are larger models with a greater number of parameters and a more complex architecture – hence their training prices being higher. Ultimately, your model of choice will depend on the specific needs of your conversational use case – and your allotted budget.

Prepare the Datasets

Having set up your environment and decided on the model you want to fine-tune, next comes the vital step of preparing your fine-tuning data. For this example, we’re going to use the Anthropic_HH_Golden dataset, hosted on HuggingFace, which is an excellent resource for downloading datasets, as well as all other aspects of AI application development.

The Anthropic dataset is suitable because it contains a large variety of conversational data to use in our fine-tuning use case. Additionally, the dataset already conforms to the same format as OpenAI’s Chat Completions API: a prompt-completion pair as shown below:

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
…

Lastly, this dataset is already conveniently divided into training and evaluation subsets, which saves us the effort of splitting it ourselves. Dividing the dataset in this way ensures that the model encounters different data during the fine-tuning and evaluation stages, which helps to prevent overfitting, i.e., where the model can’t generalize to unseen data.

To download the dataset, you must clone its git repository onto your device with the following command:

git clone https://huggingface.co/datasets/Unified-Language-Model-Alignment/Anthropic_HH_Golden

Upload the Training Datasets

Once you have prepared your datasets, the next step is uploading them using the Files API. The code snippet below uploads the training and evaluation datasets we created earlier and uses them to create a pair of file objects that will be used during the fine-tuning process.

training_dataset = client.files.create(
  file=open(training.jsonl, "rb"),
  purpose="fine-tune"
)

evaluation_dataset = client.files.create(
  file=open(evaluation.jsonl, "rb"),
  purpose="fine-tune"
)

Printing the file object allows you to examine its structure, an example of which is shown below:

print(training_dataset)

#Response

{
  "id": "file-vGysBmsoqVn9c2TKB2D3s9',",
  "object": "file",
  "bytes": 120000,
  "created_at": 1677610602,
  "filename": "training.jsonl",
  "purpose": "fine-tune",
}

The key thing of note here is the “id” attribute, which is used to uniquely identify the file object.

Create a Fine-Tuning Job

After loading your datasets and turning them into the required file objects, it’s now time to create a fine-tuning job through the fine-tuning API. In addition to creating a job, the fine-tuning API enables you to retrieve an existing job, check the status of a job, cancel a job, or list all existing jobs.

The only required parameters are model, the name of the model you want to fine-tune (which can be one of Openai’s base models, discussed earlier, or an existing fine-tuned model), and training_file, the file object’s ID that was returned when uploading the training file in the previous step. You can copy and paste in the ID explicitly, or by accessing the id attribute directly as shown below.

Additionally, as we have loaded an evaluation dataset, we will use it to create our fine-tuning job object, resulting in the code below.

ft_job = client.fine_tuning.jobs.create(
  model="model_name"
  training_file=training_dataset.id, 
  validation_file=evaluation_dataset.id,",
  )

You can also choose to include hyperparameters when initially creating your fine-tuning job. OpenAI presently allows the configuration of three hyperparameters:

Number of epochs: the number of complete passes through the entire fine-tuning dataset
Learning rate multiplier: a scaling factor applied to the base learning rate, which changes the speed at which the model’s weights are updated during fine-tuning
Batch size: the number of fine-tuning samples processed simultaneously before the model’s parameters are updated

However, it is recommended that you start fine-tuning without initially defining any hyperparameters, as OpenAI’s API will automatically configure them based on the size of the dataset. If you were to disregard this advice and wanted to include hyperparameters to establish finer control over the fine-tuning process, the code from above would now look as follows (using example values):

ft_job = client.fine_tuning.jobs.create(
  model="model_name"
  training_file=training_dataset.id, 
  validation_file=evaluation_dataset.id,",
  hyperparameters={
    "n_epochs": 10,
	"batch_size": 8,
	"learning_rate_multiplier": 0.3
  }
)

Similar to loading the fine-tuning data, the above operation will return a fine-tuning job object. This also has an id that remains important throughout the process as it’s used to reference and access the fine-tuning job for subsequent tasks. Once a fine-tuning job is complete, you will receive confirmation via email; the amount of time this requires will differ, depending on your choice of model and the size of the dataset.

Check the Status of Your Model During Fine-Tuning

While your model is being fine-tuned, i.e., the job is in progress, you can check its ongoing status by requesting a list of events. OpenAI provides the following training metrics during the fine-tuning process:

Training loss: measures how well the model’s predictions match the target values in the training dataset; the lower the training loss, the better the model’s performance.
Training token accuracy: the percentage of tokens, i.e., segments of output, correctly predicted by the model.
Valid loss: measures the model’s performance on the evaluation (or validation) dataset, assessing its ability to generalize to unseen data points.
Valid token accuracy: the percentage of tokens correctly predicted by the model for the evaluation dataset – indicating its accuracy to generalize on new data.

You can request to see a list of events with the code below. The limit parameter defines how many events to list; if left undefined, OpenAI will produce 10 by default.

client.fine_tuning.jobs.list_events(
  fine_tuning_job_id=ft_job.id,
  limit=2
)

Access Your Fine-Tuned Model

Although the fine-tuned model should be ready once the job is complete, it may take a few minutes for it to become available for use. If you cannot find your model, through its id, or requests to your model time out, it’s probably still loading and will be available shortly.

When a fine-tuning job has been completed successfully, however, you will be able to use the retrieve function to search for it by its id – where you will now see the fine_tuned_model attribute contains the name of the model, where it previously was null. Additionally, the status attribute should now read “succeeded”.

The code below shows how to retrieve a fine-tuning job by its id and the structure of the object you will receive in response. Again, while you could explicitly enter the fine-tuning job’s id, here, we are accessing it via the id attribute of the object.

ft_retrieve = client.fine_tuning.jobs.retrieve(ft_job.id)

print(ft_retrieve)

#Response

{
  "object": "fine_tuning.job",
  "id": "ftjob-abc123",
  "model": "davinci-002",
  "created_at": 1692661014,
  "finished_at": 1692661190,
  "fine_tuned_model": "ft:davinci-002:my-org:custom_suffix:7q8mpxmy",
  "organization_id": "org-123",
  "result_files": [
      "file-abc123"
  ],
  "status": "succeeded",
  "validation_file": "file-sFseAwXoqWn8c2ZDB24j4",
  "training_file": "file-vGysBmsoqVn9c2TKB2D3s9",
  "hyperparameters": {
      "n_epochs": 4,
      "batch_size": 1,
      "learning_rate_multiplier": 1.0
  },
  "trained_tokens": 5768,
  "integrations": [],
  "seed": 0,
  "estimated_finish": 0
}

You can now specify the use of this model by passing it as a parameter to the Chat Completions API (for gpt-3.5-turbo and gpt-4-o-mini) or within the OpenAI Playground to test its capabilities. Alternatively, if you’ve selected babbage-002 or davinci-002, you would use the Legacy Completions API.

completion = client.chat.completions.create(
  model="your fine-tuned model",
  messages=[
    {"role": "system", "content": "insert context here"},
    {"role": "user", "content": insert prompt here"}
  ]
)

print(completion.choices[0].message)

Access Model Checkpoints

As well as producing a final model when a fine-tuning job is complete, model checkpoints are created at the end of each training epoch. Each checkpoint is a complete model that can be used in the same way as a fully fine-tuned model.

Checkpointing is highly beneficial as it provides a layer of fault recovery: giving you a jumping-off point if your model crashes or the training process is interrupted for any reason – which increases in likelihood with the size of your model. Similarly, they give you a place to revert to if your model’s performance decreases with extra training, e.g., starts to overfit. All in all, model checkpointing adds structure and security to the fine-tuning process and allows for more experimentation.

To access model checkpoints, the fine-tuning job must first finish successfully: you can confirm its completion by querying the status of a job, as shown in the previous step. From there, querying the checkpoints’ endpoint with the fine-tuning job’s id will produce a list of checkpoints associated with the fine-tuning job. As with requesting events, the limit parameter defines how many checkpoints to create – otherwise, it is 10 by default.

client.fine_tuning.jobs.list_checkpoints(
  fine_tuning_job_id=ft_job.id,
  limit=2
)

For each checkpoint object, the fine_tuned_model_checkpoint field is populated with the name of the model checkpoint, as shown below.

{
  "object": "list"
  "data": [
    {
      "object": "fine_tuning.job.checkpoint",
      "id": "ftckpt_zc4Q7MP6XxulcVzj4MZdwsAB",
      "created_at": 1721764867,
      "fine_tuned_model_checkpoint": "ft:gpt-4o-mini-2024-07-18:my-org:custom-suffix:96olL566:ckpt-step-2000",
      "metrics": {
        "full_valid_loss": 0.134,
        "full_valid_mean_token_accuracy": 0.874
      },
      "fine_tuning_job_id": "ftjob-abc123",
      "step_number": 2000,
    },
    {
      "object": "fine_tuning.job.checkpoint",
      "id": "ftckpt_enQCFmOTGj3syEpYVhBRLTSy",
      "created_at": 1721764800,
      "fine_tuned_model_checkpoint": "ft:gpt-4o-mini-2024-07-18:my-org:custom-suffix:7q8mpxmy:ckpt-step-1000",
      "metrics": {
        "full_valid_loss": 0.167,
        "full_valid_mean_token_accuracy": 0.781
      },
      "fine_tuning_job_id": "ftjob-abc123",
      "step_number": 1000,
    },
  ],
  "first_id": "ftckpt_zc4Q7MP6XxulcVzj4MZdwsAB",
  "last_id": "ftckpt_enQCFmOTGj3syEpYVhBRLTSy",
  "has_more": true
}

Improving Your Model

If after testing your model, it doesn’t perform as expected or isn’t as consistently correct as you anticipated, it is time to improve it. OpenAI enables you to refine your model in three ways:

Quality: improving the quality of the fine-tuning data
- Double-check all data points are formatted correctly
- If your model struggles with particular types of prompts, add data points that directly demonstrate to the model how to respond to these accordingly.
- Refine your dataset’s diversity, i.e., ensure it has examples that reflect an accurate range of prompts and responses.
Quantity: increasing the size of the dataset
- The more complex the task for which you are fine-tuning, the more data you’re likely to require.
- Increasing the size of the dataset means it is likely to contain a greater number of unconventional data points, i.e., edge cases, allowing the model to learn to generalize to them more effectively.
- Increasing the size of the dataset is also likely to remediate overfitting, as the model has more data from which to learn its true, underlying relationships – as opposed to just learning the correct responses.
Hyperparameters: adjusting the hyperparameters of the fine-tuning job. Here are some guidelines on when to increase or decrease each hyperparameter:
- Number of epochs
  - Increase if: the model is underfitting, i.e., underperforming on both training and validation data; the model is converging slowly, i.e., the model’s training and valid loss is decreasing but has not stabilized.
  - Decrease if: the model is overfitting, i.e., performing well on training data but not the evaluation dataset; the model converges early in the training process but loss increases after additional epochs.
- Learning rate multiplier
  - Increase if: the model is converging slowly; you’re working with a particularly large dataset.
  - Decrease if: the model’s loss fluctuates considerably, i.e., oscillation; it is overfitting.
- Batch size:
  - Increase if: the model is being fine-tuned successfully: you can probably afford to increase the batch size to accelerate the process; if the model’s loss is oscillating.
  - Decrease if: the model is converging poorly (smaller batches allow models to learn the data more thoroughly); the model is overfitting and other hyperparameter adjustments prove ineffective.

Conclusion

In summary, the steps for fine-tuning GPT on conversational data include:

Choosing a model to fine-tune
Preparing the datasets
Uploading the training datasets
Creating a fine-tuning job
Checking the status of your model during fine-tuning
Accessing your fine-tuned model
Accessing model checkpoints
Improving your model

Fine-tuning is an intricate process but can transform the efficacy of generative AI applications when applied correctly. We encourage you to develop your understanding and competency with further experimentation. This includes configuring different hyperparameters, loading different datasets, and attempting to fine-tune the different models OpenAI has available. You can learn more by referring to the resources we have provided below.

Alternatively, if you’d prefer to sidestep the process of fine-tuning an LLM altogether, Nebula LLM is specialized to support your organization’s conversational use cases.

Nebula LLM is Symbl.ai’s proprietary large language model specialized for human interactions. Fine-tuned on well-curated datasets containing over 100,000 business interactions across sales, customer success, and customer service and on 50 conversational tasks such as chain of thought reasoning, Q&A, conversation scoring, intent detection, and others, Nebula is ideal for tasks and workflows involving conversational data:

Automated Customer Support: Nebula LLM can be used to equip chatbots with more authentic, engaging, and helpful conversational capabilities.
Real-time Agent Assistance: extract key insights and trends to help human agents on live calls and enhance customer support, including generating conversational summaries, generating responses to objections, and suggesting follow-up actions
Call Scoring: score the important conversations taking place within your organization, based on performance criteria such as communication and engagement, question handling, and forward motion to assess a human agent’s performance and enable targeted coaching.

To learn more about the model, sign up for access to Nebula Playground.

Additional Resources

Team Symbl

The writing team at Symbl.ai