Introduction
In recent times, I’ve had the enriching opportunity to immerse myself in the vibrant discourse around AI/ML at various conferences. Being a product manager, my interactions often veer towards the pragmatic aspects of leveraging AI. I’ve come to notice a persistent whirlpool of questions surrounding the application of Retrieval-Augmented Generation (RAG) and fine-tuning in enhancing the functionality of Large Language Models (LLMs). The curiosity doesn’t stem from a mere technical standpoint but dives deeper into the financial orbit as well.
This blog aims to unfold a comparative narrative on the technical aspects and costs associated with fine-tuning and RAG across various models.
Large language models (LLMs) are powerful tools that can generate natural language texts for various applications, such as chatbots, summarization, translation, and more. However, LLMs are not perfect and may have some limitations, such as:
- LLMs may not have enough knowledge or domain expertise for specific tasks or datasets.
- LLMs may generate inaccurate, inconsistent, or harmful outputs that do not match the user’s expectations or needs.
To overcome these challenges, two common techniques are used to enhance the performance and capabilities of LLMs: fine-tuning and retrieval-augmented generation (RAG).
Fine-tuning is the process of re-training a pre-trained LLM on a specific task or dataset to adapt it for a particular application. For example, if you want to build a chatbot that can answer questions about movies, you can fine-tune an LLM like GPT-4 with a dataset of movie reviews and trivia. This way, the LLM can learn the relevant vocabulary, facts, and style for the movie domain.
RAG is a framework that integrates information retrieval (or searching) into LLM text generation. It uses the user input prompt to retrieve external “context” information from a data store that is then included with the user-entered prompt to build a richer prompt containing context information that otherwise would not have been available to the LLM. For example, if you want to build a chatbot that can answer questions about a specific topic, you can use RAG to query your domain specific knowledge base and use the retrieved articles as additional input for the LLM. This way, the LLM can access the most current, reliable, and pertinent facts for any query.
So, when should you use fine-tuning versus RAG for your LLM application? Here are some factors to consider:
Considerations | Fine-tuning | RAG (Retrieval Augmented Generation) | |
Cost | High: Requires substantial computational resources and potentially specialized hardware like high-end GPUs or TPUs. | Moderate: Lower than fine-tuning as it requires less labeled data and computing resources. Main cost is associated with the setup of embedding and retrieval systems. | |
Complexity | High: Demands a deep understanding of deep learning, NLP, and expertise in data preprocessing, model configuration, and evaluation. | Moderate: Requires coding and architectural skills, but less complex compared to fine-tuning. | |
Accuracy | High: Enhances domain-specific understanding leading to higher accuracy in predictions or generated outputs. | Variable: Excels in up-to-date responses and minimizing hallucinations, accuracy may vary based on the domain and task. | |
Domain Specificity | High: Can impart domain-specific terminology and nuances to the LLM. | Moderate: May not capture domain-specific patterns, vocabulary, and nuances as effectively as a fine-tuned model. | |
Up-to-date Responses | Low: Becomes a fixed snapshot of its training dataset, requires regular retraining for evolving data. | High: Can ensure updated responses by retrieving information from external, up-to-date documents. | |
Transparency | Low: Functions more like a ‘black box’, obscuring its reasoning. | Moderate to High: Identifies the documents it retrieves, enhancing user trust and comprehension. | |
Avoidance of Hallucinations | Moderate: Can reduce hallucinations by focusing on domain-specific data, but unfamiliar queries may still cause erroneous outputs. | High: Reduces hallucinations by anchoring responses in retrieved documents, effectively fact-checking the LLM’s responses. |
Is Fine-Tuning LLMs or Implementing RAG Expensive?
Both fine-tuning and RAG involve some costs and challenges that need to be considered before implementing them for your LLM application. Here are some examples:
- The cost of compute power: Both fine-tuning and RAG require significant amounts of compute power to train and run your models. Depending on the size and complexity of your models and data, this may incur substantial expenses for cloud services or hardware resources. For example, according to OpenAI’s pricing1, fine-tuning GPT-3.5 Turbo costs $0.008 per 1K tokens for training and $0.012 per 1K tokens for input usage2. Running RAG also requires additional compute power for embedding models and vector databases that are used for information retrieval34.
- The cost of data acquisition and maintenance: Both fine-tuning and RAG require high-quality data that is relevant and up-to-date for your task or domain. Depending on the availability and accessibility of such data, this may involve expenses for data collection, cleaning, labeling, storage, and updating. For example, according to AWS’s pricing5, storing 1TB of data on Amazon S3 costs $23.55 per month, and using 1TB of data transfer on Amazon EC2 costs $90 per month.
- The technical feasibility and complexity: Both fine-tuning and RAG require advanced technical skills and knowledge to implement and optimize your models and data. Depending on the level of customization and sophistication you want to achieve, this may involve challenges such as choosing the right model architecture, hyperparameters, loss function, evaluation metrics, embedding methods, vector databases, etc. For example, according to a blog post by Experts Exchange3, implementing RAG involves several steps, such as loading data, chunking data, embedding data, indexing data, serving data, generating responses, etc.
Simulating an Example
To illustrate how fine-tuning and RAG can be used for an LLM application, let’s simulate an example of building a chatbot that can answer questions about cloud computing. Here are some sample numbers based on what is being offered in the market for fine-tuning pricing, vector database pricing, compute power examples and cost and sample timelines. These numbers are for illustrative purposes only and may not reflect the actual costs and timelines for your specific application.
Below is the simulated example based on the previous discussions and prompts for both fine-tuning and RAG with GPT-3.5 Turbo, LLAMA 2 and Claude 2 for 10 million tokens. The computations are based on a scenario where compute operations run for 15 days, 24 hours each day.
Fine-tuning:
GPT-3.5 Turbo:
- LLM/Embedding Cost: $0.0080 per 1K tokens (for training) × 10,000 = $80 + $0.0120 per 1K tokens (for input usage) × 10,000 = $120 + $0.0160 per 1K tokens (for output usage) × 10,000 = $160; Total = $360
- Compute Power Cost: $0.5 per hour × 15 days × 24 hours = $180
- Total Cost: LLM/Embedding Cost + Compute Power Cost = $360 + $180 = $540
Claude 2 (Fine-tuning):
- LLM/Embedding Cost: $1.63 per million tokens × 10 million tokens = $16.30 + $5.51 per million tokens × 10 million tokens = $55.10; Total = $71.40
- Compute Power Cost: $0.5 per hour × 15 days × 24 hours = $180
- Total Cost: LLM/Embedding Cost + Compute Power Cost = $71.40 + $180 = $251.40
RAG:
GPT-3.5 Turbo:
- LLM Usage Cost: $280 (from Fine-tuning)
- Embedding Model Cost: $0.0001 per 1K tokens × 10,000 = $1
- Vector Database Cost: $70 (Standard Plan for Pinecone)
- Compute Power Cost: $0.6 per hour (GPU + CPU) × 24 hours × 15 days = $216
- Total Monthly Operating Cost: $280 + $1 + $70 + $216 = $567
GPT-4 (RAG):
- LLM/Embedding Cost:
- Input Usage: $0.03 per 1K tokens × 10,000 = $300
- Output Usage: $0.06 per 1K tokens × 10,000 = $600
- Total LLM/Embedding Cost: $300 + $600 = $900
- Vector Database Cost (Pinecone):
- $70 (Standard Plan)
- Compute Power Cost:
- Compute power for RAG setup: ($0.5 per hour for GPU + $0.1 per hour for CPU) × 15 days × 24 hours = $216
- Total Cost:
- Total Cost: LLM/Embedding Cost + Vector Database Cost + Compute Power Cost = $900 + $70 + $216 = $1,186
Claude 2:
- LLM/Embedding Cost: $11.02 per million tokens (for prompt) × 10 million tokens = $110.20 + $32.68 per million tokens (for completion) × 10 million tokens = $326.80; Total = $437
- Vector Database Cost (Pinecone): $70 (Standard Plan)
- Compute Power Cost: ($0.5 per hour for GPU + $0.1 per hour for CPU) × 15 days × 24 hours = $216
- Total Cost: LLM/Embedding Cost + Vector Database Cost + Compute Power Cost = $437 + $70 + $216 = $723
LLAMA 2:
- LLM/Embedding Cost: Free
- Vector Database Cost (Pinecone): $70 (Standard Plan)
- Compute Power Cost: ($0.5 per hour for GPU + $0.1 per hour for CPU) × 15 days × 24 hours = $216
- Total Cost: LLM/Embedding Cost + Vector Database Cost + Compute Power Cost = $0 + $70 + $216 = $286
Now, let’s present the above calculations in a tabular format for easier comparison:
Fine-tuning | RAG | ||||||
Component | GPT-3.5 Turbo | Claude 2 | LLAMA 2 | GPT-3.5 Turbo | Claude 2 | LLAMA 2 | GPT-4 |
LLM/Embedding Cost | 360 | 71.4 | 0 | 280 | 437 | 0 | 900 |
Vector Database Cost | N/A | N/A | N/A | 70 | 70 | 70 | 70 |
Compute Power Cost | 180 | 180 | 180 | 216 | 216 | 216 | 216 |
Total Cost | 540 | 251.4 | 180 | 566 | 723 | 286 | 1186 |
Total Time (days) | 15 | 15 | 15 | Monthly | Monthly | Monthly | Monthly |
The comparison highlights varying cost structures across different models for fine-tuning and RAG. Fine-tuning incurs higher costs, especially with GPT-3.5 Turbo, while RAG presents a cost-effective approach, notably with LLAMA 2. GPT-4 has a higher cost in the RAG setup due to its advanced capabilities. The constant factor across both setups is the compute power cost. The choice between models and setups would hinge on budget considerations and the desired balance between customization and broad topical coverage. Fine-tuning requires more compute power and time than RAG, but it may result in more accurate and customized responses for the cloud computing domain. RAG requires less compute power and time than fine-tuning, but it may result in more creative and diverse responses that cover a wider range of topics.