A Guide to Building an LLM from Scratch

Up until recently, building a large language model (LLM) from scratch was a difficult and involved process – only reserved for larger organizations able to afford the considerable computational resources and highly skilled engineers that are required.

Today, with an ever-growing collection of knowledge and resources, developing a custom LLM is increasingly feasible. Organizations of all sizes can harness the power of a bespoke language model to develop highly-specialized generative AI applications that will boost their productivity, enhance their efficiency and sharpen their competitive edge.

In this guide, we detail how to build your own LLM from the ground up – from architecture definition and data curation to effective training and evaluation techniques.

Determine the Use Case For Your LLM

The first – and arguably most important – step in building an LLM from scratch is defining what it will be used for: what its purpose will be.

This is crucial for several reasons, with the first being how it influences the size of the model. In general, the more complicated the use case, the more capable the required model – and the larger it needs to be, i.e., the more parameters it must have.

Subsequently, the more the number of parameters, the more training data you will need. The LLM’s intended use case also determines the type of training data you will need to curate. Once you have a better idea of how big your LLM needs to be, you will have more insight into the amount of computational resources, i.e., memory, storage space, etc., required.

In an ideal scenario, clearly defining your intended use case will determine why you need to build your own LLM from scratch – as opposed to fine-tuning an existing base model.

Key reasons for creating your own LLM can include:

Domain-Specificity: training your LLM with industry-specific data that aligns with your organization’s distinct operations and workflow.
Greater Data Security: incorporating sensitive or proprietary information without fear of how it will be stored and used by an open-source or proprietary model.
Ownership and Control: retaining control over confidential data, you can improve your own LLM over time – as your knowledge grows and your needs evolve.

Create Your Model Architecture

Having defined the use case for your LLM, the next stage is defining the architecture of its neural network. This is the heart, or engine, of your model and will determine its capabilities and how well it performs at its intended task.

The transformer architecture is the best choice for building LLMs because of its ability to capture underlying patterns and relationships from data, handle long-range dependencies in text, and process input of variable lengths. Additionally, its self-attention mechanism allows it to process different parts of input in parallel, allowing it to utilize hardware, i.e., graphics processing units (GPUs), more efficiently than architectures that preceded it, e.g., recurrent neural networks (RNNs) and long short-term memory (LSTMs). Consequently, the transformer has emerged as the current state-of-the-art neural network architecture and has been incorporated into leading LLMs since its introduction in 2017.

Previously, an organization would have had to develop the components of a transformer on its own, which requires both considerable time and specialized knowledge. Fortunately, today, there are frameworks specifically designed for neural network development that provide these components out of the box – with Pytorch and Tensorflow being two of the most prominent.

PyTorch is a deep learning framework developed by Meta and is renowned for its simplicity and flexibility, which makes it ideal for prototyping. TensorFlow, created by Google, is a more comprehensive framework with an expansive ecosystem of libraries and tools that enable the production of scalable, production-ready machine learning models.

Creating The Transformer’s Components

Embedding Layer

This is where input enters the model and is converted into a series of vector representations that can be more efficiently understood and processed.

This occurs over several steps:

A tokenizer breaks down the input into tokens. In some cases, each token is a word but the current favored approach is to divide input into sub-word tokens of approximately four characters or ¾ words.
Each token is assigned an integer ID and saved in a dictionary to dynamically build a vocabulary.
Each integer is converted into a multi-dimensional vector, called an embedding, with each characteristic or feature of the token represented by one of the vector’s dimensions.

A transformer has two embedding layers: one within the encoder for creating input embeddings and the other inside the decoder for creating output embeddings.

Positional Encoder

Instead of utilizing recurrence or maintaining an internal state to track the position tokens within a sequence, the transformer generates positional encodings and adds them to each embedding. This is a key strength of the transformer architecture as it can process tokens in parallel instead of sequentially, and keep better track of long-range dependencies.

Like embeddings, a transformer creates positional encoding for both input and output tokens in the encoder and decoder, respectively.

Self-Attention Mechanism

This is the most crucial component of the transformer – and what distinguishes it from other network architectures – as it is responsible for comparing each embedding against others to determine their similarity and semantic relevance. The self-attention layer generates a weighted representation of the input that captures the underlying relationships between tokens, which is used to calculate the most probable output.

At each self-attention layer, the input is projected across several smaller dimensional spaces known as heads – and is hence referred to as multi-head attention. Each head independently focuses on a different aspect of the input sequence in parallel, enabling the LLM to develop a richer understanding of the data in less time. The original self-attention mechanism contains eight heads, but you may decide on a different number, based on your objectives. However, the more the attention heads, the greater the required computational resources, which will constrain the choice to the available hardware.

Multiple attention heads enhance a model’s performance as well as its reliability: if one of the heads fails to capture important information from the input, the other heads can compensate, resulting in a more robust training process.

Both the encoder and decoder contain self-attention components: the encoder has one multi-head attention layer while the decoder has two.

Feed-Forward Network

This layer captures the higher-level features, i.e., more complex and detailed characteristics, of the input sequence, so the transformer can recognise the data’s more intricate underlying relationships. It is comprised of three sub-layers:

First Linear Layer: this takes the input and projects it onto a higher-dimensional space (e.g., 512 to 2048 in the original transformer) to store more detailed representations.
Non-Linear Activation Function: this introduces non-linearity into the model, which helps in learning more realistic and nuanced relationships. A commonly used activation function is the Rectified Linear Unit (ReLU).
Second Linear Layer: transforms the higher-dimensional representation back to the original dimensionality, compressing the additional information from the higher-dimensional space back to a lower-dimensional space while retaining the most relevant aspects.

Normalization Layers

This layer ensures the input embeddings fall within a reasonable range and helps mitigate vanishing or exploding gradients, stabilizing the language model and allowing for a smoother training process.

In particular, the transformer architecture utilizes layer normalization, which normalizes the output for each token at every layer – as opposed to batch normalization, for example, which normalizes across each portion of data used during a time step. Layer normalization is ideal for transformers because it maintains the relationships between the aspects of each token; and does not interfere with the self-attention mechanism.

Residual Connections

Also called skip connections, they feed the output of one layer directly into the input of another, so data flows through the transformer more efficiently. By preventing information loss, they enable faster and more effective training.

During forward propagation, i,e., as training data is fed into the model, residual connections provide an additional pathway that ensures that the original data is preserved and can bypass transformations at that layer. Conversely, during backward propagation, i,e., when the model adjusts its parameters according to its loss function, residual connections help gradients flow more easily through the network, helping to mitigate vanishing gradients, where gradients become increasingly smaller as they pass through more layers.

Assembling the Encoder and Decoder

Once you have created the transformer’s individual components, you can assemble them to create an encoder and decoder.

Encoder

The role of the encoder is to take the input sequence and convert it into a weighted embedding that the decoder can use to generate output.

The encoder is constructed as follows:

Embedding layer
Positional encoder
- Residual connection that feeds into normalization layer
Self-attention mechanism
Normalization layer
- Residual connection that feeds into normalization layer
Feed-Forward network
Normalization layer

Decoder

The decoder takes the weighted embedding produced by the encoder and uses it to generate output, i.e., the tokens with the highest probability based on the input sequence.

The decoder has a similar architecture to the encoder, with a couple of key differences:

It has two self-attention layers, while the encoder has one.
It employs two types of self-attention
- Masked Multi-Head Attention: uses a causal masking mechanism to prevent comparisons against future tokens.
- Encoder-Decoder Multi-Head Attention: each output token calculates attention scores against all input tokens, better establishing the relationship between the input and output for greater accuracy. This cross-attention mechanism also employs casual masking to avoid influence from future output tokens.

This results in the following decoder structure:

Embedding layer
Positional encoder
- Residual connection that feeds into normalization layer
Masked self-attention mechanism
Normalization layer
- Residual connection that feeds into normalization layer
Encoder-Decoder self-attention mechanism
Normalization layer
- Residual connection that feeds into normalization layer
Feed-Forward network
Normalization layer

Combine the Encoder and Decoder to Complete the Transformer

Having defined the components and assembled the encoder and decoder, you can combine them to produce a complete transformer.

However, transformers do not contain a single encoder and decoder – but rather a stack of each in equal sizes, e.g., six in the original transformer. Stacking encoders and decoders in this manner increases the transformer’s capabilities, as each layer captures the different characteristics and underlying patterns from the input to enhance the LLM’s performance.

Data Curation

Once you have built your LLM, the next step is compiling and curating the data that will be used to train it.

This is an especially vital part of the process of building an LLM from scratch because the quality of data determines the quality of the model. While other aspects, such as the model architecture, training time, and training techniques can be adjusted to improve performance, bad data cannot be overcome.

Consequences of low-quality training data include:

Inaccuracy: a model trained on incorrect data will produce inaccurate answers
Bias: any inherent bias in the data will be learned by the model
Unpredictability: the model may produce incoherent or nonsensical answers with it being difficult to determine why
Poor resource utilization: ultimately, poor quality prolongs the training process, and incurs higher computational, personnel, and energy costs.

As well as requiring high-quality data, for your model to properly learn linguistic and semantic relationships to carry out natural language processing tasks, you also need vast amounts of data. As stated earlier, a general rule of thumb is that the more performant and capable you want your LLM to be, the more parameters it requires – and the more data you must curate.

To illustrate this, here are a few existing LLMs and the amount of data, in tokens, used to train them:

Model	# of parameters	# of tokens
GPT-3	175 billion	0.5 trillion
Llama 2	70 billion	2 trillion
Falcon 180B	180 billion	3.5 Trillion

For better context, 100,000 tokens equate to roughly 75,000 words – or an entire novel. So GPT-3, for instance, was trained on the equivalent of 5 million novels’ worth of data.

Characteristics of a High-Quality Dataset

Let us look at the main characteristics to consider when curating training data for your LLM.

Filtered for inaccuracies
Minimal biases and harmful speech
Cleaned – that the data has been filtered for:
- Misspellings
- cross-domain homographs
- Spelling variations
- Contractions
- Punctuation
- Boilerplate text
- Markup, e.g., HTML
- Non-textual components, e.g., emojis
Deduplication: removing repeated information, as it could increase bias in the model
Privacy redaction: removing confidential or sensitive data
Diverse: containing data from a wide range of formats and subjects, e.g., academic writing, prose, website text, coding samples, mathematics, etc.

Another crucial component of creating an effective training dataset is retaining a portion of your curated data for evaluating the model. If you use the same data with which you trained your LLM to evaluate it, you run the risk of overfitting the model – where it becomes familiar with a particular set of data and fails to generalize to new data.

Where Can You Source Data For Training an LLM?

There are several places to source training data for your language model. Depending on the amount of data you need, it is likely that you will draw from each of the sources outlined below.

Existing Public Datasets: data that has been previously used to train LLM made available for public use. Prominent examples include:
- The Common Crawl: a dataset containing terabytes of raw web data extracted from billions of pages. It also has widely-used variations or subsets, including RefinedWeb and C4 (Colossal Cleaned Crawled Corpus).
- The Pile: a popular text corpus that contains data from 22 data sources across 5 categories:
  - Academic Writing: e.g., arXiv
  - Online or Scraped Resources: e.g., Wikipedia
  - Prose: e.g., Project Gutenberg
  - Dialog: e.g., YouTube subtitles
  - Miscellaneous: e.g., GitHub
- StarCoder: close to 800GB of coding samples in a variety of programming languages.
- Hugging Face: an online resource hub and community that features over 100,000 public datasets.
Private Datasets: a personally curated dataset that you create in-house or purchase from an organization that specializes in dataset curation.
Directly From the Internet: naturally, scraping data directly from websites en-masse is an option – but this is ill-advised because it won’t be cleaned, is likely to contain inaccuracies and biases, and could feature confidential data. Additionally, there are likely to be data ownership issues with such an approach.

Training Your Custom LLM

The training process for LLMs requires vast amounts of textual data being passed through its neural network to initialize its parameters, i.e., weights and biases. This is composed of two steps: forward and backward propagation.

During forward propagation, training data is fed into the LLM, which learns the language patterns and semantics required to predict output accurately during inference. The output of each layer of the neural network serves as the input to another layer, until the final output layer, which generates a predicted output based on the input sequence and its learned parameters.

Meanwhile, backward propagation updates the LLM’s parameters based on its prediction errors. The model’s gradients, i.e., the extent to which parameters should be adjusted to increase accuracy, are propagated backwards through the network. The parameters of each layer are then adjusted in a way that minimizes the loss function: this is the algorithm that calculates the difference between the target output and actual output, providing a quantitative measure of performance.

This process iterates over multiple batches of training data, and several epochs, i.e., a complete pass-through of a dataset, until the model’s parameters converge to output that maximizes accuracy.

How Long Does It Take to Train an LLM From Scratch?

The training process for every model will be different – so there is no set amount of time taken to train an LLM. The amount of training time will depend on a few key factors:

The complexity of the desired use case
The amount, complexity, and quality of available training data
Available computational resources

Training an LLM for a relatively simple task on a small dataset may only take a few hours, while training for more complex tasks with a large dataset could take months.

Additionally, two challenges you will need to mitigate while training your LLM are underfitting and overfitting. Underfitting can occur when your model is not trained for long enough, and the LLM has not had sufficient time to capture the relationships in the training data. Conversely, training an LLM for too long can result in overfitting – where it learns the patterns in the training data too well, and doesn’t generalize to new data. In light of this, the best time to stop training the LLM is when it consistently produces the expected outcome – and makes accurate predictions on previously unseen data.

LLM Training Techniques

Parallelization

Parallelization is the process of distributing training tasks across multiple GPUs, so they are carried out simultaneously. This both expedites training times in contrast to using a single processor and makes efficient use of the parallel processing abilities of GPUs.

There are several different parallelization techniques which can be combined for optimal results:

Data Parallelization: the most common approach, which sees the training data divided into shards and distributed over several GPUs.
Tensor Parallelization: divides the matrix multiplications performed by the transformer into smaller calculations that are performed simultaneously on multiple GPUs.
Pipeline Parallelization: distributes the transformer layers over multiple GPUs to be processed in parallel.
Model Parallelization: distributes the model across several GPUs and uses the same data for each – so each GPU handles one part of the model instead of a portion of the data.

Gradient Checkpointing

Gradient checkpointing is a technique used to reduce the memory requirements of training LLMs. It is a valuable training technique because it makes it more feasible to train LLMs on devices with restricted memory capacity. Subsequently, by mitigating out-of-memory errors, gradient checkpointing helps make the training process more stable and reliable.

Typically, during forward propagation, the model’s neural network produces a series of intermediate activations: output values derived from the training data that the network later uses to refine its loss function. With gradient checkpointing, though all intermediate activations are calculated, only a subset of them are stored in memory at defined checkpoints.

During backward propagation, the intermediate activations that were not stored are recalculated. However, instead of recalculating all the activations, only the subset – stored at the checkpoint – needs to be recalculated. Although gradient checkpointing reduces memory requirements, the tradeoff is that it increases processing overhead; the more checkpoints used, the greater the overhead.

LLM Hyperparameters

Hyperparameters are configurations that you can use to influence how your LLM is trained. In contrast to parameters, hyperparameters are set before training begins and aren’t changed by the training data. Tuning hyperparameters is an essential part of the training process because it provides a controllable and measurable method of altering your LLM’s behavior to better align with your expectations and defined use case.

Notable hyperparameters include:

Batch Size: a batch is a collection of instances from the training data, which are fed into the model at a particular timestep. Larger batches require more memory but also accelerate the training process as you get through more data at each interval. Conversely, smaller batches use less memory but prolong training. Generally, it is best to go with the largest data batch your hardware will allow while remaining stable, but finding this optimal batch size requires experimentation.
Learning Rate: how quickly the LLM updates itself in response to its loss function, i.e., its frequency of incorrect prediction, during training. A higher learning rate expedites training but could cause instability and overfitting. A lower learning rate, in contrast, is more stable and improves generalization – but lengthens the training process.

Temperature: adjusts the range of possible output to determine how “creative” the LLM is. Represented by a value between 0.0 (minimum) and 2.0 (maximum), a lower temperature will generate more predictable output, while a higher value increases the randomness and creativity of responses.

Fine-Tuning Your LLM

After training your LLM from scratch with larger, general-purpose datasets, you will have a base, or pre-trained, language model. To prepare your LLM for your chosen use case, you likely have to fine-tune it. Fine-tuning is the process of further training a base LLM with a smaller, task or domain-specific dataset to enhance its performance on a particular use case.

Fine-tuning methods broadly fall into two categories: full fine-tuning and transfer learning:

Full Fine-Tuning: where all of the base model’s parameters are updated, creating a new version with altered weighting. This is the most comprehensive way to train an LLM for a specific task or domain – but requires more time and resources.
Transfer Learning: this involves leveraging the significant language knowledge acquired by the model during pre-training and adapting it for a specific domain or use case. Transfer learning requires many or all of the base LLM’s neural network layers to be “frozen” to limit which parameters can be tuned. The remaining layers – or, often, newly added – unfrozen layers are fine-tuned with the smaller fine-tuning dataset – requiring less time and computational resources than full fine-tuning.

Evaluating Your Bespoke LLM

After training and fine-tuning your LLM, it is time to test whether it performs as expected for its intended use case. This will allow you to determine whether your LLM is ready for deployment or requires further training.

For this, you will need previously unseen evaluation datasets that reflect the kind of information the LLM will be exposed to in a real-world scenario. As mentioned above, this dataset needs to differ from the one used to train the LLM to prevent it from overfitting to particular data points instead of genuinely capturing its underlying patterns.

LLM Benchmarks

An objective way to evaluate your bespoke LLM is through the use of benchmarks: standardized tests developed by various members of the AI research and development community. LLM benchmarks provide a standardized way to test the performance of your LLM – and compare it against existing language models. Also, each benchmark includes its own dataset, satisfying the requirement of using different datasets than during training to help avoid overfitting.

Some of the most widely used benchmarks for evaluating LLM performance include:

ARC: a question-answer (QA) benchmark designed to evaluate knowledge and reasoning skills.
HellaSwag: uses sentence completion exercises to test commonsense reasoning and natural language inference (NLI) capabilities.
MMLU: a comprehensive benchmark comprised of 15,908 questions across 57 tasks that measure natural language understanding (NLU), i.e., how well an LLM understands language and, subsequently, can solve problems.
TruthfulQA: measuring a model’s ability to generate truthful answers, i.e., its propensity to “hallucinate”.
GSM8K: measures multi-step mathematical abilities through a collection of 8,500 grade-school-level math word problems.
HumanEval: measures an LLM’s ability to generate functionally correct code.
MT Bench: evaluates a language model’s ability to effectively engage in multi-turn dialogues – like those engaged in by chatbots.

Conclusion

In summary, the process of building an LLM from scratch can roughly be broken down into five stages:

Determining the use case for your LLM: the purpose of your custom language model
Creating your model architecture: developing the individual components and combining them to create a transformer
Data curation: sourcing the data necessary to train your model
Training: pre-training and fine-tuning your model
Evaluation: testing your model to see if it works as intended; evaluating its overall performance with benchmarks

Understanding what’s involved in developing a bespoke LLM grants you a more realistic perspective of the work and resources required – and if it is a viable option.

However, though the barriers to entry for developing a language model from scratch have been significantly lowered, it is still a considerable undertaking. So, it is crucial to determine if building an LLM is absolutely essential – or if you can reap the same benefits with an existing solution.

Kartik Talamadupula

Director of AI Research

Kartik Talamadupula is a research scientist who has spent over a decade applying AI techniques to business problems in automation, human-AI collaboration, and NLP.