The capabilities of Large Language Models (LLMs) have grown in leaps and bounds in recent years, making them more user-friendly and applicable in a growing number of use cases. However, as LLMs have increased in intelligence and complexity, the number of parameters, or weights and activations, i.e., its capacity to learn from and process data, has also grown. For example, GPT-3.5 has around 175 billion parameters, while the current state-of-the-art GPT-4 has in excess of 1 *trillion *parameters.

However, the larger an LLM, the more memory it requires. This means that it is only feasible to run LLMs on high-specification hardware with the prerequisite amount of GPUs – this limits deployment options and, consequently, how readily LLM-based solutions can be adopted. Fortunately, machine learning researchers are devising a growing range of solutions to meet the challenge of growing model sizes – with one of the most prominent being quantization.

In this guide, we explore the concept of quantization, including how it works, why it is important and advantageous, and different techniques for quantizing language models.

## What is Quantization and Why is it Important?

Quantization is a model compression technique that converts the weights and activations within an LLM from a high-precision data representation to a lower-precision data representation, i.e., from a data type that can hold more information to one that holds less. A typical example of this is the conversion of data from a 32-bit floating-point number (FP32) to an 8-bit or 4-bit integer (INT4 or INT8).

A great analogy for understanding quantization is image compression. Compressing an image involves reducing its size by removing some of the information, i.e., bits of data, from it. Now, while decreasing the size of an image typically reduces its quality (to acceptable levels), it also means more images can be saved on a given device while requiring less time and bandwidth to transfer or display to a user. In a similar way, quantizing an LLM increases its portability and the number of ways it can be deployed – albeit with an acceptable sacrifice to detail or precision.

Quantization is an important process within machine learning because reducing the number of bits required for each of a model’s weights adds up to a significant decrease in its overall size. Consequently, quantization produces LLMs that consume less memory, require less storage space, are more energy-efficient, and are capable of faster inference. This all adds up to the critical advantage of enabling LLMs to run on a wider range of devices, including single GPUs, instead of expensive hardware featuring multiple GPUs, and, in some cases, even CPUs.

## How Does Quantization Work?

In essence, the quantization process involved the mapping of weights stored in high-precision values to lower-precision data types. Now, while this is relatively straightforward in some cases, e.g., mapping a 64-bit or 32-bit float to a 16-bit float, as they share a representation scheme, this is more difficult in other instances. For example, quantizing a 32-bit float value to a 4-bit integer is complex as an INT4 can only represent 16 distinct values – compared to the vast range of the FP32.

To achieve quantization, we need to find the optimum way to project our range of FP32 weight values, which we’ll label [min, max] to the INT4 space: one method of implementing this is called the *affine quantization* scheme, which is shown in the formula below:

*x_q = round(x/S + Z)*

**where**:

**x_q**– the quantized INT4 value that corresponds to the FP32 value x**S**– an FP32 scaling factor and is a positive float32**Z**– the zero-point: the INT4 value that corresponds to 0 in the FP32 space*round**–*refers to the rounding of the resultant value

To find the [min, max] of our FP32 value range, however, we must first calibrate the model with a smaller calibration dataset. The [min, max] could be determined in a number of ways, with a common solution being setting them to the min and max observed values. Subsequently, all values that sit outside this range are “clipped” – i.e., mapped to *min *and *max, *respectively.

That said, an issue caused by this approach and others like it is that outlying values, i.e Outliers can have a disproportionate impact on scaling: the full range of the lower-precision data type isn’t used effectively – which lowers the quantized model’s accuracy. The solution to this is quantizing in blocks, whereby weights are divided into groups of 64 or 128, for example, according to their value. Each block is then quantized individually to mitigate the effect of outliers and increase precision.

Something to factor in, however, is that while an LLM’s weights and activations will be quantized to reduce its size, they will be dequantized at inference time, so the necessary computations during forward and backward propagation can be performed with a higher-precision data type. This means that the scaling factors for each block must also be stored. Consequently, the more blocks that are used during the quantization process, the higher the accuracy – but the higher the number of scaling factors that must also be saved.

### The Two Types of LLM Quantization: PTQ and QAT

While there are several quantization techniques, the most notable of which we detail later in this guide, generally speaking, LLM quantization falls into two categories:

**Post-Training Quantization (PTQ**): this refers to techniques that quantize an LLM after it has already been trained. PTQ is easier to implement than QAT, as it requires less training data and is faster. However, it can also result in reduced model accuracy from lost precision in the value of the weights.**Quantization-Aware Training (QAT**): this refers to methods of fine-tuning on data with quantization in mind. In contrast to PTQ techniques, QAT integrates the weight conversion process, i.e., calibration, range estimation, clipping, rounding, etc., during the training stage. This often results in superior model performance, but is more computationally demanding.

## What Are the Advantages and Disadvantages of Quantized LLMs?

Let us look at the pros and cons of quantization.

### Pros

**Smaller Models**: by reducing the size of their weights, quantization results in smaller models. This allows them to be deployed in a wider variety of circumstances such as with less powerful hardware; and reduces storage costs.**Increased Scalability**: the lower memory footprint produced by quantized models also makes them more scalable. As quantized models have fewer hardware constraints, organizations can feasibly add to their IT infrastructure to accommodate their use.**Faster****Inference**: the lower bit widths used for weights and the resulting lower memory bandwidth requirements allow for more efficient computations.

### Cons** **

**Loss of Accuracy:**undoubtedly, the most significant drawback of quantization is a potential loss of accuracy in output. Converting the model’s weights to a lower precision is likely to degrade its performance – and the more “aggressive” the quantization technique, i.e., the lower the bit widths of the converted data type, e.g., 4-bit, 3-bit, etc., the greater the risk of loss of accuracy.

## Different Techniques for LLM Quantization

Now that we have covered what quantization is and why it is beneficial, let us turn our attention to different quantization methods and how they work.

### QLoRA

Low-Rank Adaptation (LoRA) is a Parameter-Efficient Fine-Tuning (PEFT) technique that reduces the memory requirements of further training a base LLM by freezing its weights and fine-tuning a small set of additional weights, called adapters. Quantized Low-Rank Adaptation (QLoRA) takes this a step further by quantizing the original weights within the base LLM to 4-bit: reducing the memory requirements of an LLM to make it feasible to run on a single GPU.

QLoRA carries out quantization through two key mechanisms: the 4-bit NormalFloat (NF4) data type and Double Quantization.* *

**NF4:**a 4-bit data type used in machine learning, which normalizes each weight to a value between -1 and 1 for a more accurate representation of the lower precision weight values compared to a conventional 4-bit float. However, while NF4 is used to store quantized weights, QLORA uses an additional data type, brainfloat16 (BFloat16), which was also specially designed for machine learning purposes, to carry out calculations during forward and backward propagation.**Double Quantization****(DQ****)****:**a process of quantizing the quantization constants for additional memory savings. QLoRA quantizes weights in blocks of 64, and while this facilitates precise 4-bit quantization, you also have to account for the scaling factors of each block – which increases the amount of memory required. DQ addresses this issue by performing a second round of quantization on the scaling factors for each block. The 32-bit scaling factors are compiled into blocks of 256 and quantized to 8-bit. Consequently, where a 32-bit scaling factor for each block of previously added 0.5 bits per weight, DQ brings this down to only 0.127 bits. Though seemingly insignificant, when combined in a 65B LLM, for example, this saves 3 GB of memory.

### PRILoRA

Pruned and Rank-Increasing Low-Rank Adaptation (PRILoRA) is a fine-tuning technique recently proposed by researchers that aims to** **increase** **LoRA efficiency through the introduction of two additional mechanisms: the linear distribution of ranks and ongoing importance-based A-weight pruning.

Returning to the concept of low-rank decomposition, LoRA achieves fine-tuning by combining two matrices: *W, *which contains the entire model’s weights and *AB, *which represents all changes made to the model by training the additional weights, i.e., adapters. The *AB *matrix can be decomposed into two smaller matrices of lower rank – A and B – hence the term low-rank decomposition. However, while the low-rank r is the same across all the LLM’s layers in LoRA, PRILoRA linearly increases the rank for each layer. For example, the researchers who developed PRILoRA started with r = 4 and increased the rank until r = 12 for the final layer – producing an average rank of 8 across all layers.

Second, PRILoRA performs pruning on the *A *matrix, eliminating the lowest, i.e., least significant weights every 40 steps throughout the fine-tuning process. The lowest weights are determined through the use of an importance matrix, which stores both the temporary magnitude of weights and the collected statistics related to the input for each layer. Pruning the A matrix in this way reduces the number of weights that have to be processed, which both reduces the time required to fine-tune an LLM and the memory requirements of the fine-tuned model.

Although still a work in progress, PRILoRA showed very encouraging results on benchmark tests conducted by researchers. This included outperforming full fine-tuning methods on 6 out of 8 evaluation datasets while achieving better results than LoRA on all datasets.

**GPTQ**

GPTQ (General Pre-Trained Transformer Quantization) is a quantization technique designed to reduce the size of models so they can run on a single GPU. GPTQ works through a form of layer-wise quantization: an approach that quantizes a model a layer at a time, with the aim of discovering the quantized weights that minimize output error (the mean squared error (MSE), i.e., the squared error between the outputs of the original, i.e., full-precision, layer and the quantized layer.)

First, all the model’s weights are converted into a matrix, which is worked through in batches of 128 columns at a time through a process called lazy batch updating. This involves quantizing the weights in batch, calculating the MSE, and updating the weights to values that diminish it. After processing the calibration batch, all the remaining weights in the matrix are updated in accordance with the MSE of the initial batch – and then all the individual layers are re-combined to produce a quantized model.

GPTQ employs a mixed INT4/FP16 quantization method in which a 4-bit integer is used to quantize weights and activations remain in a higher precision float16 data type. Subsequently, during inference, the model’s weights are dequantized in real-time so computations are carried out in float16.

### GGML/GGUF

GGML (which is said to stand for Georgi Gerganov Machine Learning, after its creator, or GPT-Generated Model Language) is a C-based machine learning library designed for the quantization of Llama models so they can run on a CPU. More specifically, the library allows you to save quantized models in the GGML binary format, which can be executed on a broader range of hardware.

GGML quantizes models through a process called the k-quant system, which uses value representations of different bit widths depending on the chosen quant method. First, the model’s weights are divided into blocks of 32: with each block having a scaling factor based on the largest weight value, i.e., the highest gradient magnitude.

Depending on the selected quant-method, the most important weights are quantized to a higher-precision data type, while the rest are assigned to a lower-precision type. For example, the q2_k quant method converts the largest weights to 4-bit integers and the remaining weights to 2-bit. Alternatively, however, the q5_0 and q8_0 quant methods convert all weights to 5-bit and 8-bit integer representations respectively. You can view GGML’s full range of quant methods by looking at the model cards in this code repo.

GGUF (GPT-Generated Unified Format), meanwhile, is a successor to GGML and is designed to address its limitations – most notably, enabling the quantization of non-Llama models. GGUF is also extensible: allowing for the integration of new features while retaining compatibility with older LLMs.

To run GGML or GGUF models, however, you need to use a C/C++ library called llama.cpp – which was also developed by GGML’s creator Georgi Gerganov. llama.cpp is capable of reading models saved in the .GGML or .GGUF format and enables them to run on CPU devices as opposed to requiring GPUs.

### AWQ

Conventionally, a model’s weights are quantized irrespective of the data they process during inference. In contrast, Activation-Aware Weight Quantization (AWQ) accounts for the activations of the model, i.e., the most significant features of the input data, and how it is distributed during inference. By tailoring the precision of the model’s weights to the particular characteristic of the input, you can minimize the loss of accuracy caused by quantization.

The first stage of AWQ is using a calibration data subset to collect activation statistics from the model, i.e., which weights are activated during inference. These are known as salient weights, which typically comprise less than 1% of the total weights. The salient weights are skipped over for quantization to increase accuracy, remaining as an FP16 data type. Meanwhile, the rest of the weights are quantized into INT3 or INT4 to reduce memory requirements across the rest of the LLM.

## Conclusion

Quantization is an integral part of the LLM landscape. By compressing the size of language models, quantization techniques such as QLoRA and GTPQ help to increase the adoption of LLMs. Free from the restriction of the vast memory requirements of full-precision models, organizations, AI researchers, and individuals alike have more opportunities to experiment with the rapidly growing range of LLMs. This will result in more discoveries, more use cases, and even wider adoption – helping to advance the generative AI field as a whole.