With a growing number of large language models (LLMs) on offer, choosing a model that best suits your needs is crucial to the success of your generative AI strategy. The wrong choice can consume considerable time and resources and even possibly lead to a premature conclusion that AI can’t, in fact, enhance your organisation’s efficiency and productivity.

Although there are several ways to determine an LLM’s capabilities, such as benchmarking, as detailed in our previous guide, one of the methods most applicable to real-world use, is measuring a model’s inference. i.e., how quickly it generates responses. 

With this in mind, this guide explores LLM inference performance monitoring, including how inference works, the metrics used to measure an LLM’s speed, and how some of the most popular models on the market perform. 

What is LLM Inference Performance Monitoring and Why is it Important?

LLM inference is the process of entering a prompt and generating a response from an LLM. It involves a language model drawing conclusions or making predictions to generate an appropriate output based on the patterns and relationships to which it was exposed during training. 

Subsequently, LLM inference performance monitoring is the process of measuring the speed and response times of a model. Measuring LLM inference is essential as it allows you to assess an LLM’s efficiency, reliability, and consistency – all of which are crucial aspects in determining its ability to perform in real-world scenarios and provide the intended value within an acceptable timeframe. Conversely, insufficient means to correctly evaluate LLMs leaves organisations and individuals with blind spots and an inability to properly distinguish one model from another. This is likely to lead to wasted time and resources down the line, as a language model proves to be ill-suited for its intended use case.

How LLM Inference Works

To better understand the metrics used to measure a model’s latency, let’s first briefly examine how an LLM performs inference, which involves two stages: a prefill phase and a decoding phase. 

Firstly, in the prefill phase, the LLM must process the text from a user’s input prompt by converting it into a series of prompt, or input, tokens. A token is a unit of text that represents a word or a portion of a word. For the English language, a token is 0.75 words – or four characters.  The exact mechanism that an LLM uses to divide text into tokens, i.e., its tokenizer, varies between models. Once generated, each token is turned into a vector embedding, a numerical representation that the model can understand and make inferences from. These embeddings are then processed by the LLM in order to generate an appropriate output for the user. 

From here, during the decoding phase, the LLM generates a series of vector embeddings that represent its response to the given input prompt. These are then converted into completion, or output, tokens, which are generated output one at a time until it reaches a stopping criterion, such as the token limit number or one of a list of stop words. At which time, it will generate a special end token to signal the end of token generation. As LLMs generate one token per forward propagation, i.e., pass or iteration, the number of propagations that a model requires to complete a response is the same as the number of completion tokens.

What Are the Most Important LLM Inference Performance Metrics?

To evaluate the inference capabilities of a large language model, the metrics that we’re most interested in are latency and throughput

Latency

Latency is a measure of how long it takes for an LLM to generate a response to a user’s prompt. It provides a way to evaluate a language model’s speed and is mainly responsible for forming a user’s impression of how fast or efficient a generative AI application is. Consequently, low latency is important for use cases that involve real-time interactions, such as chatbots and AI copilots, but less so for offline processes. There are several ways to measure a model’s latency, including: 

  • Time To First Token (TTFT):
  • Time Per Output Token (TPOT)
  • Total generation time

TTFT is the length of time it takes for the user to start receiving a response from a model after entering their prompt. It’s determined by the time it takes to process the user’s input and generate the first completion token. Factors that influence TFTT include:

  • Network speed: a system’s general bandwidth and, similarly, how congested the network is at the time of inference.  
  • Input sequence length: the longer the prompt, the more processing required by the model before it can output the first token.   
  • Model size: conventionally, the larger the model, i.e., the more parameters it has, the more computations it performs to generate a response, which prolongs the TFTT. 

TPOT, alternatively, is the average time it takes to generate a completion token for each user querying the model at a given time. This can also occasionally be referred to as inter-token latency (ITL). 

Total generation time refers to the end-to-end latency of an LLM: from when a prompt is originally entered by the user to when they receive the completed output from the model; often, when people refer to latency, they’re actually referring to total generation time. It can be calculated as follows:

  • Total generation time TTFT + (TPOT x number of generated tokens)

An LLM’s total generation time varies according to a number of key factors:

  • Output Length: this is the most important factor because models generate output a token at a time. This is also why the LLM’s TPOT should also be measured. 
  • Prefill Time:  the time it takes for the model to complete the prefill stage, i.e., how long it takes the model to process all the input tokens from the user’s entered prompt and can generate the first completion token. 
  • Queuing Time: there may be times when an LLM can’t keep up with user requests because of its hardware constraints – namely a lack of GPU memory. This means some input requests will be placed in a queue before they’re processed. This is the reason behind TTFT being such a commonly recorded metric, as it offers insight into how well the model’s server can handle varying numbers of user requests and, subsequently, how it might perform in a real-world setting.

Something else to be considered when measuring latency is the concept of a cold start. When an LLM is invoked after previously being inactive, i.e., scaled to zero, it causes a “cold” start as the model’s server must create an instance to process the request. This has a considerable effect on latency measurements – particularly TFTT and total generation time, so it’s crucial to note whether the published inference monitoring results for a model specify whether they include a cold start time or not.

Throughput

An LLM’s throughput provides a measure of how many requests it can process or how much output it can produce in a given time span. Throughput is typically measured in two ways: requests per second or tokens per second.

  • Requests per second: this metric is dependent on the model’s total generation time and how many requests are being made at the same time, i.e., how well the model handles concurrency. However, total generation time varies based on how long the model’s input and output are.
  • Tokens per second: because requests per second are influenced by total generation time, which itself depends on the length of the model’s output and, to a lesser extent, its input, tokens per second is a more commonly used metric for measuring throughput. Much like TFTT, the tokens per second metric is integral to the perceived speed of an LLM.

Additionally, tokens per second could refer to:

  • Total tokens per second: both input and output tokens
  • Output tokens per second: only generated completion tokens

Typically, total tokens per second is considered the more definitive measure of model throughput, while output tokens per second is applicable to measuring the performance of LLMs for use in real-time applications. 

Request Batching

One of the most effective and increasingly employed methods for increasing an LLM’s throughput is batching. Instead of loading the model’s parameters for each user prompt, batching involves collecting as many inputs as possible to process at once – so parameters have to be loaded less frequently. However, while this makes the most efficient use of a GPU and improves throughput, it does so at the expense of latency – as users that made the initial requests that comprise a batch will have to wait until it’s processed to receive a response. What’s more, the larger the batch size, the bigger the drop-off in latency, although there are limits on the maximum size of a batch before causing memory overflow. 

Types of batching techniques include:  

  • Static batching: also called naïve batching, this is the default batching method with which multiple prompts are gathered and responses are only generated when all the requests in the batch are complete.
  • Continuous batching: also known as in-flight batching; as opposed to waiting for all the prompts within a batch to be completed, this form of batching groups requests at the iteration level. As a result, once a request has been completed, a new one can replace it, making it more compute-efficient.  

What Are the Challenges of LLM Inference Performance Monitoring?

As beneficial as it is to gain insight into a model’s latency and throughputs, obtaining this data isn’t always straightforward. Some of the challenges associated with measuring LLM inference include; 

  • Lack of testing consistency: there can be differences in the way inference tests are conducted, such as the type of GPU (and quantity used), the number and nature of prompts, whether the LLM is inferred locally or through an API, etc. These can all affect a model’s inference metrics and can make it tricky to make like-for-like comparisons as the tests were conducted under different conditions. 
  • Different token lengths per model: inference performance tests typically present results in terms of token-based metrics, e.g., tokens per second – but token lengths vary per LLM. This means metrics aren’t always comparable across model types. 
  • Lack of data: quite simply, inference metrics may not be available for particular models as they weren’t published by their vendors – and no one has sufficiently tested them yet.

How Do Popular LLMs Perform on These Metrics?

Now that we’ve covered how LLMs perform inference and how it’s measured, let’s turn our attention to how some of the most popular models score on various inference metrics. 

To start, let’s look at tests performed by AI research hub Artifical Analysis, which publishes ongoing performance and benchmark tests for a collection of widely used LLMs. Although the site publishes a wide variety of inference metrics, we’re honing in on three:

  • Throughput (tokens per second)
  • Latency (total response time (TRT)): in this case, the number of seconds it takes to output 100 tokens
  • Latency (time to first chunk (TTFC)): the site opts to use TFCC as opposed to TFTT because some API hosts send out chunks of tokens instead of individually. 

Another important note is that for the TRT and TFCR metrics, with the exception of the Gemini Pro, Claude 2.0, and Mistral Medium, the figures below are the mean average across multiple API hosts. In the case of the three OpenAI GPT models, this is the average of two API hosts, OpenAI and Azure. In contrast, for the Mixtral 8x7B and Llama 2 Chat, the average is derived from eight and nine API hosting providers, respectively.  

ModelThroughput (tokens per second)Latency (TRT) (seconds)Latency (TFCR) (seconds)
Mixtral 8x7B952.660.6
GPT-3.5 Turbo921.850.65
Gemini Pro863.62.6
Llama 2 Chat (70B)823.160.88
Claude 2.0274.80.9
GPT-4227.351.9
GPT-4 Turbo207.051.05
Mistral Medium196.20.3

In addition to the summary provided above, the site features other inference measurements, including latency and throughput over time and the costs of inference.  

The site GTP For Work features a latency tracker that continually monitors the performance of the APIs for several models from OpenAI and Azure OpenAI (GPT-4 and GPT-3.5 and Anthropic (Claude 1.0 and 2.0). They publish the average latency of each model over a 48-hour period, based on:

  • Generating a maximum of 512 tokens
  • A temperature of 0.7 
  • 10 minute intervals
  • Data from three locations

Lastly, let’s look at the results of a more comprehensive study conducted by the machine learning operations organisation Predera. Now, while this study only centres model types, the Mistral Instruct and Llama 2 (though both 7B and 70B models are tested further along in the experiment), it provides a larger array of inference metrics, like: 

  • Throughput (tokens per second)
  • Throughput (requests per second)
  • Average latency (seconds)
  • Average latency per token (seconds)
  • Average latency per output token (seconds)
  • Total time (seconds)

Additionally, the experiment sees inference performed through the parallel utilization of a varying number of GPUs – which in this case is the NVIDIA L4 Tensor Core GPU. This offers an indication of each LLM’s scalability. Lastly, these results are based on feeding each model 1000 prompts.   

1 x L4 GPU

ModelThroughput (tokens per second)Throughput (requests per second)Average latency (seconds)Average latency
per token (seconds)
Average latency per output token (seconds)Total time (seconds)
Llama2-7B558.541.17 449.881.7110.87897.23
Mistral-7B-instruct915.48 1.89277.190.977.12552.44

2 x L4 GPUs

ModelThroughput (tokens per second)Throughput (requests per second)Average latency (seconds)Average latency
per token (seconds)
Average latency per output token (seconds)Total time (seconds)
Llama2-7B1265.17 2.65179.850.633.81397.65
Mistral-7B-instruct1625.083.35153.090.502.65339.51

4 x L4 GPUs

ModelThroughput (tokens per second)Throughput (requests per second)Average latency (seconds)Average latency
per token (seconds)
Average latency per output token (seconds)Total time (seconds)
Llama2-7B1489.993.12147.360.482.57324.71
Mistral-7B-instruct1742.703.59136.490.442.68285.03

8 x L4 GPUs

ModelThroughput (tokens per second)Throughput (requests per second)Average latency (seconds)Average latency
per token (seconds)
Average latency per output token (seconds)Total time (seconds)
Llama2-7B1401.182.93153.090.502.65339.51
Mistral-7B-instruct1570.703.24149.670.482.90316.74
Llama2-70B1.00475.591.629.21996.86

The first thing you’ll notice from the results above is that, as you’d reasonably expect, each model’s inference metrics improve across the board as more GPUs are utilized. This is, however, until they reach 8 GPUs – at which point, each model’s performance is worse when compared to 4 GPUs. This points to the fact that the models are only scalable up to a point – and dividing inference between additional GPUs offers little benefit while requiring additional time to distribute the workload. 

You’ll also notice that the Llama2-70B only features in the experiment when there are 8 GPUs. This is due to the fact that a model requires enough space to store the number of a model’s stated parameters multiplied by the size of the data type in which its parameters are stored. In the case of the Llama-2-70B, which stores parameters as 16-bit floating point numbers, this equates to 70 x 2 (bytes) = 140B. As an L4 GPU is 24GB large, the fewest number of units that could accommodate the 70B model is 6 – though in keeping with the theme of doubling the amount of GPUs used each time, it was run on 8 units. 

Conclusion

Inference performance monitoring provides a good indication of an LLM’s speed and is an effective method for comparing models against each other. However, when looking to select the most appropriate model for your organisation’s long-term objectives, it’s prudent to use inference metrics as a determining factor and not the sole determiner of your choice of LLM.

As detailed in this guide, the latency and throughput figures published for different models can be influenced by several things, such as the type and number of GPUs used and the nature of the prompt used during tests. Moreover, even the type of recorded metrics can differ – all of which makes it difficult to get the most comprehensive understanding of a model’s capabilities. 

Plus, as alluded to at the start of this guide, there are benchmarking tests, such as HumanEval, which tests a model’s coding abilities, and MMLU, which assesses a model’s natural language understanding, that provide insight into how an LLM performs at specific tasks. Researching how a language model performs at various benchmarking tests in addition to its inference speed is a robust strategy for identifying the best LLM for your particular needs. 

Avatar photo
Kartik Talamadupula
Director of AI Research

Kartik Talamadupula is a research scientist who has spent over a decade applying AI techniques to business problems in automation, human-AI collaboration, and NLP.