With an ever-increasing number of LLMs becoming available, it’s crucial for organisations and users to quickly navigate the growing landscape and determine which models best suit their needs. One of the most reliable ways to achieve this is through an understanding of benchmark scores.
With this in mind, this guide delves into the concept of LLM benchmarks, what the most common benchmarks are and what they entail, and what the drawbacks are of solely relying on benchmarks as an indicator of a model’s performance.
What are LLM Benchmarks and Why are They Important?
An LLM benchmark is a standardised performance test used to evaluate various capabilities of AI language models. A benchmark usually consists of a dataset, a collection of questions or tasks, and a scoring mechanism. After undergoing the benchmark’s evaluation, models are usually awarded a score from 0 to 100.
Benchmarks are valuable to organisations, namely product managers and developers, and users because they provide an objective indication of an LLM’s performance. Providing a common, standardised collection of assessments to measure LLMs, makes it easier to compare one model against another and, ultimately, select the best one for your proposed use case.
Additionally, benchmarks are incredibly useful to LLM developers and AI researchers as they provide a quantitative consensus on what constitutes good performance. Benchmark scores reveal where a model excels and, conversely and more importantly, where it falls short. Subsequently, developers can compare their model’s performance against their competition and make necessary improvements. The transparency that well-constructed benchmarks foster allows those in the LLM field to build off each other’s progress – accelerating the overall advancement of language models in the process.
Popular LLM Benchmarks
Here’s a selection of the most commonly used LLM benchmarks, along with their pros and cons.
AI2 Reasoning Challenge (ARC) is a question-answer (QA) benchmark that’s designed to test an LLM’s knowledge and reasoning skills. ARC’s dataset consists of 7787 four-option multiple-choice science questions that range from a 3rd to 9th-grade difficulty level. ARC’s questions are divided into Easy and Challenge sets that test different types of knowledge such as factual, definition, purpose, spatial, process, experimental, and algebraic.
ARC was devised to be a more comprehensive and difficult benchmark than previous QA benchmarks, such as the Stanford Question and Answer Dataset (SQuAD) or the Stanford Natural Language Inference (SNLI) corpus, which only tended to measure a model’s ability to extract the correct answer from a passage. To achieve this, the ARC corpus provides distributed evidence: typically containing most of the information required to answer a question – but spreading the pertinent details throughout a passage. This requires a language model to solve ARC questions through its knowledge and reasoning abilities instead of explicitly memorising the answers.
Pros and cons of the ARC benchmark
- Varied and challenging dataset
- Pushes AI vendors to improve QA abilities – not just through fact retrieval but by integrating information from several sentences.
- Only consists of scientific questions
HellaSwag (short for Harder Endings, Longer contexts, and Low-shot Activities for Situations with Adversarial Generations) benchmark tests the commonsense reasoning and natural language inference (NLI) capabilities of LLMs through sentence completion exercises. A successor to the SWAG benchmark, each exercise is composed of a segment of a video caption as an initial context and four possible endings, of which only one is correct.
Each question revolves around common, real-world physical scenarios that are designed to be easily answerable for humans (with an average score of around 95%) but challenging for NLP models.
HellaSwag’s corpus was created through a process called adversarial filtering, an algorithm that increases the complexity by generating deceptive wrong answers, called adversarial endings, which contain words and phrases relevant to the context – but defy conventional knowledge about the world. These adversarial endings are such that they immediately stand out to most people but often prove difficult for LLMs.
Pros and Cons of the HellaSwag Benchmark
- Similar to ARC, it evaluates a model’s common sense and reasoning, as opposed to mere ability to recall facts
- Thoroughly curated dataset: all easily completed contexts from the SWAG dataset were discarded and human assistants sifted through the adversarial endings and chose the best 70,000.
- General knowledge – doesn’t test common sense reasoning for specialised domains.
Massive Multitask Language Understanding (MMLU) is a broad, but important benchmark that measures an LLM’s NLU, i.e., how well it understands language and, subsequently, its ability to solve problems with the knowledge to which it was exposed during training. MMLU was devised to challenge models on their NLU capabilities – in contrast to NLP tasks on which a growing number of models were increasingly excelling at the time.
The MMLU dataset consists of 15,908 questions divided into 57 tasks drawn from a variety of online sources that test both qualitative and quantitative analysis. Its questions cover STEM (science, technology, engineering and mathematics), humanities (language arts, history, sociology, performing and visual arts, etc.), social sciences, and other subjects from an elementary to an advanced professional level. This in itself was a departure from other NLU benchmarks at the time of its release (like SuperGLUE), which focused on basic knowledge rather than the specialised knowledge covered by MMLU.
Pros and Cons of the MMLU Benchmark
- Tests a broad range of subjects at various levels of difficulty
- Broad corpus helps identify areas of general knowledge in which models are deficient
- Limited information on how corpus was constructed
- Dataset is shown to have numerous errors
While an LLM may be capable of producing coherent and well-constructed responses, it doesn’t necessarily mean they’re accurate. The TruthfulQA benchmark attempts to address this, i.e., language models’ tendency to hallucinate, by measuring a model’s ability to generate truthful answers to questions.
There could be several reasons why an LLM produces inaccurate responses. Chief among them is the model being given a lack of training data for particular subjects, rendering it unable to generate a truthful answer. Similarly, the LLM could have been trained on low-quality data that was full of inaccuracies. Alternatively, the false answers may have been incentivised during the model’s training, i.e., faulty training objectives: these are known as imitative falsehoods.
TruthfulQA’s dataset is designed in such a way as to encourage models to choose imitative falsehoods instead of true answers. It assesses the truthfulness of an LLM’s response by how much it describes the literal truth about the real world. Consequently, answers that stem from a particular belief system or works of fiction present in the training data are considered false. Additionally, TruthQA measures how informative an answer is – to avoid LLMs attaining high scores by simply responding sceptically with “I don’t know” or “I’m not sure”.
The TruthQA corpus consists of 817 questions across 38 categories, such as finance, health, and politics. To calculate a score, each model is put through two tasks. The first requires the model to generate answers to a series of questions. Each response is scored between 0 and 1 by human evaluators, where 0 is false and 1 is true. For the second task, instead of generating an answer, the LLM must choose true or false for a series of multiple-choice questions, which are tallied. The two scores are then combined to produce a final result.
Pros and cons of the TruthfulQA benchmark
- Diverse dataset
- Tests LLMs for hallucinations and encourages model accuracy
- Corpus covers general knowledge, so not a great indicator of truthfulness for specialised domains
WinoGrande is a benchmark that evaluates an LLM’s commonsense reasoning abilities and is based on the Winograd Schema Challenge (WSC) machine learning tests. The benchmark presents a series of pronoun resolution problems: where two near-identical sentences have two possible answers, which change based on a trigger word.
WinoGrande’s dataset contains 44,000 well-designed, crowdsource problems – which is a significant increase from the 273 problems in the WSC. Additionally, the AFLITE algorithm, which is based on HellaSwag’s adversarial filtering algorithm, was applied to the dataset to both increase its complexity and reduce any inherent bias, i.e., annotation artefacts.
Pros and Cons of the WinoGrande Benchmark
- Large crowdsourced and algorithmically-curated dataset.
- The presence of annotation artefacts in the dataset. Annotation artefacts are patterns within the data that unintentionally reveal information about the target label. Although AFLITE is designed to remove this, it’s not 100% accurate due to the size of the corpus.
The GSM8K (which stands for Grade School Math 8K) benchmark measures a model’s multi-step mathematical reasoning abilities. It contains a corpus of around 8,500 grade-school-level math word problems devised by humans, which is divided into 7,500 training problems and 1,00 test problems.
Each problem requires two to eight steps to solve and to carry out a sequence of fairly simple calculations using the basic arithmetic operators (+ − ×÷). The solution to each problem is collected in natural language form as opposed to a mathematical expression.
Pros and Cons of the GSM8K Benchmark
- Mathematical reasoning thus reveals a critical weakness in modern language models.
- Problems are framed with high linguistic diversity
- Problems are relatively simple to solve, so the benchmark could be obsolete fairly soon
The General Language Understanding Evaluation (GLUE) benchmark tests an LLM’s NLU capabilities and was notable upon its release for its variety of assessments. SuperGLUE improves upon GLUE with a more diverse and challenging collection of tasks that assess a model’s performance across eight subtasks and two metrics, with their average providing an overall score.
Here’s a summary of the SuperGLUE benchmark’s subtasks and metrics:
- Boolean Questions (BoolQ): yes/no QA task
- CommitmentBank (CB): truthfulness assessment
- Choice of Plausible Alternatives (COPA): causal reasoning task
- Multi-Sentence Reading Comprehension (MultiRC): true/false QA task
- Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD): multiple-choice QA task
- Recognizing Textual Entailment (RTE): two-class classification task
- Word-in-Context (WiC): is a binary classification task
- Winograd Schema Challenge (WSC): pronoun resolution problems
- Broad Coverage Diagnostics: to automatically test an LLM’s linguistic, common sense, and general world knowledge
- Analysing Gender Bias in Models: an analytical tool for detecting a model’s social biases
Pros and cons of the SuperGLUE benchmark
- A thorough and diverse range of tasks that test a model’s NLU capabilities
- A smaller range of models are tested against SuperGLUE than similar benchmark MMLU
HumanEval (also often referred to as HumanEval-Python) is a benchmark designed to measure a model’s ability to generate functionally correct code; it consists of the HumanEval dataset and the pass@k metric.
This HumanEval dataset was carefully designed and contains 164 diverse coding challenges that include several unit tests (7.7 on average). The pass@k metric calculates the probability that at least one of k generated code samples pass the coding challenge’s unit tests, given that there are c correct samples from n generated samples.
In the past, the BLEU (bilingual evaluation understudy) metric was used to assess the textual similarity of model-generated coding solutions compared with human ones. The problem with this approach, however as it doesn’t evaluate the functional correctness of the generated solution; for more complex problems, the solution could still be functionally correct while appearing different textually from the solution produced by a person. The HumanEval addressed this by utilising unit tests to evaluate a code sample’s functionality in a similar way that humans would.
Pros and cons of the HumanEval benchmark
- A good indication of a model’s coding capability
- Unit testing mirrors the way humans evaluate code functionality
- The HumanEval dataset doesn’t comprehensively capture how coding models are used in practice. For instance, it doesn’t test for aspects such as writing tests, code explanation, code infilling, or docstring generation.
MT-Bench is a benchmark that evaluates a language model’s capability to effectively engage in multi-turn dialogues. By simulating the back-and-forth conversations that LLMs would have in real-life situations, MT-Bench provides a way to measure how effectively chatbots follow instructions and the natural flow of conversations.
MT Bench was developed through the use of Chatbot Arena: a crowd-sourced platform that allows users to evaluate a variety of chatbots by entering a prompt and comparing the two responses side-by-side. Users could then vote for which model provided the best response, which was recorded and tallied to produce a leaderboard of the best-performing LLMs.
Through this process, the researchers behind Chatbot Arena identified eight main types of user prompts: writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities). Subsequently, they devised 10 multi-turn questions per category, to create a total set of 160 questions. While the results from Chatbot Arena are subjective, MT-Bench is intended to complement it with a more objective measure of a model’s conversational capabilities.
Pros and Cons of the MT-Bench Benchmark
- Measures a model’s ability to answer subsequent, related questions
- Though carefully curated, the dataset is small
- Hard to simulate the broad and unpredictable nature of conversations
What are LLM leaderboards?
Although understanding what various benchmarks signify about a particular LLM’s performance is important, it is also necessary to establish how models compare against each other to choose the best one for your needs: this is where LLM leaderboards come in.
AN LLM leaderboard is a published list of benchmark results for each language model. The designers of each benchmark tend to maintain their own LLM leaderboards, but there are also independent leaderboards that evaluate models on a series of benchmarks for a more comprehensive assessment of their abilities.
The best example of these are the leaderboards featured on HuggingFace that evaluate and rank a large variety of open-source LLMS based on six of the benchmarks detailed above – ARC, HellaSwag, MMLU, TruthQA, WinoGrade, and GSM8K
What are the Problems with LLM Benchmarking?
While LLM benchmarks are undoubtedly helpful in general, the best practice is to use them as a guide – or at best a strong indicator – of a model’s capabilities, and not a definitive one.
The first reason for this is benchmark leakage, which refers to instances where an LLM is trained on the same data contained in benchmark datasets. As a result, the model has a chance to learn the solutions to the challenges posed by particular benchmarks, i.e., overfitting, instead of actually solving them – so the model’s score isn’t truly reflective of its capabilities in the aspect being assessed. Worse, as a benchmark becomes increasingly established and widespread, models can be trained or fine-tuned to score highly, and become less useful on the specific tasks being measured – a manifestation of Goodhart’s law.
Another issue with LLM benchmarks is that a model’s score may not accurately reflect how it will perform when applied to a real-world use case. Benchmarking takes place in controlled settings, with a finite dataset, and with evaluators looking for a model to fulfil specific criteria. This stands in contrast to the unpredictability and variety of real-world applications, which makes it difficult to predict the “in-field” performance of a particular LLM.
Conversation-based LLMs are a prominent example of this, as it is hard for benchmarks to account for the unpredictability of conversations. While benchmarks mainly measure a model’s performance on self-contained short tasks, it is difficult to determine how long a real conversation is going to be. Although MT-Bench is designed to assess an LLM’s ability to engage in conversation, it only does so for a limited number of questions – and uses a small, albeit carefully crafted, dataset.
Additionally, on the subject of datasets, as benchmarking is mostly carried out with corpora containing broad, general knowledge, it is hard to determine how well a model will perform in a specific or specialised domain. Subsequently, the more specific the use case, the less applicable a benchmark score is likely to be.
Benchmarking offers a clear and objective way to assess an LLM’s capabilities and compare how well models perform in relation to each other. As the use of LLMs is projected to rapidly increase over the next few years, benchmarks will become increasingly important for helping organisations decide which language models best suit their long-term objectives. For all their benefits, LLM benchmarks aren’t perfect – which is why machine learning researchers are constantly developing and refining methods of measuring a model’s performance.
In this guide, we explore the concept of context length, why it is important, and the benefits and drawbacks of differing context lengths.