A Guide to Overfitting and Underfitting

The predictive capabilities of AI models have progressed rapidly in the last few years and continue to push new boundaries. However, there are two issues that still persistently plague AI developers and researchers: overfitting and underfitting. Moreover, as the field progresses and models are tasked with processing increasingly complex, high-dimensional datasets (as with multi-modal models for instance), the more likely overfitting and underfitting become. It is thus crucial to understand these challenges and develop strategies to mitigate them.

In this guide, we explore the concepts of overfitting and underfitting: the reasons they occur, how to diagnose instances of both, and techniques to mitigate them to enhance a model’s performance.

Explaining Bias and Variance

To best explain the concepts of overfitting and underfitting, we first need to understand two concepts that are central to them – bias and variance.

Bias is a statistical concept that has come to denote an AI model’s inability to accurately identify the patterns within a dataset due to a difference between its predicted output and the correct output label. The amount of bias present in a model is dependent on the number of incorrect assumptions during training – the more assumptions made, the higher the potential bias. Bias can occur for several reasons, such as an overly simplistic model or using a dataset that is too small and/or lacks sufficient variety.

Variance, meanwhile, is used to measure how sensitive an AI model is to changes in its training data. The greater the difference between a model’s performance on its training dataset and subsequent (test) datasets, the higher the variance. Variance can be caused by several factors, including a model that is too complex, and non-optimal feature selection.

What Is Overfitting and How Does It Occur?

Overfitting describes a situation in which a model consistently makes accurate predictions on its training dataset, but fails to perform as well on testing data.

For a real-world analogy of overfitting, picture a student preparing for an exam with a comprehensive set of practice questions and answers. The student studies the practice questions so thoroughly that they come to memorize the answers. However, when they eventually take the exam and are faced with a set of unfamiliar questions, they perform poorly – because they studied and memorized the answers, but not the methodology or reasoning behind how the answers were generated.

In the same way, overfitting occurs when a model understands how to produce the correct outputs based on the training data – even going as far as to capture its noise and outliers – but doesn’t learn the relationship between the data points that produced those outputs.

Overfitting is caused by the combination of a model having low bias and high variance – which means an overfitted model makes few assumptions and is very sensitive to changes in its training data.

How Can You Tell if a Model Is Overfitted?

To diagnose whether a model is overfitted, you must pay attention to two key metrics: training error rate and testing error rate. The model’s training error rate measures how often it makes inaccurate predictions on its training dataset while the testing (or generalization) error rate measures its performance on a separate testing or validation dataset.

Subsequently, you can tell if overfitting is occurring if the model consistently produces a low training error rate and a higher testing error rate, i.e., that it performed better on its training dataset than the testing dataset. Another indicator of overfitting is instability: if the model shows a disproportionately large difference in performance in response to small changes in the training data, i.e., high variance.

What Are the Implications of Overfitting?

To explore the effects of an overfitted model, let us look at an example featuring the width of a bird’s wingspan compared to its weight.

In most instances, the size of the wingspan will positively correlate with its weight, i.e., the larger the bird, the wider its wingspan. However, this won’t always be the case due to outlying data points for which a smaller species of bird has a wider wingspan. If the model overfits to the training data, including the outliers, it won’t accurately capture the relationship within the dataset, and it will produce inaccurate predictions with subsequent evaluation sets.

The implication of this is that, despite showing early promise, the model is too inconsistent to apply to real-life use cases. Although it may perform well on new data that is similar to the data it was trained on, the model is too unpredictable to be relied upon for a wider range (or distribution) of unseen data.

What Is Underfitting and How Does It Occur?

In contrast to overfitting, underfitting refers to a situation in which a model shows poor predictive abilities for both its training data and testing data.

Returning to our student analogy, underfitting is akin to a student who may have studied the practice questions and answers – but failed to comprehend them very well. Consequently, when they take their exam and are faced with new questions, they perform poorly – and would likely have performed poorly even if the questions were all already part of the practice set – because they are not familiar with the subject at all.

Similarly, an underfitted model – whether due to its simplicity, a lack of training time, or an insufficient dataset – is unable to identify and learn the relationship between the input data and the target output labels, resulting in inconsistent or poor predictive capabilities.

Underfitting is caused by high bias and high variance, which means the model makes too many assumptions and has a high sensitivity to changes in the training dataset.

How Can You Tell if a Model Is Underfitted?

When a model is underfitted, it will exhibit both a high training error rate and a high testing error, reflecting the twin facts that it hasn’t learned the relationships between the training data, and cannot generalize on previously unseen data in the validation set.

Since an underfitted model is characterized by poor predictive ability on its training data, it is typically easier to diagnose than an overfitted model.

What Are the Implications of Underfitting?

Returning to our example dataset that measures a bird’s wingspan with respect to its weight: an underfitted model will fail to fully capture the relationship, thus making predictions inconsistently as a result.

Unfortunately, the implications of this are that the model is even less suitable for real-world use than the already inconsistent overfitted model. Furthermore, as underfitting is caused by both high bias and high variance, it will likely take more work to mitigate than an overfitted model, where variance is the main prohibitor of performance.

Overfitting and Underfitting: How to Achieve Balance Between the Two

To maximize a model’s predictive abilities across a wide range of data and ensure its reliable performance in real-world use cases, one must strike the balance between a model being overfitted and underfitted. To achieve this, we must aim for the model to exhibit low bias and low variance.

This is possible by systematically employing various techniques designed to reduce bias and/or variance, respectively, until the model’s training and testing error rates are comparable. We explore some of those techniques in greater detail below.

Mitigating Overfitting

To address an overfitted model, we need to lower variance. Here are some of the main ways to do so:

Increase the size of the training dataset
If a dataset is too small, it will lack the required diversity to sufficiently represent its underlying data distribution. This could result in overfitting – as the model tries to map a probabilistic relationship on to the limited number of available data points. In contrast, by providing the model with a larger dataset, it has access to a greater variety of data with which to learn the relationships between data points – and will pay less attention to details specific to just the training dataset.
Improve the quality of the training dataset
In addition to its quantity, the quality of the training data plays a considerable role in how well a model can identify underlying relationships in the data. Ways of increasing the quality of training data include:
- Removing inaccurate data: a model’s output can only be as accurate as its input, so it is crucial to remove incorrect data from the training corpus. This prevents the possibility of the model learning from errors within the training data and making inaccurate predictions based on them.
- Ensuring the training data is representative of the broader dataset: the training data must be diverse and unbiased so it accurately represents the data involved in the use case to which the model will be applied. If not, the model may struggle with generalizing to previously unseen data points.
- Handling outliers: instances that vary considerably from the majority of the data can cause a model to overfit if it attempts to perfectly map them to a probabilistic relationship. This is because the model is capturing the noise in the data instead of genuine patterns.
  
  On the other hand, however, removing all outliers might prevent the model from learning aspects of the underlying relationship or being able to generalize successfully. With this in mind, one must decide how representative the outliers are of instances in the overall dataset, and whether to remove them. Similarly, it is prudent to remove varying amounts of outliers (perhaps based on their deviation from the mean) and observe if the model performs better on the test dataset.
Remove irrelevant features
If the training data contains an excessive amount of features, i.e., a measurable characteristic from which to predict output, it can cause the model to make inaccurate predictions due to unnecessary complexity.
Fewer training epochs
If a model is trained for too long, it could start to learn the training data itself instead of the underlying distribution – resulting in overfitting. Instead, one must identify the optimal number of training epochs by monitoring the training and validation error rates and stopping when they are comparable, i.e., when the model finds the balance between its accuracy with the training data and test data.
Regularization
This refers to a series of techniques that reduce a model’s complexity by reducing or nullifying the influence of particular parameters, or weights. Since overfitting often makes a model exaggerate the importance of certain weights, regularization lessens their significance on the output, resulting in better results with testing datasets.

Common methods of regularization include:
- L1 regularization: also called lasso regularization, this involves adding the sum of the absolute values of the model’s weights to the loss function, i.e., the quantified error between the predicted and actual output.
- L2 regularization: also referred to as ridge regression, this sees the sum of squared values of weights added to the loss function (accounting for negative values)
  
  Both L1 and L2 include the use of an alpha coefficient, which is a value between 0 and 1 that determines the extent of regularization. The closer the alpha is to 0, the greater the chance of over-signifying the importance of weights and continuing to overfit; while the closer the alpha is to 1, the more the risk of underestimating the importance of weights and causing underfitting.
  
  The difference between L1 and L2 regularization is that L1 reduces the value of certain weights to zero, creating a more sparse model that lends itself well to feature selection. Alternatively, L2 regularization reduces the value of weights more uniformly, which is better for models and networks with codependent features. However, the two methods can be combined, in a process called elastic net regularization, to benefit from their respective benefits.
- Dropout regularization: In each epoch, every weight has a probability, p, of being rendered inactive and consequently not factoring into the output.

Mitigating Underfitting

To address an underfitted model, variance needs to be lowered therefore, some of the mitigation methods overlap with those for overfitting. However, the model’s bias also needs to be lowered. Here are a few ways of accomplishing both objectives.

Increase the size of the training dataset
If the training dataset is too small for the model to learn its patterns, this will result in underfitting. Consequently, a larger training set gives the model more opportunities to learn the relationships between data points and make more accurate predictions.

Additionally, enlarging the dataset increases the diversity of data – helping to decrease inherent bias. As a result, the model itself will feature less bias, which is a key factor in mitigating underfitting.
Improve the quality of the training dataset
Just as noisy or uncleaned data can cause overfitting, it can also cause a model to underfit. Removing inaccurate instances from the dataset is the most fundamental way to enhance its quality and potentially mitigate underfitting.

Additionally, one must decide how to proceed with outlying data with respect to how accurately they reflect deviations from the mean in the broader context of the use case. On the one hand, removing the noise can prevent underfitting; while conversely, it may result in the model failing to adequately learn the true extent of the relationships within the data.
Increase the model’s complexity
If a model isn’t sophisticated enough, it may not be capable of identifying patterns within the dataset, leading to underfitting. By increasing the number of layers within its architecture, and/or the number of nodes in each layer, the model will have more parameters with which to better frame its training data, resulting in greater accuracy.
Improve Feature Selection
Similarly, a model may underfit because the input features of the training data aren’t descriptive or expressive enough for the model to consistently predict the correct output label. With this in mind, adding more features and/or making certain features more prominent will help increase the model’s complexity and yield more consistently correct predictions.
More Training Epochs
In the same way that an insufficient dataset can prevent a model from determining underlying patterns, training a model for an insufficient amount of time can have the same effect. Therefore, simply increasing the number of epochs could help mitigate underfitting and enhance a model’s performance.
Decrease regularization
While regularization can mitigate overfitting, if applied too rigorously, i.e., with too high an alpha coefficient, the data can become too uniform, preventing the model from identifying underlying patterns. Thus, by decreasing the level of regularization, one can increase the model’s complexity and prevent underfitting.

Conclusion

Overfitting and underfitting are two of the most consistent challenges in developing AI models: fortunately, there is increasing awareness about effective strategies to solve these issues.

First and foremost, with the size and quality of a model’s training data being so crucial to a model’s predictive capabilities, AI researchers and vendors are making a stronger effort to ensure their data is sourced and cleaned well. With progress in the application of AI to various tasks, it has become easier to obtain higher-quality datasets – this already goes a considerable way towards mitigating overfitting and underfitting.

Additionally, as model monitoring tools become more sophisticated, it becomes easier to diagnose instances of overfitting or underfitting and pinpoint the most effective method of addressing its cause. This reduces the amount of trial and error involved in finding the optimal balance between bias and variance, and expedites the process of developing accurate and more performant AI models.

Kartik Talamadupula

Director of AI Research

Kartik Talamadupula is a research scientist who has spent over a decade applying AI techniques to business problems in automation, human-AI collaboration, and NLP.