Machine learning (ML) seeks to understand the behavior of variables by revealing patterns in data. Quantitative researchers take these patterns and develop models to make predictions. Unfortunately for developers, patterns (or trends) can break. A break in a pattern is something out of the ordinary—an anomaly. To identify anomalies in data, you use an ML process called anomaly detection.
An outlier is a value in a random sample or collection of observations that is abnormally far from other values. The distance that’s classified as abnormal depends on the dataset and use case. Not all outliers are anomalies, but to identify anomalies in a dataset, you need to find the outliers.
Outliers fall into three broad categories:
- Global or point outliers: An unusual event that differs drastically from all other data points. For example, imagine an ML model trained to predict a house’s price based on factors like size and location. A global outlier here might be a house that’s significantly larger or smaller than the other houses in the dataset or one located in an unusual location compared to the other houses.
- Contextual or conditional outliers: An unusual event within a specific context. Consider an ML model trained to predict the likelihood of a customer moving to a competitor’s product based on age, income, or other factors. In this case, a contextual outlier might be a young customer with a high income who is significantly different from other young customers in the dataset. This outlier may not be unusual when considered in the context of the entire dataset, but it would be unusual when considered only within the subset of young customers.
- Collective outliers: A unique situation where one anomaly is not alarming, but a collection of such anomalies is. Imagine an ML model trained to predict the success of a marketing campaign based on such factors as the target audience’s demographics and the type of campaign. Here, a collective outlier might be a group of audience segments with similar demographics targeted with the same type of campaign but with a significantly lower response rate than other comparable audience segments.
You can use anomaly detection algorithms to help prevent worst-case scenarios by quickly and accurately identifying unusual data points or patterns in a dataset. There are different anomaly detection algorithms, including supervised, unsupervised, and semi-supervised.
- Supervised algorithms are trained using labeled data and can accurately identify anomalous data points based on their deviation from the normal data.
- Unsupervised algorithms use statistical techniques to identify unusual data points based on their deviation from the overall data distribution.
- Semi-supervised algorithms combine supervised and unsupervised learning elements and are trained using a mix of labeled and unlabeled data.
This hands-on article will show you how to implement anomaly detection in Python.
To follow along with this tutorial, you’ll need:
- Access to the complete notebook to recreate the graphs and models from scratch
- NumPy 1.22 or later installed — for details on NumPy and the percentile method, refer to the official documentation
- Python 3.10.8
- Pandas 1.5 or higher
- Matplotlib 3 or higher
- Seaborn 0.12 or higher
- Scikit-learn 1.1 or higher
Implementing Anomaly Detection in Python
Anomaly detection has become integral to any data analysis project, providing critical and actionable information in various application domains. It requires high detection accuracy and speed to avoid potentially catastrophic errors. This tutorial looks at the economic phenomenon of inflation to illustrate this.
Using aggregated inflation data from the International Monetary Fund (IMF) (dataset and documentation on the World Bank website), you’ll apply benchmark models to see how they perform on it, identify what an outlier means in this scenario, and visualize these outliers. Finally, you’ll compare the predictions of two ML anomaly detection algorithms: Local Outlier Factor and Isolation Forest.
First, download the file in your preferred format and open it in a code editor. World Bank data often appends a brief textual description of the dataset, so we recommend opening the file on a text editor first and manually deleting the first four rows to prevent issues when processing it with Python.
Note that the data isn’t aggregated by default. Review the details of the aggregation method that IMF economists used at the top right of the chart—their quantitative researchers took the median of all the data to compute the world’s average inflation rate. This information is essential for understanding the calculations used in the chart, meaning you’ll need to perform that transformation yourself.
To run the code snippet in a Jupyter Notebook, start by installing the pandas library. You can use the !pip install pandas command in the Jupyter Notebook or the pip install pandas command in a terminal or command prompt.
Then, open the Jupyter Notebook in your environment and create a new cell. Type or paste the code snippet below into the cell and run it by clicking the Run button or using pressing Shift + Enter.
import pandas as pd data = pd.read_csv('inflation_world_historic.csv') # compute the median value of all numeric columns in data world_yearly_median = data.select_dtypes(include=np.number).median() world_median_data = pd.DataFrame(world_yearly_median[:-1]) Next, reshape your new DataFrame to a panel data form, with the year and its median value as columns. The resulting DataFrame should have 62 rows and two columns. To do this, run the following piece of code in a new cell in your Jupyter Notebook: world_median_data.reset_index(inplace=True) world_median_data.columns = ['year', 'median'] year median 0 1960 1.945749 1 1961 2.102977 2 1962 2.669962 3 1963 2.876322 4 1964 3.328408 ... ... ...
Exploratory Data Analysis
Now, you’ll use this dataset to see how the world’s median inflation rate has evolved since 1960. Before choosing a model to apply, it’s important to understand your data. The pandas describe() method is a handy command to compute measures of centrality (e.g., mean and median) and measures of spread (e.g., percentile and standard deviation) of the DataFrame columns. Append this method to your DataFrame name to create the following command:
The output is shown below.
median count 62.000000 mean 5.978251 std 3.562202 min 1.548692 25% 3.187885 50% 4.452406 75% 8.948548 max 17.104530
There’s a significant difference between the 75th percentile and the maximum values, hinting at the existence of outliers. This isn’t so obvious if you consider the lower bounds of the data because there isn’t a significant difference between the minimum and the 25th percentile.
To define an outlier, you need to set a benchmark for what you believe are anomalous data points. In this case, an outlier is greater than the 95th percentile or less than the 5th percentile.
Run the following code in a new code cell. This computes the 95th and 5th percentiles and creates a new column called colors that assigns an r (for red) to the median values that are outside those percentile values or a b (for blue) for all other values:
import numpy as np percentile_95 = np.percentile(world_median_data['median'], 95, method='median_unbiased') percentile_5 = np.percentile(world_median_data['median'], 5, method='median_unbiased') greater_than_95 = (world_median_data[['median']] > percentile_95) smaller_than_5 = (world_median_data[['median']] < percentile_5) # create column named colors to store the color of each point based on the condition above world_median_data['colors'] = np.where(greater_than_95,'r',np.where(smaller_than_5,'r','b'))
Now that you’ve computed the thresholds based on the specified percentiles, use them to visualize the outliers in your dataset. Do this in the same canvas as the inliers using the matplotlib.pyplot module. Run the following code in a new cell:
world_median_data.plot.scatter(x='year', y='median', c=world_median_data['colors'].apply(lambda x: dict(r='red', b='blue')[x]), figsize=(10, 5), title='Median World Inflation Rate', xlabel='Year', ylabel='Median Inflation Rate') plt.xticks(rotation=65);
The output chart will look like this:
You’ve identified six anomalous points that fall outside of the 5th to 95th percentile. But this doesn’t mean that all six are outliers. This visualization suggests that the dataset has both global and contextual outliers. The inflation rate in 1974 was considerably higher than in any other year, a characteristic of a global outlier. Additionally, the inflation rates of 2014 and 2015 differ from nearby points but not from the rest of the data.
To continue analyzing your economic data, you can use a linear regression model. A linear regression model is a statistical model used to predict the value of a continuous dependent variable based on the value of one or more independent variables.
First, ensure you have the scikit-learn module installed. Do this by running the following command in your terminal:
!pip install scikit-learn
Next, create a new code cell in your Jupyter Notebook. There, import the LinearRegression class from the sklearn.linear_model module and create an instance of the LinearRegression class by assigning it to a variable, such as LinearRegression_model:
from sklearn.linear_model import LinearRegression LinearRegression_model = LinearRegression()
Now, you’ll fit this linear regression model to your economic data. You already have a pandas DataFrame called world_median_data that contains two columns: year and median. To fit the model to this data, use the fit() method on the LinearRegression_model object and pass in the year column as the input variable and the median column as the output variable.
Run the following line in a new cell:
Next, use the LinearRegression_model object to predict the data’s median inflation rate for each year. You’ll use these predictions to update your chart, with a prediction line on top of the scatter plot, the year column as the x-values, and the predictions as the y-values. The line is red, and the x-axis tick labels are rotated for easier readability.
ax = world_median_data.plot.scatter(x='year', y='median', c=world_median_data['colors'].apply(lambda x: dict(r='red', b='blue')[x]), alpha=.6, figsize=(10, 5), title='Median World Inflation Rate', xlabel='Year', ylabel='Median Inflation Rate') reg_prediction = LinearRegression_model.predict(world_median_data[['year']]) ax.plot(world_median_data['year'], reg_prediction, c='r') plt.xticks(rotation=65);
The output visualization from the code will look like this:
You’ll see that there are numerous points far from the regression line. These points could potentially be considered outliers using distance-based methods, but the linear model fails to accurately capture the underlying pattern of the data. As a result, it’s difficult to precisely identify real anomalies using simple linear regression.
Anomaly detection algorithms can help address the challenge of dealing with outliers in data. Instead of adjusting the model or normalizing the data, which can result in the loss of information, anomaly detection algorithms can identify unusual data points that may affect the model’s performance. In addition to improving model performance, you can use them to identify patterns and trends that may not be immediately apparent.
Since you don’t have labeled data to specify which inflation rates are outliers, you’ll use unsupervised ML anomaly detection algorithms to analyze the dataset. Specifically, you’ll compare the performance of two common unsupervised algorithms: Local Outlier Factor and Isolation Forest. Unsupervised algorithms are well-suited to this type of dataset because they don’t require prior knowledge or labels to identify patterns and anomalies in the data.
Local Outlier Factor
Local Outlier Factor (LOF) was introduced as a new approach to outlier detection in 2000 by Breunig and colleagues. Previously, anomaly detection focused on binary classification tasks, where a classifier is trained to detect whether something is an outlier based on the sample data.
This approach had shortcomings, which this new approach overcame by assigning a value that denotes the degree to which something is an outlier—the LOF. The key innovation to this method is the locality feature: Data points are compared to a local neighborhood—rather than whole dataset—to assess how isolated certain points are using a density-based unsupervised clustering algorithm like k-nearest neighbors (KNN).
The LOF algorithm identifies unusual data points in a dataset, comparing the density of a point in a dataset with the densities of its neighbors. Then, it assigns an “anomaly score” to each point based on this comparison.
One key parameter in the LOF algorithm is the value of k, which determines the number of nearest neighbors used to compute the k-distance and k-neighborhood. A larger value of k will result in a more conservative estimate of local density, while a smaller value may be more sensitive to local deviations from the mean density. Choosing an appropriate value for k is important for achieving sufficient performance with the LOF algorithm.
In addition to k, the LOF algorithm uses a distance function (such as Euclidean or Manhattan distance) to calculate the reachability distance. Different distance functions can have different properties, and choosing an appropriate distance function is important for achieving good performance with the LOF algorithm.
To compute the LOF anomaly score, you divide the sum of all local reachability distance (LRD) values of the k-neighbors by the LRD value of the sample of interest. Normal observations will have a score close to 1, while points with higher values will be considered more “outlier-like.” It’s important to understand how these features work and how the anomaly score is calculated so you can use the LOF algorithm effectively.
It’s also worth noting that the LOF algorithm is sensitive to noise and can be influenced by large, dense clusters in the data. Additionally, the LOF algorithm can be computationally expensive, especially for large datasets, because it requires calculating distances between all pairs of points. Despite these limitations, the LOF algorithm has been widely applied in a variety of contexts, including anomaly detection, clustering, density estimation, and outlier detection in high-dimensional data.
To implement this algorithm in Python, you’ll use the neighbors module from scikit-learn. First, instantiate a LOF model with the default parameters to benchmark its performance by running this code in a new code cell:
from sklearn.neighbors import LocalOutlierFactor model_LOF = LocalOutlierFactor() Next, fit the model to your data, compute the anomaly score for each data point, and predict if it’s an anomaly (1) or not (-1) for each year’s inflation rate. Use the following code to fit, predict, and store the computed scores from your model on your dataset: LOF_predictions = model_LOF.fit_predict(world_median_data[['median']]) model_LOF_scores = model_LOF.negative_outlier_factor_ world_median_data['LOF_anomaly_scores'] = model_LOF_scores world_median_data['LOF_anomaly'] = LOF_predictions Use this new dataset to plot the detected anomalies by running the following code in a new code cell: plt.style.use("seaborn") fig, ax = plt.subplots(1, figsize=(9, 5), sharex=True, sharey=False) ax.scatter(world_median_data['year'], world_median_data['median'], c=world_median_data['LOF_anomaly'], cmap='RdBu', alpha=0.5 ) ax.set_title("Local Outlier Factor Anomaly Detection") for anomaly in world_median_data[world_median_data['LOF_anomaly'] == -1]['year']: ax2.annotate(anomaly, xy=(anomaly, world_median_data[world_median_data['year'] == anomaly]['median']), xytext=(anomaly, world_median_data[world_median_data['year'] == anomaly]['median'] + 0.2)) ax.get_xaxis().set_visible(False)
Your output should look like this:
By default, the scikit-learn LOF model only identifies 1974 and 1980 as outliers. This scenario illustrates one of the major drawbacks of this algorithm: It’s highly sensitive to the parameter values.
The LocalOutlierFactor model in scikit-learn uses a default value of k=20 for the number of nearest neighbors and the Euclidean distance function to calculate the reachability distance. Small changes in these parameter values can significantly affect the results, potentially leading to the identification of different sets of outliers. This can make it difficult to achieve consistent results and accurately interpret the output of the LOF model.
To illustrate how sensitive the LOF algorithm is to parameter values, try using a different value for the number of nearest neighbors. In the code below, you’ll create a new instance of the LocalOutlierFactor model with a value of n_neighbors=10. Run this code in a new code cell in your Jupyter Notebook:
model_LOF_10 = LocalOutlierFactor(n_neighbors=10)
By using a value of n_neighbors=10, you’re instructing the model to consider the 10 nearest neighbors of each point when calculating the anomaly scores. This may produce different results than the default model with a value of n_neighbors=20. This shows you how changing the parameter values can affect the results of the LOF algorithm.
To compare the results of the LOF model with different parameter values, you can rerun the previous code to plot the anomalies detected using Seaborn. This lets you visually compare how and if predictions differ between the two models. And it can help you understand how changing the parameter values can affect the performance of the LOF algorithm.
Your result should look like this:
As you can see, the LocalOutlierFactor model with 10 k-neighbors used the world median inflation rate to accurately identify the high inflation levels of the 1970s and early 1980s. This is a significant achievement, as the previous economic analysis had also identified these years as outliers.
However, in contrast to the previous analysis, neither model flags 2015 and 2016 as anomalies. This discrepancy highlights the importance of carefully analyzing the results of different models and considering multiple approaches when performing economic analysis. While it’s possible to fine-tune the parameters of the LocalOutlierFactor model to improve its performance, this can be time-consuming.
Fortunately, recent advances in the field of automatic machine learning now permit the automatic tuning of hyperparameters in models like the LocalOutlierFactor, as demonstrated by Xu and colleagues. This technology could streamline identifying and addressing anomalies in economic data.
The Isolation Forest (iForest) algorithm is a tree-based model designed to address the drawbacks of density-based approaches like LOF. iForest builds an Ensemble of isolation trees, or iTrees, for a given dataset. Anomalies are identified as instances with shorter average path lengths on the iTrees, as they’re “few and different” and, therefore, more susceptible to isolation than normal points.
In isolation trees, normal points are typically isolated towards the tree’s leaf nodes, otherwise known as the deeper end of the tree. In contrast, anomalous points are isolated closer to the tree’s root. This is because normal points are more similar to other points in their subgroup and require more splits to be fully isolated, while anomalous points are less similar and can be isolated with fewer splits.
Benefits of iForest
One key advantage of iForest is its improved performance time. iForest can identify anomalies more efficiently than LOF, which is particularly important when working with large datasets or identifying anomalies in real-time. iForest is also generally more accurate in identifying anomalies, particularly in datasets with high dimensionality or complex relationships between features.
Another benefit of iForest is its scalability. It can handle large datasets with several features, which can be a challenge for density-based approaches like LOF. This makes iForest a good choice for datasets with a large number of dimensions or for applications where it’s necessary to analyze data from multiple sources.
In addition, iForest is relatively simple to implement and does not require the tuning of many hyperparameters. This can make it easier to use and more practical for users with limited expertise in machine learning.
Overall, iForest is a powerful tool for identifying anomalies in a wide range of datasets. While it may not always be the best choice for every situation, it is worth considering for its improved performance, accuracy, and scalability compared to density-based approaches like LOF.
To implement this algorithm in Python, use the scikit-learn library’s ensemble module. First, instantiate an IsolationForest model with the default parameters in a new code cell to benchmark its performance:
from sklearn.ensemble import IsolationForest model_IF = IsolationForest()
Next, fit the model to your data sample and make predictions. As in LOF, a value of -1 indicates an anomaly. One of the key components of the iForest model is the decision function, which is used to compute the average path length of each data point in the dataset. The anomaly scores of each data point are then calculated based on this average path length, and the model uses these scores to predict whether a given data point is an anomaly in the sample.
The decision function is an important aspect of the iForest model because it plays a key role in determining the algorithm’s accuracy and effectiveness. By calculating the average path length of each data point, the decision function allows the iForest model to identify anomalies that may not be apparent using other methods. This makes iForest a powerful tool for identifying unusual or unusual patterns in data, and it has been widely used in a variety of applications including fraud detection, network intrusion detection, and anomaly detection in time-series data.
Run this code in a new code cell to fit, predict, and add the scores computed to the original DataFrame:
model_IF.fit(world_median_data[['median']]) world_median_data['IF_anomaly_scores'] = model_IF.decision_function(world_median_data[['median']]) world_median_data['IF_anomaly'] = model_IF.predict(world_median_data[['median']])
Next, plot the anomalies predicted by model_IF by rerunning the code you used to plot the anomalies detected by LOF using Seaborn. Make sure to only update the name of the models so that the graphs are comparable:
The default parameters of IsolationForest predicted many more outliers than your economic analysis led you to expect. One important parameter of the model is “contamination,” which represents the proportion of outliers in the dataset. Liu and colleagues suggest a contamination level of 5% in their paper. This value is used to define the threshold of the sample scores when the model is fit to the data.
Tuning model parameters is a common ML practice, as it can help improve model performance. In the case of IsolationForest, experimenting with different contamination levels can help you identify the value that produces the most accurate outlier predictions. It’s important to note that the optimal contamination level may vary depending on the specific characteristics of your dataset. So, it’s a good idea to try a range of values and evaluate the model’s performance to determine the best value for a given dataset.
In the spirit of experimentation, instantiate a new iForest with a 5% contamination level and compare the results. To do this, run the following code in a new code cell:
model_IF_05 = IsolationForest(contamination=float(.05))
Your output will be:
The results of model_IF_05 are significantly better than those output by the default iForest model. This model recognizes four anomalies and agrees with the second LOF model on three out of four. One interesting finding is that 2015 is flagged as an anomaly — a possible mistake of the LOF model.
The iForest algorithm has several significant benefits for anomaly detection. One of its main advantages is that it’s especially effective at detecting anomalies in small samples, high-dimensional data, and noisy data. The model can identify anomalies by isolating individual observations and separating them from most of the data.
The Ensemble architecture underlying the model also has several advantages. Ensemble models can improve the accuracy and robustness of predictions by aggregating the predictions of multiple models. In the case of iForest, the Ensemble architecture is composed of multiple decision trees trained on random subsets of the data. This allows iForest to capture a diverse range of patterns and features in the data, which can improve the model’s ability to identify anomalies.
This article introduced Python’s two unsupervised machine learning algorithms that offer advanced techniques for identifying anomalies in data: LOF and iForest. These algorithms are easy to implement and evaluate with Python’s extensive collection of libraries and tools for data analysis and machine learning.
Remember to clearly define what constitutes an outlier in the context of your project and experiment with multiple methods to find the best fit for your data. Anomaly detection is an important stage in any data pipeline, and Python makes it a straightforward and valuable process. Additionally, Python’s versatility and accessibility, along with the support of a strong community of developers and users, make it a powerful and convenient choice for implementing anomaly detection algorithms.