If you like Skforecast , help us giving a star on GitHub! ⭐️
More about forecasting
Anomaly detection involves the identification of data points, patterns or events that deviate significantly from expected behaviour. In the context of time series data, these anomalies, often referred to as outliers, can arise from various sources such as changes in the underlying data generating process, measurement errors or unexpected external events. Detecting anomalies in time series is particularly challenging due to the inherent characteristics of these data, including time dependencies, trends, seasonality and noise. These factors can obscure the distinction between true anomalies and normal variation.
Despite these challenges, anomaly detection in time series is critical. Undetected anomalies can significantly affect the performance of forecasting models by biasing their predictions and compromising their accuracy. Therefore, identifying and carefully analysing anomalies is essential to maintaining model reliability and improving the quality of insights derived from the data.
This article explores a specific approach to anomaly detection in time series: measuring the residuals of a forecasting model. Residuals represent the differences between observed values and the corresponding model predictions, and provide a valuable perspective for identifying anomalies that the model is struggling to accurately predict.
✎ Note
There are numerous approaches to anomaly detection in time series, including statistical methods, seasonal-trend decomposition, machine learning techniques like classification and clustering, as well as autoencoders, among others. Each method comes with its own strengths and limitations, and the choice of approach should align with the specific characteristics of the data and the objectives of the analysis.The libraries used in this document are:
# Data processing
# ==============================================================================
import numpy as np
import pandas as pd
from skforecast.datasets import fetch_dataset
# Plots
# ==============================================================================
import matplotlib.pyplot as plt
from skforecast.plot import set_dark_theme
# Modelling and Forecasting
# ==============================================================================
import skforecast
import sklearn
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from feature_engine.datetime import DatetimeFeatures
from feature_engine.creation import CyclicalFeatures
from skforecast.recursive import ForecasterRecursive
from skforecast.model_selection import TimeSeriesFold
from skforecast.model_selection import backtesting_forecaster
from skforecast.preprocessing import RollingFeatures
# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')
print('Versión skforecast:', skforecast.__version__)
print('Versión sklearn:', sklearn.__version__)
# Download data
# ==============================================================================
data = fetch_dataset("ett_m1")
print(f"Data shape: {data.shape}")
data.head(3)
# Select data
# ==============================================================================
data = data.loc["20170101":, ["OT"]]
Few synthetic anomalies are added to the time series data to illustrate the detection process.
# Synthetic anomalies
# ==============================================================================
rng = np.random.default_rng(4631789)
n_anomalies = 20
anomalies_loc = data.index[rng.uniform(low=0, high=len(data), size=n_anomalies).astype(int)]
data.loc[anomalies_loc, "OT"] = data.loc[anomalies_loc, "OT"] * rng.choice([0.5, 2], size=n_anomalies)
# Plot data
# ==============================================================================
set_dark_theme()
fig, ax = plt.subplots(figsize=(7, 3))
ax.plot(data["OT"])
ax.scatter(anomalies_loc, data.loc[anomalies_loc, "OT"], color='orange', label='Anomalies', zorder=2)
ax.set_title('OT')
ax.legend();
# Data partition: plit train-test
# ==============================================================================
end_train = '2018-04-03 23:59:00'
data_train = data.loc[: end_train, :]
data_test = data.loc[end_train:, :]
anomalies_loc_train = [date for date in anomalies_loc if date <= pd.to_datetime(end_train)]
anomalies_loc_test = [date for date in anomalies_loc if date > pd.to_datetime(end_train)]
print(f"Dates train : {data_train.index.min()} --- {data_train.index.max()} (n={len(data_train)})")
print(f" Anomalies train: {len(anomalies_loc_train)}")
print(f"Dates test : {data_test.index.min()} --- {data_test.index.max()} (n={len(data_test)})")
print(f" Anomalies test : {len(anomalies_loc_test)}")
# Plot partitions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
ax.plot(data_train['OT'], label='Train')
ax.plot(data_test['OT'], label='Test')
ax.scatter(anomalies_loc, data.loc[anomalies_loc, "OT"], color='orange', label='Anomalies', zorder=3)
ax.set_title('OT')
ax.legend();
Additional exogenous features are created based on:
Lagged values of the exogenous series.
Rolling statistics of the exogenous series.
Calendar features.
# Calendar features
# ==============================================================================
features_to_extract = [
'year',
'month',
'week',
'day_of_week',
'hour'
]
calendar_transformer = DatetimeFeatures(
variables = 'index',
features_to_extract = features_to_extract,
drop_original = False,
)
# Cliclical encoding of calendar features
# ==============================================================================
features_to_encode = [
"month",
"week",
"day_of_week",
"hour",
]
max_values = {
"month": 12,
"week": 52,
"day_of_week": 7,
"hour": 24,
}
cyclical_encoder = CyclicalFeatures(
variables = features_to_encode,
max_values = max_values,
drop_original = True
)
exog_transformer = make_pipeline(
calendar_transformer,
cyclical_encoder
)
display(exog_transformer)
data = exog_transformer.fit_transform(data)
# Remove rows with NaNs created by lag features
data = data.dropna()
exog_features = data.columns.difference(['OT']).tolist()
display(data.head(3))
A forecaster is trained to allow its predictions to be compared with observed values. The more effectively the model learns the underlying patterns in the data, the more accurate its predictions become. This improved accuracy enhances confidence that deviations between observed and predicted values represent genuine anomalies.
# Create and forecaster
# ==============================================================================
window_features = RollingFeatures(stats=['mean', 'min', 'max'], window_sizes=24)
forecaster = ForecasterRecursive(
regressor = HistGradientBoostingRegressor(random_state=15926, loss='quantile', quantile=0.5),
lags = [1, 3, 4, 5, 6, 13, 14, 15, 16, 17],
differentiation = 1,
transformer_y = StandardScaler(),
window_features = window_features,
)
forecaster.fit(
y = data.loc[:end_train, 'OT'],
exog = data.loc[:end_train, exog_features]
)
forecaster
After the model is trained, predictions are generated for both the training set and the test set. This is achieved through a backtesting process. For this example it is assumed that 24 steps ahead are predicted in each iteration.
# Backtesting on training data
# ==============================================================================
cv = TimeSeriesFold(
initial_train_size = None, # This will use all data, assuming that the model has been trained
steps = 24, # All hours of next day
differentiation = 1,
)
metric_train, predictions_train = backtesting_forecaster(
forecaster = forecaster,
y = data.loc[:end_train, 'OT'],
exog = data.loc[:end_train, exog_features],
cv = cv,
metric = 'mean_absolute_error',
n_jobs = 'auto',
verbose = False,
show_progress = True
)
results_train = predictions_train.copy()
results_train['y_true'] = data.loc[:end_train, 'OT']
results_train['residual'] = results_train['y_true'] - results_train['pred']
results_train['residual_abs'] = np.abs(results_train['residual'])
results_train.head(3)
In the training set, there are a total of 16 anomalies. Next, the 16 largest residuals are identified, and the corresponding time points are extracted. These time points are then compared with the actual anomalies to determine the model's ability to detect them.
# Plot predictions vs real value and residuals on training data
# ==============================================================================
fig, axs = plt.subplots(
2, 1, figsize=(7, 5), gridspec_kw={"height_ratios": [2, 1]}, sharex=True
)
axs[0].plot(data.loc[:end_train, "OT"], label="Real value")
axs[0].plot(predictions_train["pred"], label="Prediction")
axs[0].scatter(
anomalies_loc_train,
data.loc[anomalies_loc_train, "OT"],
zorder=2,
color="orange",
label="Anomalies",
)
axs[0].set_title("Backtesting on train data")
axs[0].legend()
axs[1].plot(results_train["residual_abs"], label="Absolute residuals", zorder=2)
axs[1].set_title("Absolute residuals")
top_residuals_train = (
results_train.sort_values("residual_abs", ascending=False).head(len(anomalies_loc_train))
)
top_residuals_train["is_anomaly"] = top_residuals_train.index.isin(anomalies_loc_train)
axs[1].scatter(
anomalies_loc_train,
results_train.loc[anomalies_loc_train, "residual_abs"],
zorder=2,
color="orange",
label="Anomalies",
)
plt.tight_layout()
display(top_residuals_train)
top_residuals_train["is_anomaly"].value_counts()
print(f"Percentage of true positives: {100 * top_residuals_train['is_anomaly'].mean():.2f}%")
print(f"Percentage of false positives: {100 * (~top_residuals_train['is_anomaly']).mean():.2f}%")
Same process is repeated for the test set, where there are a total of 4 anomalies.
# Backtesting on test
# ==============================================================================
cv = TimeSeriesFold(
initial_train_size = len(data_train),
steps = 24, # all hours of next day
refit = False,
differentiation = 1,
)
metric_test, predictions_test = backtesting_forecaster(
forecaster = forecaster,
y = data.loc[:, 'OT'],
exog = data.loc[:, exog_features],
cv = cv,
metric = 'mean_absolute_error',
n_jobs = 'auto',
verbose = False,
show_progress = True
)
results_test = predictions_test.copy()
results_test['y_true'] = data.loc[end_train:, 'OT']
results_test['residual'] = results_test['y_true'] - results_test['pred']
results_test['residual_abs'] = np.abs(results_test['residual'])
results_test.head(3)
# Plot predictions vs real value and residuals on test data
# ==============================================================================
fig, axs = plt.subplots(
2, 1, figsize=(8, 5), gridspec_kw={"height_ratios": [2, 1]}, sharex=True
)
axs[0].plot(data.loc[end_train:, "OT"], label="Real value")
axs[0].plot(results_test["pred"], label="Prediction")
axs[0].scatter(
anomalies_loc_test,
data.loc[anomalies_loc_test, "OT"],
color="orange",
zorder=2,
label="Anomalies",
)
axs[0].set_title("Backtesting on test data")
axs[0].legend()
axs[1].plot(results_test["residual_abs"], label="Absolute residuals")
axs[1].set_title("Absolute residuals")
axs[1].scatter(
anomalies_loc_test,
results_test.loc[anomalies_loc_test, "residual_abs"],
color="orange",
zorder=2,
label="Anomalies",
)
plt.tight_layout()
top_residuals_test = results_test.sort_values('residual_abs', ascending=False).head(len(anomalies_loc_test))
top_residuals_test['is_anomaly'] = top_residuals_test.index.isin(anomalies_loc_test)
display(top_residuals_test)
top_residuals_test['is_anomaly'].value_counts()
In both the training and test sets, the observations with the largest residuals are true anomalies. However, when selecting as many top residuals as there are anomalies in the data, only 50% of these are true anomalies, the rest being false positives.
This approach has a significant limitation: anomalies in the data can distort predictions if they are included in the lags used by the model. This often results in as many false positives as true anomalies, as the accuracy of the model is compromised by the anomalies themselves. One way to mitigate this problem is to iteratively remove/replace the detected anomalies from the data and retrain the model. This process can be repeated until the model no longer detects anomalies, at which point the remaining anomalies are likely to be real.
⚠ Warning
Removing anomalies should be done with caution. Since they have been identified because the model is unable to predict them accurately, removing them is likely to improve the model's performance. However, if the anomalies are not due to errors or measurement problems, removing them could result in the loss of valuable information. Furthermore, if similar anomalies are expected to occur in the future, the model will not be able to learn from them and its performance will be compromised.import session_info
session_info.show(html=False)
How to cite this document
If you use this document or any part of it, please acknowledge the source, thank you!
Time series anomaly detection by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) at https://cienciadedatos.net/documentos/py62-time-series-anomaly-detection.html
How to cite skforecast
If you use skforecast for a publication, we would appreciate it if you cite the published software.
Zenodo:
Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.14.0). Zenodo. https://doi.org/10.5281/zenodo.8382788
APA:
Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (Version 0.14.0) [Computer software]. https://doi.org/10.5281/zenodo.8382788
BibTeX:
@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.14.0}, month = {11}, year = {2024}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }
This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.
Allowed:
Share: copy and redistribute the material in any medium or format.
Adapt: remix, transform, and build upon the material.
Under the following terms:
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial: You may not use the material for commercial purposes.
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.