If you like Skforecast , help us giving a star on GitHub! ⭐️
More about forecasting
Data leakage, also known as information leakage, occurs when data outside the training data set inadvertently influences the model during the training process. This problem leads to overly optimistic performance estimates during model evaluation and hampers the model's ability to generalise to new data. In time series forecasting, data leakage often occurs when the model has access to future data points that are not available at the time of prediction.
Data leakage is particularly critical when working with pre-trained models, such as foundational models for time series, as users are typically not involved in the model training process. To prevent data leakage, most authors report the series used to pre-train the model. This allows users to ensure that the data they use to evaluate the model has not been seen during training. However, ensuring data separation during pre-training is not the only requirement to prevent data leakage. For time series prediction, the model must also be prevented from accessing data that is more recent than the time frame used for testing. This is because if a series seen during training is highly correlated with the series used for testing, the model could inadvertently access information that it should not have.
The risk of this type of data leakage is particularly high when using models that are trained with thousand of time series, since it is likely that some of them will be highly correlated with the series the user wants to forecast.
To illustrate this phenomenon, an experiment is performed in which two models are trained on several time series that have a high degree of correlation. One model is then tested on a time series that is excluded from the training set but corresponds to a time period already observed during training. The results are compared with those obtained when the test period is completely excluded from the training data. If the model performs notably better in the first scenario, it indicates that the model is utilizing information it should not have access to, confirming the presence of data leakage.
To prevent data leakage and ensure a fair evaluation, the model must not have access to any data from the time period designated for testing.
The libraries used in this document are:
# Data manipulation
# ==============================================================================
import pandas as pd
from skforecast.datasets import fetch_dataset
# Plotting
# ==============================================================================
import matplotlib.pyplot as plt
from skforecast.plot import set_dark_theme
# Modeling
# ==============================================================================
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from skforecast.recursive import ForecasterRecursiveMultiSeries
# Configuration
# ==============================================================================
import warnings
warnings.filterwarnings("once")
Data used in this document containd the daily number of passengers several public transport services in Madrid.
# Download data
# ==============================================================================
data = pd.read_csv(
"https://raw.githubusercontent.com/skforecast/skforecast-datasets/refs/heads/"
"main/data/public-transport-madrid.csv"
)
data["date"] = pd.to_datetime(data["date"])
data = data.set_index("date")
data = data.asfreq("D")
data = data.drop(columns=["total"])
data.head()
# Plot data
# ==============================================================================
set_dark_theme()
fig, ax = plt.subplots(figsize=(7, 3.5))
data.plot(ax=ax, legend=True)
ax.set_title('Users of public transport in Madrid')
ax.set_xlabel('')
ax.legend(loc='upper left');
# Correlation matrix between series
# ==============================================================================
correlation = data.corr()
correlation
The plot and the correlation matrix of the series show that the series are highly correlated.
# Exogenous features from calendar
# ==============================================================================
exog = pd.DataFrame(index=data.index)
exog['day'] = exog.index.day
exog['month'] = exog.index.month
exog['quarter'] = exog.index.quarter
exog['dayofweek'] = exog.index.dayofweek
exog['dayofyear'] = exog.index.dayofyear
exog['year'] = exog.index.year
exog.head(4)
Two forecasting model are trained using the available series excluding the metro
. Then, the forecasting performance of the models is evaluated on the excluded series metro
for two different scenarios:
# Split data into train-test
# ==============================================================================
end_train = '2024-07-31 23:59:00'
data_train = data.loc[:end_train, :].copy()
data_test = data.loc[end_train:, :].copy()
exog_train = exog.loc[:end_train, :].copy()
exog_test = exog.loc[end_train:, :].copy()
print(f"Train dates : {data_train.index.min()} --- {data_train.index.max()} (n={len(data_train)})")
print(f"Test dates : {data_test.index.min()} --- {data_test.index.max()} (n={len(data_test)})")
Model that has not seen the target series or any other series in the test period
# Train forecaster
# ==============================================================================
lags = [1, 7, 14, 364]
forecaster_1 = ForecasterRecursiveMultiSeries(
regressor = Ridge(random_state=951),
lags = lags,
transformer_series = StandardScaler(),
encoding = None,
forecaster_id = 'forecaster_no_leakage'
)
forecaster_1.fit(series=data_train.drop(columns='metro'), exog=exog_train)
# Predictions
# ==============================================================================
predictions_1 = forecaster_1.predict(
steps = len(data_test),
last_window = data_train[['metro']],
exog = exog_test,
levels = ["metro"]
)
predictions_1 = predictions_1.rename(columns={'metro': 'pred_model_1'})
predictions_1.head(3)
Model that has not seen the target series but has seen the tested period of the other series
# Train forecaster
# ==============================================================================
lags = [1, 7, 14, 364]
forecaster_2 = ForecasterRecursiveMultiSeries(
regressor = Ridge(random_state=951),
lags = lags,
transformer_series = StandardScaler(),
encoding = None,
forecaster_id = 'forecaster_with_leakage'
)
forecaster_2.fit(series=data.drop(columns='metro'), exog=exog)
# Predictions
# ==============================================================================
predictions_2 = forecaster_2.predict(
steps = len(data_test),
last_window = data_train[['metro']],
exog = exog_test,
levels = ["metro"]
)
predictions_2 = predictions_2.rename(columns={'metro': 'pred_model_2'})
predictions_2.head(3)
# Results
# ==============================================================================
results = pd.concat([data_test['metro'], predictions_1, predictions_2], axis=1)
fig, ax = plt.subplots(figsize=(7, 3.5))
data_test['metro'].plot(ax=ax, label='True')
results['pred_model_1'].plot(ax=ax, label='No leakage')
results['pred_model_2'].plot(ax=ax, label='With leakage')
ax.set_title('Users of public transport in Madrid')
ax.set_xlabel('')
ax.legend()
plt.show();
# Prediction error for each model
# ==============================================================================
mae_1 = (results['metro'] - results['pred_model_1']).abs().mean()
mae_2 = (results['metro'] - results['pred_model_2']).abs().mean()
improvement = (mae_1 - mae_2) / mae_1
print(f"MAE model {forecaster_1.forecaster_id}: {mae_1}")
print(f"MAE model {forecaster_2.forecaster_id}: {mae_2}")
print(f"Improvement: {100 * improvement:.2f}%")
Although none of the models have directly seen the target series, the second model has been exposed to the test period for other series. Since some of these series are highly correlated with the target series, the model has indirectly accessed future information, enabling it to make better predictions than the first model.
This example illustrates how data leakage can occur even when the target series is not part of the training data. To prevent information leakage and ensure a fair evaluation, the model must not have access to any data from the time period designated for testing.
import session_info
session_info.show(html=False)
How to cite this document
If you use this document or any part of it, please acknowledge the source, thank you!
Data leakage in pre-trained forecasting models by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) at https://cienciadedatos.net/documentos/py63-data-leakage-pre-trained-forecasting-models.html
How to cite skforecast
If you use skforecast for a publication, we would appreciate it if you cite the published software.
Zenodo:
Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.14.0). Zenodo. https://doi.org/10.5281/zenodo.8382788
APA:
Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (Version 0.14.0) [Computer software]. https://doi.org/10.5281/zenodo.8382788
BibTeX:
@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.14.0}, month = {11}, year = {2024}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }
This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.
Allowed:
Share: copy and redistribute the material in any medium or format.
Adapt: remix, transform, and build upon the material.
Under the following terms:
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial: You may not use the material for commercial purposes.
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.