Data leakage in pre-trained forecasting models

If you like  Skforecast ,  help us giving a star on   GitHub! ⭐️

Data leakage in pre-trained forecasting models

Joaquin Amat Rodrigo, Javier Escobar Ortiz
December, 2024

Introduction

Data leakage, also known as information leakage, occurs when data outside the training data set inadvertently influences the model during the training process. This problem leads to overly optimistic performance estimates during model evaluation and hampers the model's ability to generalise to new data. In time series forecasting, data leakage often occurs when the model has access to future data points that are not available at the time of prediction.

Data leakage is particularly critical when working with pre-trained models, such as foundational models for time series, as users are typically not involved in the model training process. To prevent data leakage, most authors report the series used to pre-train the model. This allows users to ensure that the data they use to evaluate the model has not been seen during training. However, ensuring data separation during pre-training is not the only requirement to prevent data leakage. For time series prediction, the model must also be prevented from accessing data that is more recent than the time frame used for testing. This is because if a series seen during training is highly correlated with the series used for testing, the model could inadvertently access information that it should not have.

The risk of this type of data leakage is particularly high when using models that are trained with thousand of time series, since it is likely that some of them will be highly correlated with the series the user wants to forecast.

To illustrate this phenomenon, an experiment is performed in which two models are trained on several time series that have a high degree of correlation. One model is then tested on a time series that is excluded from the training set but corresponds to a time period already observed during training. The results are compared with those obtained when the test period is completely excluded from the training data. If the model performs notably better in the first scenario, it indicates that the model is utilizing information it should not have access to, confirming the presence of data leakage.

To prevent data leakage and ensure a fair evaluation, the model must not have access to any data from the time period designated for testing.

Libraries

The libraries used in this document are:

In [1]:
# Data manipulation
# ==============================================================================
import pandas as pd
from skforecast.datasets import fetch_dataset

# Plotting
# ==============================================================================
import matplotlib.pyplot as plt
from skforecast.plot import set_dark_theme

# Modeling
# ==============================================================================
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from skforecast.recursive import ForecasterRecursiveMultiSeries

# Configuration
# ==============================================================================
import warnings
warnings.filterwarnings("once")

Data

Data used in this document containd the daily number of passengers several public transport services in Madrid.

In [15]:
# Download data
# ==============================================================================
data = pd.read_csv(
    "https://raw.githubusercontent.com/skforecast/skforecast-datasets/refs/heads/"
    "main/data/public-transport-madrid.csv"
)
data["date"] = pd.to_datetime(data["date"])
data = data.set_index("date")
data = data.asfreq("D")
data = data.drop(columns=["total"])
data.head()
Out[15]:
metro bus road train
date
2023-01-01 685684 319488 155714 174991
2023-01-02 1581661 1024836 588003 446467
2023-01-03 1781186 1151845 662751 510268
2023-01-04 1846531 1160892 681347 517539
2023-01-05 1842966 1087828 615698 487856
In [3]:
# Plot data
# ==============================================================================
set_dark_theme()
fig, ax = plt.subplots(figsize=(7, 3.5))
data.plot(ax=ax, legend=True)
ax.set_title('Users of public transport in Madrid')
ax.set_xlabel('')
ax.legend(loc='upper left');
In [4]:
# Correlation matrix between series
# ==============================================================================
correlation = data.corr()
correlation
Out[4]:
metro bus road train
metro 1.000000 0.962036 0.963832 0.972725
bus 0.962036 1.000000 0.988552 0.982577
road 0.963832 0.988552 1.000000 0.980971
train 0.972725 0.982577 0.980971 1.000000

The plot and the correlation matrix of the series show that the series are highly correlated.

In [5]:
# Exogenous features from calendar
# ==============================================================================
exog = pd.DataFrame(index=data.index)
exog['day'] = exog.index.day
exog['month'] = exog.index.month
exog['quarter'] = exog.index.quarter
exog['dayofweek'] = exog.index.dayofweek
exog['dayofyear'] = exog.index.dayofyear
exog['year'] = exog.index.year
exog.head(4)
Out[5]:
day month quarter dayofweek dayofyear year
date
2023-01-01 1 1 1 6 1 2023
2023-01-02 2 1 1 0 2 2023
2023-01-03 3 1 1 1 3 2023
2023-01-04 4 1 1 2 4 2023

Modelling

Two forecasting model are trained using the available series excluding the metro. Then, the forecasting performance of the models is evaluated on the excluded series metro for two different scenarios:

In [6]:
# Split data into train-test
# ==============================================================================
end_train = '2024-07-31 23:59:00'
data_train = data.loc[:end_train, :].copy()
data_test  = data.loc[end_train:, :].copy()
exog_train = exog.loc[:end_train, :].copy()
exog_test  = exog.loc[end_train:, :].copy()

print(f"Train dates : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Test dates  : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")
Train dates : 2023-01-01 00:00:00 --- 2024-07-31 00:00:00  (n=578)
Test dates  : 2024-08-01 00:00:00 --- 2024-12-15 00:00:00  (n=137)

Model that has not seen the target series or any other series in the test period

In [7]:
# Train forecaster
# ==============================================================================
lags = [1, 7, 14, 364]
forecaster_1 = ForecasterRecursiveMultiSeries(
                   regressor          = Ridge(random_state=951),
                   lags               = lags,
                   transformer_series = StandardScaler(),
                   encoding           = None,
                   forecaster_id      = 'forecaster_no_leakage'
               )

forecaster_1.fit(series=data_train.drop(columns='metro'), exog=exog_train)
In [8]:
# Predictions
# ==============================================================================
predictions_1 = forecaster_1.predict(
                    steps       = len(data_test),
                    last_window = data_train[['metro']],
                    exog        = exog_test,
                    levels      = ["metro"]
                )
predictions_1 = predictions_1.rename(columns={'metro': 'pred_model_1'})
predictions_1.head(3)
Out[8]:
pred_model_1
2024-08-01 1.451714e+06
2024-08-02 1.478629e+06
2024-08-03 1.141787e+06

Model that has not seen the target series but has seen the tested period of the other series

In [9]:
# Train forecaster
# ==============================================================================
lags = [1, 7, 14, 364]
forecaster_2 = ForecasterRecursiveMultiSeries(
                   regressor          = Ridge(random_state=951),
                   lags               = lags,
                   transformer_series = StandardScaler(),
                   encoding           = None,
                   forecaster_id      = 'forecaster_with_leakage'
               )

forecaster_2.fit(series=data.drop(columns='metro'), exog=exog)
In [10]:
# Predictions
# ==============================================================================
predictions_2 = forecaster_2.predict(
                    steps       = len(data_test),
                    last_window = data_train[['metro']],
                    exog        = exog_test,
                    levels      = ["metro"]
                )
predictions_2 = predictions_2.rename(columns={'metro': 'pred_model_2'})
predictions_2.head(3)
Out[10]:
pred_model_2
2024-08-01 1.474107e+06
2024-08-02 1.511256e+06
2024-08-03 1.164568e+06
In [11]:
# Results
# ==============================================================================
results = pd.concat([data_test['metro'], predictions_1, predictions_2], axis=1)
fig, ax = plt.subplots(figsize=(7, 3.5))
data_test['metro'].plot(ax=ax, label='True')
results['pred_model_1'].plot(ax=ax, label='No leakage')
results['pred_model_2'].plot(ax=ax, label='With leakage')
ax.set_title('Users of public transport in Madrid')
ax.set_xlabel('')
ax.legend()
plt.show();
In [12]:
# Prediction error for each model
# ==============================================================================
mae_1 = (results['metro'] - results['pred_model_1']).abs().mean()
mae_2 = (results['metro'] - results['pred_model_2']).abs().mean()
improvement = (mae_1 - mae_2) / mae_1

print(f"MAE model {forecaster_1.forecaster_id}: {mae_1}")
print(f"MAE model {forecaster_2.forecaster_id}: {mae_2}")
print(f"Improvement: {100 * improvement:.2f}%")
MAE model forecaster_no_leakage: 439142.00606688287
MAE model forecaster_with_leakage: 248235.84805851904
Improvement: 43.47%

Although none of the models have directly seen the target series, the second model has been exposed to the test period for other series. Since some of these series are highly correlated with the target series, the model has indirectly accessed future information, enabling it to make better predictions than the first model.

This example illustrates how data leakage can occur even when the target series is not part of the training data. To prevent information leakage and ensure a fair evaluation, the model must not have access to any data from the time period designated for testing.

Session information

In [13]:
import session_info
session_info.show(html=False)
-----
matplotlib          3.9.0
pandas              2.2.3
session_info        1.0.0
skforecast          0.14.0
sklearn             1.5.2
-----
IPython             8.25.0
jupyter_client      8.6.2
jupyter_core        5.7.2
notebook            6.4.12
-----
Python 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:03:56) [MSC v.1929 64 bit (AMD64)]
Windows-11-10.0.26100-SP0
-----
Session information updated at 2025-01-04 20:18

Citation

How to cite this document

If you use this document or any part of it, please acknowledge the source, thank you!

Data leakage in pre-trained forecasting models by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) at https://cienciadedatos.net/documentos/py63-data-leakage-pre-trained-forecasting-models.html

How to cite skforecast

If you use skforecast for a publication, we would appreciate it if you cite the published software.

Zenodo:

Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.14.0). Zenodo. https://doi.org/10.5281/zenodo.8382788

APA:

Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (Version 0.14.0) [Computer software]. https://doi.org/10.5281/zenodo.8382788

BibTeX:

@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.14.0}, month = {11}, year = {2024}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }


Did you like the article? Your support is important

Your contribution will help me to continue generating free educational content. Many thanks! 😊

Become a GitHub Sponsor Become a GitHub Sponsor

Creative Commons Licence

This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.

Allowed:

  • Share: copy and redistribute the material in any medium or format.

  • Adapt: remix, transform, and build upon the material.

Under the following terms:

  • Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  • NonCommercial: You may not use the material for commercial purposes.

  • ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.