Probabilistic forecasting: prediction intervals for multi-step time series forecasting

Introduction

When trying to anticipate future values, most forecasting models aim to predict the most likely future value, a technique known as point forecasting. While useful, it does not provide information about the model's confidence or the uncertainty associated with the prediction.

Probabilistic forecasting, as opposed to point-forecasting, is a family of techniques that allow for predicting the expected distribution of the outcome rather than a single value. This approach offers richer insights by enabling the creation of prediction intervals —ranges within which the true value is likely to fall. More formally, a prediction interval defines the interval within which the true value of the response variable is expected to be found with a given probability.

Estimating prediction intervals in forecasting is challenging, since many well-established methods for regression and one-step-ahead forecasts are not directly applicable when predicting multiple steps ahead. Additionally, there is a trade-off between two key metrics: coverage and interval width. Ideally, we want to achieve a certain level of coverage (e.g. 80%) while keeping the prediction intervals as narrow as possible and the model's prediction error as low as possible.

This document is inspired by Carl McBride Ellis (2024). Probabilistic Forecasting I: Temperature. Kaggle competition. In the competition, predictions are made one-step-ahead, meaning each prediction is generated for the next time step based on known previous values. In this adaptation, however, a multi-step-ahead approach is used, where multiple future time steps are predicted simultaneously. This method is more challenging and requires a more sophisticated model but is often necessary for real-world applications.

The task is to estimate the prediction intervals for the target variable, labeled OT, over the next 28 days. Each day, the next 24 hours are predicted, meaning 24 values are forecasted each day. This process is repeated daily for 28 days, resulting in a total of 672 predictions (28 × 24) by the end of the period. Symmetric prediction intervals with coverage levels of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 95% are estimated, and their coverage is evaluated.

Skforecast implements several methods for probabilistic forecasting:

💡 Tip

This is the first in a series of documents on probabilistic forecasting.

Libraries

The libraries used in this notebook are:

# Data processing
# ==============================================================================
import numpy as np
import pandas as pd
from skforecast.datasets import fetch_dataset

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
from skforecast.plot import set_dark_theme, plot_prediction_intervals, plot_residuals
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.stattools import pacf
import plotly.graph_objects as go
import plotly.io as pio
import plotly.offline as poff
pio.templates.default = "seaborn"
pio.renderers.default = 'notebook' 
poff.init_notebook_mode(connected=True)

# Modelling and Forecasting
# ==============================================================================
import skforecast
from sklearn.linear_model import Ridge
from sklearn.feature_selection import RFECV
from sklearn.pipeline import make_pipeline
from feature_engine.datetime import DatetimeFeatures
from feature_engine.creation import CyclicalFeatures
from feature_engine.timeseries.forecasting import LagFeatures, WindowFeatures
from skforecast.recursive import ForecasterRecursive
from skforecast.model_selection import TimeSeriesFold, bayesian_search_forecaster, backtesting_forecaster
from skforecast.feature_selection import select_features
from skforecast.preprocessing import RollingFeatures
from skforecast.metrics import calculate_coverage

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

print('Versión skforecast:', skforecast.__version__)

Versión skforecast: 0.15.0

Data

Data from an electricity transformer station was collected between July 2016 and July 2018 (2 years × 365 days × 24 hours × 4 intervals per hour = 70,080 data points). Each data point consists of 8 features, including the date of the point, the predictive value "Oil Temperature (OT)", and 6 different types of external power load features: High UseFul Load (HUFL), High UseLess Load (HULL), Middle UseFul Load (MUFL), Middle UseLess Load (MULL), Low UseFul Load (LUFL), Low UseLess Load(LULL).

Zhou, Haoyi & Zhang, Shanghang & Peng, Jieqi & Zhang, Shuai & Li, Jianxin & Xiong, Hui & Zhang, Wancai. (2020). Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. 10.48550/arXiv.2012.07436. https://github.com/zhouhaoyi/ETDataset

# Load data
# ==============================================================================
data = fetch_dataset(name="ett_m2")
data = data.resample(rule="1h", closed="left", label="right").mean()
data

ett_m2
------
Data from an electricity transformer station was collected between July 2016 and
July 2018 (2 years x 365 days x 24 hours x 4 intervals per hour = 70,080 data
points). Each data point consists of 8 features, including the date of the
point, the predictive value "Oil Temperature (OT)", and 6 different types of
external power load features: High UseFul Load (HUFL), High UseLess Load (HULL),
Middle UseFul Load (MUFL), Middle UseLess Load (MULL), Low UseFul Load (LUFL),
Low UseLess Load (LULL).
Zhou, Haoyi & Zhang, Shanghang & Peng, Jieqi & Zhang, Shuai & Li, Jianxin &
Xiong, Hui & Zhang, Wancai. (2020). Informer: Beyond Efficient Transformer for
Long Sequence Time-Series Forecasting.
[10.48550/arXiv.2012.07436](https://arxiv.org/abs/2012.07436).
https://github.com/zhouhaoyi/ETDataset
Shape of the dataset: (69680, 7)

	HUFL	HULL	MUFL	MULL	LUFL	LULL	OT
date
2016-07-01 01:00:00	38.784501	10.88975	34.753500	8.55100	4.12575	1.26050	37.838250
2016-07-01 02:00:00	36.041249	9.44475	32.696001	7.13700	3.59025	0.62900	36.849250
2016-07-01 03:00:00	38.240000	11.41350	35.343501	9.10725	3.06000	0.31175	35.915750
2016-07-01 04:00:00	37.800250	11.45525	34.881000	9.28850	3.04400	0.60750	32.839375
2016-07-01 05:00:00	36.501750	10.49200	33.708250	8.65150	2.64400	0.00000	31.466125
...	...	...	...	...	...	...	...
2018-06-26 16:00:00	39.517250	11.45500	50.616000	12.08950	-10.64275	-1.28475	47.744249
2018-06-26 17:00:00	39.140499	11.32975	49.865250	11.84825	-10.33100	-1.35400	48.183498
2018-06-26 18:00:00	42.428501	12.85850	53.591250	13.33575	-10.95475	-1.41800	47.853999
2018-06-26 19:00:00	43.036000	12.87925	54.241250	13.33575	-10.91200	-1.41800	46.535750
2018-06-26 20:00:00	40.543500	11.16175	51.721500	11.76800	-11.22400	-1.41800	45.601625

17420 rows × 7 columns

# Plot target time series
# ==============================================================================
set_dark_theme()
plt.rcParams['lines.linewidth'] = 0.5
fig, ax = plt.subplots(figsize=(8, 3))
ax.plot(data['OT'])
ax.set_title('OT');

# Plot Exogenous features
# ==============================================================================
cols_to_plot = ['HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL']
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
colors = colors[1:] + ["violet", "magenta"]    
fig, axs = plt.subplots(6, 1, figsize=(8, 8), sharex=True)
for i, col in enumerate(cols_to_plot):
    axs[i].plot(data[col], label=col, color=colors[i])
    axs[i].legend(loc='lower left', fontsize=8)
    axs[i].tick_params(axis='both', labelsize=8)

# Partial autocorrelation values and plots
# ==============================================================================
n_lags = 24 * 7
cols_to_plot = ['OT', 'HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL']
pacf_df = []
fig, axs = plt.subplots(4, 2, figsize=(8, 6))
axs = axs.ravel()
for i, col in enumerate(cols_to_plot):
    pacf_values = pacf(data[col], nlags=n_lags)
    pacf_values = pd.DataFrame({
        'lag': range(1, len(pacf_values)),
        'value': pacf_values[1:],
        'variable': col
    })
    pacf_df.append(pacf_values)
    
    plot_pacf(data[col], lags=n_lags, ax=axs[i])
    axs[i].set_title(col, fontsize=10)
    axs[i].set_ylim(-0.5, 1.1)

plt.tight_layout()

# Top n lags with highest partial autocorrelation per variable
# ==============================================================================
n = 10
top_lags = set()
for pacf_values in pacf_df:
    variable = pacf_values['variable'].iloc[0]
    lags = pacf_values.nlargest(n, 'value')['lag'].tolist()
    top_lags.update(lags)
    print(f"{variable}: {lags}")

top_lags = list(top_lags)
print(f"\nAll lags: {top_lags}")

OT: [1, 15, 16, 14, 3, 6, 17, 13, 4, 5]
HUFL: [1, 11, 15, 19, 14, 17, 18, 12, 23, 9]
HULL: [1, 3, 2, 15, 12, 23, 11, 14, 24, 4]
MUFL: [1, 15, 11, 14, 17, 19, 12, 18, 9, 16]
MULL: [1, 3, 2, 24, 11, 15, 23, 12, 17, 4]
LUFL: [1, 2, 20, 18, 9, 17, 10, 16, 19, 21]
LULL: [1, 3, 14, 4, 18, 11, 9, 42, 19, 12]

All lags: [1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 42]

The target series and the external series exhibit similar dynamics with several lagged correlations that can be used as predictors.

Feature engineering

Additional exogenous features are created based on:

Lagged values of the exogenous series.
Rolling statistics of the exogenous series.
Calendar features.

⚠ Warning

Throughout this document, it is assumed that the aditional series are known into the future. This means its values are known at the time of prediction allowing to use them exogenous variables in the forecasting model.

# Calendar features
# ==============================================================================
features_to_extract = ['year', 'month', 'week', 'day_of_week', 'hour']
calendar_transformer = DatetimeFeatures(
    variables           = 'index',
    features_to_extract = features_to_extract,
    drop_original       = False,
)

# Lags of exogenous variables
# ==============================================================================
lag_transformer = LagFeatures(
    variables = ["HUFL", "MUFL", "MULL", "HULL", "LUFL", "LULL"],
    periods   = top_lags,
)

# Rolling features for exogenous variables
# ==============================================================================
wf_transformer = WindowFeatures(
    variables = ["HUFL", "MUFL", "MULL", "HULL", "LUFL", "LULL"],
    window    = ["1D", "7D"],
    functions = ["mean", "max", "min"],
    freq      = "1h",
)

# Cliclical encoding of calendar features
# ==============================================================================
features_to_encode = ['year', 'month', 'week', 'day_of_week', 'hour']
max_values = {
    "month": 12,
    "week": 52,
    "day_of_week": 7,
    "hour": 24,
}
cyclical_encoder = CyclicalFeatures(
                       variables     = features_to_encode,
                       max_values    = max_values,
                       drop_original = True
                   )

exog_transformer = make_pipeline(
                       calendar_transformer,
                       lag_transformer,
                       wf_transformer,
                       cyclical_encoder
                   )
display(exog_transformer)

data = exog_transformer.fit_transform(data)
# Remove rows with NaNs created by lag features
data = data.dropna()
exog_features = data.columns.difference(['OT']).tolist()
display(data.head(3))

Pipeline(steps=[('datetimefeatures',
                 DatetimeFeatures(drop_original=False,
                                  features_to_extract=['year', 'month', 'week',
                                                       'day_of_week', 'hour'],
                                  variables='index')),
                ('lagfeatures',
                 LagFeatures(periods=[1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14,
                                      15, 16, 17, 18, 19, 20, 21, 23, 24, 42],
                             variables=['HUFL', 'MUFL', 'MULL', 'HULL', 'LUFL',
                                        'LULL'])),
                ('windowfeatures',
                 WindowFeatures(freq='1h', functions=['mean', 'max', 'min'],
                                variables=['HUFL', 'MUFL', 'MULL', 'HULL',
                                           'LUFL', 'LULL'],
                                window=['1D', '7D'])),
                ('cyclicalfeatures',
                 CyclicalFeatures(drop_original=True,
                                  max_values={'day_of_week': 7, 'hour': 24,
                                              'month': 12, 'week': 52},
                                  variables=['year', 'month', 'week',
                                             'day_of_week', 'hour']))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiNot fitted

Pipeline(steps=[('datetimefeatures',
                 DatetimeFeatures(drop_original=False,
                                  features_to_extract=['year', 'month', 'week',
                                                       'day_of_week', 'hour'],
                                  variables='index')),
                ('lagfeatures',
                 LagFeatures(periods=[1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14,
                                      15, 16, 17, 18, 19, 20, 21, 23, 24, 42],
                             variables=['HUFL', 'MUFL', 'MULL', 'HULL', 'LUFL',
                                        'LULL'])),
                ('windowfeatures',
                 WindowFeatures(freq='1h', functions=['mean', 'max', 'min'],
                                variables=['HUFL', 'MUFL', 'MULL', 'HULL',
                                           'LUFL', 'LULL'],
                                window=['1D', '7D'])),
                ('cyclicalfeatures',
                 CyclicalFeatures(drop_original=True,
                                  max_values={'day_of_week': 7, 'hour': 24,
                                              'month': 12, 'week': 52},
                                  variables=['year', 'month', 'week',
                                             'day_of_week', 'hour']))])

DatetimeFeatures

DatetimeFeatures(drop_original=False,
                 features_to_extract=['year', 'month', 'week', 'day_of_week',
                                      'hour'],
                 variables='index')

LagFeatures

LagFeatures(periods=[1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
                     19, 20, 21, 23, 24, 42],
            variables=['HUFL', 'MUFL', 'MULL', 'HULL', 'LUFL', 'LULL'])

WindowFeatures

WindowFeatures(freq='1h', functions=['mean', 'max', 'min'],
               variables=['HUFL', 'MUFL', 'MULL', 'HULL', 'LUFL', 'LULL'],
               window=['1D', '7D'])

CyclicalFeatures

CyclicalFeatures(drop_original=True,
                 max_values={'day_of_week': 7, 'hour': 24, 'month': 12,
                             'week': 52},
                 variables=['year', 'month', 'week', 'day_of_week', 'hour'])

	HUFL	HULL	MUFL	MULL	LUFL	LULL	OT	year	HUFL_lag_1	MUFL_lag_1	...	LULL_window_7D_max	LULL_window_7D_min	month_sin	month_cos	week_sin	week_cos	day_of_week_sin	day_of_week_cos	hour_sin	hour_cos
date
2016-07-02 19:00:00	33.6330	9.08900	29.874751	6.95600	3.753	0.61025	28.719500	2016	32.5440	29.017001	...	1.2605	0.0	-0.5	-0.866025	1.224647e-16	-1.0	-0.974928	-0.222521	-0.965926	0.258819
2016-07-02 20:00:00	31.7065	7.72750	27.884500	5.63600	3.753	0.67425	29.103875	2016	33.6330	29.874751	...	1.2605	0.0	-0.5	-0.866025	1.224647e-16	-1.0	-0.974928	-0.222521	-0.866025	0.500000
2016-07-02 21:00:00	31.8110	7.81125	27.362000	5.18025	4.435	1.42075	29.598500	2016	31.7065	27.884500	...	1.2605	0.0	-0.5	-0.866025	1.224647e-16	-1.0	-0.974928	-0.222521	-0.707107	0.707107

3 rows × 184 columns

Forecasting intervals using bootstrapped residuals

Data partition

# Split train-validation-test
# ==============================================================================
end_train = '2017-10-01 23:59:00'
end_validation = '2018-04-03 23:59:00'
data_train = data.loc[: end_train, :]
data_val   = data.loc[end_train:end_validation, :]
data_test  = data.loc[end_validation:, :]

print(f"Dates train      : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Dates validacion : {data_val.index.min()} --- {data_val.index.max()}  (n={len(data_val)})")
print(f"Dates test       : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")

Dates train      : 2016-07-02 19:00:00 --- 2017-10-01 23:00:00  (n=10949)
Dates validacion : 2017-10-02 00:00:00 --- 2018-04-03 23:00:00  (n=4416)
Dates test       : 2018-04-04 00:00:00 --- 2018-06-26 20:00:00  (n=2013)

# Plot partitions
# ==============================================================================
fig, ax = plt.subplots(figsize=(8, 3))
ax.plot(data_train['OT'], label='Train')
ax.plot(data_val['OT'], label='Validation')
ax.plot(data_test['OT'], label='Test')
ax.set_title('OT')
ax.legend();

# Plot partitions after differencing
# ==============================================================================
fig, ax = plt.subplots(figsize=(8, 3))
ax.plot(data_train['OT'].diff(1), label='Train')
ax.plot(data_val['OT'].diff(1), label='Validation')
ax.plot(data_test['OT'].diff(1), label='Test')
ax.set_title('Differenced OT')
ax.legend();

Forecaster

# Create forecaster
# ==============================================================================
window_features = RollingFeatures(stats=['mean', 'min', 'max'], window_sizes=24)
forecaster = ForecasterRecursive(
                 regressor       = Ridge(random_state=15926),
                 lags            = top_lags,
                 differentiation = 1,
                 window_features = window_features,
                 binner_kwargs   = {'n_bins': 10}
             )

Feature selection

# Feature selection (autoregressive and exog) with scikit-learn RFECV
# ==============================================================================
regressor = Ridge(random_state=15926)

selector = RFECV(
    estimator=regressor, step=1, cv=3, min_features_to_select=25, n_jobs=-1
)

selected_lags, selected_window_features, selected_exog = select_features(
    forecaster      = forecaster,
    selector        = selector,
    y               = data.loc[:end_train, 'OT'],
    exog            = data.loc[:end_train, exog_features],
    select_only     = None,
    force_inclusion = None,
    subsample       = 0.5,
    random_state    = 123,
    verbose         = True,
)

Recursive feature elimination (RFECV)
-------------------------------------
Total number of records available: 10906
Total number of records used for feature selection: 5453
Number of features available: 208
    Lags            (n=22)
    Window features (n=3)
    Exog            (n=183)
Number of features selected: 107
    Lags            (n=14) : [1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42]
    Window features (n=2) : ['roll_min_24', 'roll_max_24']
    Exog            (n=91) : ['HUFL', 'HUFL_lag_1', 'HUFL_lag_12', 'HUFL_lag_13', 'HUFL_lag_15', 'HUFL_lag_19', 'HUFL_lag_2', 'HUFL_lag_20', 'HUFL_lag_23', 'HUFL_lag_4', 'HUFL_lag_5', 'HUFL_lag_9', 'HUFL_window_1D_mean', 'HULL', 'HULL_lag_1', 'HULL_lag_12', 'HULL_lag_14', 'HULL_lag_15', 'HULL_lag_2', 'HULL_lag_20', 'HULL_lag_21', 'HULL_lag_23', 'HULL_lag_3', 'HULL_lag_4', 'HULL_window_1D_mean', 'LUFL', 'LUFL_lag_1', 'LUFL_lag_10', 'LUFL_lag_15', 'LUFL_lag_19', 'LUFL_lag_2', 'LUFL_lag_20', 'LUFL_lag_23', 'LUFL_lag_3', 'LUFL_lag_4', 'LUFL_lag_5', 'LUFL_window_1D_mean', 'LULL', 'LULL_lag_1', 'LULL_lag_12', 'LULL_lag_13', 'LULL_lag_14', 'LULL_lag_18', 'LULL_lag_19', 'LULL_lag_20', 'LULL_lag_21', 'LULL_lag_23', 'LULL_lag_24', 'LULL_lag_4', 'LULL_lag_5', 'LULL_lag_6', 'LULL_window_1D_max', 'LULL_window_1D_min', 'MUFL', 'MUFL_lag_1', 'MUFL_lag_11', 'MUFL_lag_12', 'MUFL_lag_13', 'MUFL_lag_15', 'MUFL_lag_2', 'MUFL_lag_20', 'MUFL_lag_23', 'MUFL_lag_4', 'MUFL_lag_9', 'MUFL_window_1D_mean', 'MULL', 'MULL_lag_1', 'MULL_lag_11', 'MULL_lag_12', 'MULL_lag_13', 'MULL_lag_14', 'MULL_lag_15', 'MULL_lag_17', 'MULL_lag_19', 'MULL_lag_2', 'MULL_lag_20', 'MULL_lag_3', 'MULL_lag_4', 'MULL_lag_5', 'MULL_lag_9', 'MULL_window_1D_mean', 'MULL_window_1D_min', 'MULL_window_7D_mean', 'day_of_week_sin', 'hour_cos', 'hour_sin', 'month_cos', 'month_sin', 'week_cos', 'week_sin', 'year']

# Set the selected lags
# ==============================================================================
forecaster.set_lags(selected_lags)

Hiperparameters search

# Hyperparameters search
# ==============================================================================
cv = TimeSeriesFold(
         initial_train_size = len(data.loc[:end_train, :]),
         steps              = 24,  # all hours of next the day 
         differentiation    = 1,
     )

def search_space(trial):
    search_space  = {
        'alpha': trial.suggest_float('alpha', 1e-7, 1e+3, log=True)
    }
    return search_space

results_search, frozen_trial = bayesian_search_forecaster(
                                   forecaster   = forecaster,
                                   y            = data.loc[:end_validation, 'OT'],
                                   exog         = data.loc[:end_validation, selected_exog],
                                   search_space = search_space,
                                   cv           = cv,
                                   metric       = 'mean_absolute_error',
                                   n_trials     = 20
                               )
best_params = results_search['params'].iloc[0]
best_lags = results_search['lags'].iloc[0]
results_search.head()

  0%|          | 0/20 [00:00<?, ?it/s]

`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  4  5  6  9 12 15 17 20 23 24 42] 
  Parameters: {'alpha': 1.1075004417904626e-07}
  Backtesting metric: 2.381157334071585

	lags	params	mean_absolute_error	alpha
0	[1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42]	{'alpha': 1.1075004417904626e-07}	2.381157	1.107500e-07
1	[1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42]	{'alpha': 1.1602023537361883e-07}	2.381157	1.160202e-07
2	[1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42]	{'alpha': 1.3232531712227085e-07}	2.381157	1.323253e-07
3	[1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42]	{'alpha': 1.8131052405964636e-07}	2.381157	1.813105e-07
4	[1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42]	{'alpha': 6.876338479023381e-07}	2.381157	6.876338e-07

Forecasting with prediction intervals

After selecting the optimal model, including its features and hyperparameters, a backtesting process is performed using the validation set. This process generates out-of-sample residuals, which are then used to estimate prediction intervals for the test set.

# Backtesting on validation data to obtain out-sample residuals
# ==============================================================================
cv = TimeSeriesFold(
         initial_train_size = len(data.loc[:end_train, :]),
         steps              = 24,  # all hours of next day
         differentiation    = 1,
     )

metric_val, predictions_val = backtesting_forecaster(
                                  forecaster = forecaster,
                                  y          = data.loc[:end_validation, 'OT'],
                                  exog       = data.loc[:end_validation, selected_exog],
                                  cv         = cv,
                                  metric     = 'mean_absolute_error'
                              )
display(metric_val)
fig, ax = plt.subplots(figsize=(8, 3))
data.loc[end_train:end_validation, 'OT'].plot(ax=ax, label='Real value')
predictions_val['pred'].plot(ax=ax, label='prediction')
ax.set_title("Backtesting on validation data")
ax.legend();

  0%|          | 0/184 [00:00<?, ?it/s]

	mean_absolute_error
0	2.381157

# Out-sample residuals distribution
# ==============================================================================
residuals = data.loc[predictions_val.index, 'OT'] - predictions_val['pred']
print(pd.Series(np.where(residuals < 0, 'negative', 'positive')).value_counts())
plt.rcParams.update({'font.size': 8})
_ = plot_residuals(residuals=residuals, figsize=(7, 4))

positive    2457
negative    1959
Name: count, dtype: int64

# Store out-sample residuals in the forecaster
# ==============================================================================
forecaster.fit(y=data.loc[:end_train, 'OT'], exog=data.loc[:end_train, selected_exog])
forecaster.set_out_sample_residuals(
    y_true = data.loc[predictions_val.index, 'OT'], 
    y_pred = predictions_val['pred']
)

✎ Note

When storing new residuals in the forecaster, the residuals are binned according to the magnitude of the predictions they correspond to. Later, in the bootstrapping process, the residuals are sampled from the corresponding bin, ensuring that the distribution of the residuals is consistent with the predictions. This process allows for more accurate prediction intervals, as the residuals are more closely aligned with the predictions they correspond to resulting in better coverage with narrower intervals. For more information about how residuals are used in interval estimation visit Probabilistic forecasting: prediction intervals and prediction distribution.

# Intervals of the residual bins (conditioned on predicted values) and number of residuals stored
# ==============================================================================
for k, v in forecaster.binner_intervals_.items():
    print(f"{k}: {v}  n={len(forecaster.out_sample_residuals_by_bin_[k])}")

0: (-3.871638467653412, -1.1238482614862022)  n=417
1: (-1.1238482614862022, -0.714034097520269)  n=611
2: (-0.714034097520269, -0.5147665419615066)  n=510
3: (-0.5147665419615066, -0.3512020918321923)  n=459
4: (-0.3512020918321923, -0.19876133540594054)  n=374
5: (-0.19876133540594054, -0.013725466917961171)  n=274
6: (-0.013725466917961171, 0.2943710646100325)  n=328
7: (0.2943710646100325, 0.7696661155395645)  n=426
8: (0.7696661155395645, 1.4793364746966304)  n=516
9: (1.4793364746966304, 5.609482525672902)  n=500

# Distribution of the residual by bin
# ==============================================================================
out_sample_residuals_by_bin_df = pd.DataFrame(
    dict(
        [(k, pd.Series(v))
         for k, v in forecaster.out_sample_residuals_by_bin_.items()]
    )
)

fig, ax = plt.subplots(figsize=(6, 3))
out_sample_residuals_by_bin_df.boxplot(ax=ax)
ax.set_title("Distribution of residuals by bin", fontsize=12)
ax.set_xlabel("Bin", fontsize=10)
ax.set_ylabel("Residuals", fontsize=10)
plt.show();

# Backtesting with prediction intervals in test data using out-sample residuals
# ==============================================================================
cv = TimeSeriesFold(
         initial_train_size = len(data.loc[:end_validation, :]),
         steps              = 24,  # all hours of next day
         differentiation    = 1
     )

metric, predictions = backtesting_forecaster(
                          forecaster              = forecaster,
                          y                       = data['OT'],
                          exog                    = data[selected_exog],
                          cv                      = cv,
                          metric                  = 'mean_absolute_error',
                          interval                = [10, 90],  # 80% prediction interval
                          n_boot                  = 150,
                          interval_method         = 'bootstrapping',
                          use_in_sample_residuals = False,  # Use out-sample residuals
                          use_binned_residuals    = True,   # Intervals conditioned to the predicted values
                      )
predictions

  0%|          | 0/84 [00:00<?, ?it/s]

	pred	lower_bound	upper_bound
2018-04-04 00:00:00	32.748341	32.289988	33.384481
2018-04-04 01:00:00	31.947731	31.186928	33.132068
2018-04-04 02:00:00	31.052885	30.214739	33.000788
2018-04-04 03:00:00	30.108430	29.027782	32.823787
2018-04-04 04:00:00	29.229240	27.894726	32.262539
...	...	...	...
2018-06-26 16:00:00	49.968142	39.049217	56.065172
2018-06-26 17:00:00	49.287497	38.649894	56.572044
2018-06-26 18:00:00	48.319028	38.308019	56.107946
2018-06-26 19:00:00	47.041985	38.269496	55.357029
2018-06-26 20:00:00	45.751569	37.076566	54.628951

2013 rows × 3 columns

# Plot intervals
# ==============================================================================
plt.rcParams['lines.linewidth'] = 1
fig, ax = plt.subplots(figsize=(9, 4))
plot_prediction_intervals(
    predictions     = predictions,
    y_true          = data_test,
    target_variable = "OT",
    initial_x_zoom  = None,
    title           = "Real value vs prediction in test data",
    xaxis_title     = "Date time",
    yaxis_title     = "OT",
    ax              = ax
)
fill_between_obj = ax.collections[0]
fill_between_obj.set_facecolor('white')
fill_between_obj.set_alpha(0.3)

# Predicted interval coverage (on test data)
# ==============================================================================
coverage = calculate_coverage(
    y_true      = data.loc[end_validation:, 'OT'],
    lower_bound = predictions["lower_bound"], 
    upper_bound = predictions["upper_bound"]
)
print(f"Predicted interval coverage: {round(100 * coverage, 2)} %")

# Area of the interval
# ==============================================================================
area = (predictions["upper_bound"] - predictions["lower_bound"]).sum()
print(f"Area of the interval: {round(area, 2)}")

display(metric)

Predicted interval coverage: 81.57 %
Area of the interval: 20843.63

	mean_absolute_error
0	2.871771

# Interactive plot
# ==============================================================================
fig = go.Figure([
        go.Scatter(name='Prediction', x=predictions.index, y=predictions['pred'], mode='lines'),
        go.Scatter(name='Real value', x=data_test.index, y=data_test['OT'], mode='lines'),
        go.Scatter(
            name='Upper Bound', x=predictions.index, y=predictions['upper_bound'],
            mode='lines', marker=dict(color="#444"), line=dict(width=0), showlegend=False
        ),
        go.Scatter(
            name='Lower Bound', x=predictions.index, y=predictions['lower_bound'],
            marker=dict(color="#444"), line=dict(width=0), mode='lines',
            fillcolor='rgba(68, 68, 68, 0.3)', fill='tonexty', showlegend=False
        )
    ])
fig.update_layout(
    title="Real value vs prediction in test data",
    xaxis_title= "Date time",
    yaxis_title= "OT",
    width=800,
    height=400,
    margin=dict(l=10, r=10, t=35, b=10),
    hovermode="x",
    legend=dict(orientation="h", yanchor="top", y=1.1, xanchor="left", x=0.001)
)
fig.show()

Estimation of multiple intervals

The backtesting_forecaster function not only allows to estimate a single prediction interval, but also to estimate multiple quantiles (percentiles) from which multiple prediction intervals can be constructed. This is useful to evaluate the quality of the prediction intervals for a range of probabilities. Furthermore, it has almost no additional computational cost compared to estimating a single interval.

Next, several percentiles are predicted and from these, prediction intervals are created for different nominal coverage levels - 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 95% - and their actual coverage is assessed.

quantiles = [2.5, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 97.5]
intervals = [[2.5, 97.5], [5, 95], [10, 90], [15, 85], [20, 80], [30, 70], [35, 65], [40, 60], [45, 55]]
observed_coverages = []
observed_areas = []

metric, predictions = backtesting_forecaster(
                          forecaster              = forecaster,
                          y                       = data['OT'],
                          exog                    = data[selected_exog],
                          cv                      = cv,
                          metric                  = 'mean_absolute_error',
                          interval                = quantiles,
                          n_boot                  = 150,
                          interval_method         = 'bootstrapping',
                          use_in_sample_residuals = False,  # Use out-sample residuals
                          use_binned_residuals    = True,   # Intervals conditioned to the predicted values
                      )
predictions.head()

  0%|          | 0/84 [00:00<?, ?it/s]

	pred	p_2.5	p_5	p_10	p_15	p_20	p_25	p_30	p_35	p_40	...	p_55	p_60	p_65	p_70	p_75	p_80	p_85	p_90	p_95	p_97.5
2018-04-04 00:00:00	32.748341	31.922109	32.053411	32.289988	32.337571	32.490408	32.551956	32.600155	32.743772	32.842704	...	33.016446	33.049557	33.107951	33.172141	33.212129	33.288902	33.323562	33.384481	33.482723	33.717116
2018-04-04 01:00:00	31.947731	30.421991	30.969322	31.186928	31.302311	31.665845	31.811999	31.950245	32.050338	32.124354	...	32.407196	32.524998	32.643774	32.748498	32.892010	32.960220	33.040611	33.132068	33.368347	33.726517
2018-04-04 02:00:00	31.052885	28.930792	29.592743	30.214739	30.354561	30.649688	30.910788	31.054912	31.251281	31.386422	...	31.860174	32.005682	32.082505	32.271244	32.453232	32.584171	32.770079	33.000788	33.416257	34.215743
2018-04-04 03:00:00	30.108430	27.136373	27.518602	29.027782	29.386043	29.762199	29.972688	30.335793	30.611191	30.772863	...	31.240968	31.489245	31.663091	31.794914	32.007040	32.219158	32.405931	32.823787	33.422536	35.872185
2018-04-04 04:00:00	29.229240	25.818478	26.956306	27.894726	28.732302	28.886822	29.192374	29.471908	29.727138	30.040276	...	30.739496	30.878859	31.072958	31.173531	31.382835	31.583442	31.869380	32.262539	32.897774	35.302005

5 rows × 22 columns

# Calculate coverage and area for each interval
# ==============================================================================
for interval in intervals:
    observed_coverage = calculate_coverage(
                            y_true      = data.loc[end_validation:, 'OT'],
                            lower_bound = predictions[f"p_{interval[0]}"], 
                            upper_bound = predictions[f"p_{interval[1]}"]
                        )
    observed_area = (predictions[f"p_{interval[1]}"] - predictions[f"p_{interval[0]}"]).sum()
    observed_coverages.append(100 * observed_coverage)
    observed_areas.append(observed_area)

results = pd.DataFrame({
              'Interval': intervals,
              'Nominal coverage': [interval[1] - interval[0] for interval in intervals],
              'Observed coverage': observed_coverages,
              'Area': observed_areas
          })
results.round(1)

	Interval	Nominal coverage	Observed coverage	Area
0	[2.5, 97.5]	95.0	93.9	33464.8
1	[5, 95]	90.0	90.1	27347.4
2	[10, 90]	80.0	81.6	20843.6
3	[15, 85]	70.0	73.9	16713.4
4	[20, 80]	60.0	64.6	13495.6
5	[30, 70]	40.0	44.6	8339.2
6	[35, 65]	30.0	33.6	6079.6
7	[40, 60]	20.0	21.8	3962.8
8	[45, 55]	10.0	11.3	1968.4

The empirical coverage of all intervals is very close to the expected coverage (nominal coverage). This indicates that the intervals are well-calibrated and that the model is capable of estimating the uncertainty associated with the predictions.

Session information

import session_info
session_info.show(html=False)

-----
feature_engine      1.8.3
matplotlib          3.9.4
numpy               2.1.3
optuna              4.2.1
pandas              2.2.3
plotly              6.0.0
session_info        1.0.0
skforecast          0.15.0
sklearn             1.6.1
statsmodels         0.14.4
-----
IPython             8.32.0
jupyter_client      8.6.3
jupyter_core        5.7.2
-----
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:49:16) [MSC v.1929 64 bit (AMD64)]
Windows-11-10.0.26100-SP0
-----
Session information updated at 2025-03-13 12:11

Citation

How to cite this document

If you use this document or any part of it, please acknowledge the source, thank you!

Probabilistic forecasting: prediction intervals for multi-step time series forecasting by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) at https://cienciadedatos.net/documentos/py60-probabilistic-forecasting-prediction-intervals-multi-step-forecasting.html

How to cite skforecast

If you use skforecast for a publication, we would appreciate if you cite the published software.

Zenodo:

Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.15.0). Zenodo. https://doi.org/10.5281/zenodo.8382788

APA:

Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (Version 0.15.0) [Computer software]. https://doi.org/10.5281/zenodo.8382788

BibTeX:

@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.15.0}, month = {03}, year = {2025}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }

Did you like the article? Your support is important

Your contribution will help me to continue generating free educational content. Many thanks! 😊

This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.

Allowed:

Share: copy and redistribute the material in any medium or format.
Adapt: remix, transform, and build upon the material.

Under the following terms:

Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial: You may not use the material for commercial purposes.
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Probabilistic forecasting: prediction intervals for multi-step time series forecasting

Joaquín Amat Rodrigo, Javier Escobar Ortiz

November, 2024 (last update March, 2025)

Introduction

Libraries

Data

Feature engineering

Forecasting intervals using bootstrapped residuals

Data partition

Forecaster

Feature selection

Hiperparameters search

Forecasting with prediction intervals

Estimation of multiple intervals

Session information

Citation