Forecasting probabilistico: intervalos de prediccion en forecasting multi-step

Si te gusta  Skforecast , ayúdanos dándonos una estrella en  GitHub! ⭐️

Forecasting probabilístico: intervalos de predicción para forecasting multi-step

Joaquín Amat Rodrigo, Javier Escobar Ortiz
Noviembre, 2024

Introducción

La mayoría de los modelos de forecasting tratan de predecir el valor más probable de una serie temporal, una técnica conocida como point-forecasting. Aunque es útil, no proporciona información sobre la incertidumbre asociada a las predicciones.

El forecasting probabilístico, a diferencia del point-forecasting, es una familia de técnicas que permiten predecir la distribución esperada en lugar de un único valor. Este enfoque proporciona más información al permitir estimar intervalos de predicción, es decir, rangos dentro de los cuales es probable que se encuentre el valor real. Más formalmente, un intervalo de predicción define el intervalo dentro del cual se espera encontrar el valor verdadero de la variable de respuesta con una probabilidad determinada.

La estimación de los intervalos de predicción en el forecasting es un reto, ya que muchos métodos bien establecidos para la regresión y el forecasting single-step (en el que se predice únicamente el siguiente valor de la serie) no son directamente aplicables al forecasting multi-step (en el que se predicen múltiples valores futuros). Además, hay que encontrar un equilibrio entre dos parámetros clave: la cobertura y la amplitud del intervalo. En la práctica, se intenta alcanzar un cierto nivel de cobertura (por ejemplo, el 80 %) manteniendo los intervalos de predicción lo más estrechos posible y el error de predicción del modelo lo más bajo posible.

Este documento se inspira en Carl McBride Ellis (2024). Predicción probabilística I: temperatura. Concurso Kaggle. En la competición, las predicciones se realizan con un paso de antelación (single-step), lo que significa que cada predicción se genera para el siguiente paso temporal basándose en valores anteriores conocidos. En esta adaptación, sin embargo, se utiliza un enfoque multipaso (multi-step), en el que se predicen varios pasos temporales futuros. Este escenario es más complejo pero a menudo es necesario para aplicaciones del mundo real.

La tarea consiste en estimar los intervalos de predicción de la variable objetivo, denominada OT, para los 28 días siguientes. Cada día se predicen las 24 horas siguientes, lo que significa que cada día se pronostican 24 valores. Este proceso se repite diariamente durante 28 días, lo que da como resultado un total de 672 predicciones (28 × 24) al final del periodo. Se estiman intervalos de predicción simétricos con niveles de cobertura del 10 %, 20 %, 30 %, 40 %, 50 %, 60 %, 70 %, 80 %, 90 % y 95 %, y se evalúa su cobertura.

Librerías

Las librerías utilizadas en este cuaderno son:

In [1]:
# Procesamiento de datos
# ==============================================================================
import numpy as np
import pandas as pd

# Gráficos
# ==============================================================================
import matplotlib.pyplot as plt
from skforecast.plot import set_dark_theme
from skforecast.plot import plot_prediction_intervals
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.stattools import pacf
from skforecast.plot import plot_residuals
import plotly.graph_objects as go
import plotly.io as pio
import plotly.offline as poff
pio.templates.default = "seaborn"
pio.renderers.default = 'notebook' 
poff.init_notebook_mode(connected=True)

# Modelado y forecasting
# ==============================================================================
import skforecast
import sklearn
from sklearn.linear_model import Ridge
from sklearn.feature_selection import RFECV
from sklearn.pipeline import make_pipeline
from feature_engine.datetime import DatetimeFeatures
from feature_engine.creation import CyclicalFeatures
from feature_engine.timeseries.forecasting import LagFeatures
from feature_engine.timeseries.forecasting import WindowFeatures
from skforecast.recursive import ForecasterRecursive
from skforecast.model_selection import TimeSeriesFold
from skforecast.model_selection import bayesian_search_forecaster
from skforecast.model_selection import backtesting_forecaster
from skforecast.feature_selection import select_features
from skforecast.preprocessing import RollingFeatures

# Warnings
# ==============================================================================
import warnings
warnings.filterwarnings('once')

print('Versión skforecast:', skforecast.__version__)
print('Versión sklearn:', sklearn.__version__)
Versión skforecast: 0.14.0
Versión sklearn: 1.5.1

Datos

Los datos de una estación transformadora de electricidad se recopilaron entre julio de 2016 y julio de 2018 (2 años × 365 días × 24 horas × 4 intervalos por hora = 70,080 valores). Para cada fecha se dispone de 7 variables: el valor predictivo «Temperatura del aceite (OT)», y 6 tipos variables dobre la carga de potencia externa: High UseFul Load (HUFL), High UseLess Load (HULL), Carga media sin uso (MUFL), Carga media sin uso (MULL), Carga baja sin uso (LUFL), Carga baja sin uso (LULL).

Zhou, Haoyi & Zhang, Shanghang & Peng, Jieqi & Zhang, Shuai & Li, Jianxin & Xiong, Hui & Zhang, Wancai. (2020). Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. 10.48550/arXiv.2012.07436. https://github.com/zhouhaoyi/ETDataset

In [2]:
# Descarga de datos
# ==============================================================================
data = pd.read_csv("https://raw.githubusercontent.com/skforecast/skforecast-datasets/main/data/ETTm2.csv")
data['date'] = pd.to_datetime(data['date'])
data = data.set_index('date')
data = data.asfreq('15min')
data = data.resample(rule="1h", closed="left", label="right").mean()
data
Out[2]:
HUFL HULL MUFL MULL LUFL LULL OT
date
2016-07-01 01:00:00 38.784501 10.88975 34.753500 8.55100 4.12575 1.26050 37.838250
2016-07-01 02:00:00 36.041249 9.44475 32.696001 7.13700 3.59025 0.62900 36.849250
2016-07-01 03:00:00 38.240000 11.41350 35.343501 9.10725 3.06000 0.31175 35.915750
2016-07-01 04:00:00 37.800250 11.45525 34.881000 9.28850 3.04400 0.60750 32.839375
2016-07-01 05:00:00 36.501750 10.49200 33.708250 8.65150 2.64400 0.00000 31.466125
... ... ... ... ... ... ... ...
2018-06-26 16:00:00 39.517250 11.45500 50.616000 12.08950 -10.64275 -1.28475 47.744249
2018-06-26 17:00:00 39.140499 11.32975 49.865250 11.84825 -10.33100 -1.35400 48.183498
2018-06-26 18:00:00 42.428501 12.85850 53.591250 13.33575 -10.95475 -1.41800 47.853999
2018-06-26 19:00:00 43.036000 12.87925 54.241250 13.33575 -10.91200 -1.41800 46.535750
2018-06-26 20:00:00 40.543500 11.16175 51.721500 11.76800 -11.22400 -1.41800 45.601625

17420 rows × 7 columns

In [3]:
# Gráfico de la serie
# ==============================================================================
set_dark_theme()
plt.rcParams['lines.linewidth'] = 0.5
fig, ax = plt.subplots(figsize=(8, 3))
ax.plot(data['OT'], label='OT');
In [4]:
# Gráfico de las series exógenas
# ==============================================================================
cols_to_plot = [
    'HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL'
]
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
colors = colors[1:] + ["violet", "magenta"]    
fig, axs = plt.subplots(6, 1, figsize=(8, 8), sharex=True)
for i, col in enumerate(cols_to_plot):
    axs[i].plot(data[col], label=col, color=colors[i])
    axs[i].legend(loc='lower left', fontsize=8)
    axs[i].tick_params(axis='both', labelsize=8)
In [5]:
# Autocorrelación parcial
# ==============================================================================
n_lags = 24 * 7
cols_to_plot = [
    'OT', 'HUFL', 'HULL',
    'MUFL', 'MULL',
    'LUFL', 'LULL'
]
pacf_df = []
fig, axs = plt.subplots(4, 2, figsize=(8, 6))
axs = axs.ravel()
for i, col in enumerate(cols_to_plot):
    pacf_values = pacf(data[col], nlags=n_lags)
    pacf_values = pd.DataFrame({
        'lag': range(1, len(pacf_values)),
        'value': pacf_values[1:],
        'variable': col
    })
    pacf_df.append(pacf_values)
    plot_pacf(data[col], lags=n_lags, ax=axs[i])
    axs[i].set_title(col, fontsize=10)
    axs[i].set_ylim(-0.5, 1.1)
plt.tight_layout()
In [6]:
# Top n lags con mayor autocorrelación parcial por variable
# ==============================================================================
n = 10
top_lags = set()
for pacf_values in pacf_df:
    variable = pacf_values['variable'].iloc[0]
    lags = pacf_values.nlargest(n, 'value')['lag'].tolist()
    top_lags.update(lags)
    print(f"{variable}: {lags}")
top_lags = list(top_lags)
print(f"\nAll lags: {top_lags}")
OT: [1, 15, 16, 14, 3, 6, 17, 13, 4, 5]
HUFL: [1, 11, 15, 19, 14, 17, 18, 12, 23, 9]
HULL: [1, 3, 2, 15, 12, 23, 11, 14, 24, 4]
MUFL: [1, 15, 11, 14, 17, 19, 12, 18, 9, 16]
MULL: [1, 3, 2, 24, 11, 15, 23, 12, 17, 4]
LUFL: [1, 2, 20, 18, 9, 17, 10, 16, 19, 21]
LULL: [1, 3, 14, 4, 18, 11, 9, 42, 19, 12]

All lags: [1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 42]

La serie objetivo y las series externas exhiben dinámicas similares con varios lags que pueden utilizarse como predictores.

Feature engineering

Se crean variables exógenas adicionales basadas en:

  • Valores rezagados (lags) de las series exógenas.

  • Estadísticas móviles de las series exógenas.

  • Características del calendario.

Warning

En este documento se asume que las series adicionales son conocidas en el futuro. Esto significa que sus valores son conocidos en el momento de la predicción, lo que permite utilizarlos como variables exógenas en el modelo de forecasting.

In [7]:
# Variables de calendario
# ==============================================================================
features_to_extract = [
    'year',
    'month',
    'week',
    'day_of_week',
    'hour'
]
calendar_transformer = DatetimeFeatures(
    variables           = 'index',
    features_to_extract = features_to_extract,
    drop_original       = False,
)

# Lags de las variables exógenas
# ==============================================================================
lag_transformer = LagFeatures(
    variables = ["HUFL", "MUFL", "MULL", "HULL", "LUFL", "LULL"],
    periods   = top_lags,
)

# Estadísticas móviles de las variables exógenas
# ==============================================================================
wf_transformer = WindowFeatures(
    variables   = ["HUFL", "MUFL", "MULL", "HULL", "LUFL", "LULL"],
    window      = ["1D", "7D"],
    functions   = ["mean", "max", "min"],
    freq        = "1h",
)

# Codificación de variables cíclicas
# ==============================================================================
features_to_encode = [
    "month",
    "week",
    "day_of_week",
    "hour",
]
max_values = {
    "month": 12,
    "week": 52,
    "day_of_week": 7,
    "hour": 24,
}
cyclical_encoder = CyclicalFeatures(
                        variables     = features_to_encode,
                        max_values    = max_values,
                        drop_original = True
                   )

exog_transformer = make_pipeline(
                        calendar_transformer,
                        lag_transformer,
                        wf_transformer,
                        cyclical_encoder
                   )
display(exog_transformer)

data = exog_transformer.fit_transform(data)
# Eliminación de valores nulos introducidos al crear los lags
data = data.dropna()
exog_features = data.columns.difference(['OT']).tolist()
display(data.head(3))
Pipeline(steps=[('datetimefeatures',
                 DatetimeFeatures(drop_original=False,
                                  features_to_extract=['year', 'month', 'week',
                                                       'day_of_week', 'hour'],
                                  variables='index')),
                ('lagfeatures',
                 LagFeatures(periods=[1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14,
                                      15, 16, 17, 18, 19, 20, 21, 23, 24, 42],
                             variables=['HUFL', 'MUFL', 'MULL', 'HULL', 'LUFL',
                                        'LULL'])),
                ('windowfeatures',
                 WindowFeatures(freq='1h', functions=['mean', 'max', 'min'],
                                variables=['HUFL', 'MUFL', 'MULL', 'HULL',
                                           'LUFL', 'LULL'],
                                window=['1D', '7D'])),
                ('cyclicalfeatures',
                 CyclicalFeatures(drop_original=True,
                                  max_values={'day_of_week': 7, 'hour': 24,
                                              'month': 12, 'week': 52},
                                  variables=['month', 'week', 'day_of_week',
                                             'hour']))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
HUFL HULL MUFL MULL LUFL LULL OT year HUFL_lag_1 MUFL_lag_1 ... LULL_window_7D_max LULL_window_7D_min month_sin month_cos week_sin week_cos day_of_week_sin day_of_week_cos hour_sin hour_cos
date
2016-07-02 19:00:00 33.6330 9.08900 29.874751 6.95600 3.753 0.61025 28.719500 2016 32.5440 29.017001 ... 1.2605 0.0 -0.5 -0.866025 1.224647e-16 -1.0 -0.974928 -0.222521 -0.965926 0.258819
2016-07-02 20:00:00 31.7065 7.72750 27.884500 5.63600 3.753 0.67425 29.103875 2016 33.6330 29.874751 ... 1.2605 0.0 -0.5 -0.866025 1.224647e-16 -1.0 -0.974928 -0.222521 -0.866025 0.500000
2016-07-02 21:00:00 31.8110 7.81125 27.362000 5.18025 4.435 1.42075 29.598500 2016 31.7065 27.884500 ... 1.2605 0.0 -0.5 -0.866025 1.224647e-16 -1.0 -0.974928 -0.222521 -0.707107 0.707107

3 rows × 184 columns

Intervalos de predicción con residuos bootstrap

Partición de los datos

In [8]:
# Split train-validation-test
# ==============================================================================
end_train = '2017-10-01 23:59:00'
end_validation = '2018-04-03 23:59:00'
data_train = data.loc[: end_train, :]
data_val   = data.loc[end_train:end_validation, :]
data_test  = data.loc[end_validation:, :]

print(f"Dates train      : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Dates validacion : {data_val.index.min()} --- {data_val.index.max()}  (n={len(data_val)})")
print(f"Dates test       : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")
Dates train      : 2016-07-02 19:00:00 --- 2017-10-01 23:00:00  (n=10949)
Dates validacion : 2017-10-02 00:00:00 --- 2018-04-03 23:00:00  (n=4416)
Dates test       : 2018-04-04 00:00:00 --- 2018-06-26 20:00:00  (n=2013)
In [9]:
# Gráfico de las particiones
# ==============================================================================
fig, ax = plt.subplots(figsize=(8, 3))
ax.plot(data_train['OT'], label='Train')
ax.plot(data_val['OT'], label='Validation')
ax.plot(data_test['OT'], label='Test')
ax.set_title('OT')
ax.legend();
In [10]:
# Gráfico de las particiones tras aplicar diferenciación
# ==============================================================================
fig, ax = plt.subplots(figsize=(8, 3))
ax.plot(data_train['OT'].diff(1), label='Train')
ax.plot(data_val['OT'].diff(1), label='Validation')
ax.plot(data_test['OT'].diff(1), label='Test')
ax.set_title('Differenced OT')
ax.legend();

Forecaster

In [11]:
# Crear forecaster
# ==============================================================================
window_features = RollingFeatures(stats=['mean', 'min', 'max'], window_sizes=24)
forecaster = ForecasterRecursive(
                 regressor       = Ridge(random_state=15926),
                 lags            = top_lags,
                 differentiation = 1,
                 window_features = window_features,
                 binner_kwargs   = {'n_bins': 10}
             )
forecaster
Out[11]:

ForecasterRecursive

General Information
  • Regressor: Ridge
  • Lags: [ 1 2 3 4 5 6 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 42]
  • Window features: ['roll_mean_24', 'roll_min_24', 'roll_max_24']
  • Window size: 43
  • Exogenous included: False
  • Weight function included: False
  • Differentiation order: 1
  • Creation date: 2024-12-01 19:16:52
  • Last fit date: None
  • Skforecast version: 0.14.0
  • Python version: 3.12.5
  • Forecaster id: None
Exogenous Variables
    None
Data Transformations
  • Transformer for y: None
  • Transformer for exog: None
Training Information
  • Training range: Not fitted
  • Training index type: Not fitted
  • Training index frequency: Not fitted
Regressor Parameters
    {'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'positive': False, 'random_state': 15926, 'solver': 'auto', 'tol': 0.0001}
Fit Kwargs
    {}

🛈 API Reference    🗎 User Guide

Selección de predictores

In [12]:
# Selección de predictores (autoregressivos y exógenas) con scikit-learn RFECV
# ==============================================================================
regressor = Ridge(random_state=15926)

selector = RFECV(
    estimator=regressor, step=1, cv=3, min_features_to_select=25, n_jobs=-1
)

selected_lags, selected_window_features, selected_exog = select_features(
    forecaster      = forecaster,
    selector        = selector,
    y               = data.loc[:end_train, 'OT'],
    exog            = data.loc[:end_train, exog_features],
    select_only     = None,
    force_inclusion = None,
    subsample       = 0.5,
    random_state    = 123,
    verbose         = True,
)
Recursive feature elimination (RFECV)
-------------------------------------
Total number of records available: 10906
Total number of records used for feature selection: 5453
Number of features available: 208
    Lags            (n=22)
    Window features (n=3)
    Exog            (n=183)
Number of features selected: 107
    Lags            (n=14) : [1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42]
    Window features (n=2) : ['roll_min_24', 'roll_max_24']
    Exog            (n=91) : ['HUFL', 'HUFL_lag_1', 'HUFL_lag_12', 'HUFL_lag_13', 'HUFL_lag_15', 'HUFL_lag_19', 'HUFL_lag_2', 'HUFL_lag_20', 'HUFL_lag_23', 'HUFL_lag_4', 'HUFL_lag_5', 'HUFL_lag_9', 'HUFL_window_1D_mean', 'HULL', 'HULL_lag_1', 'HULL_lag_12', 'HULL_lag_14', 'HULL_lag_15', 'HULL_lag_2', 'HULL_lag_20', 'HULL_lag_21', 'HULL_lag_23', 'HULL_lag_3', 'HULL_lag_4', 'HULL_window_1D_mean', 'LUFL', 'LUFL_lag_1', 'LUFL_lag_10', 'LUFL_lag_15', 'LUFL_lag_19', 'LUFL_lag_2', 'LUFL_lag_20', 'LUFL_lag_23', 'LUFL_lag_3', 'LUFL_lag_4', 'LUFL_lag_5', 'LUFL_window_1D_mean', 'LULL', 'LULL_lag_1', 'LULL_lag_12', 'LULL_lag_13', 'LULL_lag_14', 'LULL_lag_18', 'LULL_lag_19', 'LULL_lag_20', 'LULL_lag_21', 'LULL_lag_23', 'LULL_lag_24', 'LULL_lag_4', 'LULL_lag_5', 'LULL_lag_6', 'LULL_window_1D_max', 'LULL_window_1D_min', 'MUFL', 'MUFL_lag_1', 'MUFL_lag_11', 'MUFL_lag_12', 'MUFL_lag_13', 'MUFL_lag_15', 'MUFL_lag_2', 'MUFL_lag_20', 'MUFL_lag_23', 'MUFL_lag_4', 'MUFL_lag_9', 'MUFL_window_1D_mean', 'MULL', 'MULL_lag_1', 'MULL_lag_11', 'MULL_lag_12', 'MULL_lag_13', 'MULL_lag_14', 'MULL_lag_15', 'MULL_lag_17', 'MULL_lag_19', 'MULL_lag_2', 'MULL_lag_20', 'MULL_lag_3', 'MULL_lag_4', 'MULL_lag_5', 'MULL_lag_9', 'MULL_window_1D_mean', 'MULL_window_1D_min', 'MULL_window_7D_mean', 'day_of_week_sin', 'hour_cos', 'hour_sin', 'month_cos', 'month_sin', 'week_cos', 'week_sin', 'year']
In [13]:
# Set los lags seleccionados
# ==============================================================================
forecaster.set_lags(selected_lags)

Búsqueda de hiperparámetros

In [14]:
# Busqueda de hiperparámetros
# ==============================================================================
cv = TimeSeriesFold(
        initial_train_size = len(data.loc[:end_train, :]),
        steps              = 24,  # todas las horas del día siguiente
        differentiation    = 1,
     )

def search_space(trial):
    search_space  = {
        'alpha': trial.suggest_float('alpha', 1e-7, 1e+3, log=True)
    }
    return search_space

results_search, frozen_trial = bayesian_search_forecaster(
                                   forecaster    = forecaster,
                                   y             = data.loc[:end_validation, 'OT'],
                                   exog          = data.loc[:end_validation, selected_exog],
                                   search_space  = search_space,
                                   cv            = cv,
                                   metric        = 'mean_absolute_error',
                                   n_trials      = 20,
                                   random_state  = 123,
                                   return_best   = True,
                                   n_jobs        = 'auto',
                                   verbose       = False,
                                   show_progress = True
                               )
best_params = results_search['params'].iloc[0]
best_lags = results_search['lags'].iloc[0]
results_search.head()
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  4  5  6  9 12 15 17 20 23 24 42] 
  Parameters: {'alpha': 1.1075004417904626e-07}
  Backtesting metric: 2.3811573340717134
Out[14]:
lags params mean_absolute_error alpha
0 [1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42] {'alpha': 1.1075004417904626e-07} 2.381157 1.107500e-07
1 [1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42] {'alpha': 1.1602023537361842e-07} 2.381157 1.160202e-07
2 [1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42] {'alpha': 1.3232531712227085e-07} 2.381157 1.323253e-07
3 [1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42] {'alpha': 1.8131052405964636e-07} 2.381157 1.813105e-07
4 [1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42] {'alpha': 6.876338479023381e-07} 2.381157 6.876338e-07

Predicción de los intervalos

Tras seleccionar el modelo óptimo, incluidas sus predictores e hiperparámetros, se realiza un proceso de backtesting utilizando el conjunto de validación. Este proceso genera residuos fuera de la muestra (out-of-sample), que luego se utilizan para estimar los intervalos de predicción para el conjunto de test.

In [15]:
# Backtesting con datos de validación para obtener residuos out-sample
# ==============================================================================
cv = TimeSeriesFold(
        initial_train_size = len(data.loc[:end_train, :]),
        steps              = 24,  # all hours of next day
        differentiation    = 1,
     )

metric_val, predictions_val = backtesting_forecaster(
                                forecaster    = forecaster,
                                y             = data.loc[:end_validation, 'OT'],
                                exog          = data.loc[:end_validation, selected_exog],
                                cv            = cv,
                                metric        = 'mean_absolute_error',
                                n_jobs        = 'auto',
                                verbose       = False,
                                show_progress = True
                              )
display(metric_val)
fig, ax = plt.subplots(figsize=(8, 3))
data.loc[end_train:end_validation, 'OT'].plot(ax=ax, label='Real value')
predictions_val['pred'].plot(ax=ax, label='prediction')
ax.set_title("Backtesting on validation data")
ax.legend();
mean_absolute_error
0 2.381157
In [16]:
# Distribución de los residuos out-sample
# ==============================================================================
residuals = data.loc[predictions_val.index, 'OT'] - predictions_val['pred']
print(pd.Series(np.where(residuals < 0, 'negative', 'positive')).value_counts())
plt.rcParams.update({'font.size': 8})
_ = plot_residuals(residuals=residuals, figsize=(7, 4))
positive    2457
negative    1959
Name: count, dtype: int64
In [17]:
# Almacenar residuos out-sample en el forecaster
# ==============================================================================
forecaster.fit(y=data.loc[:end_train, 'OT'], exog=data.loc[:end_train, selected_exog])
forecaster.set_out_sample_residuals(
    y_true = data.loc[predictions_val.index, 'OT'], 
    y_pred = predictions_val['pred']
)

✎ Note

Cuando se almacenan nuevos residuos en el forecaster, los residuos se agrupan según la magnitud de las predicciones a las que corresponden. Más tarde, en el proceso de *bootstrapping*, los residuos se muestrean de la partición correspondiente, asegurando que la distribución de los residuos sea consistente con las predicciones. Este proceso permite obtener intervalos de predicción más precisos, ya que los residuos están más alineados con las predicciones a las que corresponden, lo que resulta en una mejor cobertura con intervalos más estrechos. Para obtener más información sobre cómo se utilizan los residuos en la estimación de intervalos, visite Probabilistic forecasting: prediction intervals and prediction distribution.
In [18]:
# Intervalos de los bins de residuos (condicionados a los valores predichos) y número de residuos almacenados
# ==============================================================================
for k, v in forecaster.binner_intervals_.items():
    print(f"{k}: {v}  n={len(forecaster.out_sample_residuals_by_bin_[k])}")
0.0: (-3.8716384676533124, -1.1238482614863585)  n=417
1.0: (-1.1238482614863585, -0.7140340975203685)  n=611
2.0: (-0.7140340975203685, -0.5147665419614711)  n=510
3.0: (-0.5147665419614711, -0.3512020918319507)  n=459
4.0: (-0.3512020918319507, -0.1987613354056137)  n=374
5.0: (-0.1987613354056137, -0.013725466918245388)  n=274
6.0: (-0.013725466918245388, 0.29437106461036644)  n=328
7.0: (0.29437106461036644, 0.7696661155396924)  n=426
8.0: (0.7696661155396924, 1.479336474696538)  n=516
9.0: (1.479336474696538, 5.6094825256755385)  n=500
In [19]:
# Distribución de los residuos out-sample por bin
# ==============================================================================
out_sample_residuals_by_bin_df = pd.DataFrame(
    dict(
        [(k, pd.Series(v))
         for k, v in forecaster.out_sample_residuals_by_bin_.items()]
    )
)

fig, ax = plt.subplots(figsize=(6, 3))
out_sample_residuals_by_bin_df.boxplot(ax=ax)
ax.set_title("Distribution of residuals by bin", fontsize=12)
ax.set_xlabel("Bin", fontsize=10)
ax.set_ylabel("Residuals", fontsize=10)
plt.show();
In [20]:
# Backtesting con intervalos de predicción en datos de test usando residuos out-sample
# ==============================================================================
cv = TimeSeriesFold(
        initial_train_size = len(data.loc[:end_validation, :]),
        steps              = 24,  # todas las horas del día siguiente
        differentiation    = 1
     )

metric, predictions = backtesting_forecaster(
                          forecaster              = forecaster,
                          y                       = data['OT'],
                          exog                    = data[selected_exog],
                          cv                      = cv,
                          metric                  = 'mean_absolute_error',
                          interval                = [10, 90],  # 80% intervalo de predicción
                          n_boot                  = 150,
                          use_in_sample_residuals = False,  # Usat residuos out-sample
                          use_binned_residuals    = True,
                          n_jobs                  = 'auto',
                          verbose                 = False,
                          show_progress           = True
                     )
predictions
Out[20]:
pred lower_bound upper_bound
2018-04-04 00:00:00 32.748341 32.289988 33.384481
2018-04-04 01:00:00 31.947731 31.186928 33.132068
2018-04-04 02:00:00 31.052885 30.214739 33.000788
2018-04-04 03:00:00 30.108430 29.027782 32.823787
2018-04-04 04:00:00 29.229240 27.894726 32.262539
... ... ... ...
2018-06-26 16:00:00 49.968142 39.049217 56.065172
2018-06-26 17:00:00 49.287497 38.649894 56.572044
2018-06-26 18:00:00 48.319028 38.308019 56.107946
2018-06-26 19:00:00 47.041985 38.269496 55.357029
2018-06-26 20:00:00 45.751569 37.076566 54.628951

2013 rows × 3 columns

In [21]:
# Gráfico de los intervalos de predicción
# ==============================================================================
plt.rcParams['lines.linewidth'] = 1
fig, ax = plt.subplots(figsize=(9, 4))
plot_prediction_intervals(
    predictions     = predictions,
    y_true          = data_test,
    target_variable = "OT",
    initial_x_zoom  = None,
    title           = "Real value vs prediction in test data",
    xaxis_title     = "Date time",
    yaxis_title     = "OT",
    ax              = ax
)
fill_between_obj = ax.collections[0]
fill_between_obj.set_facecolor('white')
fill_between_obj.set_alpha(0.3)

# Covertura del intervalo predicho
# ==============================================================================
def empirical_coverage(y, lower_bound, upper_bound):
    """
    Calculate coverage of a given interval
    """
        
    return np.mean(np.logical_and(y >= lower_bound, y <= upper_bound))

coverage = empirical_coverage(
    y = data.loc[end_validation:, 'OT'],
    lower_bound = predictions["lower_bound"], 
    upper_bound = predictions["upper_bound"]
)
print(f"Predicted interval coverage: {round(100 * coverage, 2)} %")

# Area del intervalo
# ==============================================================================
area = (predictions["upper_bound"] - predictions["lower_bound"]).sum()
print(f"Area of the interval: {round(area, 2)}")

display(metric)
Predicted interval coverage: 81.57 %
Area of the interval: 20843.63
mean_absolute_error
0 2.871771
In [22]:
# Gráfico interactivo
# ==============================================================================
fig = go.Figure([
        go.Scatter(name='Prediction', x=predictions.index, y=predictions['pred'], mode='lines'),
        go.Scatter(name='Real value', x=data_test.index, y=data_test['OT'], mode='lines'),
        go.Scatter(
            name='Upper Bound', x=predictions.index, y=predictions['upper_bound'],
            mode='lines', marker=dict(color="#444"), line=dict(width=0), showlegend=False
        ),
        go.Scatter(
            name='Lower Bound', x=predictions.index, y=predictions['lower_bound'],
            marker=dict(color="#444"), line=dict(width=0), mode='lines',
            fillcolor='rgba(68, 68, 68, 0.3)', fill='tonexty', showlegend=False
        )
    ])
fig.update_layout(
    title="Real value vs prediction in test data",
    xaxis_title= "Date time",
    yaxis_title= "OT",
    width=800,
    height=400,
    margin=dict(l=10, r=10, t=35, b=10),
    hovermode="x",
    legend=dict(orientation="h", yanchor="top", y=1.1, xanchor="left", x=0.001)
)
fig.show()

Estimación de múltiples intervalos

Se estiman intervalos de predicción para varios niveles de cobertura nominal: 10 %, 20 %, 30 %, 40 %, 50 %, 60 %, 70 %, 80 %, 90 % y 95 %, y se evalúa su cobertura real.

In [23]:
# Intervalos de predicción para diferentes coberturas nominales
# ==============================================================================
nominal_coverages = [[2.5, 97.5], [5, 95], [10, 90], [15, 85], [20, 80], [30, 70], [35, 65], [40, 60], [45, 55]]
observed_coverages = []
observed_areas = []

for coverage in nominal_coverages:
    metric, predictions = backtesting_forecaster(
                            forecaster              = forecaster,
                            y                       = data['OT'],
                            exog                    = data[selected_exog],
                            cv                      = cv,
                            metric                  = 'mean_absolute_error',
                            interval                = coverage, # Cobertura nominal
                            n_boot                  = 150,
                            use_in_sample_residuals = False,  # Usa residuos out-sample
                            use_binned_residuals    = True,
                            n_jobs                  = 'auto',
                            verbose                 = False,
                            show_progress           = True
                         )
    observed_coverage = empirical_coverage(
        y = data.loc[end_validation:, 'OT'],
        lower_bound = predictions["lower_bound"], 
        upper_bound = predictions["upper_bound"]
    )
    observed_area = (predictions["upper_bound"] - predictions["lower_bound"]).sum()
    observed_coverages.append(100 * observed_coverage)
    observed_areas.append(observed_area)

results = pd.DataFrame({
            'Interval': nominal_coverages,
            'Nominal coverage': [coverage[1] - coverage[0] for coverage in nominal_coverages],
            'Observed coverage': observed_coverages,
            'Area': observed_areas
         })
results.round(1)
Out[23]:
Interval Nominal coverage Observed coverage Area
0 [2.5, 97.5] 95.0 93.9 33464.8
1 [5, 95] 90.0 90.1 27347.4
2 [10, 90] 80.0 81.6 20843.6
3 [15, 85] 70.0 73.9 16713.4
4 [20, 80] 60.0 64.6 13495.6
5 [30, 70] 40.0 44.6 8339.2
6 [35, 65] 30.0 33.6 6079.6
7 [40, 60] 20.0 21.8 3962.8
8 [45, 55] 10.0 11.3 1968.4

La cobertura empírica de todos los intervalos es muy cercana a la cobertura nominal esperada. Esto indica que los intervalos están bien calibrados y que el modelo es capaz de estimar la incertidumbre asociada a las predicciones.

Información de sesión

In [24]:
import session_info
session_info.show(html=False)
-----
feature_engine      1.8.1
matplotlib          3.9.2
numpy               2.0.0
optuna              3.6.1
pandas              2.2.3
plotly              5.24.1
session_info        1.0.0
skforecast          0.14.0
sklearn             1.5.1
statsmodels         0.14.3
-----
IPython             8.27.0
jupyter_client      8.6.3
jupyter_core        5.7.2
notebook            6.4.12
-----
Python 3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 18:27:27) [GCC 11.2.0]
Linux-5.15.0-1072-aws-x86_64-with-glibc2.31
-----
Session information updated at 2024-12-01 19:26

Instrucciones para citar

¿Cómo citar este documento?

Si utilizas este documento o alguna parte de él, te agradecemos que lo cites. ¡Muchas gracias!

Forecasting probabilístico: intervalos de predicción para forecasting multi-step por Joaquín Amat Rodrigo y Javier Escobar Ortiz, disponible bajo una licencia Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) en https://www.cienciadedatos.net/documentos/py60-forecasting-probabilistico-intervalos-prediccion-forecasting-multi-step.html

¿Cómo citar skforecast?

Si utilizas skforecast en tu investigación o publicación, te lo agradeceríamos mucho que lo cites. ¡Muchas gracias!

Zenodo:

Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.14.0). Zenodo. https://doi.org/10.5281/zenodo.8382788

APA:

Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (Version 0.14.0) [Computer software]. https://doi.org/10.5281/zenodo.8382788

BibTeX:

@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.14.0}, month = {11}, year = {2024}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }

¿Te ha gustado el artículo? Tu ayuda es importante

Mantener un sitio web tiene unos costes elevados, tu contribución me ayudará a seguir generando contenido divulgativo gratuito. ¡Muchísimas gracias! 😊

Creative Commons Licence

Este documento creado por Joaquín Amat Rodrigo y Javier Escobar Ortiz tiene licencia Attribution-NonCommercial-ShareAlike 4.0 International.

Se permite:

  • Compartir: copiar y redistribuir el material en cualquier medio o formato.

  • Adaptar: remezclar, transformar y crear a partir del material.

Bajo los siguientes términos:

  • Atribución: Debes otorgar el crédito adecuado, proporcionar un enlace a la licencia e indicar si se realizaron cambios. Puedes hacerlo de cualquier manera razonable, pero no de una forma que sugiera que el licenciante te respalda o respalda tu uso.

  • No-Comercial: No puedes utilizar el material para fines comerciales.

  • Compartir-Igual: Si remezclas, transformas o creas a partir del material, debes distribuir tus contribuciones bajo la misma licencia que el original.