• Introducción
  • Librerías
  • Datos
  • Feature engineering
  • Intervalos de predicción con residuos bootstrap
    • Partición de los datos
    • Forecaster
    • Selección de predictores
    • Búsqueda de hiperparámetros
  • Predicción de los intervalos
  • Estimación de múltiples intervalos
  • Información de sesión
  • Instrucciones para citar


Más sobre forecasting en: cienciadedatos.net


Introducción

La mayoría de los modelos de forecasting tratan de predecir el valor más probable de una serie temporal, una técnica conocida como point-forecasting. Aunque es útil, no proporciona información sobre la incertidumbre asociada a las predicciones.

El forecasting probabilístico, a diferencia del point-forecasting, es una familia de técnicas que permiten predecir la distribución esperada en lugar de un único valor. Este enfoque proporciona más información al permitir estimar intervalos de predicción, es decir, rangos dentro de los cuales es probable que se encuentre el valor real. Más formalmente, un intervalo de predicción define el intervalo dentro del cual se espera encontrar el valor verdadero de la variable de respuesta con una probabilidad determinada.

La estimación de los intervalos de predicción en el forecasting es un reto, ya que muchos métodos bien establecidos para la regresión y el forecasting single-step (en el que se predice únicamente el siguiente valor de la serie) no son directamente aplicables al forecasting multi-step (en el que se predicen múltiples valores futuros). Además, hay que encontrar un equilibrio entre dos parámetros clave: la cobertura y la amplitud del intervalo. En la práctica, se intenta alcanzar un cierto nivel de cobertura (por ejemplo, el 80 %) manteniendo los intervalos de predicción lo más estrechos posible y el error de predicción del modelo lo más bajo posible.

Este documento se inspira en Carl McBride Ellis (2024). Predicción probabilística I: temperatura. Concurso Kaggle. En la competición, las predicciones se realizan con un paso de antelación (single-step), lo que significa que cada predicción se genera para el siguiente paso temporal basándose en valores anteriores conocidos. En esta adaptación, sin embargo, se utiliza un enfoque multipaso (multi-step), en el que se predicen varios pasos temporales futuros. Este escenario es más complejo pero a menudo es necesario para aplicaciones del mundo real.

La tarea consiste en estimar los intervalos de predicción de la variable objetivo, denominada OT, para los 28 días siguientes. Cada día se predicen las 24 horas siguientes, lo que significa que cada día se pronostican 24 valores. Este proceso se repite diariamente durante 28 días, lo que da como resultado un total de 672 predicciones (28 × 24) al final del periodo. Se estiman intervalos de predicción simétricos con niveles de cobertura del 10 %, 20 %, 30 %, 40 %, 50 %, 60 %, 70 %, 80 %, 90 % y 95 %, y se evalúa su cobertura.

Skforecast implementa varios métodos para el forecasting probabilístico:

Librerías

Las librerías utilizadas en este cuaderno son:

# Procesamiento de datos
# ==============================================================================
import numpy as np
import pandas as pd
from skforecast.datasets import fetch_dataset

# Gráficos
# ==============================================================================
import matplotlib.pyplot as plt
from skforecast.plot import set_dark_theme, plot_prediction_intervals, plot_residuals
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.stattools import pacf
import plotly.graph_objects as go
import plotly.io as pio
import plotly.offline as poff
pio.templates.default = "seaborn"
pio.renderers.default = 'notebook' 
poff.init_notebook_mode(connected=True)

# Modelado y forecasting
# ==============================================================================
import skforecast
from sklearn.linear_model import Ridge
from sklearn.feature_selection import RFECV
from sklearn.pipeline import make_pipeline
from feature_engine.datetime import DatetimeFeatures
from feature_engine.creation import CyclicalFeatures
from feature_engine.timeseries.forecasting import LagFeatures, WindowFeatures
from skforecast.recursive import ForecasterRecursive
from skforecast.model_selection import TimeSeriesFold, bayesian_search_forecaster, backtesting_forecaster
from skforecast.feature_selection import select_features
from skforecast.preprocessing import RollingFeatures
from skforecast.metrics import calculate_coverage

# Warnings
# ==============================================================================
import warnings
warnings.filterwarnings('once')

print('Versión skforecast:', skforecast.__version__)
Versión skforecast: 0.15.0

Datos

Los datos de una estación transformadora de electricidad se recopilaron entre julio de 2016 y julio de 2018 (2 años × 365 días × 24 horas × 4 intervalos por hora = 70,080 valores). Para cada fecha se dispone de 7 variables: el valor predictivo «Temperatura del aceite (OT)», y 6 tipos variables dobre la carga de potencia externa: High UseFul Load (HUFL), High UseLess Load (HULL), Carga media sin uso (MUFL), Carga media sin uso (MULL), Carga baja sin uso (LUFL), Carga baja sin uso (LULL).

Zhou, Haoyi & Zhang, Shanghang & Peng, Jieqi & Zhang, Shuai & Li, Jianxin & Xiong, Hui & Zhang, Wancai. (2020). Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. 10.48550/arXiv.2012.07436. https://github.com/zhouhaoyi/ETDataset

# Descarga de datos
# ==============================================================================
data = fetch_dataset(name="ett_m2")
data = data.resample(rule="1h", closed="left", label="right").mean()
data
ett_m2
------
Data from an electricity transformer station was collected between July 2016 and
July 2018 (2 years x 365 days x 24 hours x 4 intervals per hour = 70,080 data
points). Each data point consists of 8 features, including the date of the
point, the predictive value "Oil Temperature (OT)", and 6 different types of
external power load features: High UseFul Load (HUFL), High UseLess Load (HULL),
Middle UseFul Load (MUFL), Middle UseLess Load (MULL), Low UseFul Load (LUFL),
Low UseLess Load (LULL).
Zhou, Haoyi & Zhang, Shanghang & Peng, Jieqi & Zhang, Shuai & Li, Jianxin &
Xiong, Hui & Zhang, Wancai. (2020). Informer: Beyond Efficient Transformer for
Long Sequence Time-Series Forecasting.
[10.48550/arXiv.2012.07436](https://arxiv.org/abs/2012.07436).
https://github.com/zhouhaoyi/ETDataset
Shape of the dataset: (69680, 7)
HUFL HULL MUFL MULL LUFL LULL OT
date
2016-07-01 01:00:00 38.784501 10.88975 34.753500 8.55100 4.12575 1.26050 37.838250
2016-07-01 02:00:00 36.041249 9.44475 32.696001 7.13700 3.59025 0.62900 36.849250
2016-07-01 03:00:00 38.240000 11.41350 35.343501 9.10725 3.06000 0.31175 35.915750
2016-07-01 04:00:00 37.800250 11.45525 34.881000 9.28850 3.04400 0.60750 32.839375
2016-07-01 05:00:00 36.501750 10.49200 33.708250 8.65150 2.64400 0.00000 31.466125
... ... ... ... ... ... ... ...
2018-06-26 16:00:00 39.517250 11.45500 50.616000 12.08950 -10.64275 -1.28475 47.744249
2018-06-26 17:00:00 39.140499 11.32975 49.865250 11.84825 -10.33100 -1.35400 48.183498
2018-06-26 18:00:00 42.428501 12.85850 53.591250 13.33575 -10.95475 -1.41800 47.853999
2018-06-26 19:00:00 43.036000 12.87925 54.241250 13.33575 -10.91200 -1.41800 46.535750
2018-06-26 20:00:00 40.543500 11.16175 51.721500 11.76800 -11.22400 -1.41800 45.601625

17420 rows × 7 columns

# Gráfico de la serie
# ==============================================================================
set_dark_theme()
plt.rcParams['lines.linewidth'] = 0.5
fig, ax = plt.subplots(figsize=(8, 3))
ax.plot(data['OT'], label='OT');
# Gráfico de las series exógenas
# ==============================================================================
cols_to_plot = [
    'HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL'
]
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
colors = colors[1:] + ["violet", "magenta"]    
fig, axs = plt.subplots(6, 1, figsize=(8, 8), sharex=True)
for i, col in enumerate(cols_to_plot):
    axs[i].plot(data[col], label=col, color=colors[i])
    axs[i].legend(loc='lower left', fontsize=8)
    axs[i].tick_params(axis='both', labelsize=8)
# Autocorrelación parcial
# ==============================================================================
n_lags = 24 * 7
cols_to_plot = ['OT', 'HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL']
pacf_df = []
fig, axs = plt.subplots(4, 2, figsize=(8, 6))
axs = axs.ravel()
for i, col in enumerate(cols_to_plot):
    pacf_values = pacf(data[col], nlags=n_lags)
    pacf_values = pd.DataFrame({
        'lag': range(1, len(pacf_values)),
        'value': pacf_values[1:],
        'variable': col
    })
    pacf_df.append(pacf_values)
    plot_pacf(data[col], lags=n_lags, ax=axs[i])
    axs[i].set_title(col, fontsize=10)
    axs[i].set_ylim(-0.5, 1.1)
plt.tight_layout()
# Top n lags con mayor autocorrelación parcial por variable
# ==============================================================================
n = 10
top_lags = set()
for pacf_values in pacf_df:
    variable = pacf_values['variable'].iloc[0]
    lags = pacf_values.nlargest(n, 'value')['lag'].tolist()
    top_lags.update(lags)
    print(f"{variable}: {lags}")
top_lags = list(top_lags)
print(f"\nAll lags: {top_lags}")
OT: [1, 15, 16, 14, 3, 6, 17, 13, 4, 5]
HUFL: [1, 11, 15, 19, 14, 17, 18, 12, 23, 9]
HULL: [1, 3, 2, 15, 12, 23, 11, 14, 24, 4]
MUFL: [1, 15, 11, 14, 17, 19, 12, 18, 9, 16]
MULL: [1, 3, 2, 24, 11, 15, 23, 12, 17, 4]
LUFL: [1, 2, 20, 18, 9, 17, 10, 16, 19, 21]
LULL: [1, 3, 14, 4, 18, 11, 9, 42, 19, 12]

All lags: [1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 42]

La serie objetivo y las series externas exhiben dinámicas similares con varios lags que pueden utilizarse como predictores.

Feature engineering

Se crean variables exógenas adicionales basadas en:

  • Valores rezagados (lags) de las series exógenas.

  • Estadísticas móviles de las series exógenas.

  • Características del calendario.

Warning

En este documento se asume que las series adicionales son conocidas en el futuro. Esto significa que sus valores son conocidos en el momento de la predicción, lo que permite utilizarlos como variables exógenas en el modelo de forecasting.

# Variables de calendario
# ==============================================================================
features_to_extract = ['year', 'month', 'week', 'day_of_week', 'hour']
calendar_transformer = DatetimeFeatures(
    variables           = 'index',
    features_to_extract = features_to_extract,
    drop_original       = False,
)

# Lags de las variables exógenas
# ==============================================================================
lag_transformer = LagFeatures(
    variables = ["HUFL", "MUFL", "MULL", "HULL", "LUFL", "LULL"],
    periods   = top_lags,
)

# Estadísticas móviles de las variables exógenas
# ==============================================================================
wf_transformer = WindowFeatures(
    variables   = ["HUFL", "MUFL", "MULL", "HULL", "LUFL", "LULL"],
    window      = ["1D", "7D"],
    functions   = ["mean", "max", "min"],
    freq        = "1h",
)

# Codificación de variables cíclicas
# ==============================================================================
features_to_encode = ['year', 'month', 'week', 'day_of_week', 'hour']
max_values = {
    "month": 12,
    "week": 52,
    "day_of_week": 7,
    "hour": 24,
}
cyclical_encoder = CyclicalFeatures(
                        variables     = features_to_encode,
                        max_values    = max_values,
                        drop_original = True
                   )

exog_transformer = make_pipeline(
                        calendar_transformer,
                        lag_transformer,
                        wf_transformer,
                        cyclical_encoder
                   )
display(exog_transformer)

data = exog_transformer.fit_transform(data)
# Eliminación de valores nulos introducidos al crear los lags
data = data.dropna()
exog_features = data.columns.difference(['OT']).tolist()
display(data.head(3))
Pipeline(steps=[('datetimefeatures',
                 DatetimeFeatures(drop_original=False,
                                  features_to_extract=['year', 'month', 'week',
                                                       'day_of_week', 'hour'],
                                  variables='index')),
                ('lagfeatures',
                 LagFeatures(periods=[1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14,
                                      15, 16, 17, 18, 19, 20, 21, 23, 24, 42],
                             variables=['HUFL', 'MUFL', 'MULL', 'HULL', 'LUFL',
                                        'LULL'])),
                ('windowfeatures',
                 WindowFeatures(freq='1h', functions=['mean', 'max', 'min'],
                                variables=['HUFL', 'MUFL', 'MULL', 'HULL',
                                           'LUFL', 'LULL'],
                                window=['1D', '7D'])),
                ('cyclicalfeatures',
                 CyclicalFeatures(drop_original=True,
                                  max_values={'day_of_week': 7, 'hour': 24,
                                              'month': 12, 'week': 52},
                                  variables=['year', 'month', 'week',
                                             'day_of_week', 'hour']))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
HUFL HULL MUFL MULL LUFL LULL OT year HUFL_lag_1 MUFL_lag_1 ... LULL_window_7D_max LULL_window_7D_min month_sin month_cos week_sin week_cos day_of_week_sin day_of_week_cos hour_sin hour_cos
date
2016-07-02 19:00:00 33.6330 9.08900 29.874751 6.95600 3.753 0.61025 28.719500 2016 32.5440 29.017001 ... 1.2605 0.0 -0.5 -0.866025 1.224647e-16 -1.0 -0.974928 -0.222521 -0.965926 0.258819
2016-07-02 20:00:00 31.7065 7.72750 27.884500 5.63600 3.753 0.67425 29.103875 2016 33.6330 29.874751 ... 1.2605 0.0 -0.5 -0.866025 1.224647e-16 -1.0 -0.974928 -0.222521 -0.866025 0.500000
2016-07-02 21:00:00 31.8110 7.81125 27.362000 5.18025 4.435 1.42075 29.598500 2016 31.7065 27.884500 ... 1.2605 0.0 -0.5 -0.866025 1.224647e-16 -1.0 -0.974928 -0.222521 -0.707107 0.707107

3 rows × 184 columns

Intervalos de predicción con residuos bootstrap

Partición de los datos

# Split train-validation-test
# ==============================================================================
end_train = '2017-10-01 23:59:00'
end_validation = '2018-04-03 23:59:00'
data_train = data.loc[: end_train, :]
data_val   = data.loc[end_train:end_validation, :]
data_test  = data.loc[end_validation:, :]

print(f"Dates train      : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Dates validacion : {data_val.index.min()} --- {data_val.index.max()}  (n={len(data_val)})")
print(f"Dates test       : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")
Dates train      : 2016-07-02 19:00:00 --- 2017-10-01 23:00:00  (n=10949)
Dates validacion : 2017-10-02 00:00:00 --- 2018-04-03 23:00:00  (n=4416)
Dates test       : 2018-04-04 00:00:00 --- 2018-06-26 20:00:00  (n=2013)
# Gráfico de las particiones
# ==============================================================================
fig, ax = plt.subplots(figsize=(8, 3))
ax.plot(data_train['OT'], label='Train')
ax.plot(data_val['OT'], label='Validation')
ax.plot(data_test['OT'], label='Test')
ax.set_title('OT')
ax.legend();
# Gráfico de las particiones tras aplicar diferenciación
# ==============================================================================
fig, ax = plt.subplots(figsize=(8, 3))
ax.plot(data_train['OT'].diff(1), label='Train')
ax.plot(data_val['OT'].diff(1), label='Validation')
ax.plot(data_test['OT'].diff(1), label='Test')
ax.set_title('Differenced OT')
ax.legend();

Forecaster

# Crear forecaster
# ==============================================================================
window_features = RollingFeatures(stats=['mean', 'min', 'max'], window_sizes=24)
forecaster = ForecasterRecursive(
                 regressor       = Ridge(random_state=15926),
                 lags            = top_lags,
                 differentiation = 1,
                 window_features = window_features,
                 binner_kwargs   = {'n_bins': 10}
             )

Selección de predictores

# Selección de predictores (autoregressivos y exógenas) con scikit-learn RFECV
# ==============================================================================
regressor = Ridge(random_state=15926)

selector = RFECV(
    estimator=regressor, step=1, cv=3, min_features_to_select=25, n_jobs=-1
)

selected_lags, selected_window_features, selected_exog = select_features(
    forecaster      = forecaster,
    selector        = selector,
    y               = data.loc[:end_train, 'OT'],
    exog            = data.loc[:end_train, exog_features],
    select_only     = None,
    force_inclusion = None,
    subsample       = 0.5,
    random_state    = 123,
    verbose         = True,
)
Recursive feature elimination (RFECV)
-------------------------------------
Total number of records available: 10906
Total number of records used for feature selection: 5453
Number of features available: 208
    Lags            (n=22)
    Window features (n=3)
    Exog            (n=183)
Number of features selected: 107
    Lags            (n=14) : [1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42]
    Window features (n=2) : ['roll_min_24', 'roll_max_24']
    Exog            (n=91) : ['HUFL', 'HUFL_lag_1', 'HUFL_lag_12', 'HUFL_lag_13', 'HUFL_lag_15', 'HUFL_lag_19', 'HUFL_lag_2', 'HUFL_lag_20', 'HUFL_lag_23', 'HUFL_lag_4', 'HUFL_lag_5', 'HUFL_lag_9', 'HUFL_window_1D_mean', 'HULL', 'HULL_lag_1', 'HULL_lag_12', 'HULL_lag_14', 'HULL_lag_15', 'HULL_lag_2', 'HULL_lag_20', 'HULL_lag_21', 'HULL_lag_23', 'HULL_lag_3', 'HULL_lag_4', 'HULL_window_1D_mean', 'LUFL', 'LUFL_lag_1', 'LUFL_lag_10', 'LUFL_lag_15', 'LUFL_lag_19', 'LUFL_lag_2', 'LUFL_lag_20', 'LUFL_lag_23', 'LUFL_lag_3', 'LUFL_lag_4', 'LUFL_lag_5', 'LUFL_window_1D_mean', 'LULL', 'LULL_lag_1', 'LULL_lag_12', 'LULL_lag_13', 'LULL_lag_14', 'LULL_lag_18', 'LULL_lag_19', 'LULL_lag_20', 'LULL_lag_21', 'LULL_lag_23', 'LULL_lag_24', 'LULL_lag_4', 'LULL_lag_5', 'LULL_lag_6', 'LULL_window_1D_max', 'LULL_window_1D_min', 'MUFL', 'MUFL_lag_1', 'MUFL_lag_11', 'MUFL_lag_12', 'MUFL_lag_13', 'MUFL_lag_15', 'MUFL_lag_2', 'MUFL_lag_20', 'MUFL_lag_23', 'MUFL_lag_4', 'MUFL_lag_9', 'MUFL_window_1D_mean', 'MULL', 'MULL_lag_1', 'MULL_lag_11', 'MULL_lag_12', 'MULL_lag_13', 'MULL_lag_14', 'MULL_lag_15', 'MULL_lag_17', 'MULL_lag_19', 'MULL_lag_2', 'MULL_lag_20', 'MULL_lag_3', 'MULL_lag_4', 'MULL_lag_5', 'MULL_lag_9', 'MULL_window_1D_mean', 'MULL_window_1D_min', 'MULL_window_7D_mean', 'day_of_week_sin', 'hour_cos', 'hour_sin', 'month_cos', 'month_sin', 'week_cos', 'week_sin', 'year']
# Set los lags seleccionados
# ==============================================================================
forecaster.set_lags(selected_lags)

Búsqueda de hiperparámetros

# Busqueda de hiperparámetros
# ==============================================================================
cv = TimeSeriesFold(
        initial_train_size = len(data.loc[:end_train, :]),
        steps              = 24,  # todas las horas del día siguiente
        differentiation    = 1,
     )

def search_space(trial):
    search_space  = {
        'alpha': trial.suggest_float('alpha', 1e-7, 1e+3, log=True)
    }
    return search_space

results_search, frozen_trial = bayesian_search_forecaster(
                                   forecaster   = forecaster,
                                   y            = data.loc[:end_validation, 'OT'],
                                   exog         = data.loc[:end_validation, selected_exog],
                                   search_space = search_space,
                                   cv           = cv,
                                   metric       = 'mean_absolute_error',
                                   n_trials     = 20
                               )
best_params = results_search['params'].iloc[0]
best_lags = results_search['lags'].iloc[0]
results_search.head()
  0%|          | 0/20 [00:00<?, ?it/s]
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  4  5  6  9 12 15 17 20 23 24 42] 
  Parameters: {'alpha': 1.1075004417904626e-07}
  Backtesting metric: 2.381157334071585
lags params mean_absolute_error alpha
0 [1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42] {'alpha': 1.1075004417904626e-07} 2.381157 1.107500e-07
1 [1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42] {'alpha': 1.1602023537361883e-07} 2.381157 1.160202e-07
2 [1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42] {'alpha': 1.3232531712227085e-07} 2.381157 1.323253e-07
3 [1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42] {'alpha': 1.8131052405964636e-07} 2.381157 1.813105e-07
4 [1, 2, 3, 4, 5, 6, 9, 12, 15, 17, 20, 23, 24, 42] {'alpha': 6.876338479023381e-07} 2.381157 6.876338e-07

Predicción de los intervalos

Tras seleccionar el modelo óptimo, incluidas sus predictores e hiperparámetros, se realiza un proceso de backtesting utilizando el conjunto de validación. Este proceso genera residuos fuera de la muestra (out-of-sample), que luego se utilizan para estimar los intervalos de predicción para el conjunto de test.

# Backtesting con datos de validación para obtener residuos out-sample
# ==============================================================================
cv = TimeSeriesFold(
        initial_train_size = len(data.loc[:end_train, :]),
        steps              = 24,  # all hours of next day
        differentiation    = 1,
     )

metric_val, predictions_val = backtesting_forecaster(
                                  forecaster = forecaster,
                                  y          = data.loc[:end_validation, 'OT'],
                                  exog       = data.loc[:end_validation, selected_exog],
                                  cv         = cv,
                                  metric     = 'mean_absolute_error'
                              )
display(metric_val)
fig, ax = plt.subplots(figsize=(8, 3))
data.loc[end_train:end_validation, 'OT'].plot(ax=ax, label='Real value')
predictions_val['pred'].plot(ax=ax, label='prediction')
ax.set_title("Backtesting on validation data")
ax.legend();
  0%|          | 0/184 [00:00<?, ?it/s]
mean_absolute_error
0 2.381157
# Distribución de los residuos out-sample
# ==============================================================================
residuals = data.loc[predictions_val.index, 'OT'] - predictions_val['pred']
print(pd.Series(np.where(residuals < 0, 'negative', 'positive')).value_counts())
plt.rcParams.update({'font.size': 8})
_ = plot_residuals(residuals=residuals, figsize=(7, 4))
positive    2457
negative    1959
Name: count, dtype: int64
# Almacenar residuos out-sample en el forecaster
# ==============================================================================
forecaster.fit(y=data.loc[:end_train, 'OT'], exog=data.loc[:end_train, selected_exog])
forecaster.set_out_sample_residuals(
    y_true = data.loc[predictions_val.index, 'OT'], 
    y_pred = predictions_val['pred']
)

✎ Note

Cuando se almacenan nuevos residuos en el forecaster, los residuos se agrupan según la magnitud de las predicciones a las que corresponden. Más tarde, en el proceso de *bootstrapping*, los residuos se muestrean de la partición correspondiente, asegurando que la distribución de los residuos sea consistente con las predicciones. Este proceso permite obtener intervalos de predicción más precisos, ya que los residuos están más alineados con las predicciones a las que corresponden, lo que resulta en una mejor cobertura con intervalos más estrechos. Para obtener más información sobre cómo se utilizan los residuos en la estimación de intervalos, visite Probabilistic forecasting: prediction intervals and prediction distribution.
# Intervalos de los bins de residuos (condicionados a los valores predichos) y número de residuos almacenados
# ==============================================================================
for k, v in forecaster.binner_intervals_.items():
    print(f"{k}: {v}  n={len(forecaster.out_sample_residuals_by_bin_[k])}")
0: (-3.871638467653412, -1.1238482614862022)  n=417
1: (-1.1238482614862022, -0.714034097520269)  n=611
2: (-0.714034097520269, -0.5147665419615066)  n=510
3: (-0.5147665419615066, -0.3512020918321923)  n=459
4: (-0.3512020918321923, -0.19876133540594054)  n=374
5: (-0.19876133540594054, -0.013725466917961171)  n=274
6: (-0.013725466917961171, 0.2943710646100325)  n=328
7: (0.2943710646100325, 0.7696661155395645)  n=426
8: (0.7696661155395645, 1.4793364746966304)  n=516
9: (1.4793364746966304, 5.609482525672902)  n=500
# Distribución de los residuos out-sample por bin
# ==============================================================================
out_sample_residuals_by_bin_df = pd.DataFrame(
    dict(
        [(k, pd.Series(v))
         for k, v in forecaster.out_sample_residuals_by_bin_.items()]
    )
)

fig, ax = plt.subplots(figsize=(6, 3))
out_sample_residuals_by_bin_df.boxplot(ax=ax)
ax.set_title("Distribution of residuals by bin", fontsize=12)
ax.set_xlabel("Bin", fontsize=10)
ax.set_ylabel("Residuals", fontsize=10)
plt.show();
# Backtesting con intervalos de predicción en datos de test usando residuos out-sample
# ==============================================================================
cv = TimeSeriesFold(
        initial_train_size = len(data.loc[:end_validation, :]),
        steps              = 24,  # todas las horas del día siguiente
        differentiation    = 1
     )

metric, predictions = backtesting_forecaster(
                          forecaster              = forecaster,
                          y                       = data['OT'],
                          exog                    = data[selected_exog],
                          cv                      = cv,
                          metric                  = 'mean_absolute_error',
                          interval                = [10, 90],  # 80% intervalo de predicción
                          n_boot                  = 150,
                          interval_method         = 'bootstrapping',
                          use_in_sample_residuals = False,  # Usar residuos out-sample 
                          use_binned_residuals    = True,   # Intervalos condicionados a los valores predichos
                      )
predictions
  0%|          | 0/84 [00:00<?, ?it/s]
pred lower_bound upper_bound
2018-04-04 00:00:00 32.748341 32.289988 33.384481
2018-04-04 01:00:00 31.947731 31.186928 33.132068
2018-04-04 02:00:00 31.052885 30.214739 33.000788
2018-04-04 03:00:00 30.108430 29.027782 32.823787
2018-04-04 04:00:00 29.229240 27.894726 32.262539
... ... ... ...
2018-06-26 16:00:00 49.968142 39.049217 56.065172
2018-06-26 17:00:00 49.287497 38.649894 56.572044
2018-06-26 18:00:00 48.319028 38.308019 56.107946
2018-06-26 19:00:00 47.041985 38.269496 55.357029
2018-06-26 20:00:00 45.751569 37.076566 54.628951

2013 rows × 3 columns

# Gráfico de los intervalos de predicción
# ==============================================================================
plt.rcParams['lines.linewidth'] = 1
fig, ax = plt.subplots(figsize=(9, 4))
plot_prediction_intervals(
    predictions     = predictions,
    y_true          = data_test,
    target_variable = "OT",
    initial_x_zoom  = None,
    title           = "Real value vs prediction in test data",
    xaxis_title     = "Date time",
    yaxis_title     = "OT",
    ax              = ax
)
fill_between_obj = ax.collections[0]
fill_between_obj.set_facecolor('white')
fill_between_obj.set_alpha(0.3)

# Covertura del intervalo predicho
# ==============================================================================
coverage = calculate_coverage(
    y_true      = data.loc[end_validation:, 'OT'],
    lower_bound = predictions["lower_bound"], 
    upper_bound = predictions["upper_bound"]
)
print(f"Predicted interval coverage: {round(100 * coverage, 2)} %")

# Area del intervalo
# ==============================================================================
area = (predictions["upper_bound"] - predictions["lower_bound"]).sum()
print(f"Area of the interval: {round(area, 2)}")

display(metric)
Predicted interval coverage: 81.57 %
Area of the interval: 20843.63
mean_absolute_error
0 2.871771
# Gráfico interactivo
# ==============================================================================
fig = go.Figure([
        go.Scatter(name='Prediction', x=predictions.index, y=predictions['pred'], mode='lines'),
        go.Scatter(name='Real value', x=data_test.index, y=data_test['OT'], mode='lines'),
        go.Scatter(
            name='Upper Bound', x=predictions.index, y=predictions['upper_bound'],
            mode='lines', marker=dict(color="#444"), line=dict(width=0), showlegend=False
        ),
        go.Scatter(
            name='Lower Bound', x=predictions.index, y=predictions['lower_bound'],
            marker=dict(color="#444"), line=dict(width=0), mode='lines',
            fillcolor='rgba(68, 68, 68, 0.3)', fill='tonexty', showlegend=False
        )
    ])
fig.update_layout(
    title="Real value vs prediction in test data",
    xaxis_title= "Date time",
    yaxis_title= "OT",
    width=800,
    height=400,
    margin=dict(l=10, r=10, t=35, b=10),
    hovermode="x",
    legend=dict(orientation="h", yanchor="top", y=1.1, xanchor="left", x=0.001)
)
fig.show()
Apr 152018Apr 29May 13May 27Jun 10Jun 2401020304050
Real valuePredictionReal value vs prediction in test dataDate timeOT

Estimación de múltiples intervalos

Se estiman intervalos de predicción para varios niveles de cobertura nominal: 10 %, 20 %, 30 %, 40 %, 50 %, 60 %, 70 %, 80 %, 90 % y 95 %, y se evalúa su cobertura real.

# Intervalos de predicción para diferentes coberturas nominales
# ==============================================================================
quantiles = [2.5, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 97.5]
intervals = [[2.5, 97.5], [5, 95], [10, 90], [15, 85], [20, 80], [30, 70], [35, 65], [40, 60], [45, 55]]
observed_coverages = []
observed_areas = []

metric, predictions = backtesting_forecaster(
                          forecaster              = forecaster,
                          y                       = data['OT'],
                          exog                    = data[selected_exog],
                          cv                      = cv,
                          metric                  = 'mean_absolute_error',
                          interval                = quantiles,
                          n_boot                  = 150,
                          interval_method         = 'bootstrapping',
                          use_in_sample_residuals = False,  # Usar residuos out-sample
                          use_binned_residuals    = True,   # Intervalos condicionados a los valores predichos
                      )
predictions.head()
  0%|          | 0/84 [00:00<?, ?it/s]
pred p_2.5 p_5 p_10 p_15 p_20 p_25 p_30 p_35 p_40 ... p_55 p_60 p_65 p_70 p_75 p_80 p_85 p_90 p_95 p_97.5
2018-04-04 00:00:00 32.748341 31.922109 32.053411 32.289988 32.337571 32.490408 32.551956 32.600155 32.743772 32.842704 ... 33.016446 33.049557 33.107951 33.172141 33.212129 33.288902 33.323562 33.384481 33.482723 33.717116
2018-04-04 01:00:00 31.947731 30.421991 30.969322 31.186928 31.302311 31.665845 31.811999 31.950245 32.050338 32.124354 ... 32.407196 32.524998 32.643774 32.748498 32.892010 32.960220 33.040611 33.132068 33.368347 33.726517
2018-04-04 02:00:00 31.052885 28.930792 29.592743 30.214739 30.354561 30.649688 30.910788 31.054912 31.251281 31.386422 ... 31.860174 32.005682 32.082505 32.271244 32.453232 32.584171 32.770079 33.000788 33.416257 34.215743
2018-04-04 03:00:00 30.108430 27.136373 27.518602 29.027782 29.386043 29.762199 29.972688 30.335793 30.611191 30.772863 ... 31.240968 31.489245 31.663091 31.794914 32.007040 32.219158 32.405931 32.823787 33.422536 35.872185
2018-04-04 04:00:00 29.229240 25.818478 26.956306 27.894726 28.732302 28.886822 29.192374 29.471908 29.727138 30.040276 ... 30.739496 30.878859 31.072958 31.173531 31.382835 31.583442 31.869380 32.262539 32.897774 35.302005

5 rows × 22 columns

# Calcular cobertura y área observada para cada intervalo
# ==============================================================================
for interval in intervals:
    observed_coverage = calculate_coverage(
                            y_true      = data.loc[end_validation:, 'OT'],
                            lower_bound = predictions[f"p_{interval[0]}"], 
                            upper_bound = predictions[f"p_{interval[1]}"]
                        )
    observed_area = (predictions[f"p_{interval[1]}"] - predictions[f"p_{interval[0]}"]).sum()
    observed_coverages.append(100 * observed_coverage)
    observed_areas.append(observed_area)

results = pd.DataFrame({
              'Interval': intervals,
              'Nominal coverage': [interval[1] - interval[0] for interval in intervals],
              'Observed coverage': observed_coverages,
              'Area': observed_areas
          })
results.round(1)
Interval Nominal coverage Observed coverage Area
0 [2.5, 97.5] 95.0 93.9 33464.8
1 [5, 95] 90.0 90.1 27347.4
2 [10, 90] 80.0 81.6 20843.6
3 [15, 85] 70.0 73.9 16713.4
4 [20, 80] 60.0 64.6 13495.6
5 [30, 70] 40.0 44.6 8339.2
6 [35, 65] 30.0 33.6 6079.6
7 [40, 60] 20.0 21.8 3962.8
8 [45, 55] 10.0 11.3 1968.4

La cobertura empírica de todos los intervalos es muy cercana a la cobertura nominal esperada. Esto indica que los intervalos están bien calibrados y que el modelo es capaz de estimar la incertidumbre asociada a las predicciones.

Información de sesión

import session_info
session_info.show(html=False)
-----
feature_engine      1.8.3
matplotlib          3.9.4
numpy               2.1.3
optuna              4.2.1
pandas              2.2.3
plotly              6.0.0
session_info        1.0.0
skforecast          0.15.0
sklearn             1.6.1
statsmodels         0.14.4
-----
IPython             8.32.0
jupyter_client      8.6.3
jupyter_core        5.7.2
-----
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:49:16) [MSC v.1929 64 bit (AMD64)]
Windows-11-10.0.26100-SP0
-----
Session information updated at 2025-03-13 12:51

Instrucciones para citar

¿Cómo citar este documento?

Si utilizas este documento o alguna parte de él, te agradecemos que lo cites. ¡Muchas gracias!

Forecasting probabilístico: intervalos de predicción para forecasting multi-step por Joaquín Amat Rodrigo y Javier Escobar Ortiz, disponible bajo una licencia Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) en https://www.cienciadedatos.net/documentos/py60-forecasting-probabilistico-intervalos-prediccion-forecasting-multi-step.html

¿Cómo citar skforecast?

Si utilizas skforecast, te agradeceríamos mucho que lo cites. ¡Muchas gracias!

Zenodo:

Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.15.0). Zenodo. https://doi.org/10.5281/zenodo.8382788

APA:

Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (Version 0.15.0) [Computer software]. https://doi.org/10.5281/zenodo.8382788

BibTeX:

@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.15.0}, month = {03}, year = {2025}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }

¿Te ha gustado el artículo? Tu ayuda es importante

Tu contribución me ayudará a seguir generando contenido divulgativo gratuito. ¡Muchísimas gracias! 😊

Become a GitHub Sponsor Become a GitHub Sponsor

Creative Commons Licence

Este documento creado por Joaquín Amat Rodrigo y Javier Escobar Ortiz tiene licencia Attribution-NonCommercial-ShareAlike 4.0 International.

Se permite:

  • Compartir: copiar y redistribuir el material en cualquier medio o formato.

  • Adaptar: remezclar, transformar y crear a partir del material.

Bajo los siguientes términos:

  • Atribución: Debes otorgar el crédito adecuado, proporcionar un enlace a la licencia e indicar si se realizaron cambios. Puedes hacerlo de cualquier manera razonable, pero no de una forma que sugiera que el licenciante te respalda o respalda tu uso.

  • No-Comercial: No puedes utilizar el material para fines comerciales.

  • Compartir-Igual: Si remezclas, transformas o creas a partir del material, debes distribuir tus contribuciones bajo la misma licencia que el original.