• Introduction
  • Introduction
  • Libraries and data
  • Plotting categorical time series
  • Create and train forecaster
  • Prediction
  • Predict probabilities
  • Feature importances
  • Backtesting
  • Hyperparameter tuning
  • Probability calibration
    • Calibration curves
    • Calibration
  • Session information
  • Citation

Introduction

Recursive multi-step forecasting is a technique used to predict future values in a time series by using previous predictions as inputs for subsequent forecasts.

In the context of categorical time series, where the values belong to discrete categories rather than continuous values, the model is trained to predict the probability distribution over the possible categories for each future time step. The most likely category for the current time step is then used as input for predicting the next time step, and this process continues recursively.

Although the process is conceptually very similar to that used for forecasting continuous time series, many challenges arise due the need to encode and decode categorical variables, as well as allowing the underlying machine learning model to handle categorical data effectively.

Skforecast provides the class ForecasterRecursiveClassifier to facilitate recursive multi-step forecasting for categorical time series, automatically managing the encoding and decoding of categorical variables, and integrating seamlessly with various machine learning models that support classification tasks.

Libraries and data

# Libraries
# ==============================================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.colors import ListedColormap, BoundaryNorm
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.calibration import CalibratedClassifierCV, CalibrationDisplay, calibration_curve
from skforecast.datasets import fetch_dataset
from skforecast.preprocessing import RollingFeaturesClassification
from skforecast.recursive import ForecasterRecursiveClassifier
from skforecast.model_selection import (
    TimeSeriesFold, OneStepAheadFold, backtesting_forecaster, bayesian_search_forecaster
)
from skforecast.plot import set_dark_theme
# Data download
# ==============================================================================
data = fetch_dataset(name='vic_electricity_classification')
data
╭──────────────────────── vic_electricity_classification ────────────────────────╮
│ Description:                                                                   │
│ Hourly electricity demand for Victoria, Australia classified into three        │
│ categories: 'low', 'medium' and 'high' according to the 20th and 80th          │
│ percentiles. The dataset also includes temperature, holiday indicator and hour │
│ of the day as features.                                                        │
│                                                                                │
│ Source:                                                                        │
│ O'Hara-Wild M, Hyndman R, Wang E, Godahewa R (2022).tsibbledata: Diverse       │
│ Datasets for 'tsibble'. https://tsibbledata.tidyverts.org/,                    │
│ https://github.com/tidyverts/tsibbledata/.                                     │
│ https://tsibbledata.tidyverts.org/reference/vic_elec.html                      │
│                                                                                │
│ URL:                                                                           │
│ https://raw.githubusercontent.com/skforecast/skforecast-                       │
│ datasets/main/data/vic_electricity_classification.csv                          │
│                                                                                │
│ Shape: 8736 rows x 5 columns                                                   │
╰────────────────────────────────────────────────────────────────────────────────╯
Demand Temperature Holiday Hour_of_day Demand_raw
Time
2014-01-01 01:00:00 low 24.80 1.0 1 3730.065980
2014-01-01 02:00:00 medium 25.90 1.0 2 3858.473927
2014-01-01 03:00:00 medium 25.30 1.0 3 3851.415438
2014-01-01 04:00:00 medium 23.65 1.0 4 3838.972926
2014-01-01 05:00:00 medium 20.70 1.0 5 3860.540970
... ... ... ... ... ...
2014-12-30 20:00:00 low 12.10 0.0 20 3527.232855
2014-12-30 21:00:00 medium 12.30 0.0 21 3846.439766
2014-12-30 22:00:00 medium 14.05 0.0 22 3961.529675
2014-12-30 23:00:00 medium 16.60 0.0 23 4059.288011
2014-12-31 00:00:00 medium 17.70 0.0 0 4071.800790

8736 rows × 5 columns

# Distribution of classes in the series
# ==============================================================================
print(f"Distribution of classes:\n{data['Demand'].value_counts(normalize=True)}")
Distribution of classes:
Demand
medium    0.599931
low       0.200092
high      0.199977
Name: proportion, dtype: float64
# Split train-test
# ==============================================================================
end_train = '2014-10-31 23:59:00'
data_train = data[:end_train].copy()
data_test  = data[end_train:].copy()

print(
    f"Train dates : {data_train.index.min()} --- "
    f"{data_train.index.max()}  (n={len(data_train)})"
)
print(
    f"Test dates  : {data_test.index.min()} --- "
    f"{data_test.index.max()}  (n={len(data_test)})"
)
Train dates : 2014-01-01 01:00:00 --- 2014-10-31 23:00:00  (n=7295)
Test dates  : 2014-11-01 00:00:00 --- 2014-12-31 00:00:00  (n=1441)

Plotting categorical time series

Visualizing categorical time series can be challenging due to the discrete nature of the data. Most of the standard plotting libraries require numerical inputs to create plots which means that categorical data needs to be encoded into numerical format for visualization purposes.

# Data preprocessing
# ==============================================================================
# Encode categorical variable as numerical for plotting
encode_mapping = {'low': 0, 'medium': 1, 'high': 2}
print(f"Encoding map: {encode_mapping}")
data['encoded_value'] = data['Demand'].map(encode_mapping).astype(float)
data_train['encoded_value'] = data_train['Demand'].map(encode_mapping).astype(float)
data_test['encoded_value'] = data_test['Demand'].map(encode_mapping).astype(float)
data.head(3)
Encoding map: {'low': 0, 'medium': 1, 'high': 2}
Demand Temperature Holiday Hour_of_day Demand_raw encoded_value
Time
2014-01-01 01:00:00 low 24.8 1.0 1 3730.065980 0.0
2014-01-01 02:00:00 medium 25.9 1.0 2 3858.473927 1.0
2014-01-01 03:00:00 medium 25.3 1.0 3 3851.415438 1.0
# Plot: line plot
# ==============================================================================
set_dark_theme()
fig, ax = plt.subplots(figsize=(8, 3))
data_train['encoded_value'].plot(ax=ax, label='train')
data_test['encoded_value'].plot(ax=ax, label='test')
y_ticks = list(encode_mapping.values())
y_labels = list(encode_mapping.keys())
ax.set_yticks(y_ticks)
ax.set_yticklabels(y_labels)
ax.set_title("Categorical Time Series (States)")
ax.set_xlabel("Date")
ax.legend();
# Plot: 1-row heatmap-style scatter
# ==================================================================================================
colors = ["skyblue", "gold", "tomato"]
cmap = ListedColormap(colors)
norm = BoundaryNorm([0, 1, 2, 3], cmap.N)
stripe_height = 20
state_array = np.tile(data["encoded_value"].values, (stripe_height, 1))
train_mask = data.index <= pd.to_datetime(end_train)
train_indices = np.where(train_mask)[0]
test_indices = np.where(~train_mask)[0]

fig, ax = plt.subplots(figsize=(8, 2.5))
ax.imshow(
    state_array,
    aspect="auto",
    cmap=cmap,
    norm=norm,
    extent=[0, len(data) - 1, 0, stripe_height],
)

ax.hlines(y=30, xmin=train_indices[0], xmax=train_indices[-1], color="#30a2da", linewidth=4)
ax.hlines(y=30, xmin=test_indices[0], xmax=test_indices[-1], color="#fc4f30", linewidth=4)
ax.set_yticks([])
ax.set_frame_on(False)
tick_indices = np.linspace(0, len(data) - 1, 10, dtype=int)
ax.set_xticks(tick_indices)
ax.set_xticklabels([data.index[i].strftime("%Y-%m") for i in tick_indices], rotation=0, ha="center")
ax.set_xlabel("Date")
ax.set_title("Categorical Daily States Over Time", pad=8)

cbar = fig.colorbar(
    plt.cm.ScalarMappable(cmap=cmap, norm=norm),
    ax=ax,
    orientation="horizontal",
    ticks=[0.5, 1.5, 2.5],
    pad=0.3,       # space between plot and colorbar
    fraction=0.05,  # thickness
    aspect=20       # length-to-height ratio
)
cbar.outline.set_visible(False)
cbar.ax.set_xticklabels(["low", "medium", "high"], fontsize=8)
cbar.set_label("", labelpad=2, fontsize=9)

legend_elements = [
    Patch(facecolor="#30a2da", label="Train"),
    Patch(facecolor="#fc4f30", label="Test"),
]
ax.legend(handles=legend_elements, loc="upper right", frameon=False, ncol=2, borderaxespad=0.5);
# Plot: states timeline
# ==============================================================================
fig, ax = plt.subplots(figsize=(8, 3))
color_map = {"low": "skyblue", "medium": "gold", "high": "tomato"}
for state, color in color_map.items():
    mask = data["Demand"] == state
    ax.plot(
        data.index[mask], [state] * mask.sum(), '|', color=color, markersize=8, label=state
    )
ax.set_yticks(y_ticks)
ax.set_title("Categorical States Timeline")
ax.set_xlabel("Date")
ax.legend()
plt.tight_layout()
plt.show()

Create and train forecaster

When working with categorical time series, a common challenge arises: not only is the target variable categorical, but the lagged values used as predictors are also categorical, as they represent previous observations of the target variable.

Most classification algorithms can handle categorical target variables; however, many cannot directly process categorical predictors. In such cases, it is necessary to encode the categorical features into a numerical format prior to training the model.

The features_encoding parameter in the ForecasterRecursiveClassifier class specifies how the categorical features derived from the target series (e.g., lagged values and window features) are encoded before fitting the underlying machine learning model. The parameter accepts the following options:

  • auto: Use a categorical data type if the classifier supports native categorical features (such as LightGBM, CatBoost, XGBoost, or HistGradientBoostingClassifier). Otherwise, fallback to numerical encoding.

  • categorical: Enforce the use of a categorical data type. This requires a classifier capable of handling native categorical features.

  • ordinal: Apply ordinal encoding (0, 1, 2, …). In this case, the classifier interprets the encoded values as numeric, implicitly assuming an ordinal relationship among the categories (for example, 'low' < 'medium' < 'high').

Proper encoding of categorical features is crucial to ensure that the model can accurately interpret and leverage the information contained in previous observations of the target series.

# Create and fit forecaster with rolling window features
# ==============================================================================
forecaster = ForecasterRecursiveClassifier(
                 estimator         = HistGradientBoostingClassifier(random_state=123),
                 lags              = 24,
                 window_features   = RollingFeaturesClassification(stats=['proportion'], window_sizes=72),
                 features_encoding = 'auto', 
             )

forecaster.fit(y=data_train['Demand'], exog=data_train[['Temperature', 'Holiday', 'Hour_of_day']])
forecaster

ForecasterRecursiveClassifier

General Information
  • Estimator: HistGradientBoostingClassifier
  • Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]
  • Window features: ['roll_proportion_72']
  • Window size: 72
  • Series name: Demand
  • Exogenous included: True
  • Weight function included: False
  • Creation date: 2025-11-28 23:21:44
  • Last fit date: 2025-11-28 23:21:48
  • Skforecast version: 0.19.0
  • Python version: 3.13.9
  • Forecaster id: None
Classification Information
  • Classes: ['high', 'low', 'medium']
  • Class encoding: {'high': 0, 'low': 1, 'medium': 2}
Exogenous Variables
  • Exogenous names: Temperature, Holiday, Hour_of_day
  • Transformer for exog: None
Training Information
  • Training range: [Timestamp('2014-01-01 01:00:00'), Timestamp('2014-10-31 23:00:00')]
  • Training index type: DatetimeIndex
  • Training index frequency:
Estimator Parameters
    {'categorical_features': 'from_dtype', 'class_weight': None, 'early_stopping': 'auto', 'interaction_cst': None, 'l2_regularization': 0.0, 'learning_rate': 0.1, 'loss': 'log_loss', 'max_bins': 255, 'max_depth': None, 'max_features': 1.0, 'max_iter': 100, 'max_leaf_nodes': 31, 'min_samples_leaf': 20, 'monotonic_cst': None, 'n_iter_no_change': 10, 'random_state': 123, 'scoring': 'loss', 'tol': 1e-07, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
Fit Kwargs
    {}

🛈 API Reference    🗎 User Guide

Prediction

# Predict
# ==============================================================================
predictions = forecaster.predict(steps=24, exog=data_test[['Temperature', 'Holiday', 'Hour_of_day']])
predictions
2014-11-01 00:00:00    medium
2014-11-01 01:00:00    medium
2014-11-01 02:00:00    medium
2014-11-01 03:00:00    medium
2014-11-01 04:00:00    medium
2014-11-01 05:00:00    medium
2014-11-01 06:00:00    medium
2014-11-01 07:00:00    medium
2014-11-01 08:00:00    medium
2014-11-01 09:00:00      high
2014-11-01 10:00:00      high
2014-11-01 11:00:00    medium
2014-11-01 12:00:00    medium
2014-11-01 13:00:00    medium
2014-11-01 14:00:00    medium
2014-11-01 15:00:00    medium
2014-11-01 16:00:00    medium
2014-11-01 17:00:00       low
2014-11-01 18:00:00       low
2014-11-01 19:00:00       low
2014-11-01 20:00:00    medium
2014-11-01 21:00:00    medium
2014-11-01 22:00:00    medium
2014-11-01 23:00:00    medium
Freq: h, Name: pred, dtype: object
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_test.loc[predictions.index, 'encoded_value'].plot(ax=ax, label='test')
(
    predictions
    .to_frame()
    .assign(encoded_value=predictions.map(encode_mapping))['encoded_value']
    .plot(ax=ax, label='predictions')
)
y_ticks = list(encode_mapping.values())
y_labels = list(encode_mapping.keys())
ax.set_yticks(y_ticks)
ax.set_yticklabels(y_labels)
ax.set_title("Categorical Time Series (States)")
ax.set_xlabel("Date")
ax.set_ylabel("State")
ax.legend();
# Prediction metrics
# ==============================================================================
error_mse = accuracy_score(
                y_true = data_test.loc[predictions.index, 'Demand'],
                y_pred = predictions
            )
print(f"Test error (accuracy): {error_mse}")
print("Classification Report \n---------------------")
print(
    classification_report(
        data.loc[predictions.index, 'Demand'],
        predictions
    )
)
Test error (accuracy): 0.6666666666666666
Classification Report 
---------------------
              precision    recall  f1-score   support

        high       0.00      0.00      0.00         0
         low       1.00      0.33      0.50         9
      medium       0.68      0.87      0.76        15

    accuracy                           0.67        24
   macro avg       0.56      0.40      0.42        24
weighted avg       0.80      0.67      0.67        24

Predict probabilities

ForecasterRecursiveClassifier includes the method predict_proba which allows to obtain the predicted probabilities for each category at each forecasted time step. Internally, this method calls the predict_proba method of the underlying classifier used inside the forecaster.

⚠️ Warning

It is important to note that these probabilities come from the classifier’s predict_proba method. Their calibration and reliability depend on the specific model used. For example, in scikit-learn, predict_proba returns scores that approximate probabilities, but these are not always well-calibrated or truly probabilistic. See section on Probability Calibration for more details.

# Predict probabilities
# ==============================================================================
predictions = forecaster.predict_proba(steps=24, exog=data_test[['Temperature', 'Holiday', 'Hour_of_day']])
predictions
high_proba low_proba medium_proba
2014-11-01 00:00:00 0.001643 0.001311 0.997046
2014-11-01 01:00:00 0.002286 0.004299 0.993415
2014-11-01 02:00:00 0.000829 0.001231 0.997940
2014-11-01 03:00:00 0.000601 0.002429 0.996970
2014-11-01 04:00:00 0.001399 0.000247 0.998354
2014-11-01 05:00:00 0.000314 0.000007 0.999679
2014-11-01 06:00:00 0.000883 0.000022 0.999095
2014-11-01 07:00:00 0.026483 0.000058 0.973459
2014-11-01 08:00:00 0.212137 0.000529 0.787334
2014-11-01 09:00:00 0.559296 0.000397 0.440307
2014-11-01 10:00:00 0.869440 0.000120 0.130440
2014-11-01 11:00:00 0.010950 0.000221 0.988829
2014-11-01 12:00:00 0.000274 0.000349 0.999376
2014-11-01 13:00:00 0.000247 0.000121 0.999632
2014-11-01 14:00:00 0.000193 0.000016 0.999792
2014-11-01 15:00:00 0.000051 0.000086 0.999863
2014-11-01 16:00:00 0.000686 0.007753 0.991561
2014-11-01 17:00:00 0.000244 0.958417 0.041339
2014-11-01 18:00:00 0.000035 0.995288 0.004677
2014-11-01 19:00:00 0.000057 0.976199 0.023744
2014-11-01 20:00:00 0.000317 0.474168 0.525515
2014-11-01 21:00:00 0.037326 0.001260 0.961414
2014-11-01 22:00:00 0.261726 0.001186 0.737088
2014-11-01 23:00:00 0.023844 0.000232 0.975924

Feature importances

# Native feature importance of the estimator used in the forecaster
# ==============================================================================
forecaster.get_feature_importances()

Backtesting

The backtesting process for classification forecasters follows the same principles as that of regression forecasters. The key difference lies in the evaluation metrics used to assess model performance. Instead of using metrics like Mean Squared Error (MSE) or Mean Absolute Error (MAE), classification forecasters utilize metrics such as accuracy, precision, ..., to evaluate how well the model predicts categorical outcomes.

The returned predictions will include the most likely category for each forecasted time step in the column pred as well as the predicted probabilities for each category in separate columns.

# Backtesting forecaster
# ==============================================================================
forecaster = ForecasterRecursiveClassifier(
                estimator         = HistGradientBoostingClassifier(random_state=123),
                lags              = 24,
                window_features   = RollingFeaturesClassification(stats=['proportion'], window_sizes=[72]),
                features_encoding = 'auto',
             )

cv = TimeSeriesFold(steps = 24, initial_train_size = end_train)

metric, predictions = backtesting_forecaster(
                          forecaster    = forecaster,
                          y             = data['Demand'],
                          exog          = data[['Temperature', 'Holiday', 'Hour_of_day']],
                          cv            = cv,
                          metric        = 'accuracy_score',
                          n_jobs        = 'auto',
                          verbose       = False,
                          show_progress = True
                      )
display(metric)
display(predictions)
accuracy_score
0 0.843858
fold pred high_proba low_proba medium_proba
2014-11-01 00:00:00 0 medium 0.001643 0.001311 0.997046
2014-11-01 01:00:00 0 medium 0.002286 0.004299 0.993415
2014-11-01 02:00:00 0 medium 0.000829 0.001231 0.997940
2014-11-01 03:00:00 0 medium 0.000601 0.002429 0.996970
2014-11-01 04:00:00 0 medium 0.001399 0.000247 0.998354
... ... ... ... ... ...
2014-12-30 20:00:00 59 low 0.000042 0.746859 0.253099
2014-12-30 21:00:00 59 low 0.000015 0.990844 0.009141
2014-12-30 22:00:00 59 medium 0.000097 0.460920 0.538982
2014-12-30 23:00:00 59 medium 0.000414 0.009459 0.990127
2014-12-31 00:00:00 60 medium 0.001948 0.002009 0.996042

1441 rows × 5 columns

# Summary of classification metrics on test set
# ==============================================================================
print("Classification Report \n---------------------")
print(
    classification_report(
        data.loc[predictions.index, 'Demand'],
        predictions['pred']
    )
)
Classification Report 
---------------------
              precision    recall  f1-score   support

        high       0.50      0.36      0.42       107
         low       0.84      0.83      0.84       364
      medium       0.87      0.90      0.89       970

    accuracy                           0.84      1441
   macro avg       0.74      0.70      0.71      1441
weighted avg       0.84      0.84      0.84      1441

# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data.loc[predictions.index, 'encoded_value'].plot(ax=ax, label='test')
predictions.assign(encoded_value=predictions['pred'].map(encode_mapping))['encoded_value'].plot(ax=ax, label='predictions')
y_ticks = list(encode_mapping.values())
y_labels = list(encode_mapping.keys())
ax.set_yticks(y_ticks)
ax.set_yticklabels(y_labels)
ax.set_title("Categorical Time Series (States)")
ax.set_xlabel("Date")
ax.set_ylabel("State")
ax.legend();

Hyperparameter tuning

Hyperparameter tuning for ForecasterRecursiveClassifier can be performed using the functions available in the skforecast.model_selection module, such as grid_search_forecaster, random_search_forecaster and bayesian_search_forecaster. These functions allow for systematic exploration of hyperparameter combinations to identify the best configuration for the forecaster based on specified evaluation metrics suitable for classification tasks.

When performing hyperparameter tuning, it is important to use a train-validation-test split approach. This involves dividing the dataset into three distinct subsets:

  • Training set: Used to fit the model.
  • Validation set: Used to tune the hyperparameters.
  • Test set: Used to evaluate the model's performance.

This approach helps to prevent overfitting and ensures that the model generalizes well to unseen data.

# Train-validation-test split
# ==============================================================================
end_train = '2014-08-31 23:59:00'
end_val ='2014-10-31 23:59:00'
print(
    f"Train dates      : {data.index.min()} --- {data.loc[:end_train].index.max()}"
    f"  (n={len(data.loc[:end_train])})"
)
print(
    f"Validation dates : {data.loc[end_train:].index.min()} --- {data.loc[:end_val].index.max()}"
    f"  (n={len(data.loc[end_train:end_val])})"
)
print(
    f"Test dates       : {data.loc[end_val:].index.min()} --- {data.index.max()}"
    f"  (n={len(data.loc[end_val:])})"
)
print("")
Train dates      : 2014-01-01 01:00:00 --- 2014-08-31 23:00:00  (n=5831)
Validation dates : 2014-09-01 00:00:00 --- 2014-10-31 23:00:00  (n=1464)
Test dates       : 2014-11-01 00:00:00 --- 2014-12-31 00:00:00  (n=1441)

# Bayesian search hyperparameters and lags with Optuna
# ==============================================================================
estimator = HistGradientBoostingClassifier(
                early_stopping= True,
                random_state=123,
                verbose=0
            )
forecaster = ForecasterRecursiveClassifier(
                estimator         = estimator,
                lags              = 14,  # Placeholder, the value will be overwritten
                window_features   = RollingFeaturesClassification(stats=['proportion'], window_sizes=[72]),
                features_encoding = 'auto',
             )

# Lags used as predictors
lags_grid = [24, (1, 2, 3, 23, 24, 25, 47, 48, 49)]

# Search space
def search_space(trial):
    search_space  = {
        'max_iter' : trial.suggest_int('max_iter', 100, 500, step=100),
        'max_depth'    : trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.5),
        'l2_regularization': trial.suggest_float('l2_regularization', 0.0, 1.0),
        'lags'         : trial.suggest_categorical('lags', lags_grid)
    } 
    return search_space

# Folds
cv_search = OneStepAheadFold(initial_train_size = end_train)

results, best_trial = bayesian_search_forecaster(
                          forecaster          = forecaster,
                          y                   = data.loc[:end_val, 'Demand'],
                          exog                = data.loc[:end_val, ['Temperature', 'Holiday', 'Hour_of_day']],
                          search_space        = search_space,
                          cv                  = cv_search,
                          metric              = 'accuracy_score',
                          kwargs_create_study = {'direction': 'maximize'},
                          n_trials            = 20,
                          random_state        = 123,
                          return_best         = True,
                          n_jobs              = 'auto',
                          verbose             = False,
                          show_progress       = True,
                      )
results.head(4)
╭─────────────────────────── OneStepAheadValidationWarning ────────────────────────────╮
 One-step-ahead predictions are used for faster model comparison, but they may not    
 fully represent multi-step prediction performance. It is recommended to backtest the 
 final model for a more accurate multi-step performance estimate.                     
                                                                                      
 Category : skforecast.exceptions.OneStepAheadValidationWarning                       
 Location :                                                                           
 C:\Users\Joaquin\miniconda3\envs\skforecast_19_py13\Lib\site-packages\skforecast\mod 
 el_selection\_utils.py:607                                                           
 Suppress : warnings.simplefilter('ignore', category=OneStepAheadValidationWarning)   
╰──────────────────────────────────────────────────────────────────────────────────────╯
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] 
  Parameters: {'max_iter': 300, 'max_depth': 7, 'learning_rate': 0.06910804633525863, 'l2_regularization': 0.8263408005068332}
  One-step-ahead metric: 0.9030054644808743
lags params accuracy_score max_iter max_depth learning_rate l2_regularization
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_iter': 300, 'max_depth': 7, 'learning_ra... 0.903005 300.0 7.0 0.069108 0.826341
1 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_iter': 400, 'max_depth': 5, 'learning_ra... 0.902322 400.0 5.0 0.121157 0.551315
2 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_iter': 400, 'max_depth': 4, 'learning_ra... 0.902322 400.0 4.0 0.138751 0.566709
3 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_iter': 400, 'max_depth': 7, 'learning_ra... 0.902322 400.0 7.0 0.126432 0.964954
# Best model
# ==============================================================================
best_params = results.loc[0, 'params']
best_lags   = results.loc[0, 'lags']
forecaster = ForecasterRecursiveClassifier(
               estimator       = HistGradientBoostingClassifier(random_state=123, **best_params),
               lags            = best_lags,
               window_features = RollingFeaturesClassification(stats=['proportion'], window_sizes=[72]),
               features_encoding = 'auto',
             )
# Backtest final model on test data
# ==============================================================================
cv = TimeSeriesFold(steps = 24, initial_train_size = end_val)
metric, predictions = backtesting_forecaster(
                          forecaster = forecaster,
                          y          = data['Demand'],
                          exog       = data[['Temperature', 'Holiday', 'Hour_of_day']],
                          cv         = cv,
                          metric     = 'accuracy_score'
                      )
display(metric)
predictions
accuracy_score
0 0.840389
fold pred high_proba low_proba medium_proba
2014-11-01 00:00:00 0 medium 0.000427 0.000960 0.998613
2014-11-01 01:00:00 0 medium 0.000605 0.004756 0.994639
2014-11-01 02:00:00 0 medium 0.000057 0.000384 0.999559
2014-11-01 03:00:00 0 medium 0.000072 0.000702 0.999225
2014-11-01 04:00:00 0 medium 0.000441 0.000269 0.999289
... ... ... ... ... ...
2014-12-30 20:00:00 59 low 0.000307 0.744962 0.254731
2014-12-30 21:00:00 59 low 0.000014 0.995959 0.004027
2014-12-30 22:00:00 59 medium 0.000163 0.487490 0.512347
2014-12-30 23:00:00 59 medium 0.000055 0.002139 0.997806
2014-12-31 00:00:00 60 medium 0.000395 0.001593 0.998012

1441 rows × 5 columns

# Summary of classification metrics on test set
# ==============================================================================
print("Classification Report \n---------------------")
print(
    classification_report(
        data.loc[predictions.index, 'Demand'],
        predictions['pred']
    )
)
Classification Report 
---------------------
              precision    recall  f1-score   support

        high       0.50      0.54      0.52       107
         low       0.84      0.82      0.83       364
      medium       0.88      0.88      0.88       970

    accuracy                           0.84      1441
   macro avg       0.74      0.75      0.74      1441
weighted avg       0.84      0.84      0.84      1441

Probability calibration

Well calibrated classifiers are probabilistic classifiers for which the output of the predict_proba method can be directly interpreted as a confidence level. For instance, a well calibrated (binary) classifier should classify the samples such that among the samples to which it gave a predict_proba value close to, say, 0.8, approximately 80% actually belong to the positive class.

Before we show how to re-calibrate a classifier, we first need a way to detect how good a classifier is calibrated.

Calibration curves

The following definition of calibration curves is taken from scikit-learn documentation:

"Calibration curves, also referred to as reliability diagrams, compare how well the probabilistic predictions of a binary classifier are calibrated. It plots the frequency of the positive label (to be more precise, an estimation of the conditional event probability ) on the y-axis against the predicted probability predict_proba of a model on the x-axis. The tricky part is to get values for the y-axis. In scikit-learn, this is accomplished by binning the predictions such that the x-axis represents the average predicted probability in each bin. The y-axis is then the fraction of positives given the predictions of that bin, i.e. the proportion of samples whose class is the positive class (in each bin)."

Since calibration curves are defined for binary classifiers, in the case of multi-class classification problems, one-vs-rest calibration curves are used. This means that for each class, the samples belonging to that class are considered as positive samples, while all other samples are treated as negative samples.

# Calibration plots: one-vs-rest
# ==============================================================================
for level, code in {'low': 0, 'medium': 1, 'high': 2}.items():
    fig, ax = plt.subplots(figsize=(6, 2.5))
    y_prob = predictions.loc[:, f'{level}_proba'].rename(code)
    y_true = (data.loc[predictions.index, 'encoded_value'] == code).astype(int)
    curve_prob_true, curve_prob_pred = calibration_curve(
        y_true=y_true,
        y_prob=y_prob,
        n_bins=10
    )
    disp = CalibrationDisplay(prob_true=curve_prob_true, prob_pred=curve_prob_pred, y_prob=y_prob)
    disp.plot(name=level, ax=ax, color='C'+str(code))
    ax.set_title(f"{level}")
    ax.set_xlabel("Mean predicted probability")
    ax.set_ylabel("Fraction of positives")
    ax.get_lines()[0].set_color('white')

The calibrations curves show that the classifier is not well calibrated, as the curves deviate significantly from the diagonal line representing perfect calibration. For example, for the class 'low', when the classifier predicts a probability of 0.8, the actual frequency of the positive class is only around 0.5, indicating overconfidence in its predictions for that class.

Calibration

The calibration of a classification model refers to the process of adjusting the predicted probabilities so that they align with the actual observed outcomes. In other words, calibration corrects the model when it tends to systematically overestimate or underestimate probabilities.

A model is said to be perfectly calibrated if, for any predicted probability p[0,1], the proportion of correctly classified observations among those predicted with probability p is exactly p. Formally:

P(Y^=YP^=p)=p,p[0,1]

where:

  • Y^ is the predicted class,
  • Y is the true class, and
  • P^ is the predicted probability of the positive (or target) class.

For example, if we look at all cases where the model predicts a probability of P^=0.8, we would expect that about 80% of those cases truly belong to the positive class.

When a model is perfectly calibrated, plotting the observed frequencies against the predicted probabilities produces a diagonal line (the identity function) across the range [0,1]:

Observed frequency=Predicted probability

The CalibratedClassifierCV class in sklearn is used to calibrate a classifier. Since skforecast is compatible with scikit-learn, CalibratedClassifierCV can be used directly into ForecasterRecursiveClassifier.

# Forecaster with calibrated classifier
# ==============================================================================
estimator = CalibratedClassifierCV(
                estimator = HistGradientBoostingClassifier(
                                **best_params,
                                random_state=123
                            ),
                method   = 'sigmoid',
                ensemble = False, # False to speed up prediction
                cv       = 3,
            )

forecaster = ForecasterRecursiveClassifier(
                estimator       = estimator,
                lags            = best_lags,
                window_features = RollingFeaturesClassification(stats=['proportion'], window_sizes=[72]),
                features_encoding= 'auto',
             )


cv = TimeSeriesFold(steps = 24, initial_train_size = end_val)
metric, predictions = backtesting_forecaster(
                          forecaster = forecaster,
                          y          = data['Demand'],
                          exog       = data[['Temperature', 'Holiday', 'Hour_of_day']],
                          cv         = cv,
                          metric     = 'accuracy_score'
                      )
display(metric)
predictions
accuracy_score
0 0.844552
fold pred high_proba low_proba medium_proba
2014-11-01 00:00:00 0 medium 0.025318 0.039298 0.935384
2014-11-01 01:00:00 0 medium 0.022445 0.088729 0.888826
2014-11-01 02:00:00 0 medium 0.011331 0.042600 0.946069
2014-11-01 03:00:00 0 medium 0.012608 0.061464 0.925928
2014-11-01 04:00:00 0 medium 0.028538 0.016181 0.955280
... ... ... ... ... ...
2014-12-30 20:00:00 59 low 0.002938 0.632561 0.364501
2014-12-30 21:00:00 59 low 0.001848 0.906687 0.091465
2014-12-30 22:00:00 59 medium 0.001536 0.482725 0.515739
2014-12-30 23:00:00 59 medium 0.005040 0.074950 0.920010
2014-12-31 00:00:00 60 medium 0.022506 0.054675 0.922819

1441 rows × 5 columns

# Calibration plots: one-vs-rest
# ==============================================================================
for level, code in {'low': 0, 'medium': 1, 'high': 2}.items():
    fig, ax = plt.subplots(figsize=(6, 2.5))
    y_prob = predictions.loc[:, f'{level}_proba'].rename(code)
    y_true = (data.loc[predictions.index, 'encoded_value'] == code).astype(int)
    prob_true, prob_pred = calibration_curve(
        y_true=y_true,
        y_prob=y_prob,
        n_bins=10
    )
    disp = CalibrationDisplay(prob_true=prob_true, prob_pred=prob_pred, y_prob=y_prob)
    disp.plot(name=level, ax=ax, color='C'+str(code))
    ax.set_title(f"{level}")
    ax.set_xlabel("Mean predicted probability")
    ax.set_ylabel("Fraction of positives")
    ax.get_lines()[0].set_color('white')

After applying calibration techniques, the calibration curves are much closer to the diagonal line, indicating that the predicted probabilities are now more aligned with the actual observed frequencies. For instance, for the class 'low', when the classifier predicts a probability of 0.8, the actual frequency of the positive class is now around 0.75, showing a significant improvement compared to before calibration.

Session information

import session_info
session_info.show(html=False)
-----
matplotlib          3.10.8
numpy               2.3.4
optuna              4.6.0
pandas              2.3.3
session_info        v1.0.1
skforecast          0.19.0
sklearn             1.7.2
-----
IPython             9.7.0
jupyter_client      8.6.3
jupyter_core        5.9.1
-----
Python 3.13.9 | packaged by conda-forge | (main, Oct 22 2025, 23:12:41) [MSC v.1944 64 bit (AMD64)]
Windows-11-10.0.26100-SP0
-----
Session information updated at 2025-11-28 23:22

Citation

How to cite this document

If you use this document or any part of it, please acknowledge the source, thank you!

Forecasting of categorical time series by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) at https://cienciadedatos.net/documentos/py72-classification-forecasting-categorical-series.html

How to cite skforecast

If you use skforecast for a publication, we would appreciate if you cite the published software.

Zenodo:

Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.19.0). Zenodo. https://doi.org/10.5281/zenodo.8382788

APA:

Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (Version 0.19.0) [Computer software]. https://doi.org/10.5281/zenodo.8382788

BibTeX:

@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.19.0}, month = {11}, year = {2025}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }


Did you like the article? Your support is important

Your contribution will help me to continue generating free educational content. Many thanks! 😊

Become a GitHub Sponsor Become a GitHub Sponsor

Creative Commons Licence

This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.

Allowed:

  • Share: copy and redistribute the material in any medium or format.

  • Adapt: remix, transform, and build upon the material.

Under the following terms:

  • Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  • NonCommercial: You may not use the material for commercial purposes.

  • ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.