• Introduction
  • Libraries
  • Data
  • Exogenous variables
    • Building characteristics
    • Calendar features
    • Meteorological features
    • Rolling features for exogenous variables
    • Categorical features
  • Modelling and forecasting
  • Hyperparameter tuning
  • Backtesting on test data
  • Feature selection
  • Clustering time series
  • Session information
  • Citation


More about forecasting in cienciadedatos.net


Introduction

When faced with scenarios involving the prediction of hundreds or thousands of time series, a crucial decision arises: should one develop individual models for each series, or should one use a unified model to handle them all at once?

In Single-Series Modeling (Local Forecasting Model), a separate predictive model is created for each time series. While this method provides a comprehensive understanding of each series, its scalability can be challenged by the need to create and maintain hundreds or thousands of models.

Multi-Series Modeling (Global Forecasting Model) involves building a single predictive model that considers all time series simultaneously. It attempts to capture the core patterns that govern the series, thereby mitigating the potential noise that each series might introduce. This approach is computationally efficient, easy to maintain, and can yield more robust generalizations across time series, albeit potentially at the cost of sacrificing some individual insights.

This document shows how to forecast more than 1000 time series with a single model including exogenous features with some of them having different values per series.

Libraries

Libraries used in this document.

# Data management
# ==============================================================================
import numpy as np
import pandas as pd

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-darkgrid')

# Forecasting
# ==============================================================================
import skforecast
import lightgbm
from lightgbm import LGBMRegressor
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer
from sklearn.feature_selection import RFECV
from skforecast.recursive import ForecasterRecursiveMultiSeries
from skforecast.model_selection import TimeSeriesFold, OneStepAheadFold
from skforecast.model_selection import backtesting_forecaster_multiseries
from skforecast.model_selection import bayesian_search_forecaster_multiseries
from skforecast.feature_selection import select_features_multiseries
from skforecast.preprocessing import RollingFeatures
from skforecast.preprocessing import series_long_to_dict
from skforecast.preprocessing import exog_long_to_dict
from feature_engine.datetime import DatetimeFeatures
from feature_engine.creation import CyclicalFeatures
from feature_engine.timeseries.forecasting import WindowFeatures
from skforecast.datasets import fetch_dataset

# Configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

print('Versiรณn skforecast:', skforecast.__version__)
print('Versiรณn lightgbm:', lightgbm.__version__)
Versiรณn skforecast: 0.15.1
Versiรณn lightgbm: 4.6.0

Data

Data used in this document has been obtained from the The Building Data Genome Project 2 https://github.com/buds-lab/building-data-genome-project-2. The dataset contains information on the energy consumption of more than 1500 buildings. The time range of the times-series data is the two full years (2016 and 2017) and the frequency is hourly measurements of electricity, heating and cooling water, steam, and irrigation meters. Additionally, the dataset includes information of the characteristics of the buildings and the weather conditions. Data has been aggregated to a daily resolution and only the electricity among the different energy sources has been considered.

# Load data
# ==============================================================================
data = fetch_dataset(name='bdg2_daily')
print("Data shape:", data.shape)
data.head(3)
bdg2_daily
----------
Daily energy consumption data from the The Building Data Genome Project 2 with
building metadata and weather data. https://github.com/buds-lab/building-data-
genome-project-2
Miller, C., Kathirgamanathan, A., Picchetti, B. et al. The Building Data Genome
Project 2, energy meter data from the ASHRAE Great Energy Predictor III
competition. Sci Data 7, 368 (2020). https://doi.org/10.1038/s41597-020-00712-x
Shape of the dataset: (1153518, 17)
Data shape: (1153518, 17)
building_id meter_reading site_id primaryspaceusage sub_primaryspaceusage sqm lat lng timezone airTemperature cloudCoverage dewTemperature precipDepth1HR precipDepth6HR seaLvlPressure windDirection windSpeed
timestamp
2016-01-01 Bear_assembly_Angel 12808.1620 Bear Entertainment/public assembly Entertainment/public assembly 22117.0 37.871903 -122.260729 US/Pacific 6.1750 1.666667 -5.229167 0.0 0.0 1020.891667 68.750000 3.070833
2016-01-02 Bear_assembly_Angel 9251.0003 Bear Entertainment/public assembly Entertainment/public assembly 22117.0 37.871903 -122.260729 US/Pacific 8.0875 NaN -1.404167 0.0 0.0 1017.687500 76.666667 3.300000
2016-01-03 Bear_assembly_Angel 14071.6500 Bear Entertainment/public assembly Entertainment/public assembly 22117.0 37.871903 -122.260729 US/Pacific 10.1125 NaN 1.708333 -6.0 -2.0 1011.491667 91.666667 3.120833
# Overall range of available dates
# ==============================================================================
print(
    f"Rage of dates available : {data.index.min()} --- {data.index.max()}  "
    f"(n_days={(data.index.max() - data.index.min()).days})"
)
Rage of dates available : 2016-01-01 00:00:00 --- 2017-12-31 00:00:00  (n_days=730)
# Range of available dates per building
# ==============================================================================
available_dates_per_series = (
    data
    .dropna(subset="meter_reading")
    .reset_index()
    .groupby("building_id")
    .agg(
        min_index=("timestamp", "min"),
        max_index=("timestamp", "max"),
        n_values=("timestamp", "nunique")
    )
)
display(available_dates_per_series)
print(f"Unique lenght of series: {available_dates_per_series.n_values.unique()}")
min_index max_index n_values
building_id
Bear_assembly_Angel 2016-01-01 2017-12-31 731
Bear_assembly_Beatrice 2016-01-01 2017-12-31 731
Bear_assembly_Danial 2016-01-01 2017-12-31 731
Bear_assembly_Diana 2016-01-01 2017-12-31 731
Bear_assembly_Genia 2016-01-01 2017-12-31 731
... ... ... ...
Wolf_public_Norma 2016-01-01 2017-12-31 731
Wolf_retail_Harriett 2016-01-01 2017-12-31 731
Wolf_retail_Marcella 2016-01-01 2017-12-31 731
Wolf_retail_Toshia 2016-01-01 2017-12-31 731
Wolf_science_Alfreda 2016-01-01 2017-12-31 731

1578 rows ร— 3 columns

Unique lenght of series: [731]

All time series have the same length, starting from January 1, 2016 and ending on December 31, 2017. Exogenous variables have few missing values. Skforecast does not require that the time series have the same length, and missing values are allowed as long as the underlying regressor can handle them, which is the case for LightGBM, XGBoost, and HistGradientBoostingRegressor.

# Missing values per feature
# ==============================================================================
data.isna().mean().mul(100).round(2)
building_id               0.00
meter_reading             0.00
site_id                   0.00
primaryspaceusage         1.20
sub_primaryspaceusage     1.20
sqm                       0.00
lat                      14.83
lng                      14.83
timezone                  0.00
airTemperature            0.02
cloudCoverage             7.02
dewTemperature            0.03
precipDepth1HR            0.02
precipDepth6HR            0.02
seaLvlPressure            9.56
windDirection             0.02
windSpeed                 0.02
dtype: float64

Exogenous variables

Exogenous variables are variables that are external to the time series and can be used as features to improve the forecast. In this case, the exogenous variables used are: the characteristics of the buildings, calendar variables and weather conditions.

โš  Warning

Exogenous variables must be known at the time of the forecast. For example, if temperature is used as an exogenous variable, the temperature value for the next day must be known at the time of the forecast. If the temperature value is not known, the forecast will not be possible.

โš  Warning

When creating new features in a multi-series dataset, it is important to avoid avoid mixing information of different series. It is recommended to use the groupby and apply methods to create these features.

Building characteristics

One of the key attributes associated with each building is its designated use. This feature may play a crucial role in influencing the energy consumption pattern, as distinct uses can significantly impact both the quantity and timing of energy consumption.

# Number of buildings
# ==============================================================================
print(f"Number of buildings: {data['building_id'].nunique()}")
print(f"Number of building types: {data['primaryspaceusage'].nunique()}")
print(f"Number of building subtypes: {data['sub_primaryspaceusage'].nunique()}")
Number of buildings: 1578
Number of building types: 16
Number of building subtypes: 104

For certain type and subtype categories, there are a limited number of buildings in the dataset. Types with fewer than 100 buildings and subtypes with fewer than 50 buildings are grouped into the "Other" category.

# Group unfrequent categories
# ==============================================================================
infrequent_types = (
    data
    .drop_duplicates(subset=['building_id'])['primaryspaceusage']
    .value_counts()
    .loc[lambda x: x < 100]
    .index
    .tolist()
)
infrequent_subtypes = (
    data
    .drop_duplicates(subset=['building_id'])['sub_primaryspaceusage']
    .value_counts()
    .loc[lambda x: x < 50]
    .index
    .tolist()
)

data['primaryspaceusage'] = np.where(
    data['primaryspaceusage'].isin(infrequent_types),
    'Other',
    data['primaryspaceusage']
)
data['sub_primaryspaceusage'] = np.where(
    data['sub_primaryspaceusage'].isin(infrequent_subtypes),
    'Other',
    data['sub_primaryspaceusage']
)

display(data.drop_duplicates(subset=['building_id'])['primaryspaceusage'].value_counts())
display(data.drop_duplicates(subset=['building_id', 'sub_primaryspaceusage'])['sub_primaryspaceusage'].value_counts())
primaryspaceusage
Education                        604
Office                           296
Entertainment/public assembly    203
Public services                  166
Lodging/residential              149
Other                            141
Name: count, dtype: int64
sub_primaryspaceusage
Other                          612
Office                         295
College Classroom              131
College Laboratory             116
K-12 School                    109
Dormitory                       91
Primary/Secondary Classroom     84
Education                       67
Library                         54
Name: count, dtype: int64

Calendar features

# Calendar features
# ==============================================================================
features_to_extract = [
    'month',
    'week',
    'day_of_week',
]
calendar_transformer = DatetimeFeatures(
                            variables           = 'index',
                            features_to_extract = features_to_extract,
                            drop_original       = False,
                       )
data = calendar_transformer.fit_transform(data)
print(f"Data shape: {data.shape}")
data.head(3)
Data shape: (1153518, 20)
building_id meter_reading site_id primaryspaceusage sub_primaryspaceusage sqm lat lng timezone airTemperature cloudCoverage dewTemperature precipDepth1HR precipDepth6HR seaLvlPressure windDirection windSpeed month week day_of_week
timestamp
2016-01-01 Bear_assembly_Angel 12808.1620 Bear Entertainment/public assembly Other 22117.0 37.871903 -122.260729 US/Pacific 6.1750 1.666667 -5.229167 0.0 0.0 1020.891667 68.750000 3.070833 1 53 4
2016-01-02 Bear_assembly_Angel 9251.0003 Bear Entertainment/public assembly Other 22117.0 37.871903 -122.260729 US/Pacific 8.0875 NaN -1.404167 0.0 0.0 1017.687500 76.666667 3.300000 1 53 5
2016-01-03 Bear_assembly_Angel 14071.6500 Bear Entertainment/public assembly Other 22117.0 37.871903 -122.260729 US/Pacific 10.1125 NaN 1.708333 -6.0 -2.0 1011.491667 91.666667 3.120833 1 53 6
# Cyclical encoding of calendar features
# ==============================================================================
features_to_encode = [
    "month",
    "week",
    "day_of_week",
]
max_values = {
    "month": 12,
    "week": 52,
    "day_of_week": 6,
}
cyclical_encoder = CyclicalFeatures(
                        variables     = features_to_encode,
                        max_values    = max_values,
                        drop_original = False
                   )

data = cyclical_encoder.fit_transform(data)
print(f"Data shape: {data.shape}")
data.head(3)
Data shape: (1153518, 26)
building_id meter_reading site_id primaryspaceusage sub_primaryspaceusage sqm lat lng timezone airTemperature ... windSpeed month week day_of_week month_sin month_cos week_sin week_cos day_of_week_sin day_of_week_cos
timestamp
2016-01-01 Bear_assembly_Angel 12808.1620 Bear Entertainment/public assembly Other 22117.0 37.871903 -122.260729 US/Pacific 6.1750 ... 3.070833 1 53 4 0.5 0.866025 0.120537 0.992709 -8.660254e-01 -0.5
2016-01-02 Bear_assembly_Angel 9251.0003 Bear Entertainment/public assembly Other 22117.0 37.871903 -122.260729 US/Pacific 8.0875 ... 3.300000 1 53 5 0.5 0.866025 0.120537 0.992709 -8.660254e-01 0.5
2016-01-03 Bear_assembly_Angel 14071.6500 Bear Entertainment/public assembly Other 22117.0 37.871903 -122.260729 US/Pacific 10.1125 ... 3.120833 1 53 6 0.5 0.866025 0.120537 0.992709 -2.449294e-16 1.0

3 rows ร— 26 columns

โœŽ Note

For more information about calendar features and cyclical encoding visit Calendar features and Cyclical features in time series forecasting.

Meteorological features

Meteorological variables are recorded at the site level, meaning that weather data varies by building location, even at the same timestamp. In other words, while the exogenous variables are consistent across all series, their values differ by location.

# Values of meteorological features for a guiven date in each site
# ==============================================================================
data.loc["2016-01-01"].groupby("site_id", observed=True).agg(
    {
        "airTemperature": "first",
        "cloudCoverage": "first",
        "dewTemperature": "first",
        "precipDepth1HR": "first",
        "precipDepth6HR": "first",
        "seaLvlPressure": "first",
        "windDirection": "first",
        "windSpeed": "first",
    }
)
airTemperature cloudCoverage dewTemperature precipDepth1HR precipDepth6HR seaLvlPressure windDirection windSpeed
site_id
Bear 6.175000 1.666667 -5.229167 0.0 0.0 1020.891667 68.750000 3.070833
Bobcat -11.595833 0.000000 -17.041667 0.0 0.0 1034.466667 260.869565 3.033333
Bull 7.612500 4.000000 0.708333 -2.0 -1.0 1031.762500 130.416667 3.712500
Cockatoo -2.000000 4.000000 -4.066667 0.0 0.0 NaN 281.333333 4.826667
Crow -1.787500 NaN -3.595833 15.0 13.0 1011.670833 236.250000 3.666667
Eagle 3.187500 1.000000 -4.487500 0.0 0.0 1017.445833 287.500000 3.470833
Fox 10.200000 1.000000 -2.804167 0.0 0.0 1018.512500 47.083333 0.470833
Gator 23.291667 5.600000 19.625000 -3.0 0.0 1018.663636 150.416667 2.520833
Hog -5.583333 1.142857 -9.741667 -2.0 -1.0 1019.545833 260.416667 4.758333
Lamb 6.913043 0.000000 5.434783 0.0 0.0 NaN 123.181818 8.017391
Moose -1.787500 NaN -3.595833 15.0 13.0 1011.670833 236.250000 3.666667
Mouse 5.387500 0.000000 3.879167 0.0 0.0 1016.941667 116.666667 4.470833
Panther 23.291667 5.600000 19.625000 -3.0 0.0 1018.663636 150.416667 2.520833
Peacock 5.558333 0.000000 -1.541667 0.0 0.0 1019.783333 120.000000 1.275000
Rat 5.633333 4.727273 -2.112500 0.0 0.0 1020.450000 314.782609 3.816667
Robin 5.387500 0.000000 3.879167 0.0 0.0 1016.941667 116.666667 4.470833
Shrew 5.387500 0.000000 3.879167 0.0 0.0 1016.941667 116.666667 4.470833
Swan 6.054167 0.400000 -2.591667 0.0 0.0 1021.216667 154.000000 1.783333
Wolf 5.716667 6.625000 3.208333 0.0 1.0 1007.625000 140.000000 8.875000

Skforecast allows you to include different exogenous variables and/or different values for each series in the dataset (more details in the next section).

Rolling features for exogenous variables

# Rolling features of meteorological features
# ==============================================================================
wf_transformer = WindowFeatures(
    variables      = ["airTemperature", "windSpeed"],
    window         = ["7D", "14D"],
    functions      = ["mean"],
    freq           = "D",
    missing_values = "ignore",
    drop_na        = False,
)
data = data.groupby("building_id").apply(wf_transformer.fit_transform, include_groups=False).reset_index()
data = data.set_index("timestamp")
print(f"Data shape: {data.shape}")
data.head(3)
Data shape: (1153518, 30)
building_id meter_reading site_id primaryspaceusage sub_primaryspaceusage sqm lat lng timezone airTemperature ... month_sin month_cos week_sin week_cos day_of_week_sin day_of_week_cos airTemperature_window_7D_mean windSpeed_window_7D_mean airTemperature_window_14D_mean windSpeed_window_14D_mean
timestamp
2016-01-01 Bear_assembly_Angel 12808.1620 Bear Entertainment/public assembly Other 22117.0 37.871903 -122.260729 US/Pacific 6.1750 ... 0.5 0.866025 0.120537 0.992709 -8.660254e-01 -0.5 NaN NaN NaN NaN
2016-01-02 Bear_assembly_Angel 9251.0003 Bear Entertainment/public assembly Other 22117.0 37.871903 -122.260729 US/Pacific 8.0875 ... 0.5 0.866025 0.120537 0.992709 -8.660254e-01 0.5 6.17500 3.070833 6.17500 3.070833
2016-01-03 Bear_assembly_Angel 14071.6500 Bear Entertainment/public assembly Other 22117.0 37.871903 -122.260729 US/Pacific 10.1125 ... 0.5 0.866025 0.120537 0.992709 -2.449294e-16 1.0 7.13125 3.185417 7.13125 3.185417

3 rows ร— 30 columns

Categorical features

LightGBM can handle categorical variables internally without any preprocessing. To allow automatic detection of categorical features based on pandas data types in a Forecaster, categorical variables must first be encoded as integers (ordinal encoding) and then stored as a category type. This is necessary because skforecast uses a numeric array internally to speed up the calculation, and LightGBM requires the categorical features to be encoded as category to be detected automatically. It is also necessary to set the categorical_features parameter to 'auto' during the initialization of the forecaster using fit_kwargs = {'categorical_feature': 'auto'}.

โš  Warning

The four main gradient boosting frameworks โ€“ LightGBM, scikit-learn's HistogramGradientBoosting, XGBoost, and CatBoost โ€“ are capable of directly handling categorical features within the model. However, it is important to note that each framework has its own configurations, benefits and potential problems. To fully comprehend how to use these frameworks, it is highly recommended to refer to the skforecast user guide for a detailed understanding.
# Transformer: ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
categorical_features = ['primaryspaceusage', 'sub_primaryspaceusage', 'timezone']
transformer_exog = make_column_transformer(
                       (
                           OrdinalEncoder(
                               dtype=float,
                               handle_unknown="use_encoded_value",
                               unknown_value=np.nan,
                               encoded_missing_value=np.nan
                           ),
                           categorical_features
                       ),
                       remainder="passthrough",
                       verbose_feature_names_out=False,
                   ).set_output(transform="pandas")
transformer_exog
ColumnTransformer(remainder='passthrough',
                  transformers=[('ordinalencoder',
                                 OrdinalEncoder(dtype=<class 'float'>,
                                                handle_unknown='use_encoded_value',
                                                unknown_value=nan),
                                 ['primaryspaceusage', 'sub_primaryspaceusage',
                                  'timezone'])],
                  verbose_feature_names_out=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

When creating a Forecaster with LGBMRegressor, it is necessary to specify how to handle the categorical columns using the fit_kwargs argument. This is because the categorical_feature argument is only specified in the fit method of LGBMRegressor, and not during its initialization.

Modelling and forecasting

ForecasterRecursiveMultiSeries allows modeling time series of different lengths and using different exogenous variables. When series have different lengths, the data must be transformed into a dictionary. The keys of the dictionary are the names of the series and the values are the series themselves. To do this, the series_long_to_dict function is used, which takes the DataFrame in "long format" and returns a dict of pandas Series. Similarly, when the exogenous variables are different (values or variables) for each series, the data must be transformed into a dictionary. The keys of the dictionary are the names of the series and the values are the exogenous variables themselves. The exog_long_to_dict function is used, which takes the DataFrame in "long format" and returns a dict of exogenous variables (pandas Series or pandas DataFrame).

When all series have the same length and the same exogenous variables, it is not necessary to use dictionaries. Series can be passed as unique DataFrame with each series in a column, and exogenous variables can be passed as a DataFrame with the same length as the series.

โœŽ Note

For more information about modelling series of different lengths and using different exogenous variables, visit Global Forecasting Models: Time series with different lengths and different exogenous variables.
# Exogenous features selected for modeling
# ==============================================================================
exog_features = [
    "primaryspaceusage",
    "sub_primaryspaceusage",
    "timezone",
    "sqm",
    "airTemperature",
    "cloudCoverage",
    "dewTemperature",
    "precipDepth1HR",
    "precipDepth6HR",
    "seaLvlPressure",
    "windDirection",
    "windSpeed",
    "day_of_week_sin",
    "day_of_week_cos",
    "week_sin",
    "week_cos",
    "month_sin",
    "month_cos",
    "airTemperature_window_7D_mean",
    "windSpeed_window_7D_mean",
    "airTemperature_window_14D_mean",
    "windSpeed_window_14D_mean",
]
# Transform series and exog to dictionaries
# ==============================================================================
series_dict = series_long_to_dict(
    data      = data.reset_index(),
    series_id = 'building_id',
    index     = 'timestamp',
    values    = 'meter_reading',
    freq      = 'D'
)

exog_dict = exog_long_to_dict(
    data      = data[exog_features + ['building_id']].reset_index(),
    series_id = 'building_id',
    index     = 'timestamp',
    freq      = 'D'
)

To train the models, search for optimal hyperparameters, and evaluate their predictive performance, the data is divided into three separate sets: training, validation, and test.

# Partition data in train and test
# ==============================================================================
data = data.sort_index()
end_train = '2017-08-31 23:59:00'
end_validation = '2017-10-31 23:59:00'
series_dict_train = {k: v.loc[: end_train,] for k, v in series_dict.items()}
series_dict_valid = {k: v.loc[end_train: end_validation,] for k, v in series_dict.items()}
series_dict_test = {k: v.loc[end_validation:,] for k, v in series_dict.items()}
exog_dict_train = {k: v.loc[: end_train,] for k, v in exog_dict.items()}
exog_dict_valid = {k: v.loc[end_train: end_validation,] for k, v in exog_dict.items()}
exog_dict_test = {k: v.loc[end_validation:,] for k, v in exog_dict.items()}

print(
    f"Rage of dates available : {data.index.min()} --- {data.index.max()} "
    f"(n_days={(data.index.max() - data.index.min()).days})"
)
print(
    f"  Dates for training    : {data.loc[: end_train, :].index.min()} --- {data.loc[: end_train, :].index.max()} "
    f"(n_days={(data.loc[: end_train, :].index.max() - data.loc[: end_train, :].index.min()).days})"
)
print(
    f"  Dates for validation  : {data.loc[end_train:end_validation, :].index.min()} --- {data.loc[end_train:end_validation, :].index.max()} "
    f"(n_days={(data.loc[end_train:end_validation, :].index.max() - data.loc[end_train:end_validation, :].index.min()).days})"
)
print(
    f"  Dates for test        : {data.loc[end_validation:, :].index.min()} --- {data.loc[end_validation:, :].index.max()} "
    f"(n_days={(data.loc[end_validation:, :].index.max() - data.loc[end_validation:, :].index.min()).days})"
)
Rage of dates available : 2016-01-01 00:00:00 --- 2017-12-31 00:00:00 (n_days=730)
  Dates for training    : 2016-01-01 00:00:00 --- 2017-08-31 00:00:00 (n_days=608)
  Dates for validation  : 2017-09-01 00:00:00 --- 2017-10-31 00:00:00 (n_days=60)
  Dates for test        : 2017-11-01 00:00:00 --- 2017-12-31 00:00:00 (n_days=60)

Hyperparameter tuning

Hyperparameter and lag tuning involves systematically testing different values or combinations of hyperparameters (and/or lags) to find the optimal configuration that gives the best performance. The skforecast library provides two different methods to evaluate each candidate configuration:

  • Backtesting: In this method, the model predicts several steps ahead in each iteration, using the same forecast horizon and retraining frequency strategy that would be used if the model were deployed. This simulates a real forecasting scenario where the model is retrained and updated over time.

  • One-Step Ahead: Evaluates the model using only one-step-ahead predictions. This method is faster because it requires fewer iterations, but it only tests the model's performance in the immediate next time step (t+1).

Each method uses a different evaluation strategy, so they may produce different results. However, in the long run, both methods are expected to converge to similar selections of optimal hyperparameters. The one-step-ahead method is much faster than backtesting because it requires fewer iterations, but it only tests the model's performance in the immediate next time step. It is recommended to backtest the final model for a more accurate multi-step performance estimate.

# Create forecaster
# ==============================================================================
window_features = RollingFeatures(stats=['mean', 'min', 'max'], window_sizes=7)
forecaster = ForecasterRecursiveMultiSeries(
                regressor        = LGBMRegressor(random_state=8520, verbose=-1),
                lags             = 14,
                window_features  = window_features,
                transformer_exog = transformer_exog,
                fit_kwargs       = {'categorical_feature': categorical_features},
                encoding         = "ordinal"
            )
# Bayesian search with OneStepAheadFold
# ==============================================================================
def search_space(trial):
    search_space  = {
        'lags'            : trial.suggest_categorical('lags', [31, 62]),
        'n_estimators'    : trial.suggest_int('n_estimators', 200, 800, step=100),
        'max_depth'       : trial.suggest_int('max_depth', 3, 8, step=1),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 25, 500),
        'learning_rate'   : trial.suggest_float('learning_rate', 0.01, 0.5),
        'feature_fraction': trial.suggest_float('feature_fraction', 0.5, 0.8, step=0.1),
        'max_bin'         : trial.suggest_int('max_bin', 50, 100, step=25),
        'reg_alpha'       : trial.suggest_float('reg_alpha', 0, 1, step=0.1),
        'reg_lambda'      : trial.suggest_float('reg_lambda', 0, 1, step=0.1)
    }

    return search_space

cv = OneStepAheadFold(initial_train_size=608) # size of the training set

results_search, best_trial = bayesian_search_forecaster_multiseries(
    forecaster        = forecaster,
    series            = {k: v.loc[:end_validation,] for k, v in series_dict.items()},
    exog              = {k: v.loc[:end_validation, exog_features] for k, v in exog_dict.items()},
    cv                = cv,
    search_space      = search_space,
    n_trials          = 10,
    metric            = "mean_absolute_error",
    suppress_warnings = True
)

best_params = results_search.at[0, 'params']
best_lags = results_search.at[0, 'lags']
results_search.head(3)
  0%|          | 0/10 [00:00<?, ?it/s]
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52 53 54 55 56 57 58 59 60 61 62] 
  Parameters: {'n_estimators': 600, 'max_depth': 6, 'min_data_in_leaf': 188, 'learning_rate': 0.1590191866233202, 'feature_fraction': 0.6, 'max_bin': 100, 'reg_alpha': 0.9, 'reg_lambda': 0.5}
  Backtesting metric: 261.5074045473127
  Levels: ['Bear_assembly_Angel', 'Bear_assembly_Beatrice', 'Bear_assembly_Danial', 'Bear_assembly_Diana', 'Bear_assembly_Genia', 'Bear_assembly_Harry', 'Bear_assembly_Jose', 'Bear_assembly_Roxy', 'Bear_assembly_Ruby', 'Bear_education_Alfredo', '...', 'Wolf_office_Emanuel', 'Wolf_office_Haydee', 'Wolf_office_Joana', 'Wolf_office_Nadia', 'Wolf_office_Rochelle', 'Wolf_public_Norma', 'Wolf_retail_Harriett', 'Wolf_retail_Marcella', 'Wolf_retail_Toshia', 'Wolf_science_Alfreda']

levels lags params mean_absolute_error__weighted_average mean_absolute_error__average mean_absolute_error__pooling n_estimators max_depth min_data_in_leaf learning_rate feature_fraction max_bin reg_alpha reg_lambda
0 [Bear_assembly_Angel, Bear_assembly_Beatrice, ... [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'n_estimators': 600, 'max_depth': 6, 'min_dat... 261.507405 261.507405 261.507405 600.0 6.0 188.0 0.159019 0.6 100.0 0.9 0.5
1 [Bear_assembly_Angel, Bear_assembly_Beatrice, ... [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'n_estimators': 400, 'max_depth': 5, 'min_dat... 265.896416 265.896416 265.896416 400.0 5.0 437.0 0.132723 0.6 100.0 0.5 0.6
2 [Bear_assembly_Angel, Bear_assembly_Beatrice, ... [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'n_estimators': 500, 'max_depth': 5, 'min_dat... 266.299734 266.299734 266.299734 500.0 5.0 227.0 0.163008 0.6 100.0 1.0 0.5

Backtesting on test data

# Backtesting on test set
# ==============================================================================
cv = TimeSeriesFold(
    initial_train_size = 608 + 60, # training + validation
    steps              = 7,
    refit              = False
)
metrics, predictions = backtesting_forecaster_multiseries(
                          forecaster        = forecaster,
                          series            = series_dict,
                          exog              = exog_dict,
                          cv                = cv,
                          metric            = 'mean_absolute_error',
                          suppress_warnings = True
                      )

display(predictions.head())
display(metrics)
  0%|          | 0/9 [00:00<?, ?it/s]
level pred
2017-10-30 Bear_assembly_Angel 9561.254764
2017-10-30 Bear_assembly_Beatrice 1112.779749
2017-10-30 Bear_assembly_Danial 4172.709133
2017-10-30 Bear_assembly_Diana 26.412851
2017-10-30 Bear_assembly_Genia 7780.739532
levels mean_absolute_error
0 Bear_assembly_Angel 1244.039948
1 Bear_assembly_Beatrice 244.728577
2 Bear_assembly_Danial 281.653933
3 Bear_assembly_Diana 39.561433
4 Bear_assembly_Genia 563.580736
... ... ...
1576 Wolf_retail_Toshia 519.948486
1577 Wolf_science_Alfreda 220.802361
1578 average 346.329085
1579 weighted_average 346.329085
1580 pooling 346.329085

1581 rows ร— 2 columns

# Aggregated metric for all buildings
# ==============================================================================
average_metric_all_buildings = metrics.query("levels == 'average'")["mean_absolute_error"].item()
errors_all_buildings = pd.merge(
    left     = data[['building_id', 'meter_reading']].reset_index(),
    right    = predictions.rename_axis("timestamp").reset_index(),
    left_on  = ['timestamp', 'building_id'],
    right_on = ['timestamp', 'level'],
    how      = 'inner',
    validate = "1:1"
).assign(error=lambda df: df['meter_reading'] - df['pred'])

sum_abs_errors_all_buildings = errors_all_buildings['error'].abs().sum().sum()
sum_bias_all_buildings = errors_all_buildings['error'].sum().sum()
print(f"Average mean absolute error for all buildings: {average_metric_all_buildings:.0f}")
print(f"Sum of absolute errors for all buildings (x 10,000): {sum_abs_errors_all_buildings/10000:.0f}")
print(f"Bias (x 10,000): {sum_bias_all_buildings/10000:.0f}")
Average mean absolute error for all buildings: 346
Sum of absolute errors for all buildings (x 10,000): 3443
Bias (x 10,000): -382
predictions
level pred
2017-10-30 Bear_assembly_Angel 9561.254764
2017-10-30 Bear_assembly_Beatrice 1112.779749
2017-10-30 Bear_assembly_Danial 4172.709133
2017-10-30 Bear_assembly_Diana 26.412851
2017-10-30 Bear_assembly_Genia 7780.739532
... ... ...
2017-12-31 Wolf_public_Norma 1961.736704
2017-12-31 Wolf_retail_Harriett 634.324994
2017-12-31 Wolf_retail_Marcella -80.704692
2017-12-31 Wolf_retail_Toshia 821.663849
2017-12-31 Wolf_science_Alfreda 1528.388111

99414 rows ร— 2 columns

# Plot predictions vs real value for 2 random buildings
# ==============================================================================
rng = np.random.default_rng(14793)
n_buildings = 2
selected_buildings = rng.choice(data['building_id'].unique(), size=n_buildings, replace=False)

fig, axs = plt.subplots(n_buildings, 1, figsize=(7, 4.5), sharex=True)
axs = axs.flatten()

for i, building in enumerate(selected_buildings):
    data.query("building_id == @building").loc[predictions.index, 'meter_reading'].plot(ax=axs[i], label='test')
    predictions.query("level == @building")['pred'].plot(ax=axs[i], label='predictions')
    axs[i].set_title(f"Building {building}", fontsize=10)
    axs[i].set_xlabel("")
    axs[i].legend()

fig.tight_layout()
plt.show();

Feature selection

Feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: to simplify models to make them easier to interpret, to reduce training time, to avoid the curse of dimensionality, to improve generalization by reducing overfitting (formally, variance reduction), and others.

Skforecast is compatible with the feature selection methods implemented in the scikit-learn library. There are several methods for feature selection, but the most common are:

  • Recursive feature elimination (RFE)

  • Sequential Feature Selection (SFS)

  • Feature selection based on threshold (SelectFromModel)

๐Ÿ’ก Tip

Feature selection is a powerful tool for improving the performance of machine learning models. However, it is computationally expensive and can be time-consuming. Since the goal is to find the best subset of features, not the best model, it is not necessary to use the entire data set or a highly complex model. Instead, it is recommended to use a small subset of the data and a simple model. Once the best subset of features has been identified, the model can then be trained using the entire dataset and a more complex configuration.
# Feature selection (autoregressive and exog) with scikit-learn RFECV
# ==============================================================================
regressor = LGBMRegressor(n_estimators=100, max_depth=5, random_state=15926, verbose=-1)
selector = RFECV(estimator=regressor, step=1, cv=3, n_jobs=1)
selected_lags, selected_window_features, selected_exog = select_features_multiseries(
    forecaster      = forecaster,
    selector        = selector,
    series          = {k: v.loc[:end_validation,] for k, v in series_dict.items()},
    exog            = {k: v.loc[:end_validation, exog_features] for k, v in exog_dict.items()},
    select_only     = None,
    force_inclusion = None,
    subsample       = 0.2,
    random_state    = 123,
    verbose         = True,
)
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ MissingValuesWarning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ NaNs detected in `X_train`. Some regressors do not allow NaN values during training. โ”‚
โ”‚ If you want to drop them, set `forecaster.dropna_from_series = True`.                โ”‚
โ”‚                                                                                      โ”‚
โ”‚ Category : MissingValuesWarning                                                      โ”‚
โ”‚ Location :                                                                           โ”‚
โ”‚ /home/ubuntu/anaconda3/envs/skforecast_15_py12/lib/python3.12/site-packages/skforeca โ”‚
โ”‚ st/recursive/_forecaster_recursive_multiseries.py:1212                               โ”‚
โ”‚ Suppress : warnings.simplefilter('ignore', category=MissingValuesWarning)            โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
Recursive feature elimination (RFECV)
-------------------------------------
Total number of records available: 959424
Total number of records used for feature selection: 191884
Number of features available: 87
    Lags            (n=62)
    Window features (n=3)
    Exog            (n=22)
Number of features selected: 56
    Lags            (n=36) : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 28, 29, 30, 33, 34, 35, 42, 44, 49, 50, 51, 54, 56, 57, 60, 62]
    Window features (n=3) : ['roll_mean_7', 'roll_min_7', 'roll_max_7']
    Exog            (n=17) : ['sub_primaryspaceusage', 'timezone', 'sqm', 'airTemperature', 'cloudCoverage', 'dewTemperature', 'seaLvlPressure', 'windDirection', 'windSpeed', 'day_of_week_sin', 'day_of_week_cos', 'week_sin', 'week_cos', 'airTemperature_window_7D_mean', 'windSpeed_window_7D_mean', 'airTemperature_window_14D_mean', 'windSpeed_window_14D_mean']
# Backtesting forecaster with selected features
# ==============================================================================
forecaster = ForecasterRecursiveMultiSeries(
                regressor        = LGBMRegressor(**best_params, random_state=8520, verbose=-1),
                lags             = selected_lags,
                window_features  = window_features,
                transformer_exog = transformer_exog,
                fit_kwargs       = {'categorical_feature': categorical_features},
                encoding         = "ordinal"
            )
cv = TimeSeriesFold(
    initial_train_size = 608 + 60, # Entreanmiento + validaciรณn
    steps              = 7,
    refit              = False
)
metrics, predictions = backtesting_forecaster_multiseries(
                          forecaster        = forecaster,
                          series            = series_dict,
                          exog              = {k: v[exog_features] for k, v in exog_dict.items()},
                          cv                = cv,
                          metric            = 'mean_absolute_error',
                          suppress_warnings = True
                      )

display(predictions.head())
display(metrics)
  0%|          | 0/9 [00:00<?, ?it/s]
level pred
2017-10-30 Bear_assembly_Angel 9297.191358
2017-10-30 Bear_assembly_Beatrice 996.638961
2017-10-30 Bear_assembly_Danial 4039.385147
2017-10-30 Bear_assembly_Diana 9.180618
2017-10-30 Bear_assembly_Genia 7354.941371
levels mean_absolute_error
0 Bear_assembly_Angel 1221.926524
1 Bear_assembly_Beatrice 218.906181
2 Bear_assembly_Danial 325.030420
3 Bear_assembly_Diana 37.261934
4 Bear_assembly_Genia 642.969646
... ... ...
1576 Wolf_retail_Toshia 514.892785
1577 Wolf_science_Alfreda 228.344802
1578 average 343.951047
1579 weighted_average 343.951047
1580 pooling 343.951047

1581 rows ร— 2 columns

The number of features included in the model has been reduced without compromising the model's performance. This simplifies the model and speeds up training.

Clustering time series

The idea behind modeling multiple series at the same time is to be able to capture the main patterns that govern the series, thereby reducing the impact of the potential noise that each series may have. This means that series that behave similarly may benefit from being modeled together. One way to identify potential groups of series is to perform a clustering study prior to modeling. If clear groups are identified as a result of clustering, it is appropriate to model each of them separately.

Clustering is an unsupervised analysis technique that groups a set of observations into clusters that contain observations that are considered homogeneous, while observations in different clusters are considered heterogeneous. Algorithms that cluster time series can be divided into two groups: those that use a transformation to create features prior to clustering (feature-driven time series clustering), and those that work directly on the time series (elastic distance measures).

  • Feature-driven time series clustering: Features describing structural characteristics are extracted from each time series, then these features are fed into arbitrary clustering algorithms. This features are obtained by applying statistical operations that best capture the underlying characteristics: trend, seasonality, periodicity, serial correlation, skewness, kurtosis, chaos, nonlinearity, and self-similarity.

  • Elastic distance measures: This approach works directly on the time series, adjusting or "realigning" the series in comparison to each other. The best known of this family of measures is Dynamic Time Warping (DTW).

For a detailed example of how time series clustering can improve the forecasting models, see Global Forecasting Models: Comparative Analysis of Single and Multi-Series Forecasting Modeling.

Session information

import session_info
session_info.show(html=False)
-----
feature_engine      1.8.3
lightgbm            4.6.0
matplotlib          3.10.1
numpy               2.0.2
optuna              4.2.1
pandas              2.2.3
session_info        1.0.0
skforecast          0.15.1
sklearn             1.5.2
-----
IPython             9.0.2
jupyter_client      8.6.3
jupyter_core        5.7.2
notebook            6.4.12
-----
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0]
Linux-5.15.0-1077-aws-x86_64-with-glibc2.31
-----
Session information updated at 2025-03-28 15:22

Citation

How to cite this document

If you use this document or any part of it, please acknowledge the source, thank you!

Forecasting at scale: modeling thousand time series with a single global model by Joaquรญn Amat Rodrigo and Javier Escobar Ortiz, available under a CC BY-NC-SA 4.0 at https://www.cienciadedatos.net/documentos/py59-scalable-forecasting-model.html

How to cite skforecast

If you use skforecast for a publication, we would appreciate it if you cite the published software.

Zenodo:

Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2025). skforecast (v0.15.1). Zenodo. https://doi.org/10.5281/zenodo.8382788

APA:

Amat Rodrigo, J., & Escobar Ortiz, J. (2025). skforecast (Version 0.15.1) [Computer software]. https://doi.org/10.5281/zenodo.8382788

BibTeX:

@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.15.1}, month = {03}, year = {2025}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }


Did you like the article? Your support is important

Your contribution will help me to continue generating free educational content. Many thanks! ๐Ÿ˜Š

Become a GitHub Sponsor Become a GitHub Sponsor

Creative Commons Licence

This work by Joaquรญn Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.

Allowed:

  • Share: copy and redistribute the material in any medium or format.

  • Adapt: remix, transform, and build upon the material.

Under the following terms:

  • Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  • NonCommercial: You may not use the material for commercial purposes.

  • ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.