Skforecast: Time series forecasting with python and scikit learn

Skforecast: time series forecasting with Python and Scikit-learn

Joaquín Amat Rodrigo, Javier Escobar Ortiz
February, 2021 (last update September 2023)

Introduction

A time series is a succession of chronologically ordered data spaced at equal or unequal intervals. The forecasting process consists of predicting the future value of a time series, either by modeling the series solely based on its past behavior (autoregressive) or by using other external variables.

This document describes how to use Scikit-learn regression models to perform forecasting on time series. Specifically, it introduces Skforecast, a simple library that contains the classes and functions necessary to adapt any Scikit-learn regression model to forecasting problems.

More examples in skforecast-examples.

Machine learning for forecasting

In order to apply machine learning models to forecasting problems, the time series has to be transformed into a matrix in which each value is related to the time window (lags) that precedes it.

In a time series context, a lag with respect to a time step $t$ is defined as the values of the series at previous time steps. For example, lag 1 is the value at time step $t − 1$ and lag $m$ is the value at time step $t − m$.


Time series transformation into a matrix of 5 lags and a vector with the value of the series that follows each row of the matrix.

This type of transformation also allows to include additional variables.

Time series transformation including an exogenous variable.

Once data have been rearranged into the new shape, any regression model can be trained to predict the next value (step) of the series. During model training, every row is considered a separate data instance, where values at lags 1, 2, ... $p$ are considered predictors for the target quantity of the time series at time step $p + 1$.

Multi-Step Time Series Forecasting

When working with time series, it is seldom needed to predict only the next element in the series ($t_{+1}$). Instead, the most common goal is to predict a whole future interval (($t_{+1}$), ..., ($t_{+n}$)) or a far point in time ($t_{+n}$). Several strategies allow generating this type of prediction.

Recursive multi-step forecasting

Since the value $t_{n-1}$ is required to predict $t_{n}$, and $t_{n-1}$ is unknown, a recursive process is applied in which, each new prediction, is based on the previous one. This process is known as recursive forecasting or recursive multi-step forecasting and can be easily generated with the ForecasterAutoreg and ForecasterAutoregCustom classes.

Recursive multi-step prediction process diagram to predict 3 steps into the future using the last 4 lags of the series as predictors.

Direct multi-step forecasting

Direct multi-step forecasting consists of training a different model for each step of the forecast horizon. For example, to predict the next 5 values of a time series, 5 different models are trained, one for each step. As a result, the predictions are independent of each other.

Direct multi-step prediction process diagram to predict 3 steps into the future using the last 4 lags of the series as predictors.


The main complexity of this approach is to generate the correct training matrices for each model. The ForecasterAutoregDirect class of the skforecast library automates this process. It is also important to bear in mind that this strategy has a higher computational cost since it requires the train of multiple models. The following diagram shows the process for a case in which the response variable and two exogenous variables are available.

Transformation of a time series into matrices to train a direct multi-step forecasting model


Multiple output forecasting

Some machine learning models, such as long short-term memory (LSTM) neural network, can predict simultaneously several values of a sequence (one-shot). This strategy is not currently implemented in skforecast library.

Libraries

The libraries used in this document are:

In [1]:
# Data manipulation
# ==============================================================================
import numpy as np
import pandas as pd

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 10

# Modeling and Forecasting
# ==============================================================================
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler

from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregCustom import ForecasterAutoregCustom
from skforecast.ForecasterAutoregDirect import ForecasterAutoregDirect
from skforecast.model_selection import grid_search_forecaster
from skforecast.model_selection import backtesting_forecaster
from skforecast.utils import save_forecaster
from skforecast.utils import load_forecaster

# Warnings configuration
# ==============================================================================
import warnings
# warnings.filterwarnings('ignore')

Data

A time series is available with the monthly expenditure (millions of dollars) on corticosteroid drugs that the Australian health system had between 1991 and 2008. It is intended to create an autoregressive model capable of predicting future monthly expenditures.

The data used in the examples of this document have been obtained from the magnificent book Forecasting: Principles and Practice by Rob J Hyndman and George Athanasopoulos.

In [2]:
# Data download
# ==============================================================================
url = 'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o_exog.csv'
data = pd.read_csv(url, sep=',')

The column date has been stored as a string. To convert it to datetime the pd.to_datetime() function can be use. Once in datetime format, and to make use of Pandas functionalities, it is set as an index. Also, since the data is monthly, the frequency is set as Monthly Started 'MS'.

In [3]:
# Data preparation
# ==============================================================================
data = data.rename(columns={'fecha': 'date'})
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
data = data.set_index('date')
data = data.rename(columns={'x': 'y'})
data = data.asfreq('MS')
data = data.sort_index()
data.head()
Out[3]:
y exog_1 exog_2
date
1992-04-01 0.379808 0.958792 1.166029
1992-05-01 0.361801 0.951993 1.117859
1992-06-01 0.410534 0.952955 1.067942
1992-07-01 0.483389 0.958078 1.097376
1992-08-01 0.475463 0.956370 1.122199

When setting a frequency with the asfreq() method, Pandas fills the gaps that may exist in the time series with the value of Null to ensure the indicated frequency. Therefore, it should be checked if missing values have appeared after this transformation.

In [4]:
print(f'Number of rows with missing values: {data.isnull().any(axis=1).mean()}')
Number of rows with missing values: 0.0

Although it is unnecessary, since a frequency has been established, it is possible to verify that the time series is complete.

In [5]:
# Verify that a temporary index is complete
# ==============================================================================
(data.index == pd.date_range(start=data.index.min(),
                             end=data.index.max(),
                             freq=data.index.freq)).all()
Out[5]:
True
In [6]:
# Fill gaps in a temporary index
# ==============================================================================
# data.asfreq(freq='30min', fill_value=np.nan)

The last 36 months are used as the test set to evaluate the predictive capacity of the model.

In [7]:
# Split data into train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]

print(f"Train dates : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Test dates  : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")

fig, ax = plt.subplots(figsize=(6, 2.5))
data_train['y'].plot(ax=ax, label='train')
data_test['y'].plot(ax=ax, label='test')
ax.legend();
Train dates : 1992-04-01 00:00:00 --- 2005-06-01 00:00:00  (n=159)
Test dates  : 2005-07-01 00:00:00 --- 2008-06-01 00:00:00  (n=36)

Recursive autoregressive forecasting

ForecasterAutoreg

With the ForecasterAutoreg class, a model is created and trained from a RandomForestRegressor regressor with a time window of 6 lags. This means that the model uses the previous 6 months as predictors.

In [8]:
# Create and train forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                regressor = RandomForestRegressor(random_state=123),
                lags      = 6
             )

forecaster.fit(y=data_train['y'])
forecaster
Out[8]:
================= 
ForecasterAutoreg 
================= 
Regressor: RandomForestRegressor(random_state=123) 
Lags: [1 2 3 4 5 6] 
Transformer for y: None 
Transformer for exog: None 
Window size: 6 
Weight function included: False 
Differentiation order: None 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2005-06-01 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: MS 
Regressor parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': 1.0, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 123, 'verbose': 0, 'warm_start': False} 
fit_kwargs: {} 
Creation date: 2023-09-07 14:55:22 
Last fit date: 2023-09-07 14:55:23 
Skforecast version: 0.10.0 
Python version: 3.11.4 
Forecaster id: None 

Predictions

Once the model is trained, the test data is predicted (36 months into the future).

In [9]:
# Predictions
# ==============================================================================
steps = 36
predictions = forecaster.predict(steps=steps)
predictions.head(5)
Out[9]:
2005-07-01    0.878756
2005-08-01    0.882167
2005-09-01    0.973184
2005-10-01    0.983678
2005-11-01    0.849494
Freq: MS, Name: pred, dtype: float64
In [10]:
# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 2.5))
data_train['y'].plot(ax=ax, label='train')
data_test['y'].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();

Prediction error in the test set

The error that the model makes in its predictions is quantified. In this case, the metric used is the mean squared error (mse).

In [11]:
# Test error
# ==============================================================================
error_mse = mean_squared_error(
                y_true = data_test['y'],
                y_pred = predictions
            )

print(f"Test error (mse): {error_mse}")
Test error (mse): 0.07326833976120374

Hyperparameter tuning

The trained ForecasterAutoreg uses a 6 lag time window and a Random Forest model with the default hyperparameters. However, there is no reason why these values are the most suitable. Skforecast library provides the grid_search_forecaster function. It compares the results obtained with multiple combinations of hyperparameters and lags, and identify the best one.

🖉 Note

The computational cost of hyperparameter tuning depends heavily on the backtesting approach chosen to evaluate each hyperparameter combination. In general, the duration of the tuning process increases with the number of re-trains involved in the backtesting. To effectively speed up the prototyping phase, it is highly recommended to adopt a two-step strategy. First, use refit=False during the initial search to narrow down the range of values. Then, focus on the identified region of interest and apply a tailored backtesting strategy that meets the specific requirements of the use case. For additional tips on backtesting strategies, refer to Hyperparameter tuning and lags selection
In [12]:
# Hyperparameter grid search
# ==============================================================================
steps = 36
forecaster = ForecasterAutoreg(
                regressor = RandomForestRegressor(random_state=123),
                lags      = 12 # This value will be replaced in the grid search
             )

# Lags used as predictors
lags_grid = [10, 20]

# Regressor's hyperparameters
param_grid = {'n_estimators': [100, 500],
              'max_depth': [3, 5, 10]}

results_grid = grid_search_forecaster(
                        forecaster         = forecaster,
                        y                  = data_train['y'],
                        param_grid         = param_grid,
                        lags_grid          = lags_grid,
                        steps              = steps,
                        refit              = False,
                        metric             = 'mean_squared_error',
                        initial_train_size = int(len(data_train)*0.5),
                        fixed_train_size   = False,
                        return_best        = True,
                        n_jobs             = 'auto',
                        verbose            = False
               )
Number of models compared: 12.
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] 
  Parameters: {'max_depth': 3, 'n_estimators': 500}
  Backtesting metric: 0.021992765856921042

In [13]:
# Grid Search results
# ==============================================================================
results_grid
Out[13]:
lags params mean_squared_error max_depth n_estimators
7 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 3, 'n_estimators': 500} 0.021993 3 500
9 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 5, 'n_estimators': 500} 0.022114 5 500
11 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 10, 'n_estimators': 500} 0.022224 10 500
8 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 5, 'n_estimators': 100} 0.022530 5 100
6 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 3, 'n_estimators': 100} 0.022569 3 100
10 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 10, 'n_estimators': 100} 0.023400 10 100
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] {'max_depth': 3, 'n_estimators': 100} 0.063144 3 100
1 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] {'max_depth': 3, 'n_estimators': 500} 0.064775 3 500
4 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] {'max_depth': 10, 'n_estimators': 100} 0.066307 10 100
2 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] {'max_depth': 5, 'n_estimators': 100} 0.067151 5 100
3 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] {'max_depth': 5, 'n_estimators': 500} 0.067227 5 500
5 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] {'max_depth': 10, 'n_estimators': 500} 0.068109 10 500

The best results are obtained using a time window of 20 lags and a Random Forest set up of {'max_depth': 3, 'n_estimators': 500}.

Final model

Finally, a ForecasterAutoreg is trained with the optimal configuration found by validation. This step is not necessary if return_best = True is specified in the grid_search_forecaster function.

In [14]:
# Create and train forecaster with the best hyperparameters
# ==============================================================================
regressor = RandomForestRegressor(max_depth=3, n_estimators=500, random_state=123)
forecaster = ForecasterAutoreg(
                regressor = regressor,
                lags      = 20
             )

forecaster.fit(y=data_train['y'])
In [15]:
# Predictions
# ==============================================================================
predictions = forecaster.predict(steps=steps)
In [16]:
# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 2.5))
data_train['y'].plot(ax=ax, label='train')
data_test['y'].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();
In [17]:
# Test error
# ==============================================================================
error_mse = mean_squared_error(
                y_true = data_test['y'],
                y_pred = predictions
            )

print(f"Test error (mse): {error_mse}")
Test error (mse): 0.004392699665157793

The optimal combination of hyperparameters significantly reduces test error.

Backtesting

The process of backtesting consists of evaluating the performance of a predictive model by applying it retrospectively to historical data. Therefore, it is a special type of cross-validation applied to the previous period(s).

🖉 Note

To ensure an accurate evaluation of your model and gain confidence in its predictive performance on new data, it is critical to employ an appropriate backtesting strategy. Factors such as use case characteristics, available computing resources and time intervals between predictions need to be considered to determine which strategy to use. In general, the more closely the backtesting process resembles the actual scenario in which the model is used, the more reliable the estimated metric will be. For more information about backtesting, visit Which strategy should I use?.

Backtesting with refit and increasing training size (fixed origin)

The model is trained each time before making predictions. With this configuration, the model uses all the data available so far. It is a variation of the standard cross-validation but, instead of making a random distribution of the observations, the training set increases sequentially, maintaining the temporal order of the data.

Time series backtesting diagram with an initial training size of 10 observations, a prediction horizon of 3 steps, and retraining at each iteration.

Backtesting with refit and fixed training size (rolling origin)

A technique similar to the previous one but, in this case, the forecast origin rolls forward, therefore, the size of training remains constant. This is also known as time series cross-validation or walk-forward validation.

Time series backtesting diagram with an initial training size of 10 observations, a prediction horizon of 3 steps, and a training set of constant size.

Backtesting with intermittent refit

The model is retrained every $n$ iterations of predictions.

🖉 Note

This strategy usually achieves a good balance between the computational cost of retraining and avoiding model degradation.


Backtesting with intermittent refit.

Backtesting without refit

After an initial train, the model is used sequentially without updating it and following the temporal order of the data. This strategy has the advantage of being much faster since the model is trained only once. However, the model does not incorporate the latest data available, so it may lose predictive capacity over time.

Time series backtesting diagram with an initial training size of 10 observations, a prediction horizon of 3 steps, and no retraining at each iteration.

skforecast library has multiple backtesting strategies implemented. Regardless of which one is used, it is important not to include test data in the search process to avoid overfitting problems.

For this example, a backtesting with refit strategy is followed. Internally, the process that the function applies is:

  • In the first iteration, the model is trained with the observations selected for the initial training (in this case, 87). Then, the next 36 observations are used to validate the predictions of this first model.

  • In the second iteration, the model is retrained by adding, to the initial training set, the previous 36 validation observations (87 + 36). In the same way, the next 36 observations are established as the new validation set.

  • This process is repeated until all available observations are used. Following this strategy, the training set increases in each iteration with as many observations as steps are being predicted.

In [18]:
# Backtesting
# ==============================================================================
steps = 36
n_backtesting = 36*3 # The last 9 years are separated for the backtest

metric, predictions_backtest = backtesting_forecaster(
                                    forecaster         = forecaster,
                                    y                  = data['y'],
                                    initial_train_size = len(data) - n_backtesting,
                                    fixed_train_size   = False,
                                    steps              = steps,
                                    metric             = 'mean_squared_error',
                                    refit              = True,
                                    verbose            = True,
                                    show_progress      = True
                                )

print(f"Backtest error: {metric}")
Information of backtesting process
----------------------------------
Number of observations used for initial training: 87
Number of observations used for backtesting: 108
    Number of folds: 3
    Number of steps per fold: 36
    Number of steps to exclude from the end of each train set before test (gap): 0

Fold: 0
    Training:   1992-04-01 00:00:00 -- 1999-06-01 00:00:00  (n=87)
    Validation: 1999-07-01 00:00:00 -- 2002-06-01 00:00:00  (n=36)
Fold: 1
    Training:   1992-04-01 00:00:00 -- 2002-06-01 00:00:00  (n=123)
    Validation: 2002-07-01 00:00:00 -- 2005-06-01 00:00:00  (n=36)
Fold: 2
    Training:   1992-04-01 00:00:00 -- 2005-06-01 00:00:00  (n=159)
    Validation: 2005-07-01 00:00:00 -- 2008-06-01 00:00:00  (n=36)

Backtest error: 0.010578977232387663
In [19]:
# Plot backtest predictions vs real values
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 2.5))
data.loc[predictions_backtest.index, 'y'].plot(ax=ax, label='test')
predictions_backtest.plot(ax=ax, label='predictions')
ax.legend();

Predictors importance

Since the ForecasterAutoreg object uses Scikit-learn models, the importance of predictors can be accessed once trained. When the regressor used is a LinearRegression(), Lasso() or Ridge(), the coefficients of the model reflect their importance. In GradientBoostingRegressor() or RandomForestRegressor() regressors, the importance of predictors is based on impurity.

🖉 Note

get_feature_importances() method only returns values if the regressor used within the forecaster has the attribute coef_ or feature_importances_.
In [20]:
# Predictors importances
# ==============================================================================
forecaster.get_feature_importances()
Out[20]:
feature importance
0 lag_1 0.009412
1 lag_2 0.087268
2 lag_3 0.012754
3 lag_4 0.001446
4 lag_5 0.000401
5 lag_6 0.001386
6 lag_7 0.001273
7 lag_8 0.006926
8 lag_9 0.005839
9 lag_10 0.013076
10 lag_11 0.008868
11 lag_12 0.816041
12 lag_13 0.001266
13 lag_14 0.019411
14 lag_15 0.008746
15 lag_16 0.001766
16 lag_17 0.000578
17 lag_18 0.000329
18 lag_19 0.000853
19 lag_20 0.002359

Recursive autoregressive forecasting with exogenous variables

In the previous example, only lags of the predicted variable itself have been used as predictors. In certain scenarios, it is possible to have information about other variables, whose future value is known, so could serve as additional predictors in the model.

Continuing with the previous example, a new variable whose behavior is correlated with the modeled time series and it is wanted to incorporate as a predictor is simulated. The same applies to multiple exogenous variables.

Data

In [21]:
# Data download
# ==============================================================================
url = 'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o_exog.csv'
data = pd.read_csv(url, sep=',')

# Data preparation
# ==============================================================================
data = data.rename(columns={'fecha': 'date'})
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
data = data.set_index('date')
data = data.asfreq('MS')
data = data.sort_index()

fig, ax = plt.subplots(figsize=(6, 2.5))
data['y'].plot(ax=ax, label='y')
data['exog_1'].plot(ax=ax, label='exogenous variable')
ax.legend();
In [22]:
# Split data into train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]

print(f"Train dates : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Test dates  : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")
Train dates : 1992-04-01 00:00:00 --- 2005-06-01 00:00:00  (n=159)
Test dates  : 2005-07-01 00:00:00 --- 2008-06-01 00:00:00  (n=36)

ForecasterAutoreg

In [23]:
# Create and train forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                regressor = RandomForestRegressor(random_state=123),
                lags      = 8
             )

forecaster.fit(y=data_train['y'], exog=data_train['exog_1'])
forecaster
Out[23]:
================= 
ForecasterAutoreg 
================= 
Regressor: RandomForestRegressor(random_state=123) 
Lags: [1 2 3 4 5 6 7 8] 
Transformer for y: None 
Transformer for exog: None 
Window size: 8 
Weight function included: False 
Differentiation order: None 
Exogenous included: True 
Type of exogenous variable: <class 'pandas.core.series.Series'> 
Exogenous variables names: exog_1 
Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2005-06-01 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: MS 
Regressor parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': 1.0, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 123, 'verbose': 0, 'warm_start': False} 
fit_kwargs: {} 
Creation date: 2023-09-07 14:55:45 
Last fit date: 2023-09-07 14:55:45 
Skforecast version: 0.10.0 
Python version: 3.11.4 
Forecaster id: None 

Predictions

If the ForecasterAutoreg is trained with an exogenous variable, the value of this variable must be passed to predict(). It is only applicable to scenarios in which future information on the exogenous variable is available.

In [24]:
# Predictions
# ==============================================================================
predictions = forecaster.predict(steps=steps, exog=data_test['exog_1'])
In [25]:
# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 2.5))
data_train['y'].plot(ax=ax, label='train')
data_test['y'].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();

Prediction error in the test set

In [26]:
# Test error
# ==============================================================================
error_mse = mean_squared_error(
                y_true = data_test['y'],
                y_pred = predictions
            )

print(f"Test error (mse): {error_mse}")
Test error (mse): 0.03989087922533575

Hyperparameter tuning

In [27]:
# Hyperparameter Grid search
# ==============================================================================
steps = 36
forecaster = ForecasterAutoreg(
                regressor = RandomForestRegressor(random_state=123),
                lags      = 12 # This value will be replaced in the grid search
             )

lags_grid = [5, 12, 20]

param_grid = {'n_estimators': [50, 100, 500],
              'max_depth': [3, 5, 10]}

results_grid = grid_search_forecaster(
                    forecaster  = forecaster,
                    y           = data_train['y'],
                    exog        = data_train['exog_1'],
                    param_grid  = param_grid,
                    lags_grid   = lags_grid,
                    steps       = steps,
                    refit       = False,
                    metric      = 'mean_squared_error',
                    initial_train_size = int(len(data_train)*0.5),
                    return_best = True,
                    n_jobs      = 'auto',
                    verbose     = False
               )
Number of models compared: 27.
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12] 
  Parameters: {'max_depth': 10, 'n_estimators': 50}
  Backtesting metric: 0.020919321798477056

In [28]:
# Grid Search results
# ==============================================================================
results_grid.head()
Out[28]:
lags params mean_squared_error max_depth n_estimators
15 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] {'max_depth': 10, 'n_estimators': 50} 0.020919 10 50
12 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] {'max_depth': 5, 'n_estimators': 50} 0.021596 5 50
23 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 5, 'n_estimators': 500} 0.021759 5 500
20 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 3, 'n_estimators': 500} 0.021839 3 500
26 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'max_depth': 10, 'n_estimators': 500} 0.021968 10 500

The best results are obtained using a time window of 12 lags and a Random Forest set up of {'max_depth': 10, 'n_estimators': 50}.

Final model

Setting return_best = True in grid_search_forecaster, after the search, the ForecasterAutoreg object has been modified and trained with the best configuration found.

In [29]:
# Predictions
# ==============================================================================
predictions = forecaster.predict(steps=steps, exog=data_test['exog_1'])

# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 2.5))
data_train['y'].plot(ax=ax, label='train')
data_test['y'].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();
In [30]:
# Test error
# ==============================================================================
error_mse = mean_squared_error(y_true = data_test['y'], y_pred = predictions)
print(f"Test error (mse) {error_mse}")
Test error (mse) 0.0044949967759907675

Recursive autoregressive forecasting with custom predictors

In addition to the lags, it may be interesting to incorporate other characteristics of the time series in some scenarios. For example, the moving average of the last n values could be used to capture the series's trend.

The ForecasterAutoregCustom class behaves very similar to the ForecasterAutoreg class seen in the previous sections, but with the difference that it is the user who defines the function used to create the predictors.

The first example of the document about predicting the last 36 months of the time series is repeated. In this case, the predictors are the first 10 lags and the values' moving average of the last 20 months.

Data

In [31]:
# Data download
# ==============================================================================
url = 'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o_exog.csv'
data = pd.read_csv(url, sep=',')

# Data preparation
# ==============================================================================
data = data.rename(columns={'fecha': 'date'})
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
data = data.set_index('date')
data = data.rename(columns={'x': 'y'})
data = data.asfreq('MS')
data = data.sort_index()

# Split data into train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]

print(f"Train dates : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Test dates  : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")
Train dates : 1992-04-01 00:00:00 --- 2005-06-01 00:00:00  (n=159)
Test dates  : 2005-07-01 00:00:00 --- 2008-06-01 00:00:00  (n=36)

ForecasterAutoregCustom

A ForecasterAutoregCustom is created and trained from a RandomForestRegressor regressor. The create_predictor() function, which calculates the first 10 lags and the moving average of the last 20 values, is used to create the predictors.

In [32]:
# Function to calculate predictors from time series
# ==============================================================================
def custom_predictors(y):
    '''
    Create first 10 lags of a time series.
    Calculate moving average with window 20.
    '''
    
    lags = y[-1:-11:-1]
    mean = np.mean(y[-20:])
    predictors = np.hstack([lags, mean])
    
    return predictors

⚠ Warning

When creating the forecaster, the window_size argument must be equal to or greater than the window used by the function that creates the predictors. This value, in this case, is 20.


In [33]:
# Create and train forecaster
# ==============================================================================
forecaster = ForecasterAutoregCustom(
                regressor      = RandomForestRegressor(random_state=123),
                fun_predictors = custom_predictors,
                window_size    = 20
             )

forecaster.fit(y=data_train['y'])
forecaster
Out[33]:
======================= 
ForecasterAutoregCustom 
======================= 
Regressor: RandomForestRegressor(random_state=123) 
Predictors created with function: custom_predictors 
Transformer for y: None 
Transformer for exog: None 
Window size: 20 
Weight function included: False 
Differentiation order: None 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2005-06-01 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: MS 
Regressor parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': 1.0, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 123, 'verbose': 0, 'warm_start': False} 
fit_kwargs: {} 
Creation date: 2023-09-07 14:56:10 
Last fit date: 2023-09-07 14:56:10 
Skforecast version: 0.10.0 
Python version: 3.11.4 
Forecaster id: None 

It is possible to access the custom function used to create the predictors.

In [34]:
print(forecaster.source_code_fun_predictors)
def custom_predictors(y):
    '''
    Create first 10 lags of a time series.
    Calculate moving average with window 20.
    '''
    
    lags = y[-1:-11:-1]
    mean = np.mean(y[-20:])
    predictors = np.hstack([lags, mean])
    
    return predictors

Using the method create_train_X_y, is it posible to acces the matrices that are created internally in the training process.

In [35]:
X, y = forecaster.create_train_X_y(y=data_train['y'])
X.head(4)
Out[35]:
custom_predictor_0 custom_predictor_1 custom_predictor_2 custom_predictor_3 custom_predictor_4 custom_predictor_5 custom_predictor_6 custom_predictor_7 custom_predictor_8 custom_predictor_9 custom_predictor_10
date
1993-12-01 0.699605 0.632947 0.601514 0.558443 0.509210 0.470126 0.428859 0.413890 0.427283 0.387554 0.523089
1994-01-01 0.963081 0.699605 0.632947 0.601514 0.558443 0.509210 0.470126 0.428859 0.413890 0.427283 0.552253
1994-02-01 0.819325 0.963081 0.699605 0.632947 0.601514 0.558443 0.509210 0.470126 0.428859 0.413890 0.575129
1994-03-01 0.437670 0.819325 0.963081 0.699605 0.632947 0.601514 0.558443 0.509210 0.470126 0.428859 0.576486
In [36]:
y.head(4)
Out[36]:
date
1993-12-01    0.963081
1994-01-01    0.819325
1994-02-01    0.437670
1994-03-01    0.506121
Freq: MS, Name: y, dtype: float64

Predictions

In [37]:
# Predictions
# ==============================================================================
steps = 36
predictions = forecaster.predict(steps=steps)
In [38]:
# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 2.5))
data_train['y'].plot(ax=ax, label='train')
data_test['y'].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();

Prediction error in the test set

In [39]:
# Test error
# ==============================================================================
error_mse = mean_squared_error(
                y_true = data_test['y'],
                y_pred = predictions
            )
print(f"Test error (mse): {error_mse}")
Test error (mse): 0.046232546768232

Hyperparameter tuning

When using the grid_search_forecaster function with a ForecasterAutoregCustom, thelags_grid argument is not specified.

In [40]:
# Hyperparameter Grid search
# ==============================================================================
steps = 36
forecaster = ForecasterAutoregCustom(
                regressor      = RandomForestRegressor(random_state=123),
                fun_predictors = custom_predictors,
                window_size    = 20
             )

# Regressor's hyperparameters
param_grid = {'n_estimators': [100, 500],
              'max_depth': [3, 5, 10]}

results_grid = grid_search_forecaster(
                    forecaster         = forecaster,
                    y                  = data_train['y'],
                    param_grid         = param_grid,
                    steps              = steps,
                    refit              = False,
                    metric             = 'mean_squared_error',
                    initial_train_size = int(len(data_train)*0.5),
                    fixed_train_size   = False,
                    return_best        = True,
                    n_jobs             = 'auto',
                    verbose            = False
               ) 
Number of models compared: 6.
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: custom predictors 
  Parameters: {'max_depth': 3, 'n_estimators': 100}
  Backtesting metric: 0.0696399122820351

In [41]:
# Grid Search results
# ==============================================================================
results_grid.head(5)
Out[41]:
lags params mean_squared_error max_depth n_estimators
0 custom predictors {'max_depth': 3, 'n_estimators': 100} 0.069640 3 100
1 custom predictors {'max_depth': 3, 'n_estimators': 500} 0.070582 3 500
5 custom predictors {'max_depth': 10, 'n_estimators': 500} 0.070979 10 500
2 custom predictors {'max_depth': 5, 'n_estimators': 100} 0.070994 5 100
3 custom predictors {'max_depth': 5, 'n_estimators': 500} 0.071184 5 500

Final model

In [42]:
# Predictions
# ==============================================================================
predictions = forecaster.predict(steps=steps)

# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 2.5))
data_train['y'].plot(ax=ax, label='train')
data_test['y'].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();
In [43]:
# Test error
# ==============================================================================
error_mse = mean_squared_error(y_true = data_test['y'], y_pred = predictions)
print(f"Test error (mse) {error_mse}")
Test error (mse) 0.035435886844525276

Direct multi-step forecasting

The ForecasterAutoreg and ForecasterAutoregCustom models follow a recursive prediction strategy in which each new prediction builds on the previous one. An alternative is to train a model for each of the steps to be predicted. This strategy, commonly known as direct multi-step forecasting, is computationally more expensive than recursive since it requires training several models. However, in some scenarios, it achieves better results. These kinds of models can be obtained with the ForecasterAutoregDirect class and can include one or multiple exogenous variables.

⚠ Warning

`ForecasterAutoregDirect` may require very long training times, as one model is fitted for each step.

ForecasterAutoregDirect

Unlike when using ForecasterAutoreg or ForecasterAutoregCustom, the number of steps to be predicted must be indicated in the ForecasterAutoregDirect type models. This means that the number of predictions obtained when executing the predict() method is always the same. It is not possible to predict steps beyond the value defined at their creation.

For this example, a linear model with Lasso penalty is used as a regressor. These models require the predictors to be standardized, so it is combined with a StandardScaler.

For more detailed documentation how to use transformers and pipelines, visit: skforecast with scikit-learn and transformers pipelines.

In [44]:
# Create and train forecaster
# ==============================================================================
forecaster = ForecasterAutoregDirect(
                regressor     = Lasso(random_state=123),
                transformer_y = StandardScaler(),
                steps         = 36,
                lags          = 8
             )

forecaster
Out[44]:
======================= 
ForecasterAutoregDirect 
======================= 
Regressor: Lasso(random_state=123) 
Lags: [1 2 3 4 5 6 7 8] 
Transformer for y: StandardScaler() 
Transformer for exog: None 
Weight function included: False 
Window size: 8 
Maximum steps predicted: 36 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: None 
Training index type: None 
Training index frequency: None 
Regressor parameters: {'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': 1000, 'positive': False, 'precompute': False, 'random_state': 123, 'selection': 'cyclic', 'tol': 0.0001, 'warm_start': False} 
fit_kwargs: {} 
Creation date: 2023-09-07 14:56:21 
Last fit date: None 
Skforecast version: 0.10.0 
Python version: 3.11.4 
Forecaster id: None 
In [45]:
# Hyperparameter Grid search
# ==============================================================================
from skforecast.exceptions import LongTrainingWarning
warnings.simplefilter('ignore', category=LongTrainingWarning)

forecaster = ForecasterAutoregDirect(
                regressor     = Lasso(random_state=123),
                transformer_y = StandardScaler(),
                steps         = 36,
                lags          = 8 # This value will be replaced in the grid search
             )

param_grid = {'alpha': np.logspace(-5, 5, 10)}
lags_grid = [5, 12, 20]

results_grid = grid_search_forecaster(
                    forecaster         = forecaster,
                    y                  = data_train['y'],
                    param_grid         = param_grid,
                    lags_grid          = lags_grid,
                    steps              = 36,
                    refit              = False,
                    metric             = 'mean_squared_error',
                    initial_train_size = int(len(data_train)*0.5),
                    fixed_train_size   = False,
                    return_best        = True,
                    n_jobs             = 'auto',
                    verbose            = False
                )
Number of models compared: 30.
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] 
  Parameters: {'alpha': 0.2782559402207126}
  Backtesting metric: 0.017733661183101296

In [46]:
# Grid Search results
# ==============================================================================
results_grid.head()
Out[46]:
lags params mean_squared_error alpha
24 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... {'alpha': 0.2782559402207126} 0.017734 0.278256
14 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] {'alpha': 0.2782559402207126} 0.020040 0.278256
13 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] {'alpha': 0.021544346900318843} 0.023779 0.021544
12 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] {'alpha': 0.001668100537200059} 0.027246 0.001668
11 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] {'alpha': 0.00012915496650148838} 0.027471 0.000129

The best results are obtained using a time window of 12 lags and a Lasso setting {'alpha': 0.021544}.

In [47]:
# Predictions
# ==============================================================================
predictions = forecaster.predict()

# Gráfico
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 2.5))
data_train['y'].plot(ax=ax, label='train')
data_test['y'].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();
In [48]:
# Test error
# ==============================================================================
error_mse = mean_squared_error(y_true = data_test['y'], y_pred = predictions)
print(f"Test error (mse) {error_mse}")
Test error (mse) 0.010065596959809572

Prediction intervals

A prediction interval defines the interval within which the true value of $y$ is expected to be found with a given probability.

Rob J Hyndman and George Athanasopoulos, list in their book Forecasting: Principles and Practice multiple ways to estimate prediction intervals, most of which require that the residuals (errors) of the model are distributed in a normal way. When this property cannot be assumed, bootstrapping can be resorted to, which only assumes that the residuals are uncorrelated. This is the method used in the Skforecast library. For more information visit skforecast probabilistic forecasting.

Diagram of how to create prediction intervals using bootstrapping.


In [49]:
# Data download
# ==============================================================================
url = 'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o_exog.csv'
data_raw = pd.read_csv(url, sep=',')

# Data preparation
# ==============================================================================
data = data_raw.copy()
data = data.rename(columns={'fecha': 'date'})
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
data = data.set_index('date')
data = data.rename(columns={'x': 'y'})
data = data.asfreq('MS')
data = data.sort_index()

# Split data into train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]

print(f"Train dates : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Test dates  : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")
Train dates : 1992-04-01 00:00:00 --- 2005-06-01 00:00:00  (n=159)
Test dates  : 2005-07-01 00:00:00 --- 2008-06-01 00:00:00  (n=36)
In [50]:
# Create and train forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                    regressor = LinearRegression(),
                    lags = 15
             )

forecaster.fit(y=data_train['y'])

# Prediction intervals
# ==============================================================================
predictions = forecaster.predict_interval(
                    steps    = steps,
                    interval = [1, 99],
                    n_boot   = 500
              )

predictions.head(5)
Out[50]:
pred lower_bound upper_bound
2005-07-01 0.962317 0.836348 1.072365
2005-08-01 0.976595 0.786105 1.093848
2005-09-01 1.140783 0.976097 1.262243
2005-10-01 1.179538 1.009499 1.306744
2005-11-01 1.221138 1.059447 1.348892
In [51]:
# Prediction error
# ==============================================================================
error_mse = mean_squared_error(
                y_true = data_test['y'],
                y_pred = predictions.iloc[:, 0]
            )

print(f"Test error (mse): {error_mse}")

# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 2.5))
data_test['y'].plot(ax=ax, label='test')
predictions['pred'].plot(ax=ax, label='prediction')
ax.fill_between(
    predictions.index,
    predictions['lower_bound'],
    predictions['upper_bound'],
    color = 'red',
    alpha = 0.2
)
ax.legend();
Test error (mse): 0.010536580526702774