Intermittent demand forecasting with skforecast

If you like  Skforecast ,  help us giving a star on   GitHub! ⭐️

Intermittent demand forecasting with skforecast

Joaquín Amat Rodrigo, Javier Escobar Ortiz
April, 2023 (last update November 2024)

Introduction

Intermittent demand forecasting is a statistical method used to predict the demand for products that have sporadic or irregular sales patterns. These types of products are characterized by periods of high demand followed by periods of little or no demand. Intermittent demand forecasting is used in the manufacturing, retail, and healthcare industries to manage inventory levels, optimize production schedules, and reduce out-of-stocks and excess inventory costs.

Regular intermittent demand refers to predictable demand patterns with known intervals between demand periods. This type of intermittent demand occurs with a degree of regularity, such as a product that has seasonal demand or a product that is ordered every month on a certain date. Forecasting of regular intermittent demand can be done using statistical methods, such as Croston's method, which is a common technique used to forecast demand in these situations. However, these methods often do not take into account important exogenous variables that can provide valuable information about demand intervals. As a result, machine learning-based forecasting can be an appealing alternative as it can incorporate these variables and improve accuracy.

On the other hand, irregular intermittent demand refers to unpredictable demand patterns, with no known intervals between demand periods. This type of intermittent demand is random and unpredictable, such as a product that is ordered only a few times a year and in varying quantities. Forecasting irregular intermittent demand is challenging because there is no predictable pattern to the demand. Traditional forecasting methods may not work effectively in these situations and alternative forecasting methods, such as bootstrapping or simulation, may be required.

In summary, regular intermittent demand has a predictable pattern whereas irregular intermittent demand does not. Forecasting regular intermittent demand is easier than forecasting irregular intermittent demand due to the predictability of the demand pattern.

This document demonstrates how the Python library skforecast can be used to forecast regular intermittent demand scenarios. By using this library, the machine learning model can focus on learning to predict demand during periods of interest, while avoiding the influence of periods when there is no demand.

Libraries

In [1]:
# Data processing
# ==============================================================================
import numpy as np
import pandas as pd

# Plots
# ==============================================================================
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
import plotly.offline as poff
pio.templates.default = "seaborn"
poff.init_notebook_mode(connected=True)

# Modeling
# ==============================================================================
import skforecast
import sklearn
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_absolute_error
from skforecast.recursive import ForecasterRecursive
from skforecast.model_selection import TimeSeriesFold
from skforecast.model_selection import grid_search_forecaster
from skforecast.model_selection import backtesting_forecaster

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

color = '\033[1m\033[38;5;208m' 
print(f"{color}Version skforecast: {skforecast.__version__}")
print(f"{color}Version scikit-learn: {sklearn.__version__}")
print(f"{color}Version pandas: {pd.__version__}")
print(f"{color}Version numpy: {np.__version__}")
Version skforecast: 0.14.0
Version scikit-learn: 1.5.1
Version pandas: 2.2.3
Version numpy: 2.0.2

Data

The data used in this example represents the number of users who visited a store during its opening hours from Monday to Friday, between 7:00 and 20:00. Therefore, any predictions outside this period are not useful and can either be ignored or set to 0.

In [2]:
# Downloading data
# ======================================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine'
       '-learning-python/master/data/intermittent_demand.csv')
data = pd.read_csv(url, sep=',')
data['date_time'] = pd.to_datetime(data['date_time'], format='%Y-%m-%d %H:%M:%S')
data = data.set_index('date_time')
data = data.asfreq('h')
data = data.sort_index()
data.head(3)
Out[2]:
users
date_time
2011-01-01 00:00:00 0.0
2011-01-01 01:00:00 0.0
2011-01-01 02:00:00 0.0
In [3]:
# Split train-val-test
# ======================================================================================
end_train = '2012-03-31 23:59:00'
end_validation = '2012-08-31 23:59:00'
data_train = data.loc[: end_train, :]
data_val   = data.loc[end_train:end_validation, :]
data_test  = data.loc[end_validation:, :]

print(f"Dates train      : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Dates validation : {data_val.index.min()} --- {data_val.index.max()}  (n={len(data_val)})")
print(f"Dates test       : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")
Dates train      : 2011-01-01 00:00:00 --- 2012-03-31 23:00:00  (n=10944)
Dates validation : 2012-04-01 00:00:00 --- 2012-08-31 23:00:00  (n=3672)
Dates test       : 2012-09-01 00:00:00 --- 2012-12-31 23:00:00  (n=2928)
In [4]:
# Plot time series
# ======================================================================================
fig = go.Figure()
trace1 = go.Scatter(x=data_train.index, y=data_train['users'], name="train", mode="lines")
trace2 = go.Scatter(x=data_val.index, y=data_val['users'], name="validation", mode="lines")
trace3 = go.Scatter(x=data_test.index, y=data_test['users'], name="test", mode="lines")
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.add_trace(trace3)
fig.update_layout(
    title="Time series of users",
    xaxis_title="Date time",
    yaxis_title="Users",
    width  = 750,
    height = 350,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1,
        xanchor="left",
        x=0.001
    )
)
fig.show()
In [5]:
# Boxplot for weekly seasonality
# ==============================================================================
data['week_day'] = data.index.day_of_week + 1
fig = px.box(
        data,
        x="week_day",
        y="users",
        title = 'Distribution of users per day of week',
        width=600,
        height=300
     )
median_values = data.groupby('week_day')['users'].median()
fig.add_trace(
    go.Scatter(
        x=median_values.index,
        y=median_values.values,
        mode='lines+markers',
        line=dict(color='blue', dash='dash'),
        showlegend=False
    )
)
fig.update_layout(margin=dict(l=20, r=20, t=35, b=20))
fig.show()
In [6]:
# Boxplot for daily seasonality
# ==============================================================================
data['hour_day'] = data.index.hour + 1
fig = px.box(
        data,
        x="hour_day",
        y="users",
        title = 'Distribution of users per hour of day',
        width=600,
        height=300
     )
median_values = data.groupby('hour_day')['users'].median()
fig.add_trace(
    go.Scatter(
        x=median_values.index,
        y=median_values.values,
        mode='lines+markers',
        line=dict(color='blue', dash='dash'),
        showlegend=False
    )
)
fig.update_layout(margin=dict(l=20, r=20, t=35, b=20))
fig.show()

Define a custom metric to evaluate the model

To accurately evaluate the performance of the model, it is crucial to define a metric that closely reflects the business scenario in which the model will be used. Specifically, in this case, the model's performance should be optimized during weekdays from 9:00 to 20:00.

Fortunately, skforecast provides the flexibility to use custom functions as metrics for backtesting and model tuning.

In [7]:
def custom_metric(y_true, y_pred):
    """
    Calculate the mean absolute error using only the predicted values for weekdays
    from 9:00 AM to 8:00 PM
    """

    day_of_week = y_true.index.day_of_week
    hour_of_day = y_true.index.hour
    mask = day_of_week.isin([0, 1, 2, 3, 4]) | ((hour_of_day > 7) | (hour_of_day < 20))
    metric = mean_absolute_error(y_true[mask], y_pred[mask])
    
    return metric

Forecasting

In [8]:
# Create forecaster
# ==============================================================================
forecaster = ForecasterRecursive(
                 regressor = LGBMRegressor(
                                 learning_rate = 0.1,
                                 max_depth     = 5,
                                 n_estimators  = 500,
                                 random_state  = 123,
                                 verbose       = -1
                             ),
                 lags = 24
             )
forecaster
Out[8]:

ForecasterRecursive

General Information
  • Regressor: LGBMRegressor(max_depth=5, n_estimators=500, random_state=123, verbose=-1)
  • Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]
  • Window features: None
  • Window size: 24
  • Exogenous included: False
  • Weight function included: False
  • Differentiation order: None
  • Creation date: 2024-11-04 16:23:36
  • Last fit date: None
  • Skforecast version: 0.14.0
  • Python version: 3.12.5
  • Forecaster id: None
Exogenous Variables
    None
Data Transformations
  • Transformer for y: None
  • Transformer for exog: None
Training Information
  • Training range: Not fitted
  • Training index type: Not fitted
  • Training index frequency: Not fitted
Regressor Parameters
    {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': 5, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1}
Fit Kwargs
    {}

🛈 API Reference    🗎 User Guide

In [9]:
# Backtesting test period
# ==============================================================================
cv = TimeSeriesFold(
        steps              = 36,
        initial_train_size = len(data.loc[:end_validation]),
        refit              = False
     )
metric, predictions = backtesting_forecaster(
                          forecaster    = forecaster,
                          y             = data['users'],
                          cv            = cv,
                          metric        = custom_metric,
                          verbose       = False,
                          show_progress = True
                      )
metric
Out[9]:
custom_metric
0 108.045136
In [10]:
# Set to zero predictions for closed hours
# ==============================================================================
hour = data_test.index.hour
day_of_week = data_test.index.day_of_week
closed_hours = (hour < 7) | (hour > 20)
closed_days = day_of_week.isin([5, 6])
is_closed = (closed_hours) | (closed_days)
predictions[is_closed] = 0
In [11]:
# Plot predictions vs real value
# ======================================================================================
fig = go.Figure()
trace1 = go.Scatter(x=data_test.index, y=data_test['users'], name="test", mode="lines")
trace2 = go.Scatter(x=predictions.index, y=predictions['pred'], name="prediction", mode="lines")
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.update_layout(
    title="Real value vs predicted in test data",
    xaxis_title="Date time",
    yaxis_title="Users",
    width  = 750,
    height = 350,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1.1,
        xanchor="left",
        x=0.001
    )
)
fig.show()

When analysing the estimated predictions, it is clear that the model struggles to accurately capture the patterns of store opening and closing times. In addition, the influence of lags leads to an underestimation of the first few hours of the day and the days following the closing days.

In [12]:
# Distribution of residuals by day of week and hour of day
# ==============================================================================
residuals = (predictions['pred'] - data_test['users']).to_frame('residuals')
residuals['week_day'] = residuals.index.day_of_week + 1
residuals['hour_day'] = residuals.index.hour + 1

fig = px.box(
        residuals,
        x="week_day",
        y="residuals",
        title = 'Distribution of residuals per hour of day',
        width=600,
        height=300
     )
fig.update_layout(margin=dict(l=20, r=20, t=35, b=20))
fig.show()

fig = px.box(
        residuals,
        x="hour_day",
        y="residuals",
        title = 'Distribution of residuals per hour of day',
        width=600,
        height=300
     )
fig.update_layout(margin=dict(l=20, r=20, t=35, b=20))
fig.show()

Inform the model when the store is closed

Exogenous variables can be used in a forecasting model to provide additional information and improve the model's ability to detect patterns. This approach offers the advantage of incorporating external factors that could influence the accuracy of the forecast, leading to a more reliable and accurate forecasting model. The exog argument makes it incredibly easy to integrate exogenous variables into the forecaster.

In [13]:
# Create exogenous variable
# ==============================================================================
hour = data.index.hour
day_of_week = data.index.day_of_week
closed_hours = (hour < 7) | (hour > 20)
closed_days = day_of_week.isin([5, 6])
is_closed = (closed_hours) | (closed_days)
data['is_closed'] = is_closed.astype(int)

end_train = '2012-03-31 23:59:00'
end_validation = '2012-08-31 23:59:00'
data_train = data.loc[: end_train, :]
data_val   = data.loc[end_train:end_validation, :]
data_test  = data.loc[end_validation:, :]
In [14]:
# Backtesting test period
# ==============================================================================
metric, predictions = backtesting_forecaster(
                        forecaster    = forecaster,
                        y             = data['users'],
                        exog          = data['is_closed'],
                        cv            = cv,
                        metric        = custom_metric,
                        verbose       = False,
                        show_progress = True
                    )

metric
Out[14]:
custom_metric
0 52.93065
In [15]:
# Set to zero predictions for closed hours
# ==============================================================================
hour = data_test.index.hour
day_of_week = data_test.index.day_of_week
closed_hours = (hour < 7) | (hour > 20)
closed_days = day_of_week.isin([5, 6])
is_closed = (closed_hours) | (closed_days)
predictions[is_closed] = 0
In [16]:
# Plot predictions vs real value
# ======================================================================================
fig = go.Figure()
trace1 = go.Scatter(x=data_test.index, y=data_test['users'], name="test", mode="lines")
trace2 = go.Scatter(x=predictions.index, y=predictions['pred'], name="prediction", mode="lines")
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.update_layout(
    title="Real value vs predicted in test data",
    xaxis_title="Date time",
    yaxis_title="Users",
    width  = 750,
    height = 350,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1.1,
        xanchor="left",
        x=0.001
    )
)
fig.show()

Incorporating an exogenous variable into the model led to a significant improvement in forecasting accuracy, with the error being halved. This highlights the importance of using external factors to enhance the performance of the forecasting model.

Model tunning

Hyperparameter tuning is a critical aspect of developing accurate and effective machine learning models. Hyperparameters are values that cannot be learned from data and must be set by the user before the model is trained. These hyperparameters can have a significant impact on the performance of the model, and careful tuning can improve its accuracy and generalisation to new data. In the case of forecasting models, the lags included in the model can be considered as an additional hyperparameter.

Hyperparameter tuning involves systematically testing different values or combinations of hyperparameters (including lags) to find the optimal configuration that gives the best results. The skforecast library offers several hyperparameter tuning strategies, including grid search, random search, and Bayesian search, to identify the optimal combination of lags and hyperparameters that achieve the best prediction performance.

In [17]:
# Grid search hyperparameters and lags
# ==============================================================================
forecaster = ForecasterRecursive(
                 regressor = LGBMRegressor(
                                 learning_rate = 0.1,
                                 max_depth     = 5,
                                 n_estimators  = 500,
                                 random_state  = 123,
                                 verbose       = -1
                             ),
                 lags = 24
             )

# Lags
lags_grid = [24, 48, 72]

# Huperparameters
param_grid = {
    "n_estimators": [50, 100, 500],
    "max_depth": [5, 10, 15],
    "learning_rate": [0.01, 0.1],
}

# Folds
cv_search = TimeSeriesFold(
                steps              = 36,
                initial_train_size = len(data.loc[:end_train]),
                refit              = False
            )

results_grid = grid_search_forecaster(
                   forecaster  = forecaster,
                   y           = data.loc[:end_validation, 'users'],
                   exog        = data.loc[:end_validation, 'is_closed'],
                   cv          = cv_search,
                   param_grid  = param_grid,
                   lags_grid   = lags_grid,
                   metric      = custom_metric,
                   return_best = True,
                   verbose     = False
               )
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72] 
  Parameters: {'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 100}
  Backtesting metric: 39.30052275988041
In [18]:
# Backtesting test period
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster    = forecaster,
                          y             = data['users'],
                          exog          = data['is_closed'],
                          cv            = cv,
                          metric        = custom_metric,
                          verbose       = False,
                          show_progress = True
                      )
metric
Out[18]:
custom_metric
0 39.861555
In [19]:
# Plot predictions vs real value
# ======================================================================================
fig = go.Figure()
trace1 = go.Scatter(x=data_test.index, y=data_test['users'], name="test", mode="lines")
trace2 = go.Scatter(x=predictions.index, y=predictions['pred'], name="prediction", mode="lines")
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.update_layout(
    title="Real value vs predicted in test data",
    xaxis_title="Date time",
    yaxis_title="Users",
    width  = 750,
    height = 350,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1.1,
        xanchor="left",
        x=0.001
    )
)
fig.show()

An improved set of hyperparameters has been discovered through grid search, resulting in a further reduction of forecast errors.

Session information

In [20]:
import session_info
session_info.show(html=False)
-----
lightgbm            4.5.0
numpy               2.0.2
pandas              2.2.3
plotly              5.24.1
session_info        1.0.0
skforecast          0.14.0
sklearn             1.5.1
-----
IPython             8.27.0
jupyter_client      8.6.3
jupyter_core        5.7.2
notebook            6.4.12
-----
Python 3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 18:27:27) [GCC 11.2.0]
Linux-5.15.0-1071-aws-x86_64-with-glibc2.31
-----
Session information updated at 2024-11-04 16:26

Citation

How to cite this document

If you use this document or any part of it, please acknowledge the source, thank you!

Intermittent demand forecasting with skforecast by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) at https://www.cienciadedatos.net/documentos/py48-intermittent-demand-forecasting.html

How to cite skforecast

If you use skforecast for a publication, we would appreciate it if you cite the published software.

Zenodo:

Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.14.0). Zenodo. https://doi.org/10.5281/zenodo.8382788

APA:

Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (Version 0.14.0) [Computer software]. https://doi.org/10.5281/zenodo.8382788

BibTeX:

@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.14.0}, month = {11}, year = {2024}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }

%%html


Did you like the article? Your support is important

Website maintenance has high cost, your contribution will help me to continue generating free educational content. Many thanks! 😊


Creative Commons Licence
This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.