More about forecasting in cienciadedatos.net
- ARIMA and SARIMAX models with python
- Time series forecasting with machine learning
- Forecasting time series with gradient boosting: XGBoost, LightGBM and CatBoost
- Forecasting time series with XGBoost
- Global Forecasting Models: Multi-series forecasting
- Global Forecasting Models: Comparative Analysis of Single and Multi-Series Forecasting Modeling
- Probabilistic forecasting
- Forecasting with deep learning
- Forecasting energy demand with machine learning
- Forecasting web traffic with machine learning
- Intermittent demand forecasting
- Modelling time series trend with tree-based models
- Bitcoin price prediction with Python
- Stacking ensemble of machine learning models to improve forecasting
- Interpretable forecasting models
- Mitigating the Impact of Covid on Forecasting Models
- Forecasting time series with missing values

Introduction
In early 2025, Kaggle launched the Forecasting Sticker Sales competition as part of its Playground series (Season 5, Episode 1), providing a hands-on challenge for time series forecasting enthusiasts.
Contestants were tasked with forecasting monthly sales for five Kaggle-branded products across six countries and three store types - resulting in 90 different time series. The forecasting period spanned from 2017 to 2019, with historical sales data from 2010 to 2016 available for training. Performance was evaluated using Mean Absolute Percentage Error (MAPE), highlighting the importance of accurate predictions across a diverse and moderately granular dataset.
This document serves as an introductory example of how to use the skforecast Python library to build predictive models and generate predictions. It will guide the reader through the main steps of a typical forecasting project, including data preparation, model training, and evaluation. While the focus is on simplicity and clarity, achieving peak performance would require more advanced feature engineering and iterative model refinement.
Libraries
The libraries used in this notebook are:
# Data management
# ==============================================================================
import pandas as pd
import numpy as np
from itertools import product
# Plots
# ==============================================================================
from matplotlib import pyplot as plt
from skforecast.plot import set_dark_theme
# Modelling
# ==============================================================================
from skforecast.recursive import ForecasterRecursiveMultiSeries
from skforecast.model_selection import bayesian_search_forecaster_multiseries, OneStepAheadFold
from skforecast.preprocessing import series_long_to_dict, exog_long_to_dict
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder
from lightgbm import LGBMRegressor
import holidays
from feature_engine.datetime import DatetimeFeatures
from feature_engine.creation import CyclicalFeatures
# Warnings
# ==============================================================================
import warnings
from skforecast.exceptions import MissingValuesWarning
warnings.simplefilter('ignore', category=MissingValuesWarning)
Data
The data sets used in this notebook are from the Kaggle competition: Forecasting Sticker Sales" (Walter Reade and Elizabeth Park, 2025).
# Data
# ==============================================================================
data_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')
display(data_train.head())
display(data_test.head())
id | date | country | store | product | num_sold | |
---|---|---|---|---|---|---|
0 | 0 | 2010-01-01 | Canada | Discount Stickers | Holographic Goose | NaN |
1 | 1 | 2010-01-01 | Canada | Discount Stickers | Kaggle | 973.0 |
2 | 2 | 2010-01-01 | Canada | Discount Stickers | Kaggle Tiers | 906.0 |
3 | 3 | 2010-01-01 | Canada | Discount Stickers | Kerneler | 423.0 |
4 | 4 | 2010-01-01 | Canada | Discount Stickers | Kerneler Dark Mode | 491.0 |
id | date | country | store | product | |
---|---|---|---|---|---|
0 | 230130 | 2017-01-01 | Canada | Discount Stickers | Holographic Goose |
1 | 230131 | 2017-01-01 | Canada | Discount Stickers | Kaggle |
2 | 230132 | 2017-01-01 | Canada | Discount Stickers | Kaggle Tiers |
3 | 230133 | 2017-01-01 | Canada | Discount Stickers | Kerneler |
4 | 230134 | 2017-01-01 | Canada | Discount Stickers | Kerneler Dark Mode |
# Convert 'date' column to datetime
# ==============================================================================
data_train['date'] = pd.to_datetime(data_train['date'])
data_test['date'] = pd.to_datetime(data_test['date'])
# Create a new column 'unique_id' that identifies each time series as the combination
# of columns 'country', 'store', and 'product'
# ==============================================================================
data_train['unique_id'] = (
data_train['country'] + '_' +
data_train['store'] + '_' +
data_train['product']
).replace(' ', '_')
data_test['unique_id'] = (
data_test['country'] + '_' +
data_test['store'] + '_' +
data_test['product']
).replace(' ', '_')
display(data_train.head())
display(data_test.head())
id | date | country | store | product | num_sold | unique_id | |
---|---|---|---|---|---|---|---|
0 | 0 | 2010-01-01 | Canada | Discount Stickers | Holographic Goose | NaN | Canada_Discount Stickers_Holographic Goose |
1 | 1 | 2010-01-01 | Canada | Discount Stickers | Kaggle | 973.0 | Canada_Discount Stickers_Kaggle |
2 | 2 | 2010-01-01 | Canada | Discount Stickers | Kaggle Tiers | 906.0 | Canada_Discount Stickers_Kaggle Tiers |
3 | 3 | 2010-01-01 | Canada | Discount Stickers | Kerneler | 423.0 | Canada_Discount Stickers_Kerneler |
4 | 4 | 2010-01-01 | Canada | Discount Stickers | Kerneler Dark Mode | 491.0 | Canada_Discount Stickers_Kerneler Dark Mode |
id | date | country | store | product | unique_id | |
---|---|---|---|---|---|---|
0 | 230130 | 2017-01-01 | Canada | Discount Stickers | Holographic Goose | Canada_Discount Stickers_Holographic Goose |
1 | 230131 | 2017-01-01 | Canada | Discount Stickers | Kaggle | Canada_Discount Stickers_Kaggle |
2 | 230132 | 2017-01-01 | Canada | Discount Stickers | Kaggle Tiers | Canada_Discount Stickers_Kaggle Tiers |
3 | 230133 | 2017-01-01 | Canada | Discount Stickers | Kerneler | Canada_Discount Stickers_Kerneler |
4 | 230134 | 2017-01-01 | Canada | Discount Stickers | Kerneler Dark Mode | Canada_Discount Stickers_Kerneler Dark Mode |
Range of available dates
# Unique conuntries, stores and products
# ==============================================================================
print('Number of unique time series:', data_train['unique_id'].nunique())
print('Unique countries :', data_train['country'].unique())
print('Unique stores :', data_train['store'].unique())
print('Unique products :', data_train['product'].unique())
Unique countries: ['Canada' 'Finland' 'Italy' 'Kenya' 'Norway' 'Singapore'] Unique stores: ['Discount Stickers' 'Stickers for Less' 'Premium Sticker Mart'] Unique products: ['Holographic Goose' 'Kaggle' 'Kaggle Tiers' 'Kerneler' 'Kerneler Dark Mode'] Number of unique time series: 90
# Date range in the training and test sets
# ==============================================================================
print('Date range in the training set :', data_train['date'].min(), 'to', data_train['date'].max())
print('Date range in the test set :', data_test['date'].min(), 'to', data_test['date'].max())
Date range in the training set: 2010-01-01 00:00:00 to 2016-12-31 00:00:00 Date range in the test set : 2017-01-01 00:00:00 to 2019-12-31 00:00:00
# Date range available in the training set for each time series
# ==============================================================================
date_range = data_train.groupby('unique_id')['date'].agg(['min', 'max', 'count'])
date_range = date_range.rename(columns={'min': 'start_date', 'max': 'end_date'})
date_range
start_date | end_date | count | |
---|---|---|---|
unique_id | |||
Canada_Discount Stickers_Holographic Goose | 2010-01-01 | 2016-12-31 | 2557 |
Canada_Discount Stickers_Kaggle | 2010-01-01 | 2016-12-31 | 2557 |
Canada_Discount Stickers_Kaggle Tiers | 2010-01-01 | 2016-12-31 | 2557 |
Canada_Discount Stickers_Kerneler | 2010-01-01 | 2016-12-31 | 2557 |
Canada_Discount Stickers_Kerneler Dark Mode | 2010-01-01 | 2016-12-31 | 2557 |
... | ... | ... | ... |
Singapore_Stickers for Less_Holographic Goose | 2010-01-01 | 2016-12-31 | 2557 |
Singapore_Stickers for Less_Kaggle | 2010-01-01 | 2016-12-31 | 2557 |
Singapore_Stickers for Less_Kaggle Tiers | 2010-01-01 | 2016-12-31 | 2557 |
Singapore_Stickers for Less_Kerneler | 2010-01-01 | 2016-12-31 | 2557 |
Singapore_Stickers for Less_Kerneler Dark Mode | 2010-01-01 | 2016-12-31 | 2557 |
90 rows × 3 columns
# Plot 3 time series
# ==============================================================================
set_dark_theme()
series_to_plot = [
'Italy_Stickers for Less_Kerneler Dark Mode',
'Singapore_Stickers for Less_Holographic Goose',
'Italy_Stickers for Less_Kaggle'
]
for series in series_to_plot:
fig, ax = plt.subplots(1, 1, figsize=(7, 2.5))
data_train.query('unique_id == @series').plot(
x='date',
y=['num_sold'],
ax=ax,
title=series,
linewidth=0.3,
legend=False,
)
plt.show()
Missing values in target variable
Only 9 series have missing values in the target variable num_sold
, most of them are in the "Holographic Goose" product. Since no information is given about the missing values in the challenge description, it is necessary to take a decision about how to handle them:
Missing values means that the product was not sold in that month, meaning that the sales are 0.
Missing values means that the sales of the product are unknown, so they could be 0 or any other value.
# Ensure all time series are complete without intermediate gaps
# ==============================================================================
data_train = (
data_train
.groupby('unique_id')
.apply(lambda group: group.set_index('date').asfreq('D', fill_value=np.nan), include_groups=False)
.reset_index()
)
data_train
unique_id | date | id | country | store | product | num_sold | |
---|---|---|---|---|---|---|---|
0 | Canada_Discount Stickers_Holographic Goose | 2010-01-01 | 0 | Canada | Discount Stickers | Holographic Goose | NaN |
1 | Canada_Discount Stickers_Holographic Goose | 2010-01-02 | 90 | Canada | Discount Stickers | Holographic Goose | NaN |
2 | Canada_Discount Stickers_Holographic Goose | 2010-01-03 | 180 | Canada | Discount Stickers | Holographic Goose | NaN |
3 | Canada_Discount Stickers_Holographic Goose | 2010-01-04 | 270 | Canada | Discount Stickers | Holographic Goose | NaN |
4 | Canada_Discount Stickers_Holographic Goose | 2010-01-05 | 360 | Canada | Discount Stickers | Holographic Goose | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
230125 | Singapore_Stickers for Less_Kerneler Dark Mode | 2016-12-27 | 229764 | Singapore | Stickers for Less | Kerneler Dark Mode | 1016.0 |
230126 | Singapore_Stickers for Less_Kerneler Dark Mode | 2016-12-28 | 229854 | Singapore | Stickers for Less | Kerneler Dark Mode | 1062.0 |
230127 | Singapore_Stickers for Less_Kerneler Dark Mode | 2016-12-29 | 229944 | Singapore | Stickers for Less | Kerneler Dark Mode | 1178.0 |
230128 | Singapore_Stickers for Less_Kerneler Dark Mode | 2016-12-30 | 230034 | Singapore | Stickers for Less | Kerneler Dark Mode | 1357.0 |
230129 | Singapore_Stickers for Less_Kerneler Dark Mode | 2016-12-31 | 230124 | Singapore | Stickers for Less | Kerneler Dark Mode | 1312.0 |
230130 rows × 7 columns
# Percentaje of missing values in each series
# ==============================================================================
missing_pct = (
data_train
.groupby('unique_id')
.apply(lambda group: group['num_sold'].isna().mean() * 100, include_groups=False)
.sort_values(ascending=False)
.reset_index(name='missing_values_pct')
)
missing_pct.query('missing_values_pct > 0')
unique_id | missing_values_pct | |
---|---|---|
0 | Canada_Discount Stickers_Holographic Goose | 100.000000 |
1 | Kenya_Discount Stickers_Holographic Goose | 100.000000 |
2 | Kenya_Stickers for Less_Holographic Goose | 53.109112 |
3 | Canada_Stickers for Less_Holographic Goose | 51.153696 |
4 | Kenya_Premium Sticker Mart_Holographic Goose | 25.263981 |
5 | Canada_Premium Sticker Mart_Holographic Goose | 14.861165 |
6 | Kenya_Discount Stickers_Kerneler | 2.463825 |
7 | Canada_Discount Stickers_Kerneler | 0.039108 |
8 | Kenya_Discount Stickers_Kerneler Dark Mode | 0.039108 |
Series containing only missing values are excluded from the training set, as they provide no useful information. For the remaining series, missing values are not imputed because the chosen regressor, LightGBM, is capable of handling NaN values directly.
# Drop time series with 100% of missing values
# ==============================================================================
series_to_drop = ['Canada_Discount Stickers_Holographic Goose', 'Kenya_Discount Stickers_Holographic Goose']
data_train = data_train.query("unique_id not in @series_to_drop").copy()
✎ Note
As described in the discussion of the competition, one can get better results by filling missing values winth a random number between 1 and the minimum value within the country (Kenya = 5, Canada = 200). However, this aproach seems to be based on repited submissions to the competition rather than a solid statistical foundation. The approach taken in this notebook is to leave the missing values as they are, and let the model learn from them. This is a common practice in time series forecasting, as it allows the model to learn from the patterns in the data without introducing artificial values. Nevertheless, the reader can try it with the following code:# Fill series "Canada_Discount Stickers_Holographic Goose" with random values between 1 and 200
# ==============================================================================
# mask = data_train['unique_id'] == 'Canada_Discount Stickers_Holographic Goose'
# data_train.loc[mask, 'num_sold'] = np.random.randint(1, 200, size=sum(mask))
# Fill series "Kenya_Discount Stickers_Holographic Goose" with random values between 1 and 5
# ==============================================================================
# mask = data_train['unique_id'] == 'Kenya_Discount Stickers_Holographic Goose'
# data_train.loc[mask, 'num_sold'] = np.random.randint(1, 5, size=sum(mask))
Feature engineering
Logarithmic transformation
Series are transformed using the logarithm function. This transformation is particularly useful for series with a long tail, as it reduces the impact of extreme values and helps to normalize the distribution of the data. Furthtermore, it avoid negative predictions, which is a common issue when using machine learning models.
The logarithmic transformation is applied to the target variable num_sold
and the resulting values are stored in a new column called log_num_sold
. Once the predictions are made, the inverse transformation is applied to obtain the original scale of the data.
# Transform the target variable 'num_sold' to log scale
# ==============================================================================
data_train['log_num_sold'] = np.log1p(data_train['num_sold'])
data_train.head()
unique_id | date | id | country | store | product | num_sold | log_num_sold | |
---|---|---|---|---|---|---|---|---|
2557 | Canada_Discount Stickers_Kaggle | 2010-01-01 | 1 | Canada | Discount Stickers | Kaggle | 973.0 | 6.881411 |
2558 | Canada_Discount Stickers_Kaggle | 2010-01-02 | 91 | Canada | Discount Stickers | Kaggle | 881.0 | 6.782192 |
2559 | Canada_Discount Stickers_Kaggle | 2010-01-03 | 181 | Canada | Discount Stickers | Kaggle | 1003.0 | 6.911747 |
2560 | Canada_Discount Stickers_Kaggle | 2010-01-04 | 271 | Canada | Discount Stickers | Kaggle | 744.0 | 6.613384 |
2561 | Canada_Discount Stickers_Kaggle | 2010-01-05 | 361 | Canada | Discount Stickers | Kaggle | 707.0 | 6.562444 |
Exogenous variables
In addition to historical sales data, incorporating additional variables—also known as exogenous variables—can enhance the model's performance. For example, calendar-related variables such as the month, day of the week, and holidays can provide valuable context.
# Generate holidays for each country and year
# ==============================================================================
countries = ['Canada', 'Finland', 'Italy', 'Kenya', 'Norway', 'Singapore']
years = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
all_holidays = []
for country_code, year in product(countries, years):
try:
country_holidays = holidays.country_holidays(country_code, years=year)
all_holidays.extend([
{
'country': country_code,
'date': pd.to_datetime(date),
'holiday_name': name
}
for date, name in country_holidays.items()
])
except NotImplementedError:
print(f"Country '{country_code}' is not supported by the holidays library.")
df_holidays = (
pd.DataFrame(all_holidays)
.groupby(['country', 'date'], as_index=False)
.agg({'holiday_name': ', '.join})
)
df_holidays['is_holiday'] = 1
df_holidays.sort_values(by=['country', 'date'], inplace=True)
To account for the delayed impact of holidays on sales, it's recommended to include lagged versions of the holiday variable. This involves creating new indicators that show whether the previous or following day is a holiday.
Since the df_holidays
DataFrame only lists holiday dates, it must first be expanded to include all calendar dates for each country before calculating these shifted variables.
# Build complete date range per country
date_range = pd.date_range(start=f"{min(years)}-01-01", end=f"{max(years)}-12-31")
all_dates = pd.MultiIndex.from_product([countries, date_range], names=['country', 'date']).to_frame(index=False)
# Merge with holidays
df_calendar = all_dates.merge(df_holidays, on=['country', 'date'], how='left')
df_calendar['is_holiday'] = df_calendar['is_holiday'].fillna(0).astype(int)
# Create lagged and next features for holidays
for i in [1, 2, 5, 7, 9]:
df_calendar[f'is_holiday_lag_{i}'] = df_calendar.groupby('country')['is_holiday'].shift(i).fillna(0).astype(int)
df_calendar[f'is_holiday_next_{i}'] = df_calendar.groupby('country')['is_holiday'].shift(-i).fillna(0).astype(int)
df_calendar.head()
country | date | holiday_name | is_holiday | is_holiday_lag_1 | is_holiday_next_1 | is_holiday_lag_2 | is_holiday_next_2 | is_holiday_lag_5 | is_holiday_next_5 | is_holiday_lag_7 | is_holiday_next_7 | is_holiday_lag_9 | is_holiday_next_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Canada | 2010-01-01 | New Year's Day | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | Canada | 2010-01-02 | NaN | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | Canada | 2010-01-03 | NaN | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | Canada | 2010-01-04 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | Canada | 2010-01-05 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Add calendar features and encode them with cyclical encoding
# ==============================================================================
features_to_extract = [
'month',
'week',
'day_of_week',
]
calendar_transformer = DatetimeFeatures(
variables = 'date',
features_to_extract = features_to_extract,
drop_original = False,
)
df_calendar = calendar_transformer.fit_transform(df_calendar)
df_calendar.columns = df_calendar.columns.str.replace('date_', '', regex=False)
features_to_encode = [
"month",
"week",
"day_of_week",
]
max_values = {
"month": 12,
"week": 52,
"day_of_week": 6,
}
cyclical_encoder = CyclicalFeatures(
variables = features_to_encode,
max_values = max_values,
drop_original = True
)
df_calendar = cyclical_encoder.fit_transform(df_calendar)
df_calendar.head()
country | date | holiday_name | is_holiday | is_holiday_lag_1 | is_holiday_next_1 | is_holiday_lag_2 | is_holiday_next_2 | is_holiday_lag_5 | is_holiday_next_5 | is_holiday_lag_7 | is_holiday_next_7 | is_holiday_lag_9 | is_holiday_next_9 | month_sin | month_cos | week_sin | week_cos | day_of_week_sin | day_of_week_cos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Canada | 2010-01-01 | New Year's Day | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.5 | 0.866025 | 0.120537 | 0.992709 | -8.660254e-01 | -0.5 |
1 | Canada | 2010-01-02 | NaN | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.5 | 0.866025 | 0.120537 | 0.992709 | -8.660254e-01 | 0.5 |
2 | Canada | 2010-01-03 | NaN | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.5 | 0.866025 | 0.120537 | 0.992709 | -2.449294e-16 | 1.0 |
3 | Canada | 2010-01-04 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.5 | 0.866025 | 0.120537 | 0.992709 | 0.000000e+00 | 1.0 |
4 | Canada | 2010-01-05 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.5 | 0.866025 | 0.120537 | 0.992709 | 8.660254e-01 | 0.5 |
# Add all exogenous features to the training set
# ==============================================================================
exog_features = [
'is_holiday',
'is_holiday_lag_1',
'is_holiday_lag_2',
'is_holiday_lag_5',
'is_holiday_lag_7',
'is_holiday_lag_9',
'is_holiday_next_1',
'is_holiday_next_2',
'is_holiday_next_5',
'month_sin',
'month_cos',
'week_sin',
'week_cos',
'day_of_week_sin',
'day_of_week_cos',
'country',
'store',
'product',
]
data_train = data_train.merge(
right = df_calendar.drop(columns=['holiday_name']),
how = 'left',
left_on = ['country', 'date'],
right_on = ['country', 'date'],
validate = 'many_to_one'
)
data_train.head()
unique_id | date | id | country | store | product | num_sold | log_num_sold | is_holiday | is_holiday_lag_1 | ... | is_holiday_lag_7 | is_holiday_next_7 | is_holiday_lag_9 | is_holiday_next_9 | month_sin | month_cos | week_sin | week_cos | day_of_week_sin | day_of_week_cos | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Canada_Discount Stickers_Kaggle | 2010-01-01 | 1 | Canada | Discount Stickers | Kaggle | 973.0 | 6.881411 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0.5 | 0.866025 | 0.120537 | 0.992709 | -8.660254e-01 | -0.5 |
1 | Canada_Discount Stickers_Kaggle | 2010-01-02 | 91 | Canada | Discount Stickers | Kaggle | 881.0 | 6.782192 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0.5 | 0.866025 | 0.120537 | 0.992709 | -8.660254e-01 | 0.5 |
2 | Canada_Discount Stickers_Kaggle | 2010-01-03 | 181 | Canada | Discount Stickers | Kaggle | 1003.0 | 6.911747 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0.5 | 0.866025 | 0.120537 | 0.992709 | -2.449294e-16 | 1.0 |
3 | Canada_Discount Stickers_Kaggle | 2010-01-04 | 271 | Canada | Discount Stickers | Kaggle | 744.0 | 6.613384 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0.5 | 0.866025 | 0.120537 | 0.992709 | 0.000000e+00 | 1.0 |
4 | Canada_Discount Stickers_Kaggle | 2010-01-05 | 361 | Canada | Discount Stickers | Kaggle | 707.0 | 6.562444 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0.5 | 0.866025 | 0.120537 | 0.992709 | 8.660254e-01 | 0.5 |
5 rows × 25 columns
Modeling
Since there are 90 different time series, two different approaches can be taken: modeling each time series separately, known as local forecasting, or modeling all time series together, known as global forecasting.
In this case, a global forecasting approach is taken, which typically better fulfills the requirements of real-world applications, where a large number of time series exist and training a single model for each series is computationally infeasible. Furthermore, the global forecasting models implemented in skforecast are able to forecast new time series that were not present in the training data, which will be useful for predicting the "Holographic Goose" product.
Skforecast accepts different data structures when creating global forecasting models (ForecasterRecursiveMultiSeries
):
If all series have the same length and share the same exogenous variables, the data can be passed as a pandas DataFrame where each column represents a time series and each row corresponds to a time step. The DataFrame index should be a datetime index.
If the series have different lengths, the data must be passed as a dictionary. The keys of the dictionary represent the names of the series, and the values are the series themselves. To facilitate this, the
series_long_to_dict
function can be used—it takes a DataFrame in "long format" and returns a dictionary of pandas Series. Similarly, if the exogenous variables differ (in values or type) across series, the data must also be provided as a dictionary. In this case, theexog_long_to_dict
function is used, converting a "long format" DataFrame into a dictionary of exogenous variables (either pandas Series or pandas DataFrames).
In this scenario, the holiday-related exogenous variables differ for each series, as they are specific to the country where the product is sold. Therefore, the data must be passed as a dictionary.
# Transform series and exog to dictionaries
# ==============================================================================
series_dict = series_long_to_dict(
data = data_train,
series_id = 'unique_id',
index = 'date',
values = 'log_num_sold',
freq = 'D'
)
exog_dict = exog_long_to_dict(
data = data_train[exog_features + ['date', 'unique_id']],
series_id = 'unique_id',
index = 'date',
freq = 'D'
)
When training a forecaster using exogenous variables, it is necessary to provide the exogenous variables for the prediction period. These variables must follow the same structure observed during training. Therefore, the exogenous variables for the test set must also be provided as a dictionary.
# Prepare exogenous variables for the test set
# ==============================================================================
data_test = data_test.merge(
df_calendar.drop(columns=['holiday_name']),
how = 'left',
left_on = ['country', 'date'],
right_on = ['country', 'date'],
validate = 'many_to_one'
)
exog_dict_pred = exog_long_to_dict(
data = data_test[exog_features + ['date', 'unique_id']],
series_id = 'unique_id',
index = 'date',
freq = 'D'
)
Feature encoding
The exogenous variables country
, store
, and product
are categorical. Depending on the regressor used, it may be necessary to encode them. In this case, the LightGBM regressor can handle categorical variables directly. However, to ensure they are treated consistently in the training and prediction phases, the variables are encoded first encoded as integers and then stored as Pandas category type. For more details on how to encode exogenous variables, please refer to the Feature Engineering section of the user guide.
# Categorical encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
categorical_features = ['country', 'store', 'product']
transformer_exog = make_column_transformer(
(
OrdinalEncoder(
dtype=int,
handle_unknown="use_encoded_value",
unknown_value=-1,
encoded_missing_value=-1
),
categorical_features
),
remainder="passthrough",
verbose_feature_names_out=False,
).set_output(transform="pandas")
The encoder will be passed to the forecaster, so it can be used during the prediction phase.
Forecaster training
# Create forecaster
# ==============================================================================
forecaster = ForecasterRecursiveMultiSeries(
regressor = LGBMRegressor(random_state=8520, verbose=-1),
lags = 31,
encoding = "ordinal_category",
transformer_exog = transformer_exog,
fit_kwargs = {'categorical_feature': categorical_features}
)
forecaster
ForecasterRecursiveMultiSeries
General Information
- Regressor: LGBMRegressor
- Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31]
- Window features: None
- Window size: 31
- Series encoding: ordinal_category
- Exogenous included: False
- Weight function included: False
- Series weights: None
- Differentiation order: None
- Creation date: 2025-05-19 14:35:45
- Last fit date: None
- Skforecast version: 0.16.0
- Python version: 3.12.9
- Forecaster id: None
Exogenous Variables
-
None
Data Transformations
- Transformer for series: None
- Transformer for exog: ColumnTransformer(remainder='passthrough',
transformers=[('ordinalencoder',
OrdinalEncoder(dtype=
, encoded_missing_value=-1, handle_unknown='use_encoded_value', unknown_value=-1), ['country', 'store', 'product'])], verbose_feature_names_out=False)
Training Information
- Series names (levels): None
- Training range: None
- Training index type: Not fitted
- Training index frequency: Not fitted
Regressor Parameters
-
{'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 8520, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1}
Fit Kwargs
-
{'categorical_feature': ['country', 'store', 'product']}
To find the best model hyperparameters, a Bayesian search is performed using the bayesian_search_forecaster_multiseries
function. This method is combined with the OneStepAheadFold
validation strategy and uses mean absolute percentage error (MAPE) as the evaluation metric. For more details on the validation strategy, see the Model Evaluation and Tuning section of the documentation.
Since hyperparameter searches should not be performed on the test set, the training data is split into two parts: a training set and a validation set. The training set is used to train the model, and the validation set is used to evaluate its performance.
# Bayesian search with OneStepAheadFold
# ==============================================================================
end_train = '2015-12-31 00:00:00'
start_validation = '2016-01-01 00:00:00'
initial_train_size = (pd.to_datetime(end_train) - pd.to_datetime(data_train['date'].min())).days
def search_space(trial):
search_space = {
'lags' : trial.suggest_categorical('lags', [1, 14, 21, 60]),
'n_estimators' : trial.suggest_int('n_estimators', 200, 800, step=100),
'max_depth' : trial.suggest_int('max_depth', 3, 8, step=1),
'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 25, 500),
'learning_rate' : trial.suggest_float('learning_rate', 0.01, 0.5),
'feature_fraction': trial.suggest_float('feature_fraction', 0.5, 0.8, step=0.1),
'max_bin' : trial.suggest_int('max_bin', 50, 100, step=25),
'reg_alpha' : trial.suggest_float('reg_alpha', 0, 1, step=0.1),
'reg_lambda' : trial.suggest_float('reg_lambda', 0, 1, step=0.1),
'linear_tree' : trial.suggest_categorical('linear_tree', [True, False]),
}
return search_space
cv = OneStepAheadFold(initial_train_size=initial_train_size)
results_search, best_trial = bayesian_search_forecaster_multiseries(
forecaster = forecaster,
series = series_dict,
exog = exog_dict,
cv = cv,
search_space = search_space,
n_trials = 20,
metric = "mean_absolute_percentage_error",
suppress_warnings = True
)
best_params = results_search.at[0, 'params']
best_lags = results_search.at[0, 'lags']
results_search.head(3)
0%| | 0/20 [00:00<?, ?it/s]
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60] Parameters: {'n_estimators': 800, 'max_depth': 4, 'min_data_in_leaf': 190, 'learning_rate': 0.21343356904845662, 'feature_fraction': 0.7, 'max_bin': 75, 'reg_alpha': 0.1, 'reg_lambda': 1.0, 'linear_tree': True} Backtesting metric: 0.008537774021838387 Levels: ['Canada_Discount Stickers_Kaggle', 'Canada_Discount Stickers_Kaggle Tiers', 'Canada_Discount Stickers_Kerneler', 'Canada_Discount Stickers_Kerneler Dark Mode', 'Canada_Premium Sticker Mart_Holographic Goose', 'Canada_Premium Sticker Mart_Kaggle', 'Canada_Premium Sticker Mart_Kaggle Tiers', 'Canada_Premium Sticker Mart_Kerneler', 'Canada_Premium Sticker Mart_Kerneler Dark Mode', 'Canada_Stickers for Less_Holographic Goose', '...', 'Singapore_Premium Sticker Mart_Holographic Goose', 'Singapore_Premium Sticker Mart_Kaggle', 'Singapore_Premium Sticker Mart_Kaggle Tiers', 'Singapore_Premium Sticker Mart_Kerneler', 'Singapore_Premium Sticker Mart_Kerneler Dark Mode', 'Singapore_Stickers for Less_Holographic Goose', 'Singapore_Stickers for Less_Kaggle', 'Singapore_Stickers for Less_Kaggle Tiers', 'Singapore_Stickers for Less_Kerneler', 'Singapore_Stickers for Less_Kerneler Dark Mode']
levels | lags | params | mean_absolute_percentage_error__weighted_average | mean_absolute_percentage_error__average | mean_absolute_percentage_error__pooling | n_estimators | max_depth | min_data_in_leaf | learning_rate | feature_fraction | max_bin | reg_alpha | reg_lambda | linear_tree | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | [Canada_Discount Stickers_Kaggle, Canada_Disco... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'n_estimators': 800, 'max_depth': 4, 'min_dat... | 0.008538 | 0.008490 | 0.008490 | 800 | 4 | 190 | 0.213434 | 0.7 | 75 | 0.1 | 1.0 | True |
1 | [Canada_Discount Stickers_Kaggle, Canada_Disco... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'n_estimators': 400, 'max_depth': 4, 'min_dat... | 0.008619 | 0.008503 | 0.008503 | 400 | 4 | 158 | 0.186406 | 0.6 | 100 | 0.0 | 1.0 | True |
2 | [Canada_Discount Stickers_Kaggle, Canada_Disco... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'n_estimators': 800, 'max_depth': 8, 'min_dat... | 0.008635 | 0.008543 | 0.008543 | 800 | 8 | 127 | 0.116468 | 0.7 | 100 | 1.0 | 0.0 | True |
Prediction
Once the model is trained, it can be used to make predictions. Note that when training a forecaster using exogenous variables, the exogenous variables must be provided for the prediction period using the exog
parameter of the predict
method.
Before proceeding to the final test set, it is important to first assess the model’s performance on a validation set, considering that the true sales are not yet available. This intermediate evaluation provides insight into how well the model generalizes and whether it is suitable for final testing.
To conduct this assessment, train the model using all available data up to "2015-12-31 00:00:00" and generate predictions for the subsequent year. Then, these predictions are compared to the actual sales data to evaluate performance.
# Train the forecaster
# ==============================================================================
forecaster.fit(
series = {k: v.loc[:end_train] for k, v in series_dict.items()},
exog = {k: v.loc[:end_train] for k, v in exog_dict.items()},
suppress_warnings = True,
)
# Predictions for the validation set
# ==============================================================================
steps = (data_train['date'].max() - pd.to_datetime(end_train)).days
print('Number of steps to predict:', steps)
# Select the exogenous variables for the validation dates
exog_dict_validation = {k: v.loc[start_validation:] for k, v in exog_dict.items()}
predictions_validation = forecaster.predict(steps=steps, exog=exog_dict_validation)
predictions_validation.head(4)
Number of steps to predict: 366
╭──────────────────────────────── MissingValuesWarning ────────────────────────────────╮ │ `last_window` has missing values. Most of machine learning models do not allow │ │ missing values. Prediction method may fail. │ │ │ │ Category : MissingValuesWarning │ │ Location : │ │ /home/joaquin/miniconda3/envs/skforecast_16_py12/lib/python3.12/site-packages/skfore │ │ cast/utils/utils.py:989 │ │ Suppress : warnings.simplefilter('ignore', category=MissingValuesWarning) │ ╰──────────────────────────────────────────────────────────────────────────────────────╯
level | pred | |
---|---|---|
2016-01-01 | Canada_Discount Stickers_Kaggle | 6.500966 |
2016-01-01 | Canada_Discount Stickers_Kaggle Tiers | 6.415851 |
2016-01-01 | Canada_Discount Stickers_Kerneler | 5.710596 |
2016-01-01 | Canada_Discount Stickers_Kerneler Dark Mode | 5.894394 |
Since the training was done using the logarithm of the target variable, the predictions are also in logarithmic scale. To obtain the original scale of the data, the inverse transformation is applied using the exponential function.
# Reverse the log transformation of the predictions
# ==============================================================================
predictions_validation['pred'] = np.expm1(predictions_validation['pred'])
predictions_validation.head(4)
level | pred | |
---|---|---|
2016-01-01 | Canada_Discount Stickers_Kaggle | 664.784257 |
2016-01-01 | Canada_Discount Stickers_Kaggle Tiers | 610.460865 |
2016-01-01 | Canada_Discount Stickers_Kerneler | 301.051077 |
2016-01-01 | Canada_Discount Stickers_Kerneler Dark Mode | 361.996929 |
Next, the predictions are compared against the actual sales data to evaluate the model's performance.
# Compare predictions with the real values
# ==============================================================================
predictions_validation = predictions_validation.reset_index(names= 'date')
predictions_validation = predictions_validation.merge(
data_train[['unique_id', 'date', 'num_sold']],
left_on = ['level', 'date'],
right_on = ['unique_id', 'date'],
how = 'left',
validate = '1:1'
)
predictions_validation = predictions_validation[['date', 'unique_id', 'pred', 'num_sold']]
predictions_validation.head(4)
date | unique_id | pred | num_sold | |
---|---|---|---|---|
0 | 2016-01-01 | Canada_Discount Stickers_Kaggle | 664.784257 | 706.0 |
1 | 2016-01-01 | Canada_Discount Stickers_Kaggle Tiers | 610.460865 | 634.0 |
2 | 2016-01-01 | Canada_Discount Stickers_Kerneler | 301.051077 | 316.0 |
3 | 2016-01-01 | Canada_Discount Stickers_Kerneler Dark Mode | 361.996929 | 404.0 |
# Calculate MAPE in the validation set
# ==============================================================================
# MAPE do not accept 0 values in the denominator (real values), therefore records
# with 0 in `num_sold` are excluded from the calculation.
mask_not_zero = predictions_validation['num_sold'] != 0
mask_not_nan = predictions_validation['num_sold'].notna()
mask = mask_not_zero & mask_not_nan
mape_validation = mean_absolute_percentage_error(
y_true = predictions_validation.loc[mask, 'num_sold'],
y_pred = predictions_validation.loc[mask, 'pred'],
)
print('Overall MAPE in the validation set :', mape_validation)
# MAPE per time series
# ==============================================================================
mape_validation_per_series = (
predictions_validation
.query('num_sold != 0 and num_sold.notna()')
.groupby('unique_id')
.apply(lambda group: mean_absolute_percentage_error(
y_true = group['num_sold'],
y_pred = group['pred'],
), include_groups=False)
.sort_values()
.reset_index(name='mape')
)
mape_validation_per_series
Overall MAPE in the validation set : 0.0806041750456532
unique_id | mape | |
---|---|---|
0 | Canada_Discount Stickers_Kaggle | 0.043041 |
1 | Norway_Discount Stickers_Kerneler | 0.045692 |
2 | Canada_Discount Stickers_Kerneler Dark Mode | 0.046368 |
3 | Singapore_Discount Stickers_Kaggle | 0.047748 |
4 | Finland_Premium Sticker Mart_Kerneler | 0.048505 |
... | ... | ... |
83 | Kenya_Premium Sticker Mart_Kaggle Tiers | 0.159288 |
84 | Kenya_Premium Sticker Mart_Holographic Goose | 0.166229 |
85 | Norway_Discount Stickers_Holographic Goose | 0.193311 |
86 | Singapore_Discount Stickers_Holographic Goose | 0.214695 |
87 | Italy_Discount Stickers_Holographic Goose | 0.257877 |
88 rows × 2 columns
Next plot shows the predictions and the actual sales data for four different products.
set_dark_theme()
series_to_plot = [
'Italy_Stickers for Less_Kerneler Dark Mode',
'Singapore_Stickers for Less_Holographic Goose',
'Italy_Discount Stickers_Holographic Goose',
'Kenya_Premium Sticker Mart_Holographic Goose'
]
for series in series_to_plot:
fig, ax = plt.subplots(1, 1, figsize=(7, 3))
predictions_validation.query('unique_id == @series').plot(
x='date',
y=['num_sold', 'pred'],
ax=ax,
title=series,
linewidth=0.7,
)
plt.show()
Finally, the model is trained using all available data, and the predict
method is used to generate predictions for all series over the next three years (1094 days). All predictions are made at once, immediately following the last date of the training data. The model is not updated with new data before making each prediction.
# Train the forecaster with all available data
# ==============================================================================
forecaster.fit(series = series_dict, exog = exog_dict)
forecaster
╭──────────────────────────────── MissingValuesWarning ────────────────────────────────╮ │ NaNs detected in `y_train`. They have been dropped because the target variable │ │ cannot have NaN values. Same rows have been dropped from `X_train` to maintain │ │ alignment. This is caused by series with interspersed NaNs. │ │ │ │ Category : MissingValuesWarning │ │ Location : │ │ /home/joaquin/miniconda3/envs/skforecast_16_py12/lib/python3.12/site-packages/skfore │ │ cast/recursive/_forecaster_recursive_multiseries.py:1191 │ │ Suppress : warnings.simplefilter('ignore', category=MissingValuesWarning) │ ╰──────────────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────── MissingValuesWarning ────────────────────────────────╮ │ NaNs detected in `X_train`. Some regressors do not allow NaN values during training. │ │ If you want to drop them, set `forecaster.dropna_from_series = True`. │ │ │ │ Category : MissingValuesWarning │ │ Location : │ │ /home/joaquin/miniconda3/envs/skforecast_16_py12/lib/python3.12/site-packages/skfore │ │ cast/recursive/_forecaster_recursive_multiseries.py:1213 │ │ Suppress : warnings.simplefilter('ignore', category=MissingValuesWarning) │ ╰──────────────────────────────────────────────────────────────────────────────────────╯
ForecasterRecursiveMultiSeries
General Information
- Regressor: LGBMRegressor
- Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60]
- Window features: None
- Window size: 60
- Series encoding: ordinal_category
- Exogenous included: True
- Weight function included: False
- Series weights: None
- Differentiation order: None
- Creation date: 2025-05-19 14:35:45
- Last fit date: 2025-05-19 14:38:33
- Skforecast version: 0.16.0
- Python version: 3.12.9
- Forecaster id: None
Exogenous Variables
-
is_holiday, is_holiday_lag_1, is_holiday_lag_2, is_holiday_lag_5, is_holiday_lag_7, is_holiday_lag_9, is_holiday_next_1, is_holiday_next_2, is_holiday_next_5, month_sin, month_cos, week_sin, week_cos, day_of_week_sin, day_of_week_cos, country, store, product
Data Transformations
- Transformer for series: None
- Transformer for exog: ColumnTransformer(remainder='passthrough',
transformers=[('ordinalencoder',
OrdinalEncoder(dtype=
, encoded_missing_value=-1, handle_unknown='use_encoded_value', unknown_value=-1), ['country', 'store', 'product'])], verbose_feature_names_out=False)
Training Information
- Series names (levels): Canada_Discount Stickers_Kaggle, Canada_Discount Stickers_Kaggle Tiers, Canada_Discount Stickers_Kerneler, Canada_Discount Stickers_Kerneler Dark Mode, Canada_Premium Sticker Mart_Holographic Goose, Canada_Premium Sticker Mart_Kaggle, Canada_Premium Sticker Mart_Kaggle Tiers, Canada_Premium Sticker Mart_Kerneler, Canada_Premium Sticker Mart_Kerneler Dark Mode, Canada_Stickers for Less_Holographic Goose, Canada_Stickers for Less_Kaggle, Canada_Stickers for Less_Kaggle Tiers, Canada_Stickers for Less_Kerneler, Canada_Stickers for Less_Kerneler Dark Mode, Finland_Discount Stickers_Holographic Goose, Finland_Discount Stickers_Kaggle, Finland_Discount Stickers_Kaggle Tiers, Finland_Discount Stickers_Kerneler, Finland_Discount Stickers_Kerneler Dark Mode, Finland_Premium Sticker Mart_Holographic Goose, Finland_Premium Sticker Mart_Kaggle, Finland_Premium Sticker Mart_Kaggle Tiers, Finland_Premium Sticker Mart_Kerneler, Finland_Premium Sticker Mart_Kerneler Dark Mode, Finland_Stickers for Less_Holographic Goose, ..., Norway_Premium Sticker Mart_Holographic Goose, Norway_Premium Sticker Mart_Kaggle, Norway_Premium Sticker Mart_Kaggle Tiers, Norway_Premium Sticker Mart_Kerneler, Norway_Premium Sticker Mart_Kerneler Dark Mode, Norway_Stickers for Less_Holographic Goose, Norway_Stickers for Less_Kaggle, Norway_Stickers for Less_Kaggle Tiers, Norway_Stickers for Less_Kerneler, Norway_Stickers for Less_Kerneler Dark Mode, Singapore_Discount Stickers_Holographic Goose, Singapore_Discount Stickers_Kaggle, Singapore_Discount Stickers_Kaggle Tiers, Singapore_Discount Stickers_Kerneler, Singapore_Discount Stickers_Kerneler Dark Mode, Singapore_Premium Sticker Mart_Holographic Goose, Singapore_Premium Sticker Mart_Kaggle, Singapore_Premium Sticker Mart_Kaggle Tiers, Singapore_Premium Sticker Mart_Kerneler, Singapore_Premium Sticker Mart_Kerneler Dark Mode, Singapore_Stickers for Less_Holographic Goose, Singapore_Stickers for Less_Kaggle, Singapore_Stickers for Less_Kaggle Tiers, Singapore_Stickers for Less_Kerneler, Singapore_Stickers for Less_Kerneler Dark Mode
- Training range: 'Canada_Discount Stickers_Kaggle': ['2010-01-01', '2016-12-31'], 'Canada_Discount Stickers_Kaggle Tiers': ['2010-01-01', '2016-12-31'], 'Canada_Discount Stickers_Kerneler': ['2010-01-01', '2016-12-31'], 'Canada_Discount Stickers_Kerneler Dark Mode': ['2010-01-01', '2016-12-31'], 'Canada_Premium Sticker Mart_Holographic Goose': ['2010-01-01', '2016-12-31'], ..., 'Singapore_Stickers for Less_Holographic Goose': ['2010-01-01', '2016-12-31'], 'Singapore_Stickers for Less_Kaggle': ['2010-01-01', '2016-12-31'], 'Singapore_Stickers for Less_Kaggle Tiers': ['2010-01-01', '2016-12-31'], 'Singapore_Stickers for Less_Kerneler': ['2010-01-01', '2016-12-31'], 'Singapore_Stickers for Less_Kerneler Dark Mode': ['2010-01-01', '2016-12-31']
- Training index type: DatetimeIndex
- Training index frequency: D
Regressor Parameters
-
{'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.21343356904845662, 'max_depth': 4, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 800, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 8520, 'reg_alpha': 0.1, 'reg_lambda': 1.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1, 'min_data_in_leaf': 190, 'feature_fraction': 0.7, 'max_bin': 75, 'linear_tree': True, 'device': 'cpu'}
Fit Kwargs
-
{'categorical_feature': ['country', 'store', 'product']}
# Feature importance (top 7)
# ==============================================================================
importance = forecaster.get_feature_importances()
importance.head(7)
feature | importance | |
---|---|---|
13 | lag_14 | 581 |
6 | lag_7 | 572 |
55 | lag_56 | 428 |
75 | week_sin | 418 |
76 | week_cos | 369 |
0 | lag_1 | 319 |
1 | lag_2 | 312 |
# Prediction of test set
# ==============================================================================
steps = (data_test['date'].max() - data_test['date'].min()).days + 1
print('Number of steps to predict:', steps)
predictions = forecaster.predict(steps=steps, exog=exog_dict_pred, suppress_warnings=True)
# Reverse the log transformation of the predictions
# ==============================================================================
predictions['pred'] = np.expm1(predictions['pred'])
predictions.head(4)
Number of steps to predict: 1095
level | pred | |
---|---|---|
2017-01-01 | Canada_Discount Stickers_Kaggle | 938.299735 |
2017-01-01 | Canada_Discount Stickers_Kaggle Tiers | 715.476380 |
2017-01-01 | Canada_Discount Stickers_Kerneler | 418.295782 |
2017-01-01 | Canada_Discount Stickers_Kerneler Dark Mode | 501.718288 |
Two of the series were excluded from the training set because they contained only missing values. Skforecast allows users to forecast new series that were not seen during the model training; however, the predictions are not included by default. To obtain the predictions for these series, it is needed to use the argument last_window
in the predict
method.
# Predict unseen series during training
# ==============================================================================
last_window_unseen_series = pd.DataFrame(
data = np.nan,
index = pd.date_range(end='2016-12-31', periods=forecaster.window_size, freq='D'),
columns = ['Canada_Discount Stickers_Holographic Goose', 'Kenya_Discount Stickers_Holographic Goose']
)
predictions_unseen_Series = forecaster.predict(
steps = steps,
last_window = last_window_unseen_series,
exog = exog_dict_pred,
suppress_warnings = True
)
predictions_unseen_Series['pred'] = np.expm1(predictions_unseen_Series['pred'])
predictions_unseen_Series
level | pred | |
---|---|---|
2017-01-01 | Canada_Discount Stickers_Holographic Goose | 130.095322 |
2017-01-01 | Kenya_Discount Stickers_Holographic Goose | 13.561558 |
2017-01-02 | Canada_Discount Stickers_Holographic Goose | 131.262703 |
2017-01-02 | Kenya_Discount Stickers_Holographic Goose | 15.972330 |
2017-01-03 | Canada_Discount Stickers_Holographic Goose | 163.919315 |
... | ... | ... |
2019-12-29 | Kenya_Discount Stickers_Holographic Goose | 102.435982 |
2019-12-30 | Canada_Discount Stickers_Holographic Goose | 344.756188 |
2019-12-30 | Kenya_Discount Stickers_Holographic Goose | 90.246888 |
2019-12-31 | Canada_Discount Stickers_Holographic Goose | 335.771299 |
2019-12-31 | Kenya_Discount Stickers_Holographic Goose | 91.815570 |
2190 rows × 2 columns
Submission results
predictions_all = pd.concat([predictions, predictions_unseen_Series])
submission = data_test.merge(
predictions_all.reset_index(names=['date']),
how = 'left',
left_on = ['date', 'unique_id'],
right_on = ['date', 'level'],
validate = 'one_to_one'
)
submission = submission.loc[:, ['id', 'pred']]
submission = submission.rename(columns={'pred': 'num_sold'})
submission.to_csv('submission.csv', index=False)
submission
id | num_sold | |
---|---|---|
0 | 230130 | 130.095322 |
1 | 230131 | 938.299735 |
2 | 230132 | 715.476380 |
3 | 230133 | 418.295782 |
4 | 230134 | 501.718288 |
... | ... | ... |
98545 | 328675 | 415.009558 |
98546 | 328676 | 3036.446657 |
98547 | 328677 | 2080.632355 |
98548 | 328678 | 1366.599418 |
98549 | 328679 | 1640.340384 |
98550 rows × 2 columns
# Update results to kaggle
# ==============================================================================
# !pip install kaggle
# !kaggle competitions submit -c playground-series-s5e1 -f submission.csv -m "uploading submission"
Session information
import session_info
session_info.show(html=False)
----- feature_engine 1.8.3 holidays 0.72 lightgbm 4.6.0 matplotlib 3.10.1 numpy 2.2.5 optuna 3.6.2 pandas 2.2.3 session_info v1.0.1 skforecast 0.16.0 sklearn 1.6.1 ----- IPython 9.1.0 jupyter_client 8.6.3 jupyter_core 5.7.2 notebook 6.5.7 ----- Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 18:56:27) [GCC 11.2.0] Linux-6.11.0-25-generic-x86_64-with-glibc2.39 ----- Session information updated at 2025-05-19 14:38
Citation
How to cite this document
If you use this document or any part of it, please acknowledge the source, thank you!
A Step-by-Step Guide to Global Time Series Forecasting Using Kaggle Sticker Sales Data by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) at https://cienciadedatos.net/documentos/py66-forecasting-sticker-sales-kaggle.html
How to cite skforecast
If you use skforecast for a publication, we would appreciate if you cite the published software.
Zenodo:
Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.16.0). Zenodo. https://doi.org/10.5281/zenodo.8382788
APA:
Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (Version 0.16.0) [Computer software]. https://doi.org/10.5281/zenodo.8382788
BibTeX:
@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.16.0}, month = {05}, year = {2025}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }
Did you like the article? Your support is important
Your contribution will help me to continue generating free educational content. Many thanks! 😊
This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.
Allowed:
-
Share: copy and redistribute the material in any medium or format.
-
Adapt: remix, transform, and build upon the material.
Under the following terms:
-
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
-
NonCommercial: You may not use the material for commercial purposes.
-
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.