More about forecasting in cienciadedatos.net
- ARIMA and SARIMAX models with python
- Time series forecasting with machine learning
- Forecasting time series with gradient boosting: XGBoost, LightGBM and CatBoost
- Forecasting time series with XGBoost
- Global Forecasting Models: Multi-series forecasting
- Global Forecasting Models: Comparative Analysis of Single and Multi-Series Forecasting Modeling
- Probabilistic forecasting
- Forecasting with deep learning
- Forecasting energy demand with machine learning
- Forecasting web traffic with machine learning
- Intermittent demand forecasting
- Modelling time series trend with tree-based models
- Bitcoin price prediction with Python
- Stacking ensemble of machine learning models to improve forecasting
- Interpretable forecasting models
- Mitigating the Impact of Covid on forecasting Models
- Forecasting time series with missing values

Missing values in time series forecasting
In many real use cases of forecasting, although historical data are available, it is common for the time series to be incomplete. The presence of missing values in the data is a major problem since most forecasting algorithms require the time series to be complete in order to train a model.
A commonly employed strategy to overcome this problem is to impute missing values before training the model, for example, using a moving average. However, the quality of the imputations may not be good, impairing the training of the model. One way to improve the imputation strategy is to combine it with weighted time series forecasting. The latter consists of reducing the weight of the imputed observations and thus their influence during model training.
This document shows two examples of how skforecast makes it easy to apply this strategy.
Libraries
# Data manipulation
# ==============================================================================
import numpy as np
import pandas as pd
from skforecast.datasets import fetch_dataset
# Plots
# ==============================================================================
import matplotlib.pyplot as plt
from skforecast.plot import set_dark_theme
# Modeling and Forecasting
# ==============================================================================
import sklearn
import skforecast
from lightgbm import LGBMRegressor
from skforecast.recursive import ForecasterRecursive
from skforecast.model_selection import TimeSeriesFold
from skforecast.model_selection import backtesting_forecaster
# Warnings configuration
# ==============================================================================
import warnings
color = '\033[1m\033[38;5;208m'
print(f"{color}Version skforecast: {skforecast.__version__}")
print(f"{color}Version scikit-learn: {sklearn.__version__}")
print(f"{color}Version pandas: {pd.__version__}")
print(f"{color}Version numpy: {np.__version__}")
Version skforecast: 0.15.1 Version scikit-learn: 1.5.2 Version pandas: 2.2.3 Version numpy: 2.0.2
✎ Note
In this document, a forecaster of typeForecasterAutoreg
is used. The same strategy can be applied with any forecaster from skforecast.
Data
# Data download
# ==============================================================================
data = fetch_dataset('bicimad')
data
bicimad ------- This dataset contains the daily users of the bicycle rental service (BiciMad) in the city of Madrid (Spain) from 2014-06-23 to 2022-09-30. The original data was obtained from: Portal de datos abiertos del Ayuntamiento de Madrid https://datos.madrid.es/portal/site/egob Shape of the dataset: (3022, 1)
users | |
---|---|
date | |
2014-06-23 | 99 |
2014-06-24 | 72 |
2014-06-25 | 119 |
2014-06-26 | 135 |
2014-06-27 | 149 |
... | ... |
2022-09-26 | 12340 |
2022-09-27 | 13888 |
2022-09-28 | 14239 |
2022-09-29 | 11574 |
2022-09-30 | 12957 |
3022 rows × 1 columns
# Generating gaps with missing values
# ==============================================================================
gaps = [
['2020-09-01', '2020-10-10'],
['2020-11-08', '2020-12-15'],
]
for gap in gaps:
data.loc[gap[0]:gap[1]] = np.nan
# Split data into train-test
# ==============================================================================
data = data.loc['2020-06-01':'2021-06-01'].copy()
end_train = '2021-03-01'
data_train = data.loc[: end_train, :]
data_test = data.loc[end_train:, :]
print(f"Dates train : {data_train.index.min()} --- {data_train.index.max()} (n={len(data_train)})")
print(f"Dates test : {data_test.index.min()} --- {data_test.index.max()} (n={len(data_test)})")
Dates train : 2020-06-01 00:00:00 --- 2021-03-01 00:00:00 (n=274) Dates test : 2021-03-01 00:00:00 --- 2021-06-01 00:00:00 (n=93)
# Time series plot
# ==============================================================================
set_dark_theme()
fig, ax = plt.subplots(figsize=(7, 3))
data_train.users.plot(ax=ax, label='train', linewidth=1)
data_test.users.plot(ax=ax, label='test', linewidth=1)
for gap in gaps:
ax.plot(
[pd.to_datetime(gap[0]), pd.to_datetime(gap[1])],
[data.users[pd.to_datetime(gap[0]) - pd.Timedelta(days=1)],
data.users[pd.to_datetime(gap[1]) + pd.Timedelta(days=1)]],
color = 'red',
linestyle = '--',
label = 'gap'
)
ax.set_xlabel("")
ax.set_title('Number of users BiciMAD')
handles, labels = plt.gca().get_legend_handles_labels()
by_label = dict(zip(labels, handles))
ax.legend(by_label.values(), by_label.keys(), loc='lower right');
Impute missing values
# Value imputation using linear interpolation
# ======================================================================================
data['users_imputed'] = data['users'].interpolate(method='linear')
data_train = data.loc[: end_train, :]
data_test = data.loc[end_train:, :]
Using imputed values in model training
# Create recursive multi-step forecaster (ForecasterRecursive)
# ==============================================================================
forecaster = ForecasterRecursive(
regressor = LGBMRegressor(random_state=123, verbose=-1),
lags = 14
)
# Backtesting: predict next 7 days at a time.
# ==============================================================================
cv = TimeSeriesFold(
steps = 7,
initial_train_size = len(data.loc[:end_train]),
refit = True,
fixed_train_size = False
)
metric, predictions = backtesting_forecaster(
forecaster = forecaster,
y = data.users_imputed,
cv = cv,
metric = 'mean_absolute_error',
)
display(metric)
predictions.head(4)
0%| | 0/14 [00:00<?, ?it/s]
mean_absolute_error | |
---|---|
0 | 2151.339364 |
pred | |
---|---|
2021-03-02 | 9679.561409 |
2021-03-03 | 10556.841280 |
2021-03-04 | 8922.423792 |
2021-03-05 | 8874.277159 |
Give weight of zero to imputed values
To minimize the influence on the model of imputed values, a custom function is defined to create weights following the rules:
Weight of 0 if the index date has been imputed or is within 14 days ahead of an imputed day.
Weight of 1 otherwise.
If an observation has a weight of 0, it has no influence at all during model training.
✎ Note
Imputed values should neither participate in the training process as a target nor as a predictor (lag). Therefore, values within a window size as large as the lags used should also be excluded.# Custom function to create weights
# ==============================================================================
def custom_weights(index):
"""
Return 0 if index is in any gap.
"""
gaps = [
['2020-09-01', '2020-10-10'],
['2020-11-08', '2020-12-15'],
]
missing_dates = [pd.date_range(
start = pd.to_datetime(gap[0]) + pd.Timedelta('14d'),
end = pd.to_datetime(gap[1]) + pd.Timedelta('14d'),
freq = 'D'
) for gap in gaps]
missing_dates = pd.DatetimeIndex(np.concatenate(missing_dates))
weights = np.where(index.isin(missing_dates), 0, 1)
return weights
Again, a ForecasterAutoreg
is trained but this time including the custom_weights
function.
# Create recursive multi-step forecaster (ForecasterRecursive)
# ==============================================================================
forecaster = ForecasterRecursive(
regressor = LGBMRegressor(random_state=123, verbose=-1),
lags = 14,
weight_func = custom_weights
)
# Backtesting: predict next 7 days at a time.
# ==============================================================================
metric, predictions = backtesting_forecaster(
forecaster = forecaster,
y = data.users_imputed,
cv = cv,
metric = 'mean_absolute_error',
)
display(metric)
predictions.head(4)
0%| | 0/14 [00:00<?, ?it/s]
mean_absolute_error | |
---|---|
0 | 1904.830714 |
pred | |
---|---|
2021-03-02 | 10524.159747 |
2021-03-03 | 10087.283682 |
2021-03-04 | 8882.926166 |
2021-03-05 | 9474.810215 |
Giving a weight of 0 to the imputed values (excluding it from the model training) improves the forecasting performance.
Session information
import session_info
session_info.show(html=False)
----- lightgbm 4.6.0 matplotlib 3.10.1 numpy 2.0.2 pandas 2.2.3 session_info 1.0.0 skforecast 0.15.1 sklearn 1.5.2 ----- IPython 9.0.2 jupyter_client 8.6.3 jupyter_core 5.7.2 notebook 6.4.12 ----- Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 18:56:27) [GCC 11.2.0] Linux-5.15.0-1077-aws-x86_64-with-glibc2.31 ----- Session information updated at 2025-03-26 16:48
Citation
How to cite this document
If you use this document or any part of it, please acknowledge the source, thank you!
Forecasting time series with missing values by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under a Attribution-NonCommercial-ShareAlike 4.0 International at https://www.cienciadedatos.net/documentos/py46-forecasting-time-series-missing-values
How to cite skforecast
If you use skforecast for a publication, we would appreciate it if you cite the published software.
Zenodo:
Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.15.1). Zenodo. https://doi.org/10.5281/zenodo.8382788
APA:
Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (Version 0.15.1) [Computer software]. https://doi.org/10.5281/zenodo.8382788
BibTeX:
@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.15.1}, month = {03}, year = {2025}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }
Did you like the article? Your support is important
Your contribution will help me to continue generating free educational content. Many thanks! 😊
This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.
Allowed:
-
Share: copy and redistribute the material in any medium or format.
-
Adapt: remix, transform, and build upon the material.
Under the following terms:
-
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
-
NonCommercial: You may not use the material for commercial purposes.
-
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.