Clustering time series to improve forecasting models

If you like  Skforecast ,  help us giving a star on   GitHub! ⭐️

Clustering Time Series to Improve Forecasting Models

Joaquín Amat Rodrigo, Javier Escobar Ortiz
January, 2025

Introduction

Global Forecasting Models involves the creation of a single forecasting model that considers all time series simultaneously. This approach attempts to capture the underlying patterns common to all series, thereby reducing the impact of noise present in individual series. It offers computational efficiency, ease of maintenance and robust generalisation across time series. Global forecasting assumes that time series with similar behaviour can benefit from being modelled together. When working with hundreds or thousands of series, initial clustering can help maximise model performance.

Clustering is an unsupervised analysis technique that groups a set of observations into clusters that contain observations that are considered homogeneous, while observations in different clusters are considered heterogeneous. Algorithms that cluster time series can be divided into two groups: those that use a transformation to create features prior to clustering (feature-driven time series clustering), and those that work directly on the time series (elastic distance measures).

  • Feature-driven time series clustering: Features describing structural characteristics are extracted from each time series, then these features are fed into arbitrary clustering algorithms. This features are obtained by applying statistical operations that best capture the underlying characteristics: trend, seasonality, periodicity, serial correlation, skewness, kurtosis, chaos, nonlinearity, and self-similarity.

  • Elastic distance measures: This approach works directly on the time series, adjusting or "realigning" the series in comparison to each other. The best known of this family of measures is Dynamic Time Warping (DTW).

If after performing the clustering analysis, distinct groups of series are identified, two strategies can be used:

  • Model all series together, while adding a feature that indicates the cluster to which each series belongs.

  • Build several global models, with each model tailored to a specific group of series.

To help understand the benefits of clustering, this document focuses on examining and comparing the results obtained when predicting the energy consumption of over a thousand buildings. Although the primary use of each building is available in the dataset, it may not reflect groups with similar patterns of energy use, so additional groups are created using clustering methods. A total of 4 experiments are performed:

  • Modelling all buildings together with a unique global model strategy.

  • Modelling groups of buildings based on their primary use (a global model per primary use).

  • Modelling groups of buildings based on time series feature clustering (a global model per cluster).

  • Modelling groups of buildings based on Dynamic Time Warping (DTW) clustering (a global model per cluster).

Libraries

Libraries used in this document.

In [1]:
# Data management
# ==============================================================================
import numpy as np
import pandas as pd
from datetime import datetime
from skforecast.datasets import fetch_dataset

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
plt.style.use('seaborn-v0_8-darkgrid')

# Forecasting
# ==============================================================================
import lightgbm
from lightgbm import LGBMRegressor
from sklearn.preprocessing import StandardScaler
import skforecast
from skforecast.recursive import ForecasterRecursiveMultiSeries
from skforecast.model_selection import TimeSeriesFold
from skforecast.model_selection import backtesting_forecaster_multiseries

# Feature extraction
# ==============================================================================
import tsfresh
from tsfresh import extract_features
from tsfresh.feature_extraction.settings import ComprehensiveFCParameters
from tsfresh.feature_extraction.settings import from_columns

# Clustering
# ==============================================================================
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import sktime
from sktime.clustering.k_means import TimeSeriesKMeans

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

color = '\033[1m\033[38;5;208m' 
print(f"{color}Version skforecast: {skforecast.__version__}")
print(f"{color}Version scikit-learn: {sklearn.__version__}")
print(f"{color}Version lightgbm: {lightgbm.__version__}")
print(f"{color}Version tsfresh: {tsfresh.__version__}")
print(f"{color}Version sktime: {sktime.__version__}")
print(f"{color}Version pandas: {pd.__version__}")
print(f"{color}Version numpy: {np.__version__}")
Version skforecast: 0.14.0
Version scikit-learn: 1.5.1
Version lightgbm: 4.5.0
Version tsfresh: 0.20.3
Version sktime: 0.35.0
Version pandas: 2.2.3
Version numpy: 2.0.0

Data

Data used in this document has been obtained from the Kaggle competition Addison Howard, Chris Balbach, Clayton Miller, Jeff Haberl, Krishnan Gowri, Sohier Dane. (2019). ASHRAE - Great Energy Predictor III. Kaggle.

Three files are used to create the modeling data set:

  • weather_train.csv and weather_test.csv: These files contain weather-related data for each building, including outdoor air temperature, dew temperature, relative humidity, and other weather parameters. The weather data is crucial for understanding the impact of external conditions on building energy usage.

  • building_metadata.csv: This file provides metadata for each building in the dataset, such as building type, primary use, square footage, floor count, and year built. This information helps in understanding the characteristics of the buildings and their potential influence on energy consumption patterns.

  • train.csv: the train dataset contains the target variable, i.e., the energy consumption data for each building, along with the timestamps for the energy consumption readings. It also includes the corresponding building and weather IDs to link the information across different datasets.

The three files have been preprocessed to remove buildings with less than 85% of values not equal to NaN or zero, to use only the electricity meter, and to aggregate the data to a daily frequency.

In [2]:
# Load preprocessed data
# ==============================================================================
data = fetch_dataset('ashrae_daily')
data.head()
ashrae_daily
------------
Daily energy consumption data from the ASHRAE competition with building metadata
and weather data.
Kaggle competition Addison Howard, Chris Balbach, Clayton Miller, Jeff Haberl,
Krishnan Gowri, Sohier Dane. (2019). ASHRAE - Great Energy Predictor III.
Kaggle. https://www.kaggle.com/c/ashrae-energy-prediction/overview
Shape of the dataset: (444324, 10)
Out[2]:
building_id meter_reading site_id primary_use square_feet air_temperature dew_temperature sea_level_pressure wind_direction wind_speed
timestamp
2016-01-01 id_105 1102.766933 1 Education 50623 3.8 2.4 1020.9 240.0 3.1
2016-01-01 id_106 17.606200 1 Education 5374 3.8 2.4 1020.9 240.0 3.1
2016-01-01 id_107 8233.625107 1 Education 97532 3.8 2.4 1020.9 240.0 3.1
2016-01-01 id_108 4289.478432 1 Education 81580 3.8 2.4 1020.9 240.0 3.1
2016-01-01 id_109 3911.921989 1 Education 56995 3.8 2.4 1020.9 240.0 3.1
In [3]:
# Fill missing values of air_temperature and wind_speed using fowrard and backward fill
# ==============================================================================
# Imputation must be done separately for each building
data = data.sort_values(by=['building_id', 'timestamp'])
data['air_temperature'] = data.groupby('building_id')['air_temperature'].ffill().bfill()
data['wind_speed'] = data.groupby('building_id')['wind_speed'].ffill().bfill()
data = data.sort_index()

print(
    f"Rage of dates available : {data.index.min()} --- {data.index.max()}  "
    f"(n_days={(data.index.max() - data.index.min()).days})"
)
Rage of dates available : 2016-01-01 00:00:00 --- 2016-12-31 00:00:00  (n_days=365)

Exploratory data analysis

Building primary use

One of the key attributes associated with each building is its designated use. This feature may play a crucial role in influencing the energy consumption pattern, as distinct uses can significantly impact both the quantity and timing of energy consumption.

In [4]:
# Number of buildings and type of buildings based on primary use
# ==============================================================================
n_building = data['building_id'].nunique()
n_type_building = data['primary_use'].nunique()
range_datetime = [data.index.min(), data.index.max()]

print(f"Number of buildings: {n_building}")
print(f"Number of building types: {n_type_building}")
display(data.drop_duplicates(subset=['building_id'])['primary_use'].value_counts())
Number of buildings: 1214
Number of building types: 16
primary_use
Education                        463
Office                           228
Entertainment/public assembly    157
Public services                  150
Lodging/residential              112
Other                             19
Healthcare                        19
Parking                           13
Warehouse/storage                 12
Manufacturing/industrial          10
Services                           9
Food sales and service             5
Technology/science                 5
Retail                             5
Utility                            4
Religious worship                  3
Name: count, dtype: int64

For certain primary use categories, there is a limited number of buildings within the dataset. To streamline the analysis, categories with fewer than 50 buildings are grouped into the "Other" category.

In [5]:
# Types of buildings (primary use) with less than 50 samples are grouped as "Other".
# ==============================================================================
infrequent_categories = (
    data
    .drop_duplicates(subset=['building_id'])['primary_use']
    .value_counts()
    .loc[lambda x: x < 50]
    .index
    .tolist()
)
print("Infrequent categories:")
print("======================")
print('\n'.join(infrequent_categories))

data['primary_use'] = np.where(
    data['primary_use'].isin(infrequent_categories),
    'Other',
    data['primary_use']
)
Infrequent categories:
======================
Other
Healthcare
Parking
Warehouse/storage
Manufacturing/industrial
Services
Food sales and service
Technology/science
Retail
Utility
Religious worship

Next, a plot is created showing the energy consumption for a randomly selected building within each respective category, and a plot of all available time series for each category.

In [6]:
# Time series for 1 random selected building per group
# ==============================================================================
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(8, 5.5), sharex=True, sharey=False)
sample_ids = (
    data
    .groupby('primary_use')['building_id']
    .apply(lambda x: x.sample(1, random_state=333))
    .tolist()
)
axs = axs.flatten()

for i, building_id in enumerate(sample_ids):
    data_sample = data[data['building_id'] == building_id]
    building_type = data_sample['primary_use'].unique()[0]
    data_sample.plot(
        y        = 'meter_reading',
        ax       = axs[i],
        legend   = False,
        title    = f"Building: {building_id}, type: {building_type}",
        fontsize = 8
    )
    axs[i].set_xlabel("")
    axs[i].set_ylabel("")
    # Scientific notation for y axis
    axs[i].ticklabel_format(axis='y', style='sci', scilimits=(0, 0))
    axs[i].title.set_size(9)

fig.suptitle('Energy consumption for 6 random buildings', fontsize=12)
fig.tight_layout()
plt.show()
In [7]:
# Energy consumption by type of building (one gray line per building)
# ==============================================================================
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(8, 5.5), sharex=True, sharey=False)
axs = axs.flatten()

for i, building_type in enumerate(data['primary_use'].unique()):
    data_sample = data[data['primary_use'] == building_type]
    data_sample = data_sample.pivot_table(
                      index   = 'timestamp',
                      columns = 'building_id',
                      values  = 'meter_reading',
                      aggfunc = 'mean'
                  )
    data_sample.plot(
        legend   = False,
        title    = f"Type: {building_type}",
        color    = 'gray',
        alpha    = 0.2,
        fontsize = 8,
        ax       = axs[i]
    )
    data_sample.mean(axis=1).plot(
        ax       = axs[i],
        color    = 'blue',
        fontsize = 8
    )
    axs[i].set_xlabel("")
    axs[i].set_ylabel("")
    # Scientific notation for y axis
    axs[i].ticklabel_format(axis='y', style='sci', scilimits=(0,0))
    axs[i].title.set_size(9)

    # Limit the axis to 5 times the maximum mean value to improve visualization
    axs[i].set_ylim(
        bottom = 0,
        top    = 5 * data_sample.mean(axis=1).max()
    )

fig.tight_layout()
fig.suptitle('Energy consumption by type of building', fontsize=12)
plt.subplots_adjust(top=0.9)
plt.show()

The graph reveals that there is significant variability in consumption patterns among buildings of the same purpose. This suggests that there may be scope for improving the criteria by which buildings are grouped.

Clustering time series

The idea behind modeling multiple series at the same time is to be able to capture the main patterns that govern the series, thereby reducing the impact of the potential noise that each series may have. This means that series that behave similarly may benefit from being modeled together. One way to identify potential groups of series is to perform a clustering study prior to modeling. If clear groups are identified as a result of clustering, it is appropriate to model each of them separately.

Clustering is an unsupervised analysis technique that groups a set of observations into clusters that contain observations that are considered homogeneous, while observations in different clusters are considered heterogeneous. Algorithms that cluster time series can be divided into two groups: those that use a transformation to create features prior to clustering (feature-driven time series clustering), and those that work directly on the time series (elastic distance measures).

  • Feature-driven time series clustering: Features describing structural characteristics are extracted from each time series, then these features are fed into arbitrary clustering algorithms. This features are obtained by applying statistical operations that best capture the underlying characteristics: trend, seasonality, periodicity, serial correlation, skewness, kurtosis, chaos, nonlinearity, and self-similarity.

  • Elastic distance measures: This approach works directly on the time series, adjusting or "realigning" the series in comparison to each other. The best known of this family of measures is Dynamic Time Warping (DTW).

For a more detailed review of time series clustering, see A review and evaluation of elastic distance functions for time series clustering.

Clustering based on time series features

Feature creation

tsfresh is a powerful Python library for feature engineering from time-series and sequential data, including statistical measures, Fourier coefficients, and various other time-domain and frequency-domain features. It provides a systematic approach to automate the calculation of features and select the most informative ones.

To start, the default configuration of tsfresh is used, which calculates all available features. The easiest way to access this configuration is to use the functions provided by the ComprehensiveFCParameters class. It creates a dictionary which is a mapping from string (the name of each feature) to a list of dictionary parameters to be used when the function with that name is called.

In [8]:
# Default features and settings created by tsfresh
# ==============================================================================
default_features = ComprehensiveFCParameters()
print("Name of the features extracted by tsfresh:")
print("=========================================")
print('\n'.join(list(default_features)))
Name of the features extracted by tsfresh:
=========================================
variance_larger_than_standard_deviation
has_duplicate_max
has_duplicate_min
has_duplicate
sum_values
abs_energy
mean_abs_change
mean_change
mean_second_derivative_central
median
mean
length
standard_deviation
variation_coefficient
variance
skewness
kurtosis
root_mean_square
absolute_sum_of_changes
longest_strike_below_mean
longest_strike_above_mean
count_above_mean
count_below_mean
last_location_of_maximum
first_location_of_maximum
last_location_of_minimum
first_location_of_minimum
percentage_of_reoccurring_values_to_all_values
percentage_of_reoccurring_datapoints_to_all_datapoints
sum_of_reoccurring_values
sum_of_reoccurring_data_points
ratio_value_number_to_time_series_length
sample_entropy
maximum
absolute_maximum
minimum
benford_correlation
time_reversal_asymmetry_statistic
c3
cid_ce
symmetry_looking
large_standard_deviation
quantile
autocorrelation
agg_autocorrelation
partial_autocorrelation
number_cwt_peaks
number_peaks
binned_entropy
index_mass_quantile
cwt_coefficients
spkt_welch_density
ar_coefficient
change_quantiles
fft_coefficient
fft_aggregated
value_count
range_count
approximate_entropy
friedrich_coefficients
max_langevin_fixed_point
linear_trend
agg_linear_trend
augmented_dickey_fuller
number_crossing_m
energy_ratio_by_chunks
ratio_beyond_r_sigma
linear_trend_timewise
count_above
count_below
lempel_ziv_complexity
fourier_entropy
permutation_entropy
query_similarity_count
mean_n_absolute_max

Many of these features are calculated using different values for their arguments.

In [9]:
# Default configuration for feature "partial_autocorrelation"
# ==============================================================================
default_features['partial_autocorrelation']
Out[9]:
[{'lag': 0},
 {'lag': 1},
 {'lag': 2},
 {'lag': 3},
 {'lag': 4},
 {'lag': 5},
 {'lag': 6},
 {'lag': 7},
 {'lag': 8},
 {'lag': 9}]

To access the detailed view of each feature and the parameter values included in the default configuration, use the following code:

In [10]:
# Default configuration for all extracted feature
# ==============================================================================
# from pprint import pprint
# pprint(default_features)

Once the configuration of features has been defined, the next step is to extract them from the time series. The function extract_features() of tsfresh is used for this purpose. This function receives as input the time series and the configuration of the features to be extracted. The output is a dataframe with the extracted features.

In [11]:
# Feature extraction
# ==============================================================================
ts_features = extract_features(
    timeseries_container  = data[['building_id', 'meter_reading']].reset_index(),
    column_id             = "building_id",
    column_sort           = "timestamp",
    column_value          = "meter_reading",
    default_fc_parameters = default_features,
    impute_function       = tsfresh.utilities.dataframe_functions.impute,
    n_jobs                = 4
)

print("Shape of ts_features:", ts_features.shape)
ts_features.head(3)
Feature Extraction:   0%|          | 0/20 [00:00<?, ?it/s]Feature Extraction: 100%|██████████| 20/20 [01:26<00:00,  4.34s/it]
Shape of ts_features: (1214, 783)
Out[11]:
meter_reading__variance_larger_than_standard_deviation meter_reading__has_duplicate_max meter_reading__has_duplicate_min meter_reading__has_duplicate meter_reading__sum_values meter_reading__abs_energy meter_reading__mean_abs_change meter_reading__mean_change meter_reading__mean_second_derivative_central meter_reading__median ... meter_reading__fourier_entropy__bins_5 meter_reading__fourier_entropy__bins_10 meter_reading__fourier_entropy__bins_100 meter_reading__permutation_entropy__dimension_3__tau_1 meter_reading__permutation_entropy__dimension_4__tau_1 meter_reading__permutation_entropy__dimension_5__tau_1 meter_reading__permutation_entropy__dimension_6__tau_1 meter_reading__permutation_entropy__dimension_7__tau_1 meter_reading__query_similarity_count__query_None__threshold_0.0 meter_reading__mean_n_absolute_max__number_of_maxima_7
id_1000 1.0 0.0 0.0 1.0 5.424863e+05 8.444178e+08 235.994991 0.332875 -0.255837 1515.625595 ... 0.170467 0.260704 0.780228 1.515622 2.520615 3.496026 4.341908 5.045132 0.0 2144.607384
id_1001 1.0 0.0 0.0 1.0 2.543133e+05 4.114736e+08 212.298183 -0.049314 0.026097 304.500150 ... 0.090729 0.090729 0.700821 1.752526 3.038890 4.327469 5.278212 5.735413 0.0 2871.142857
id_1002 1.0 0.0 0.0 1.0 1.409981e+06 6.255919e+09 960.832035 3.226027 -0.308379 3649.003765 ... 0.488746 0.765782 2.650153 1.747629 3.035159 4.431478 5.394148 5.780678 0.0 7882.572432

3 rows × 783 columns

As a result of the extraction process, 783 features have been created for each time series (building_id in this use case). The returned dataframe has as its index the column specified in the column_id argument of extract_features.

The default extraction of tsfresh produces a huge number of features. However, only a few of these may be of interest in each use case. To select the most relevant ones, tsfresh includes an automated selection process based on hypothesis tests (FeatuRE Extraction based on Scalable Hypothesis tests). In this process, the features are individually and independently evaluated for their significance in predicting the target under investigation.

Warning

The selection process used by tsfresh is based on the significance of each feature in accurately predicting the target variable. To perform this process, a target variable is required – for example, the type of building associated with a given time series. However, there are cases where a target variable is not readily available. In such cases, alternative strategies can be used:
  • Instead of calculating all the standard features, focus only on those that are likely to be relevant to the specific application, based on expert knowledge.
  • Exclude features based on criteria such as low variance and high correlation. This helps refine the set of features to consider, focusing on those that provide the most meaningful information for analysis.
  • Use techniques such as PCA, t-SNE, or auto-encoders to reduce dimensionality.

In this case, since the goal is to identify groups of buildings with similar energy consumption patterns, regardless of the building type, the automated selection process is not used.

Some created features may have missing values for some of the time series. Since most clustering algorithms do not allow for missing values, these are excluded.

In [12]:
# Remove features with missing values
# ==============================================================================
ts_features = ts_features.dropna(axis=1, how='any')
print(f"Number of features after selection and removing missing values: {ts_features.shape[1]}")
Number of features after selection and removing missing values: 783

Once the final feature matrix is ready, it may be useful to create a new dictionary to store the selected features and the parameters used to calculate them. This can be easily done using the from_columns function.

In [13]:
# Dictionary with selected features and their configuration
# ==============================================================================
ts_features_info = from_columns(ts_features)
# pprint(ts_features_info['meter_reading'])

K-means clustering

A K-means clustering method is used to group the buildings. Since clustering is known to be negatively affected by high dimensionality, and since several hundred features have been created for each building, a PCA is used to reduce the dimensionality of the data before the k-means is applied.

In [14]:
# Scale features so all of them have mean 0 and std 1
# ==============================================================================
scaler = StandardScaler().set_output(transform="pandas")
ts_features_scaled = scaler.fit_transform(ts_features)
ts_features_scaled.head(2)
Out[14]:
meter_reading__variance_larger_than_standard_deviation meter_reading__has_duplicate_max meter_reading__has_duplicate_min meter_reading__has_duplicate meter_reading__sum_values meter_reading__abs_energy meter_reading__mean_abs_change meter_reading__mean_change meter_reading__mean_second_derivative_central meter_reading__median ... meter_reading__fourier_entropy__bins_5 meter_reading__fourier_entropy__bins_10 meter_reading__fourier_entropy__bins_100 meter_reading__permutation_entropy__dimension_3__tau_1 meter_reading__permutation_entropy__dimension_4__tau_1 meter_reading__permutation_entropy__dimension_5__tau_1 meter_reading__permutation_entropy__dimension_6__tau_1 meter_reading__permutation_entropy__dimension_7__tau_1 meter_reading__query_similarity_count__query_None__threshold_0.0 meter_reading__mean_n_absolute_max__number_of_maxima_7
id_1000 0.028712 -0.119173 -0.649265 0.933043 -0.306332 -0.136331 -0.144129 0.102783 0.114077 -0.301155 ... -0.254284 -0.331247 -0.741751 -1.961882 -1.922583 -1.994894 -1.987187 -1.477824 0.0 -0.303740
id_1001 0.028712 -0.119173 -0.649265 0.933043 -0.406251 -0.138365 -0.173691 0.010218 0.257708 -0.453833 ... -0.744451 -0.917204 -0.843649 0.680139 0.864391 0.743247 0.601810 0.510372 0.0 -0.241759

2 rows × 783 columns

In [15]:
# Plot variance reduction as a function of the number of PCA components
# ==============================================================================
pca = PCA()
pca.fit(ts_features_scaled)
n_components = np.argmax(np.cumsum(pca.explained_variance_ratio_) > 0.85) + 1

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(6, 2.5))
ax.plot(np.cumsum(pca.explained_variance_ratio_))
ax.set_title('Variance explained by PCA components')
ax.set_xlabel('Number of components')
ax.set_ylabel('Cumulative explained variance')
ax.axhline(y=0.85, color='red', linestyle='--')
ax.axvline(x=n_components, color='red', linestyle='--')
ax.text(
    x=n_components + 10,
    y=0.4,
    s=f"n_components = {n_components}",
    color='red',
    verticalalignment='center'
)

plt.show();

The dimensionality of the original 783 features is reduced by using the first 68 principal components (explaining 85% of the variance).

In [16]:
# PCA with as many components as necessary to explain 85% of the variance
# ==============================================================================
pca = PCA(n_components=0.85)
pca_projections = pca.fit_transform(ts_features_scaled)

# Create a data frame with the projections. Each column is a principal component
# and each row is the id of the building.
pca_projections = pd.DataFrame(
    pca_projections,
    index   = ts_features_scaled.index,
    columns = [f"PC{i}" for i in range(1, pca.n_components_ + 1)]
)

print(f"Number of components selected: {pca.n_components_}")
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.3f}")
pca_projections.head(3)
Number of components selected: 68
Explained variance: 0.851
Out[16]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 ... PC59 PC60 PC61 PC62 PC63 PC64 PC65 PC66 PC67 PC68
id_1000 -3.645479 -1.691632 6.208614 -1.057665 -0.261976 3.046202 -3.506316 -3.401727 1.613060 -1.374571 ... 1.277602 0.057423 -1.093630 0.707782 0.184354 -0.612926 1.226124 0.538833 0.887622 -0.194430
id_1001 -3.325130 4.390626 -5.022388 3.768395 4.646361 -1.631149 -0.657330 3.772956 -2.057336 2.860316 ... -0.328626 1.070108 2.019594 1.286460 -3.258346 -0.690189 1.131247 0.110061 0.517350 0.232683
id_1002 3.042578 1.729933 1.835861 -4.895404 2.387857 1.261218 -0.031567 0.435050 2.554299 3.283560 ... 1.452407 -2.381626 1.123819 -0.592596 -1.790022 1.866666 -0.493726 0.678684 -0.484853 3.647496

3 rows × 68 columns

One of the inherent challenges of clustering is determining the optimal number of clusters, since there is no ground truth or predefined number of clusters in unsupervised learning. Several heuristics have been proposed to guide this selection, and in this case, the elbow method is used.

The elbow method involves plotting the total within-cluster sum of squares (WSS) against the number of clusters (k). The optimal number of clusters is typically identified at the "elbow" of the curve, where the rate of decrease in WSS begins to decrease significantly. This suggests that adding more clusters beyond this point provides little improvement in clustering performance.

In [17]:
# Optimal number of clusters
# ==============================================================================
# Identify the optimal number of clusters using the elbow method
range_n_clusters = range(1, 30)
inertias = []
size_of_clusters = []

for n_clusters in range_n_clusters:
    kmeans = KMeans(
                 n_clusters   = n_clusters, 
                 n_init       = 20, 
                 random_state = 963852
             )
    kmeans.fit(pca_projections)
    inertias.append(kmeans.inertia_)
    size_of_clusters.append(np.unique(kmeans.labels_, return_counts=True)[1].tolist())

inertias = pd.Series(inertias, index=range_n_clusters)
dicrease_in_inertia = inertias.pct_change() * -100

fig, axs = plt.subplots(1, 2, figsize=(9, 2.5))
inertias.plot(marker='o', ax=axs[0])
axs[0].xaxis.set_major_locator(ticker.MaxNLocator(integer=True))
axs[0].set_title("Cluster inertia vs number of clusters")
axs[0].set_xlabel('Number of clusters')
axs[0].set_ylabel('Intra-cluster variance (inertia)')

dicrease_in_inertia.plot(kind='bar', ax=axs[1])
axs[1].set_title("Decrease in inertia")
axs[1].set_xlabel('Number of clusters')
axs[1].set_ylabel('Decrease in inertia (%)')
fig.tight_layout()
plt.show();

The chart shows that after 7 clusters the decrease in inertia slows down, therefore 7 clusters may be a good choice.

In addition to analyzing the evolution of the inertia (intra-variance), it is important to check the size of the clusters being created. The presence of small clusters may indicate overfitting or anomalous samples that do not fit well into any of the groups.

In [18]:
# Distribution of cluster sizes
# ==============================================================================
for n_clusters, sizes in zip(range_n_clusters, size_of_clusters):
    print(f"Size of clusters (n = {n_clusters}): {sizes}")
Size of clusters (n = 1): [1214]
Size of clusters (n = 2): [1201, 13]
Size of clusters (n = 3): [1186, 27, 1]
Size of clusters (n = 4): [1140, 4, 4, 66]
Size of clusters (n = 5): [1105, 4, 1, 97, 7]
Size of clusters (n = 6): [466, 1, 654, 6, 82, 5]
Size of clusters (n = 7): [81, 465, 8, 1, 5, 653, 1]
Size of clusters (n = 8): [662, 466, 6, 4, 1, 1, 73, 1]
Size of clusters (n = 9): [163, 4, 530, 1, 405, 26, 1, 81, 3]
Size of clusters (n = 10): [139, 552, 1, 1, 4, 81, 2, 22, 411, 1]
Size of clusters (n = 11): [562, 1, 5, 5, 1, 126, 413, 1, 18, 1, 81]
Size of clusters (n = 12): [382, 1, 1, 4, 11, 80, 1, 400, 107, 6, 220, 1]
Size of clusters (n = 13): [394, 1, 4, 120, 22, 82, 1, 1, 1, 213, 5, 1, 369]
Size of clusters (n = 14): [370, 1, 394, 1, 81, 1, 1, 4, 26, 1, 1, 123, 1, 209]
Size of clusters (n = 15): [367, 7, 396, 9, 1, 1, 4, 1, 24, 82, 1, 1, 218, 1, 101]
Size of clusters (n = 16): [247, 103, 4, 1, 1, 23, 80, 1, 1, 1, 1, 9, 213, 338, 12, 179]
Size of clusters (n = 17): [1, 105, 4, 11, 384, 366, 24, 2, 1, 1, 17, 7, 1, 214, 1, 2, 73]
Size of clusters (n = 18): [196, 247, 88, 1, 1, 39, 1, 199, 1, 1, 1, 1, 1, 80, 339, 5, 11, 2]
Size of clusters (n = 19): [24, 16, 1, 298, 1, 2, 190, 95, 81, 1, 39, 1, 2, 1, 253, 1, 1, 202, 5]
Size of clusters (n = 20): [93, 254, 1, 57, 1, 82, 344, 1, 1, 1, 141, 16, 2, 1, 16, 5, 1, 1, 195, 1]
Size of clusters (n = 21): [317, 12, 1, 1, 277, 1, 1, 1, 5, 1, 8, 1, 2, 49, 99, 5, 1, 78, 1, 251, 102]
Size of clusters (n = 22): [245, 1, 1, 1, 17, 1, 99, 101, 2, 1, 1, 1, 5, 15, 84, 1, 4, 141, 57, 1, 244, 191]
Size of clusters (n = 23): [249, 1, 1, 210, 1, 77, 1, 1, 101, 1, 8, 1, 24, 180, 5, 1, 80, 1, 8, 1, 1, 1, 260]
Size of clusters (n = 24): [8, 71, 99, 4, 1, 305, 1, 1, 16, 107, 12, 241, 1, 1, 1, 1, 1, 1, 194, 1, 1, 25, 114, 7]
Size of clusters (n = 25): [77, 146, 1, 2, 94, 1, 220, 1, 1, 1, 1, 20, 7, 1, 12, 8, 1, 5, 1, 1, 231, 2, 178, 39, 163]
Size of clusters (n = 26): [78, 1, 140, 1, 16, 235, 95, 1, 16, 1, 4, 1, 250, 57, 1, 1, 1, 186, 7, 1, 116, 1, 1, 1, 1, 1]
Size of clusters (n = 27): [16, 192, 1, 1, 1, 139, 1, 248, 1, 1, 13, 77, 1, 1, 1, 4, 1, 95, 7, 1, 57, 1, 311, 1, 40, 1, 1]
Size of clusters (n = 28): [76, 223, 1, 6, 4, 141, 1, 200, 240, 16, 1, 1, 1, 1, 8, 1, 40, 1, 93, 1, 7, 1, 1, 1, 1, 89, 57, 1]
Size of clusters (n = 29): [1, 89, 1, 1, 1, 9, 214, 8, 126, 16, 1, 1, 1, 57, 1, 1, 203, 1, 1, 1, 1, 1, 26, 124, 100, 1, 148, 76, 3]

When 7 clusters are created, the size of the clusters is not well balanced. Before modeling the time series, the smaller clusters (less than 20 observations) are combined into a single cluster named "other".

In [19]:
# Train clustering model with 6 clusters and assign each building to a cluster
# ==============================================================================
kmeans = KMeans(n_clusters=7, n_init=20, random_state=963852)
kmeans.fit(pca_projections)
clusters = kmeans.predict(pca_projections)
clusters = pd.DataFrame({
    'building_id': pca_projections.index,
    'cluster_base_on_features': clusters.astype(str)
})

# Merge cluster minor clusters into a single cluster
threshold = 20
cluster_size = clusters['cluster_base_on_features'].value_counts()
cluster_size = cluster_size[cluster_size < threshold].index.tolist()
clusters['cluster_base_on_features'] = np.where(
    clusters['cluster_base_on_features'].isin(cluster_size),
    'Other',
    clusters['cluster_base_on_features']
)
display(clusters.head(3))
clusters['cluster_base_on_features'].value_counts().to_frame(name='Number of series')
building_id cluster_base_on_features
0 id_1000 5
1 id_1001 1
2 id_1002 5
Out[19]:
Number of series
cluster_base_on_features
5 653
1 465
0 81
Other 15
In [20]:
# Add the predicted cluster to the data frame with the building information
# ==============================================================================
data = pd.merge(
           data.reset_index(),  # To avoid losing the index
           clusters,
           on       ='building_id',
           how      ='left',
           validate = 'm:1'
       ).set_index('timestamp')

data.head(3)
Out[20]:
building_id meter_reading site_id primary_use square_feet air_temperature dew_temperature sea_level_pressure wind_direction wind_speed cluster_base_on_features
timestamp
2016-01-01 id_1000 1012.500694 10 Education 121146 -8.9 NaN NaN NaN 3.1 5
2016-01-01 id_970 0.000000 9 Other 346056 7.8 NaN NaN NaN 3.1 5
2016-01-01 id_1066 639.270004 12 Education 55800 1.9 -1.2 1016.2 200.0 5.0 5

Once each building has been assigned to a cluster, it is useful to examine the characteristics of the grouped buildings. For example, the distribution of the primary use of the buildings.

In [21]:
# Percentaje of building types per cluster
# ==============================================================================
primary_usage_per_cluster = (
    data
    .groupby('cluster_base_on_features')['primary_use']
    .value_counts(normalize=True)
    .unstack()
    .fillna(0)
)

primary_usage_per_cluster = (100 * primary_usage_per_cluster).round(1)
primary_usage_per_cluster
Out[21]:
primary_use Education Entertainment/public assembly Lodging/residential Office Other Public services
cluster_base_on_features
0 59.3 11.1 1.2 16.0 2.5 9.9
1 26.0 15.3 18.7 9.2 11.6 19.1
5 43.8 11.6 3.7 25.4 7.4 8.1
Other 53.3 6.7 0.0 40.0 0.0 0.0

The results suggest (the table should be read horizontally) that the clustering process based on features extracted from the time series generates groups that differ from those formed by the main purpose of the building.

Clustering based on elastic distance (DTW)

Dynamic Time Warping (DTW) is a technique that measures the similarity between two temporal sequences, which may vary in speed. In essence, it's an elastic distance measure that allows the time series to be stretched or compressed to align with each other optimally. Clustering with DTW involves grouping time series data based on their dynamic time-warped distances, ensuring that time series within the same cluster have similar shapes and patterns, even if they are out of phase in time or have different lengths.

The class TimeSeriesKMeans class from the Sktime library enables the application of K-means clustering with a variety of distance metrics, including DTW, Euclidean, ERP, EDR, LCSS, squared, DDTW, WDTW, and WDDTW. Many of these metrics are elastic distances, making the method well-suited for time series data.

Sktime requires time series to be structured in a long format with a multi-index. The outermost index level represents the series ID, while the innermost level corresponds to the datetime.

In [22]:
# Convert data to long format with a multiindex: (building_id, timestamp)
# ==============================================================================
data_long = (
    data
    .reset_index()
    .loc[:, ['timestamp', 'building_id', 'meter_reading']]
    .set_index(['building_id', 'timestamp'])
    .sort_index(ascending=True)
)
data_long
Out[22]:
meter_reading
building_id timestamp
id_1000 2016-01-01 1012.500694
2016-01-02 1158.500099
2016-01-03 983.000099
2016-01-04 1675.750496
2016-01-05 1586.250694
... ... ...
id_999 2016-12-27 2627.250000
2016-12-28 2667.250000
2016-12-29 2495.000000
2016-12-30 2059.500000
2016-12-31 1899.500000

444324 rows × 1 columns

Warning

The next cell has a run time of approximately 10 minutes.
In [23]:
# Fit clustering model
# ==============================================================================
model = TimeSeriesKMeans(
            n_clusters   = 4,
            metric       = "dtw",
            max_iter     = 10,
            random_state = 123456
        )

model.fit(data_long)
Out[23]:
TimeSeriesKMeans(max_iter=10, n_clusters=4, random_state=123456)
Please rerun this cell to show the HTML repr or trust the notebook.
In [24]:
# Cluster prediction
# ==============================================================================
clusters = model.predict(data_long)
clusters = pd.DataFrame({
               'building_id': data_long.index.get_level_values('building_id').drop_duplicates(),
               'cluster_base_on_dtw': clusters.astype(str),
           })

# Size of each cluster
clusters['cluster_base_on_dtw'].value_counts().sort_index().to_frame(name='Number of series')
Out[24]:
Number of series
cluster_base_on_dtw
0 22
1 717
2 347
3 128
In [25]:
# Add the predicted cluster to the data frame with the building information
# ==============================================================================
data = pd.merge(
           data.reset_index(),  # To avoid losing the index
           clusters,
           on       = 'building_id',
           how      = 'left',
           validate = 'm:1'
       ).set_index('timestamp')

data.head(3)
Out[25]:
building_id meter_reading site_id primary_use square_feet air_temperature dew_temperature sea_level_pressure wind_direction wind_speed cluster_base_on_features cluster_base_on_dtw
timestamp
2016-01-01 id_1000 1012.500694 10 Education 121146 -8.9 NaN NaN NaN 3.1 5 1
2016-01-01 id_970 0.000000 9 Other 346056 7.8 NaN NaN NaN 3.1 5 1
2016-01-01 id_1066 639.270004 12 Education 55800 1.9 -1.2 1016.2 200.0 5.0 5 1

✎ Note

Sktime provides additional clustering algorithms, such as TimeSeriesKMeansTslearn, TimeSeriesKMedoids, and TimeSeriesKShapes that worth exploring. Other great libraries for clustering time series are: + DTAIDistance + Tslearn + AEON
In [26]:
# Save data for modelling to avoid repeating the previous steps
# ==============================================================================
data.to_parquet('data_modelling.parquet')

Modelling and forecasting

After establishing the three grouping criteria (building usage, clustering based on time series features, and clustering based on dynamic time warping), multi-series global models are trained. The evaluation that follows focus on determining how effectively these models can forecast daily data for the final two weeks. During this assessment, three distinct metrics are used:

  • The average value of the Mean Absolute Error (MAE) for all buildings.

  • The absolute sum of the errors, i.e. the absolute deviation between the predicted value and the actual consumption for all buildings.

  • The bias for the predictions summed over all buildings.

In addition to the lagged values of the own time series, the day of the week (sine-cosine encoded), the outdoor temperature and the wind speed are included as exogenous variables.

✎ Note

For a more detailed explanation of time series model validation, readers are encouraged to consult the Backtesting user guide. For more information about calendar features and cyclical encoding visit Calendar features and Cyclical features in time series.

To train the models, search for optimal hyperparameters, and evaluate their predictive performance, the data is divided into three separate sets: training, validation, and test.

In [27]:
# Load data for modelling
# ==============================================================================
data = pd.read_parquet('data_modelling.parquet')

# Split data in train, validation and test
# ==============================================================================
end_train = '2016-07-31 23:59:00'
end_validation = '2016-09-30 23:59:00'
data_train = data.loc[: end_train, :].copy()
data_val   = data.loc[end_train:end_validation, :].copy()
data_test  = data.loc[end_validation:, :].copy()

print(
    f"Rage of dates available : {data.index.min()} --- {data.index.max()} "
    f"(n_days={(data.index.max() - data.index.min()).days})"
)
print(
    f"  Dates for training    : {data_train.index.min()} --- {data_train.index.max()} "
    f"(n_days={(data_train.index.max() - data_train.index.min()).days})"
)
print(
    f"  Dates for validation  : {data_val.index.min()} --- {data_val.index.max()} "
    f"(n_days={(data_val.index.max() - data_val.index.min()).days})"
)
print(
    f"  Dates for test        : {data_test.index.min()} --- {data_test.index.max()} "
    f"(n_days={(data_test.index.max() - data_test.index.min()).days})"
)
Rage of dates available : 2016-01-01 00:00:00 --- 2016-12-31 00:00:00 (n_days=365)
  Dates for training    : 2016-01-01 00:00:00 --- 2016-07-31 00:00:00 (n_days=212)
  Dates for validation  : 2016-08-01 00:00:00 --- 2016-09-30 00:00:00 (n_days=60)
  Dates for test        : 2016-10-01 00:00:00 --- 2016-12-31 00:00:00 (n_days=91)
In [28]:
# Table of results for all models
# ==============================================================================
table_results = pd.DataFrame(columns=['model', 'mae', 'abs_error', 'bias', 'elapsed_time'])
table_results = table_results.set_index('model')
table_results = table_results.astype({'mae': float, 'abs_error': float, 'bias': float, 'elapsed_time': object})
In [29]:
# Function to add calendar features to a DataFrame
# ==============================================================================
def day_of_week_cyclical_encoding(data):
    """
    Calculate cyclical encoding for the day of week using a 
    DataFrame with a datetime index.
    """
    data['day_of_week'] = data.index.dayofweek
    data['sin_day_of_week'] = np.sin(2*np.pi*data['day_of_week']/7)
    data['cos_day_of_week'] = np.cos(2*np.pi*data['day_of_week']/7)
    
    return data
In [30]:
# Exogenous variables included in the model
# ==============================================================================
exog_features = ['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed']

Global multi-series model for all buildings

A global model for all buildings is trained and tested using the skforecast class ForecasterRecursiveMultiSeries. This forecaster allows the user to pass the time series in several formats, including a dataFrame with the time series organized as columns.

For more information about using series of different lengths, or different exogenous variables for each series, see Global Forecasting Models.

In [31]:
# Reshape data to have one column per building
# ==============================================================================
data_pivot = data.pivot_table(
                 index    = 'timestamp',
                 columns  = 'building_id',
                 values   = 'meter_reading',
                 aggfunc  = 'mean' 
             )
data_pivot.columns = data_pivot.columns.astype(str)
data_pivot = data_pivot.asfreq('D').sort_index()

# Add calendar features
data_pivot = day_of_week_cyclical_encoding(data_pivot)

# Add temperature and wind speed as exogenous variables
data_pivot = data_pivot.merge(
                 data[['air_temperature', 'wind_speed']].resample('D').mean(),
                 left_index  = True,
                 right_index = True,
                 how         = 'left',
                 validate    = '1:m'
             )

data_pivot.head(3)
Out[31]:
id_1000 id_1001 id_1002 id_1003 id_1004 id_1007 id_1008 id_1009 id_1010 id_1013 ... id_993 id_995 id_996 id_998 id_999 day_of_week sin_day_of_week cos_day_of_week air_temperature wind_speed
timestamp
2016-01-01 1012.500694 142.999700 3114.500198 2294.750893 7188.003021 682.750099 2007.500893 507.630098 109.000300 996.750298 ... 27809.0 585.0 2643.0 1757.750298 2401.250595 4 -0.433884 -0.900969 6.416639 4.040115
2016-01-02 1158.500099 141.000801 4110.000000 1750.000198 8494.001007 661.750000 2025.250793 593.370001 111.000000 947.750504 ... 29067.0 639.0 2771.0 1756.000496 2244.370499 5 -0.974928 -0.222521 6.366474 4.530395
2016-01-03 983.000099 137.000300 2965.000000 1455.750492 7130.001007 713.250099 1922.000496 561.630098 111.000301 940.180004 ... 28937.0 605.0 2705.0 1716.000893 1878.750893 6 -0.781831 0.623490 6.555272 3.273064

3 rows × 1219 columns

In [32]:
# Forecaster multi-series to model all buildings at the same time
# ==============================================================================
exog_features = ['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed']

params_lgbm = {
    'n_estimators': 500,
    'learning_rate': 0.01,
    'max_depth': 10,
    'random_state': 8520,
    'verbose': -1
}
forecaster = ForecasterRecursiveMultiSeries(
                 regressor = LGBMRegressor(**params_lgbm),
                 lags      = 31,
                 encoding  = 'ordinal_category'
             )

start = datetime.now()

cv = TimeSeriesFold(
        steps              = 7,
        initial_train_size = len(data_pivot.loc[:end_validation, :]),
        refit              = False
     )
metric, predictions = backtesting_forecaster_multiseries(
                          forecaster            = forecaster,
                          series                = data_pivot.filter(regex='^id_'), # Select only buildings
                          exog                  = data_pivot[exog_features],
                          cv                    = cv,
                          metric                = 'mean_absolute_error',
                          add_aggregated_metric = False,
                          verbose               = False,
                          show_progress         = True
                      )

end = datetime.now()

mean_metric_all_buildings = metric['mean_absolute_error'].mean()
errors_all_buildings = predictions - data_pivot.loc[predictions.index, predictions.columns]
sum_abs_errors_all_buildings = errors_all_buildings.abs().sum().sum()
sum_bias_all_buildings = errors_all_buildings.sum().sum()

table_results.loc['Global model', 'mae'] = mean_metric_all_buildings
table_results.loc['Global model', 'abs_error'] = sum_abs_errors_all_buildings
table_results.loc['Global model', 'bias'] = sum_bias_all_buildings
table_results.loc['Global model', 'elapsed_time'] = end - start

print("")
print(f"Average mean absolute error for all buildings: {mean_metric_all_buildings:.0f}")
print(f"Sum of absolute errors for all buildings (x 10,000): {sum_abs_errors_all_buildings/10000:.0f}")
print(f"Bias (x 10,000): {sum_bias_all_buildings/10000:.0f}")
Average mean absolute error for all buildings: 419
Sum of absolute errors for all buildings (x 10,000): 4684
Bias (x 10,000): 1205
In [33]:
# Plot predictions vs real value for 2 random buildings
# ==============================================================================
rng = np.random.default_rng(147)
selected_buildings = rng.choice(data['building_id'].unique(), size=2, replace=False)

fig, axs = plt.subplots(2, 1, figsize=(6, 4), sharex=True)
axs = axs.flatten()

for i, building in enumerate(selected_buildings):
    data_test.query("building_id == @building")['meter_reading'].plot(ax=axs[i], label='test')
    predictions[building].plot(ax=axs[i], label='One model per building')
    axs[i].set_title(f"Building {building}", fontsize=10)
    axs[i].set_xlabel("")
    axs[i].legend()

fig.tight_layout()
plt.show();

Global multi-series model by building usage

In [34]:
# Forecaster multi-series models for buildings grouped by primary usage
# ==============================================================================
predictions_all_buildings = []
metrics_all_buildings = []

params_lgbm = {
    'n_estimators': 500,
    'learning_rate': 0.01,
    'max_depth': 10,
    'random_state': 8520,
    'verbose': -1
}
forecaster = ForecasterRecursiveMultiSeries(
                 regressor = LGBMRegressor(**params_lgbm),
                 lags      = 31,
                 encoding  = "ordinal_category"
             )

start = datetime.now()

for primary_usage in data['primary_use'].unique():

    print(
        f"Training and testing model for primary usage: {primary_usage} "
        f"(n = {data[data['primary_use'] == primary_usage]['building_id'].nunique()})"
    )

    # Create a subset based on primary use clusters
    data_subset = data[data['primary_use'] == primary_usage]
    data_subset = data_subset.pivot_table(
                      index   = 'timestamp',
                      columns = 'building_id',
                      values  = 'meter_reading',
                      aggfunc = 'mean'
                  )
    data_subset.columns = data_subset.columns.astype(str)
    data_subset = data_subset.asfreq('D').sort_index()

    # Add calendar features
    data_subset = day_of_week_cyclical_encoding(data_subset)
    data_subset = data_subset.merge(
                      data[['air_temperature', 'wind_speed']].resample('D').mean(),
                      left_index  = True,
                      right_index = True,
                      how         = 'left',
                      validate    = '1:m'
                  )

    cv = TimeSeriesFold(
            steps              = 7,
            initial_train_size = len(data_subset.loc[:end_validation, :]),
            refit              = False
         )
    metric, predictions = backtesting_forecaster_multiseries(
                              forecaster            = forecaster,
                              series                = data_subset.filter(regex='^id_'),
                              exog                  = data_subset[['sin_day_of_week', 'cos_day_of_week']],
                              cv                    = cv,
                              metric                = 'mean_absolute_error',
                              add_aggregated_metric = False,
                              verbose               = False,
                              show_progress         = False
                          )
    predictions_all_buildings.append(predictions)
    metrics_all_buildings.append(metric)

end = datetime.now()

predictions_all_buildings = pd.concat(predictions_all_buildings, axis=1)
metrics_all_buildings = pd.concat(metrics_all_buildings, axis=0)
errors_all_buildings = (
    predictions_all_buildings - data_pivot.loc[predictions_all_buildings.index, predictions_all_buildings.columns]
)
mean_metric_all_buildings = metrics_all_buildings['mean_absolute_error'].mean()
sum_abs_errors_all_buildings = errors_all_buildings.abs().sum().sum()
sum_bias_all_buildings = errors_all_buildings.sum().sum()

table_results.loc['Global model per primary usage', 'mae'] = mean_metric_all_buildings
table_results.loc['Global model per primary usage', 'abs_error'] = sum_abs_errors_all_buildings
table_results.loc['Global model per primary usage', 'bias'] = sum_bias_all_buildings
table_results.loc['Global model per primary usage', 'elapsed_time'] = end - start

print("")
print(f"Average mean absolute error for all buildings: {mean_metric_all_buildings:.0f}")
print(f"Sum of absolute errors for all buildings (x 10,000): {sum_abs_errors_all_buildings/10000:.0f}")
print(f"Bias (x 10,000): {sum_bias_all_buildings/10000:.0f}")
Training and testing model for primary usage: Education (n = 463)
Training and testing model for primary usage: Other (n = 104)
Training and testing model for primary usage: Entertainment/public assembly (n = 157)
Training and testing model for primary usage: Office (n = 228)
Training and testing model for primary usage: Lodging/residential (n = 112)
Training and testing model for primary usage: Public services (n = 150)

Average mean absolute error for all buildings: 405
Sum of absolute errors for all buildings (x 10,000): 4523
Bias (x 10,000): 1149
In [35]:
# Add predictions to the already existing plot (not showing the plot yet)
# ==============================================================================
for i, building in enumerate(selected_buildings):
    predictions_all_buildings[building].plot(ax=axs[i], label='Global model per primary usage')
    axs[i].legend()

Global multi-series model by cluster (features)

A global model for each cluster based on time series features is trained and tested.

In [36]:
# Forecaster multi-series models for buildings grouped by time series features
# ==============================================================================
predictions_all_buildings = []
metrics_all_buildings = []

params_lgbm = {
    'n_estimators': 500,
    'learning_rate': 0.01,
    'max_depth': 10,
    'random_state': 8520,
    'verbose': -1
}
forecaster = ForecasterRecursiveMultiSeries(
                 regressor = LGBMRegressor(**params_lgbm),
                 lags      = 31,
                 encoding  = "ordinal_category"
             )

start = datetime.now()

for cluster in data['cluster_base_on_features'].unique():
    
    print(
        f"Training and testing model for cluster: {cluster} "
        f"(n = {data[data['cluster_base_on_features'] == cluster]['building_id'].nunique()})"
    )

    # Create subset based on features clusters
    data_subset = data[data['cluster_base_on_features'] == cluster]
    data_subset = data_subset.pivot_table(
                        index   = 'timestamp',
                        columns = 'building_id',
                        values  = 'meter_reading',
                        aggfunc = 'mean'
                  )
    data_subset.columns = data_subset.columns.astype(str)
    data_subset = data_subset.asfreq('D').sort_index()

    # Add calendar features
    data_subset = day_of_week_cyclical_encoding(data_subset)
    data_subset = data_subset.merge(
                      data[['air_temperature', 'wind_speed']].resample('D').mean(),
                      left_index  = True,
                      right_index = True,
                      how         = 'left',
                      validate    = '1:m'
                  )
    cv = TimeSeriesFold(
            steps              = 7,
            initial_train_size = len(data_subset.loc[:end_validation, :]),
            refit              = False
         )
    metric, predictions = backtesting_forecaster_multiseries(
                              forecaster            = forecaster,
                              series                = data_subset.filter(regex='^id_'),
                              exog                  = data_subset[['sin_day_of_week', 'cos_day_of_week']],
                              cv                    = cv,
                              metric                = 'mean_absolute_error',
                              add_aggregated_metric = False,
                              verbose               = False,
                              show_progress         = False
                          )
    predictions_all_buildings.append(predictions)
    metrics_all_buildings.append(metric)

end = datetime.now()

predictions_all_buildings = pd.concat(predictions_all_buildings, axis=1)
metrics_all_buildings = pd.concat(metrics_all_buildings, axis=0)
errors_all_buildings = (
    predictions_all_buildings - data_pivot.loc[predictions_all_buildings.index, predictions_all_buildings.columns]
)
mean_metric_all_buildings = metrics_all_buildings['mean_absolute_error'].mean()
sum_abs_errors_all_buildings = errors_all_buildings.abs().sum().sum()
sum_bias_all_buildings = errors_all_buildings.sum().sum()

table_results.loc['Global model per cluster (features)', 'mae'] = mean_metric_all_buildings
table_results.loc['Global model per cluster (features)', 'abs_error'] = sum_abs_errors_all_buildings
table_results.loc['Global model per cluster (features)', 'bias'] = sum_bias_all_buildings
table_results.loc['Global model per cluster (features)', 'elapsed_time'] = end - start

print("")
print(f"Average mean absolute error for all buildings: {mean_metric_all_buildings:.0f}")
print(f"Sum of absolute errors for all buildings (x 10,000): {sum_abs_errors_all_buildings/10000:.0f}")
print(f"Bias (x 10,000): {sum_bias_all_buildings/10000:.0f}")
Training and testing model for cluster: 5 (n = 653)
Training and testing model for cluster: 0 (n = 81)
Training and testing model for cluster: 1 (n = 465)
Training and testing model for cluster: Other (n = 15)

Average mean absolute error for all buildings: 383
Sum of absolute errors for all buildings (x 10,000): 4274
Bias (x 10,000): 1072
In [37]:
# Add predictions to the already existing plot (not showing the plot yet)
# ==============================================================================
for i, building in enumerate(selected_buildings):
    predictions_all_buildings[building].plot(ax=axs[i], label='Global model per cluster (features)')
    axs[i].legend()

Global multi-series model by cluster (elastic distance DTW)

A global model for each cluster based on elastic distance (DTW) is trained and tested.

In [38]:
# Forecaster multi-series models for buildings grouped by DTW
# ==============================================================================
predictions_all_buildings = []
metrics_all_buildings = []

params_lgbm = {
    'n_estimators': 500,
    'learning_rate': 0.01,
    'max_depth': 10,
    'random_state': 8520,
    'verbose': -1
}
forecaster = ForecasterRecursiveMultiSeries(
                 regressor = LGBMRegressor(**params_lgbm),
                 lags      = 31,
                 encoding  = 'ordinal_category'
             )

start = datetime.now()

for cluster in data['cluster_base_on_dtw'].unique():

    print(
        f"Training and testing model for cluster: {cluster} "
        f"(n = {data[data['cluster_base_on_dtw'] == cluster]['building_id'].nunique()})"
    )

    # Create subset based on DTW clusters
    data_subset = data[data['cluster_base_on_dtw'] == cluster]
    data_subset = data_subset.pivot_table(
                      index   = 'timestamp',
                      columns = 'building_id',
                      values  = 'meter_reading',
                      aggfunc = 'mean'
                  )
    data_subset.columns = data_subset.columns.astype(str)
    data_subset = data_subset.asfreq('D').sort_index()

    # Add calendar features
    data_subset = day_of_week_cyclical_encoding(data_subset)
    data_subset = data_subset.merge(
                      data[['air_temperature', 'wind_speed']].resample('D').mean(),
                      left_index  = True,
                      right_index = True,
                      how         = 'left',
                      validate    = '1:m'
                  )
    cv = TimeSeriesFold(
            steps              = 7,
            initial_train_size = len(data_subset.loc[:end_validation, :]),
            refit              = False
        )
    metric, predictions = backtesting_forecaster_multiseries(
                              forecaster            = forecaster,
                              series                = data_subset.filter(regex='^id_'),
                              exog                  = data_subset[['sin_day_of_week', 'cos_day_of_week']],
                              cv                    = cv,
                              metric                = 'mean_absolute_error',
                              add_aggregated_metric = False,
                              verbose               = False,
                              show_progress         = False
                          )
    predictions_all_buildings.append(predictions)
    metrics_all_buildings.append(metric)

end = datetime.now()

predictions_all_buildings = pd.concat(predictions_all_buildings, axis=1)
metrics_all_buildings = pd.concat(metrics_all_buildings, axis=0)
errors_all_buildings = (
    predictions_all_buildings - data_pivot.loc[predictions_all_buildings.index, predictions_all_buildings.columns]
)
mean_metric_all_buildings = metrics_all_buildings['mean_absolute_error'].mean()
sum_abs_errors_all_buildings = errors_all_buildings.abs().sum().sum()
sum_bias_all_buildings = errors_all_buildings.sum().sum()

table_results.loc['Global model per cluster (DTW)', 'mae'] = mean_metric_all_buildings
table_results.loc['Global model per cluster (DTW)', 'abs_error'] = sum_abs_errors_all_buildings
table_results.loc['Global model per cluster (DTW)', 'bias'] = sum_bias_all_buildings
table_results.loc['Global model per cluster (DTW)', 'elapsed_time'] = end - start

print("")
print(f"Average mean absolute error for all buildings: {mean_metric_all_buildings:.0f}")
print(f"Sum of absolute errors for all buildings (x 10,000): {sum_abs_errors_all_buildings/10000:.0f}")
print(f"Bias (x 10,000): {sum_bias_all_buildings/10000:.0f}")
Training and testing model for cluster: 1 (n = 717)
Training and testing model for cluster: 2 (n = 347)
Training and testing model for cluster: 3 (n = 128)
Training and testing model for cluster: 0 (n = 22)

Average mean absolute error for all buildings: 377
Sum of absolute errors for all buildings (x 10,000): 4207
Bias (x 10,000): 1044
In [39]:
# Add predictions to the already existing plot (not showing the plot yet)
# ==============================================================================
for i, building in enumerate(selected_buildings):
    predictions_all_buildings[building].plot(ax=axs[i], label='Global model per cluster (DTW)')
    axs[i].legend()

Results

In [40]:
# Table of results
# ==============================================================================
table_results['elapsed_time'] = table_results['elapsed_time'].astype(str).str[:7]
table_results.style.highlight_min(axis=0, color='green').format(precision=0)
Out[40]:
  mae abs_error bias elapsed_time
model        
Global model 419 46842049 12052318 0:00:26
Global model per primary usage 405 45233247 11487379 0:00:31
Global model per cluster (features) 383 42744087 10720536 0:00:33
Global model per cluster (DTW) 377 42073831 10441483 0:00:31
In [41]:
# Plot predictions vs real value for 2 random buildings for all models
# ==============================================================================
handles, labels = axs[0].get_legend_handles_labels()
fig.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5, 0.05), ncol=2)
for ax in axs:
    ax.legend().remove()
fig
Out[41]:

Conclusion

This document has shown that clustering time series can be a valuable tool for improving forecasting models. By grouping time series with similar patterns, the impact of noise can be reduced, leading to more accurate predictions. The results of this study demonstrate that the choice of clustering algorithm can significantly impact the performance of the forecasting model. In this case, the dynamic time warping (DTW) algorithm outperformed the feature-driven clustering method and the primary use of the building as a grouping criterion.

Further research

This analysis has provided interesting insights into the effectiveness combining clustering and forecasting models. Possible next steps could include:

  • Manually review buildings with high error. Identify if there is a group for which the model is not performing well.

  • Add more exogenous features: see skforecast's user guide Calendar Features for calendar and sunlight features that typically affect energy consumption.

  • Optimize lags and hyperparameters: Use Grid, Random or Bayesian search to find the best model configuration.

  • Try other machine learning algorithms.

Session information

In [42]:
import session_info
session_info.show(html=False)
-----
lightgbm            4.5.0
matplotlib          3.9.2
numpy               2.0.0
pandas              2.2.3
session_info        1.0.0
skforecast          0.14.0
sklearn             1.5.1
sktime              0.35.0
tsfresh             0.20.3
-----
IPython             8.27.0
jupyter_client      8.6.3
jupyter_core        5.7.2
notebook            6.4.12
-----
Python 3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 18:27:27) [GCC 11.2.0]
Linux-5.15.0-1075-aws-x86_64-with-glibc2.31
-----
Session information updated at 2025-01-29 09:26

Citation

How to cite this document

If you use this document or any part of it, please acknowledge the source, thank you!

Clustering Time Series to Improve Forecasting Models by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under a CC BY-NC-SA 4.0 at https://www.cienciadedatos.net/documentos/py64-clustering-time-series-forecasting.html

How to cite skforecast

If you use skforecast for a publication, we would appreciate it if you cite the published software.

Zenodo:

Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.14.0). Zenodo. https://doi.org/10.5281/zenodo.8382788

APA:

Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (Version 0.14.0) [Computer software]. https://doi.org/10.5281/zenodo.8382788

BibTeX:

@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.14.0}, month = {11}, year = {2024}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }


Did you like the article? Your support is important

Your contribution will help me to continue generating free educational content. Many thanks! 😊

Become a GitHub Sponsor Become a GitHub Sponsor

Creative Commons Licence

This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.

Allowed:

  • Share: copy and redistribute the material in any medium or format.

  • Adapt: remix, transform, and build upon the material.

Under the following terms:

  • Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  • NonCommercial: You may not use the material for commercial purposes.

  • ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.