Global Forecasting Models: Comparative Analysis of Single and Multi-Series Forecasting Modeling

If you like  Skforecast ,  help us giving a star on   GitHub! ⭐️

Global Forecasting Models: Comparative Analysis of Single and Multi-Series Forecasting Modeling

Joaquín Amat Rodrigo, Javier Escobar Ortiz
December, 2023 (last update June 2024)

Introduction

When faced with scenarios involving the prediction of hundreds or thousands of time series, a crucial decision arises: should one develop individual models for each series, or should one use a unified model to handle them all at once?

In Single-Series Modeling (Local Forecasting Model), a separate predictive model is created for each time series. While this method provides a comprehensive understanding of each series, its scalability can be challenged by the need to create and maintain hundreds or thousands of models.

Multi-Series Modeling (Global Forecasting Model) involves building a single predictive model that considers all time series simultaneously. It attempts to capture the core patterns that govern the series, thereby mitigating the potential noise that each series might introduce. This approach is computationally efficient, easy to maintain, and can yield more robust generalizations across time series, albeit potentially at the cost of sacrificing some individual insights.

To help understand the benefits of each forecasting strategy, this document focuses on examining and comparing the results obtained when predicting the energy consumption of over a thousand buildings using the ASHRAE - Great Energy Predictor III dataset available on Kaggle. Efficient management of energy consumption in residential buildings is essential for sustainable urban development, therefore accurate forecasting of energy consumption can play a crucial role in optimizing resource allocation and promoting energy conservation initiatives.

Global forecasting modeling assumes that series that behave similarly can benefit from being modeled together. Although the primary use of each building is available in the dataset, it may not reflect groups with similar patterns of energy use, so additional groups are created using clustering methods. A total of 5 experiments are performed:

  • Modelling each building individually.

  • Modelling all buildings together with a unique global model strategy.

  • Modelling groups of buildings based on their primary use (a global model per primary use).

  • Modelling groups of buildings based on time series feature clustering (a global model per cluster).

  • Modelling groups of buildings based on Dynamic Time Warping (DTW) clustering (a global model per cluster).

Energy consumption is predicted for each building on a weekly basis (daily resolution) for 13 weeks following each strategy. The effectiveness of each approach is evaluated using several performance metrics, including Mean Absolute Error (MAE), Absolute Error, and Bias. The goal of the study is to identify the most effective approach both for overall predictions and for a specific group of buildings.

✎ Note

Read the Conclusions First?

If you prefer a quick overview before diving into the details, consider starting with the Conclusions section. This approach allows you to tailor your reading to your interests and time constraints, and summarizes our findings and insights. After reading the conclusions, you may find certain sections particularly relevant or interesting. Feel free to navigate directly to those parts of the article for a deeper understanding.

✎ Note

This document focuses on one use case. For a detailed description of the many features that skforecast provides for building global forecasting models, see Global Forecasting Models: Modeling Multiple Time Series with Machine Learning.

Libraries

Libraries used in this document.

In [49]:
# Data management
# ==============================================================================
import numpy as np
import pandas as pd
from datetime import datetime
from skforecast.datasets import fetch_dataset

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
plt.style.use('seaborn-v0_8-darkgrid')
from tqdm.auto import tqdm

# Forecasting
# ==============================================================================
import lightgbm
from lightgbm import LGBMRegressor
from sklearn.preprocessing import StandardScaler
import skforecast
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.model_selection import backtesting_forecaster
from skforecast.ForecasterAutoregMultiSeries import ForecasterAutoregMultiSeries
from skforecast.model_selection_multiseries import backtesting_forecaster_multiseries

# Feature extraction
# ==============================================================================
import tsfresh
from tsfresh import extract_features
from tsfresh import select_features
from tsfresh.feature_extraction.settings import ComprehensiveFCParameters
from tsfresh.feature_extraction.settings import from_columns

# Clustering
# ==============================================================================
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import tslearn
from tslearn.clustering import TimeSeriesKMeans

# Configuración
# ==============================================================================
import warnings
warnings.filterwarnings('once')

print('Versión skforecast:', skforecast.__version__)
print('Versión tsfresh:', tsfresh.__version__)
print('Versión tslearn:', tslearn.__version__)
print('Versión sklearn:', sklearn.__version__)
Versión skforecast: 0.12.1
Versión tsfresh: 0.20.2
Versión tslearn: 0.6.3
Versión sklearn: 1.4.2

Data

Data used in this document has been obtained from the Kaggle competition Addison Howard, Chris Balbach, Clayton Miller, Jeff Haberl, Krishnan Gowri, Sohier Dane. (2019). ASHRAE - Great Energy Predictor III. Kaggle.

Three files are used to create the modeling data set:

  • weather_train.csv and weather_test.csv: These files contain weather-related data for each building, including outdoor air temperature, dew temperature, relative humidity, and other weather parameters. The weather data is crucial for understanding the impact of external conditions on building energy usage.

  • building_metadata.csv: This file provides metadata for each building in the dataset, such as building type, primary use, square footage, floor count, and year built. This information helps in understanding the characteristics of the buildings and their potential influence on energy consumption patterns.

  • train.csv: the train dataset contains the target variable, i.e., the energy consumption data for each building, along with the timestamps for the energy consumption readings. It also includes the corresponding building and weather IDs to link the information across different datasets.

The three files have been preprocessed to remove buildings with less than 85% of values not equal to NaN or zero, to use only the electricity meter, and to aggregate the data to a daily frequency. See the Download and preprocess raw data section for more details.

In [50]:
# Load preprocessed data
# ==============================================================================
url = "https://drive.google.com/file/d/1fMsYjfhrFLmeFjKG3jenXjDa5s984ThC/view?usp=sharing"
file_id = url.split('/')[-2]
url = 'https://drive.google.com/uc?id=' + file_id
data = pd.read_parquet(url)
print("Data shape:", data.shape)
Data shape: (444324, 10)
In [3]:
# Fill missing values of air_temperature and wind_speed using fowrard and backward fill
# ==============================================================================
# Imputation must be done separately for each building
data = data.sort_values(by=['building_id', 'timestamp'])
data['air_temperature'] = data.groupby('building_id')['air_temperature'].ffill().bfill()
data['wind_speed'] = data.groupby('building_id')['wind_speed'].ffill().bfill()
data = data.sort_index()

print(
    f"Rage of dates available : {data.index.min()} --- {data.index.max()}  "
    f"(n_days={(data.index.max() - data.index.min()).days})"
)
Rage of dates available : 2016-01-01 00:00:00 --- 2016-12-31 00:00:00  (n_days=365)

Exploratory data analysis

Building primary use

One of the key attributes associated with each building is its designated use. This feature may play a crucial role in influencing the energy consumption pattern, as distinct uses can significantly impact both the quantity and timing of energy consumption.

In [4]:
# Number of buildings and type of buildings based on primary use
# ==============================================================================
n_building = data['building_id'].nunique()
n_type_building = data['primary_use'].nunique()
range_datetime = [data.index.min(), data.index.max()]

print(f"Number of buildings: {n_building}")
print(f"Number of building types: {n_type_building}")
display(data.drop_duplicates(subset=['building_id'])['primary_use'].value_counts())
Number of buildings: 1214
Number of building types: 16
primary_use
Education                        463
Office                           228
Entertainment/public assembly    157
Public services                  150
Lodging/residential              112
Healthcare                        19
Other                             19
Parking                           13
Warehouse/storage                 12
Manufacturing/industrial          10
Services                           9
Food sales and service             5
Retail                             5
Technology/science                 5
Utility                            4
Religious worship                  3
Name: count, dtype: int64

For certain primary use categories, there is a limited number of buildings within the dataset. To streamline the analysis, categories with fewer than 100 buildings are grouped into the "Other" category.

In [5]:
# Types of buildings (primary use) with less than 100 samples are grouped as "Other".
# ==============================================================================
infrequent_categories = (
    data
    .drop_duplicates(subset=['building_id'])['primary_use']
    .value_counts()
    .loc[lambda x: x < 100]
    .index
    .tolist()
)
print(f"Infequent categories:")
print("======================")
print('\n'.join(infrequent_categories))

data['primary_use'] = np.where(
    data['primary_use'].isin(infrequent_categories),
    'Other',
    data['primary_use']
)
Infequent categories:
======================
Healthcare
Other
Parking
Warehouse/storage
Manufacturing/industrial
Services
Food sales and service
Retail
Technology/science
Utility
Religious worship

Next, a plot is created showing the energy consumption for a randomly selected building within each respective category, and a plot of all available time series for each category.

In [6]:
# Time series for 1 random selected building per group
# ==============================================================================
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(8, 5.5), sharex=True, sharey=False)
sample_ids = (
    data
    .groupby('primary_use')['building_id']
    .apply(lambda x: x.sample(1, random_state=333))
    .tolist()
)
axs = axs.flatten()

for i, building_id in enumerate(sample_ids):
    data_sample = data[data['building_id'] == building_id]
    building_type = data_sample['primary_use'].unique()[0]
    data_sample.plot(
        y        = 'meter_reading',
        ax       = axs[i],
        legend   = False,
        title    = f"Building: {building_id}, type: {building_type}",
    )
    axs[i].set_xlabel("")
    axs[i].set_ylabel("")
    # Scientific notation for y axis
    axs[i].ticklabel_format(axis='y', style='sci', scilimits=(0, 0))
    axs[i].title.set_size(10)

fig.suptitle('Energy consumption for 6 random buildings', fontsize=16)
fig.tight_layout()
plt.show()
In [7]:
# Energy consumption by type of building (one gray line per building)
# ==============================================================================
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(8, 5.5), sharex=True, sharey=False)
axs = axs.flatten()

for i, building_type in enumerate(data['primary_use'].unique()):
    data_sample = data[data['primary_use'] == building_type]
    data_sample = data_sample.pivot_table(
                      index   = 'timestamp',
                      columns = 'building_id',
                      values  = 'meter_reading',
                      aggfunc = 'mean'
                  )
    data_sample.plot(
        legend = False,
        title  = f"Type: {building_type}",
        color  = 'gray',
        alpha  = 0.2,
        ax     = axs[i]
    )
    data_sample.mean(axis=1).plot(
        ax     = axs[i],
        color  = 'blue',
    )
    axs[i].set_xlabel("")
    axs[i].set_ylabel("")
    # Scientific notation for y axis
    axs[i].ticklabel_format(axis='y', style='sci', scilimits=(0,0))

    # Limit the axis to 5 times the maximum mean value to improve visualization
    axs[i].set_ylim(
        bottom = 0,
        top    = 5 * data_sample.mean(axis=1).max()
    )

fig.tight_layout()
fig.suptitle('Energy consumption by type of building', fontsize=16)
plt.subplots_adjust(top=0.9)
plt.show()

The graph reveals that there is significant variability in consumption patterns among buildings of the same purpose. This suggests that there may be scope for improving the criteria by which buildings are grouped.

Clustering of buildings by the energy consumption patterns

The idea behind modeling multiple series at the same time is to be able to capture the main patterns that govern the series, thereby reducing the impact of the potential noise that each series may have. This means that series that behave similarly may benefit from being modeled together. One way to identify potential groups of series is to perform a clustering study prior to modeling. If clear groups are identified as a result of clustering, it is appropriate to model each of them separately.

Clustering is an unsupervised analysis technique that groups a set of observations into clusters that contain observations that are considered homogeneous, while observations in different clusters are considered heterogeneous. Algorithms that cluster time series can be divided into two groups: those that use a transformation to create features prior to clustering (feature-driven time series clustering), and those that work directly on the time series (elastic distance measures).

  • Feature-driven time series clustering: Features describing structural characteristics are extracted from each time series, then these features are fed into arbitrary clustering algorithms. This features are obtained by applying statistical operations that best capture the underlying characteristics: trend, seasonality, periodicity, serial correlation, skewness, kurtosis, chaos, nonlinearity, and self-similarity.

  • Elastic distance measures: This approach works directly on the time series, adjusting or "realigning" the series in comparison to each other. The best known of this family of measures is Dynamic Time Warping (DTW).

For a more detailed review of time series clustering, see A review and evaluation of elastic distance functions for time series clustering.

Clustering based on time series features

Feature creation

tsfresh is a powerful Python library for feature engineering from time-series and sequential data, including statistical measures, Fourier coefficients, and various other time-domain and frequency-domain features. It provides a systematic approach to automate the calculation of features and select the most informative ones.

To start, the default configuration of tsfresh is used, which calculates all available features. The easiest way to access this configuration is to use the functions provided by the ComprehensiveFCParameters class. It creates a dictionary which is a mapping from string (the name of each feature) to a list of dictionary parameters to be used when the function with that name is called.

In [8]:
# Default features and settings created by tsfresh
# ==============================================================================
default_features = ComprehensiveFCParameters()
print("Name of the features extracted by tsfresh:")
print("=========================================")
print('\n'.join(list(default_features)))
Name of the features extracted by tsfresh:
=========================================
variance_larger_than_standard_deviation
has_duplicate_max
has_duplicate_min
has_duplicate
sum_values
abs_energy
mean_abs_change
mean_change
mean_second_derivative_central
median
mean
length
standard_deviation
variation_coefficient
variance
skewness
kurtosis
root_mean_square
absolute_sum_of_changes
longest_strike_below_mean
longest_strike_above_mean
count_above_mean
count_below_mean
last_location_of_maximum
first_location_of_maximum
last_location_of_minimum
first_location_of_minimum
percentage_of_reoccurring_values_to_all_values
percentage_of_reoccurring_datapoints_to_all_datapoints
sum_of_reoccurring_values
sum_of_reoccurring_data_points
ratio_value_number_to_time_series_length
sample_entropy
maximum
absolute_maximum
minimum
benford_correlation
time_reversal_asymmetry_statistic
c3
cid_ce
symmetry_looking
large_standard_deviation
quantile
autocorrelation
agg_autocorrelation
partial_autocorrelation
number_cwt_peaks
number_peaks
binned_entropy
index_mass_quantile
cwt_coefficients
spkt_welch_density
ar_coefficient
change_quantiles
fft_coefficient
fft_aggregated
value_count
range_count
approximate_entropy
friedrich_coefficients
max_langevin_fixed_point
linear_trend
agg_linear_trend
augmented_dickey_fuller
number_crossing_m
energy_ratio_by_chunks
ratio_beyond_r_sigma
linear_trend_timewise
count_above
count_below
lempel_ziv_complexity
fourier_entropy
permutation_entropy
query_similarity_count
mean_n_absolute_max

Many of these features are calculated using different values for their arguments.

In [9]:
# Default configuration for feature "partial_autocorrelation"
# ==============================================================================
default_features['partial_autocorrelation']
Out[9]:
[{'lag': 0},
 {'lag': 1},
 {'lag': 2},
 {'lag': 3},
 {'lag': 4},
 {'lag': 5},
 {'lag': 6},
 {'lag': 7},
 {'lag': 8},
 {'lag': 9}]

To access the detailed view of each feature and the parameter values included in the default configuration, use the following code:

In [10]:
# Default configuration for all extracted feature
# ==============================================================================
# from pprint import pprint
# pprint(default_features)

Once the configuration of features has been defined, the next step is to extract them from the time series. The function extract_features() of tsfresh is used for this purpose. This function receives as input the time series and the configuration of the features to be extracted. The output is a dataframe with the extracted features.

In [11]:
# Feature extraction
# ==============================================================================
ts_features = extract_features(
    timeseries_container  = data[['building_id', 'meter_reading']].reset_index(),
    column_id             = "building_id",
    column_sort           = "timestamp",
    column_value          = "meter_reading",
    default_fc_parameters = default_features,
    impute_function       = tsfresh.utilities.dataframe_functions.impute,
    n_jobs                = 4
)

print("Shape of ts_features:", ts_features.shape)
ts_features.head(3)
Feature Extraction: 100%|██████████| 20/20 [01:10<00:00,  3.53s/it]
Shape of ts_features: (1214, 783)
Out[11]:
meter_reading__variance_larger_than_standard_deviation meter_reading__has_duplicate_max meter_reading__has_duplicate_min meter_reading__has_duplicate meter_reading__sum_values meter_reading__abs_energy meter_reading__mean_abs_change meter_reading__mean_change meter_reading__mean_second_derivative_central meter_reading__median ... meter_reading__fourier_entropy__bins_5 meter_reading__fourier_entropy__bins_10 meter_reading__fourier_entropy__bins_100 meter_reading__permutation_entropy__dimension_3__tau_1 meter_reading__permutation_entropy__dimension_4__tau_1 meter_reading__permutation_entropy__dimension_5__tau_1 meter_reading__permutation_entropy__dimension_6__tau_1 meter_reading__permutation_entropy__dimension_7__tau_1 meter_reading__query_similarity_count__query_None__threshold_0.0 meter_reading__mean_n_absolute_max__number_of_maxima_7
id_1000 1.0 0.0 0.0 1.0 5.424863e+05 8.444178e+08 235.994991 0.332875 -0.255837 1515.625595 ... 0.170467 0.260704 0.780228 1.515622 2.520615 3.496026 4.341908 5.045132 0.0 2144.607384
id_1001 1.0 0.0 0.0 1.0 2.543133e+05 4.114736e+08 212.298183 -0.049314 0.026097 304.500150 ... 0.090729 0.090729 0.700821 1.752526 3.047478 4.335947 5.282052 5.727711 0.0 2871.142857
id_1002 1.0 0.0 0.0 1.0 1.409981e+06 6.255919e+09 960.832035 3.226027 -0.308379 3649.003765 ... 0.488746 0.765782 2.650153 1.747629 3.035159 4.431478 5.394148 5.780678 0.0 7882.572432

3 rows × 783 columns

As a result of the extraction process, 783 features have been created for each time series (building_id in this use case). The returned dataframe has as its index the column specified in the column_id argument of extract_features.

The default extraction of tsfresh produces a huge number of features. However, only a few of these may be of interest in each use case. To select the most relevant ones, tsfresh includes an automated selection process based on hypothesis tests (FeatuRE Extraction based on Scalable Hypothesis tests). In this process, the features are individually and independently evaluated for their significance in predicting the target under investigation.

Warning

The selection process used by tsfresh is based on the significance of each feature in accurately predicting the target variable. To perform this process, a target variable is required – for example, the type of building associated with a given time series. However, there are cases where a target variable is not readily available. In such cases, alternative strategies can be used:
  • Instead of calculating all the standard features, focus only on those that are likely to be relevant to the specific application, based on expert knowledge.
  • Exclude features based on criteria such as low variance and high correlation. This helps refine the set of features to consider, focusing on those that provide the most meaningful information for analysis.
  • Use techniques such as PCA, t-SNE, or auto-encoders to reduce dimensionality.

As feature selection is a critical step that affects the information available for the next steps of the analysis, it is highly recommended to understand the parameters that control the behavior of select_features().

  • X: dataframe with the features created with extract_features. It can contain both binary or real-valued features at the same time.
  • y: Target vector which is needed to test which features are relevant. Can be binary or real-valued.
  • test_for_binary_target_binary_feature: Which test to be used when the target is binary and feature is binary. Currently unused, default value 'fisher'.
  • test_for_binary_target_real_feature: Which test to be used when the target is binary and feature is continious (real value). Default value 'mann'.
  • test_for_real_target_binary_feature: Which test to be used when the target is continious (real value) and feature is binary.Currently unused, default value 'mann'.
  • test_for_real_target_real_feature: Which test to be used when the target is continious (real value) and feature is continious (real value) .Currently unused, default value 'kendall'.
  • fdr_level: The FDR level that should be respected, this is the theoretical expected percentage of irrelevant features among all created features. Default value 0.05.
  • hypotheses_independent: Can the significance of the features be assumed to be independent? Normally, this should be set to False as the features are never independent (e.g. mean and median). Default value False.
  • n_jobs: Number of processes to use during the p-value calculation. Default value 4.
  • show_warnings: Show warnings during the p-value calculation (needed for debugging of calculators). Default value False.
  • chunksize: The size of one chunk that is submitted to the worker process for the parallelisation. Where one chunk is defined as the data for one feature. If you set the chunksize to 10, then it means that one task is to filter 10 features. If it is set it to None, depending on distributor, heuristics are used to find the optimal chunksize. If you get out of memory exceptions, you can try it with the dask distributor and a smaller chunksize. Default value None.
  • ml_task: The intended machine learning task. Either ‘classification’, ‘regression’ or ‘auto’. Defaults to ‘auto’, meaning the intended task is inferred from y. If y has a boolean, integer or object dtype, the task is assumed to be classification, else regression. Default value 'auto'.
  • multiclass: Whether the problem is multiclass classification. This modifies the way in which features are selected. Multiclass requires the features to be statistically significant for predicting n_significant features. Default value False.
  • n_significant: The number of classes for which features should be statistically significant predictors to be regarded as ‘relevant’. Only specify when multiclass=True. Default value 1.

Warning

The order of the index returned in the features DataFrame is not the same as the order of the columns in the original DataFrame. Therefore, the data passed to the argument y in select_features must be sorted to ensure the correct association between the features and the target variable.
In [12]:
# Select relevant features
# ==============================================================================
target = (
    data[['building_id', 'primary_use']]
    .drop_duplicates()
    .set_index('building_id')
    .loc[ts_features.index, :]
    ['primary_use']
)

assert ts_features.index.equals(target.index)

ts_features_selected = select_features(
    X = ts_features,
    y = target,
    fdr_level = 0.001 # Very restrictive filter
)

ts_features_selected.index.name = 'building_id'
print(f"Number of features before selection: {ts_features.shape[1]}")
print(f"Number of features after selection: {ts_features_selected.shape[1]}")
ts_features_selected.head()
Number of features before selection: 783
Number of features after selection: 457
Out[12]:
meter_reading__fft_coefficient__attr_"real"__coeff_52 meter_reading__fft_coefficient__attr_"real"__coeff_54 meter_reading__fft_coefficient__attr_"abs"__coeff_52 meter_reading__fft_coefficient__attr_"abs"__coeff_54 meter_reading__fft_coefficient__attr_"imag"__coeff_52 meter_reading__fft_coefficient__attr_"real"__coeff_53 meter_reading__cwt_coefficients__coeff_13__w_2__widths_(2, 5, 10, 20) meter_reading__fft_coefficient__attr_"abs"__coeff_51 meter_reading__partial_autocorrelation__lag_8 meter_reading__fft_coefficient__attr_"real"__coeff_51 ... meter_reading__fft_aggregated__aggtype_"skew" meter_reading__fft_coefficient__attr_"real"__coeff_70 meter_reading__fft_coefficient__attr_"imag"__coeff_9 meter_reading__fft_aggregated__aggtype_"centroid" meter_reading__fft_coefficient__attr_"imag"__coeff_96 meter_reading__fft_coefficient__attr_"real"__coeff_19 meter_reading__number_peaks__n_10 meter_reading__ratio_beyond_r_sigma__r_0.5 meter_reading__index_mass_quantile__q_0.6 meter_reading__large_standard_deviation__r_0.25
building_id
id_1000 -43074.519495 15596.933094 47584.386245 15727.156152 20220.276570 17067.039047 1079.240903 17715.628669 -0.382448 -15096.668908 ... 1.201055 159.119510 -3191.297602 38.152192 1106.287865 -2352.230884 17.0 0.696721 0.612022 0.0
id_1001 -7628.816978 -3911.939557 15873.642987 5032.296389 13920.261966 5925.520985 6.877930 13676.871037 -0.120301 2475.852684 ... 0.969957 -4138.987801 4257.388873 48.266915 -772.445893 -4623.990926 16.0 0.724044 0.581967 1.0
id_1002 -19168.921389 12392.682219 36069.190566 15379.633893 30553.869818 30522.230317 -643.816123 11208.568040 -0.055535 -6106.134034 ... 0.775965 -4475.278772 14149.232678 54.305618 -11976.100060 22628.461765 16.0 0.729508 0.576503 0.0
id_1003 -30259.389852 14004.997345 30879.572926 23761.466488 -6157.706543 12778.366490 544.708620 29662.381954 -0.171598 -20781.529348 ... 1.306953 -4272.822854 -1846.209559 35.748746 317.775756 1231.871456 16.0 0.620219 0.581967 0.0
id_1004 -251461.502332 90357.768462 267714.847953 90357.948689 91860.506529 66282.694494 3956.360745 80049.075009 -0.162121 -67609.908104 ... 1.023724 35709.830232 -37068.728064 42.108670 -17303.296473 14107.779485 14.0 0.636612 0.560109 0.0

5 rows × 457 columns

Some created features may have missing values for some of the time series. Since most clustering algorithms do not allow for missing values, these are excluded.

In [13]:
# Remove features with missing values
# ==============================================================================
ts_features_selected = ts_features_selected.dropna(axis=1, how='any')
print(f"Number of features after selection and removing missing values: {ts_features_selected.shape[1]}")
Number of features after selection and removing missing values: 457

Once the final feature matrix is ready, it may be useful to create a new dictionary to store the selected features and the parameters used to calculate them. This can be easily done using the from_columns function.

In [14]:
# Dictionary with selected features and their configuration
# ==============================================================================
selected_features_info = from_columns(ts_features_selected)
# pprint(selected_features_info['meter_reading'])

K-means clustering

A K-means clustering method is used to group the buildings. Since clustering is known to be negatively affected by high dimensionality, and since several hundred features have been created for each building, a PCA is used to reduce the dimensionality of the data before the k-means is applied.

In [15]:
# Scale features so all of them have mean 0 and std 1
# ==============================================================================
scaler = StandardScaler().set_output(transform="pandas")
ts_features_selected_scaled = scaler.fit_transform(ts_features_selected)
ts_features_selected_scaled.head(2)
Out[15]:
meter_reading__fft_coefficient__attr_"real"__coeff_52 meter_reading__fft_coefficient__attr_"real"__coeff_54 meter_reading__fft_coefficient__attr_"abs"__coeff_52 meter_reading__fft_coefficient__attr_"abs"__coeff_54 meter_reading__fft_coefficient__attr_"imag"__coeff_52 meter_reading__fft_coefficient__attr_"real"__coeff_53 meter_reading__cwt_coefficients__coeff_13__w_2__widths_(2, 5, 10, 20) meter_reading__fft_coefficient__attr_"abs"__coeff_51 meter_reading__partial_autocorrelation__lag_8 meter_reading__fft_coefficient__attr_"real"__coeff_51 ... meter_reading__fft_aggregated__aggtype_"skew" meter_reading__fft_coefficient__attr_"real"__coeff_70 meter_reading__fft_coefficient__attr_"imag"__coeff_9 meter_reading__fft_aggregated__aggtype_"centroid" meter_reading__fft_coefficient__attr_"imag"__coeff_96 meter_reading__fft_coefficient__attr_"real"__coeff_19 meter_reading__number_peaks__n_10 meter_reading__ratio_beyond_r_sigma__r_0.5 meter_reading__index_mass_quantile__q_0.6 meter_reading__large_standard_deviation__r_0.25
building_id
id_1000 0.015397 0.217408 -0.038323 -0.005531 -0.029957 -0.033146 0.249012 0.010032 -1.155843 -0.167445 ... -0.452856 -0.101645 -0.077438 0.309209 0.068678 -0.203444 1.337482 0.463180 0.227108 -0.523328
id_1001 0.272683 -0.459299 -0.244963 -0.278789 -0.120070 -0.220976 -0.310700 -0.077230 0.322673 0.342703 ... -0.800787 -0.469541 0.194937 1.179068 -0.112485 -0.282370 0.881898 0.658756 -0.458116 1.910848

2 rows × 457 columns

In [16]:
# Plot variance reduction as a function of the number of PCA components
# ==============================================================================
pca = PCA()
pca.fit(ts_features_selected_scaled)
n_components = np.argmax(np.cumsum(pca.explained_variance_ratio_) > 0.85) + 1

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(6, 2.5))
ax.plot(np.cumsum(pca.explained_variance_ratio_))
ax.set_title('Variance explained by PCA components')
ax.set_xlabel('Number of components')
ax.set_ylabel('Cumulative explained variance')
ax.axhline(y=0.85, color='red', linestyle='--')
ax.axvline(x=n_components, color='red', linestyle='--')
ax.text(
    x=n_components+10,
    y=0.4,
    s=f"n_components = {n_components}",
    color='red',
    verticalalignment='center'
)

plt.show();

The dimensionality of the original 457 features is reduced by using the first 30 principal components (explaining 85% of the variance).

In [17]:
# PCA with as many components as necessary to explain 85% of the variance
# ==============================================================================
pca = PCA(n_components=0.85)
pca_projections = pca.fit_transform(ts_features_selected_scaled)

# Create a data frame with the projections. Each column is a principal component
# and each row is the id of the building.
pca_projections = pd.DataFrame(
    pca_projections,
    index   = ts_features_selected_scaled.index,
    columns = [f"PC{i}" for i in range(1, pca.n_components_ + 1)]
)

print(f"Number of components selected: {pca.n_components_}")
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.3f}")
pca_projections.head(3)
Number of components selected: 30
Explained variance: 0.852
Out[17]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 ... PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30
building_id
id_1000 -3.375609 -2.516596 6.588551 -1.155265 -1.412404 3.698966 0.831889 2.794919 -1.124592 0.481617 ... 0.374781 0.401220 -0.149135 -1.119731 0.844226 -0.810427 0.929896 0.063537 -0.145631 -0.104035
id_1001 -3.347753 3.874328 -4.978124 -2.118414 5.253900 -2.415342 1.447210 1.089728 1.738567 2.160865 ... 1.362968 0.322453 2.432154 1.557620 -2.760159 -0.523866 0.076542 2.007141 2.520836 0.317391
id_1002 3.975189 3.421043 4.809103 3.881389 3.255394 -2.851151 1.332397 -0.746149 0.215048 -0.402993 ... 1.865017 -0.747655 0.544906 0.743551 0.506844 -1.993740 1.011375 1.519550 1.313071 1.132661

3 rows × 30 columns

One of the inherent challenges of clustering is determining the optimal number of clusters, since there is no ground truth or predefined number of clusters in unsupervised learning. Several heuristics have been proposed to guide this selection, and in this case, the elbow method is used.

The elbow method involves plotting the total within-cluster sum of squares (WSS) against the number of clusters (k). The optimal number of clusters is typically identified at the "elbow" of the curve, where the rate of decrease in WSS begins to decrease significantly. This suggests that adding more clusters beyond this point provides little improvement in clustering performance.

In [18]:
# Optimal number of clusters
# ==============================================================================
# Identify the optimal number of clusters using the elbow method
range_n_clusters = range(1, 20)
inertias = []
size_of_clusters = []

for n_clusters in range_n_clusters:
    kmeans = KMeans(
                 n_clusters   = n_clusters, 
                 n_init       = 20, 
                 random_state = 963852
             )
    kmeans.fit(pca_projections)
    inertias.append(kmeans.inertia_)
    size_of_clusters.append(list(np.unique(kmeans.labels_, return_counts=True)[1]))

inertias = pd.Series(inertias, index=range_n_clusters)
dicrease_in_inertia = inertias.pct_change() * -100

fig, axs = plt.subplots(1, 2, figsize=(7, 2.5))
inertias.plot(marker='o', ax=axs[0])
axs[0].xaxis.set_major_locator(ticker.MaxNLocator(integer=True))
axs[0].set_title("Cluster inertia vs number of clusters")
axs[0].set_xlabel('Number of clusters')
axs[0].set_ylabel('Intra-cluster variance (inertia)')

dicrease_in_inertia.plot(kind='bar', ax=axs[1])
axs[1].set_title("Decrease in inertia")
axs[1].set_xlabel('Number of clusters')
axs[1].set_ylabel('Decrease in inertia (%)')
fig.tight_layout()
plt.show();