If you like Skforecast , help us giving a star on GitHub! ⭐️
More about forecasting
When faced with scenarios involving the prediction of hundreds or thousands of time series, a crucial decision arises: should one develop individual models for each series, or should one use a unified model to handle them all at once?
In Single-Series Modeling (Local Forecasting Model), a separate predictive model is created for each time series. While this method provides a comprehensive understanding of each series, its scalability can be challenged by the need to create and maintain hundreds or thousands of models.
Multi-Series Modeling (Global Forecasting Model) involves building a single predictive model that considers all time series simultaneously. It attempts to capture the core patterns that govern the series, thereby mitigating the potential noise that each series might introduce. This approach is computationally efficient, easy to maintain, and can yield more robust generalizations across time series, albeit potentially at the cost of sacrificing some individual insights.
To help understand the benefits of each forecasting strategy, this document focuses on examining and comparing the results obtained when predicting the energy consumption of over a thousand buildings using the ASHRAE - Great Energy Predictor III dataset available on Kaggle. Efficient management of energy consumption in residential buildings is essential for sustainable urban development, therefore accurate forecasting of energy consumption can play a crucial role in optimizing resource allocation and promoting energy conservation initiatives.
Global forecasting modeling assumes that series that behave similarly can benefit from being modeled together. Although the primary use of each building is available in the dataset, it may not reflect groups with similar patterns of energy use, so additional groups are created using clustering methods. A total of 5 experiments are performed:
Modelling each building individually.
Modelling all buildings together with a unique global model strategy.
Modelling groups of buildings based on their primary use (a global model per primary use).
Modelling groups of buildings based on time series feature clustering (a global model per cluster).
Modelling groups of buildings based on Dynamic Time Warping (DTW) clustering (a global model per cluster).
Energy consumption is predicted for each building on a weekly basis (daily resolution) for 13 weeks following each strategy. The effectiveness of each approach is evaluated using several performance metrics, including Mean Absolute Error (MAE), Absolute Error, and Bias. The goal of the study is to identify the most effective approach both for overall predictions and for a specific group of buildings.
💡 Tip
This is the first in a series of documents on global forecasting models:✎ Note
Read the Conclusions First?✎ Note
This document focuses on one use case. For a detailed description of the many features that skforecast provides for building global forecasting models, see Global Forecasting Models: Modeling Multiple Time Series with Machine Learning.Libraries used in this document.
# Data management
# ==============================================================================
import numpy as np
import pandas as pd
from datetime import datetime
from skforecast.datasets import fetch_dataset
# Plots
# ==============================================================================
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
plt.style.use('seaborn-v0_8-darkgrid')
from tqdm.auto import tqdm
# Forecasting
# ==============================================================================
import lightgbm
from lightgbm import LGBMRegressor
from sklearn.preprocessing import StandardScaler
import skforecast
from skforecast.recursive import ForecasterRecursive
from skforecast.model_selection import backtesting_forecaster
from skforecast.recursive import ForecasterRecursiveMultiSeries
from skforecast.model_selection import TimeSeriesFold
from skforecast.model_selection import backtesting_forecaster_multiseries
# Feature extraction
# ==============================================================================
import tsfresh
from tsfresh import extract_features
from tsfresh import select_features
from tsfresh.feature_extraction.settings import ComprehensiveFCParameters
from tsfresh.feature_extraction.settings import from_columns
# Clustering
# ==============================================================================
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import tslearn
from tslearn.clustering import TimeSeriesKMeans
# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')
color = '\033[1m\033[38;5;208m'
print(f"{color}Version skforecast: {skforecast.__version__}")
print(f"{color}Version scikit-learn: {sklearn.__version__}")
print(f"{color}Version lightgbm: {lightgbm.__version__}")
print(f"{color}Version tsfresh: {tsfresh.__version__}")
print(f"{color}Version tslearn: {tslearn.__version__}")
print(f"{color}Version pandas: {pd.__version__}")
print(f"{color}Version numpy: {np.__version__}")
⚠ Warning
At the time of writing this document,tslearn
is only compatible with numpy
versions lower than 2.0. If you have a higher version, you can downgrade it by running the following command: pip install numpy==1.26.4
Data used in this document has been obtained from the Kaggle competition Addison Howard, Chris Balbach, Clayton Miller, Jeff Haberl, Krishnan Gowri, Sohier Dane. (2019). ASHRAE - Great Energy Predictor III. Kaggle.
Three files are used to create the modeling data set:
weather_train.csv and weather_test.csv: These files contain weather-related data for each building, including outdoor air temperature, dew temperature, relative humidity, and other weather parameters. The weather data is crucial for understanding the impact of external conditions on building energy usage.
building_metadata.csv: This file provides metadata for each building in the dataset, such as building type, primary use, square footage, floor count, and year built. This information helps in understanding the characteristics of the buildings and their potential influence on energy consumption patterns.
train.csv: the train dataset contains the target variable, i.e., the energy consumption data for each building, along with the timestamps for the energy consumption readings. It also includes the corresponding building and weather IDs to link the information across different datasets.
The three files have been preprocessed to remove buildings with less than 85% of values not equal to NaN or zero, to use only the electricity meter, and to aggregate the data to a daily frequency. See the Download and preprocess raw data section for more details.
# Load preprocessed data
# ==============================================================================
data = fetch_dataset('ashrae_daily')
data.head()
# Fill missing values of air_temperature and wind_speed using fowrard and backward fill
# ==============================================================================
# Imputation must be done separately for each building
data = data.sort_values(by=['building_id', 'timestamp'])
data['air_temperature'] = data.groupby('building_id')['air_temperature'].ffill().bfill()
data['wind_speed'] = data.groupby('building_id')['wind_speed'].ffill().bfill()
data = data.sort_index()
print(
f"Rage of dates available : {data.index.min()} --- {data.index.max()} "
f"(n_days={(data.index.max() - data.index.min()).days})"
)
One of the key attributes associated with each building is its designated use. This feature may play a crucial role in influencing the energy consumption pattern, as distinct uses can significantly impact both the quantity and timing of energy consumption.
# Number of buildings and type of buildings based on primary use
# ==============================================================================
n_building = data['building_id'].nunique()
n_type_building = data['primary_use'].nunique()
range_datetime = [data.index.min(), data.index.max()]
print(f"Number of buildings: {n_building}")
print(f"Number of building types: {n_type_building}")
display(data.drop_duplicates(subset=['building_id'])['primary_use'].value_counts())
For certain primary use categories, there is a limited number of buildings within the dataset. To streamline the analysis, categories with fewer than 100 buildings are grouped into the "Other" category.
# Types of buildings (primary use) with less than 100 samples are grouped as "Other".
# ==============================================================================
infrequent_categories = (
data
.drop_duplicates(subset=['building_id'])['primary_use']
.value_counts()
.loc[lambda x: x < 100]
.index
.tolist()
)
print(f"Infrequent categories:")
print("======================")
print('\n'.join(infrequent_categories))
data['primary_use'] = np.where(
data['primary_use'].isin(infrequent_categories),
'Other',
data['primary_use']
)
Next, a plot is created showing the energy consumption for a randomly selected building within each respective category, and a plot of all available time series for each category.
# Time series for 1 random selected building per group
# ==============================================================================
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(8, 5.5), sharex=True, sharey=False)
sample_ids = (
data
.groupby('primary_use')['building_id']
.apply(lambda x: x.sample(1, random_state=333))
.tolist()
)
axs = axs.flatten()
for i, building_id in enumerate(sample_ids):
data_sample = data[data['building_id'] == building_id]
building_type = data_sample['primary_use'].unique()[0]
data_sample.plot(
y = 'meter_reading',
ax = axs[i],
legend = False,
title = f"Building: {building_id}, type: {building_type}",
fontsize = 8
)
axs[i].set_xlabel("")
axs[i].set_ylabel("")
# Scientific notation for y axis
axs[i].ticklabel_format(axis='y', style='sci', scilimits=(0, 0))
axs[i].title.set_size(9)
fig.suptitle('Energy consumption for 6 random buildings', fontsize=12)
fig.tight_layout()
plt.show()
# Energy consumption by type of building (one gray line per building)
# ==============================================================================
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(8, 5.5), sharex=True, sharey=False)
axs = axs.flatten()
for i, building_type in enumerate(data['primary_use'].unique()):
data_sample = data[data['primary_use'] == building_type]
data_sample = data_sample.pivot_table(
index = 'timestamp',
columns = 'building_id',
values = 'meter_reading',
aggfunc = 'mean'
)
data_sample.plot(
legend = False,
title = f"Type: {building_type}",
color = 'gray',
alpha = 0.2,
fontsize = 8,
ax = axs[i]
)
data_sample.mean(axis=1).plot(
ax = axs[i],
color = 'blue',
fontsize = 8
)
axs[i].set_xlabel("")
axs[i].set_ylabel("")
# Scientific notation for y axis
axs[i].ticklabel_format(axis='y', style='sci', scilimits=(0,0))
axs[i].title.set_size(9)
# Limit the axis to 5 times the maximum mean value to improve visualization
axs[i].set_ylim(
bottom = 0,
top = 5 * data_sample.mean(axis=1).max()
)
fig.tight_layout()
fig.suptitle('Energy consumption by type of building', fontsize=12)
plt.subplots_adjust(top=0.9)
plt.show()
The graph reveals that there is significant variability in consumption patterns among buildings of the same purpose. This suggests that there may be scope for improving the criteria by which buildings are grouped.
The idea behind modeling multiple series at the same time is to be able to capture the main patterns that govern the series, thereby reducing the impact of the potential noise that each series may have. This means that series that behave similarly may benefit from being modeled together. One way to identify potential groups of series is to perform a clustering study prior to modeling. If clear groups are identified as a result of clustering, it is appropriate to model each of them separately.
Clustering is an unsupervised analysis technique that groups a set of observations into clusters that contain observations that are considered homogeneous, while observations in different clusters are considered heterogeneous. Algorithms that cluster time series can be divided into two groups: those that use a transformation to create features prior to clustering (feature-driven time series clustering), and those that work directly on the time series (elastic distance measures).
Feature-driven time series clustering: Features describing structural characteristics are extracted from each time series, then these features are fed into arbitrary clustering algorithms. This features are obtained by applying statistical operations that best capture the underlying characteristics: trend, seasonality, periodicity, serial correlation, skewness, kurtosis, chaos, nonlinearity, and self-similarity.
Elastic distance measures: This approach works directly on the time series, adjusting or "realigning" the series in comparison to each other. The best known of this family of measures is Dynamic Time Warping (DTW).
For a more detailed review of time series clustering, see A review and evaluation of elastic distance functions for time series clustering.
tsfresh is a powerful Python library for feature engineering from time-series and sequential data, including statistical measures, Fourier coefficients, and various other time-domain and frequency-domain features. It provides a systematic approach to automate the calculation of features and select the most informative ones.
To start, the default configuration of tsfresh is used, which calculates all available features. The easiest way to access this configuration is to use the functions provided by the ComprehensiveFCParameters
class. It creates a dictionary which is a mapping from string (the name of each feature) to a list of dictionary parameters to be used when the function with that name is called.
# Default features and settings created by tsfresh
# ==============================================================================
default_features = ComprehensiveFCParameters()
print("Name of the features extracted by tsfresh:")
print("=========================================")
print('\n'.join(list(default_features)))
Many of these features are calculated using different values for their arguments.
# Default configuration for feature "partial_autocorrelation"
# ==============================================================================
default_features['partial_autocorrelation']
To access the detailed view of each feature and the parameter values included in the default configuration, use the following code:
# Default configuration for all extracted feature
# ==============================================================================
# from pprint import pprint
# pprint(default_features)
Once the configuration of features has been defined, the next step is to extract them from the time series. The function extract_features()
of tsfresh is used for this purpose. This function receives as input the time series and the configuration of the features to be extracted. The output is a dataframe with the extracted features.
# Feature extraction
# ==============================================================================
ts_features = extract_features(
timeseries_container = data[['building_id', 'meter_reading']].reset_index(),
column_id = "building_id",
column_sort = "timestamp",
column_value = "meter_reading",
default_fc_parameters = default_features,
impute_function = tsfresh.utilities.dataframe_functions.impute,
n_jobs = 4
)
print("Shape of ts_features:", ts_features.shape)
ts_features.head(3)
As a result of the extraction process, 783 features have been created for each time series (building_id
in this use case). The returned dataframe has as its index the column specified in the column_id
argument of extract_features
.
The default extraction of tsfresh produces a huge number of features. However, only a few of these may be of interest in each use case. To select the most relevant ones, tsfresh includes an automated selection process based on hypothesis tests (FeatuRE Extraction based on Scalable Hypothesis tests). In this process, the features are individually and independently evaluated for their significance in predicting the target under investigation.
⚠ Warning
The selection process used by tsfresh is based on the significance of each feature in accurately predicting the target variable. To perform this process, a target variable is required – for example, the type of building associated with a given time series. However, there are cases where a target variable is not readily available. In such cases, alternative strategies can be used:As feature selection is a critical step that affects the information available for the next steps of the analysis, it is highly recommended to understand the parameters that control the behavior of select_features()
.
X
: dataframe with the features created with extract_features
. It can contain both binary or real-valued features at the same time.y
: Target vector which is needed to test which features are relevant. Can be binary or real-valued.test_for_binary_target_binary_feature
: Which test to be used when the target is binary and feature is binary. Currently unused, default value 'fisher'
.test_for_binary_target_real_feature
: Which test to be used when the target is binary and feature is continious (real value). Default value 'mann'
.test_for_real_target_binary_feature
: Which test to be used when the target is continious (real value) and feature is binary.Currently unused, default value 'mann'
.test_for_real_target_real_feature
: Which test to be used when the target is continious (real value) and feature is continious (real value) .Currently unused, default value 'kendall'
.fdr_level
: The FDR level that should be respected, this is the theoretical expected percentage of irrelevant features among all created features. Default value 0.05
.hypotheses_independent
: Can the significance of the features be assumed to be independent? Normally, this should be set to False as the features are never independent (e.g. mean and median). Default value False
.n_jobs
: Number of processes to use during the p-value calculation. Default value 4
.show_warnings
: Show warnings during the p-value calculation (needed for debugging of calculators). Default value False
.chunksize
: The size of one chunk that is submitted to the worker process for the parallelisation. Where one chunk is defined as the data for one feature. If you set the chunksize to 10, then it means that one task is to filter 10 features. If it is set it to None, depending on distributor, heuristics are used to find the optimal chunksize. If you get out of memory exceptions, you can try it with the dask distributor and a smaller chunksize. Default value None
.ml_task
: The intended machine learning task. Either ‘classification’, ‘regression’ or ‘auto’. Defaults to ‘auto’, meaning the intended task is inferred from y. If y has a boolean, integer or object dtype, the task is assumed to be classification, else regression. Default value 'auto'
.multiclass
: Whether the problem is multiclass classification. This modifies the way in which features are selected. Multiclass requires the features to be statistically significant for predicting n_significant features. Default value False
.n_significant
: The number of classes for which features should be statistically significant predictors to be regarded as ‘relevant’. Only specify when multiclass=True
. Default value 1
.⚠ Warning
The order of the index returned in the features DataFrame is not the same as the order of the columns in the original DataFrame. Therefore, the data passed to the argumenty
in select_features
must be sorted to ensure the correct association between the features and the target variable.
# Select relevant features
# ==============================================================================
target = (
data[['building_id', 'primary_use']]
.drop_duplicates()
.set_index('building_id')
.loc[ts_features.index, :]
['primary_use']
)
assert ts_features.index.equals(target.index)
ts_features_selected = select_features(
X = ts_features,
y = target,
fdr_level = 0.001 # Very restrictive filter
)
ts_features_selected.index.name = 'building_id'
print(f"Number of features before selection: {ts_features.shape[1]}")
print(f"Number of features after selection: {ts_features_selected.shape[1]}")
ts_features_selected.head()
Some created features may have missing values for some of the time series. Since most clustering algorithms do not allow for missing values, these are excluded.
# Remove features with missing values
# ==============================================================================
ts_features_selected = ts_features_selected.dropna(axis=1, how='any')
print(f"Number of features after selection and removing missing values: {ts_features_selected.shape[1]}")
Once the final feature matrix is ready, it may be useful to create a new dictionary to store the selected features and the parameters used to calculate them. This can be easily done using the from_columns
function.
# Dictionary with selected features and their configuration
# ==============================================================================
selected_features_info = from_columns(ts_features_selected)
# pprint(selected_features_info['meter_reading'])
A K-means clustering method is used to group the buildings. Since clustering is known to be negatively affected by high dimensionality, and since several hundred features have been created for each building, a PCA is used to reduce the dimensionality of the data before the k-means is applied.
# Scale features so all of them have mean 0 and std 1
# ==============================================================================
scaler = StandardScaler().set_output(transform="pandas")
ts_features_selected_scaled = scaler.fit_transform(ts_features_selected)
ts_features_selected_scaled.head(2)
# Plot variance reduction as a function of the number of PCA components
# ==============================================================================
pca = PCA()
pca.fit(ts_features_selected_scaled)
n_components = np.argmax(np.cumsum(pca.explained_variance_ratio_) > 0.85) + 1
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(6, 2.5))
ax.plot(np.cumsum(pca.explained_variance_ratio_))
ax.set_title('Variance explained by PCA components')
ax.set_xlabel('Number of components')
ax.set_ylabel('Cumulative explained variance')
ax.axhline(y=0.85, color='red', linestyle='--')
ax.axvline(x=n_components, color='red', linestyle='--')
ax.text(
x=n_components+10,
y=0.4,
s=f"n_components = {n_components}",
color='red',
verticalalignment='center'
)
plt.show();
The dimensionality of the original 457 features is reduced by using the first 30 principal components (explaining 85% of the variance).
# PCA with as many components as necessary to explain 85% of the variance
# ==============================================================================
pca = PCA(n_components=0.85)
pca_projections = pca.fit_transform(ts_features_selected_scaled)
# Create a data frame with the projections. Each column is a principal component
# and each row is the id of the building.
pca_projections = pd.DataFrame(
pca_projections,
index = ts_features_selected_scaled.index,
columns = [f"PC{i}" for i in range(1, pca.n_components_ + 1)]
)
print(f"Number of components selected: {pca.n_components_}")
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.3f}")
pca_projections.head(3)
One of the inherent challenges of clustering is determining the optimal number of clusters, since there is no ground truth or predefined number of clusters in unsupervised learning. Several heuristics have been proposed to guide this selection, and in this case, the elbow method is used.
The elbow method involves plotting the total within-cluster sum of squares (WSS) against the number of clusters (k). The optimal number of clusters is typically identified at the "elbow" of the curve, where the rate of decrease in WSS begins to decrease significantly. This suggests that adding more clusters beyond this point provides little improvement in clustering performance.
# Optimal number of clusters
# ==============================================================================
# Identify the optimal number of clusters using the elbow method
range_n_clusters = range(1, 20)
inertias = []
size_of_clusters = []
for n_clusters in range_n_clusters:
kmeans = KMeans(
n_clusters = n_clusters,
n_init = 20,
random_state = 963852
)
kmeans.fit(pca_projections)
inertias.append(kmeans.inertia_)
size_of_clusters.append(list(np.unique(kmeans.labels_, return_counts=True)[1]))
inertias = pd.Series(inertias, index=range_n_clusters)
dicrease_in_inertia = inertias.pct_change() * -100
fig, axs = plt.subplots(1, 2, figsize=(7, 2.5))
inertias.plot(marker='o', ax=axs[0])
axs[0].xaxis.set_major_locator(ticker.MaxNLocator(integer=True))
axs[0].set_title("Cluster inertia vs number of clusters")
axs[0].set_xlabel('Number of clusters')
axs[0].set_ylabel('Intra-cluster variance (inertia)')
dicrease_in_inertia.plot(kind='bar', ax=axs[1])
axs[1].set_title("Decrease in inertia")
axs[1].set_xlabel('Number of clusters')
axs[1].set_ylabel('Decrease in inertia (%)')
fig.tight_layout()
plt.show();
The chart shows that after 9 clusters the decrease in inertia slows down, therefore 9 clusters may be a good choice.
In addition to analyzing the evolution of the inertia (intra-variance), it is important to check the size of the clusters being created. The presence of small clusters may indicate overfitting or anomalous samples that do not fit well into any of the groups.
# Distribution of cluster sizes
# ==============================================================================
for n_clusters, sizes in zip(range_n_clusters, size_of_clusters):
print(f"Size of clusters (n = {n_clusters}): {sizes}")
When 9 clusters are created, the size of the clusters is not well balanced. Before modeling the time series, the smaller clusters (less than 20 observations) are combined into a single cluster named "other".
# Train clustering model with 6 clusters and assign each building to a cluster
# ==============================================================================
kmeans = KMeans(n_clusters=9, n_init=20, random_state=963852)
kmeans.fit(pca_projections)
clusters = kmeans.predict(pca_projections)
clusters = pd.DataFrame({
'building_id': pca_projections.index,
'cluster_base_on_features': clusters.astype(str)
})
# Merge cluster minor clusters into a single cluster
threshold = 20
cluster_size = clusters['cluster_base_on_features'].value_counts()
cluster_size = cluster_size[cluster_size < threshold].index.tolist()
clusters['cluster_base_on_features'] = np.where(
clusters['cluster_base_on_features'].isin(cluster_size),
'Other',
clusters['cluster_base_on_features']
)
clusters['cluster_base_on_features'].value_counts()
# Add the predicted cluster to the data frame with the building information
# ==============================================================================
data = pd.merge(
data.reset_index(), # To avoid losing the index
clusters,
on ='building_id',
how ='left',
validate = 'm:1'
).set_index('timestamp')
data.head(3)
Once each building has been assigned to a cluster, it is useful to examine the characteristics of the grouped buildings. For example, the distribution of the primary use of the buildings.
# Percentaje of building types per cluster
# ==============================================================================
primary_usage_per_cluster = (
data
.groupby('cluster_base_on_features')['primary_use']
.value_counts(normalize=True)
.unstack()
.fillna(0)
)
primary_usage_per_cluster = (100 * primary_usage_per_cluster).round(1)
primary_usage_per_cluster
The results suggest (the table should be read horizontally) that the clustering process based on features extracted from the time series generates groups that differ from those formed by the main purpose of the building.
DTW is a technique that measures the similarity between two temporal sequences, which may vary in speed. In essence, it's an elastic distance measure that allows the time series to be stretched or compressed to align with each other optimally. Clustering with DTW involves grouping time series data based on their dynamic time-warped distances, ensuring that time series within the same cluster have similar shapes and patterns, even if they are out of phase in time or have different lengths. This method is particularly useful in various applications like speech recognition, data mining, and financial analysis, where the alignment of sequences in time is crucial for accurate pattern recognition and analysis.
The data needs to be transformed from a narrow (long) format to a wide (pivot) format in order to use a scikit-learn StandardScaler
, which scales the time series so that they all have zero mean and unit variance. Each column represents a building.
# Pivot table with time series for each building
# ==============================================================================
data_pivot = data.pivot_table(
index = 'timestamp',
columns = 'building_id',
values = 'meter_reading',
aggfunc = 'mean'
)
data_pivot = data_pivot.fillna(value=0)
data_pivot.head(3)
# Scale the time series
# ==============================================================================
data_pivot = pd.DataFrame(
StandardScaler().fit_transform(data_pivot),
index = data_pivot.index,
columns = data_pivot.columns
)
data_pivot.head(3)
TimeSeriesKMeans
from the tslearn library is a specialized clustering algorithm designed for time series data. It extends the traditional K-means clustering approach to specifically handle the shapes and patterns inherent in time series datasets. It groups similar time series into clusters based on their temporal features, accounting for aspects like trends, seasonality, and various time-dependent patterns.
⚠ Warning
The next cell has a run time of approximately 15 minutes.# Fit clustering model
# ==============================================================================
model = TimeSeriesKMeans(
n_clusters = 4,
metric = "dtw",
max_iter = 100,
random_state = 123456
)
# TimeSeriesKMeans assumes there is one time series per row.
model.fit(data_pivot.transpose())
# Cluster prediction
# ==============================================================================
clusters = model.predict(data_pivot.transpose())
clusters = pd.DataFrame({
'building_id': data_pivot.columns,
'cluster_base_on_dtw': clusters.astype(str),
})
# Size of each cluster
print("Size of each cluster:")
clusters['cluster_base_on_dtw'].value_counts().sort_index()
The 4 clusters generated by Dynamic Time Warping (DTW) are well-balanced in size.
# Add the predicted cluster to the data frame with the building information
# ==============================================================================
data = pd.merge(
data.reset_index(), # To avoid losing the index
clusters,
on = 'building_id',
how = 'left',
validate = 'm:1'
).set_index('timestamp')
data.head(3)
✎ Note
DTAIDistance is also a great Python library for computing distances between time series. It is based on C++ and Cython and it is very fast. It also has a lot of distance metrics and clustering methods implemented. It is a good posibility as tslearn.# Save data for modelling to avoid repeating the previous steps
# ==============================================================================
data.to_parquet('data_modelling.parquet')
After establishing the three grouping criteria (building usage, clustering based on time series features, and clustering based on dynamic time warping), both single and multi-series global models are trained. T The evaluation that follows concentrates on determining how effectively these models can forecast daily data for the final two weeks. During this assessment, three distinct metrics are used:
The average value of the Mean Absolute Error (MAE) for all buildings.
The absolute sum of the errors, i.e. the absolute deviation between the predicted value and the actual consumption for all buildings.
The bias for the predictions summed over all buildings.
In addition to the lagged values of the own time series, the day of the week (sine-cosine encoded), the outdoor temperature and the wind speed are included as exogenous variables.
✎ Note
For a more detailed explanation of time series model validation, readers are encouraged to consult the Backtesting user guide. For more information about calendar features and cyclical encoding visit Calendar features and Cyclical features in time series.To train the models, search for optimal hyperparameters, and evaluate their predictive performance, the data is divided into three separate sets: training, validation, and test.
# Load data for modelling
# ==============================================================================
data = pd.read_parquet('data_modelling.parquet')
# Split data in train, validation and test
# ==============================================================================
end_train = '2016-07-31 23:59:00'
end_validation = '2016-09-30 23:59:00'
data_train = data.loc[: end_train, :].copy()
data_val = data.loc[end_train:end_validation, :].copy()
data_test = data.loc[end_validation:, :].copy()
print(
f"Rage of dates available : {data.index.min()} --- {data.index.max()} "
f"(n_days={(data.index.max() - data.index.min()).days})"
)
print(
f" Dates for training : {data_train.index.min()} --- {data_train.index.max()} "
f"(n_days={(data_train.index.max() - data_train.index.min()).days})"
)
print(
f" Dates for validation : {data_val.index.min()} --- {data_val.index.max()} "
f"(n_days={(data_val.index.max() - data_val.index.min()).days})"
)
print(
f" Dates for test : {data_test.index.min()} --- {data_test.index.max()} "
f"(n_days={(data_test.index.max() - data_test.index.min()).days})"
)
⚠ Warning
Although 1214 buildings are available for modeling, to keep model training within a reasonable time frame a subset of, for example, 600 randomly selected buildings can be used. The reader is encouraged to adapt the number of buildings if necessary and check whether the conclusions hold.# Sample 600 buildings
# ==============================================================================
# rng = np.random.default_rng(12345)
# buildings = data['building_id'].unique()
# buildings_selected = rng.choice(
# buildings,
# size = 600,
# replace = False
# )
#
# data = data.query("building_id in @buildings_selected")
# data_train = data_train.query("building_id in @buildings_selected")
# data_val = data_val.query("building_id in @buildings_selected")
# data_test = data_test.query("building_id in @buildings_selected")
# Table of results for all models
# ==============================================================================
table_results = pd.DataFrame(columns=['model', 'mae', 'abs_error', 'bias', 'elapsed_time'])
table_results = table_results.set_index('model')
table_results = table_results.astype({'mae': float, 'abs_error': float, 'bias': float, 'elapsed_time': object})
# Function to add calendar features to a DataFrame
# ==============================================================================
def day_of_week_cyclical_encoding(data):
"""
Calculate cyclical encoding for the day of week using a
DataFrame with a datetime index.
"""
data['day_of_week'] = data.index.dayofweek
data['sin_day_of_week'] = np.sin(2*np.pi*data['day_of_week']/7)
data['cos_day_of_week'] = np.cos(2*np.pi*data['day_of_week']/7)
return data
# Exogenous variables included in the model
# ==============================================================================
exog_features = ['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed']
An individual forecasting model is trained and evaluated for each building.
# Train and test a model for each building
# ==============================================================================
predictions_all_buildings = {}
metrics_all_buildings = {}
errors_all_buildings = {}
# Define forecaster
params_lgbm = {
'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 4,
'random_state': 8520,
'verbose': -1
}
forecaster = ForecasterRecursive(
regressor = LGBMRegressor(**params_lgbm),
lags = 31,
)
# Train and predict for each building
start = datetime.now()
for building in tqdm(data['building_id'].unique(), desc='Modelling buildings'):
# Get data for the building
data_building = data[data['building_id'] == building]
data_building = data_building.asfreq('D').sort_index()
# Add calendar features
data_building = day_of_week_cyclical_encoding(data_building)
# Backtesting
try:
cv = TimeSeriesFold(
steps = 7,
initial_train_size = len(data_building.loc[:end_validation, :]),
refit = False
)
metric, predictions = backtesting_forecaster(
forecaster = forecaster,
y = data_building['meter_reading'],
exog = data_building[exog_features],
cv = cv,
metric = 'mean_absolute_error',
verbose = False,
show_progress = False
)
predictions_all_buildings[building] = predictions['pred']
metrics_all_buildings[building] = metric.at[0, 'mean_absolute_error']
errors_all_buildings[building] = (
predictions['pred'] - data_building.loc[predictions.index, 'meter_reading']
)
except:
print(f"Error modelling building {building}")
end = datetime.now()
predictions_all_buildings = pd.DataFrame(predictions_all_buildings)
errors_all_buildings = pd.DataFrame(errors_all_buildings)
mean_metric_all_buildings = pd.Series(metrics_all_buildings).mean()
sum_abs_errors_all_buildings = errors_all_buildings.abs().sum(axis=1).sum()
sum_bias_all_buildings = errors_all_buildings.sum().sum()
table_results.loc['One model per building', 'mae'] = mean_metric_all_buildings
table_results.loc['One model per building', 'abs_error'] = sum_abs_errors_all_buildings
table_results.loc['One model per building', 'bias'] = sum_bias_all_buildings
table_results.loc['One model per building', 'elapsed_time'] = end - start
print("")
print(f"Average mean absolute error for all buildings: {mean_metric_all_buildings:.0f}")
print(f"Sum of absolute errors for all buildings (x 10,000): {sum_abs_errors_all_buildings / 10000:.0f}")
print(f"Bias (x 10,000): {sum_bias_all_buildings / 10000:.0f}")
# Plot predictions vs real value for 2 random buildings
# ==============================================================================
rng = np.random.default_rng(147)
selected_buildings = rng.choice(data['building_id'].unique(), size=2, replace=False)
fig, axs = plt.subplots(2, 1, figsize=(6, 4), sharex=True)
axs = axs.flatten()
for i, building in enumerate(selected_buildings):
data_test.query("building_id == @building")['meter_reading'].plot(ax=axs[i], label='test')
predictions_all_buildings[building].plot(ax=axs[i], label='One model per building')
axs[i].set_title(f"Building {building}", fontsize=10)
axs[i].set_xlabel("")
axs[i].legend()
fig.tight_layout()
plt.show();
A global model for all buildings is trained and tested using the skforecast class ForecasterRecursiveMultiSeries
. This forecaster allows the user to pass the time series in several formats, including a dataFrame with the time series organized as columns.
For more information about using series of different lengths, or different exogenous variables for each series, see Global Forecasting Models.
# Reshape data to have one column per building
# ==============================================================================
data_pivot = data.pivot_table(
index = 'timestamp',
columns = 'building_id',
values = 'meter_reading',
aggfunc = 'mean'
)
data_pivot.columns = data_pivot.columns.astype(str)
data_pivot = data_pivot.asfreq('D').sort_index()
# Add calendar features
data_pivot = day_of_week_cyclical_encoding(data_pivot)
# Add temperature and wind speed as exogenous variables
data_pivot = data_pivot.merge(
data[['air_temperature', 'wind_speed']].resample('D').mean(),
left_index = True,
right_index = True,
how = 'left',
validate = '1:m'
)
data_pivot.head(3)
# Forecaster multi-series to model all buildings at the same time
# ==============================================================================
exog_features = ['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed']
params_lgbm = {
'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 10,
'random_state': 8520,
'verbose': -1
}
forecaster = ForecasterRecursiveMultiSeries(
regressor = LGBMRegressor(**params_lgbm),
lags = 31,
encoding = 'ordinal_category'
)
start = datetime.now()
cv = TimeSeriesFold(
steps = 7,
initial_train_size = len(data_pivot.loc[:end_validation, :]),
refit = False
)
metric, predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = data_pivot.filter(regex='^id_'), # Select only buildings
exog = data_pivot[exog_features],
cv = cv,
metric = 'mean_absolute_error',
add_aggregated_metric = False,
verbose = False,
show_progress = True
)
end = datetime.now()
mean_metric_all_buildings = metric['mean_absolute_error'].mean()
errors_all_buildings = predictions - data_pivot.loc[predictions.index, predictions.columns]
sum_abs_errors_all_buildings = errors_all_buildings.abs().sum().sum()
sum_bias_all_buildings = errors_all_buildings.sum().sum()
table_results.loc['Global model', 'mae'] = mean_metric_all_buildings
table_results.loc['Global model', 'abs_error'] = sum_abs_errors_all_buildings
table_results.loc['Global model', 'bias'] = sum_bias_all_buildings
table_results.loc['Global model', 'elapsed_time'] = end - start
print("")
print(f"Average mean absolute error for all buildings: {mean_metric_all_buildings:.0f}")
print(f"Sum of absolute errors for all buildings (x 10,000): {sum_abs_errors_all_buildings/10000:.0f}")
print(f"Bias (x 10,000): {sum_bias_all_buildings/10000:.0f}")
# Add predictions to the already existing plot (not showing the plot yet)
# ==============================================================================
for i, building in enumerate(selected_buildings):
predictions_all_buildings[building].plot(ax=axs[i], label='Global model')
axs[i].legend()
# Forecaster multi-series models for buildings grouped by primary usage
# ==============================================================================
predictions_all_buildings = []
metrics_all_buildings = []
params_lgbm = {
'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 10,
'random_state': 8520,
'verbose': -1
}
forecaster = ForecasterRecursiveMultiSeries(
regressor = LGBMRegressor(**params_lgbm),
lags = 31,
encoding = "ordinal_category"
)
start = datetime.now()
for primary_usage in data['primary_use'].unique():
print(
f"Training and testing model for primary usage: {primary_usage} "
f"(n = {data[data['primary_use'] == primary_usage]['building_id'].nunique()})"
)
# Create a subset based on primary use clusters
data_subset = data[data['primary_use'] == primary_usage]
data_subset = data_subset.pivot_table(
index = 'timestamp',
columns = 'building_id',
values = 'meter_reading',
aggfunc = 'mean'
)
data_subset.columns = data_subset.columns.astype(str)
data_subset = data_subset.asfreq('D').sort_index()
# Add calendar features
data_subset = day_of_week_cyclical_encoding(data_subset)
data_subset = data_subset.merge(
data[['air_temperature', 'wind_speed']].resample('D').mean(),
left_index = True,
right_index = True,
how = 'left',
validate = '1:m'
)
cv = TimeSeriesFold(
steps = 7,
initial_train_size = len(data_subset.loc[:end_validation, :]),
refit = False
)
metric, predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = data_subset.filter(regex='^id_'),
exog = data_subset[['sin_day_of_week', 'cos_day_of_week']],
cv = cv,
metric = 'mean_absolute_error',
add_aggregated_metric = False,
verbose = False,
show_progress = False
)
predictions_all_buildings.append(predictions)
metrics_all_buildings.append(metric)
end = datetime.now()
predictions_all_buildings = pd.concat(predictions_all_buildings, axis=1)
metrics_all_buildings = pd.concat(metrics_all_buildings, axis=0)
errors_all_buildings = (
predictions_all_buildings - data_pivot.loc[predictions_all_buildings.index, predictions_all_buildings.columns]
)
mean_metric_all_buildings = metrics_all_buildings['mean_absolute_error'].mean()
sum_abs_errors_all_buildings = errors_all_buildings.abs().sum().sum()
sum_bias_all_buildings = errors_all_buildings.sum().sum()
table_results.loc['Global model per primary usage', 'mae'] = mean_metric_all_buildings
table_results.loc['Global model per primary usage', 'abs_error'] = sum_abs_errors_all_buildings
table_results.loc['Global model per primary usage', 'bias'] = sum_bias_all_buildings
table_results.loc['Global model per primary usage', 'elapsed_time'] = end - start
print("")
print(f"Average mean absolute error for all buildings: {mean_metric_all_buildings:.0f}")
print(f"Sum of absolute errors for all buildings (x 10,000): {sum_abs_errors_all_buildings/10000:.0f}")
print(f"Bias (x 10,000): {sum_bias_all_buildings/10000:.0f}")
# Add predictions to the already existing plot (not showing the plot yet)
# ==============================================================================
for i, building in enumerate(selected_buildings):
predictions_all_buildings[building].plot(ax=axs[i], label='Global model per primary usage')
axs[i].legend()
A global model for each cluster based on time series features is trained and tested.
# Forecaster multi-series models for buildings grouped by time series features
# ==============================================================================
predictions_all_buildings = []
metrics_all_buildings = []
params_lgbm = {
'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 10,
'random_state': 8520,
'verbose': -1
}
forecaster = ForecasterRecursiveMultiSeries(
regressor = LGBMRegressor(**params_lgbm),
lags = 31,
encoding = "ordinal_category"
)
start = datetime.now()
for cluster in data['cluster_base_on_features'].unique():
print(
f"Training and testing model for cluster: {cluster} "
f"(n = {data[data['cluster_base_on_features'] == cluster]['building_id'].nunique()})"
)
# Create subset based on features clusters
data_subset = data[data['cluster_base_on_features'] == cluster]
data_subset = data_subset.pivot_table(
index = 'timestamp',
columns = 'building_id',
values = 'meter_reading',
aggfunc = 'mean'
)
data_subset.columns = data_subset.columns.astype(str)
data_subset = data_subset.asfreq('D').sort_index()
# Add calendar features
data_subset = day_of_week_cyclical_encoding(data_subset)
data_subset = data_subset.merge(
data[['air_temperature', 'wind_speed']].resample('D').mean(),
left_index = True,
right_index = True,
how = 'left',
validate = '1:m'
)
cv = TimeSeriesFold(
steps = 7,
initial_train_size = len(data_subset.loc[:end_validation, :]),
refit = False
)
metric, predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = data_subset.filter(regex='^id_'),
exog = data_subset[['sin_day_of_week', 'cos_day_of_week']],
cv = cv,
metric = 'mean_absolute_error',
add_aggregated_metric = False,
verbose = False,
show_progress = False
)
predictions_all_buildings.append(predictions)
metrics_all_buildings.append(metric)
end = datetime.now()
predictions_all_buildings = pd.concat(predictions_all_buildings, axis=1)
metrics_all_buildings = pd.concat(metrics_all_buildings, axis=0)
errors_all_buildings = (
predictions_all_buildings - data_pivot.loc[predictions_all_buildings.index, predictions_all_buildings.columns]
)
mean_metric_all_buildings = metrics_all_buildings['mean_absolute_error'].mean()
sum_abs_errors_all_buildings = errors_all_buildings.abs().sum().sum()
sum_bias_all_buildings = errors_all_buildings.sum().sum()
table_results.loc['Global model per cluster (features)', 'mae'] = mean_metric_all_buildings
table_results.loc['Global model per cluster (features)', 'abs_error'] = sum_abs_errors_all_buildings
table_results.loc['Global model per cluster (features)', 'bias'] = sum_bias_all_buildings
table_results.loc['Global model per cluster (features)', 'elapsed_time'] = end - start
print("")
print(f"Average mean absolute error for all buildings: {mean_metric_all_buildings:.0f}")
print(f"Sum of absolute errors for all buildings (x 10,000): {sum_abs_errors_all_buildings/10000:.0f}")
print(f"Bias (x 10,000): {sum_bias_all_buildings/10000:.0f}")
# Add predictions to the already existing plot (not showing the plot yet)
# ==============================================================================
for i, building in enumerate(selected_buildings):
predictions_all_buildings[building].plot(ax=axs[i], label='Global model per cluster (features)')
axs[i].legend()
A global model for each cluster based on elastic distance (DTW) is trained and tested.
# Forecaster multi-series models for buildings grouped by DTW
# ==============================================================================
predictions_all_buildings = []
metrics_all_buildings = []
params_lgbm = {
'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 10,
'random_state': 8520,
'verbose': -1
}
forecaster = ForecasterRecursiveMultiSeries(
regressor = LGBMRegressor(**params_lgbm),
lags = 31,
encoding = 'ordinal_category'
)
start = datetime.now()
for cluster in data['cluster_base_on_dtw'].unique():
print(
f"Training and testing model for cluster: {cluster} "
f"(n = {data[data['cluster_base_on_dtw'] == cluster]['building_id'].nunique()})"
)
# Create subset based on DTW clusters
data_subset = data[data['cluster_base_on_dtw'] == cluster]
data_subset = data_subset.pivot_table(
index = 'timestamp',
columns = 'building_id',
values = 'meter_reading',
aggfunc = 'mean'
)
data_subset.columns = data_subset.columns.astype(str)
data_subset = data_subset.asfreq('D').sort_index()
# Add calendar features
data_subset = day_of_week_cyclical_encoding(data_subset)
data_subset = data_subset.merge(
data[['air_temperature', 'wind_speed']].resample('D').mean(),
left_index = True,
right_index = True,
how = 'left',
validate = '1:m'
)
cv = TimeSeriesFold(
steps = 7,
initial_train_size = len(data_subset.loc[:end_validation, :]),
refit = False
)
metric, predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = data_subset.filter(regex='^id_'),
exog = data_subset[['sin_day_of_week', 'cos_day_of_week']],
cv = cv,
metric = 'mean_absolute_error',
add_aggregated_metric = False,
verbose = False,
show_progress = False
)
predictions_all_buildings.append(predictions)
metrics_all_buildings.append(metric)
end = datetime.now()
predictions_all_buildings = pd.concat(predictions_all_buildings, axis=1)
metrics_all_buildings = pd.concat(metrics_all_buildings, axis=0)
errors_all_buildings = (
predictions_all_buildings - data_pivot.loc[predictions_all_buildings.index, predictions_all_buildings.columns]
)
mean_metric_all_buildings = metrics_all_buildings['mean_absolute_error'].mean()
sum_abs_errors_all_buildings = errors_all_buildings.abs().sum().sum()
sum_bias_all_buildings = errors_all_buildings.sum().sum()
table_results.loc['Global model per cluster (DTW)', 'mae'] = mean_metric_all_buildings
table_results.loc['Global model per cluster (DTW)', 'abs_error'] = sum_abs_errors_all_buildings
table_results.loc['Global model per cluster (DTW)', 'bias'] = sum_bias_all_buildings
table_results.loc['Global model per cluster (DTW)', 'elapsed_time'] = end - start
print("")
print(f"Average mean absolute error for all buildings: {mean_metric_all_buildings:.0f}")
print(f"Sum of absolute errors for all buildings (x 10,000): {sum_abs_errors_all_buildings/10000:.0f}")
print(f"Bias (x 10,000): {sum_bias_all_buildings/10000:.0f}")
# Add predictions to the already existing plot (not showing the plot yet)
# ==============================================================================
for i, building in enumerate(selected_buildings):
predictions_all_buildings[building].plot(ax=axs[i], label='Global model per cluster (DTW)')
axs[i].legend()
# Table of results
# ==============================================================================
table_results['elapsed_time'] = table_results['elapsed_time'].astype(str).str[:7]
table_results.style.highlight_min(axis=0, color='green').format(precision=0)
# Plot predictions vs real value for 2 random buildings for all models
# ==============================================================================
handles, labels = axs[0].get_legend_handles_labels()
fig.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5, 0.05), ncol=2)
for ax in axs:
ax.legend().remove()
fig
The decision between using individual models for each building or a global model for all can greatly impact the outcome. Our examination compares these two strategies, highlighting the trade-offs between the forecasting capabilities and computational efficiency.
Predictive Capability: Global models outperform individual models in terms of mean absolute error, suggesting that they better capture the common patterns across series, leading to more accurate forecasts.
Computational Efficiency: Global models prove to be more efficient, requiring less time than running individual models for each building. This highlights the advantage of the global model in time-sensitive applications where rapid training and prediction are valued.
Best Model: For this use case, among the global models, the "Global Model per Cluster (Features)" has the best overall performance, with the lowest mean absolute error and absolute error.
Further research
This analysis has provided interesting insights into the effectiveness of global versus individual models in time series forecasting. Possible next steps could include:
Manually review buildings with high error. Identify if there is a group for which the model is not performing well.
Add more exogenous features: see skforecast's user guide Calendar Features for calendar and sunlight features that typically affect energy consumption.
Optimize lags and hyperparameters: Use Grid, Random or Bayesian search to find the best model configuration.
Try other machine learning algorithms.
The data used in this document was obtained from the Kaggle competition Addison Howard, Chris Balbach, Clayton Miller, Jeff Haberl, Krishnan Gowri, Sohier Dane. (2019). ASHRAE - Great Energy Predictor III. Kaggle. Raw data are available for download after registration at Kaggle platform.
Three files are used:
These files contain weather-related data for each building, including outside air temperature, dew point temperature, relative humidity and other weather parameters. The weather data is critical for understanding the impact of external conditions on building energy use.
building_metadata.csv: This file contains metadata for each building in the dataset, such as building type, primary use, square footage, number of floors, and year built. This information helps to understand the characteristics of the buildings and their potential influence on energy consumption patterns.
train.csv: The train dataset contains the target variable, i.e. the energy consumption data for each building, together with the time stamps for the energy consumption readings. It also includes the corresponding building and weather IDs to link the information across different datasets.
Only buildings with more than 85% of values different than NaN or Zero are selected for this study. Among the different types of energy consumption measures, only the electricity meter is used. Finally, the data is aggregated to a daily frequency.
# Raw data downloaded from kaggle
# ======================================================================================
# Building metadata
building_metadata = pd.read_csv("building_metadata.csv")
print("Building metadata shape:", building_metadata.shape)
display(building_metadata.head(3))
# Consumption data
train = pd.read_csv("train.csv")
train['building_id'] = train['building_id'].astype(np.int16)
train['meter'] = train['meter'].astype(np.int8)
train['meter_reading'] = train['meter_reading'].astype(np.float32)
print("Consumption data shape:", train.shape)
display(train.head(3))
# Weather data
weather_train = pd.read_csv("weather_train.csv")
print("Weather data shape:", weather_train.shape)
display(weather_train.head(3))
# Preprocessing
# ======================================================================================
print("")
print("Size of data sets before filtering:")
print("==================================")
print(f"Number of rows in train: {train.shape[0]}")
print(f"Number of rows in building_metadata: {building_metadata.shape[0]}")
print(f"Number of rows in weather: {weather_train.shape[0]}")
print(f"Number of buildings: {train['building_id'].nunique()}")
# The variable meter indicates the type of meter that measures energy consumption. There
# are 4 types of meter: {0: electricity, 1: chilled water, 2: steam, 3: hot water}. Only
# the electricity meter, which is the most common, is selected.
train = train[train['meter'] == 0]
train = train.drop(columns=['meter'])
# Select only buildings with 85% of values different from 0 or NaN
train = train.pivot_table(
index = 'timestamp',
columns = 'building_id',
values = 'meter_reading',
aggfunc = 'mean',
fill_value = np.nan
)
pct_nan_per_col = train.isnull().mean(axis=0)
pct_zero_per_col = (train == 0).mean(axis=0)
selected_buildings = train.columns[(pct_nan_per_col + pct_zero_per_col) < 0.15]
train = train.loc[:, selected_buildings]
train = train.melt(ignore_index=False, value_name='meter_reading').reset_index()
# Exclude information about unselected buildings
selected_buildings = train['building_id'].unique()
building_metadata = building_metadata[building_metadata['building_id'].isin(selected_buildings)]
selected_sites = building_metadata['site_id'].unique()
weather_train = weather_train[weather_train['site_id'].isin(selected_sites)]
# Optimise memory usage
train["timestamp"] = pd.to_datetime(train["timestamp"])
building_metadata["primary_use"] = building_metadata["primary_use"].astype("category")
weather_train["timestamp"] = pd.to_datetime(weather_train["timestamp"])
# Exclude columns with more than 80% of missing values
threshold = 0.8
building_metadata = building_metadata.dropna(thresh=building_metadata.shape[0]*threshold, axis=1)
weather_train = weather_train.dropna(thresh=weather_train.shape[0]*threshold, axis=1)
# Save data
train.to_parquet('train_clean.parquet', index=False)
building_metadata.to_parquet('building_metadata_clean.parquet', index=False)
weather_train.to_parquet('weather_train_clean.parquet', index=False)
print("")
print("Size of data sets after filtering:")
print("==================================")
print(f"Number of rows in train: {train.shape[0]}")
print(f"Number of rows in building_metadata: {building_metadata.shape[0]}")
print(f"Number of rows in weather: {weather_train.shape[0]}")
print(f"Number of buildings: {train['building_id'].nunique()}")
# Group data from hourly to daily
train_daily = (
train
.groupby(['building_id', pd.Grouper(key='timestamp', freq='D')])['meter_reading']
.sum()
.reset_index()
.sort_values(['building_id', 'timestamp'])
)
weather_train_daily = (
weather_train
.groupby(['site_id', pd.Grouper(key='timestamp', freq='D')])
.mean()
.reset_index()
.sort_values(['site_id', 'timestamp'])
)
# Merge data in a unique dataframe
data = pd.merge(train_daily, building_metadata, on='building_id', how='left')
data = pd.merge(data, weather_train, on=['site_id', 'timestamp'], how='left')
data = data.sort_values(['timestamp', 'building_id'])
data = data.set_index('timestamp')
data['building_id'] = 'id_' + data['building_id'].astype(str)
# Fill nan values in temperature and wind speed with forward fill
data['air_temperature'] = data['air_temperature'].fillna(method='ffill')
data['wind_speed'] = data['wind_speed'].fillna(method='ffill')
print("")
print("Size of data sets after grouping by day:")
print("========================================")
print(f"Number of rows in train: {train_daily.shape[0]}")
print(f"Number of rows in weather: {weather_train_daily.shape[0]}")
print(f"Number of buildings: {train_daily['building_id'].nunique()}")
print("")
print("Size of final table:")
print("========================================")
print(f"Number of rows in final table: {data.shape[0]}")
# Save data
train_daily.to_parquet('train_clean_daily.parquet', index=False)
weather_train_daily.to_parquet('weather_train_clean_daily.parquet', index=False)
data.to_parquet('data_clean_daily.parquet', index=True)
import session_info
session_info.show(html=False)
Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia.
Time Series Analysis and Forecasting with ADAM Ivan Svetunkov
Joseph, M. (2022). Modern time series forecasting with Python: Explore industry-ready time series forecasting using modern machine learning and Deep Learning. Packt Publishing.
Time Series Analysis and Forecasting with ADAM Ivan Svetunkov
Holder, C., Middlehurst, M. & Bagnall, A. A review and evaluation of elastic distance functions for time series clustering. Knowl Inf Syst (2023)
Forecasting: theory and practice
How to cite this document
If you use this document or any part of it, please acknowledge the source, thank you!
Global Forecasting Models: Comparative Analysis of Single and Multi-Series Forecasting Modeling by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under a CC BY-NC-SA 4.0 at https://www.cienciadedatos.net/documentos/py53-global-forecasting-models.html
How to cite skforecast
If you use skforecast for publication, we would appreciate it if you cite the published software.
Zenodo:
Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.14.0). Zenodo. https://doi.org/10.5281/zenodo.11222585
APA:
Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (v0.14.0). Zenodo. https://doi.org/10.5281/zenodo.11222585
BibTeX:
@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.14.0}, month = {08}, year = {2024}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.11222585} }
Did you like the article? Your support is important
Website maintenance has high cost, your contribution will help me to continue generating free educational content. Many thanks! 😊
This work by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.