• 引言
  • 数据
  • 图形探索
    • 时间序列绘图
    • 季节性图
    • 自相关图
  • 基线
    • ForecasterEquivalentDate
  • 递归多步预测
    • ForecasterRecursive
  • 回测
  • 外生变量
  • 超参数调优
    • ForecasterRecursive
  • 特征选择
  • 概率性预测:预测区间
  • 模型可解释性与可解释性
    • 模型特定特征重要性
    • Shap 值
    • 结论
    • 会话信息
    • 引用
  • 预期的每日预测
  • 结论
  • 会话信息
  • 引用


更多关于预测的信息请访问 cienciadedatos.net


引言

能源需求预测在电力生产、分配和利用的资源管理与规划中起着至关重要的作用。预测能源需求是一项复杂的任务,受天气模式、经济状况和社会行为等多种因素影响。本文将探讨如何利用机器学习方法构建预测模型,以预测能源需求。

时间序列与预测

时间序列是指按时间顺序排列的数据序列,时间间隔可以相等也可以不等。预测过程包括预测时间序列的未来值,可以仅基于其过去行为进行建模(自回归),也可以结合其他外部变量。

在处理时间序列时,通常不仅仅需要预测下一个时刻(t+1),更常见的目标是预测未来一段区间((t+1), ..., (t+n))或更远的未来时刻(t+n)。为实现这类预测,可以采用多种策略,skforecast 针对单变量时间序列预测实现了以下方法:

  • 递归多步预测:由于预测 tn 需要 tn1 的值,而 tn1 是未知的,因此采用递归过程,每次新预测都基于上一次的结果。这一过程称为递归预测或递归多步预测,可通过 ForecasterRecursive 类轻松实现。


  • 直接多步预测:该方法为预测区间内的每一步分别训练一个模型。例如,要预测时间序列的未来5个值,就需要训练5个不同的模型,每个模型对应一个预测步长,因此各步预测结果相互独立。整个过程可通过 ForecasterDirect 类自动完成。


  • 多输出预测:某些机器学习模型(如长短期记忆网络 LSTM)可以一次性同时预测多个序列值(one-shot)。该策略在 ForecasterRnn 类中实现。

本文使用的库如下:

# 数据处理
# ==============================================================================
import numpy as np
import pandas as pd
from astral.sun import sun
from astral import LocationInfo
from skforecast.datasets import fetch_dataset
from feature_engine.datetime import DatetimeFeatures
from feature_engine.creation import CyclicalFeatures
from feature_engine.timeseries.forecasting import WindowFeatures

# 绘图
# ==============================================================================
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from skforecast.plot import plot_residuals
import plotly.graph_objects as go
import plotly.io as pio
import plotly.offline as poff
pio.templates.default = "seaborn"
poff.init_notebook_mode(connected=True)
plt.style.use('seaborn-v0_8-darkgrid')
plt.rcParams.update({'font.size': 8})

# 建模与预测
# ==============================================================================
import skforecast
import lightgbm
import sklearn
from lightgbm import LGBMRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import RFECV
from skforecast.recursive import ForecasterEquivalentDate, ForecasterRecursive
from skforecast.direct import ForecasterDirect
from skforecast.model_selection import TimeSeriesFold, bayesian_search_forecaster, backtesting_forecaster
from skforecast.feature_selection import select_features
from skforecast.preprocessing import RollingFeatures
from skforecast.plot import calculate_lag_autocorrelation, plot_residuals
from skforecast.metrics import calculate_coverage
import shap


# 警告配置
# ==============================================================================
import warnings
warnings.filterwarnings('once')

color = '\033[1m\033[38;5;208m' 
print(f"{color}Version skforecast: {skforecast.__version__}")
print(f"{color}Version scikit-learn: {sklearn.__version__}")
print(f"{color}Version lightgbm: {lightgbm.__version__}")
print(f"{color}Version pandas: {pd.__version__}")
print(f"{color}Version numpy: {np.__version__}")
Version skforecast: 0.17.0
Version scikit-learn: 1.7.1
Version lightgbm: 4.6.0
Version pandas: 2.3.1
Version numpy: 2.2.6

数据

本例使用的数据为澳大利亚维多利亚州 2011-12-31 至 2014-12-31 的电力需求(MW)时间序列。数据来源于 R tsibbledata 包。数据集包含 5 列和 52,608 条完整记录。各列信息如下:

  • Time:记录的日期和时间。
  • Date:记录的日期。
  • Demand:电力需求(MW)。
  • Temperature:维多利亚州首府墨尔本的温度。
  • Holiday:是否为公共假日。
# 数据下载
# ==============================================================================
data = fetch_dataset(name='vic_electricity', raw=True)
data.info()
vic_electricity
---------------
Half-hourly electricity demand for Victoria, Australia
O'Hara-Wild M, Hyndman R, Wang E, Godahewa R (2022).tsibbledata: Diverse
Datasets for 'tsibble'. https://tsibbledata.tidyverts.org/,
https://github.com/tidyverts/tsibbledata/.
https://tsibbledata.tidyverts.org/reference/vic_elec.html
Shape of the dataset: (52608, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52608 entries, 0 to 52607
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Time         52608 non-null  object 
 1   Demand       52608 non-null  float64
 2   Temperature  52608 non-null  float64
 3   Date         52608 non-null  object 
 4   Holiday      52608 non-null  bool   
dtypes: bool(1), float64(2), object(2)
memory usage: 1.7+ MB

Time 列以 string 格式存储。可使用 pd.to_datetime() 函数将其转换为 datetime 类型。转换后,为了方便使用 pandas 的功能,将其设置为索引。由于数据每 30 分钟记录一次,因此还需指定频率 '30min'

# 数据预处理
# ==============================================================================
data = data.copy()
data['Time'] = pd.to_datetime(data['Time'], format='%Y-%m-%dT%H:%M:%SZ')
data = data.set_index('Time')
data = data.asfreq('30min')
data = data.sort_index()
data.head(2)
Demand Temperature Date Holiday
Time
2011-12-31 13:00:00 4382.825174 21.40 2012-01-01 True
2011-12-31 13:30:00 4263.365526 21.05 2012-01-01 True

在处理时间序列时,首先要检查序列是否完整,即是否存在缺失值。

# 验证时间索引是否完整
# ==============================================================================
start_date = data.index.min()
end_date = data.index.max()
complete_date_range = pd.date_range(start=start_date, end=end_date, freq=data.index.freq)
is_index_complete = (data.index == complete_date_range).all()
print(f"Index complete: {is_index_complete}")
print(f"Number of rows with missing values: {data.isnull().any(axis=1).mean()}")
Index complete: True
Number of rows with missing values: 0.0
# 填补时间索引中的缺口
# ==============================================================================
# data.asfreq(freq='30min', fill_value=np.nan)

虽然数据的时间间隔为 30 分钟,但本例的目标是建立能够预测每小时电力需求的模型,因此需要对数据进行聚合。此类转换可通过 Pandas 的 DatetimeIndex 索引和 resample() 方法轻松实现。

正确使用 closed='left'label='right' 参数非常重要,以避免在训练中引入未来信息(信息泄露))。假设可用的时间点为 10:10、10:30、10:45、11:00、11:12 和 11:30。要获得每小时的平均值,分配给 11:00 的值应仅使用 10:10、10:30 和 10:45 的数据计算;分配给 12:00 的值应仅使用 11:00、11:12 和 11:30 的数据计算。

不包含未来信息的时间数据聚合示意图。

11:00 的平均值不包含 11:00 的点值,因为实际上该时刻的值并不可用。

# 以 1 小时为间隔进行聚合
# ==============================================================================
# 删除 Date 列,避免聚合时报错。
data = data.drop(columns="Date")
data = (
    data
    .resample(rule="h", closed="left", label="right")
    .agg({
        "Demand": "mean",
        "Temperature": "mean",
        "Holiday": "mean",
    })
)
data
Demand Temperature Holiday
Time
2011-12-31 14:00:00 4323.095350 21.225 1.0
2011-12-31 15:00:00 3963.264688 20.625 1.0
2011-12-31 16:00:00 3950.913495 20.325 1.0
2011-12-31 17:00:00 3627.860675 19.850 1.0
2011-12-31 18:00:00 3396.251676 19.025 1.0
... ... ... ...
2014-12-31 09:00:00 4069.625550 21.600 0.0
2014-12-31 10:00:00 3909.230704 20.300 0.0
2014-12-31 11:00:00 3900.600901 19.650 0.0
2014-12-31 12:00:00 3758.236494 18.100 0.0
2014-12-31 13:00:00 3785.650720 17.200 0.0

26304 rows × 3 columns

数据集起始于 2011-12-31 14:00:00,结束于 2014-12-31 13:00:00。为保证数据完整,前 10 条和最后 13 条记录被舍弃,使其起始于 2012-01-01 00:00:00,结束于 2014-12-30 23:00:00。此外,为了优化模型超参数并评估其预测能力,数据被划分为训练集、验证集和测试集三部分。

# 划分数据为 train-val-test
# ==============================================================================
data = data.loc['2012-01-01 00:00:00':'2014-12-30 23:00:00', :].copy()
end_train = '2013-12-31 23:59:00'
end_validation = '2014-11-30 23:59:00'
data_train = data.loc[: end_train, :].copy()
data_val   = data.loc[end_train:end_validation, :].copy()
data_test  = data.loc[end_validation:, :].copy()

print(f"Train dates      : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Validation dates : {data_val.index.min()} --- {data_val.index.max()}  (n={len(data_val)})")
print(f"Test dates       : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")
Train dates      : 2012-01-01 00:00:00 --- 2013-12-31 23:00:00  (n=17544)
Validation dates : 2014-01-01 00:00:00 --- 2014-11-30 23:00:00  (n=8016)
Test dates       : 2014-12-01 00:00:00 --- 2014-12-30 23:00:00  (n=720)

图形探索

通过对时间序列进行图形化探索,可以有效识别趋势、模式和季节性变化,从而为选择最合适的预测模型提供指导。

时间序列绘图

完整时间序列

# 时间序列交互式绘图
# ==============================================================================
fig = go.Figure()
fig.add_trace(go.Scatter(x=data_train.index, y=data_train['Demand'], mode='lines', name='Train'))
fig.add_trace(go.Scatter(x=data_val.index, y=data_val['Demand'], mode='lines', name='Validation'))
fig.add_trace(go.Scatter(x=data_test.index, y=data_test['Demand'], mode='lines', name='Test'))
fig.update_layout(
    title='Hourly energy demand',
    xaxis_title="Time",
    yaxis_title="Demand",
    legend_title="Partition:",
    width=800,
    height=400,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(orientation="h", yanchor="top", y=1, xanchor="left", x=0.001)
)
#fig.update_xaxes(rangeslider_visible=True)
fig.show()
Jan 2012Jul 2012Jan 2013Jul 2013Jan 2014Jul 20143000400050006000700080009000
Partition:TrainValidationTestHourly energy demandTimeDemand

上图显示,电力需求具有年度季节性。7 月份有一个高峰,1 月至 3 月期间需求也有明显的高峰。

时间序列片段

由于时间序列的方差较大,单一图表难以观察到可能存在的日内模式。

# 时间序列局部放大图
# ==============================================================================
zoom = ('2013-05-01 14:00:00', '2013-06-01 14:00:00')
fig = plt.figure(figsize=(8, 4))
grid = plt.GridSpec(nrows=8, ncols=1, hspace=0.6, wspace=0)
main_ax = fig.add_subplot(grid[:3, :])
data.Demand.plot(ax=main_ax, c='black', alpha=0.5, linewidth=0.5)
min_y = min(data.Demand)
max_y = max(data.Demand)
main_ax.fill_between(zoom, min_y, max_y, facecolor='blue', alpha=0.5, zorder=0)
main_ax.set_title(f'Electricity demand: {data.index.min()}, {data.index.max()}', fontsize=10)
main_ax.set_xlabel('')
zoom_ax = fig.add_subplot(grid[5:, :])
data.loc[zoom[0]: zoom[1]].Demand.plot(ax=zoom_ax, color='blue', linewidth=1)
zoom_ax.set_title(f'Electricity demand: {zoom}', fontsize=10)
zoom_ax.set_xlabel('')
plt.subplots_adjust(hspace=1)
plt.show();

当对时间序列进行放大观察时,可以明显看到每周的季节性特征:工作日(周一至周五)用电量较高,周末用电量较低。同时,每天的用电量与前一天存在明显相关性。

季节性图

季节性图是识别时间序列中季节性模式和趋势的有力工具。其方法是对每个季节的数值进行平均,然后将其随时间绘制出来。

# 年度、每周和每日季节性
# ==============================================================================
fig, axs = plt.subplots(2, 2, figsize=(8, 5), sharex=False, sharey=True)
axs = axs.ravel()

# 按月份统计需求分布
data['month'] = data.index.month
data.boxplot(column='Demand', by='month', ax=axs[0], flierprops={'markersize': 3, 'alpha': 0.3})
data.groupby('month')['Demand'].median().plot(style='o-', linewidth=0.8, ax=axs[0])
axs[0].set_ylabel('Demand')
axs[0].set_title('Demand distribution by month', fontsize=9)

# 按星期几统计需求分布
data['week_day'] = data.index.day_of_week + 1
data.boxplot(column='Demand', by='week_day', ax=axs[1], flierprops={'markersize': 3, 'alpha': 0.3})
data.groupby('week_day')['Demand'].median().plot(style='o-', linewidth=0.8, ax=axs[1])
axs[1].set_ylabel('Demand')
axs[1].set_title('Demand distribution by week day', fontsize=9)

# 按一天中的小时统计需求分布
data['hour_day'] = data.index.hour + 1
data.boxplot(column='Demand', by='hour_day', ax=axs[2], flierprops={'markersize': 3, 'alpha': 0.3})
data.groupby('hour_day')['Demand'].median().plot(style='o-', linewidth=0.8, ax=axs[2])
axs[2].set_ylabel('Demand')
axs[2].set_title('Demand distribution by the hour of the day', fontsize=9)

# 按星期几和小时统计需求分布
mean_day_hour = data.groupby(["week_day", "hour_day"])["Demand"].mean()
mean_day_hour.plot(ax=axs[3])
axs[3].set(
    title       = "Mean Demand during week",
    xticks      = [i * 24 for i in range(7)],
    xticklabels = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
    xlabel      = "Day and hour",
    ylabel      = "Number of Demand"
    )
axs[3].title.set_size(10)

fig.suptitle("Seasonality plots", fontsize=12)
fig.tight_layout()

存在年度季节性,6、7、8 月份的需求(中位数)较高,11、12、1、2、3 月份的需求峰值较高。存在每周季节性,周末需求较低。也存在日内季节性,16:00 至 21:00 期间需求下降。

自相关图

自相关图是确定自回归模型阶数的有力工具。自相关函数(ACF)衡量时间序列与其自身滞后版本之间的相关性。偏自相关函数(PACF)衡量时间序列与其自身滞后版本之间的相关性,同时控制所有更短滞后的影响。这些图有助于确定自回归模型中应包含的滞后数。

# 自相关图
# ==============================================================================
fig, ax = plt.subplots(figsize=(5, 2))
plot_acf(data['Demand'], ax=ax, lags=60)
plt.show()
# 偏自相关图
# ==============================================================================
fig, ax = plt.subplots(figsize=(5, 2))
plot_pacf(data['Demand'], ax=ax, lags=60)
plt.show()
# 偏自相关绝对值最大的前 10 个滞后
# ==============================================================================
calculate_lag_autocorrelation(
    data    = data['Demand'],
    n_lags  = 60,
    sort_by = "partial_autocorrelation_abs"
).head(10)
lag partial_autocorrelation_abs partial_autocorrelation autocorrelation_abs autocorrelation
0 1 0.949526 0.949526 0.949490 0.949490
1 25 0.761400 -0.761400 0.731620 0.731620
2 2 0.657961 -0.657961 0.836792 0.836792
3 26 0.634868 0.634868 0.622458 0.622458
4 24 0.308085 -0.308085 0.785622 0.785622
5 19 0.291113 0.291113 0.302308 0.302308
6 27 0.281261 -0.281261 0.488351 0.488351
7 21 0.269274 0.269274 0.537173 0.537173
8 20 0.201192 0.201192 0.414709 0.414709
9 9 0.184975 0.184975 0.037677 0.037677

自相关图显示,某一小时的需求与前几小时以及前几天同一小时的需求有很强的相关性。这种相关性表明自回归模型在本场景下可能非常有效。

基线

在面对预测问题时,建立基线模型非常重要。基线模型通常是非常简单的模型,用于评估更复杂模型是否值得实现。

Skforecast 可以通过 ForecasterEquivalentDate 类轻松创建基线模型。该模型也称为季节性朴素预测(Seasonal Naive Forecasting),其原理是直接返回前一季节同一时段的观测值(如前一周同一工作日、前一天同一小时等)。

根据前面的探索性分析,基线模型将采用前一天同一小时的值作为每小时的预测。

✏️ 注意

在接下来的代码单元中,将训练基线预测器并通过回测过程评估其预测能力。如果你对回测(backtesting)不熟悉,不必担心,本文后续会详细解释。现在只需知道,回测过程是用部分数据训练模型,再用未见过的数据评估其预测能力。该误差指标将作为后续更复杂模型预测能力的参考。

# 创建基线模型:前一天同一小时的值
# ==============================================================================
forecaster = ForecasterEquivalentDate(
                 offset    = pd.DateOffset(days=1),
                 n_offsets = 1
)

# 训练预测器
# ==============================================================================
forecaster.fit(y=data.loc[:end_validation, 'Demand'])
forecaster

ForecasterEquivalentDate

General Information
  • Regressor: NoneType
  • Offset:
  • Number of offsets: 1
  • Aggregation function: mean
  • Window size: 24
  • Creation date: 2025-08-20 00:17:52
  • Last fit date: 2025-08-20 00:17:52
  • Skforecast version: 0.17.0
  • Python version: 3.12.9
  • Forecaster id: None
Training Information
  • Training range: [Timestamp('2012-01-01 00:00:00'), Timestamp('2014-11-30 23:00:00')]
  • Training index type: DatetimeIndex
  • Training index frequency: h

🛈 API Reference    🗎 User Guide

# 回测
# ==============================================================================
cv = TimeSeriesFold(
        steps              = 24,
        initial_train_size = len(data.loc[:end_validation]),
        refit              = False
)
metric, predictions = backtesting_forecaster(
                          forecaster = forecaster,
                          y          = data['Demand'],
                          cv         = cv,
                          metric     = 'mean_absolute_error'
)
metric
  0%|          | 0/30 [00:00<?, ?it/s]
mean_absolute_error
0 308.375272

基线模型的误差可作为评估更复杂模型是否值得实现的参考。

递归多步预测

训练递归自回归模型 ForecasterRecursive,使用梯度提升回归器 LGBMRegressor 预测未来 24 小时的能源需求。

所用特征包括过去 24 小时的需求值(24 个滞后)以及过去 3 天的移动平均值。回归器的超参数均为默认值。

# 创建预测器
# ==============================================================================
window_features = RollingFeatures(stats=["mean"], window_sizes=24 * 3)
forecaster = ForecasterRecursive(
                 regressor       = LGBMRegressor(random_state=15926, verbose=-1),
                 lags            = 24,
                 window_features = window_features
)

# 训练预测器
# ==============================================================================
forecaster.fit(y=data.loc[:end_validation, 'Demand'])
forecaster

ForecasterRecursive

General Information
  • Regressor: LGBMRegressor
  • Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]
  • Window features: ['roll_mean_72']
  • Window size: 72
  • Series name: Demand
  • Exogenous included: False
  • Weight function included: False
  • Differentiation order: None
  • Creation date: 2025-08-20 00:17:52
  • Last fit date: 2025-08-20 00:17:52
  • Skforecast version: 0.17.0
  • Python version: 3.12.9
  • Forecaster id: None
Exogenous Variables
    None
Data Transformations
  • Transformer for y: None
  • Transformer for exog: None
Training Information
  • Training range: [Timestamp('2012-01-01 00:00:00'), Timestamp('2014-11-30 23:00:00')]
  • Training index type: DatetimeIndex
  • Training index frequency: h
Regressor Parameters
    {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 15926, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1}
Fit Kwargs
    {}

🛈 API Reference    🗎 User Guide

回测

为获得模型预测能力的稳健估计,需进行回测。回测过程是针对测试集的每个观测点生成预测,模拟实际生产环境下的预测流程,并将预测值与真实值进行比较。

回测通过 backtesting_forecaster() 函数实现。本例中,模型用 2012-01-01 00:00 至 2014-11-30 23:00 的数据训练,每天 23:59 预测未来 24 小时。误差指标为平均绝对误差(MAE)。

强烈建议查阅 backtesting_forecaster() 函数文档,深入了解其功能,以便充分利用其分析模型预测能力的潜力。

# 回测
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster    = forecaster,
                          y             = data['Demand'],
                          cv            = cv,
                          metric        = 'mean_absolute_error',
                          verbose       = True, # 设为 False 可避免打印
)
Information of folds
--------------------
Number of observations used for initial training: 25560
Number of observations used for backtesting: 720
    Number of folds: 30
    Number skipped folds: 0 
    Number of steps per fold: 24
    Number of steps to exclude between last observed data (last window) and predictions (gap): 0

Fold: 0
    Training:   2012-01-01 00:00:00 -- 2014-11-30 23:00:00  (n=25560)
    Validation: 2014-12-01 00:00:00 -- 2014-12-01 23:00:00  (n=24)
Fold: 1
    Training:   No training in this fold
    Validation: 2014-12-02 00:00:00 -- 2014-12-02 23:00:00  (n=24)
Fold: 2
    Training:   No training in this fold
    Validation: 2014-12-03 00:00:00 -- 2014-12-03 23:00:00  (n=24)
Fold: 3
    Training:   No training in this fold
    Validation: 2014-12-04 00:00:00 -- 2014-12-04 23:00:00  (n=24)
Fold: 4
    Training:   No training in this fold
    Validation: 2014-12-05 00:00:00 -- 2014-12-05 23:00:00  (n=24)
Fold: 5
    Training:   No training in this fold
    Validation: 2014-12-06 00:00:00 -- 2014-12-06 23:00:00  (n=24)
Fold: 6
    Training:   No training in this fold
    Validation: 2014-12-07 00:00:00 -- 2014-12-07 23:00:00  (n=24)
Fold: 7
    Training:   No training in this fold
    Validation: 2014-12-08 00:00:00 -- 2014-12-08 23:00:00  (n=24)
Fold: 8
    Training:   No training in this fold
    Validation: 2014-12-09 00:00:00 -- 2014-12-09 23:00:00  (n=24)
Fold: 9
    Training:   No training in this fold
    Validation: 2014-12-10 00:00:00 -- 2014-12-10 23:00:00  (n=24)
Fold: 10
    Training:   No training in this fold
    Validation: 2014-12-11 00:00:00 -- 2014-12-11 23:00:00  (n=24)
Fold: 11
    Training:   No training in this fold
    Validation: 2014-12-12 00:00:00 -- 2014-12-12 23:00:00  (n=24)
Fold: 12
    Training:   No training in this fold
    Validation: 2014-12-13 00:00:00 -- 2014-12-13 23:00:00  (n=24)
Fold: 13
    Training:   No training in this fold
    Validation: 2014-12-14 00:00:00 -- 2014-12-14 23:00:00  (n=24)
Fold: 14
    Training:   No training in this fold
    Validation: 2014-12-15 00:00:00 -- 2014-12-15 23:00:00  (n=24)
Fold: 15
    Training:   No training in this fold
    Validation: 2014-12-16 00:00:00 -- 2014-12-16 23:00:00  (n=24)
Fold: 16
    Training:   No training in this fold
    Validation: 2014-12-17 00:00:00 -- 2014-12-17 23:00:00  (n=24)
Fold: 17
    Training:   No training in this fold
    Validation: 2014-12-18 00:00:00 -- 2014-12-18 23:00:00  (n=24)
Fold: 18
    Training:   No training in this fold
    Validation: 2014-12-19 00:00:00 -- 2014-12-19 23:00:00  (n=24)
Fold: 19
    Training:   No training in this fold
    Validation: 2014-12-20 00:00:00 -- 2014-12-20 23:00:00  (n=24)
Fold: 20
    Training:   No training in this fold
    Validation: 2014-12-21 00:00:00 -- 2014-12-21 23:00:00  (n=24)
Fold: 21
    Training:   No training in this fold
    Validation: 2014-12-22 00:00:00 -- 2014-12-22 23:00:00  (n=24)
Fold: 22
    Training:   No training in this fold
    Validation: 2014-12-23 00:00:00 -- 2014-12-23 23:00:00  (n=24)
Fold: 23
    Training:   No training in this fold
    Validation: 2014-12-24 00:00:00 -- 2014-12-24 23:00:00  (n=24)
Fold: 24
    Training:   No training in this fold
    Validation: 2014-12-25 00:00:00 -- 2014-12-25 23:00:00  (n=24)
Fold: 25
    Training:   No training in this fold
    Validation: 2014-12-26 00:00:00 -- 2014-12-26 23:00:00  (n=24)
Fold: 26
    Training:   No training in this fold
    Validation: 2014-12-27 00:00:00 -- 2014-12-27 23:00:00  (n=24)
Fold: 27
    Training:   No training in this fold
    Validation: 2014-12-28 00:00:00 -- 2014-12-28 23:00:00  (n=24)
Fold: 28
    Training:   No training in this fold
    Validation: 2014-12-29 00:00:00 -- 2014-12-29 23:00:00  (n=24)
Fold: 29
    Training:   No training in this fold
    Validation: 2014-12-30 00:00:00 -- 2014-12-30 23:00:00  (n=24)

  0%|          | 0/30 [00:00<?, ?it/s]
# 绘制预测值与真实值对比图
# ======================================================================================
fig = go.Figure()
trace1 = go.Scatter(x=data_test.index, y=data_test['Demand'], name="test", mode="lines")
trace2 = go.Scatter(x=predictions.index, y=predictions['pred'], name="prediction", mode="lines")
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.update_layout(
    title="Real value vs predicted in test data",
    xaxis_title="Date time",
    yaxis_title="Demand",
    width=800,
    height=400,
    margin=dict(l=20, r=20, t=35, b=20),
    legend=dict(orientation="h", yanchor="top", y=1.01, xanchor="left", x=0)
)
fig.show()
Dec 72014Dec 14Dec 21Dec 283000350040004500500055006000
testpredictionReal value vs predicted in test dataDate timeDemand
# 回测误差
# ==============================================================================
metric
mean_absolute_error
0 225.521306

自回归模型的 MAE 低于基线模型。这说明自回归模型对次日电力需求的预测更为准确。

外生变量

目前为止,仅使用了时间序列的滞后值作为特征。实际上,还可以引入其他变量作为特征,这些变量称为外生变量(特征),其使用有助于提升模型的预测能力。需要特别注意的是,预测时必须已知外生变量的值。

常见的外生变量包括日历信息(如星期几、月份、年份或节假日)、气象变量(如温度、湿度、风速)以及经济变量(如通胀、利率)等。

⚠️ 警告

外生变量必须在预测时已知。例如,若将温度作为外生变量,则预测时必须已知下一小时的温度值。若温度未知,则无法进行预测。

气象变量应谨慎使用。实际部署时,未来气象条件并非已知,而是由气象服务预测。这些预测本身存在误差,会传递到预测模型中,导致模型预测能力下降。为预估模型实际表现,建议在训练时使用当时可获得的气象预测数据,而非真实观测值。

接下来,将基于日历信息、日出日落时间、温度和节假日等创建外生变量。这些新变量将被添加到训练集、验证集和测试集中,并作为特征用于自回归模型。

# 日历特征
# ==============================================================================
features_to_extract = [
    'month',
    'week',
    'day_of_week',
    'hour'
    ]
calendar_transformer = DatetimeFeatures(
    variables           = 'index',
    features_to_extract = features_to_extract,
    drop_original       = True,
    )
calendar_features = calendar_transformer.fit_transform(data)[features_to_extract]

# 日照特征
# ==============================================================================
location = LocationInfo(
    latitude  = -37.8,
    longitude = 144.95,
    timezone  = 'Australia/Melbourne'
    )
sunrise_hour = [
    sun(location.observer, date=date, tzinfo=location.timezone)['sunrise']
    for date in data.index
]
sunset_hour = [
    sun(location.observer, date=date, tzinfo=location.timezone)['sunset']
    for date in data.index
]
sunrise_hour = pd.Series(sunrise_hour, index=data.index).dt.round("h").dt.hour
sunset_hour = pd.Series(sunset_hour, index=data.index).dt.round("h").dt.hour
sun_light_features = pd.DataFrame({
                        'sunrise_hour': sunrise_hour,
                        'sunset_hour': sunset_hour
                     })
sun_light_features['daylight_hours'] = (
    sun_light_features['sunset_hour'] - sun_light_features['sunrise_hour']
    )
sun_light_features["is_daylight"] = np.where(
    (data.index.hour >= sun_light_features["sunrise_hour"])
    & (data.index.hour < sun_light_features["sunset_hour"]),
    1,
    0,
    )

# 节假日特征
# ==============================================================================
holiday_features = data[['Holiday']].astype(int)
holiday_features['holiday_previous_day'] = holiday_features['Holiday'].shift(24)
holiday_features['holiday_next_day'] = holiday_features['Holiday'].shift(-24)

# 温度的滑动窗口
# ==============================================================================
wf_transformer = WindowFeatures(
    variables   = ["Temperature"],
    window      = ["1D", "7D"],
    functions   = ["mean", "max", "min"],
    freq        = "h",
)
temp_features = wf_transformer.fit_transform(data[['Temperature']])


# 合并所有外生变量
# ==============================================================================
assert all(calendar_features.index == sun_light_features.index)
assert all(calendar_features.index == temp_features.index)
assert all(calendar_features.index == holiday_features.index)
exogenous_features = pd.concat([
                         calendar_features,
                         sun_light_features,
                         temp_features,
                         holiday_features
                     ], axis=1)

# 由于滑动平均的创建,序列开头会有缺失值;holiday_next_day 导致结尾也有缺失值。
exogenous_features = exogenous_features.iloc[7 * 24:, :]
exogenous_features = exogenous_features.iloc[:-24, :]
exogenous_features.head(3)
month week day_of_week hour sunrise_hour sunset_hour daylight_hours is_daylight Temperature Temperature_window_1D_mean Temperature_window_1D_max Temperature_window_1D_min Temperature_window_7D_mean Temperature_window_7D_max Temperature_window_7D_min Holiday holiday_previous_day holiday_next_day
Time
2012-01-08 00:00:00 1 1 6 0 6 21 15 0 20.575 24.296875 29.0 19.875 23.254018 39.525 14.35 0 0.0 0.0
2012-01-08 01:00:00 1 1 6 1 6 21 15 0 22.500 24.098958 29.0 19.875 23.215774 39.525 14.35 0 0.0 0.0
2012-01-08 02:00:00 1 1 6 2 6 21 15 0 25.250 23.923958 29.0 19.875 23.173214 39.525 14.35 0 0.0 0.0

日历中的某些特征(如小时、天)具有周期性。例如,小时-天的周期为 0 到 23 小时,23:00 实际上距离 00:00 只有一小时。月份-年份同理,12 月距离 1 月也只有一个月。通过正弦和余弦等三角函数变换,可以有效表达周期性模式,避免数据表示上的不连续。这种方法称为周期编码(cyclic encoding),可显著提升模型预测能力。

# 日历和日照特征的周期编码
# ==============================================================================
features_to_encode = [
    "month",
    "week",
    "day_of_week",
    "hour",
    "sunrise_hour",
    "sunset_hour",
]
max_values = {
    "month": 12,
    "week": 52,
    "day_of_week": 6,
    "hour": 24,
    "sunrise_hour": 24,
    "sunset_hour": 24,
}
cyclical_encoder = CyclicalFeatures(
    variables     = features_to_encode,
    max_values    = max_values,
    drop_original = False
)

exogenous_features = cyclical_encoder.fit_transform(exogenous_features)
exogenous_features.head(3)
month week day_of_week hour sunrise_hour sunset_hour daylight_hours is_daylight Temperature Temperature_window_1D_mean ... week_sin week_cos day_of_week_sin day_of_week_cos hour_sin hour_cos sunrise_hour_sin sunrise_hour_cos sunset_hour_sin sunset_hour_cos
Time
2012-01-08 00:00:00 1 1 6 0 6 21 15 0 20.575 24.296875 ... 0.120537 0.992709 -2.449294e-16 1.0 0.000000 1.000000 1.0 6.123234e-17 -0.707107 0.707107
2012-01-08 01:00:00 1 1 6 1 6 21 15 0 22.500 24.098958 ... 0.120537 0.992709 -2.449294e-16 1.0 0.258819 0.965926 1.0 6.123234e-17 -0.707107 0.707107
2012-01-08 02:00:00 1 1 6 2 6 21 15 0 25.250 23.923958 ... 0.120537 0.992709 -2.449294e-16 1.0 0.500000 0.866025 1.0 6.123234e-17 -0.707107 0.707107

3 rows × 30 columns

在许多情况下,外生变量并非孤立影响目标变量,其作用还取决于其他变量的取值。例如,温度对用电需求的影响取决于一天中的时间。变量间的交互作用可通过变量相乘得到新特征,scikit-learn 的 PolynomialFeatures 类可轻松实现。

# 外生变量间的交互作用
# ==============================================================================
transformer_poly = PolynomialFeatures(
                        degree           = 2,
                        interaction_only = True,
                        include_bias     = False
                    ).set_output(transform="pandas")
poly_cols = [
    'month_sin', 
    'month_cos',
    'week_sin',
    'week_cos',
    'day_of_week_sin',
    'day_of_week_cos',
    'hour_sin',
    'hour_cos',
    'sunrise_hour_sin',
    'sunrise_hour_cos',
    'sunset_hour_sin',
    'sunset_hour_cos',
    'daylight_hours',
    'is_daylight',
    'holiday_previous_day',
    'holiday_next_day',
    'Temperature_window_1D_mean',
    'Temperature_window_1D_min',
    'Temperature_window_1D_max',
    'Temperature_window_7D_mean',
    'Temperature_window_7D_min',
    'Temperature_window_7D_max',
    'Temperature',
    'Holiday'
    ]
poly_features = transformer_poly.fit_transform(exogenous_features[poly_cols])
poly_features = poly_features.drop(columns=poly_cols)
poly_features.columns = [f"poly_{col}" for col in poly_features.columns]
poly_features.columns = poly_features.columns.str.replace(" ", "__")
assert all(poly_features.index == exogenous_features.index)
exogenous_features = pd.concat([exogenous_features, poly_features], axis=1)
exogenous_features.head(3)
month week day_of_week hour sunrise_hour sunset_hour daylight_hours is_daylight Temperature Temperature_window_1D_mean ... poly_Temperature_window_7D_mean__Temperature_window_7D_min poly_Temperature_window_7D_mean__Temperature_window_7D_max poly_Temperature_window_7D_mean__Temperature poly_Temperature_window_7D_mean__Holiday poly_Temperature_window_7D_min__Temperature_window_7D_max poly_Temperature_window_7D_min__Temperature poly_Temperature_window_7D_min__Holiday poly_Temperature_window_7D_max__Temperature poly_Temperature_window_7D_max__Holiday poly_Temperature__Holiday
Time
2012-01-08 00:00:00 1 1 6 0 6 21 15 0 20.575 24.296875 ... 333.695156 919.115056 478.451417 0.0 567.18375 295.25125 0.0 813.226875 0.0 0.0
2012-01-08 01:00:00 1 1 6 1 6 21 15 0 22.500 24.098958 ... 333.146354 917.603460 522.354911 0.0 567.18375 322.87500 0.0 889.312500 0.0 0.0
2012-01-08 02:00:00 1 1 6 2 6 21 15 0 25.250 23.923958 ... 332.535625 915.921295 585.123661 0.0 567.18375 362.33750 0.0 998.006250 0.0 0.0

3 rows × 306 columns

# 选择要纳入模型的外生变量
# ==============================================================================
exog_features = []
# 选择以 _sin 或 _cos 结尾的列
exog_features.extend(exogenous_features.filter(regex='_sin$|_cos$').columns.tolist())
# 选择以 temp_ 开头的列
exog_features.extend(exogenous_features.filter(regex='^Temperature_.*').columns.tolist())
# 选择以 holiday_ 开头的列
exog_features.extend(exogenous_features.filter(regex='^Holiday_.*').columns.tolist())
# 包含原始特征
exog_features.extend(['Temperature', 'Holiday'])
# 合并目标变量和外生变量到同一 DataFrame
# ==============================================================================
data = data[['Demand']].merge(
           exogenous_features[exog_features],
           left_index  = True,
           right_index = True,
           how         = 'inner'  # 仅使用所有变量均有值的日期
)
data = data.astype('float32')

# 划分数据为 train-val-test
data_train = data.loc[: end_train, :].copy()
data_val   = data.loc[end_train:end_validation, :].copy()
data_test  = data.loc[end_validation:, :].copy()

再次回测模型,这次将外生变量也作为特征纳入。

# 回测模型
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster = forecaster,
                          y          = data['Demand'],
                          exog       = data[exog_features],
                          cv         = cv,
                          metric     = 'mean_absolute_error'
)
display(metric)
predictions.head()
  0%|          | 0/22 [00:00<?, ?it/s]
mean_absolute_error
0 141.615022
pred
2014-12-08 00:00:00 4973.909908
2014-12-08 01:00:00 4976.217860
2014-12-08 02:00:00 4987.034012
2014-12-08 03:00:00 5007.016313
2014-12-08 04:00:00 5027.765102

将外生变量作为特征后,模型的预测能力进一步提升。

超参数调优

前述 ForecasterRecursive 对象使用了前 24 个滞后和默认超参数的 LGMBRegressor。但这些参数未必最优。为寻找最佳超参数,采用 贝叶斯搜索,通过 bayesian_search_forecaster() 函数实现。搜索过程同样基于回测,每次用不同超参数和滞后组合训练模型。需注意,超参数搜索应在验证集上完成,测试集数据绝不参与。

💡 提示

超参数搜索耗时较长,尤其是基于回测(TimeSeriesFold)的验证策略。更快的替代方案是基于一步预测(OneStepAheadFold)的验证策略。虽然速度更快,但其准确性可能不及回测。详见 回测 vs 一步预测

# 带间隔的回测
# ==============================================================================
forecaster = ForecasterRecursive(
                 regressor       = LGBMRegressor(random_state=15926, verbose=-1),
                 lags            = 24,  # 网格搜索时会被替换
                 window_features = window_features
)

# 用作特征的滞后
lags_grid = [24, (1, 2, 3, 23, 24, 25, 47, 48, 49)]

# 回归器超参数搜索空间
def search_space(trial):
    search_space  = {
        'n_estimators' : trial.suggest_int('n_estimators', 300, 1000, step=100),
        'max_depth'    : trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.5),
        'reg_alpha'    : trial.suggest_float('reg_alpha', 0, 1),
        'reg_lambda'   : trial.suggest_float('reg_lambda', 0, 1),
        'lags'         : trial.suggest_categorical('lags', lags_grid)
    } 
    return search_space

# 训练和验证折
cv_search = TimeSeriesFold(
                steps              = 24,
                initial_train_size = len(data[:end_train]),
                refit              = False,
            )

results_search, frozen_trial = bayesian_search_forecaster(
                                   forecaster   = forecaster,
                                   y            = data.loc[:end_validation, 'Demand'],
                                   exog         = data.loc[:end_validation, exog_features],
                                   cv           = cv_search,
                                   metric       = 'mean_absolute_error',
                                   search_space = search_space,
                                   n_trials     = 10,  # 增大以获得更全面搜索
                                   return_best  = True
)
  0%|          | 0/10 [00:00<?, ?it/s]
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3 23 24 25 47 48 49] 
  Parameters: {'n_estimators': 800, 'max_depth': 9, 'learning_rate': 0.050765544282894995, 'reg_alpha': 0.7636828414433382, 'reg_lambda': 0.243666374536874}
  Backtesting metric: 128.29025410611806
# 搜索结果
# ==============================================================================
best_params = results_search.at[0, 'params']
best_params = best_params | {'random_state': 15926, 'verbose': -1}
best_lags = results_search.at[0, 'lags']
results_search.head(3)
lags params mean_absolute_error n_estimators max_depth learning_rate reg_alpha reg_lambda
0 [1, 2, 3, 23, 24, 25, 47, 48, 49] {'n_estimators': 800, 'max_depth': 9, 'learnin... 128.290254 800.0 9.0 0.050766 0.763683 0.243666
1 [1, 2, 3, 23, 24, 25, 47, 48, 49] {'n_estimators': 800, 'max_depth': 5, 'learnin... 129.402763 800.0 5.0 0.121157 0.551315 0.719469
2 [1, 2, 3, 23, 24, 25, 47, 48, 49] {'n_estimators': 600, 'max_depth': 8, 'learnin... 130.719469 600.0 8.0 0.099421 0.175452 0.531551

由于设置了 return_best=True,预测器对象会自动更新为最佳配置并在全量数据上训练。最终模型可用于新数据的未来预测。

# 最优模型
# ==============================================================================
forecaster

ForecasterRecursive

General Information
  • Regressor: LGBMRegressor
  • Lags: [ 1 2 3 23 24 25 47 48 49]
  • Window features: ['roll_mean_72']
  • Window size: 72
  • Series name: Demand
  • Exogenous included: True
  • Weight function included: False
  • Differentiation order: None
  • Creation date: 2025-08-20 00:18:01
  • Last fit date: 2025-08-20 00:20:19
  • Skforecast version: 0.17.0
  • Python version: 3.12.9
  • Forecaster id: None
Exogenous Variables
    month_sin, month_cos, week_sin, week_cos, day_of_week_sin, day_of_week_cos, hour_sin, hour_cos, sunrise_hour_sin, sunrise_hour_cos, sunset_hour_sin, sunset_hour_cos, poly_month_sin__month_cos, poly_month_sin__week_sin, poly_month_sin__week_cos, poly_month_sin__day_of_week_sin, poly_month_sin__day_of_week_cos, poly_month_sin__hour_sin, poly_month_sin__hour_cos, poly_month_sin__sunrise_hour_sin, poly_month_sin__sunrise_hour_cos, poly_month_sin__sunset_hour_sin, poly_month_sin__sunset_hour_cos, poly_month_cos__week_sin, poly_month_cos__week_cos, ..., poly_day_of_week_cos__sunset_hour_sin, poly_day_of_week_cos__sunset_hour_cos, poly_hour_sin__hour_cos, poly_hour_sin__sunrise_hour_sin, poly_hour_sin__sunrise_hour_cos, poly_hour_sin__sunset_hour_sin, poly_hour_sin__sunset_hour_cos, poly_hour_cos__sunrise_hour_sin, poly_hour_cos__sunrise_hour_cos, poly_hour_cos__sunset_hour_sin, poly_hour_cos__sunset_hour_cos, poly_sunrise_hour_sin__sunrise_hour_cos, poly_sunrise_hour_sin__sunset_hour_sin, poly_sunrise_hour_sin__sunset_hour_cos, poly_sunrise_hour_cos__sunset_hour_sin, poly_sunrise_hour_cos__sunset_hour_cos, poly_sunset_hour_sin__sunset_hour_cos, Temperature_window_1D_mean, Temperature_window_1D_max, Temperature_window_1D_min, Temperature_window_7D_mean, Temperature_window_7D_max, Temperature_window_7D_min, Temperature, Holiday
Data Transformations
  • Transformer for y: None
  • Transformer for exog: None
Training Information
  • Training range: [Timestamp('2012-01-08 00:00:00'), Timestamp('2014-11-30 23:00:00')]
  • Training index type: DatetimeIndex
  • Training index frequency: h
Regressor Parameters
    {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.050765544282894995, 'max_depth': 9, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 800, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 15926, 'reg_alpha': 0.7636828414433382, 'reg_lambda': 0.243666374536874, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1}
Fit Kwargs
    {}

🛈 API Reference    🗎 User Guide

确定最佳超参数组合(基于验证集)后,评估模型在测试集上的预测能力。

# 在测试集上回测最终模型
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster = forecaster,
                          y          = data['Demand'],
                          exog       = data[exog_features],
                          cv         = cv,
                          metric     = 'mean_absolute_error'
)
display(metric)
predictions.head()
  0%|          | 0/22 [00:00<?, ?it/s]
mean_absolute_error
0 132.742342
pred
2014-12-08 00:00:00 4975.449681
2014-12-08 01:00:00 5002.267029
2014-12-08 02:00:00 5025.141643
2014-12-08 03:00:00 5057.270548
2014-12-08 04:00:00 5087.195752

经过滞后和超参数优化后,预测误差进一步降低。

特征选择

特征选择是从所有特征中选出最相关子集用于建模的重要步骤,有助于减少过拟合、提升模型精度并缩短训练时间。skforecast 的底层回归器遵循 scikit-learn API,因此可直接用 select_features() 函数结合 scikit-learn 的特征选择方法。最常用的两种方法为 递归特征消除顺序特征选择

💡 提示

特征选择是提升机器学习模型性能的有力工具,但计算量大且耗时。目标是找到最佳特征子集,而非最优模型,因此可用较小数据子集和简单模型。确定最佳特征后,再用全量数据和更复杂配置训练模型。

# 创建预测器
# ==============================================================================
regressor = LGBMRegressor(
                n_estimators = 100,
                max_depth    = 4,
                random_state = 15926,
                verbose      = -1
)

forecaster = ForecasterRecursive(
                 regressor       = regressor,
                 lags            = best_lags,
                 window_features = window_features
)

# 递归特征消除与交叉验证
# ==============================================================================
warnings.filterwarnings("ignore", message="X does not have valid feature names.*")
selector = RFECV(
    estimator = regressor,
    step      = 1,
    cv        = 3,
)
lags_select, window_features_select, exog_select = select_features(
    forecaster      = forecaster,
    selector        = selector,
    y               = data_train['Demand'],
    exog            = data_train[exog_features],
    select_only     = None,
    force_inclusion = None,
    subsample       = 0.5,  # 加速过程
    random_state    = 123,
    verbose         = True,
)
Recursive feature elimination (RFECV)
-------------------------------------
Total number of records available: 17304
Total number of records used for feature selection: 8652
Number of features available: 96
    Lags            (n=9)
    Window features (n=1)
    Exog            (n=86)
Number of features selected: 45
    Lags            (n=9) : [1, 2, 3, 23, 24, 25, 47, 48, 49]
    Window features (n=1) : ['roll_mean_72']
    Exog            (n=35) : ['week_cos', 'hour_sin', 'hour_cos', 'poly_month_sin__day_of_week_cos', 'poly_month_sin__hour_cos', 'poly_month_cos__hour_cos', 'poly_week_sin__hour_sin', 'poly_week_sin__hour_cos', 'poly_week_cos__day_of_week_sin', 'poly_week_cos__day_of_week_cos', 'poly_week_cos__hour_sin', 'poly_week_cos__hour_cos', 'poly_day_of_week_sin__day_of_week_cos', 'poly_day_of_week_sin__hour_sin', 'poly_day_of_week_sin__hour_cos', 'poly_day_of_week_sin__sunrise_hour_sin', 'poly_day_of_week_cos__hour_sin', 'poly_day_of_week_cos__hour_cos', 'poly_day_of_week_cos__sunrise_hour_cos', 'poly_hour_sin__hour_cos', 'poly_hour_sin__sunrise_hour_cos', 'poly_hour_sin__sunset_hour_sin', 'poly_hour_sin__sunset_hour_cos', 'poly_hour_cos__sunrise_hour_sin', 'poly_hour_cos__sunrise_hour_cos', 'poly_hour_cos__sunset_hour_sin', 'poly_hour_cos__sunset_hour_cos', 'Temperature_window_1D_mean', 'Temperature_window_1D_max', 'Temperature_window_1D_min', 'Temperature_window_7D_mean', 'Temperature_window_7D_max', 'Temperature_window_7D_min', 'Temperature', 'Holiday']

Scikit-learn 的 RFECV 首先用全部特征训练模型,获取每个特征的重要性(如 coef_feature_importances_)。随后每轮迭代剔除最不重要特征,并通过交叉验证评估剩余特征的模型性能。该过程持续,直到移除特征不再提升或开始降低模型性能(基于选定指标),或达到 min_features_to_select

最终结果是通过交叉验证获得的、在模型简洁性与预测能力间取得最佳平衡的特征子集。

用最佳特征子集重新训练并评估预测器。

# 用选定特征创建预测器
# ==============================================================================
forecaster = ForecasterRecursive(
                regressor       = LGBMRegressor(**best_params),
                lags            = lags_select,
                window_features = window_features
)
# 在验证数据上进行回测以获得样本外残差
# ==============================================================================
metric, predictions = backtesting_forecaster(
                            forecaster = forecaster,
                            y          = data['Demand'],
                            exog       = data[exog_select],
                            cv         = cv,
                            metric     = 'mean_absolute_error'
)                
metric
  0%|          | 0/22 [00:00<?, ?it/s]
mean_absolute_error
0 131.391247

在不损失模型性能的前提下减少了特征数量,使模型更简洁、训练更快,也降低了过拟合风险。

概率性预测:预测区间

预测区间定义了目标变量真实值在给定概率下可能出现的区间。Skforecast 实现了多种概率性预测方法:

以下代码展示如何为自回归模型生成预测区间。prediction_interval() 函数可为每个预测步生成区间。backtesting_forecaster() 可为整个测试集生成区间。interval 参数指定区间的置信概率,本例设置为 [5, 95],即计算 5% 和 95% 分位数的区间,对应理论覆盖率 90%。

# 创建并训练预测器
# ==============================================================================
forecaster = ForecasterRecursive(
                 regressor       = LGBMRegressor(**best_params),
                 lags            = lags_select,
                 window_features = window_features,
                 binner_kwargs   = {"n_bins": 5}
             )
forecaster.fit(
    y    = data.loc[:end_train, 'Demand'],
    exog = data.loc[:end_train, exog_select],
    store_in_sample_residuals = True
)
# 预测区间
# ==============================================================================
# 由于模型已使用外生变量进行训练,因此在预测时也必须提供外生变量。
predictions = forecaster.predict_interval(
                  exog     = data.loc[end_train:, exog_select],
                  steps    = 24,
                  interval = [5, 95],
                  method  = 'conformal',
              )
predictions.head()
pred lower_bound upper_bound
2014-01-01 00:00:00 3659.284654 3619.432029 3699.137278
2014-01-01 01:00:00 3809.478172 3769.625548 3849.330797
2014-01-01 02:00:00 3941.153515 3895.670015 3986.637016
2014-01-01 03:00:00 3973.933981 3928.450480 4019.417482
2014-01-01 04:00:00 4062.377324 4016.893823 4107.860824

默认情况下,区间基于样本内残差(训练集残差)计算。但这样得到的区间可能过窄(过于乐观)。为避免此问题,可用 set_out_sample_residuals() 方法指定通过回测在验证集上计算的样本外残差。

如果在 set_out_sample_residuals() 中同时传入预测值,则自助法采样残差时可按预测值区间分组,有助于提升区间覆盖率并尽量缩小区间宽度。

# 在验证数据上进行回测以获得样本外残差
# ==============================================================================
cv = TimeSeriesFold(
        steps              = 24,
        initial_train_size = len(data.loc[:end_train]),
        refit              = False,
)
_, predictions_val = backtesting_forecaster(
                         forecaster = forecaster,
                         y          = data.loc[:end_validation, 'Demand'],
                         exog       = data.loc[:end_validation, exog_select],
                         cv         = cv,
                         metric     = 'mean_absolute_error'
                     )
  0%|          | 0/334 [00:00<?, ?it/s]
# 样本外残差分布
# ============================================================================== 
residuals = data.loc[predictions_val.index, 'Demand'] - predictions_val['pred']
print(pd.Series(np.where(residuals < 0, 'negative', 'positive')).value_counts())
plt.rcParams.update({'font.size': 8})
_ = plot_residuals(residuals=residuals, figsize=(7, 4))
negative    4429
positive    3587
Name: count, dtype: int64
# 在预测器中存储样本外残差
# ============================================================================== 
forecaster.set_out_sample_residuals(
    y_true = data.loc[predictions_val.index, 'Demand'],
    y_pred = predictions_val['pred']
)

随后在测试集上运行 回测,以估算预测区间。use_in_sample_residuals 设为 False,以使用前面存储的样本外残差,use_binned_residuals 设为 True,使自助法采样残差时按预测值区间分组。

# 在测试数据中使用样本外残差进行区间预测的回测
# ============================================================================== 
cv = TimeSeriesFold(
        steps              = 24,
        initial_train_size = len(data.loc[:end_validation]),
        refit              = False,
)
metric, predictions = backtesting_forecaster(
                            forecaster              = forecaster,
                            y                       = data['Demand'],
                            exog                    = data[exog_select],
                            cv                      = cv,
                            metric                  = 'mean_absolute_error',
                            interval                = [5, 95],
                            interval_method         = 'conformal',
                            use_in_sample_residuals = False,  # 样本外残差
                            use_binned_residuals    = True,   # 区间根据预测值分组
                       )
predictions.head(5)
  0%|          | 0/29 [00:00<?, ?it/s]
pred lower_bound upper_bound
2014-12-01 00:00:00 5609.728357 5141.917122 6077.539593
2014-12-01 01:00:00 5570.278476 5102.467241 6038.089711
2014-12-01 02:00:00 5557.446741 5089.635506 6025.257976
2014-12-01 03:00:00 5574.087926 5106.276691 6041.899161
2014-12-01 04:00:00 5655.740166 5187.928931 6123.551402
# 绘制预测区间与真实值对比图
# ============================================================================== 
fig = go.Figure([
    go.Scatter(
        name='Prediction', x=predictions.index, y=predictions['pred'], mode='lines',
    ),
    go.Scatter(
        name='Real value', x=data_test.index, y=data_test['Demand'], mode='lines',
    ),
    go.Scatter(
        name='Upper Bound', x=predictions.index, y=predictions['upper_bound'],
        mode='lines', marker=dict(color="#444"), line=dict(width=0), showlegend=False
    ),
    go.Scatter(
        name='Lower Bound', x=predictions.index, y=predictions['lower_bound'],
        marker=dict(color="#444"), line=dict(width=0), mode='lines',
        fillcolor='rgba(68, 68, 68, 0.3)', fill='tonexty', showlegend=False
    )
])
fig.update_layout(
    title="Real value vs predicted in test data",
    xaxis_title="Date time",
    yaxis_title="Demand",
    width=800,
    height=400,
    margin=dict(l=20, r=20, t=35, b=20),
    hovermode="x",
    legend=dict(orientation="h", yanchor="top", y=1.1, xanchor="left", x=0.001)
)
fig.show()
Dec 22014Dec 5Dec 8Dec 11Dec 14Dec 17Dec 20Dec 23Dec 26Dec 2930004000500060007000
Real valuePredictionReal value vs predicted in test dataDate timeDemand
# 预测区间覆盖率(测试数据)
# ============================================================================== 
coverage = calculate_coverage(
              y_true       = data.loc[end_validation:, "Demand"],
              lower_bound  = predictions["lower_bound"],
              upper_bound  = predictions["upper_bound"]
           )
area = (predictions['upper_bound'] - predictions['lower_bound']).sum()
print(f"Total area of the interval: {round(area, 2)}")
print(f"Predicted interval coverage: {round(100 * coverage, 2)} %")
Total area of the interval: 335515.26
Predicted interval coverage: 86.64 %

区间的实际覆盖率接近理论期望覆盖率(90%)。

✏️ 注意

关于 skforecast 概率性预测功能的详细说明,请参见:机器学习概率性预测

模型可解释性与可解释性

许多现代机器学习模型(如集成方法)本质上是黑盒,难以理解其具体预测原因。可解释性技术旨在揭示模型内部机制,提升透明度、建立信任,并满足各领域的合规需求。提升模型可解释性不仅有助于理解模型行为,还能识别偏差、提升性能,并帮助利益相关者基于机器学习洞见做出更明智决策。

skforecast 兼容多种主流模型可解释性方法:模型特定特征重要性、SHAP 值、部分依赖图

# 创建并训练预测器
# ==============================================================================
forecaster = ForecasterRecursive(
                 regressor       = LGBMRegressor(**best_params),
                 lags            = lags_select,
                 window_features = window_features
             )
forecaster.fit(
    y    = data.loc[:end_validation, 'Demand'],
    exog = data.loc[:end_validation, exog_select]
)

模型特定特征重要性

# 模型特定的特征重要性
# ============================================================================== 
feature_importances = forecaster.get_feature_importances()
feature_importances.head(10)
feature importance
0 lag_1 2279
43 Temperature 1285
1 lag_2 1167
5 lag_25 1111
2 lag_3 1047
4 lag_24 975
3 lag_23 757
39 Temperature_window_1D_min 713
38 Temperature_window_1D_max 689
29 poly_hour_sin__hour_cos 683

⚠️ 警告

只有当预测器的回归器具有 coef_feature_importances_ 属性(scikit-learn 默认支持)时,get_feature_importances() 方法才会返回结果。

Shap 值

SHAP(SHapley Additive exPlanations)是一种基于博弈论的解释方法,用于量化每个特征对单个预测的贡献。SHAP 值可以帮助我们理解模型为何做出某个特定预测。

# 由预测器用于拟合内部回归器的训练矩阵
# ============================================================================== 
X_train, y_train = forecaster.create_train_X_y(
                       y    = data_train['Demand'],
                       exog = data_train[exog_select]
)
display(X_train.head(3))
display(y_train.head(3))
lag_1 lag_2 lag_3 lag_23 lag_24 lag_25 lag_47 lag_48 lag_49 roll_mean_72 ... poly_hour_cos__sunset_hour_sin poly_hour_cos__sunset_hour_cos Temperature_window_1D_mean Temperature_window_1D_max Temperature_window_1D_min Temperature_window_7D_mean Temperature_window_7D_max Temperature_window_7D_min Temperature Holiday
Time
2012-01-11 00:00:00 5087.274902 4925.770996 4661.530762 4988.975098 4963.185059 4896.571289 5039.643066 4958.630371 4886.074219 4386.356154 ... -0.707107 0.707107 16.140625 19.575001 13.175 19.321875 29.0 13.175 15.150 0.0
2012-01-11 01:00:00 5107.675781 5087.274902 4925.770996 5006.663086 4988.975098 4963.185059 5090.203125 5039.643066 4958.630371 4398.891398 ... -0.683013 0.683013 16.042707 19.575001 13.175 19.291666 29.0 13.175 15.425 0.0
2012-01-11 02:00:00 5126.337891 5107.675781 5087.274902 4977.723633 5006.663086 4988.975098 5108.837402 5090.203125 5039.643066 4410.526191 ... -0.612372 0.612372 15.939584 19.575001 13.175 19.255507 29.0 13.175 14.800 0.0

3 rows × 45 columns

Time
2012-01-11 00:00:00    5107.675781
2012-01-11 01:00:00    5126.337891
2012-01-11 02:00:00    5103.370605
Freq: h, Name: y, dtype: float32
# 创建 SHAP 解释器(针对三个基模型)
# ============================================================================== 
shap.initjs()
explainer = shap.TreeExplainer(forecaster.regressor)

# 抽取 50% 的数据以加快计算
# ==============================================================================
rng = np.random.default_rng(seed=785412)
sample = rng.choice(X_train.index, size=int(len(X_train)*0.5), replace=False)
X_train_sample = X_train.loc[sample, :]
shap_values = explainer.shap_values(X_train_sample)

结论

# SHAP 摘要图(前 10 个特征)
# ==============================================================================
shap.summary_plot(shap_values, X_train_sample, max_display=10, show=False)
fig, ax = plt.gcf(), plt.gca()
ax.set_title("SHAP Summary plot")
ax.tick_params(labelsize=8)
fig.set_size_inches(6, 4.5)

在本教程中,我们展示了如何使用 skforecast 和 LightGBM 对电力需求进行时间序列预测。我们涵盖了数据预处理、特征工程、模型训练、回测、概率性预测以及模型可解释性等关键步骤。
skforecast 提供了灵活且强大的工具,能够轻松实现和评估多种机器学习模型在时间序列预测中的表现。

# 回测并返回预测器
# ==============================================================================
cv = TimeSeriesFold(
        steps              = 24,
        initial_train_size = len(data.loc[:end_validation]),
)
_, predictions = backtesting_forecaster(
                        forecaster        = forecaster,
                        y                 = data['Demand'],
                        exog              = data[exog_select],
                        cv                = cv,
                        metric            = 'mean_absolute_error',
                        return_predictors = True,
                )       
  0%|          | 0/29 [00:00<?, ?it/s]

会话信息

predictions.head(3)
pred fold lag_1 lag_2 lag_3 lag_23 lag_24 lag_25 lag_47 lag_48 ... poly_hour_cos__sunset_hour_sin poly_hour_cos__sunset_hour_cos Temperature_window_1D_mean Temperature_window_1D_max Temperature_window_1D_min Temperature_window_7D_mean Temperature_window_7D_max Temperature_window_7D_min Temperature Holiday
2014-12-01 00:00:00 5609.728357 0 5634.690430 5519.464844 5250.484863 4326.276855 4102.013672 3851.330322 4056.273926 4058.317627 ... -0.866025 0.500000 28.779167 33.849998 23.6 18.208036 33.849998 11.15 20.000000 0.0
2014-12-01 01:00:00 5570.278476 0 5609.728357 5634.690430 5519.464844 4560.871094 4326.276855 4102.013672 4060.130127 4056.273926 ... -0.836516 0.482963 28.556250 33.849998 20.0 18.224703 33.849998 11.15 20.350000 0.0
2014-12-01 02:00:00 5557.446741 0 5570.278476 5609.728357 5634.690430 4782.341309 4560.871094 4326.276855 4071.491943 4060.130127 ... -0.750000 0.433013 28.193750 33.849998 20.0 18.242857 33.849998 11.15 22.299999 0.0

3 rows × 47 columns

# 回测期间生成的单个预测的瀑布图
# ==============================================================================
predictions = predictions.astype(data[exog_select].dtypes) # 确保类型一致
iloc_predicted_date = predictions.index.get_loc('2014-12-16 05:00:00')
shap_values_single = explainer(predictions.iloc[:, 2:])
shap.plots.waterfall(shap_values_single[iloc_predicted_date], show=False)
fig, ax = plt.gcf(), plt.gca()
fig.set_size_inches(8, 3.5)
ax_list = fig.axes
ax = ax_list[0]
ax.tick_params(labelsize=8)
ax.set
plt.show()
# 回测期间生成的单个预测的力图
# ==============================================================================
shap.force_plot(
    base_value  = shap_values_single.base_values[iloc_predicted_date],
    shap_values = shap_values_single.values[iloc_predicted_date],
    features    = predictions.iloc[iloc_predicted_date, 2:],
)
1,6762,1762,6763,1763,6764,1764,6765,1765,6766,1766,6767,1767,676Temperature = 31.85lag_1 = 6,312base value6,550.576,550.57higherf(x)lower

引用

ForecasterRecursiveForecasterRecursiveCustom 模型采用递归策略,即每一个新的预测都基于前一个预测结果。另一种多步预测策略是为每一个需要预测的步长分别训练一个模型,这被称为直接多步预测。虽然由于需要训练多个模型,这种方法在计算上比递归方法更昂贵,但它可能带来更好的结果。

# 直接法预测器
# ==============================================================================
forecaster = ForecasterDirect(
                 regressor       = LGBMRegressor(**best_params),
                 steps           = 24,
                 lags            = lags_select,
                 window_features = window_features
             )

# 回测模型
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster = forecaster,
                          y          = data['Demand'],
                          exog       = data[exog_select],
                          cv         = cv,
                          metric     = 'mean_absolute_error'
)

display(metric)
predictions.head()
  0%|          | 0/29 [00:00<?, ?it/s]
mean_absolute_error
0 119.096333
pred
2014-12-01 00:00:00 5643.441883
2014-12-01 01:00:00 5770.005014
2014-12-01 02:00:00 5844.018982
2014-12-01 03:00:00 6144.759526
2014-12-01 04:00:00 6105.915435

如需更多信息,请访问 skforecast 官方文档

预期的每日预测

到目前为止,我们假设每天的次日预测是在当天晚上 11:59 生成的。但这种做法在实际中并不现实,因为没有时间处理第二天的凌晨时段。 现在,假设每天需要在上午 11:00 生成次日的预测,以便有足够的时间进行管理。这意味着在第 D 天的 11:00,你需要预测当天 [12, 13, ..., 23] 点和第 D+1 天的 [0, 1, ..., 23] 点,共 36 小时的需求(但只需保留最后 24 小时的预测结果)。 这种评估方式可以通过 backtesting_forecaster() 函数及其 gap 参数轻松实现。此外,allow_incomplete_fold 参数允许在最后一个分组不满足步数要求时将其排除。该流程每天运行一次,步骤如下:

  1. 在测试集的第一天上午 11:00,预测接下来 36 小时(当天剩余 12 小时加上次日 24 小时)。
  2. 只保留次日的预测(从位置 12 开始)。
  3. 将次日 11:00 前的数据加入测试集。
  4. 重复该过程。

因此,每天上午 11:00,模型都能访问到截至该时刻的真实需求值。


⚠️ 注意

有两个注意事项:

  • 训练数据必须在 gap 开始处结束。本例中,initial_train_size 需多加 12 个位置,使第一个预测点为 2014-12-01 12:00:00
  • 虽然每次只保留最后 24 个预测(steps)用于模型评估,但每个分组实际预测的步数为 36(steps + gap)。

# initial_train_size + 12 位置的末尾
# ==============================================================================
data.iloc[:len(data.loc[:end_validation]) + 12].tail(2)
Demand month_sin month_cos week_sin week_cos day_of_week_sin day_of_week_cos hour_sin hour_cos sunrise_hour_sin ... poly_sunrise_hour_cos__sunset_hour_cos poly_sunset_hour_sin__sunset_hour_cos Temperature_window_1D_mean Temperature_window_1D_max Temperature_window_1D_min Temperature_window_7D_mean Temperature_window_7D_max Temperature_window_7D_min Temperature Holiday
Time
2014-12-01 10:00:00 5084.011230 -2.449294e-16 1.0 -0.354605 0.935016 0.0 1.0 0.500000 -0.866025 1.0 ... 3.061617e-17 -0.433013 25.589582 30.950001 20.0 18.573214 33.849998 11.15 19.90 0.0
2014-12-01 11:00:00 4851.066895 -2.449294e-16 1.0 -0.354605 0.935016 0.0 1.0 0.258819 -0.965926 1.0 ... 3.061617e-17 -0.433013 25.129168 29.500000 19.9 18.601191 33.849998 11.15 19.35 0.0

2 rows × 87 columns

# 预测器
# ==============================================================================
forecaster = ForecasterRecursive(
                 regressor       = LGBMRegressor(**best_params),
                 lags            = lags_select,
                 window_features = window_features
             )

# 带间隔的回测
# ==============================================================================
cv = TimeSeriesFold(
        steps              = 24,
        initial_train_size = len(data.loc[:end_validation]) + 12,
        refit              = False,
        gap                = 12,
)

metric, predictions = backtesting_forecaster(
                          forecaster = forecaster,
                          y          = data['Demand'],
                          exog       = data[exog_select],
                          cv         = cv,
                          metric     = 'mean_absolute_error'
                      )
display(metric)
predictions.head(5)
  0%|          | 0/28 [00:00<?, ?it/s]
mean_absolute_error
0 138.496187
pred
2014-12-02 00:00:00 5459.348705
2014-12-02 01:00:00 5506.327848
2014-12-02 02:00:00 5590.533606
2014-12-02 03:00:00 5688.215164
2014-12-02 04:00:00 5835.673206

如预期,随着预测时长从 24 小时增加到 36 小时,误差也随之增加。

结论

梯度提升模型已被证明是预测能源需求的强大工具。其主要优势之一在于能够轻松引入外生变量,从而显著提升模型的预测能力。此外,借助可解释性技术,可以直观且定量地理解变量及其取值对预测的影响。所有这些问题都可以通过 skforecast 库轻松解决。

预测器类型 外生变量 MAE 回测
ForecasterEquivalentDate (基线) 308.4
ForecasterRecursive 225.5
ForecasterRecursive 132.7
ForecasterRecursive 是(特征选择) 131.9
ForecasterDirect 是(特征选择) 119.1

会话信息

import session_info
session_info.show(html=False)
-----
astral              3.2
feature_engine      1.8.3
lightgbm            4.6.0
matplotlib          3.10.1
numpy               2.2.6
optuna              3.6.2
pandas              2.3.1
plotly              6.0.1
session_info        v1.0.1
shap                0.47.2
skforecast          0.17.0
sklearn             1.7.1
statsmodels         0.14.4
-----
IPython             9.1.0
jupyter_client      8.6.3
jupyter_core        5.7.2
jupyterlab          4.4.5
notebook            7.4.5
-----
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0]
Linux-6.14.0-27-generic-x86_64-with-glibc2.39
-----
Session information updated at 2025-08-20 00:23

引用

如何引用本文档

如果你使用了本文档或其中的任何部分,请注明来源,谢谢!

《使用机器学习预测能源需求》作者:Joaquín Amat Rodrigo 和 Javier Escobar Ortiz,遵循 Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) 许可,原文链接:https://www.cienciadedatos.net/documentos/py29-forecasting-electricity-power-demand-python.html

如何引用 skforecast

如果你在论文或出版物中使用了 skforecast,欢迎引用已发布的软件。

Zenodo:

Amat Rodrigo, Joaquin, & Escobar Ortiz, Javier. (2024). skforecast (v0.17.0). Zenodo. https://doi.org/10.5281/zenodo.8382788

APA:

Amat Rodrigo, J., & Escobar Ortiz, J. (2024). skforecast (Version 0.17.0) [计算机软件]. https://doi.org/10.5281/zenodo.8382788

BibTeX:

@software{skforecast, author = {Amat Rodrigo, Joaquin and Escobar Ortiz, Javier}, title = {skforecast}, version = {0.17.0}, month = {08}, year = {2025}, license = {BSD-3-Clause}, url = {https://skforecast.org/}, doi = {10.5281/zenodo.8382788} }


喜欢这篇文章吗?你的支持很重要

你的支持将帮助我持续创作免费教育内容,感谢!😊

成为 GitHub 赞助者 成为 GitHub 赞助者

知识共享许可协议

本作品由 Joaquín Amat Rodrigo 和 Javier Escobar Ortiz 创作,采用 知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议

允许:

  • 共享:可以以任何媒介或格式复制、传播本材料。

  • 演绎:可以对本材料进行再混合、转换和创作。

须遵守以下条款:

  • 署名:你必须给予适当的署名,提供许可协议链接,并标明是否作了修改。你可以以任何合理方式进行,但不得以暗示许可人认可你或你的使用方式的方式进行。

  • 非商业性使用:你不得将本材料用于商业目的

  • 相同方式共享:如果你再混合、转换或基于本材料进行创作,必须采用与原先相同的许可协议进行分发。