More about machine learning: Machine Learning with Python


Introduction

Most supervised machine learning models applied to regression problems generate a point estimate of the target variable $y$ given a set of predictors $\mathbf{X}$. This estimate typically corresponds to the conditional mean $\mathbb{E}[y \mid \mathbf{X}]$. While perfectly adequate for many tasks, point predictions carry a fundamental limitation: they say nothing about the uncertainty surrounding the predicted value.

Consider a company that manages electricity distribution. The grid must be provisioned to handle peak demand, not just average demand. Knowing that the expected consumption at 10 AM is 15 MWh is useful, but knowing that there is a 5% probability that it exceeds 22.5 MWh is far more actionable for capacity planning. A standard regression model cannot answer the second question.

Distribución de consumo eléctrico

More formally, point-estimate models do not provide:

  • Prediction intervals: a range $[L, U]$ that covers the true value with a given probability, e.g., $P(L \leq y \leq U \mid \mathbf{X}) = 0.90$.
  • Probability predictions: $P(y > c \mid \mathbf{X})$ for an arbitrary threshold $c$.
  • Full conditional distribution $p(y \mid \mathbf{X})$, which contains all information about the response variable.

Distributional regression addresses these limitations by modelling the full conditional distribution of $y$ rather than just its mean. The output of the model is no longer a single number, but the parameters of a parametric probability distribution, for example, the mean $\mu$ and standard deviation $\sigma$ of a Normal distribution, both of which can vary as functions of $\mathbf{X}$.

Gradient Boosting Machines for Location, Scale and Shape (GBM-LSS), extend the gradient boosting framework to distributional regression. Instead of fitting a single boosting model to the conditional mean, GBM-LSS fits one gradient boosting model per distributional parameter, guiding each one towards the maximum likelihood estimate of the full distribution. The result is a flexible, non-parametric model for the entire conditional distribution $p(y \mid \mathbf{X})$.

This approach inherits all the strengths of modern gradient boosting, handling of non-linear relationships, robustness to outliers (with appropriate distributions), compatibility with heterogeneous tabular data, built-in variable selection via shrinkage, while extending its predictions from a single number to a full probability distribution.

This article covers the conceptual foundations, a comparison with alternative probabilistic approaches, the practical Python API, and two worked examples: one on simulated data with known ground truth, and one on a real-world pediatric growth dataset.

Distributional gradient boosting: GBM-LSS

Standard gradient boosting for regression trains a single ensemble of trees that maps features $\mathbf{X}$ to a prediction of the conditional mean $\mathbb{E}[y \mid \mathbf{X}]$. This is fundamentally a point prediction: one number per observation.

GBM-LSS extends this idea so that the output of the model is not one number but an entire probability distribution. To do this, it relies on the concept of distributional regression: rather than modelling only the expected value of $y$, it models all parameters of a chosen parametric distribution as functions of the predictors. For a Normal distribution, that means modelling both the mean $\mu(\mathbf{X})$ and the standard deviation $\sigma(\mathbf{X})$. For a Gamma distribution, it means modelling the shape and rate parameters. Each of those parameters gets its own gradient boosting ensemble, trained jointly to maximise the likelihood of the observed data.

The name Location, Scale and Shape comes from the generic labelling convention used in the GAMLSS framework (Rigby & Stasinopoulos, 2005), which this approach directly extends to gradient boosting trees. Location refers to where the distribution is centred ($\mu$), scale to how spread out it is ($\sigma$), and shape to asymmetry and tail behaviour ($\nu$, $\tau$). Not every distribution requires all four, a Normal distribution needs only location and scale, but the framework supports up to four independently modelled parameters. The practical consequence is that the model can capture heteroscedasticity (variance that changes with the predictors), variable skewness, and variable tail weight, none of which standard GBM can represent.

The two Python implementations covered in this article are XGBoostLSS and LightGBMLSS, both written and maintained by Alexander März. They wrap XGBoost and LightGBM respectively, extending them with distributional regression while keeping the same familiar training API.

How GBM-LSS works

Location, Scale and Shape parameters

Any continuous parametric distribution can be characterised by its parameters. The GBM-LSS framework uses the generic notation $(\mu, \sigma, \nu, \tau)$ for up to four parameters, which loosely map to:

Parameter Role Example
$\mu$ (Location) Where the distribution is centred Mean of Normal, median of Laplace
$\sigma$ (Scale) How spread out the distribution is Std. dev. of Normal, scale of Gamma
$\nu$ (Shape 1) Asymmetry / skewness Degrees of freedom of Student-t
$\tau$ (Shape 2) Tail weight / kurtosis Second shape parameter in four-parameter distributions

Not all distributions require all four parameters. A Normal distribution needs only $(\mu, \sigma)$; a Student-t needs $(\mu, \sigma, \nu)$; most four-parameter distributions are rarely needed in practice. In standard regression, only $\mu$ is modelled as a function of $\mathbf{X}$, and all other parameters are estimated as global constants. GBM-LSS models all distributional parameters as functions of $\mathbf{X}$:

$$\hat{\theta}_k(\mathbf{X}) = h_k^{-1}(F_k(\mathbf{X}))$$

where $h_k$ is a link function that maps the unconstrained output of the boosting model to the valid range of parameter $k$, and $F_k$ is a gradient boosting ensemble.

The training objective: negative log-likelihood

In standard gradient boosting for regression, the model minimises the expected squared error or MAE loss. In GBM-LSS, the loss is the negative log-likelihood of the assumed distribution:

$$\mathcal{L}(\mathbf{F}) = -\sum_{i=1}^{n} \log p\!\left(y_i \mid \boldsymbol{\theta}(\mathbf{X}_i)\right)$$

where $\mathbf{F} = (F_1, \ldots, F_K)$ is the vector of GBM models, one per distributional parameter, and $K$ is the number of parameters in the chosen distribution. Minimising this loss is equivalent to maximum likelihood estimation of all distributional parameters simultaneously.

Gradient boosting is applied to this objective: at each iteration $t$, a new tree is added to the model for each parameter $k$ by fitting the negative gradient (and Hessian) of $\mathcal{L}$ with respect to the current prediction:

$$r_{i,k}^{(t)} = -\left[\frac{\partial \log p(y_i \mid \boldsymbol{\theta})}{\partial F_k(\mathbf{X}_i)}\right]_{F_k = F_k^{(t-1)}}$$

To compute these derivatives efficiently, frameworks like lightgbmlss leverage PyTorch's automatic differentiation capabilities. This means gradients and Hessians are estimated simultaneously for all parameters under the hood, allowing users to easily implement custom distributions without manually deriving complex analytical derivatives.

The result is $K$ gradient boosting ensembles trained jointly, each responsible for predicting one distributional parameter as a non-linear function of the predictors.

An alternative loss function supported by the library is the Continuous Ranked Probability Score (CRPS) (loss_fn="crps"). CRPS is a proper scoring rule that generalises MAE to probabilistic forecasts and has the same units as $y$. Unlike NLL, CRPS does not require specifying a distributional family — it evaluates forecast quality by comparing the predicted CDF $F$ to the observed outcome $y$:

$$\text{CRPS}(F, y) = \int_{-\infty}^{\infty} \left[F(t) - \mathbf{1}(t \geq y)\right]^2 dt$$

In practice loss_fn="nll" (the default) is the most common choice; loss_fn="crps" can be useful when robustness to distributional misspecification is a priority.

Training algorithm

The GBM-LSS training algorithm proceeds as follows:

  1. Initialise each parameter model $F_k^{(0)}$ with the MLE estimate of $\theta_k$ on the training set (a single constant per parameter that maximises the unconditional likelihood).

  2. For each boosting iteration $t = 1, \ldots, T$:

    • Compute the negative gradient (pseudo-residuals) for each parameter $k$.
    • Fit a new regression tree $h_k^{(t)}$ to the pseudo-residuals for parameter $k$.
    • Update $F_k^{(t)} = F_k^{(t-1)} + \eta \cdot h_k^{(t)}$, where $\eta$ is the learning rate.
    • Apply the link function to map the raw prediction to the valid parameter space.
  3. Return the final ensemble $\mathbf{F} = (F_1, \ldots, F_K)$.

After training, a prediction for a new observation $\mathbf{X}^*$ is the personalised distribution:

$$\hat{\boldsymbol{\theta}}(\mathbf{X}^*) = \left(h_1^{-1}(F_1(\mathbf{X}^*)),\ \ldots,\ h_K^{-1}(F_K(\mathbf{X}^*))\right)$$

Each observation receives its own distribution $p(y \mid \hat{\boldsymbol{\theta}}(\mathbf{X}^*))$ rather than a single point estimate.

Link functions

Link functions ensure that the raw unconstrained GBM output is mapped to a value in the valid domain of each distributional parameter:

Parameter domain Link function Inverse link Typical use
$\mathbb{R}$ (unbounded) Identity: $g(\theta) = \theta$ $\theta = F$ Mean of Normal or Laplace
$(0, \infty)$ (strictly positive) Log: $g(\theta) = \log(\theta)$ $\theta = e^F$ Scale parameters, Gamma shape
$(0, 1)$ (probability) Logit: $g(\theta) = \log\frac{\theta}{1-\theta}$ $\theta = \sigma(F)$ Zero-inflation probability

For example, when fitting a Normal distribution, the mean $\mu$ uses the identity link (any real number is valid), and the standard deviation $\sigma$ uses the log link (it must be strictly positive). The library handles link function application automatically — the user only needs to choose the distribution family.

Available distributions

One of the most consequential modelling decisions is the choice of distributional family. GBM-LSS supports a wide range of parametric distributions. Below is a practical guide organised by the type of response variable.

Continuous symmetric distributions

Distribution Parameters Use case
Normal $\mu \in \mathbb{R}$, $\sigma > 0$ Continuous data, roughly symmetric errors; baseline choice
Student-t $\mu \in \mathbb{R}$, $\sigma > 0$, $\nu > 0$ Heavy-tailed data, robust to outliers; generalises Normal
Laplace $\mu \in \mathbb{R}$, $\sigma > 0$ Heavier tails than Normal (exponential decay), sharp peak at mode; analogous to L1 loss
Logistic $\mu \in \mathbb{R}$, $\sigma > 0$ Slightly heavier tails than Normal

The Student-t distribution is particularly valuable in practice. When the degrees-of-freedom parameter $\nu$ is large, it approximates a Normal distribution; as $\nu$ decreases, the tails become heavier. This provides robustness to outliers without requiring the analyst to pre-specify outlier thresholds.

Continuous asymmetric distributions (positive domain)

Distribution Parameters Use case
LogNormal $\mu \in \mathbb{R}$, $\sigma > 0$ Positive data with right skew; income, prices
Gamma $\mu > 0$, $\sigma > 0$ Positive data; insurance claims, waiting times
Beta $\mu \in (0,1)$, $\sigma > 0$ Proportions and rates bounded in $(0, 1)$
Weibull $\mu > 0$, $\sigma > 0$ Failure and survival times
Gumbel $\mu \in \mathbb{R}$, $\sigma > 0$ Extreme values, return levels


Discrete and count data

Distribution Parameters Use case
Poisson $\mu > 0$ Count data; assumes mean = variance
Negative Binomial $\mu > 0$, $\sigma > 0$ Overdispersed count data (variance > mean)
Zero-Inflated Poisson (ZIP) $\mu > 0$, $\pi \in (0,1)$ Counts with excess zeros
Zero-Inflated Negative Binomial (ZINB) $\mu > 0$, $\sigma > 0$, $\pi$ Overdispersed counts with excess zeros


Advanced distribution modeling (Multivariate, Mixtures, and Flows)

Beyond standard univariate distributions, the GBM-LSS framework is generic enough to handle highly complex modeling scenarios:

  • Multivariate Targets: Predicting $D$-dimensional responses simultaneously (e.g., Multivariate Gaussian or Student-t) by modeling not just the means and variances, but also the covariance matrices (using, for example, Cholesky decomposition or Low-Rank approximations) as a function of the predictors.
  • Mixture Distributions: Approximating multi-modal data by combining multiple component distributions (e.g., a Gaussian Mixture where the weights, means, and variances of each component depend on covariates).
  • Normalizing Flows: For extremely complex, non-parametric distributions, the framework supports Normalizing Flows (like Neural Spline Flows) which apply bijective transformations to turn a simple base distribution into a realistic target distribution.

How to choose the right distribution

Domain knowledge: Start with the constraints the response variable must satisfy:

  • Is $y$ strictly positive? → Gamma, LogNormal, Weibull
  • Is $y$ bounded in $[0, 1]$? → Beta
  • Is $y$ an integer (count)? → Poisson, Negative Binomial
  • Does $y$ have a point mass at zero but is otherwise continuous? → Zero-Adjusted distributions (ZAGamma, ZALN, ZABeta)

Exploratory data analysis: Visualise the marginal distribution of $y$:

  • Symmetric, bell-shaped → Normal, Student-t
  • Right-skewed, strictly positive → LogNormal, Gamma
  • Heavy tails on both sides → Student-t
  • Sharp peak → Laplace

Automated selection (lightgbmlss): The lightgbmlss library provides a dist_select helper function (via DistributionClass) that fits each candidate distribution to the target variable by maximum likelihood and returns them ranked by negative log-likelihood (NLL). The distribution with the lowest NLL best explains the observed data. The function also plots the density of the target variable alongside the fitted densities of all candidates.

Residual diagnostics: After fitting a candidate model, inspect the PIT (Probability Integral Transform) histogram, the distribution of $F(y_i \mid \hat{\boldsymbol{\theta}}_i)$ across the test set. A well-calibrated model produces a flat histogram; deviations indicate mis-specification:

  • U-shaped → tails too light (model is over-confident)
  • Hump-shaped → tails too heavy (model is under-confident)
  • Skewed → asymmetry of the distribution is mis-specified

GBM-LSS vs other probabilistic approaches

Several approaches to distributional or probabilistic regression are available in Python. Understanding their trade-offs upfront helps motivate the choice of GBM-LSS.

GBM-LSS vs Quantile Regression

Aspect Quantile Regression GBM-LSS
Output Specific quantiles Full parametric distribution
Distributional assumption None (non-parametric) Parametric (distribution must be chosen)
$P(y > c \mid \mathbf{X})$ for arbitrary $c$ Requires interpolation Exact, from the CDF
Quantile crossing Possible Impossible (parametric coherence)
Number of models trained One per quantile One per distributional parameter
Captures heteroscedasticity Yes Yes
Captures variable skewness Partially Yes

Choose Quantile Regression when no parametric distribution is a reasonable assumption (e.g., multimodal or mixture-shaped responses), or when only a fixed set of quantile levels is needed.

Choose GBM-LSS when a parametric family is plausible, you want a fully coherent distributional description that supports computing any quantile, any tail probability, and any moment, without training separate models.

GBM-LSS vs NGBoost

Aspect NGBoost GBM-LSS
Optimisation Natural gradient (Fisher information scaled) Standard gradient (NLL loss)
GBDT backend Custom (ngboost library) XGBoost / LightGBM (optimised)
Speed Slower Faster (especially on large datasets)
Distributional families Limited (~5) Extensive (30+)
Feature importance Standard per-tree importance SHAP per distributional parameter
Active development Slower (as of 2026) Active

In practice, GBM-LSS outperforms NGBoost on large tabular datasets in both speed and distributional accuracy, due to the highly optimised XGBoost and LightGBM backends. NGBoost's natural gradient update can be advantageous when the likelihood surface is poorly conditioned, but this is rarely decisive in practice.

Summary recommendation

Situation Recommended approach
Quick baseline, only a few quantile levels needed Quantile regression (GBM with pinball loss)
Small dataset, statistical inference with coefficients GAMLSS (R) or NGBoost
Large tabular dataset, known parametric family GBM-LSS (XGBoostLSS / LightGBMLSS)
Unknown or complex distributional shape Quantile regression or normalising flows
Full Bayesian inference required PyMC or Stan

Example 1: Simulated heteroscedastic data

The first example uses simulated data where the variance of the response variable changes as a function of the predictor. This is a canonical case where standard regression fails and distributional modelling shines.

Problem setting: A utility company monitors hourly electricity consumption across thousands of households. The average consumption is roughly constant throughout the day (~15 MWh), but the variability is not: consumption is tightly clustered at night and highly dispersed during peak morning and evening hours. The company wants to:

  1. Predict the full distribution of consumption for each hour.
  2. Compute prediction intervals with 90% coverage.
  3. Estimate the probability that consumption exceeds a threshold at each hour.

Libraries and data

# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import lightgbm as lgb
from lightgbmlss.model import LightGBMLSS
from lightgbmlss.distributions.Gaussian import Gaussian
import lightgbmlss.distributions.Gaussian as Gaussian_mod
import lightgbmlss.distributions.StudentT as StudentT_mod
import lightgbmlss.distributions.Gamma as Gamma_mod
import lightgbmlss.distributions.LogNormal as LogNormal_mod
from sklearn.model_selection import train_test_split
from scipy import stats

import warnings
warnings.filterwarnings('once')

plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8

✏️ Note

Users can switch from the LightGBM-based model to the XGBoost-based model simply by replacing lightgbmlss with xgboostlss.

# Simulated data
# ==============================================================================
# Simulation slightly modified from the example published in
# XGBoostLSS - An extension of XGBoost to probabilistic forecasting Alexander März
np.random.seed(1234)
n = 3000
X = np.linspace(0, 24, n)
y = np.random.normal(
    loc=15,
    scale=(
        1
        + 1.5 * ((4.8 < X) & (X < 7.2))
        + 4.0 * ((7.2 < X) & (X < 12.0))
        + 1.5 * ((12.0 < X) & (X < 14.4))
        + 2.0 * (X > 16.8)
    )
)

The data is generated from $\mathcal{N}(15, \sigma^2(t))$ where $\sigma(t)$ changes with the hour of the day. Standard GBM captures the constant mean well but cannot represent the changing variance.

# Compute the 0.1 and 0.9 quantiles for each simulated value of X
# ==============================================================================
quantile_10 = stats.norm.ppf(
                q = 0.1,
                loc   = 15,
                scale = (
                    1 + 1.5 * ((4.8 < X) & (X < 7.2)) + 4 * ((7.2 < X) & (X < 12))
                    + 1.5 * ((12 < X) & (X < 14.4)) + 2 * (X > 16.8)
                )
             )

quantile_90 = stats.norm.ppf(
                q = 0.9,
                loc   = 15,
                scale = (
                    1 + 1.5 * ((4.8 < X) & (X < 7.2)) + 4 * ((7.2 < X) & (X < 12))
                    + 1.5 * ((12 < X) & (X < 14.4)) + 2 * (X > 16.8)
                )
             )
# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3.5))
ax.scatter(X, y, alpha = 0.3, c = "black", s = 8)
ax.hlines(y=15, xmin=0, xmax=24, colors='tomato', lw=2, label='True mean (15 MWh)')
ax.plot(X, quantile_10, c = "black")
ax.plot(X, quantile_90, c = "black")
ax.fill_between(
    X, quantile_10, quantile_90, alpha=0.3, color='red',
    label='Theoretical quantile interval 0.1-0.9'
)
ax.set_xlabel('Hour of day')
ax.set_ylabel('Consumption (MWh)')
ax.set_title('Hourly electricity consumption')
ax.set_xticks(range(0, 25, 2))
ax.legend()
plt.tight_layout()

Model training

# Train-test split
# ==============================================================================
X_train, X_test, y_train, y_test = train_test_split(
    X.reshape(-1, 1), y, test_size=0.2, random_state=42, shuffle=True
)

# When using XGBoostLSS data must be transformed into a xgb.DMatrix
# dX_train = xgb.DMatrix(X_train, label=y_train)
# dX_test  = xgb.DMatrix(X_test,  label=y_test)

# When using LightGBMLSS data must be transformed into a lgb.Dataset
dX_train = lgb.Dataset(X_train, label=y_train)
dX_test  = lgb.Dataset(X_test,  label=y_test)

print(f"Train set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
Train set: 2400 samples
Test set: 600 samples

When fitting the GBM-LSS model, it is required to specify a distributional family. Each distribution class accepts three key arguments:

  • stabilization : "None" | "MAD" | "L2" — gradient/Hessian stabilization method
  • response_fn : "exp" | "softplus" — inverse link for positive parameters (e.g. scale)
  • loss_fn : "nll" | "crps" — training loss (negative log-likelihood or CRPS)

In this case, we choose the Normal distribution, which has two parameters: $\mu$ (mean) and $\sigma$ (standard deviation). The mean is modelled with the identity link, and the standard deviation with the log link to ensure positivity. The training objective is the negative log-likelihood of the Normal distribution.

# Model specification
# ==============================================================================
model = LightGBMLSS(
    Gaussian(stabilization="None", response_fn="exp", loss_fn="nll")
)

To find the optimal hyperparameters, a Bayesian search is performed using the integration with Optuna provided by the library. The search optimises the validation negative log-likelihood over a predefined hyperparameter space, using early stopping to prevent overfitting. The best model is then refit on the entire training set with the optimal hyperparameters.

# Hyperparameter optimization
# ==============================================================================
# !pip install optuna-integration[lightgbm]

params_dict = {
    "learning_rate":            ["float", {"low": 1e-5,   "high": 1,     "log": True}],
    "max_depth":                ["int",   {"low": 1,      "high": 10,    "log": False}],
    "min_gain_to_split":        ["float", {"low": 1e-8,   "high": 40,    "log": False}],
    "min_sum_hessian_in_leaf":  ["float", {"low": 1e-8,   "high": 500,   "log": True}],
    "subsample":                ["float", {"low": 0.2,    "high": 1.0,   "log": False}],
    "feature_fraction":         ["float", {"low": 0.2,    "high": 1.0,   "log": False}],
    "boosting":                 ["categorical", ["gbdt"]],
}

opt_params = model.hyper_opt(
    hp_dict = params_dict,
    train_set = dX_train,
    num_boost_round=100,       # Number of boosting iterations.
    nfold=5,                   # Number of cv-folds.
    early_stopping_rounds=20,  # Number of early-stopping rounds
    max_minutes=10,            # Time budget in minutes, i.e., stop study after the given number of minutes.
    n_trials=30,               # The number of trials. If this argument is set to None, there is no limitation on the number of trials.
    silence=True,              # Controls the verbosity of the trail, i.e., user can silence the outputs of the trail.
    seed=123,                  # Seed used to generate cv-folds.
    hp_seed=6715               # Seed for random number generator used in the Bayesian hyperparameter search.
)
  0%|          | 0/30 [00:00<?, ?it/s]
Hyper-Parameter Optimization successfully finished.
  Number of finished trials:  30
  Best trial:
    Value: 2.265606331242457
    Params: 
    learning_rate: 0.9273640481563128
    max_depth: 1
    min_gain_to_split: 0.035986405436204016
    min_sum_hessian_in_leaf: 0.025909830444300425
    subsample: 0.7788823527111574
    feature_fraction: 0.6019028154022705
    boosting: gbdt
    opt_rounds: 31

The model is trained with the best hyperparameters found in the search.

# Train model with optimal hyperparameters
# ==============================================================================
n_rounds = opt_params.pop("opt_rounds")

model.train(
    params = opt_params,
    train_set = dX_train,
    num_boost_round = n_rounds
)

Predictions

lightgbmlss provides three different options for making predictions:

  • samples: draws $n$ samples from the predicted distribution.

  • quantiles: calculates quantiles from the predicted distribution.

  • parameters: returns predicted distributional parameters.

When using prediction type samples, the output is a DataFrame with one row per observation in the test set and n_samples columns, each containing a draw from the predicted distribution for that observation. These represent possible future outcomes consistent with the model's uncertainty.

# Sample from predicted distribution
# ==============================================================================
sample_predictions = model.predict(data=X_test, pred_type="samples", n_samples=50)
sample_predictions
y_sample0 y_sample1 y_sample2 y_sample3 y_sample4 y_sample5 y_sample6 y_sample7 y_sample8 y_sample9 ... y_sample40 y_sample41 y_sample42 y_sample43 y_sample44 y_sample45 y_sample46 y_sample47 y_sample48 y_sample49
0 15.697314 16.262325 16.202932 15.307989 17.619083 13.853810 15.308901 12.309548 15.882831 19.780371 ... 17.041090 16.348074 14.840222 11.370354 12.450369 13.575173 13.577733 16.836323 13.708682 18.007421
1 14.166789 18.025551 23.503204 19.067127 11.026548 14.392353 7.978942 4.036101 25.193146 14.973810 ... 11.270795 14.132236 15.480999 10.183868 19.415047 27.263340 12.544895 14.446023 19.760115 10.920843
2 14.586010 13.869585 15.001597 16.088562 14.989753 15.055332 13.881459 14.251566 15.061149 15.794733 ... 15.848513 15.297975 16.091333 14.388843 15.170838 14.735168 14.798789 13.742496 13.630329 13.089798
3 14.592566 14.624412 14.269474 15.491898 15.569511 16.288342 15.164846 14.566668 13.327274 14.233431 ... 13.844853 15.285441 14.233530 14.675815 15.908656 15.858871 15.202689 15.007872 14.247766 14.850210
4 16.392326 14.859611 19.943062 13.869667 17.015881 15.606420 22.863440 16.727417 20.740311 12.586379 ... 16.827066 13.654144 14.565105 19.044834 13.281296 10.824958 10.942783 11.074723 12.510992 13.662723
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
595 14.199569 12.481850 15.573352 15.122883 15.642666 13.908475 16.400906 15.557431 17.194227 17.689520 ... 15.000471 17.024361 14.361902 15.227952 14.057050 16.765730 15.868464 15.708106 14.718128 13.703343
596 16.862871 15.523841 14.538917 13.587122 15.685340 16.660721 15.864644 14.723064 15.661668 17.678423 ... 14.425453 14.341781 15.765164 14.548565 13.399756 17.723585 16.486721 17.126137 14.486805 17.241503
597 16.452101 14.660017 14.022560 14.965786 15.117435 14.897168 15.493895 15.492521 15.704017 15.135339 ... 14.843552 14.960234 15.990202 14.974794 15.501653 16.243484 15.444390 14.754246 14.522926 15.973195
598 14.150993 16.409111 15.669437 15.129354 20.987328 15.376616 15.166396 15.373619 15.436773 14.702501 ... 14.456417 17.717783 14.643255 11.609344 17.478497 14.445816 14.792369 15.499956 14.302234 15.070537
599 18.794319 22.029449 16.802168 12.680256 8.722343 13.182603 15.911584 6.012594 20.524622 18.866379 ... 15.987385 6.264203 9.161081 5.458413 18.129385 17.038593 17.099699 26.817642 18.101303 17.728611

600 rows × 50 columns

# Prediction of quantiles
# ==============================================================================
quantile_predictions = model.predict(data=X_test, pred_type="quantiles", quantiles=[0.1, 0.9])
quantile_predictions
quant_0.1 quant_0.9
0 12.138686 17.828659
1 8.832257 21.029195
2 13.532447 16.402666
3 13.928053 16.376365
4 11.775336 19.083310
... ... ...
595 13.756510 16.352524
596 13.673839 16.496371
597 13.872277 16.089009
598 12.043160 18.024937
599 9.239491 21.524747

600 rows × 2 columns

# Prediction of distributional parameters
# ==============================================================================
parameter_predictions = model.predict(data=X_test, pred_type="parameters")
parameter_predictions
loc scale
0 14.931164 2.270949
1 15.008944 4.737135
2 14.931164 1.137141
3 15.172076 0.985543
4 15.406059 2.829181
... ... ...
595 15.053152 0.985543
596 15.050743 1.137141
597 15.008944 0.855896
598 14.931164 2.270949
599 15.332772 4.737135

600 rows × 2 columns

# Plot quantile interval 10%-90%
# ==============================================================================
sort_idx = np.argsort(X_test.ravel())
X_test_sorted = X_test.ravel()[sort_idx]
quantile_predictions_sorted = quantile_predictions.iloc[sort_idx]

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7, 3.5))
ax.scatter(X, y, alpha = 0.3, c = "black", s = 8)
ax.plot(X, quantile_10, c = "black")
ax.plot(X, quantile_90, c = "black")
ax.fill_between(X, quantile_10, quantile_90, alpha=0.3, color='red',
                label='Theoretical quantile interval 0.1-0.9')
ax.plot(X_test_sorted, quantile_predictions_sorted.iloc[:, 0], c = "blue", label='Predicted quantile interval')
ax.plot(X_test_sorted, quantile_predictions_sorted.iloc[:, 1], c = "blue")

ax.set_xticks(range(0, 25))
ax.set_title('Electricity consumption throughout the day')
ax.set_xlabel('Hour of the day')
ax.set_ylabel('Electricity consumption (MWh)')
plt.legend();

Coverage

One of the metrics used to evaluate intervals is coverage. Its value corresponds to the percentage of observations that fall within the estimated interval. Ideally, empirical coverage should equal the nominal coverage level of the interval. A well-calibrated model should achieve empirical coverage close to the nominal level.

In this example, quantiles 0.1 and 0.9 have been computed, so the interval has a nominal coverage level of 80%. If the predicted interval is correct, approximately 80% of observations should fall within it.

# Coverage of the predicted interval
# ==============================================================================
lower = quantile_predictions.iloc[:, 0].to_numpy()
upper = quantile_predictions.iloc[:, 1].to_numpy()

coverage = ((y_test >= lower) & (y_test <= upper)).mean()
mean_width = (upper - lower).mean()

print(f'Empirical 80% PI coverage : {coverage:.3f}')
print(f'Mean interval width       : {mean_width:.3f} (MWh)')
Empirical 80% PI coverage : 0.747
Mean interval width       : 6.470 (MWh)

Probabilistic Prediction

For example, to determine the probability that at 8 a.m. consumption exceeds 22.5 MWh, the distribution parameters are first predicted for $hour=8$ and then, using those parameters, the probability $consumption \geq 22.5$ is computed with the cumulative distribution function.

# Predict distribution parameters when hour=8
# ==============================================================================
x_new = np.array([[8.0]])
params_at_8 = model.predict(data=x_new, pred_type="parameters")
print("Predicted distribution parameters when hour=8:")
print(params_at_8)
Predicted distribution parameters when hour=8:
         loc     scale
0  15.008944  4.737135
# Probability that consumption exceeds 22.5 when hour=8
# ==============================================================================
# Column names depend on the chosen distribution (e.g., 'loc'/'scale' for Gaussian)
mu_8 = params_at_8['loc'].to_numpy()[0]
sigma_8 = params_at_8['scale'].to_numpy()[0]

prob_exceed = 1 - stats.norm.cdf(22.5, loc=mu_8, scale=sigma_8)
print(f'P(consumption >= 22.5 MWh | hour=8) = {prob_exceed:.4f}')
P(consumption >= 22.5 MWh | hour=8) = 0.0569

According to the distribution predicted by the model, there is a 5.69% probability that consumption at 8 a.m. will be equal to or above 22.5 MWh.

If this same calculation is performed for the entire hourly range, the probability that consumption exceeds a given value throughout the day can be determined.

# Probability prediction across the full hourly range of consumption exceeding 22.5 MWh
# ==============================================================================
hours = np.linspace(0, 24, 500)
d_hours = hours.reshape(-1, 1)
params_hours = model.predict(data=d_hours, pred_type="parameters")

prob_hours = 1 - stats.norm.cdf(
    22.5,
    loc=params_hours['loc'].to_numpy(),
    scale=params_hours['scale'].to_numpy()
)

fig, ax = plt.subplots(figsize=(7, 3.5))
ax.plot(hours, prob_hours)
ax.set_xlabel('Hour of the day')
ax.set_ylabel('P(consumption ≥ 22.5 MWh)')
ax.set_title('Probability of exceeding 22.5 MWh throughout the day')
ax.set_xticks(range(0, 25, 2))
plt.tight_layout()

Example 2

The dbbmi dataset (Fourth Dutch Growth Study, Fredriks et al. (2000a, 2000b)), available in the R package gamlss.data, contains information on age and body mass index (bmi) of 7,294 Dutch youths aged 0 to 20. The goal is to train a model capable of predicting quantiles of body mass index as a function of age, as this is one of the standards used to detect anomalous cases that may require medical attention.

Libraries and data

# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import lightgbm as lgb
from lightgbmlss.model import LightGBMLSS
from lightgbmlss.distributions.Gaussian import Gaussian
from lightgbmlss.distributions.Gamma import Gamma
from lightgbmlss.distributions.LogNormal import LogNormal
from lightgbmlss.distributions.StudentT import StudentT
from lightgbmlss.distributions.distribution_utils import DistributionClass
import lightgbmlss.distributions.Gaussian as Gaussian_mod
import lightgbmlss.distributions.StudentT as StudentT_mod
import lightgbmlss.distributions.Gamma as Gamma_mod
import lightgbmlss.distributions.LogNormal as LogNormal_mod
import warnings

warnings.filterwarnings('once')
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8
# Load data
# ==============================================================================
url = (
    'https://raw.githubusercontent.com/JoaquinAmatRodrigo/'
    'Estadistica-machine-learning-python/master/data/dbbmi.csv'
)
data = pd.read_csv(url)
data.head(3)
age bmi
0 0.03 13.235289
1 0.04 12.438775
2 0.04 14.541775
# Plot distribution of bmi values
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3))
ax.hist(data.bmi, bins=30, density=True, color='#3182bd', alpha=0.8)
ax.set_title('Distribution of body mass index values')
ax.set_xlabel('bmi')
ax.set_ylabel('density');
# Plot distribution of body mass index as a function of age
# ==============================================================================
fig, axs = plt.subplots(ncols=2, nrows=1, figsize=(8, 4), sharey=True)
data[data.age <= 2.5].plot(
    x    = 'age',
    y    = 'bmi',
    c    = 'black',
    kind = "scatter",
    alpha = 0.1,
    ax   = axs[0]
)
axs[0].set_ylim([10, 30])
axs[0].set_title('Age <= 2.5 years')

data[data.age > 2.5].plot(
    x    = 'age',
    y    = 'bmi',
    c    = 'black',
    kind = "scatter",
    alpha = 0.1,
    ax   = axs[1]
)
axs[1].set_title('Age > 2.5 years')

fig.tight_layout()
plt.subplots_adjust(top = 0.85)
fig.suptitle('Distribution of body mass index as a function of age', fontsize = 12);

This distribution exhibits three important characteristics that must be taken into account when modeling it:

  • The relationship between age and body mass index is neither linear nor constant. There is a notable positive relationship until about 0.25 years of age, then it stabilizes until age 10, and rises sharply again from ages 10 to 20.

  • Variance is heterogeneous (heteroscedasticity), being lower at younger ages and higher at older ages.

  • The distribution of the response variable is not normal; it shows skewness and a positive tail.

Given these characteristics, a model is needed that:

  • Is capable of learning non-linear relationships.

  • Can explicitly model variance as a function of the predictors, since it is not constant.

  • Is capable of learning skewed distributions with a pronounced positive tail.

Distribution selection

lightgbmlss provides an automated distribution selection procedure via DistributionClass.dist_select(). The function fits each candidate distribution to the target variable by maximum likelihood and ranks them by negative log-likelihood (NLL). A lower NLL indicates a better fit to the marginal (unconditional) distribution of the response (ignoring the features $\\mathbf{X}$).

# Distribution selection
# ==============================================================================
xgblss_dist_class = DistributionClass()
candidate_distributions = [Gaussian_mod, StudentT_mod, Gamma_mod, LogNormal_mod]

dist_nll = xgblss_dist_class.dist_select(
    target=data['bmi'].to_numpy(),
    candidate_distributions=candidate_distributions,
    max_iter=50,
    plot=True,
    figure_size=(6, 3)
)
dist_nll
nll distribution
rank
1 17688.355825 LogNormal
2 17807.554688 Gamma
3 18133.892757 Gaussian
4 18175.127931 StudentT

Based on the NLL ranking, we select the distribution with the lowest value and proceed with model training. For this data set, LogNormal ranks highest, which makes sense given the positive skew and heavy tail observed in the distribution of body mass index.

Model training and hyperparameter tuning

# Train-test split
# ==============================================================================
X_train, X_test, y_train, y_test = train_test_split(
    data['age'].to_numpy().reshape(-1, 1),
    data['bmi'].to_numpy(),
    test_size=0.2,
    random_state=42
)

# Model training with the selected distribution and hyperparameter optimization
# ==============================================================================
# When using XGBoostLSS data must be transformed into a xgb.DMatrix
# d_data = xgb.DMatrix(X_train, label=y_train)

# When using LightGBMLSS data must be transformed into a lgb.Dataset
d_data = lgb.Dataset(X_train, label=y_train)

model = LightGBMLSS(
    LogNormal(stabilization='None', response_fn='exp', loss_fn='nll')
)

params_dict = {
    "learning_rate":            ["float", {"low": 1e-5,   "high": 1,     "log": True}],
    "max_depth":                ["int",   {"low": 1,      "high": 10,    "log": False}],
    "min_gain_to_split":        ["float", {"low": 1e-8,   "high": 40,    "log": False}],
    "min_sum_hessian_in_leaf":  ["float", {"low": 1e-8,   "high": 500,   "log": True}],
    "subsample":                ["float", {"low": 0.2,    "high": 1.0,   "log": False}],
    "feature_fraction":         ["float", {"low": 0.2,    "high": 1.0,   "log": False}],
    "boosting":                 ["categorical", ["gbdt"]],
}

opt_params = model.hyper_opt(
    hp_dict=params_dict,
    train_set=d_data,
    num_boost_round=100,
    nfold=5,
    early_stopping_rounds=10,
    max_minutes=10,
    n_trials=30,
    silence=True,
    seed=123,
    hp_seed=123
)
  0%|          | 0/30 [00:00<?, ?it/s]
Hyper-Parameter Optimization successfully finished.
  Number of finished trials:  30
  Best trial:
    Value: 2.0346465443268174
    Params: 
    learning_rate: 0.7526113148050829
    max_depth: 1
    min_gain_to_split: 0.12693190605161014
    min_sum_hessian_in_leaf: 1.17102461864558e-08
    subsample: 0.9928628563764392
    feature_fraction: 0.20415744257468788
    boosting: gbdt
    opt_rounds: 100
# Train model with optimal hyperparameters
# ==============================================================================
n_rounds = opt_params.pop('opt_rounds')
model.train(opt_params, d_data, num_boost_round=n_rounds)

Predictions and prediction intervals

Once the model is trained, the goal is to predict quantiles of BMI as a function of age, not a single expected value, but the shape of the full conditional distribution $p(\text{bmi} \mid \text{age})$ at each point.

Concretely, three quantile curves are predicted over a fine grid of age values:

  • q0.05 and q0.95: the lower and upper bounds of a 90% prediction interval. For any given age, 90% of children are expected to have a BMI within this range.
  • q0.50: the predicted median, a robust measure of central tendency for the right-skewed BMI distribution.

A regular grid of 500 evenly-spaced age values (rather than the observed ages) is used so the resulting curves are smooth, making the heteroscedasticity and non-linearity visually clear across the full age range.

# Prediction of the median (50% quantile) and the 90% prediction interval
# ==============================================================================
age_grid = np.linspace(data['age'].min(), data['age'].max(), 500)
quantiles_to_predict = [0.05, 0.5, 0.95]
quantile_grid = model.predict(
    data=age_grid.reshape(-1, 1), pred_type='quantiles', quantiles=quantiles_to_predict
)
quantile_grid.head(3)
quant_0.05 quant_0.5 quant_0.95
0 12.119442 13.910258 15.860954
1 12.080288 13.831336 15.894178
2 13.200316 14.995238 16.919324
# Plot
# ==============================================================================
q_grid = quantile_grid.to_numpy()  # shape (500, 3): q0.05, q0.50, q0.95
mask_young_grid = age_grid <= 2.5
mask_old_grid   = age_grid > 2.5

fig, axs = plt.subplots(ncols=2, nrows=1, figsize=(9, 4.5), sharey=True)

for ax, gmask, dmask, title in zip(
    axs,
    [mask_young_grid, mask_old_grid],
    [data['age'] <= 2.5, data['age'] > 2.5],
    ['Age ≤ 2.5 years', 'Age > 2.5 years']
):
    ax.scatter(data.loc[dmask, 'age'], data.loc[dmask, 'bmi'], c='black', alpha=0.1, s=6)
    ax.fill_between(
        age_grid[gmask], q_grid[gmask, 0], q_grid[gmask, 2],
        alpha=0.3, color='tomato', label='90% prediction interval'
    )
    ax.plot(age_grid[gmask], q_grid[gmask, 1], color='tomato', lw=2, label='Median (q0.50)')
    ax.plot(age_grid[gmask], q_grid[gmask, 0], color='tomato', lw=1, ls='--', label='q0.05 / q0.95')
    ax.plot(age_grid[gmask], q_grid[gmask, 2], color='tomato', lw=1, ls='--')
    ax.set_title(title)
    ax.set_xlabel('Age (years)')
    ax.set_ylabel('BMI')

axs[0].legend(fontsize=7)
fig.suptitle('Predicted quantiles of BMI by age (LogNormal GBM-LSS)', fontsize=11)
plt.tight_layout()

Finally, the empirical coverage of the 90% prediction interval is calculated on the held-out test set, where a well-calibrated model should achieve coverage close to 0.90.

# Prediction of quantiles
# ==============================================================================
quantile_predictions = model.predict(
    data=X_test,
    pred_type="quantiles",
    quantiles=[0.05, 0.95]
)
quantile_predictions
quant_0.05 quant_0.95
0 13.938657 21.610874
1 17.045627 25.072260
2 14.604642 18.645010
3 13.751138 19.467578
4 14.599775 21.713120
... ... ...
1454 14.551365 18.590945
1455 16.931873 25.026875
1456 14.720979 19.045074
1457 16.014579 23.116499
1458 14.618010 18.566463

1459 rows × 2 columns

# Coverage of the predicted interval
# ==============================================================================
lower = quantile_predictions.iloc[:, 0].to_numpy()
upper = quantile_predictions.iloc[:, 1].to_numpy()

coverage = ((y_test >= lower) & (y_test <= upper)).mean()
mean_width = (upper - lower).mean()

print(f'Empirical 90% PI coverage : {coverage:.3f}')
print(f'Mean interval width       : {mean_width:.3f}')
Empirical 90% PI coverage : 0.906
Mean interval width       : 6.159

Citation

How to cite this document

If you use this document or any part of it, please acknowledge the source, thank you!

Probabilistic Gradient Boosting with Python: A Guide to GBM-LSS by Joaquín Amat Rodrigo available under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) at https://cienciadedatos.net/documentos/py26b-probabilistic-gradient-boosting-python-gbmlss.html


Did you like the article? Your support is important

Your contribution will help me to continue generating free educational content. Many thanks! 😊

Become a GitHub Sponsor Become a GitHub Sponsor

Creative Commons Licence

This work by Joaquín Amat Rodrigo, Javier Escobar Ortiz is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International.

Allowed:

  • Share: copy and redistribute the material in any medium or format.

  • Adapt: remix, transform, and build upon the material.

Under the following terms:

  • Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  • NonCommercial: You may not use the material for commercial purposes.

  • ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.