More about machine learning: Machine Learning with Python
- Linear regression
- Logistic regression
- Decision trees
- Random forest
- Gradient boosting
- Machine learning with Python and Scikitlearn
- Neural Networks
- Principal Component Analysis (PCA)
- Clustering with python
- Anomaly detection with PCA
- Face Detection and Recognition
- Introduction to Graphs and Networks
Introduction¶
Most machine learning methods applied to regression problems model the behavior of a response variable $y$ as a function of one or more predictors $X$, and generate "point estimate" predictions. These predictions correspond to the expected value of $y$ given a specific value of the predictors ($\mathbb{E}[y|X]$).
The main limitation of these models is that they offer no information about the uncertainty associated with the predicted value. They also cannot answer questions such as:
What is the probability that, given a value of $X$, the response variable $y$ is below or above a certain threshold?
What is the probability that, given a value of $X$, the response variable $y$ falls within a given interval?
To answer these questions, probabilistic regression methods are needed — methods capable of modeling the conditional probability distribution of the response variable as a function of the predictors ($P(y|X)$).
Consider the following simulated (and highly simplified) example of the evolution of electricity consumption across all households in a city as a function of the time of day.
The mean electricity consumption is the same throughout the day, $\overline{consumption}$ = 15 MWh; however, its dispersion is not constant (heteroscedasticity).
Now suppose the company responsible for supplying electricity must be able to provision, at any given time, up to 50% extra electricity above average. This means a maximum of 22.5 MWh. Being prepared to supply this additional energy involves staffing and equipment costs, so the company wonders whether it is necessary to be ready to produce that amount all day, or whether it could be avoided during certain hours, thereby saving costs.
A model that only predicts the average cannot answer this question because, both at 2 a.m. and at 8 a.m., the predicted average consumption is around 15 MWh. However, the probability of reaching 22.5 MWh at 2 a.m. is practically zero, whereas it is quite plausible at 10 a.m.
One way to predict the probability that electricity consumption will exceed a given value at a given hour is to model its probability distribution as a function of the time of day.
"
Natural Gradient Boosting (NGBoost)¶
Natural Gradient Boosting (NGBoost) is a generalization of gradient boosting that is capable of estimating the parameters $\theta$ of a conditional probability distribution, thus enabling probabilistic predictions. Without going into mathematical details, the two main differences from standard gradient boosting are:
It uses natural gradient descent instead of standard gradient descent for model fitting. Unlike the standard gradient, the natural gradient corrects for the curvature of the parameter space of the distribution (using the Fisher information matrix), making the optimization steps invariant to reparameterization and improving convergence.
It estimates the parameters of a parametric probability distribution $P_{\theta}(y|X)$ instead of a point estimate $\mathbb{E}[y|X]$.
For example, if the chosen distribution is a normal, NGBoost models the mean $\mu$ and standard deviation $\sigma$ as functions of the predictors $X$. This way, it is not necessary to assume homoscedasticity (constant variance across the range of $X$).
Natural Gradient Boosting with Python
The ngboost library developed by the Stanford Machine Learning Group implements these models in Python, with an API similar to and compatible with scikit-learn.
In general terms, training an ngboost model requires defining 3 elements:
A type of base learner — by default, decision trees from scikit-learn are used.
A parametric probability distribution.
A scoring rule to guide the fitting process.
The remaining steps are identical to those in a scikit-learn gradient boosting model, including the possibility of optimizing hyperparameters via GridSearchCV and early_stopping_rounds.
The ngboost library is relatively recent and will likely continue to evolve and expand its functionality. Detailed documentation can be found in the NGBoost user guide. Currently, the available distributions and scoring rules are:
Distributions for regression
| Distribution | Parameters | Implemented Scores | Reference |
|---|---|---|---|
Normal |
loc, scale |
LogScore, CRPScore |
scipy.stats normal |
LogNormal |
s, scale |
LogScore, CRPScore |
scipy.stats lognormal |
Exponential |
scale |
LogScore, CRPScore |
scipy.stats exponential |
Distributions for classification
| Distribution | Parameters | Implemented Scores | Reference |
|---|---|---|---|
k_categorical(K) |
p0, p1... p{K-1} |
LogScore |
Categorical distribution on Wikipedia |
Bernoulli |
p |
LogScore |
Bernoulli distribution on Wikipedia |
Example 1¶
An electricity supply company has hourly consumption data for all households in a city. To allocate its resources as efficiently as possible, it wants to build a probabilistic model capable of predicting, with a 90% prediction interval, the average consumption for each hour of the day. It also wants to compute the probability throughout the day that consumption exceeds a given threshold.
Libraries¶
The libraries used in this document are:
# Data handling
# ==============================================================================
import pandas as pd
import numpy as np
# Plotting
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8
# Modeling
# ==============================================================================
from ngboost import NGBRegressor
from ngboost import scores
from ngboost import distns
from sklearn.tree import DecisionTreeRegressor
from scipy import stats
# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')
Data¶
Data are simulated from a normal distribution with different variance depending on the time of day.
# Simulated data
# ==============================================================================
# Simulation slightly modified from the example published in
# XGBoostLSS – An extension of XGBoost to probabilistic forecasting Alexander März
np.random.seed(seed=1234)
n = 3000
# Normal distribution with changing variance
x = np.linspace(start=0, stop= 24, num=n)
y = np.random.normal(
loc = 15,
scale = (
1 + 1.5 * ((4.8 < x) & (x < 7.2)) + 4 * ((7.2 < x) & (x < 12))
+ 1.5 * ((12 < x) & (x < 14.4)) + 2 * (x > 16.8)
)
)
Since the data were simulated using normal distributions, the theoretical quantile values are known for each hour.
# Compute the 0.1 and 0.9 quantiles for each simulated value of x
# ==============================================================================
quantile_10 = stats.norm.ppf(
q = 0.1,
loc = 15,
scale = (
1 + 1.5 * ((4.8 < x) & (x < 7.2)) + 4 * ((7.2 < x) & (x < 12))
+ 1.5 * ((12 < x) & (x < 14.4)) + 2 * (x > 16.8)
)
)
quantile_90 = stats.norm.ppf(
q = 0.9,
loc = 15,
scale = (
1 + 1.5 * ((4.8 < x) & (x < 7.2)) + 4 * ((7.2 < x) & (x < 12))
+ 1.5 * ((12 < x) & (x < 14.4)) + 2 * (x > 16.8)
)
)
# Plot
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7, 3.5))
ax.scatter(x, y, alpha = 0.3, c = "black", s = 8)
ax.plot(x, quantile_10, c = "black")
ax.plot(x, quantile_90, c = "black")
ax.fill_between(x, quantile_10, quantile_90, alpha=0.3, color='red',
label='Theoretical quantile interval 0.1-0.9')
ax.set_xticks(range(0, 25))
ax.set_title('Electricity consumption throughout the day')
ax.set_xlabel('Hour of the day')
ax.set_ylabel('Electricity consumption (MWh)')
plt.legend();
Model¶
To compute quantiles, the probabilistic distribution of the data must first be obtained. To simplify this demonstration, two assumptions are made:
It is assumed that the distribution is normal (since the data were simulated with this distribution).
Default hyperparameters are used when fitting the model.
In practice, a preliminary analysis is needed to determine both of these points (see Example 2).
# Model fitting
# ==============================================================================
model = NGBRegressor(
Dist = distns.Normal,
Base = DecisionTreeRegressor(criterion='friedman_mse', max_depth=3),
Score = scores.LogScore,
natural_gradient = True,
n_estimators = 500,
learning_rate = 0.01,
minibatch_frac = 1.0,
col_sample = 1.0,
random_state = 123,
)
model.fit(X=x.reshape(-1, 1), Y=y)
[iter 0] loss=2.5299 val_loss=0.0000 scale=2.0000 norm=4.5902 [iter 100] loss=2.2691 val_loss=0.0000 scale=2.0000 norm=4.4540 [iter 200] loss=2.2101 val_loss=0.0000 scale=1.0000 norm=2.2067 [iter 300] loss=2.1927 val_loss=0.0000 scale=2.0000 norm=4.3769 [iter 400] loss=2.1775 val_loss=0.0000 scale=2.0000 norm=4.3366
NGBRegressor(random_state=RandomState(MT19937) at 0x7F6F6CB00A40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| Dist | <class 'ngboo...ormal.Normal'> | |
| Score | <class 'ngboo...res.LogScore'> | |
| Base | DecisionTreeR..., max_depth=3) | |
| natural_gradient | True | |
| n_estimators | 500 | |
| learning_rate | 0.01 | |
| minibatch_frac | 1.0 | |
| col_sample | 1.0 | |
| verbose | True | |
| random_state | RandomState(M...0x7F6F6CB00A40 | |
| validation_fraction | 0.1 | |
| early_stopping_rounds | None |
DecisionTreeRegressor(criterion='friedman_mse', max_depth=3)
Parameters
| criterion | 'friedman_mse' | |
| splitter | 'best' | |
| max_depth | 3 | |
| min_samples_split | 2 | |
| min_samples_leaf | 1 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | None | |
| random_state | None | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| ccp_alpha | 0.0 | |
| monotonic_cst | None |
Predictions¶
NGBRegressor models provide two prediction methods:
predict(): returns a point estimate prediction, as most models do. This is the expected value of the response variable given the predictor values for the test observation $\mathbb{E}[y|x=x_i]$.pred_dist(): returns anngboost.distnsobject containing the conditional distribution of the response variable for the predictor values of the test observation $P(y|x=x_i)$. The estimated parameters are stored within each of these objects.
# Point estimate prediction
# ==============================================================================
point_predictions = model.predict(X=x.reshape(-1, 1))
point_predictions
array([15.02650577, 15.02650577, 15.02650577, ..., 15.13872404,
21.92369649, 17.73664598])
# Conditional distribution prediction
# ==============================================================================
dist_predictions = model.pred_dist(X=x.reshape(-1, 1))
dist_predictions
<ngboost.distns.normal.Normal at 0x7f6f6c8f65d0>
# Each distns.distn object contains the estimated parameters
dist_predictions[0].params
{'loc': np.float64(15.026505766709645),
'scale': np.float64(0.9610903262391606)}
# Estimated parameters for each hour
# ==============================================================================
df_parameters = pd.DataFrame([dist.params for dist in dist_predictions])
df_parameters['hour'] = x
df_parameters
| loc | scale | hour | |
|---|---|---|---|
| 0 | 15.026506 | 0.961090 | 0.000000 |
| 1 | 15.026506 | 0.961090 | 0.008003 |
| 2 | 15.026506 | 0.961090 | 0.016005 |
| 3 | 15.026506 | 0.961090 | 0.024008 |
| 4 | 15.026506 | 0.961090 | 0.032011 |
| ... | ... | ... | ... |
| 2995 | 15.674938 | 1.789045 | 23.967989 |
| 2996 | 13.786219 | 2.108075 | 23.975992 |
| 2997 | 15.138724 | 0.974876 | 23.983995 |
| 2998 | 21.923696 | 1.180863 | 23.991997 |
| 2999 | 17.736646 | 1.102145 | 24.000000 |
3000 rows × 3 columns
Interval Predictions¶
Once the parameters of the conditional distribution are known, the quantiles of the expected energy consumption for each hour can be computed.
A quantile of order $\tau$ $(0 < \tau < 1)$ of a distribution is the value that marks a cut such that a proportion $\tau$ of population values is less than or equal to that value. For example, the quantile of order 0.36 leaves 36% of values below it, and the quantile of order 0.50 corresponds to 50% (the median of the distribution).
All distributions generated by NGBRegressor are scipy.stats classes, so they provide the methods pdf(), logpdf(), cdf(), ppf(), and rvs() for computing density, log-density, cumulative probability, quantiles, and sampling new values.
The steps to compute quantile intervals are:
Generate a grid of values covering the entire observed range.
For each grid value, obtain the parameters predicted by the model.
Compute quantiles of the chosen distribution using the predicted parameters.
# Compute the theoretical quantiles to define the central interval accumulating
# 90% of probability using the predicted parameters.
# ==============================================================================
grid_X = np.linspace(start=min(x), stop=max(x), num=1000)
dist_predictions = model.pred_dist(X=grid_X.reshape(-1, 1))
pred_quantile_10 = [stats.norm.ppf(q=0.1, **dist.params) for dist in dist_predictions]
pred_quantile_90 = [stats.norm.ppf(q=0.9, **dist.params) for dist in dist_predictions]
# Plot quantile interval 10%-90%
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7, 3.5))
ax.scatter(x, y, alpha = 0.3, c = "black", s = 8)
ax.plot(x, quantile_10, c = "black")
ax.plot(x, quantile_90, c = "black")
ax.fill_between(x, quantile_10, quantile_90, alpha=0.3, color='red',
label='Theoretical quantile interval 0.1-0.9')
ax.plot(grid_X, pred_quantile_10, c = "blue", label='Predicted quantile interval')
ax.plot(grid_X, pred_quantile_90, c = "blue")
ax.set_xticks(range(0, 25))
ax.set_title('Electricity consumption throughout the day')
ax.set_xlabel('Hour of the day')
ax.set_ylabel('Electricity consumption (MWh)')
plt.legend();
Coverage¶
One of the metrics used to evaluate intervals is coverage. Its value corresponds to the percentage of observations that fall within the estimated interval. Ideally, coverage should equal the confidence level of the interval, so in practice, the closer its value, the better.
In this example, quantiles 0.1 and 0.9 have been computed, so the interval has a confidence level of 80%. If the predicted interval is correct, approximately 80% of observations should fall within it.
# Interval predictions for the available observations
# ==============================================================================
dist_predictions = model.pred_dist(X=x.reshape(-1, 1))
pred_quantile_10 = [stats.norm.ppf(q=0.1, **dist.params) for dist in dist_predictions]
pred_quantile_90 = [stats.norm.ppf(q=0.9, **dist.params) for dist in dist_predictions]
# Coverage of the predicted interval
# ==============================================================================
within_interval = np.where((y >= pred_quantile_10) & (y <= pred_quantile_90), True, False)
coverage = within_interval.mean()
print(f"Coverage of the predicted interval: {100 * coverage:.2f}%")
Coverage of the predicted interval: 80.03%
Probabilistic Prediction¶
For example, to determine the probability that at 8 a.m. consumption exceeds 22.5 MWh, the distribution parameters are first predicted for $hour=8$ and then, using those parameters, the probability $consumption \geq 22.5$ is computed with the cumulative distribution function.
# Predict distribution parameters when x=8
# ==============================================================================
distribution = model.pred_dist(X=np.array([8]).reshape(-1, 1))
distribution.params
{'loc': array([14.1530422]), 'scale': array([4.52000988])}
# Probability that consumption exceeds 22.5
# ==============================================================================
100 * (1 - stats.norm.cdf(x=22.5, **distribution.params))
array([3.2397633])
According to the distribution predicted by the model, there is a 3.24% probability that consumption at 8 a.m. will be equal to or above 22.5 MWh.
If this same calculation is performed for the entire hourly range, the probability that consumption exceeds a given value throughout the day can be determined.
# Probability prediction across the full hourly range
# ==============================================================================
grid_X = np.linspace(start=0, stop=24, num=500)
dist_predictions = model.pred_dist(X=grid_X.reshape(-1, 1))
probability_predictions = [(1 - stats.norm.cdf(x=22.5, **dist.params)) for dist in dist_predictions]
# Probability plot
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7, 3))
ax.plot(grid_X, probability_predictions)
ax.set_xticks(range(0, 25))
ax.set_title('Probability that consumption exceeds 22.5 MWh',
fontdict={'fontsize':13})
ax.set_xlabel('Hour of the day')
ax.set_ylabel('Probability');
Example 2¶
The dbbmi dataset (Fourth Dutch Growth Study, Fredriks et al. (2000a, 2000b)), available in the R package gamlss.data, contains information on age and body mass index (bmi) of 7,294 Dutch youths aged 0 to 20. The goal is to train a model capable of predicting quantiles of body mass index as a function of age, as this is one of the standards used to detect anomalous cases that may require medical attention.
Libraries¶
# Data handling
# ==============================================================================
import pandas as pd
import numpy as np
from tqdm import tqdm
# Plotting
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8
# Modeling
# ==============================================================================
from ngboost import NGBRegressor
from ngboost import scores
from ngboost import distns
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from scipy import stats
import multiprocessing
# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')
Data¶
# Load data
# ==============================================================================
url = (
'https://raw.githubusercontent.com/JoaquinAmatRodrigo/'
'Estadistica-machine-learning-python/master/data/dbbmi.csv'
)
data = pd.read_csv(url)
data.head(3)
| age | bmi | |
|---|---|---|
| 0 | 0.03 | 13.235289 |
| 1 | 0.04 | 12.438775 |
| 2 | 0.04 | 14.541775 |
# Plot distribution of bmi values
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3))
ax.hist(data.bmi, bins=30, density=True, color='#3182bd', alpha=0.8)
ax.set_title('Distribution of body mass index values')
ax.set_xlabel('bmi')
ax.set_ylabel('density');
# Plot distribution of body mass index as a function of age
# ==============================================================================
fig, axs = plt.subplots(ncols=2, nrows=1, figsize=(8, 4), sharey=True)
data[data.age <= 2.5].plot(
x = 'age',
y = 'bmi',
c = 'black',
kind = "scatter",
alpha = 0.1,
ax = axs[0]
)
axs[0].set_ylim([10, 30])
axs[0].set_title('Age <= 2.5 years')
data[data.age > 2.5].plot(
x = 'age',
y = 'bmi',
c = 'black',
kind = "scatter",
alpha = 0.1,
ax = axs[1]
)
axs[1].set_title('Age > 2.5 years')
fig.tight_layout()
plt.subplots_adjust(top = 0.85)
fig.suptitle('Distribution of body mass index as a function of age', fontsize = 12);
This distribution exhibits three important characteristics that must be taken into account when modeling it:
The relationship between age and body mass index is neither linear nor constant. There is a notable positive relationship until about 0.25 years of age, then it stabilizes until age 10, and rises sharply again from ages 10 to 20.
Variance is heterogeneous (heteroscedasticity), being lower at younger ages and higher at older ages.
The distribution of the response variable is not normal; it shows skewness and a positive tail.
Given these characteristics, a model is needed that:
Is capable of learning non-linear relationships.
Can explicitly model variance as a function of the predictors, since it is not constant.
Is capable of learning skewed distributions with a pronounced positive tail.
Parametric Distribution Selection¶
NGBoost models require specifying a parametric distribution type, so the first step is to identify which of the three distributions available in NGBoost (normal, lognormal, and exponential) best fits the data.
Each distribution is fitted and evaluated graphically by overlaying it onto the histogram. Detailed information on how to compare distributions can be found in Distribution fitting and selection with Python.
def fit_plot_distribution(
x: np.ndarray,
distribution_name,
ax=None,
verbose: bool = True,
):
'''
This function overlays the density curve(s) of one or more distributions
on top of the data histogram.
Parameters
----------
x : array_like
Data used to fit the distribution.
distribution_name : str or list
Name of a distribution or list of distribution names available
in `scipy.stats`.
ax : matplotlib.axes.Axes, optional
Axes on which to draw the plot. If None, a new one is created.
verbose : bool, optional
If True, prints fitting results. For multiple distributions,
prints error warnings (default True).
Returns
-------
ax: matplotlib.axes.Axes
Generated plot
'''
# Input validation
x = np.asarray(x)
if x.size == 0:
raise ValueError("Array 'x' cannot be empty")
if not np.all(np.isfinite(x)):
raise ValueError("Array 'x' contains NaN or infinite values")
if isinstance(distribution_name, str):
distribution_name = [distribution_name]
if not isinstance(distribution_name, (list, tuple)):
raise TypeError("Parameter 'distribution_name' must be a string, list, or tuple")
# Create plot
if ax is None:
fig, ax = plt.subplots(figsize=(7,4))
# Plot histogram
ax.hist(x=x, density=True, bins=30, color="#3182bd", alpha=0.5)
ax.plot(x, np.full_like(x, -0.01), '|k', markeredgewidth=1)
ax.set_title('Distribution fit' if len(distribution_name) == 1 else
'Distribution fits')
ax.set_xlabel('x')
ax.set_ylabel('Probability density')
for name in tqdm(distribution_name):
if not hasattr(stats, name):
if verbose:
print(f"⚠️ Distribution '{name}' does not exist in scipy.stats")
continue
try:
distribution = getattr(stats, name)
parameters = distribution.fit(data=x)
is_continuous = isinstance(distribution, stats.rv_continuous)
pdf_func = distribution.pdf if is_continuous else distribution.pmf
x_hat = np.linspace(min(x), max(x), num=100)
y_hat = pdf_func(x_hat, *parameters)
ax.plot(x_hat, y_hat, linewidth=2, label=distribution.name)
except Exception as e:
if verbose:
print(f"⚠️ Error fitting distribution '{name}': {type(e).__name__}: {e}")
ax.legend(loc='best')
return ax
fig, ax = plt.subplots(figsize=(6, 3))
fit_plot_distribution(
x=data.bmi.to_numpy(),
distribution_name=['lognorm', 'norm', 'expon'],
ax=ax
);
The graphical representation provides clear evidence that, among the available distributions, the lognormal is the best fit.
Model¶
Unlike the previous example, cross-validation is used here to identify the optimal hyperparameters.
X = data.age.to_numpy()
y = data.bmi.to_numpy()
# Hyperparameter grid
# ==============================================================================
b1 = DecisionTreeRegressor(criterion='friedman_mse', max_depth=2)
b2 = DecisionTreeRegressor(criterion='friedman_mse', max_depth=4)
param_grid = {
'Base': [b1, b2],
'n_estimators': [100, 500, 1000]
}
# Cross-validation search
# ==============================================================================
grid = GridSearchCV(
estimator = NGBRegressor(
Dist = distns.LogNormal,
Score = scores.LogScore,
verbose = False
),
param_grid = param_grid,
scoring = 'neg_root_mean_squared_error', # Evaluates point prediction (expected value). To evaluate full distributional quality, neg-log-likelihood or CRPS would be used.
n_jobs = multiprocessing.cpu_count() - 1,
cv = 5,
verbose = 0
)
# Assign result to _ to suppress printed output
_ = grid.fit(X = X.reshape(-1, 1), y = y)
# Grid results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
.drop(columns = 'params')\
.sort_values('mean_test_score', ascending = False)
| param_Base | param_n_estimators | mean_test_score | std_test_score | |
|---|---|---|---|---|
| 1 | DecisionTreeRegressor(criterion='friedman_mse'... | 500 | -2.254565 | 0.420097 |
| 2 | DecisionTreeRegressor(criterion='friedman_mse'... | 1000 | -2.256332 | 0.360156 |
| 4 | DecisionTreeRegressor(criterion='friedman_mse'... | 500 | -2.291792 | 0.307326 |
| 3 | DecisionTreeRegressor(criterion='friedman_mse'... | 100 | -2.334395 | 0.438422 |
| 5 | DecisionTreeRegressor(criterion='friedman_mse'... | 1000 | -2.372743 | 0.200955 |
| 0 | DecisionTreeRegressor(criterion='friedman_mse'... | 100 | -2.444140 | 0.651819 |
# Best hyperparameters
# ==============================================================================
print("-----------------------------------")
print("Best hyperparameters found")
print("-----------------------------------")
print(grid.best_params_, ":", grid.best_score_, grid.scoring)
-----------------------------------
Best hyperparameters found
-----------------------------------
{'Base': DecisionTreeRegressor(criterion='friedman_mse', max_depth=2), 'n_estimators': 500} : -2.2545651891289116 neg_root_mean_squared_error
# Best model found
# ==============================================================================
model = grid.best_estimator_
Interval Predictions¶
Once the parameters of the conditional distribution are known, the expected body mass index quantiles for each age can be computed.
# Compute the quantiles to define the central interval accumulating
# 90% of probability using the predicted parameters.
# ==============================================================================
grid_X = np.linspace(start=min(data.age), stop=max(data.age), num=2000)
dist_predictions = model.pred_dist(X=grid_X.reshape(-1, 1))
pred_quantile_10 = [stats.lognorm.ppf(q=0.1, **dist.params) for dist in dist_predictions]
pred_quantile_90 = [stats.lognorm.ppf(q=0.9, **dist.params) for dist in dist_predictions]
df_intervals = pd.DataFrame({
'age': grid_X,
'pred_quantile_10': pred_quantile_10,
'pred_quantile_90': pred_quantile_90
})
# Plot quantile interval 10%-90%
# ==============================================================================
fig, axs = plt.subplots(ncols=2, nrows=1, figsize=(10, 4.5), sharey=True)
data[data.age <= 2.5].plot(
x = 'age',
y = 'bmi',
c = 'black',
kind = "scatter",
alpha = 0.1,
ax = axs[0]
)
df_intervals[df_intervals.age <= 2.5].plot(
x = 'age',
y = 'pred_quantile_10',
c = 'red',
kind = "line",
ax = axs[0]
)
df_intervals[df_intervals.age <= 2.5].plot(
x = 'age',
y = 'pred_quantile_90',
c = 'red',
kind = "line",
ax = axs[0]
)
axs[0].set_ylim([10, 30])
axs[0].set_title('Age <= 2.5 years')
data[data.age > 2.5].plot(
x = 'age',
y = 'bmi',
c = 'black',
kind = "scatter",
alpha = 0.1,
ax = axs[1]
)
df_intervals[df_intervals.age > 2.5].plot(
x = 'age',
y = 'pred_quantile_10',
c = 'red',
kind = "line",
ax = axs[1]
)
df_intervals[df_intervals.age > 2.5].plot(
x = 'age',
y = 'pred_quantile_90',
c = 'red',
kind = "line",
ax = axs[1]
)
axs[1].set_title('Age > 2.5 years')
axs[1].get_legend().remove()
fig.tight_layout()
plt.subplots_adjust(top = 0.85)
fig.suptitle('Distribution of body mass index as a function of age', fontsize = 14);
Coverage¶
One of the metrics used to evaluate intervals is coverage. Its value corresponds to the percentage of observations that fall within the estimated interval. Ideally, coverage should equal the confidence level of the interval, so in practice, the closer its value, the better.
In this example, quantiles 0.1 and 0.9 were computed using a LogNormal distribution, so the interval has a confidence level of 80%. If the model and chosen distribution are correct, approximately 80% of observations should fall within it.
# Interval predictions for the available observations
# ==============================================================================
dist_predictions = model.pred_dist(X=X.reshape(-1, 1))
pred_quantile_10 = [stats.lognorm.ppf(q=0.1, **dist.params) for dist in dist_predictions]
pred_quantile_90 = [stats.lognorm.ppf(q=0.9, **dist.params) for dist in dist_predictions]
# Coverage of the predicted interval
# ==============================================================================
within_interval = np.where((y >= pred_quantile_10) & (y <= pred_quantile_90), True, False)
coverage = within_interval.mean()
print(f"Coverage of the predicted interval: {100 * coverage:.2f}%")
Coverage of the predicted interval: 82.11%
Anomalies (Outliers)¶
Knowing the probability distribution is useful for detecting anomalies — observations with values that are very unlikely according to the model.
For a given age, low body mass index values are indicative of potential malnutrition issues, and very high values are indicative of potential overweight problems. Using the trained model, individuals with bmi values that are highly improbable given their age are identified.
# 98% interval predictions for the available observations
# ==============================================================================
X = data.age.to_numpy()
y = data.bmi.to_numpy()
dist_predictions = model.pred_dist(X=X.reshape(-1, 1))
pred_quantile_01 = [stats.lognorm.ppf(q=0.01, **dist.params) for dist in dist_predictions]
pred_quantile_99 = [stats.lognorm.ppf(q=0.99, **dist.params) for dist in dist_predictions]
within_interval = np.where((y >= pred_quantile_01) & (y <= pred_quantile_99), True, False)
df_intervals = pd.DataFrame({
'age': X,
'bmi': y,
'pred_quantile_01': pred_quantile_01,
'pred_quantile_99': pred_quantile_99,
'within_interval': within_interval
})
df_intervals.head()
| age | bmi | pred_quantile_01 | pred_quantile_99 | within_interval | |
|---|---|---|---|---|---|
| 0 | 0.03 | 13.235289 | 11.016612 | 17.542151 | True |
| 1 | 0.04 | 12.438775 | 11.016612 | 17.542151 | True |
| 2 | 0.04 | 14.541775 | 11.016612 | 17.542151 | True |
| 3 | 0.04 | 11.773954 | 11.016612 | 17.542151 | True |
| 4 | 0.04 | 15.325614 | 11.016612 | 17.542151 | True |
Individuals for whom the model assigns a probability of occurrence equal to or below 1% are highlighted in red.
# Anomaly plot
# ==============================================================================
fig, axs = plt.subplots(ncols=2, nrows=1, figsize=(10, 4.5), sharey=True)
df_intervals[df_intervals.age <= 2.5].plot(
x = 'age',
y = 'bmi',
c = 'black',
kind = "scatter",
alpha = 0.1,
ax = axs[0]
)
df_intervals[(df_intervals.age <= 2.5) & (df_intervals.within_interval == False)].plot(
x = 'age',
y = 'bmi',
c = 'red',
kind = "scatter",
label = 'anomaly',
ax = axs[0]
)
axs[0].set_ylim([10, 30])
axs[0].set_title('Age <= 2.5 years')
df_intervals[df_intervals.age > 2.5].plot(
x = 'age',
y = 'bmi',
c = 'black',
kind = "scatter",
alpha = 0.1,
ax = axs[1]
)
df_intervals[(df_intervals.age > 2.5) & (df_intervals.within_interval == False)].plot(
x = 'age',
y = 'bmi',
c = 'red',
kind = "scatter",
ax = axs[1]
)
axs[1].set_title('Age > 2.5 years')
fig.tight_layout()
plt.subplots_adjust(top = 0.85)
fig.suptitle('Distribution of body mass index as a function of age', fontsize = 14);
Session Information¶
import session_info
session_info.show(html=False)
----- matplotlib 3.10.8 ngboost 0.5.10 numpy 2.1.3 pandas 2.3.3 scipy 1.15.3 session_info v1.0.1 sklearn 1.7.2 tqdm 4.67.3 ----- IPython 9.10.0 jupyter_client 8.8.0 jupyter_core 5.9.1 ----- Python 3.13.11 | packaged by Anaconda, Inc. | (main, Dec 10 2025, 21:28:48) [GCC 14.3.0] Linux-6.17.0-20-generic-x86_64-with-glibc2.39 ----- Session information updated at 2026-04-13 22:47
Citation Instructions¶
How to cite this document?
If you use this document or any part of it, we appreciate you citing it. Thank you!
Probabilistic gradient boosting with Python: NGBoost by Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://www.cienciadedatos.net/documentos/py26-probabilistic-gradient-boosting-python-ngboost.html\n"
Did you enjoy the article? Your support matters
Your contribution helps me continue producing free educational content. Thank you so much! 😊
This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.
You are free to:
-
Share: copy and redistribute the material in any medium or format.
-
Adapt: remix, transform, and build upon the material.
Under the following terms:
-
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
-
NonCommercial: You may not use the material for commercial purposes.
-
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
