More about Data Science and Statistics
- Normality Tests
- Equality of variances
- Linear Correlation
- T-test
- ANOVA
- Permutation tests
- Bootstrapping
- Fitting probability distributions
- Kernel Density Estimation (KDE)
- Kolmogorov-Smirnov Test
- Cramer-Von Mises Test
Introduction¶
Identifying the theoretical distribution type that best fits an empirical variable is a critical step in statistical analysis. It is the foundation for conducting parametric hypothesis tests, estimating confidence intervals, and preprocessing data for statistical learning and Machine Learning models (especially in linear or Gaussian models).
In Python, the data analysis ecosystem offers powerful tools for this task. This document explores the functionalities of the scipy.stats module, focusing on how to fit and compare multiple distributions to determine which best describes the observed data.
Fitting a Distribution¶
Fitting a parametric distribution consists of determining the values of its parameters so that the theoretical distribution represents as accurately as possible the behavior of a real dataset. Once these parameters are estimated, the distribution is completely defined and can be used to describe, analyze, or predict the behavior of the studied system. For example, a normal distribution is completely defined by two parameters: its location (mean, $\mu$) and its scale (standard deviation, $\sigma$).
There are several methods to find the optimal parameters, one of the most widely used, and the one implemented in scipy.stats, is the Maximum Likelihood Estimation (MLE) method. This method seeks the parameters that maximize the joint probability of observing the available data.
The scipy.stats library has more than 90 implemented distributions. You can consult the complete list in its official documentation:
The above definition is intended to be intuitive. Formally, fitting a parametric distribution can be defined as follows:
Fitting a parametric distribution consists of selecting a parametric family of distributions $\{f(x;\theta), \theta \in \Theta\}$ and estimating the value of the unknown parameter $\theta$ from a random sample $X_1, \dots, X_n$ of a random variable $X$, under the assumption that the distribution of $X$ belongs to that family. The estimation of $\theta$ is performed using a formal statistical criterion, such as the maximum likelihood method, the method of moments, or least squares, so that the distribution $f(x;\hat{\theta})$ is optimal with respect to that criterion.
In addition to distinguishing between continuous and discrete distributions, it is useful to be able to select them by support, that is, the set of values on which the distribution is defined. For example, if you want to model wind speed, even if you don't know the exact type of distribution, you can restrict the selection to those distributions whose support is contained in $[0,+\infty)$.
# Available distributions in scipy.stats
# ==============================================================================
from scipy import stats
import pandas as pd
import numpy as np
distributions = [
getattr(stats, d) for d in dir(stats)
if isinstance(getattr(stats, d), (stats.rv_continuous, stats.rv_discrete))
]
data = []
for dist in distributions:
try:
a = getattr(dist, 'a', np.nan)
b = getattr(dist, 'b', np.nan)
name = getattr(dist, 'name', str(dist))
data.append({'distribution': name, 'support_min': a, 'support_max': b})
except Exception as e:
continue
dist_info = pd.DataFrame(data)
dist_info = dist_info.sort_values(by=['support_min', 'support_max']).reset_index(drop=True)
display(dist_info)
| distribution | support_min | support_max | |
|---|---|---|---|
| 0 | levy_l | -inf | 0.0 |
| 1 | weibull_max | -inf | 0.0 |
| 2 | cauchy | -inf | inf |
| 3 | crystalball | -inf | inf |
| 4 | dgamma | -inf | inf |
| ... | ... | ... | ... |
| 127 | pareto | 1.0 | inf |
| 128 | truncpareto | 1.0 | inf |
| 129 | yulesimon | 1.0 | inf |
| 130 | zipf | 1.0 | inf |
| 131 | zipfian | 1.0 | inf |
132 rows × 3 columns
Comparing Fits¶
The Maximum Likelihood Estimation (MLE) fitting method allows, given a specific distribution, to find the values of its parameters that best fit the data.
In practice, the distribution followed by the data is often not known in advance. With experience, the analyst can narrow down candidate distributions (continuous or discrete, positive only, etc.), but it is necessary to have a criterion that quantifies the goodness of fit of each distribution and allows comparison between them. A book that explains the characteristics of the main probability distributions is Distributions for Modeling Location, Scale, and Shape Using GAMLSS in R By Robert A. Rigby, Mikis D. Stasinopoulos, Gillian Z. Heller, Fernanda De Bastiani.
Generalized Akaike Information Criterion (GAIC)
In parametric distributions that have been fitted following a Maximum Likelihood Estimation (MLE) strategy, one way to quantify the goodness of fit of the model, that is, how well it fits the data, is by using the likelihood value itself achieved in the fit. What does this mean?
The likelihood value returned by a distribution for an observation $x_i$ is a measure of the probability with which that distribution could have generated that observation. To facilitate mathematical calculations, the logarithm (log likelihood) is normally used, but the interpretation is the same: the higher the log likelihood, the higher the probability.
If the log likelihood of all observations with which the distribution was fitted is summed, we have a value that measures the "compatibility" between the distribution and the data. Following this idea, the fitted global deviance (GDEV) term is defined, which is nothing more than the model's log likelihood multiplied by $-2$.
$$\text{GDEV} = − 2 \log(likelihood)$$The problem with using GDEV to compare distributions is that it does not account for the degrees of freedom of each one (their flexibility). In general terms, the more parameters a distribution has, the more easily it fits the data and the higher its log likelihood. This means that using GDEV or log likelihood to compare distributions with different numbers of parameters is not a fair strategy; the one with the most parameters will almost always win, even if it is not really the one that best describes the behavior of the data.
One way to mitigate this problem is through the use of GAIC (generalized Akaike information criterion), which incorporates a penalty $k$ for each parameter that the distribution has:
$$\text{GAIC} = \text{GDEV} + k \times \text{number of parameters}= $$$$= −2 \log(likelihood) + k \times \text{number of parameters}$$Depending on the degree of penalization, model simplicity is favored more or less. Two standards of this metric are AIC (Akaike Information Criterion) and BIC (Bayesian information criterion), also known as SBC. The first uses a penalization value of $k=2$ and the second $k=\log(\text{number of observations)}$.
$$\text{AIC} = −2 \log(\text{likelihood}) + 2 \times \text{number of parameters}$$$$\text{BIC} = −2 \log(\text{likelihood}) + \log(\text{number of observations}) \times \text{number of parameters}$$In practice, AIC tends to favor distributions with more parameters (overfitting), while BIC/SBC tends to the opposite underfitting. The authors of the package from the book Distributions for Modeling Location, Scale, and Shape Using GAMLSS in R recommend using $k$ values in the range $2.5 \leq k \leq 4$ or $k = \sqrt{\log(n)}$ when $n \geq 1000$.
Multiplication by -2 converts the maximization of log likelihood into a minimization problem, so that lower values of GDEV, AIC or BIC indicate better fit.
It is important to note that none of these metrics serve to quantify the quality of fit in an absolute sense, but rather to compare the relative quality between distributions. If all fitted distributions are poor, they provide no warning of it.
Example¶
In this example, two distributions, normal and gamma, are fitted with the objective of modeling the distribution of diamond sale prices. In addition to performing the fits, the results are graphically represented and the goodness-of-fit metrics AIC, BIC, and Log-Likelihood are calculated with the objective of comparing and identifying which distribution fits best.
Libraries¶
The libraries used in this document are:
# Data manipulation
# ==============================================================================
import pandas as pd
import numpy as np
import seaborn as sns
# Plots
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8
# Distribution fitting
# ==============================================================================
from tqdm.auto import tqdm
from scipy import stats
import inspect
from statsmodels.distributions.empirical_distribution import ECDF
# Warnings configuration
# ==============================================================================
import warnings
from contextlib import nullcontext
warnings.filterwarnings('once')
Data¶
For this demonstration, the diamond prices available in the diamonds dataset from the seaborn library are used, specifically the price column.
# Data
# ==============================================================================
data = sns.load_dataset('diamonds')
data = data.loc[data.cut == 'Fair', 'price']
Distribution Fitting¶
Two of the first steps when analyzing a variable are: calculating the main descriptive statistics and representing the observed (empirical) distributions. If the data are stored in a Pandas Series, the main descriptive statistics can be obtained with the describe() method.
# Descriptive statistics
# ==============================================================================
data.describe()
count 1610.000000 mean 4358.757764 std 3560.386612 min 337.000000 25% 2050.250000 50% 3282.000000 75% 5205.500000 max 18574.000000 Name: price, dtype: float64
# Observed (empirical) distribution plots
# ==============================================================================
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(8, 3))
# Histogram
axs[0].hist(x=data, bins=30, color="#3182bd", alpha=0.5)
axs[0].plot(data, np.full_like(data, -0.01), '|k', markeredgewidth=1)
axs[0].set_title('Empirical distribution of diamond prices')
axs[0].set_xlabel('price')
axs[0].set_ylabel('counts')
# Cumulative Distribution Function
# ecdf (empirical cumulative distribution function)
ecdf = ECDF(x=data)
axs[1].plot(ecdf.x, ecdf.y, color="#3182bd")
axs[1].set_title('Empirical distribution function')
axs[1].set_xlabel('price')
axs[1].set_ylabel('CDF')
plt.tight_layout();
# Normal distribution fitting
#===============================================================================
# 1) Define the distribution type
distribution = stats.norm
# 2) Obtain the parameters with the fit() method
params = distribution.fit(data=data)
# 3) Create a dictionary that includes the name of each parameter
param_names = [p for p in inspect.signature(distribution._pdf).parameters \
if p != 'x'] + ["loc","scale"]
params_dict = dict(zip(param_names, params))
# 4) Calculate the log likelihood
log_likelihood = distribution.logpdf(data.to_numpy(), *params).sum()
# 5) Calculate the AIC and BIC
aic = -2 * log_likelihood + 2 * len(params)
bic = -2 * log_likelihood + np.log(data.shape[0]) * len(params)
# 6) Plot
x_hat = np.linspace(min(data), max(data), num=100)
y_hat = distribution.pdf(x_hat, *params)
fig, ax = plt.subplots(figsize=(6, 3))
ax.plot(x_hat, y_hat, linewidth=2, label=distribution.name)
ax.hist(x=data, density=True, bins=30, color="#3182bd", alpha=0.5)
ax.plot(data, np.full_like(data, -0.01), '|k', markeredgewidth=1)
ax.set_title('Diamond price distribution')
ax.set_xlabel('price')
ax.set_ylabel('Probability density')
ax.legend();
# 6) Fit information
print('---------------------')
print('Fit results')
print('---------------------')
print(f"Distribution : {distribution.name}")
print(f"Support : {[distribution.a, distribution.b]}")
print(f"Parameters : {params_dict}")
print(f"Log likelihood : {log_likelihood}")
print(f"AIC : {aic}")
print(f"BIC : {bic}")
---------------------
Fit results
---------------------
Distribution : norm
Support : [-inf, inf]
Parameters : {'loc': np.float64(4358.757763975155), 'scale': np.float64(3559.2807303891086)}
Log likelihood : -15449.966194325283
AIC : 30903.932388650566
BIC : 30914.700367566522
The process is repeated, but this time with the gamma distribution.
# Gamma distribution fitting
#===============================================================================
# 1) Define the distribution type
distribution = stats.gamma
# 2) Obtain the parameters with the fit() method
params = distribution.fit(data=data)
# 3) Create a dictionary that includes the name of each parameter
param_names = [p for p in inspect.signature(distribution._pdf).parameters \
if p != 'x'] + ["loc","scale"]
params_dict = dict(zip(param_names, params))
# 4) Calculate the log likelihood
log_likelihood = distribution.logpdf(data.to_numpy(), *params).sum()
# 5) Calculate the AIC and BIC
aic = -2 * log_likelihood + 2 * len(params)
bic = -2 * log_likelihood + np.log(data.shape[0]) * len(params)
# 6) Plot
x_hat = np.linspace(min(data), max(data), num=100)
y_hat = distribution.pdf(x_hat, *params)
fig, ax = plt.subplots(figsize=(6, 3))
ax.plot(x_hat, y_hat, linewidth=2, label=distribution.name)
ax.hist(x=data, density=True, bins=30, color="#3182bd", alpha=0.5)
ax.plot(data, np.full_like(data, -0.01), '|k', markeredgewidth=1)
ax.set_title('Diamond price distribution')
ax.set_xlabel('price')
ax.set_ylabel('Probability density')
ax.legend();
# 6) Fit information
print('---------------------')
print('Fit results')
print('---------------------')
print(f"Distribution : {distribution.name}")
print(f"Support : {[distribution.a, distribution.b]}")
print(f"Parameters : {params_dict}")
print(f"Log likelihood : {log_likelihood}")
print(f"AIC : {aic}")
print(f"BIC : {bic}")
---------------------
Fit results
---------------------
Distribution : gamma
Support : [0.0, inf]
Parameters : {'a': np.float64(14.399103124683087), 'loc': np.float64(-6244.786483560789), 'scale': np.float64(726.3476320693849)}
Log likelihood : -15235.639828174484
AIC : 30477.27965634897
BIC : 30493.431624722904
Both the AIC and BIC metrics agree that the gamma distribution fits the data better (lower AIC and BIC values). This can be easily corroborated by graphical inspection of the results.
In this case, given that price values can only be positive and have a notable right tail, the gamma distribution was a much better candidate than the normal distribution. However, there are other possible distributions, some of which could be better. In the next example, we show how to automate the search.
Fitting and Comparing Multiple Distributions¶
The following example shows how to automate the fitting and comparison of multiple distributions available in scipy.stats. The code should allow:
Fitting all distributions available in scipy.stats.
Preselecting a subset of candidate distributions based on their support.
Displaying the parameters of each fit.
Calculating AIC and BIC values to select the distribution with the best fit.
Graphical representation of the results.
Functions¶
Functions used to compare multiple distributions.
def select_distributions(
family: str = 'realall',
exclusions: list = None,
verbose: bool = True
) -> list:
'''
This function selects a subset of the distributions available
in scipy.stats
Parameters
----------
family : {'realall', 'realline', 'realplus', 'real0to1', 'discrete'}
realall: distributions from the `realline` + `realplus` families
realline: continuous distributions on the support (-inf, +inf)
realplus: continuous distributions on the support [0, +inf)
real0to1: continuous distributions on the support [0,1]
discrete: discrete distributions
exclusions : list, optional
List of distribution names to exclude. By default excludes
['levy_stable', 'vonmises', 'erlang', 'studentized_range']
verbose : bool
Whether to display information about the selected distributions
(the default `True`).
Returns
-------
distributions: list
list with the selected distributions (the objects).
'''
valid_families = ['realall', 'realline', 'realplus', 'real0to1', 'discrete']
if family not in valid_families:
raise ValueError(f"The 'family' parameter must be one of {valid_families}, "
f"but received '{family}'")
# Get all distributions
distributions = [getattr(stats, d) for d in dir(stats)
if isinstance(getattr(stats, d), (stats.rv_continuous, stats.rv_discrete))]
# Default exclusions if not specified
if exclusions is None:
exclusions = ['levy_stable', 'vonmises', 'erlang', 'studentized_range']
exclusions_set = set(exclusions)
distributions = [dist for dist in distributions if dist.name not in exclusions_set]
dist_info = pd.DataFrame([
{
'distribution': dist.name,
'type': 'continuous' if isinstance(dist, stats.rv_continuous) else 'discrete',
'support_inf': dist.a,
'support_sup': dist.b,
'dist_obj': dist
}
for dist in distributions
]).sort_values(by=['support_inf', 'support_sup']).reset_index(drop=True)
supports = {
'realline': [-np.inf, np.inf],
'realplus': [0, np.inf],
'real0to1': [0, 1]
}
if family == 'realall':
mask = (dist_info['type'] == 'continuous') & (
((dist_info['support_inf'] == -np.inf) &
(dist_info['support_sup'] == np.inf)) |
((dist_info['support_inf'] == 0) &
(dist_info['support_sup'] == np.inf))
)
dist_info = dist_info[mask].reset_index(drop=True)
elif family in supports:
mask = (dist_info['type'] == 'continuous') & \
(dist_info['support_inf'] == supports[family][0]) & \
(dist_info['support_sup'] == supports[family][1])
dist_info = dist_info[mask].reset_index(drop=True)
elif family == 'discrete':
dist_info = dist_info[
dist_info['type'] == 'discrete'
].reset_index(drop=True)
selection = dist_info['dist_obj'].tolist()
dist_info_display = dist_info.drop(columns=['dist_obj'])
if not selection:
print("⚠️ No distributions found for the specified family")
if verbose:
print("---------------------------------------------------")
print(" Selected distributions ")
print("---------------------------------------------------")
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(dist_info_display)
return selection
def compare_distributions(
x: np.ndarray,
family: str = 'realall',
criterion: str = 'aic',
exclusions: list = None,
verbose: bool = True,
suppress_warnings: bool = True
) -> pd.DataFrame:
'''
This function selects and fits a subset of the distributions
available in scipy.stats. For each distribution it calculates the
Log Likelihood, AIC and BIC values.
Parameters
----------
x : array_like
data to fit the distribution.
family : {'realall', 'realline', 'realplus', 'real0to1', 'discrete'}
realall: distributions from the `realline` + `realplus` families
realline: continuous distributions on the support (-inf, +inf)
realplus: continuous distributions on the support [0, +inf)
real0to1: continuous distributions on the support [0,1]
discrete: discrete distributions
criterion : {'aic', 'bic'}
sorting criterion from best to worst fit.
exclusions : list, optional
List of distribution names to exclude. By default excludes
['levy_stable', 'vonmises', 'erlang', 'studentized_range']
verbose : bool
Whether to display information about the selected distributions
(the default `True`).
suppress_warnings : bool
Whether to suppress numerical warnings during fitting (RuntimeWarning).
Recommended True for clean output (the default `True`).
Returns
-------
results: pd.DataFrame
distribution: distribution name.
log_likelihood: log likelihood of the fit.
aic: AIC metric.
bic: BIC metric.
n_params: number of distribution parameters.
params: parameters after fitting
'''
valid_criteria = ['aic', 'bic']
if criterion not in valid_criteria:
raise ValueError(f"The 'criterion' parameter must be one of {valid_criteria}, "
f"but received '{criterion}'")
if exclusions is not None and not isinstance(exclusions, (list, tuple, set)):
raise TypeError("The 'exclusions' parameter must be a list, tuple or set")
x = np.asarray(x)
if x.size == 0:
raise ValueError("Array 'x' cannot be empty")
if not np.all(np.isfinite(x)):
raise ValueError("Array 'x' contains NaN or infinite values")
distributions = select_distributions(
family=family,
exclusions=exclusions,
verbose=False
)
results = []
for distribution in tqdm(distributions):
if verbose:
print(f"Fitting distribution: {distribution.name}")
try:
if suppress_warnings:
with warnings.catch_warnings():
warnings.simplefilter("ignore")
params = distribution.fit(data=x)
else:
params = distribution.fit(data=x)
# Determine if continuous or discrete
is_continuous = isinstance(distribution, stats.rv_continuous)
sig_func = distribution._pdf if is_continuous else distribution._pmf
log_func = distribution.logpdf if is_continuous else distribution.logpmf
# Extract parameter names
param_names = [p for p in inspect.signature(sig_func).parameters
if p != 'x'] + ["loc", "scale"]
params_dict = dict(zip(param_names, params))
# Calculate metrics
log_likelihood = log_func(x, *params).sum()
aic = -2 * log_likelihood + 2 * len(params)
bic = -2 * log_likelihood + np.log(x.shape[0]) * len(params)
# Store results
results.append({
'distribution': distribution.name,
'log_likelihood': log_likelihood,
'aic': aic,
'bic': bic,
'n_params': len(params),
'params': params_dict
})
except (ValueError, RuntimeError, FloatingPointError) as e:
if verbose:
print(f"Error fitting {distribution.name}: {type(e).__name__}: {e}\n")
except Exception as e:
if verbose:
print(f"Unexpected error fitting {distribution.name}: {type(e).__name__}: {e}\n")
# Verify if any distribution was fitted
if not results and verbose:
print("\n⚠️ Could not fit any distribution successfully.")
results = pd.DataFrame(
columns=['distribution', 'log_likelihood', 'aic', 'bic', 'n_params', 'params']
)
else:
results = pd.DataFrame(results).sort_values(by=criterion).reset_index(drop=True)
return results
Data¶
Again, the diamond prices available in the diamonds dataset from the seaborn library are used, specifically the price column.
# Data
# ==============================================================================
data = sns.load_dataset('diamonds')
data = data.loc[data.cut == 'Fair', 'price']
Distribution Fitting¶
# Distribution fitting and comparison
# ==============================================================================
results = compare_distributions(
x=data.to_numpy(),
family='realall',
criterion='aic',
verbose=False,
suppress_warnings=True
)
results
0%| | 0/84 [00:00<?, ?it/s]
| distribution | log_likelihood | aic | bic | n_params | params | |
|---|---|---|---|---|---|---|
| 0 | recipinvgauss | -1.488310e+04 | 2.977220e+04 | 2.978835e+04 | 3 | {'mu': 0.8564617364318987, 'loc': 75.589562124... |
| 1 | geninvgauss | -1.488228e+04 | 2.977256e+04 | 2.979410e+04 | 4 | {'p': 0.03895738587977514, 'b': 1.373372982791... |
| 2 | norminvgauss | -1.488307e+04 | 2.977415e+04 | 2.979569e+04 | 4 | {'a': 53.047412557163966, 'b': 53.025400276075... |
| 3 | genhyperbolic | -1.488252e+04 | 2.977503e+04 | 2.980195e+04 | 5 | {'p': 0.06748200212521904, 'a': 19.46831101851... |
| 4 | lognorm | -1.488561e+04 | 2.977722e+04 | 2.979337e+04 | 3 | {'s': 0.775749425283541, 'loc': 30.76091834294... |
| ... | ... | ... | ... | ... | ... | ... |
| 78 | exponweib | -1.697787e+04 | 3.396375e+04 | 3.398529e+04 | 4 | {'a': 3.6435974274790723, 'c': 0.1273335852525... |
| 79 | invweibull | -1.799302e+04 | 3.599205e+04 | 3.600820e+04 | 3 | {'c': 0.1931244916343704, 'loc': 336.999783682... |
| 80 | ncx2 | -1.024232e+06 | 2.048472e+06 | 2.048493e+06 | 4 | {'df': 0.0011534145941760432, 'nc': 0.34024851... |
| 81 | truncnorm | -inf | inf | inf | 4 | {'a': 388.84586460519677, 'b': 506.69039671197... |
| 82 | weibull_min | -inf | inf | inf | 3 | {'c': 1.1567328971778637, 'loc': 495.039530173... |
83 rows × 6 columns
Plots¶
def fit_plot_distribution(
x: np.ndarray,
distribution_name,
ax=None,
verbose: bool = True,
):
'''
This function overlays the density curve(s) of one or more
distributions with the data histogram.
Parameters
----------
x : array_like
data to fit the distribution.
distribution_name : str or list
name of a distribution or list of distribution names
available in `scipy.stats`.
ax : matplotlib.axes.Axes, optional
axis on which to make the plot. If None, a new one is created.
verbose : bool, optional
If True, prints the fit results. For multiple distributions,
prints error warnings (default True).
Returns
-------
ax: matplotlib.axes.Axes
created plot
'''
# Input validation
x = np.asarray(x)
if x.size == 0:
raise ValueError("Array 'x' cannot be empty")
if not np.all(np.isfinite(x)):
raise ValueError("Array 'x' contains NaN or infinite values")
if isinstance(distribution_name, str):
distribution_name = [distribution_name]
if not isinstance(distribution_name, (list, tuple)):
raise TypeError("The 'distribution_name' parameter must be a string, list or tuple")
# Create plot
if ax is None:
fig, ax = plt.subplots(figsize=(7,4))
# Plot histogram
ax.hist(x=x, density=True, bins=30, color="#3182bd", alpha=0.5)
ax.plot(x, np.full_like(x, -0.01), '|k', markeredgewidth=1)
ax.set_title('Distribution fit' if len(distribution_name) == 1 else
'Distribution fits')
ax.set_xlabel('x')
ax.set_ylabel('Probability density')
for name in tqdm(distribution_name):
if not hasattr(stats, name):
if verbose:
print(f"⚠️ Distribution '{name}' does not exist in scipy.stats")
continue
try:
distribution = getattr(stats, name)
params = distribution.fit(data=x)
is_continuous = isinstance(distribution, stats.rv_continuous)
pdf_func = distribution.pdf if is_continuous else distribution.pmf
x_hat = np.linspace(min(x), max(x), num=100)
y_hat = pdf_func(x_hat, *params)
ax.plot(x_hat, y_hat, linewidth=2, label=distribution.name)
except Exception as e:
if verbose:
print(f"⚠️ Error fitting distribution '{name}': {type(e).__name__}: {e}")
ax.legend(loc='best')
return ax
The best distribution is shown according to the AIC criterion.
fig, ax = plt.subplots(figsize=(6, 3))
fit_plot_distribution(
x=data.to_numpy(),
distribution_name=results['distribution'][0],
ax=ax
);
0%| | 0/1 [00:00<?, ?it/s]
The probability density curves for the 5 best distributions.
fig, ax = plt.subplots(figsize=(6, 3))
fit_plot_distribution(
x=data.to_numpy(),
distribution_name=results['distribution'][:5].tolist(),
ax=ax
);
0%| | 0/5 [00:00<?, ?it/s]
Density Function, Quantile, and Sampling¶
All distribution objects in scipy.stats have the pdf(), logpdf(), cdf(), ppf(), and rvs() methods for calculating density, log density, cumulative probability, quantiles, and sampling new values, respectively. For example, 5 new diamond values can be simulated according to the johnsonsu distribution.
# Distribution definition
distribution = stats.johnsonsu
# Fitting to obtain parameter values
params = distribution.fit(data.to_numpy())
# Random sampling
distribution.rvs(*params, size=5)
array([12925.69222894, 2831.92278871, 3877.35030327, 872.06750859,
1713.25813818])
Session Information¶
import session_info
session_info.show(html=False)
----- matplotlib 3.10.8 numpy 2.2.6 pandas 2.3.3 scipy 1.15.3 seaborn 0.13.2 session_info v1.0.1 statsmodels 0.14.6 tqdm 4.67.1 ----- IPython 9.8.0 jupyter_client 8.7.0 jupyter_core 5.9.1 ----- Python 3.13.11 | packaged by Anaconda, Inc. | (main, Dec 10 2025, 21:28:48) [GCC 14.3.0] Linux-6.14.0-37-generic-x86_64-with-glibc2.39 ----- Session information updated at 2026-01-16 13:00
Bibliography¶
Distributions for Modeling Location, Scale, and Shape Using GAMLSS in R By Robert A. Rigby, Mikis D. Stasinopoulos, Gillian Z. Heller, Fernanda De Bastiani.
Delignette-Muller, M., & Dutang, C. (2015). fitdistrplus: An R Package for Fitting Distributions. doi:http://dx.doi.org/10.18637/jss.v064.i04
Citation Instructions¶
How to cite this document?
If you use this document or any part of it, we would appreciate if you cite it. Thank you very much!
Fitting and Selecting Probability Distributions with Python by Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://www.cienciadedatos.net/documentos/pystats01-fitting-probability-distributions-python.html
Did you like the article? Your help is important
Your contribution will help me continue generating free educational content. Thank you very much! 😊
This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.
Permitted:
-
Share: copy and redistribute the material in any medium or format.
-
Adapt: remix, transform, and build upon the material.
Under the following terms:
-
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
-
Non-Commercial: You may not use the material for commercial purposes.
-
Share-Alike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
