More about Data Science and Statistics

Introduction

Identifying the theoretical distribution type that best fits an empirical variable is a critical step in statistical analysis. It is the foundation for conducting parametric hypothesis tests, estimating confidence intervals, and preprocessing data for statistical learning and Machine Learning models (especially in linear or Gaussian models).

In Python, the data analysis ecosystem offers powerful tools for this task. This document explores the functionalities of the scipy.stats module, focusing on how to fit and compare multiple distributions to determine which best describes the observed data.

Fitting a Distribution

Fitting a parametric distribution consists of determining the values of its parameters so that the theoretical distribution represents as accurately as possible the behavior of a real dataset. Once these parameters are estimated, the distribution is completely defined and can be used to describe, analyze, or predict the behavior of the studied system. For example, a normal distribution is completely defined by two parameters: its location (mean, $\mu$) and its scale (standard deviation, $\sigma$).

There are several methods to find the optimal parameters, one of the most widely used, and the one implemented in scipy.stats, is the Maximum Likelihood Estimation (MLE) method. This method seeks the parameters that maximize the joint probability of observing the available data.

The scipy.stats library has more than 90 implemented distributions. You can consult the complete list in its official documentation:

The above definition is intended to be intuitive. Formally, fitting a parametric distribution can be defined as follows:

Fitting a parametric distribution consists of selecting a parametric family of distributions $\{f(x;\theta), \theta \in \Theta\}$ and estimating the value of the unknown parameter $\theta$ from a random sample $X_1, \dots, X_n$ of a random variable $X$, under the assumption that the distribution of $X$ belongs to that family. The estimation of $\theta$ is performed using a formal statistical criterion, such as the maximum likelihood method, the method of moments, or least squares, so that the distribution $f(x;\hat{\theta})$ is optimal with respect to that criterion.

In addition to distinguishing between continuous and discrete distributions, it is useful to be able to select them by support, that is, the set of values on which the distribution is defined. For example, if you want to model wind speed, even if you don't know the exact type of distribution, you can restrict the selection to those distributions whose support is contained in $[0,+\infty)$.

# Available distributions in scipy.stats
# ==============================================================================
from scipy import stats
import pandas as pd
import numpy as np

distributions = [
    getattr(stats, d) for d in dir(stats)
    if isinstance(getattr(stats, d), (stats.rv_continuous, stats.rv_discrete))
]

data = []
for dist in distributions:
    try:
        a = getattr(dist, 'a', np.nan)
        b = getattr(dist, 'b', np.nan)
        name = getattr(dist, 'name', str(dist))
        data.append({'distribution': name, 'support_min': a, 'support_max': b})
    except Exception as e:
        continue

dist_info = pd.DataFrame(data)
dist_info = dist_info.sort_values(by=['support_min', 'support_max']).reset_index(drop=True)
display(dist_info)
distribution support_min support_max
0 levy_l -inf 0.0
1 weibull_max -inf 0.0
2 cauchy -inf inf
3 crystalball -inf inf
4 dgamma -inf inf
... ... ... ...
127 pareto 1.0 inf
128 truncpareto 1.0 inf
129 yulesimon 1.0 inf
130 zipf 1.0 inf
131 zipfian 1.0 inf

132 rows × 3 columns

Comparing Fits

The Maximum Likelihood Estimation (MLE) fitting method allows, given a specific distribution, to find the values of its parameters that best fit the data.

In practice, the distribution followed by the data is often not known in advance. With experience, the analyst can narrow down candidate distributions (continuous or discrete, positive only, etc.), but it is necessary to have a criterion that quantifies the goodness of fit of each distribution and allows comparison between them. A book that explains the characteristics of the main probability distributions is Distributions for Modeling Location, Scale, and Shape Using GAMLSS in R By Robert A. Rigby, Mikis D. Stasinopoulos, Gillian Z. Heller, Fernanda De Bastiani.

Generalized Akaike Information Criterion (GAIC)

In parametric distributions that have been fitted following a Maximum Likelihood Estimation (MLE) strategy, one way to quantify the goodness of fit of the model, that is, how well it fits the data, is by using the likelihood value itself achieved in the fit. What does this mean?

The likelihood value returned by a distribution for an observation $x_i$ is a measure of the probability with which that distribution could have generated that observation. To facilitate mathematical calculations, the logarithm (log likelihood) is normally used, but the interpretation is the same: the higher the log likelihood, the higher the probability.

If the log likelihood of all observations with which the distribution was fitted is summed, we have a value that measures the "compatibility" between the distribution and the data. Following this idea, the fitted global deviance (GDEV) term is defined, which is nothing more than the model's log likelihood multiplied by $-2$.

$$\text{GDEV} = − 2 \log(likelihood)$$

The problem with using GDEV to compare distributions is that it does not account for the degrees of freedom of each one (their flexibility). In general terms, the more parameters a distribution has, the more easily it fits the data and the higher its log likelihood. This means that using GDEV or log likelihood to compare distributions with different numbers of parameters is not a fair strategy; the one with the most parameters will almost always win, even if it is not really the one that best describes the behavior of the data.

One way to mitigate this problem is through the use of GAIC (generalized Akaike information criterion), which incorporates a penalty $k$ for each parameter that the distribution has:

$$\text{GAIC} = \text{GDEV} + k \times \text{number of parameters}= $$$$= −2 \log(likelihood) + k \times \text{number of parameters}$$

Depending on the degree of penalization, model simplicity is favored more or less. Two standards of this metric are AIC (Akaike Information Criterion) and BIC (Bayesian information criterion), also known as SBC. The first uses a penalization value of $k=2$ and the second $k=\log(\text{number of observations)}$.

$$\text{AIC} = −2 \log(\text{likelihood}) + 2 \times \text{number of parameters}$$$$\text{BIC} = −2 \log(\text{likelihood}) + \log(\text{number of observations}) \times \text{number of parameters}$$

In practice, AIC tends to favor distributions with more parameters (overfitting), while BIC/SBC tends to the opposite underfitting. The authors of the package from the book Distributions for Modeling Location, Scale, and Shape Using GAMLSS in R recommend using $k$ values in the range $2.5 \leq k \leq 4$ or $k = \sqrt{\log(n)}$ when $n \geq 1000$.

Multiplication by -2 converts the maximization of log likelihood into a minimization problem, so that lower values of GDEV, AIC or BIC indicate better fit.

It is important to note that none of these metrics serve to quantify the quality of fit in an absolute sense, but rather to compare the relative quality between distributions. If all fitted distributions are poor, they provide no warning of it.

Example

In this example, two distributions, normal and gamma, are fitted with the objective of modeling the distribution of diamond sale prices. In addition to performing the fits, the results are graphically represented and the goodness-of-fit metrics AIC, BIC, and Log-Likelihood are calculated with the objective of comparing and identifying which distribution fits best.

Libraries

The libraries used in this document are:

# Data manipulation
# ==============================================================================
import pandas as pd
import numpy as np
import seaborn as sns

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8

# Distribution fitting
# ==============================================================================
from tqdm.auto import tqdm
from scipy import stats
import inspect
from statsmodels.distributions.empirical_distribution import ECDF

# Warnings configuration
# ==============================================================================
import warnings
from contextlib import nullcontext
warnings.filterwarnings('once')

Data

For this demonstration, the diamond prices available in the diamonds dataset from the seaborn library are used, specifically the price column.

# Data
# ==============================================================================
data = sns.load_dataset('diamonds')
data = data.loc[data.cut == 'Fair', 'price']

Distribution Fitting

Two of the first steps when analyzing a variable are: calculating the main descriptive statistics and representing the observed (empirical) distributions. If the data are stored in a Pandas Series, the main descriptive statistics can be obtained with the describe() method.

# Descriptive statistics
# ==============================================================================
data.describe()
count     1610.000000
mean      4358.757764
std       3560.386612
min        337.000000
25%       2050.250000
50%       3282.000000
75%       5205.500000
max      18574.000000
Name: price, dtype: float64
# Observed (empirical) distribution plots
# ==============================================================================
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(8, 3))

# Histogram
axs[0].hist(x=data, bins=30, color="#3182bd", alpha=0.5)
axs[0].plot(data, np.full_like(data, -0.01), '|k', markeredgewidth=1)
axs[0].set_title('Empirical distribution of diamond prices')
axs[0].set_xlabel('price')
axs[0].set_ylabel('counts')

# Cumulative Distribution Function
# ecdf (empirical cumulative distribution function)
ecdf = ECDF(x=data)
axs[1].plot(ecdf.x, ecdf.y, color="#3182bd")
axs[1].set_title('Empirical distribution function')
axs[1].set_xlabel('price')
axs[1].set_ylabel('CDF')

plt.tight_layout();
# Normal distribution fitting
#===============================================================================
# 1) Define the distribution type
distribution = stats.norm

# 2) Obtain the parameters with the fit() method
params = distribution.fit(data=data)

# 3) Create a dictionary that includes the name of each parameter
param_names = [p for p in inspect.signature(distribution._pdf).parameters \
               if p != 'x'] + ["loc","scale"]
params_dict = dict(zip(param_names, params))

# 4) Calculate the log likelihood
log_likelihood = distribution.logpdf(data.to_numpy(), *params).sum()

# 5) Calculate the AIC and BIC
aic = -2 * log_likelihood + 2 * len(params)
bic = -2 * log_likelihood + np.log(data.shape[0]) * len(params)

# 6) Plot
x_hat = np.linspace(min(data), max(data), num=100)
y_hat = distribution.pdf(x_hat, *params)

fig, ax = plt.subplots(figsize=(6, 3))
ax.plot(x_hat, y_hat, linewidth=2, label=distribution.name)
ax.hist(x=data, density=True, bins=30, color="#3182bd", alpha=0.5)
ax.plot(data, np.full_like(data, -0.01), '|k', markeredgewidth=1)
ax.set_title('Diamond price distribution')
ax.set_xlabel('price')
ax.set_ylabel('Probability density')
ax.legend();

# 6) Fit information
print('---------------------')
print('Fit results')
print('---------------------')
print(f"Distribution   : {distribution.name}")
print(f"Support        : {[distribution.a, distribution.b]}")
print(f"Parameters     : {params_dict}")
print(f"Log likelihood : {log_likelihood}")
print(f"AIC            : {aic}")
print(f"BIC            : {bic}")
---------------------
Fit results
---------------------
Distribution   : norm
Support        : [-inf, inf]
Parameters     : {'loc': np.float64(4358.757763975155), 'scale': np.float64(3559.2807303891086)}
Log likelihood : -15449.966194325283
AIC            : 30903.932388650566
BIC            : 30914.700367566522

The process is repeated, but this time with the gamma distribution.

# Gamma distribution fitting
#===============================================================================
# 1) Define the distribution type
distribution = stats.gamma

# 2) Obtain the parameters with the fit() method
params = distribution.fit(data=data)

# 3) Create a dictionary that includes the name of each parameter
param_names = [p for p in inspect.signature(distribution._pdf).parameters \
               if p != 'x'] + ["loc","scale"]
params_dict = dict(zip(param_names, params))

# 4) Calculate the log likelihood
log_likelihood = distribution.logpdf(data.to_numpy(), *params).sum()

# 5) Calculate the AIC and BIC
aic = -2 * log_likelihood + 2 * len(params)
bic = -2 * log_likelihood + np.log(data.shape[0]) * len(params)

# 6) Plot
x_hat = np.linspace(min(data), max(data), num=100)
y_hat = distribution.pdf(x_hat, *params)

fig, ax = plt.subplots(figsize=(6, 3))
ax.plot(x_hat, y_hat, linewidth=2, label=distribution.name)
ax.hist(x=data, density=True, bins=30, color="#3182bd", alpha=0.5)
ax.plot(data, np.full_like(data, -0.01), '|k', markeredgewidth=1)
ax.set_title('Diamond price distribution')
ax.set_xlabel('price')
ax.set_ylabel('Probability density')
ax.legend();

# 6) Fit information
print('---------------------')
print('Fit results')
print('---------------------')
print(f"Distribution   : {distribution.name}")
print(f"Support        : {[distribution.a, distribution.b]}")
print(f"Parameters     : {params_dict}")
print(f"Log likelihood : {log_likelihood}")
print(f"AIC            : {aic}")
print(f"BIC            : {bic}")
---------------------
Fit results
---------------------
Distribution   : gamma
Support        : [0.0, inf]
Parameters     : {'a': np.float64(14.399103124683087), 'loc': np.float64(-6244.786483560789), 'scale': np.float64(726.3476320693849)}
Log likelihood : -15235.639828174484
AIC            : 30477.27965634897
BIC            : 30493.431624722904

Both the AIC and BIC metrics agree that the gamma distribution fits the data better (lower AIC and BIC values). This can be easily corroborated by graphical inspection of the results.

In this case, given that price values can only be positive and have a notable right tail, the gamma distribution was a much better candidate than the normal distribution. However, there are other possible distributions, some of which could be better. In the next example, we show how to automate the search.

Fitting and Comparing Multiple Distributions

The following example shows how to automate the fitting and comparison of multiple distributions available in scipy.stats. The code should allow:

  • Fitting all distributions available in scipy.stats.

  • Preselecting a subset of candidate distributions based on their support.

  • Displaying the parameters of each fit.

  • Calculating AIC and BIC values to select the distribution with the best fit.

  • Graphical representation of the results.

Functions

Functions used to compare multiple distributions.

def select_distributions(
    family: str = 'realall', 
    exclusions: list = None,
    verbose: bool = True
) -> list:
    '''
    This function selects a subset of the distributions available
    in scipy.stats
    
    Parameters
    ----------
    family : {'realall', 'realline', 'realplus', 'real0to1', 'discrete'}
        realall: distributions from the `realline` + `realplus` families
        realline: continuous distributions on the support (-inf, +inf)
        realplus: continuous distributions on the support [0, +inf)
        real0to1: continuous distributions on the support [0,1]
        discrete: discrete distributions
        
    exclusions : list, optional
        List of distribution names to exclude. By default excludes
        ['levy_stable', 'vonmises', 'erlang', 'studentized_range']
        
    verbose : bool
        Whether to display information about the selected distributions
        (the default `True`).
        
    Returns
    -------
    distributions: list
        list with the selected distributions (the objects).

    '''
    
    valid_families = ['realall', 'realline', 'realplus', 'real0to1', 'discrete']
    if family not in valid_families:
        raise ValueError(f"The 'family' parameter must be one of {valid_families}, "
                         f"but received '{family}'")
    
    # Get all distributions
    distributions = [getattr(stats, d) for d in dir(stats) 
                     if isinstance(getattr(stats, d), (stats.rv_continuous, stats.rv_discrete))]
    
    # Default exclusions if not specified
    if exclusions is None:
        exclusions = ['levy_stable', 'vonmises', 'erlang', 'studentized_range']
    
    exclusions_set = set(exclusions)
    distributions = [dist for dist in distributions if dist.name not in exclusions_set]

    dist_info = pd.DataFrame([
        {
            'distribution': dist.name,
            'type': 'continuous' if isinstance(dist, stats.rv_continuous) else 'discrete',
            'support_inf': dist.a,
            'support_sup': dist.b,
            'dist_obj': dist
        }
        for dist in distributions
    ]).sort_values(by=['support_inf', 'support_sup']).reset_index(drop=True)
    
    supports = {
        'realline': [-np.inf, np.inf],
        'realplus': [0, np.inf],
        'real0to1': [0, 1]
    }
    
    if family == 'realall':
        mask = (dist_info['type'] == 'continuous') & (
            ((dist_info['support_inf'] == -np.inf) & 
             (dist_info['support_sup'] == np.inf)) |
            ((dist_info['support_inf'] == 0) & 
             (dist_info['support_sup'] == np.inf))
        )
        dist_info = dist_info[mask].reset_index(drop=True)
        
    elif family in supports:
        mask = (dist_info['type'] == 'continuous') & \
               (dist_info['support_inf'] == supports[family][0]) & \
               (dist_info['support_sup'] == supports[family][1])
        dist_info = dist_info[mask].reset_index(drop=True)
        
    elif family == 'discrete':
        dist_info = dist_info[
            dist_info['type'] == 'discrete'
        ].reset_index(drop=True)
    
    selection = dist_info['dist_obj'].tolist()
    dist_info_display = dist_info.drop(columns=['dist_obj'])
    
    if not selection:
        print("⚠️  No distributions found for the specified family")
    
    if verbose:
        print("---------------------------------------------------")
        print("       Selected distributions                ")
        print("---------------------------------------------------")
        with pd.option_context('display.max_rows', None, 'display.max_columns', None): 
            print(dist_info_display)
    
    return selection


def compare_distributions(
    x: np.ndarray,
    family: str = 'realall', 
    criterion: str = 'aic',
    exclusions: list = None,
    verbose: bool = True,
    suppress_warnings: bool = True
) -> pd.DataFrame:
    '''
    This function selects and fits a subset of the distributions 
    available in scipy.stats. For each distribution it calculates the 
    Log Likelihood, AIC and BIC values.
    
    Parameters
    ----------
    x : array_like
        data to fit the distribution.
        
    family : {'realall', 'realline', 'realplus', 'real0to1', 'discrete'}
        realall: distributions from the `realline` + `realplus` families
        realline: continuous distributions on the support (-inf, +inf)
        realplus: continuous distributions on the support [0, +inf)
        real0to1: continuous distributions on the support [0,1]
        discrete: discrete distributions
    
    criterion : {'aic', 'bic'}
        sorting criterion from best to worst fit.
    
    exclusions : list, optional
        List of distribution names to exclude. By default excludes
        ['levy_stable', 'vonmises', 'erlang', 'studentized_range']
    
    verbose : bool
        Whether to display information about the selected distributions
        (the default `True`).
    
    suppress_warnings : bool
        Whether to suppress numerical warnings during fitting (RuntimeWarning).
        Recommended True for clean output (the default `True`).
        
    Returns
    -------
    results: pd.DataFrame
        distribution: distribution name.
        log_likelihood: log likelihood of the fit.
        aic: AIC metric.
        bic: BIC metric.
        n_params: number of distribution parameters.
        params: parameters after fitting

    '''
    
    valid_criteria = ['aic', 'bic']
    if criterion not in valid_criteria:
        raise ValueError(f"The 'criterion' parameter must be one of {valid_criteria}, "
                         f"but received '{criterion}'")
    
    if exclusions is not None and not isinstance(exclusions, (list, tuple, set)):
        raise TypeError("The 'exclusions' parameter must be a list, tuple or set")
    
    x = np.asarray(x)
    if x.size == 0:
        raise ValueError("Array 'x' cannot be empty")
    
    if not np.all(np.isfinite(x)):
        raise ValueError("Array 'x' contains NaN or infinite values")
    
    distributions = select_distributions(
        family=family, 
        exclusions=exclusions, 
        verbose=False
    )
    
    results = []
    for distribution in tqdm(distributions):
        
        if verbose:
            print(f"Fitting distribution: {distribution.name}")
        
        try:
            if suppress_warnings:
                with warnings.catch_warnings():
                    warnings.simplefilter("ignore")
                    params = distribution.fit(data=x)
            else:
                params = distribution.fit(data=x)
            
            # Determine if continuous or discrete
            is_continuous = isinstance(distribution, stats.rv_continuous)
            sig_func = distribution._pdf if is_continuous else distribution._pmf
            log_func = distribution.logpdf if is_continuous else distribution.logpmf
            
            # Extract parameter names
            param_names = [p for p in inspect.signature(sig_func).parameters 
                          if p != 'x'] + ["loc", "scale"]
            params_dict = dict(zip(param_names, params))
            
            # Calculate metrics
            log_likelihood = log_func(x, *params).sum()
            aic = -2 * log_likelihood + 2 * len(params)
            bic = -2 * log_likelihood + np.log(x.shape[0]) * len(params)
            
            # Store results
            results.append({
                'distribution': distribution.name,
                'log_likelihood': log_likelihood,
                'aic': aic,
                'bic': bic,
                'n_params': len(params),
                'params': params_dict
            })
            
        except (ValueError, RuntimeError, FloatingPointError) as e:
            if verbose:
                print(f"Error fitting {distribution.name}: {type(e).__name__}: {e}\n")
        except Exception as e:
            if verbose:
                print(f"Unexpected error fitting {distribution.name}: {type(e).__name__}: {e}\n")
    
    # Verify if any distribution was fitted
    if not results and verbose:
        print("\n⚠️  Could not fit any distribution successfully.")
        results = pd.DataFrame(
            columns=['distribution', 'log_likelihood', 'aic', 'bic', 'n_params', 'params']
        )
    else:
        results = pd.DataFrame(results).sort_values(by=criterion).reset_index(drop=True)
    
    return results

Data

Again, the diamond prices available in the diamonds dataset from the seaborn library are used, specifically the price column.

# Data
# ==============================================================================
data = sns.load_dataset('diamonds')
data = data.loc[data.cut == 'Fair', 'price']

Distribution Fitting

# Distribution fitting and comparison
# ==============================================================================
results = compare_distributions(
            x=data.to_numpy(),
            family='realall',
            criterion='aic',
            verbose=False,
            suppress_warnings=True
          )
results
  0%|          | 0/84 [00:00<?, ?it/s]
distribution log_likelihood aic bic n_params params
0 recipinvgauss -1.488310e+04 2.977220e+04 2.978835e+04 3 {'mu': 0.8564617364318987, 'loc': 75.589562124...
1 geninvgauss -1.488228e+04 2.977256e+04 2.979410e+04 4 {'p': 0.03895738587977514, 'b': 1.373372982791...
2 norminvgauss -1.488307e+04 2.977415e+04 2.979569e+04 4 {'a': 53.047412557163966, 'b': 53.025400276075...
3 genhyperbolic -1.488252e+04 2.977503e+04 2.980195e+04 5 {'p': 0.06748200212521904, 'a': 19.46831101851...
4 lognorm -1.488561e+04 2.977722e+04 2.979337e+04 3 {'s': 0.775749425283541, 'loc': 30.76091834294...
... ... ... ... ... ... ...
78 exponweib -1.697787e+04 3.396375e+04 3.398529e+04 4 {'a': 3.6435974274790723, 'c': 0.1273335852525...
79 invweibull -1.799302e+04 3.599205e+04 3.600820e+04 3 {'c': 0.1931244916343704, 'loc': 336.999783682...
80 ncx2 -1.024232e+06 2.048472e+06 2.048493e+06 4 {'df': 0.0011534145941760432, 'nc': 0.34024851...
81 truncnorm -inf inf inf 4 {'a': 388.84586460519677, 'b': 506.69039671197...
82 weibull_min -inf inf inf 3 {'c': 1.1567328971778637, 'loc': 495.039530173...

83 rows × 6 columns

Plots

def fit_plot_distribution(
    x: np.ndarray, 
    distribution_name, 
    ax=None, 
    verbose: bool = True,
):
    '''
    This function overlays the density curve(s) of one or more 
    distributions with the data histogram.
    
    Parameters
    ----------
    x : array_like
        data to fit the distribution.
        
    distribution_name : str or list
        name of a distribution or list of distribution names 
        available in `scipy.stats`.
    
    ax : matplotlib.axes.Axes, optional
        axis on which to make the plot. If None, a new one is created.
    
    verbose : bool, optional
        If True, prints the fit results. For multiple distributions,
        prints error warnings (default True).
        
    Returns
    -------
    ax: matplotlib.axes.Axes
        created plot
        
    '''
    
    # Input validation
    x = np.asarray(x)
    if x.size == 0:
        raise ValueError("Array 'x' cannot be empty")
    if not np.all(np.isfinite(x)):
        raise ValueError("Array 'x' contains NaN or infinite values")
    
    if isinstance(distribution_name, str):
        distribution_name = [distribution_name]
    if not isinstance(distribution_name, (list, tuple)):
        raise TypeError("The 'distribution_name' parameter must be a string, list or tuple")
    
    # Create plot
    if ax is None:
        fig, ax = plt.subplots(figsize=(7,4))
    
    # Plot histogram
    ax.hist(x=x, density=True, bins=30, color="#3182bd", alpha=0.5)
    ax.plot(x, np.full_like(x, -0.01), '|k', markeredgewidth=1)
    ax.set_title('Distribution fit' if len(distribution_name) == 1 else 
                 'Distribution fits')
    ax.set_xlabel('x')
    ax.set_ylabel('Probability density')
    
    for name in tqdm(distribution_name):
        if not hasattr(stats, name):
            if verbose:
                print(f"⚠️  Distribution '{name}' does not exist in scipy.stats")
            continue
        
        try:

            distribution = getattr(stats, name)
            params = distribution.fit(data=x)
            
            is_continuous = isinstance(distribution, stats.rv_continuous)
            pdf_func = distribution.pdf if is_continuous else distribution.pmf
            
            x_hat = np.linspace(min(x), max(x), num=100)
            y_hat = pdf_func(x_hat, *params)
            ax.plot(x_hat, y_hat, linewidth=2, label=distribution.name)
            
        except Exception as e:
            if verbose:
                print(f"⚠️  Error fitting distribution '{name}': {type(e).__name__}: {e}")
    
        ax.legend(loc='best')

    return ax

The best distribution is shown according to the AIC criterion.

fig, ax = plt.subplots(figsize=(6, 3))

fit_plot_distribution(
    x=data.to_numpy(),
    distribution_name=results['distribution'][0],
    ax=ax
);
  0%|          | 0/1 [00:00<?, ?it/s]

The probability density curves for the 5 best distributions.

fig, ax = plt.subplots(figsize=(6, 3))

fit_plot_distribution(
    x=data.to_numpy(),
    distribution_name=results['distribution'][:5].tolist(),
    ax=ax
);
  0%|          | 0/5 [00:00<?, ?it/s]

Density Function, Quantile, and Sampling

All distribution objects in scipy.stats have the pdf(), logpdf(), cdf(), ppf(), and rvs() methods for calculating density, log density, cumulative probability, quantiles, and sampling new values, respectively. For example, 5 new diamond values can be simulated according to the johnsonsu distribution.

# Distribution definition
distribution = stats.johnsonsu

# Fitting to obtain parameter values
params = distribution.fit(data.to_numpy())

# Random sampling
distribution.rvs(*params, size=5)
array([12925.69222894,  2831.92278871,  3877.35030327,   872.06750859,
        1713.25813818])

Session Information

import session_info
session_info.show(html=False)
-----
matplotlib          3.10.8
numpy               2.2.6
pandas              2.3.3
scipy               1.15.3
seaborn             0.13.2
session_info        v1.0.1
statsmodels         0.14.6
tqdm                4.67.1
-----
IPython             9.8.0
jupyter_client      8.7.0
jupyter_core        5.9.1
-----
Python 3.13.11 | packaged by Anaconda, Inc. | (main, Dec 10 2025, 21:28:48) [GCC 14.3.0]
Linux-6.14.0-37-generic-x86_64-with-glibc2.39
-----
Session information updated at 2026-01-16 13:00

Bibliography

Distributions for Modeling Location, Scale, and Shape Using GAMLSS in R By Robert A. Rigby, Mikis D. Stasinopoulos, Gillian Z. Heller, Fernanda De Bastiani.

Delignette-Muller, M., & Dutang, C. (2015). fitdistrplus: An R Package for Fitting Distributions. doi:http://dx.doi.org/10.18637/jss.v064.i04

Citation Instructions

How to cite this document?

If you use this document or any part of it, we would appreciate if you cite it. Thank you very much!

Fitting and Selecting Probability Distributions with Python by Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://www.cienciadedatos.net/documentos/pystats01-fitting-probability-distributions-python.html

Did you like the article? Your help is important

Your contribution will help me continue generating free educational content. Thank you very much! 😊

Become a GitHub Sponsor

Creative Commons Licence

This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.

Permitted:

  • Share: copy and redistribute the material in any medium or format.

  • Adapt: remix, transform, and build upon the material.

Under the following terms:

  • Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  • Non-Commercial: You may not use the material for commercial purposes.

  • Share-Alike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.