More about Data Science and Statistics

Introduction

This document presents several strategies to compare distributions and detect whether there are differences between them. In practice, this type of comparison can be useful in cases such as:

  • Determining whether two variables have the same distribution.

  • Determining whether the same variable is distributed identically across two groups.

  • Monitoring models in production: in projects involving the creation and deployment of predictive models, models are trained with historical data assuming that the predictor variables will maintain the same behavior in the near future. How can we detect if this ceases to be true? What if a variable starts behaving differently? Detecting these changes can be used as an alarm to indicate the need to retrain models, either in an automated way or through new studies.

  • Finding variables with different behavior between two scenarios: in industrial settings, it is common to have multiple production lines that supposedly perform exactly the same process. However, one of the lines often generates different results. One way to discover the reason for this difference is by comparing the variables measured on each line to identify which ones differ to a greater degree.

Multiple strategies exist to address these questions. One of the most commonly used approaches is to compare centrality statistics (mean, median...) or dispersion statistics (standard deviation, interquartile range...). This strategy has the advantage of being easy to implement and interpret. However, each of these statistics only considers one type of difference, so depending on which one is used, important changes may be ignored. For example, two very different distributions can have the same mean.

Another approach consists of using methods that attempt to quantify the "distance" between distributions, for example the Cramér-von Mises Distance, which is influenced by both differences in location and shape of the distribution.

As with most statistical tests or methods, there is no one that always outperforms the others. Depending on which change in the distribution is most important to detect, one strategy will be better than another.

Cramér-von Mises Distance

The Cramér-von Mises (CvM) distance is a statistical measure quantifying the difference between an empirical cumulative distribution function (ECDF) and a theoretical reference cumulative distribution function (CDF), or between two empirical CDFs, by integrating the squared difference across all values.

Left: illustration of the Cramér-von Mises distance between a sample and a reference distribution. The red line shows the theoretical cumulative distribution function, the blue line shows the empirical cumulative distribution function. The Cramér-von Mises distance integrates the squared differences across the entire distribution. Right: illustration of the Cramér-von Mises distance between two samples. The red and blue lines show the empirical cumulative distribution functions of two samples.

The CvM distance is based on the integrated squared difference between the empirical and theoretical distributions.

The Theoretical Formula

For a sample of size $n$, the test statistic $W^2$ is defined as:

$$W^2 = n \int_{-\infty}^{\infty} [F_n(x) - F(x)]^2 dF(x)$$

Where:

  • $F_n(x)$ is the empirical CDF of the sample.
  • $F(x)$ is the theoretical CDF of the null distribution.

The Practical Calculation

In practice, you don't need to perform integration. If you have $n$ observations sorted in increasing order ($x_1 \le x_2 \le \dots \le x_n$), the formula simplifies to:

$$W^2 = \frac{1}{12n} + \sum_{i=1}^{n} \left[ F(x_i) - \frac{2i - 1}{2n} \right]^2$$

Step-by-Step Process

  • Sort your sample data from smallest to largest.
  • Calculate the theoretical CDF value $F(x_i)$ for each data point (e.g., if testing for normality, find the Z-score and then the percentile).
  • Plug the values into the formula above.
  • Compare the resulting $W^2$ value against critical values from a CvM distribution table or use software to find the $p$-value.

CvM vs. Kolmogorov-Smirnov (KS) Distance

The Kolmogorov-Smirnov (KS) distance is the most famous alternative to CvM. While both use the CDF to measure fit, they differ in how they measure the gap. While the KS distance focuses on the maximum deviation between the empirical and theoretical CDFs, the CvM distance considers the overall squared differences across the entire range of data.

Imagine two curves that are very close for 90% of the range but have one sharp "bump" where they diverge significantly. The KS test will focus entirely on that bump. Conversely, imagine two curves that are slightly different across the entire range, but never deviate sharply at any single point. The KS test might miss this (as no single gap is "large"), but the CvM test will flag it because the small errors "add up" over the entire integral.

When should you use CvM over KS?

  • Use KS if you are primarily worried about a "worst-case" difference at a specific point in the data.

  • Use CvM if you want to know if the distributions are different in any way across their entire range. It is generally considered a more sophisticated and powerful test for general goodness-of-fit.

Example

A study aims to determine whether the distribution of salaries in Spain has changed between 1989 and 1990. To do this, the Cramér-von Mises Distance is used.

Libraries

The libraries used in this document are:

# Data processing
# ==============================================================================
import numpy as np
import pandas as pd

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams.update({'font.size': 10})
plt.rcParams['lines.linewidth'] = 1
import seaborn as sns

# Modeling and statistical tests
# ==============================================================================
from statsmodels.distributions.empirical_distribution import ECDF
from scipy.stats import cramervonmises_2samp

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings("once")

Data

The Snmesp dataset from the R package plm contains a sample of salaries (on logarithmic scale) paid in Spain during the years 1983 to 1990 (783 observations per year).

# Data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/'
       'Estadistica-machine-learning-python/master/data/Snmesp.csv')
data = pd.read_csv(url)
data['year'] = data['year'].astype(str) 
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1476 entries, 0 to 1475
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   year    1476 non-null   object 
 1   salary  1476 non-null   float64
dtypes: float64(1), object(1)
memory usage: 23.2+ KB

Graphical Analysis

# Distribution plots
# ==============================================================================
fig, axs = plt.subplots(nrows=2, ncols=1, figsize=(7, 6))
sns.violinplot(
    x     = data.salary,
    y     = data.year,
    color = ".8",
    ax    = axs[0]
)

sns.stripplot(
    x      = data.salary,
    y      = data.year,
    hue    = data.year,
    size   = 4,
    jitter = 0.1,
    legend = False,
    ax     = axs[0]
)
axs[0].set_title('Salary distribution by year')
axs[0].set_ylabel('year')
axs[0].set_xlabel('salary')

for year in data.year.unique():
    temp_data = data[data.year == year]['salary']
    temp_data.plot.kde(ax=axs[1], label=year)
    axs[1].plot(temp_data, np.full_like(temp_data, 0), '|k', markeredgewidth=1)

axs[1].set_xlabel('salary')
axs[1].legend()
fig.tight_layout();

Empirical Cumulative Distribution Function

The ECDF() class from the statsmodels library allows fitting the empirical cumulative distribution function from a sample. The result of this function is an ecdf object that behaves similarly to a predictive model, receiving a vector of observations and returning their estimated cumulative probability.

# Fitting ecdf functions with each sample
# ==============================================================================
ecdf_1989 = ECDF(data.loc[data.year == '1989', 'salary'])
ecdf_1990 = ECDF(data.loc[data.year == '1990', 'salary'])
# Estimation of cumulative probability for each observed salary value
# ==============================================================================
salary_grid = np.sort(data.salary.unique())
cumulative_prob_ecdf_1989 = ecdf_1989(salary_grid)
cumulative_prob_ecdf_1990 = ecdf_1990(salary_grid)
# Graphical representation of ecdf curves and the area between them
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(6, 4))
ax.plot(salary_grid, cumulative_prob_ecdf_1989, label='1989')
ax.plot(salary_grid, cumulative_prob_ecdf_1990, label='1990')
ax.fill_between(
    salary_grid,
    cumulative_prob_ecdf_1989,
    cumulative_prob_ecdf_1990,
    color='gray',
    alpha=0.5,
    label='Area between ECDFs'
)
ax.set_title("Empirical cumulative distribution function of salaries")
ax.set_ylabel("Cumulative probability")
ax.legend();

This same plot can be generated directly with the ecdfplot function from seaborn.

Calculating the Cramér-von Mises Distance

To calculate the Cramér-von Mises distance between two samples, we use the two-sample formulation. The test statistic measures the integrated squared difference between the two empirical CDFs:

$$T = \frac{nm}{n+m} \int_{-\infty}^{\infty} [F_n(x) - G_m(x)]^2 dH(x)$$

Where:

  • $F_n(x)$ and $G_m(x)$ are the empirical CDFs of the two samples.
  • $H(x)$ is the pooled empirical CDF (from combining both samples).
  • $n$ and $m$ are the sample sizes.

In practice, the computational formula involves ranking all observations from both samples and evaluating the empirical CDFs at each observation point. The scipy implementation uses Anderson's efficient algorithm for this calculation.

# Cramér-von Mises distance (manual calculation)
# ==============================================================================
# Calculate squared differences instead of absolute differences
squared_diff = (cumulative_prob_ecdf_1989 - cumulative_prob_ecdf_1990) ** 2

# The CvM distance is the integral (sum) of squared differences
# We approximate the integral by summing over the grid points
n = len(data[data.year == '1989'])
m = len(data[data.year == '1990'])
cvm_distance = (n * m / (n + m)**2) * np.sum(squared_diff)

print(f"Cramér-von Mises distance (approximation): {cvm_distance:.6f}")
Cramér-von Mises distance (approximation): 1.470639

Cramér-von Mises Test

Once the Cramér-von Mises distance is calculated, it must be determined whether the value of this distance is large enough, considering the available samples, to conclude that the two distributions are different. This can be achieved by calculating the probability (p-value) of observing equal or greater distances if both samples came from the same distribution.

The Cramér-von Mises statistical test for two samples is available in the cramervonmises_2samp() function from the scipy.stats library. The null hypothesis of this test assumes that both samples come from the same distribution. Therefore, only when the estimated p-value is very small (significant) can it be considered that there is evidence against both samples having the same distribution.

Example

Using the same data from the previous example, the Cramér-von Mises test is applied to answer the question of whether the salary distribution has changed between 1989 and 1990.

# Cramér-von Mises test between two samples
# ==============================================================================
result = cramervonmises_2samp(
    data.loc[data.year == '1989', 'salary'],
    data.loc[data.year == '1990', 'salary']
)

print(f"Cramér-von Mises statistic: {result.statistic:.6f}")
print(f"P-value: {result.pvalue:.6f}")
Cramér-von Mises statistic: 1.472183
P-value: 0.000200

With a very small p-value (close to 0), there is sufficient evidence at any reasonable significance level (e.g., α = 0.05) to reject the null hypothesis and conclude that the salary distribution has changed from one year to another.

Session Information

import session_info
session_info.show(html=False)
-----
matplotlib          3.10.8
numpy               2.4.0
pandas              2.3.3
scipy               1.16.3
seaborn             0.13.2
session_info        v1.0.1
statsmodels         0.14.6
-----
IPython             9.9.0
jupyter_client      8.7.0
jupyter_core        5.9.1
-----
Python 3.13.11 | packaged by conda-forge | (main, Dec  6 2025, 11:10:00) [MSC v.1944 64 bit (AMD64)]
Windows-11-10.0.26100-SP0
-----
Session information updated at 2026-01-05 17:26

Citation Instructions

How to cite this document?

If you use this document or any part of it, we appreciate you citing it. Thank you very much!

Distribution Comparison: Cramér-von Mises Test by Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://cienciadedatos.net/documentos/pystats12-cramer-von-mises-test-python.html

Did you like the article? Your help is important

Your contribution will help me continue generating free educational content. Thank you so much! 😊

Become a GitHub Sponsor

Creative Commons Licence

This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.

Permissions:

  • Share: copy and redistribute the material in any medium or format.

  • Adapt: remix, transform, and build upon the material.

Under the following terms:

  • Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  • Non-Commercial: You may not use the material for commercial purposes.

  • Share-Alike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.