More about machine learning: Machine Learning with Python

Introduction¶

Gaussian Mixture Model (GMM) is a probabilistic model in which observations are considered to follow a probability distribution formed by the combination of multiple normal distributions (components). It can be understood as a generalization of K-means where, instead of assigning each observation to a single cluster, a probability distribution of membership to each component is obtained.

Fitting a GMM model consists of estimating the parameters that define the distribution function of each component: the mean and the covariance matrix. For example, if the model has two components, the mean and covariance matrix of each component must be found. If it's a multidimensional problem, for example with 3 variables, the mean of each component is a vector of 3 values and the covariance matrix is a 3×3 matrix. The most commonly used algorithm for fitting is Expectation-Maximization (EM).

Once the parameters are learned, the probability density of each observation belonging to each component and to the distribution as a whole can be calculated. Observations with very low probability density can be considered anomalies.

Probability Density

As it is a probabilistic model, fitting a GMM model actually generates a probability density function. The concept of density can be understood as an analog to probability in discrete distributions, but instead of being bounded in the range [0,1], it can take values [0, +$\infty$]. The probability density value is a relative measure of likelihood, the higher the density value of an observation, the more evidence that the observation belongs to a particular distribution.

Frequently, to facilitate calculations, instead of using the density value, its logarithm is used. The interpretation is the same: the higher its value, the more evidence that the observation belongs to the distribution.

Practical Aspects

When training a GMM model, along with the number of components, the type of covariance matrix must be determined. Depending on this choice, the shape of the component distributions can be:

tied: all components share the same covariance matrix.
diagonal: the dimensions of each component along each dimension can be different, but they always remain aligned with the axes, i.e., their orientations are limited.
spherical: the dimensions of each component are the same in all dimensions. This allows generating distributions of different sizes but all spherical.
full: each component has its own covariance matrix, so they can have any orientation and dimensions.

Gaussian Mixture Model (GMM) in Python

One of the main implementations of Gaussian Mixture Model is available in Scikit Learn

Libraries¶

The libraries used in this document are:

# Data processing
# ==============================================================================
import numpy as np
import pandas as pd
from mat4py import loadmat
from sklearn.datasets import make_blobs

# Graphics
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8
import seaborn as sns

# Preprocessing and modeling
# ==============================================================================
from sklearn.mixture import GaussianMixture
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

Example 1¶

Data¶

The data used has been obtained from Outlier Detection DataSets (ODDS), a repository with data commonly used to compare the ability of different algorithms to identify anomalies (outliers). Shebuti Rayana (2016). ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science.

All these datasets are labeled, it is known whether observations are anomalies or not (variable y). Although the methods described in the document are unsupervised, i.e., they do not use the response variable, knowing the true classification allows evaluating their ability to correctly identify anomalies.

Cardiotocography dataset link:
- Number of observations: 1831
- Number of variables: 21
- Number of outliers: 176 (9.6%)
- y: 1 = outliers, 0 = inliers
- Notes: all variables are centered and scaled (mean 0, sd 1).
- Reference: C. C. Aggarwal and S. Sathe, "Theoretical foundations and algorithms for outlier ensembles." ACM SIGKDD Explorations Newsletter, vol. 17, no. 1, pp. 24–47, 2015. Saket Sathe and Charu C. Aggarwal. LODES: Local Density meets Spectral Outlier Detection. SIAM Conference on Data Mining, 2016.

The data is available in MATLAB format (.mat). To read its content, the loadmat() function from the mat4py package is used.

# Reading data
# ==============================================================================
data = loadmat(filename='cardio.mat')
data_X = pd.DataFrame(data['X'])
data_X.columns = ["col_" + str(i) for i in data_X.columns]
data_y = pd.DataFrame(data['y'])
data_y = data_y.to_numpy().flatten()

Model¶

The sklearn.mixture.GaussianMixture class from Scikit-Learn can be used to train GMM models using the expectation-maximization (EM) algorithm. Its main parameters include:

n_components: number of components (in this case clusters) that form the model.
covariance_type: type of covariance matrix ('full' (default), 'tied', 'diag', 'spherical').
max_iter: maximum number of iterations allowed.
random_state: seed to ensure reproducibility of results.

# GMM model definition and training
# ==============================================================================
gmm_model = GaussianMixture(
                n_components    = 4,
                covariance_type = 'full',
                random_state    = 123
             )
gmm_model.fit(X=data_X)

GaussianMixture(n_components=4, random_state=123)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The object returned by GaussianMixture contains among other data: the weight of each component (cluster) in the model (weights_), its mean (means_), and covariance matrix (covariances_). The structure of the latter depends on the type of matrix used in fitting the model (covariance_type).

How to determine the number of components and covariance matrix type?

As it is an unsupervised problem, there is no way to know in advance the optimal number of components and covariance matrix type. Fortunately, being a probabilistic model, metrics such as the Akaike information criterion (AIC) or Bayesian information criterion (BIC) can be used to identify how well the observed data fits the created model, while controlling excess overfitting. In the Scikit Learn implementation, for both metrics, the lower the value, the better.

# GMM model tuning
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3))
n_components = range(1, 10)
covariance_types = ['spherical', 'tied', 'diag', 'full']

for covariance_type in covariance_types:
    bic_values = []
    for i in n_components:
        model = GaussianMixture(n_components=i, covariance_type=covariance_type)
        model = model.fit(data_X)
        bic_values.append(model.bic(data_X))
    ax.plot(n_components, bic_values, label=covariance_type)
        
ax.set_title("BIC Values")
ax.set_xlabel("Number of components")
ax.legend();

Based on the BIC values, the optimal model configuration is around 6 components and full type covariance matrix. The model is trained again using these values.

# GMM model training
# ==============================================================================
gmm_model = GaussianMixture(
                n_components    = 6,
                covariance_type = 'full',
                random_state    = 123, 
             )
gmm_model.fit(X=data_X)

GaussianMixture(n_components=6, random_state=123)

Prediction¶

GaussianMixture models have several prediction methods. To use them as anomaly detectors, it is important to know the probability density that each observation has according to the model. This value can be obtained with the score_samples() method. The documentation indicates that this method returns the logarithm of the probability, but strictly speaking, it is the logarithm of the probability density. This is why, if the logarithm is reversed with the exponential function, values greater than 1 are obtained.

# Log probability prediction
# ==============================================================================
log_probability_predicted = gmm_model.score_samples(X=data_X)
log_probability_predicted

array([ 11.05595431,   9.18897279,   9.58042572, ...,  -5.78051384,
        -8.01073077, -20.51751524], shape=(1831,))

# Distribution of probabilities (log density)
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3))
sns.kdeplot(
    data      = log_probability_predicted,
    fill      = True,
    linewidth = 1,
    color     = 'blue',
    ax        = ax
)
sns.rugplot(
    data  = log_probability_predicted,
    color = 'blue',
    ax    = ax
)
ax.set_title('Prediction Distribution')
ax.set_xlabel('Log probability density')
plt.tight_layout()

Anomaly Detection¶

Once the probability density of each observation according to the model is calculated, it can be used as a criterion to identify anomalies. Observations with very low predicted probability would be expected to be anomalous.

In practice, if this detection strategy is being used, it is because labeled data is not available, i.e., it is not known which observations are actually anomalies. However, as in this example we have the actual classification, we can verify if anomalous data actually has lower probabilities.

# Distribution of anomaly values
# ==============================================================================
df_results = pd.DataFrame({
                    'log_probability' : log_probability_predicted,
                    'anomaly' : data_y
                })

fig, ax = plt.subplots(figsize=(4, 3))
sns.boxplot(
    x     = 'anomaly',
    y     = 'log_probability',
    hue   = 'anomaly',
    data  = df_results,
    ax    = ax
)
ax.set_title('GMM Prediction Distribution')
ax.set_ylabel('Log probability density')
ax.set_xlabel('classification (0 = normal, 1 = anomaly)')
ax.get_legend().remove();

The distribution of the logarithm of probability (density) in the anomaly group is lower. However, due to considerable overlap, if the n observations with the lowest probability are classified as anomalies, false positive errors would occur.

According to the documentation, the Cardiotocography dataset contains 176 anomalies. See the resulting confusion matrix if the 176 observations with the lowest predicted probability are classified as anomalies.

# Confusion matrix of the final classification
# ==============================================================================
df_results = df_results \
                .sort_values('log_probability', ascending=True) \
                .reset_index(drop=True)

df_results['classification'] = np.where(df_results.index <= 176, 1, 0)

pd.crosstab(
    df_results.anomaly,
    df_results.classification,
    rownames=['actual value'],
    colnames=['prediction']
)

prediction	0	1
actual value
0.0	1532	123
1.0	122	54

Of the 176 observations identified as anomalies, only 32% actually are. The percentage of false positives is very high; the GMM method does not achieve notable results on this dataset.

One of the main limitations of using GMM models as anomaly detectors is that they assume the data follows multivariate normal distributions. In practice, this assumption is difficult to meet, especially as the number of variables increases. A strategy that sometimes manages to mitigate part of this problem is to reduce the dimensionality of the data with PCA and then apply a GMM model with only a few components.

Iterative Retraining¶

The previous model was trained using all observations, including potential anomalies. Since the goal is to learn the component distributions only with "normal" data, the result can be improved by retraining the model but excluding the n observations with the lowest probability (potential anomalies).

Anomaly detection is repeated but, this time, discarding observations with a probability density lower than the 0.01 quantile.

# Remove observations with a probability density lower than the 0.01 quantile
# ==============================================================================
quantile = np.quantile(a=log_probability_predicted, q=0.01)
print('Quantile: ', quantile)
data_X_trimmed = data_X.loc[log_probability_predicted > quantile, :].copy()
data_y_trimmed = data_y[log_probability_predicted > quantile].copy()

Quantile:  -15.46041004031156

# GMM model training
# ==============================================================================
gmm_model = GaussianMixture(
                n_components    = 6,
                covariance_type = 'full',
                random_state    = 123, 
             )

gmm_model.fit(X=data_X_trimmed)

GaussianMixture(n_components=6, random_state=123)

# Confusion matrix of the final classification
# ==============================================================================
df_results = pd.DataFrame({
                    'log_probability': gmm_model.score_samples(X=data_X),
                    'anomaly' : data_y
                })

df_results = df_results \
                .sort_values('log_probability', ascending=True) \
                .reset_index(drop=True)

df_results['classification'] = np.where(df_results.index <= 176, 1, 0)

pd.crosstab(
    df_results.anomaly,
    df_results.classification,
    rownames=['actual value'],
    colnames=['prediction']
)

prediction	0	1
actual value
0.0	1556	99
1.0	98	78

After discarding 10% of the observations with the lowest probability and retraining the model, 78 of the 176 observations identified as anomalies actually are. The percentage of false positives has been slightly reduced to 50%.

Example 2¶

The geyser dataset from the R MASS package contains information about the eruptions of the Old Faithful geyser located in Yellowstone National Park, Wyoming. Specifically, it records information about the duration of 299 eruptions, as well as the time elapsed since the previous one.

Data¶

# Reading data
# ==============================================================================
url = 'https://raw.githubusercontent.com/JoaquinAmatRodrigo/' \
      + 'Estadistica-machine-learning-python/master/data/geyser.csv'
data_X = pd.read_csv(url)
data_X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   waiting   299 non-null    int64  
 1   duration  299 non-null    float64
dtypes: float64(1), int64(1)
memory usage: 4.8 KB

Since this is data with only two variables, its distribution can be visualized.

# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(4, 4))
ax.scatter(data_X.waiting, data_X.duration)
ax.set_title('Geyser eruption distribution')
ax.set_ylabel('Duration')
ax.set_xlabel('Waiting');

Model¶

In two-dimensional (variable) problems like this, identifying the number of components can be done visually. Looking at the data, it seems reasonable to think there are 3 groups in the data.

# GMM model tuning based on BIC metric
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3.84))
n_components = range(1, 6)
covariance_types = ['spherical', 'tied', 'diag', 'full']

for covariance_type in covariance_types:
    bic_values = []
    for i in n_components:
        model = GaussianMixture(n_components=i, covariance_type=covariance_type)
        model = model.fit(data_X)
        bic_values.append(model.bic(data_X))
    ax.plot(n_components, bic_values, label=covariance_type)
        
ax.set_title("BIC Values")
ax.set_xlabel("Number of components")
ax.legend();

Looking at the BIC metric, its value stabilizes from 4 components onwards.

# GMM model training
# ==============================================================================
gmm_model = GaussianMixture(
                n_components    = 4,
                covariance_type = 'diag',
                random_state    = 123, 
             )
gmm_model.fit(X=data_X)

GaussianMixture(covariance_type='diag', n_components=4, random_state=123)

Probability Predictions¶

# Probability density prediction
# ==============================================================================
log_probability_predicted = gmm_model.score_samples(X=data_X)

# Probability density map
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5, 4))

# Grid of values within the observed range
x = np.linspace(min(data_X.waiting)*0.8, max(data_X.waiting)*1.2, 1000)
y = np.linspace(min(data_X.duration)*0.8, max(data_X.duration)*1.2, 1000)
xx, yy = np.meshgrid(x, y)
X = pd.DataFrame(
    np.stack([xx.ravel(), yy.ravel()], axis=1),
    columns=['waiting', 'duration']
)

# Probability density of each grid value
scores = gmm_model.score_samples(X)
scores = np.exp(scores) # Values are in log

ax.scatter(data_X.waiting, data_X.duration, alpha=0.6)
ax.contour(
    xx, yy, scores.reshape(xx.shape), alpha=0.6,
    levels=np.percentile(scores, np.linspace(0, 100, 10))[1:-1],
)
ax.set_title('Model probability density');

Anomaly Detection¶

The 5 observations with the lowest probability according to the model are identified.

df_results = data_X.copy()
df_results['log_proba'] = log_probability_predicted
df_results = df_results.sort_values(by='log_proba')
top_10_anomalies = df_results.head(10)
top_10_anomalies

	waiting	duration	log_proba
60	108	1.950000	-9.554223
148	80	0.833333	-9.175420
11	50	5.450000	-8.091603
57	69	3.333333	-8.054387
242	93	3.000000	-7.892902
131	94	4.416667	-7.157670
40	93	2.266667	-6.952211
160	75	4.800000	-6.914526
61	50	5.266667	-6.740258
269	89	2.750000	-6.629788

# Probability density map
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5, 4))

# Grid of values within the observed range
x = np.linspace(min(data_X.waiting)*0.8, max(data_X.waiting)*1.2, 1000)
y = np.linspace(min(data_X.duration)*0.8, max(data_X.duration)*1.2, 1000)
xx, yy = np.meshgrid(x, y)
X = pd.DataFrame(
    np.stack([xx.ravel(), yy.ravel()], axis=1),
    columns=['waiting', 'duration']
)

# Probability density of each grid value
scores = gmm_model.score_samples(X)
scores = np.exp(scores) # Values are in log

ax.scatter(data_X.waiting, data_X.duration, alpha=0.6)
ax.scatter(top_10_anomalies.waiting, top_10_anomalies.duration, c="red", label='anomaly')
ax.contour(
    xx, yy, scores.reshape(xx.shape),
    alpha=0.6, cmap='viridis',
    levels=np.percentile(scores, np.linspace(0, 100, 10))[1:-1]
)
ax.set_title('Model probability density');
ax.legend();

Session Information¶

import session_info
session_info.show(html=False)

-----
mat4py              0.6.0
matplotlib          3.10.8
numpy               2.2.6
pandas              2.3.3
seaborn             0.13.2
session_info        v1.0.1
sklearn             1.7.2
-----
IPython             9.8.0
jupyter_client      8.7.0
jupyter_core        5.9.1
-----
Python 3.13.11 | packaged by Anaconda, Inc. | (main, Dec 10 2025, 21:28:48) [GCC 14.3.0]
Linux-6.14.0-37-generic-x86_64-with-glibc2.39
-----
Session information updated at 2026-01-14 18:44

Bibliography¶

Outlier Analysis Aggarwal, Charu C.

Outlier Ensembles: An Introduction by Charu C. Aggarwal, Saket Sathe

Introduction to Machine Learning with Python: A Guide for Data Scientists

Python Data Science Handbook by Jake VanderPlas

Citation Instructions¶

How to cite this document?

If you use this document or any part of it, we appreciate if you cite it. Thank you very much!

Anomaly Detection with Gaussian Mixture Model (GMM) and Python by Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://cienciadedatos.net/documentos/py23-anomaly-detection-gmm-python.html

Did you like the article? Your help is important

Your contribution will help me continue generating free educational content. Thank you very much! 😊

This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.

Permissions:

Share: copy and redistribute the material in any medium or format.
Adapt: remix, transform, and build upon the material.

Under the following terms:

Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial: You may not use the material for commercial purposes.
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

	n_components	4
	covariance_type	'full'
	tol	0.001
	reg_covar	1e-06
	max_iter	100
	n_init	1
	init_params	'kmeans'
	weights_init	None
	means_init	None
	precisions_init	None
	random_state	123
	warm_start	False
	verbose	0
	verbose_interval	10

	n_components	6
	covariance_type	'full'
	tol	0.001
	reg_covar	1e-06
	max_iter	100
	n_init	1
	init_params	'kmeans'
	weights_init	None
	means_init	None
	precisions_init	None
	random_state	123
	warm_start	False
	verbose	0
	verbose_interval	10

	n_components	6
	covariance_type	'full'
	tol	0.001
	reg_covar	1e-06
	max_iter	100
	n_init	1
	init_params	'kmeans'
	weights_init	None
	means_init	None
	precisions_init	None
	random_state	123
	warm_start	False
	verbose	0
	verbose_interval	10

	n_components	4
	covariance_type	'diag'
	tol	0.001
	reg_covar	1e-06
	max_iter	100
	n_init	1
	init_params	'kmeans'
	weights_init	None
	means_init	None
	precisions_init	None
	random_state	123
	warm_start	False
	verbose	0
	verbose_interval	10