More about machine learning: Machine Learning with Python
- Linear regression
- Multiple linear regression
- Regularization: Ridge, Lasso, Elastic Net
- Logistic regression
- Decision trees
- Random forest
- Gradient boosting
- Machine learning with Python and Scikitlearn
- Support Vector Machines
- Neural Networks
- Principal Component Analysis (PCA)
- Clustering with python
- Face Detection and Recognition
- Introduction to Graphs and Networks
- Anomaly Detection with PCA
- Anomaly Detection with GMM
- Anomaly Detection with Isolation Forest
Introduction¶
Anomaly detection (outliers) with Principal Component Analysis (PCA) is an unsupervised strategy to identify anomalies when the data is not labeled, that is, the true classification (anomaly - non-anomaly) of the observations is unknown.
While this strategy uses PCA, it does not directly use its result as a way to detect anomalies, but rather employs the reconstruction error produced when reversing the dimensionality reduction. The reconstruction error as a strategy to detect anomalies is based on the following idea: dimensionality reduction methods allow projecting observations into a space of lower dimension than the original space, while trying to preserve as much information as possible. The way they manage to minimize the global loss of information is by searching for a new space in which most observations can be well represented.
The PCA method creates a function that maps the position occupied by each observation in the original space with the one it occupies in the newly generated space. This mapping works in both directions, so it is also possible to go from the new space to the original space. Only those observations that have been well projected will be able to return to the position they occupied in the original space with high precision.
Since the search for this new space is guided by most of the observations, the observations closest to the average will be the ones that can be best projected and, consequently, best reconstructed. Anomalous observations, on the contrary, will be poorly projected and their reconstruction will be worse. It is this reconstruction error (squared) that can be used to identify anomalies.
✏️ Note
To learn more about how PCA works and its application in Python, see Principal Component Analysis with Python.
Libraries¶
The libraries used in this document are:
# Data processing
# ==============================================================================
import numpy as np
import pandas as pd
from mat4py import loadmat
# Graphics
# ==============================================================================
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
# Preprocessing and modeling
# ==============================================================================
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from tqdm.notebook import tqdm
# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')
Data¶
The data used in this document were obtained from Outlier Detection DataSets (ODDS), a repository with data commonly used to compare the ability that different algorithms have to identify anomalies (outliers). Shebuti Rayana (2016). ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science.
All these datasets are labeled, it is known whether the observations are anomalies or not (variable y). Although the methods described in the document are unsupervised, that is, they do not use the response variable, knowing the true classification allows evaluating their ability to correctly identify anomalies.
Cardiotocography dataset link:
- Number of observations: 1831
- Number of variables: 21
- Number of outliers: 176 (9.6%)
- y: 1 = outliers, 0 = inliers
- Notes: all variables are centered and scaled (mean 0, sd 1).
- Reference: C. C. Aggarwal and S. Sathe, "Theoretical foundations and algorithms for outlier ensembles." ACM SIGKDD Explorations Newsletter, vol. 17, no. 1, pp. 24–47, 2015. Saket Sathe and Charu C. Aggarwal. LODES: Local Density meets Spectral Outlier Detection. SIAM Conference on Data Mining, 2016.
Speech dataset link:
- Number of observations: 3686
- Number of variables: 400
- Number of outliers: 61 (1.65%)
- y: 1 = outliers, 0 = inliers
- Reference: Learing Outlier Ensembles: The Best of Both Worlds – Supervised and Unsupervised. Barbora Micenkova, Brian McWilliams, and Ira Assent, KDD ODD2 Workshop, 2014.
Shuttle dataset link:
- Number of observations: 49097
- Number of variables: 9
- Number of outliers: 3511 (7%)
- y: 1 = outliers, 0 = inliers
- Reference: Abe, Naoki, Bianca Zadrozny, and John Langford. "Outlier detection by active learning." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006.
The data is available in MATLAB format (.mat). The loadmat() function from the mat4py package is used to read its content.
# Data reading
# ==============================================================================
data = loadmat(filename='cardio.mat')
X_data = pd.DataFrame(data['X'])
X_data.columns = ["col_" + str(i) for i in X_data.columns]
y_data = pd.DataFrame(data['y'])
y_data = y_data.to_numpy().flatten()
PCA¶
Principal Component Analysis (PCA) is a statistical method that allows simplifying the complexity of sample spaces with multiple dimensions while preserving their information. Suppose there is a sample with $n$ individuals, each with $p$ variables ($X_1$, $X_2$, ..., $X_p$), that is, the space has $p$ dimensions. PCA allows finding a number of underlying factors, $(z < p)$ that explain approximately the same as the original $p$ variables. Where $p$ values were previously needed to characterize each individual, now $z$ values suffice. Each of these $z$ new variables is called a principal component.
Each principal component ($Z_i$) is obtained by linear combination of the original variables. They can be understood as new variables obtained by combining the original variables in a certain way. The first principal component of a group of variables ($X_1$, $X_2$, ..., $X_p$) is the normalized linear combination of those variables that has the greatest variance. Once the first component ($Z_1$) is calculated, the second ($Z_2$) is calculated by repeating the same process, but adding the condition that the linear combination cannot be correlated with the first component. The process is repeated iteratively until all possible components (min(n-1, p)) are calculated or until the process is decided to be stopped.
The main limitation that PCA has as a dimensionality reduction method is that it only contemplates linear combinations of the original variables, which means it is not capable of capturing other types of relationships.
Model¶
The sklearn.decomposition.PCA class incorporates the main functionalities needed when working with PCA models. The n_components argument determines the number of calculated components. If None is indicated, all possible ones are calculated (min(rows, columns) - 1).
By default, PCA() centers the values but does not scale them. This is important since, if the variables have different dispersion, as in this case, it is necessary to scale them. One way to do this is to combine a StandardScaler() and a PCA() within a pipeline.
# Training PCA model with data scaling
# ==============================================================================
pca_pipeline = make_pipeline(StandardScaler(), PCA(n_components=None))
pca_pipeline.fit(X=X_data)
# Extract the trained model from the pipeline for diagnostics
pca_model = pca_pipeline.named_steps['pca']
Diagnostics¶
After obtaining the PCA, its ability to reduce data dimensionality is evaluated by exploring the explained and cumulative variance of each component.
# Percentage of variance explained by each component
# ==============================================================================
print('-------------------------------------------------------')
print('Percentage of variance explained by each component')
print('-------------------------------------------------------')
print(pca_model.explained_variance_ratio_)
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7, 3.5))
ax.bar(
x = np.arange(pca_model.n_components_) + 1,
height = pca_model.explained_variance_ratio_
)
for x, y in zip(np.arange(pca_model.n_components_) + 1, pca_model.explained_variance_ratio_):
label = round(y, 2)
ax.annotate(
label,
(x,y),
textcoords="offset points",
xytext=(0, 10),
ha='center',
fontsize=8
)
ax.set_xticks(np.arange(pca_model.n_components_) + 1)
ax.set_ylim(0, 1.1)
ax.set_title('Percentage of variance explained by each component')
ax.set_xlabel('Principal component')
ax.set_ylabel('% explained variance');
------------------------------------------------------- Percentage of variance explained by each component ------------------------------------------------------- [2.69831338e-01 1.75787255e-01 8.68938668e-02 6.88155848e-02 5.95308477e-02 4.97905270e-02 4.58906579e-02 4.42857769e-02 3.75980654e-02 3.19494743e-02 2.92016595e-02 2.56986810e-02 1.92184240e-02 1.74753805e-02 1.31181188e-02 8.86498538e-03 6.71579614e-03 5.51775331e-03 2.43704462e-03 1.37876279e-03 2.24565280e-17]
# Cumulative percentage of variance explained
# ==============================================================================
cumulative_variance_ratio = pca_model.explained_variance_ratio_.cumsum()
print('--------------------------------------------------')
print('Cumulative percentage of variance explained')
print('--------------------------------------------------')
print(cumulative_variance_ratio)
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7, 4))
ax.plot(
np.arange(pca_model.n_components_) + 1,
cumulative_variance_ratio,
marker = 'o'
)
for x, y in zip(np.arange(pca_model.n_components_) + 1, cumulative_variance_ratio):
label = round(y, 2)
ax.annotate(
label,
(x,y),
textcoords="offset points",
xytext=(0, 10),
ha='center',
fontsize=8
)
ax.axvline(x=11, linestyle = '--')
ax.set_ylim(0, 1.1)
ax.set_xticks(np.arange(pca_model.n_components_) + 1)
ax.set_title('Cumulative percentage of variance explained')
ax.set_xlabel('Principal component')
ax.set_ylabel('Cumulative % variance');
-------------------------------------------------- Cumulative percentage of variance explained -------------------------------------------------- [0.26983134 0.44561859 0.53251246 0.60132804 0.66085889 0.71064942 0.75654008 0.80082585 0.83842392 0.87037339 0.89957505 0.92527373 0.94449216 0.96196754 0.97508566 0.98395064 0.99066644 0.99618419 0.99862124 1. 1. ]
With the first 11 components, approximately 90% of the variance observed in the data is explained.
Reconstruction¶
Once the PCA is obtained (matrix of components (eigenvectors), projections, and means), the reconstruction of the initial observations can be obtained using the following equation:
$$\text{reconstruction} = \text{projections} \cdot \text{components}^\top$$It is important to keep in mind that, if the data has been centered and scaled (which in general terms should be done when applying PCA) this transformation must be reversed.
With the inverse_transform() method of the PCA class, the transformation can be reversed and the initial value reconstructed. It is important to note that the reconstruction will only be complete if all components have been included. The inverse_transform() method uses all components included when creating the PCA object for reconstruction.
# Reconstruction of projections
# ==============================================================================
projections = pca_pipeline.transform(X_data)
reconstruction = pca_pipeline.inverse_transform(X=projections)
reconstruction = pd.DataFrame(
data = reconstruction,
columns = X_data.columns,
index = X_data.index
)
print('----------------')
print('Original values')
print('----------------')
display(X_data.head(3))
print('')
print('---------------------')
print('Reconstructed values')
print('---------------------')
display(reconstruction.head(3))
---------------- Original values ----------------
| col_0 | col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | ... | col_11 | col_12 | col_13 | col_14 | col_15 | col_16 | col_17 | col_18 | col_19 | col_20 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.004912 | 0.693191 | -0.20364 | 0.595322 | 0.353190 | -0.061401 | -0.278295 | -1.650444 | 0.759072 | -0.420487 | ... | 1.485973 | -0.798376 | 1.854728 | 0.622631 | 0.963083 | 0.301464 | 0.193113 | 0.231498 | -0.289786 | -0.493294 |
| 1 | 0.110729 | -0.079903 | -0.20364 | 1.268942 | 0.396246 | -0.061401 | -0.278295 | -1.710270 | 0.759072 | -0.420487 | ... | 1.485973 | -0.798376 | 1.854728 | 0.278625 | 0.963083 | 0.301464 | 0.129265 | 0.093563 | -0.256385 | -0.493294 |
| 2 | 0.216546 | -0.272445 | -0.20364 | 1.050988 | 0.148753 | -0.061401 | -0.278295 | -1.710270 | 1.106509 | -0.420487 | ... | 1.141780 | -1.332931 | 0.314688 | 2.342663 | -0.488279 | 0.061002 | 0.065417 | 0.024596 | -0.256385 | 1.140018 |
3 rows × 21 columns
--------------------- Reconstructed values ---------------------
| col_0 | col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | ... | col_11 | col_12 | col_13 | col_14 | col_15 | col_16 | col_17 | col_18 | col_19 | col_20 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.004912 | 0.693191 | -0.20364 | 0.595322 | 0.353190 | -0.061401 | -0.278295 | -1.650444 | 0.759072 | -0.420487 | ... | 1.485973 | -0.798376 | 1.854728 | 0.622631 | 0.963083 | 0.301464 | 0.193113 | 0.231498 | -0.289786 | -0.493294 |
| 1 | 0.110729 | -0.079903 | -0.20364 | 1.268942 | 0.396246 | -0.061401 | -0.278295 | -1.710270 | 0.759072 | -0.420487 | ... | 1.485973 | -0.798376 | 1.854728 | 0.278625 | 0.963083 | 0.301464 | 0.129265 | 0.093563 | -0.256385 | -0.493294 |
| 2 | 0.216546 | -0.272445 | -0.20364 | 1.050988 | 0.148753 | -0.061401 | -0.278295 | -1.710270 | 1.106509 | -0.420487 | ... | 1.141780 | -1.332931 | 0.314688 | 2.342663 | -0.488279 | 0.061002 | 0.065417 | 0.024596 | -0.256385 | 1.140018 |
3 rows × 21 columns
It is verified that, using all components, the reconstruction is total. The reconstructed values are equal to the original data (small differences are due to numerical imprecision of operations).
Reconstruction error¶
The mean squared reconstruction error of an observation is calculated as the average of the squared differences between the original value of its variables and the reconstructed value, that is, the average of the reconstruction errors of all its variables squared.
In the previous section, it was shown that, using all components, the reconstruction is total, so the reconstruction error is zero. When the objective of reconstruction is to identify anomalous observations, only a subset of all components must be used, thus allowing differentiation between observations that are easy to reconstruct and those that are difficult.
To facilitate the process, a function is defined that, given a dataset and a number of components: scales the data, trains a PCA model, and returns the reconstruction of the initial data and its error.
def pca_reconstruction(
X: pd.DataFrame,
X_new: pd.DataFrame | None = None,
components: int | list | np.ndarray | None = None,
verbose : int = 0
):
'''
Function to calculate the reconstruction and error of a dataset
using PCA
Parameters
----------
X: pd.DataFrame
PCA training data.
X_new: pd.DataFrame
Data on which to apply reconstruction. If None,
X is used.
components: int, list, np.ndarray, default=None
Components used for reconstruction.
- int: the first 'components' components are used.
- list or np.ndarray: the components indicated in the list are used
starting from 1.
- None (default): all components are used.
verbose: int, default = 0
If greater than 0, displays additional information on screen.
Returns
-------
reconstruction: pd.DataFrame
Data reconstruction.
reconstruction_error: np.ndarray
Reconstruction error of each observation.
'''
if X_new is None:
X_new = X
max_allowed_components = min(X.shape)
# Component selection, only as many as needed are calculated.
if isinstance(components, int):
if components <= 0 or components > max_allowed_components:
raise ValueError(f"components must be between 1 and {max_allowed_components}")
n_components = components
component_ids = np.arange(components)
elif isinstance(components, (list, np.ndarray, range)):
component_ids = np.asarray(components) - 1
if len(component_ids) == 0:
raise ValueError("The component list cannot be empty")
if np.any(component_ids < 0) or np.any(component_ids >= max_allowed_components):
raise ValueError(f"Component indices must be between 1 and {max_allowed_components}")
component_ids = np.unique(component_ids)
n_components = max(component_ids) + 1
else:
n_components = None
component_ids = None
# Training PCA model with data scaling
pca_pipeline = make_pipeline(StandardScaler(), PCA(n_components=n_components))
pca_pipeline.fit(X=X)
if component_ids is None:
component_ids = np.arange(pca_pipeline['pca'].n_components_)
# Project the data
projections = pca_pipeline.transform(X_new)
# Reconstruction
pca_components = pca_pipeline['pca'].components_
reconstruction = np.dot(
projections[:, component_ids],
pca_components[component_ids, :]
)
# Undo scaling
reconstruction = pca_pipeline['standardscaler'].inverse_transform(reconstruction)
reconstruction = pd.DataFrame(
reconstruction,
columns = X_new.columns,
index = X_new.index
)
# Mean squared reconstruction error
reconstruction_error = reconstruction - X_new
reconstruction_error = reconstruction_error**2
reconstruction_error = reconstruction_error.mean(axis=1)
if verbose > 0:
print(f"Reconstruction with components: {component_ids + 1}")
return reconstruction, reconstruction_error
# Reconstruction with the first 11 components (90% of explained variance)
# ==============================================================================
reconstruction, reconstruction_error = pca_reconstruction(X=X_data, components=11, verbose=1)
Reconstruction with components: [ 1 2 3 4 5 6 7 8 9 10 11]
# Distribution of reconstruction error
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(6, 3.5))
sns.kdeplot(
reconstruction_error,
fill = True,
ax = ax
)
sns.rugplot(reconstruction_error, ax=ax, color='black')
ax.set_title('Distribution of reconstruction errors (PCA)')
ax.set_xlabel('Reconstruction error');
Anomaly detection¶
Once the reconstruction error has been calculated, it can be used as a criterion to identify anomalies. Assuming that dimensionality reduction has been performed so that most of the data (the normal ones) are well represented, those observations with the highest reconstruction error should be the most atypical.
In practice, if this detection strategy is being used, it is because there is no labeled data, that is, it is not known which observations are really anomalies. However, since in this example the true classification is available, it can be verified whether anomalous data really have higher reconstruction errors.
df_results = pd.DataFrame({
'reconstruction_error' : reconstruction_error,
'anomaly' : y_data
})
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5.5, 3.5))
sns.boxplot(
x = 'anomaly',
y = 'reconstruction_error',
data = df_results,
palette = 'tab10',
ax = ax
)
ax.set_yscale("log")
ax.set_title('Distribution of reconstruction errors (PCA)')
ax.set_ylabel('log(Reconstruction error)')
ax.set_xlabel('classification (0 = normal, 1 = anomaly)');
The distribution of reconstruction errors in the anomaly group (1) is clearly higher. However, since there is overlap, if the n observations with the highest reconstruction error are classified as anomalies, false positives would occur.
According to the documentation, the Cardiotocography dataset contains 176 anomalies. See the resulting confusion matrix if the 176 observations with the highest reconstruction error are classified as anomalies.
# Confusion matrix of the final classification
# ==============================================================================
df_results = (
df_results
.sort_values('reconstruction_error', ascending=False)
.reset_index(drop=True)
)
df_results['classification'] = np.where(df_results.index <= 176, 1, 0)
pd.crosstab(
df_results.anomaly,
df_results.classification
)
| classification | 0 | 1 |
|---|---|---|
| anomaly | ||
| 0.0 | 1545 | 110 |
| 1.0 | 109 | 67 |
Of the 176 observations identified as anomalies, only 38% (67/176) are. The false positive rate (62%) is very high.
Last components¶
In the previous section, the first 11 components out of a total of 21 were used since with them approximately 90% of the variance observed can be explained. Whether this selection of components is appropriate for anomaly detection largely depends on in which directions the anomalies deviate. Often, anomalous data do not have atypical values in the dominant directions but do in the less dominant ones, so their atypical behavior would be captured in the last components. Anomaly detection is repeated but this time discarding the first 4 principal components in the reconstruction.
The inverse_transform() method of the PCA class does not allow selecting specific components with which to reconstruct the data. Fortunately, reconstruction can be easily done by multiplying the projections by the vectors that define each component.
# Reconstruction using all except the first 4 components
# ==============================================================================
reconstruction, reconstruction_error = pca_reconstruction(X=X_data, components = range(4, 22), verbose=1)
Reconstruction with components: [ 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21]
df_results = pd.DataFrame({
'reconstruction_error' : reconstruction_error,
'anomaly' : y_data
})
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5.5, 3.5))
sns.boxplot(
x = 'anomaly',
y = 'reconstruction_error',
data = df_results,
palette = 'tab10',
ax = ax
)
ax.set_yscale("log")
ax.set_title('Distribution of reconstruction errors (PCA)')
ax.set_ylabel('log(Reconstruction error)')
ax.set_xlabel('classification (0 = normal, 1 = anomaly)');
# Confusion matrix of the final classification
# ==============================================================================
df_results = (
df_results
.sort_values('reconstruction_error', ascending=False)
.reset_index(drop=True)
)
df_results['classification'] = np.where(df_results.index <= 176, 1, 0)
pd.crosstab(
df_results.anomaly,
df_results.classification
)
| classification | 0 | 1 |
|---|---|---|
| anomaly | ||
| 0.0 | 1601 | 54 |
| 1.0 | 53 | 123 |
By discarding the first 4 components, 123 of the 176 (70%) observations identified as anomalies are actually anomalies. The false positive rate has been reduced to 30%.
Iterative retraining¶
The previous PCA was obtained using all observations, including potential anomalies. Since the objective is to generate a projection space only with "normal" data, the result can be improved by retraining the PCA but excluding the n observations with the highest reconstruction error (potential anomalies).
Anomaly detection is repeated but, this time, discarding observations with a reconstruction error greater than the 0.9 quantile.
# Remove observations with a reconstruction error greater than the
# 0.9 quantile
# ==============================================================================
quantile = np.quantile(a=reconstruction_error, q=0.9)
X_data_trimmed = X_data.loc[reconstruction_error < quantile, :].copy()
y_data_trimmed = y_data[reconstruction_error < quantile].copy()
# Reconstruction using all except the first 4 components
# ==============================================================================
reconstruction2, reconstruction_error2 = pca_reconstruction(
X = X_data_trimmed,
X_new = X_data,
components = range(4, 21)
)
# Confusion matrix of the final classification
# ==============================================================================
df_results = pd.DataFrame({
'reconstruction_error' : reconstruction_error2,
'anomaly' : y_data
})
df_results = (
df_results
.sort_values('reconstruction_error', ascending=False)
.reset_index(drop=True)
)
df_results['classification'] = np.where(df_results.index <= 176, 1, 0)
pd.crosstab(
df_results.anomaly,
df_results.classification
)
| classification | 0 | 1 |
|---|---|---|
| anomaly | ||
| 0.0 | 1605 | 50 |
| 1.0 | 49 | 127 |
After discarding the 10% of observations with the highest error and retraining the PCA, 127 of the 176 (72%) observations identified as anomalies are actually anomalies. The false positive rate has been slightly reduced to 28%.
Bagging PCA¶
The bagging (bootstrap aggregating) method allows reducing the variance of statistical and machine learning models. In general terms, this method consists of training n models, each with a pseudo-sample obtained by resampling from the training data. The final prediction of an observation is the average of the predictions of all the generated models.
This strategy can be combined with PCA for anomaly detection. In this case, multiple PCAs are trained, each with a resampling of the initial data. With each of the PCAs, the reconstruction error of the original observations is calculated. Finally, the reconstruction error of each sample is averaged.
Functions¶
def bagging_pca_reconstruction(
X: pd.DataFrame,
X_new: pd.DataFrame = None,
iterations: int = 500,
verbose: int = 0,
random_state: int = 61879782
):
'''
Function to calculate the reconstruction error combining the bagging
method with PCA. In each iteration, a PCA model is trained with
a resample of the data and reconstructed with a random set of
components. The component set can have between 1 and half of
the maximum components.
Parameters
----------
X: pd.DataFrame
PCA training data.
X_new: pd.DataFrame
Data on which to apply reconstruction. If None,
X is used.
iterations: int, default=500
Number of bagging iterations.
verbose: int, default=0
Level of detail in screen output. If greater than 0,
the components used in each iteration are shown.
random_state: int, default=61879782
Seed used for reproducibility.
Returns
-------
reconstruction_error: np.ndarray
Average reconstruction error of each observation.
'''
if X_new is None:
X_new = X
if iterations <= 0:
raise ValueError("iterations must be greater than 0")
error_matrix = np.full(shape=(X_new.shape[0], iterations), fill_value=np.nan)
rng = np.random.default_rng(random_state)
for i in tqdm(range(iterations)):
# Resampling
resample = X.sample(frac=0.632, axis=0, random_state=rng.integers(0, 2**31))
resample = resample.sample(frac=0.8, axis=1, random_state=rng.integers(0, 2**31))
# If due to resampling, any column is constant, it is removed
if any(resample.var() == 0):
constant_cols = resample.columns[resample.var() == 0]
resample = resample.drop(columns=constant_cols)
# Verify that enough columns remain
if resample.shape[1] == 0:
if verbose > 0:
print(f"Iteration {i}: all columns are constant, skipping...")
continue
# Random selection of components used in reconstruction
n_components = min(resample.shape)
component_ids = rng.choice(
range(n_components),
size = max(1, int(n_components/2)),
replace = False
)
# Convert to base-1 for pca_reconstruction
component_ids = component_ids + 1
if verbose > 0:
print(f"Components used iteration {i}: {np.sort(component_ids)}")
# Reconstruction error
_, reconstruction_error = pca_reconstruction(
X = resample,
X_new = X_new.loc[:, resample.columns],
components = component_ids,
verbose = 0
)
# Store iteration results
error_matrix[:, i] = reconstruction_error
# Average value of all iterations (values are scaled beforehand)
error_matrix = StandardScaler().fit_transform(error_matrix)
average_error = np.nanmean(error_matrix, axis=1)
return average_error
Reconstruction error¶
# Calculate reconstruction error
# ==============================================================================
reconstruction_error = bagging_pca_reconstruction(X=X_data, iterations=500, verbose=0)
0%| | 0/500 [00:00<?, ?it/s]
Anomaly detection¶
# Confusion matrix of the final classification
# ==============================================================================
df_results = pd.DataFrame({
'reconstruction_error' : reconstruction_error,
'anomaly' : y_data
})
df_results = (
df_results
.sort_values('reconstruction_error', ascending=False)
.reset_index(drop=True)
)
df_results['classification'] = np.where(df_results.index <= 176, 1, 0)
pd.crosstab(
df_results.anomaly,
df_results.classification
)
| classification | 0 | 1 |
|---|---|---|
| anomaly | ||
| 0.0 | 1583 | 72 |
| 1.0 | 71 | 105 |
The combination of Bagging and PCA achieves notably superior results to those achieved with PCA alone (without iterative filtering of observations).
Generalized Low Rank Model (GLRM)¶
One of the limitations of the PCA method for use in anomaly detection is its high sensitivity to the presence of anomalous values (outliers). Since the objective is to learn a new space in which most observations (the normal ones) are well represented, it is convenient to minimize the influence of outliers present in the training set.
The generalized low rank models (GLRMs) are a generalization of PCA and matrix factorization that, among other characteristics, allows using more robust cost functions than the squared error (used by PCA), for example, absolute error or Huber. They are, therefore, a potentially more suitable dimensionality reduction method than PCA for use in anomaly detection.
From Python, it is possible to access this type of models thanks to the implementation available in the H2O library. For more information on how to use this excellent library, see Machine Learning with H2O and Python.
✏️ Note
A GLRM with absolute error as a cost function is equivalent to a Robust-PCA
# Libraries
# ==============================================================================
import h2o
from h2o.estimators import H2OGeneralizedLowRankEstimator
Model¶
# Cluster initialization and data transfer
# ==============================================================================
h2o.init()
h2o.remove_all()
h2o.no_progress()
X_data_h2o = h2o.H2OFrame(X_data)
Checking whether there is an H2O instance running at http://localhost:54321. connected.
| H2O_cluster_uptime: | 56 mins 20 secs |
| H2O_cluster_timezone: | Europe/Madrid |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.46.0.9 |
| H2O_cluster_version_age: | 1 month and 9 days |
| H2O_cluster_name: | H2O_from_python_joaquin_kyc6yy |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 3.874 Gb |
| H2O_cluster_total_cores: | 8 |
| H2O_cluster_allowed_cores: | 8 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://localhost:54321 |
| H2O_connection_proxy: | {"http": null, "https": null} |
| H2O_internal_security: | False |
| Python_version: | 3.13.11 final |
# Training GLRM model with the maximum number of possible components
# ==============================================================================
glrm_model = H2OGeneralizedLowRankEstimator(
k = min(X_data_h2o.shape),
loss = "absolute",
transform = 'standardize'
)
_ = glrm_model.train(training_frame=X_data_h2o)
# Cumulative percentage of variance
# ==============================================================================
cumulative_variance_ratio = glrm_model.varimp(use_pandas=True).iloc[2, 1:]
print('--------------------------------------------------')
print('Cumulative percentage of variance explained')
print('--------------------------------------------------')
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7, 4))
ax.plot(
np.arange(len(cumulative_variance_ratio)) + 1,
cumulative_variance_ratio,
marker = 'o'
)
for x, y in zip(np.arange(len(cumulative_variance_ratio)) + 1, cumulative_variance_ratio):
label = round(y, 2)
ax.annotate(
label,
(x,y),
textcoords="offset points",
xytext=(0, 10),
ha='center',
fontsize=8
)
ax.set_ylim(0, 1.1)
ax.set_xticks(np.arange(pca_model.n_components_) + 1)
ax.set_title('Cumulative percentage of variance explained')
ax.set_xlabel('Component')
ax.set_ylabel('Cumulative % variance');
-------------------------------------------------- Cumulative percentage of variance explained --------------------------------------------------
glrm_model = H2OGeneralizedLowRankEstimator(
k = 14,
loss = "absolute",
transform = 'standardize'
)
_ = glrm_model.train(training_frame=X_data_h2o)
Projection and reconstruction¶
With the reconstruct method of an H2OGeneralizedLowRankEstimator model, the reconstruction for each observation can be obtained directly. Internally, this method is first obtaining the projections for each observation and then reversing the process.
reconstruction = glrm_model.reconstruct(test_data=X_data_h2o, reverse_transform=True)
Once the reconstructions are obtained, the mean squared reconstruction error of an observation can be calculated with the average of the squared differences between the original value of its variables and the reconstructed value, that is, the average of the reconstruction errors of all its variables squared.
reconstruction_error = reconstruction - X_data_h2o
reconstruction_error = reconstruction_error.as_data_frame()
reconstruction_error = reconstruction_error**2
reconstruction_error = reconstruction_error.mean(axis=1)
Anomaly detection¶
# Confusion matrix of the final classification
# ==============================================================================
df_results = pd.DataFrame({
'reconstruction_error' : reconstruction_error,
'anomaly' : y_data
})
df_results = (
df_results
.sort_values('reconstruction_error', ascending=False)
.reset_index(drop=True)
)
df_results['classification'] = np.where(df_results.index <= 176, 1, 0)
pd.crosstab(
df_results.anomaly,
df_results.classification
)
| classification | 0 | 1 |
|---|---|---|
| anomaly | ||
| 0.0 | 1522 | 133 |
| 1.0 | 132 | 44 |
Contrary to what might be expected, in this case, a GLRM with absolute cost function does not achieve better results than a PCA.
Session information¶
import session_info
session_info.show(html=False)
----- h2o 3.46.0.9 mat4py 0.6.0 matplotlib 3.10.8 numpy 2.2.6 pandas 2.3.3 seaborn 0.13.2 session_info v1.0.1 sklearn 1.7.2 tqdm 4.67.1 ----- IPython 9.8.0 jupyter_client 8.7.0 jupyter_core 5.9.1 ----- Python 3.13.11 | packaged by Anaconda, Inc. | (main, Dec 10 2025, 21:28:48) [GCC 14.3.0] Linux-6.14.0-37-generic-x86_64-with-glibc2.39 ----- Session information updated at 2026-01-02 17:27
Bibliography¶
Outlier Analysis Aggarwal, Charu C.
Outlier Ensembles: An Introduction by Charu C. Aggarwal, Saket Sathe
Introduction to Machine Learning with Python: A Guide for Data Scientists
Python Data Science Handbook by Jake VanderPlas
Citation instructions¶
How to cite this document?
If you use this document or any part of it, we would appreciate you citing it. Thank you very much!
Anomaly Detection with PCA by Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://cienciadedatos.net/documentos/py21-anomaly-detection-pca-python.html
Did you like the article? Your help is important
Your contribution will help me continue generating free divulgative content. Thank you so much! 😊
This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.
Permissions:
-
Share: copy and redistribute the material in any medium or format.
-
Adapt: remix, transform, and build upon the material.
Under the following terms:
-
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
-
NonCommercial: You may not use the material for commercial purposes.
-
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
