More about machine learning: Machine Learning with Python

Introduction¶

Anomaly detection (outliers) with Principal Component Analysis (PCA) is an unsupervised strategy to identify anomalies when the data is not labeled, that is, the true classification (anomaly - non-anomaly) of the observations is unknown.

While this strategy uses PCA, it does not directly use its result as a way to detect anomalies, but rather employs the reconstruction error produced when reversing the dimensionality reduction. The reconstruction error as a strategy to detect anomalies is based on the following idea: dimensionality reduction methods allow projecting observations into a space of lower dimension than the original space, while trying to preserve as much information as possible. The way they manage to minimize the global loss of information is by searching for a new space in which most observations can be well represented.

The PCA method creates a function that maps the position occupied by each observation in the original space with the one it occupies in the newly generated space. This mapping works in both directions, so it is also possible to go from the new space to the original space. Only those observations that have been well projected will be able to return to the position they occupied in the original space with high precision.

Since the search for this new space is guided by most of the observations, the observations closest to the average will be the ones that can be best projected and, consequently, best reconstructed. Anomalous observations, on the contrary, will be poorly projected and their reconstruction will be worse. It is this reconstruction error (squared) that can be used to identify anomalies.

✏️ Note

To learn more about how PCA works and its application in Python, see Principal Component Analysis with Python.

Libraries¶

The libraries used in this document are:

# Data processing
# ==============================================================================
import numpy as np
import pandas as pd
from mat4py import loadmat

# Graphics
# ==============================================================================
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

# Preprocessing and modeling
# ==============================================================================
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from tqdm.notebook import tqdm

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

Data¶

The data used in this document were obtained from Outlier Detection DataSets (ODDS), a repository with data commonly used to compare the ability that different algorithms have to identify anomalies (outliers). Shebuti Rayana (2016). ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science.

All these datasets are labeled, it is known whether the observations are anomalies or not (variable y). Although the methods described in the document are unsupervised, that is, they do not use the response variable, knowing the true classification allows evaluating their ability to correctly identify anomalies.

Cardiotocography dataset link:
- Number of observations: 1831
- Number of variables: 21
- Number of outliers: 176 (9.6%)
- y: 1 = outliers, 0 = inliers
- Notes: all variables are centered and scaled (mean 0, sd 1).
- Reference: C. C. Aggarwal and S. Sathe, "Theoretical foundations and algorithms for outlier ensembles." ACM SIGKDD Explorations Newsletter, vol. 17, no. 1, pp. 24–47, 2015. Saket Sathe and Charu C. Aggarwal. LODES: Local Density meets Spectral Outlier Detection. SIAM Conference on Data Mining, 2016.
Speech dataset link:
- Number of observations: 3686
- Number of variables: 400
- Number of outliers: 61 (1.65%)
- y: 1 = outliers, 0 = inliers
- Reference: Learing Outlier Ensembles: The Best of Both Worlds – Supervised and Unsupervised. Barbora Micenkova, Brian McWilliams, and Ira Assent, KDD ODD2 Workshop, 2014.
Shuttle dataset link:
- Number of observations: 49097
- Number of variables: 9
- Number of outliers: 3511 (7%)
- y: 1 = outliers, 0 = inliers
- Reference: Abe, Naoki, Bianca Zadrozny, and John Langford. "Outlier detection by active learning." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006.

The data is available in MATLAB format (.mat). The loadmat() function from the mat4py package is used to read its content.

# Data reading
# ==============================================================================
data = loadmat(filename='cardio.mat')
X_data = pd.DataFrame(data['X'])
X_data.columns = ["col_" + str(i) for i in X_data.columns]
y_data = pd.DataFrame(data['y'])
y_data = y_data.to_numpy().flatten()

PCA¶

Principal Component Analysis (PCA) is a statistical method that allows simplifying the complexity of sample spaces with multiple dimensions while preserving their information. Suppose there is a sample with $n$ individuals, each with $p$ variables ($X_1$, $X_2$, ..., $X_p$), that is, the space has $p$ dimensions. PCA allows finding a number of underlying factors, $(z < p)$ that explain approximately the same as the original $p$ variables. Where $p$ values were previously needed to characterize each individual, now $z$ values suffice. Each of these $z$ new variables is called a principal component.

Each principal component ($Z_i$) is obtained by linear combination of the original variables. They can be understood as new variables obtained by combining the original variables in a certain way. The first principal component of a group of variables ($X_1$, $X_2$, ..., $X_p$) is the normalized linear combination of those variables that has the greatest variance. Once the first component ($Z_1$) is calculated, the second ($Z_2$) is calculated by repeating the same process, but adding the condition that the linear combination cannot be correlated with the first component. The process is repeated iteratively until all possible components (min(n-1, p)) are calculated or until the process is decided to be stopped.

The main limitation that PCA has as a dimensionality reduction method is that it only contemplates linear combinations of the original variables, which means it is not capable of capturing other types of relationships.

Model¶

The sklearn.decomposition.PCA class incorporates the main functionalities needed when working with PCA models. The n_components argument determines the number of calculated components. If None is indicated, all possible ones are calculated (min(rows, columns) - 1).

By default, PCA() centers the values but does not scale them. This is important since, if the variables have different dispersion, as in this case, it is necessary to scale them. One way to do this is to combine a StandardScaler() and a PCA() within a pipeline.

# Training PCA model with data scaling
# ==============================================================================
pca_pipeline = make_pipeline(StandardScaler(), PCA(n_components=None))
pca_pipeline.fit(X=X_data)

# Extract the trained model from the pipeline for diagnostics
pca_model = pca_pipeline.named_steps['pca']

Diagnostics¶

After obtaining the PCA, its ability to reduce data dimensionality is evaluated by exploring the explained and cumulative variance of each component.

# Percentage of variance explained by each component
# ==============================================================================
print('-------------------------------------------------------')
print('Percentage of variance explained by each component')
print('-------------------------------------------------------')
print(pca_model.explained_variance_ratio_)

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7, 3.5))
ax.bar(
    x      = np.arange(pca_model.n_components_) + 1,
    height = pca_model.explained_variance_ratio_
)

for x, y in zip(np.arange(pca_model.n_components_) + 1, pca_model.explained_variance_ratio_):
    label = round(y, 2)
    ax.annotate(
        label,
        (x,y),
        textcoords="offset points",
        xytext=(0, 10),
        ha='center',
        fontsize=8
    )

ax.set_xticks(np.arange(pca_model.n_components_) + 1)
ax.set_ylim(0, 1.1)
ax.set_title('Percentage of variance explained by each component')
ax.set_xlabel('Principal component')
ax.set_ylabel('% explained variance');

-------------------------------------------------------
Percentage of variance explained by each component
-------------------------------------------------------
[2.69831338e-01 1.75787255e-01 8.68938668e-02 6.88155848e-02
 5.95308477e-02 4.97905270e-02 4.58906579e-02 4.42857769e-02
 3.75980654e-02 3.19494743e-02 2.92016595e-02 2.56986810e-02
 1.92184240e-02 1.74753805e-02 1.31181188e-02 8.86498538e-03
 6.71579614e-03 5.51775331e-03 2.43704462e-03 1.37876279e-03
 2.24565280e-17]

# Cumulative percentage of variance explained
# ==============================================================================
cumulative_variance_ratio = pca_model.explained_variance_ratio_.cumsum()
print('--------------------------------------------------')
print('Cumulative percentage of variance explained')
print('--------------------------------------------------')
print(cumulative_variance_ratio)

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7, 4))
ax.plot(
    np.arange(pca_model.n_components_) + 1,
    cumulative_variance_ratio,
    marker = 'o'
)

for x, y in zip(np.arange(pca_model.n_components_) + 1, cumulative_variance_ratio):
    label = round(y, 2)
    ax.annotate(
        label,
        (x,y),
        textcoords="offset points",
        xytext=(0, 10),
        ha='center',
        fontsize=8
    )

ax.axvline(x=11, linestyle = '--')
ax.set_ylim(0, 1.1)
ax.set_xticks(np.arange(pca_model.n_components_) + 1)
ax.set_title('Cumulative percentage of variance explained')
ax.set_xlabel('Principal component')
ax.set_ylabel('Cumulative % variance');

--------------------------------------------------
Cumulative percentage of variance explained
--------------------------------------------------
[0.26983134 0.44561859 0.53251246 0.60132804 0.66085889 0.71064942
 0.75654008 0.80082585 0.83842392 0.87037339 0.89957505 0.92527373
 0.94449216 0.96196754 0.97508566 0.98395064 0.99066644 0.99618419
 0.99862124 1.         1.        ]

With the first 11 components, approximately 90% of the variance observed in the data is explained.

Reconstruction¶

Once the PCA is obtained (matrix of components (eigenvectors), projections, and means), the reconstruction of the initial observations can be obtained using the following equation:

$$\text{reconstruction} = \text{projections} \cdot \text{components}^\top$$

It is important to keep in mind that, if the data has been centered and scaled (which in general terms should be done when applying PCA) this transformation must be reversed.

With the inverse_transform() method of the PCA class, the transformation can be reversed and the initial value reconstructed. It is important to note that the reconstruction will only be complete if all components have been included. The inverse_transform() method uses all components included when creating the PCA object for reconstruction.

# Reconstruction of projections
# ==============================================================================
projections = pca_pipeline.transform(X_data)
reconstruction = pca_pipeline.inverse_transform(X=projections)
reconstruction = pd.DataFrame(
                    data    = reconstruction,
                    columns = X_data.columns,
                    index   = X_data.index
)
print('----------------')
print('Original values')
print('----------------')
display(X_data.head(3))

print('')
print('---------------------')
print('Reconstructed values')
print('---------------------')
display(reconstruction.head(3))

----------------
Original values
----------------

	col_0	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9	...	col_11	col_12	col_13	col_14	col_15	col_16	col_17	col_18	col_19	col_20
0	0.004912	0.693191	-0.20364	0.595322	0.353190	-0.061401	-0.278295	-1.650444	0.759072	-0.420487	...	1.485973	-0.798376	1.854728	0.622631	0.963083	0.301464	0.193113	0.231498	-0.289786	-0.493294
1	0.110729	-0.079903	-0.20364	1.268942	0.396246	-0.061401	-0.278295	-1.710270	0.759072	-0.420487	...	1.485973	-0.798376	1.854728	0.278625	0.963083	0.301464	0.129265	0.093563	-0.256385	-0.493294
2	0.216546	-0.272445	-0.20364	1.050988	0.148753	-0.061401	-0.278295	-1.710270	1.106509	-0.420487	...	1.141780	-1.332931	0.314688	2.342663	-0.488279	0.061002	0.065417	0.024596	-0.256385	1.140018

3 rows × 21 columns

---------------------
Reconstructed values
---------------------

	col_0	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9	...	col_11	col_12	col_13	col_14	col_15	col_16	col_17	col_18	col_19	col_20
0	0.004912	0.693191	-0.20364	0.595322	0.353190	-0.061401	-0.278295	-1.650444	0.759072	-0.420487	...	1.485973	-0.798376	1.854728	0.622631	0.963083	0.301464	0.193113	0.231498	-0.289786	-0.493294
1	0.110729	-0.079903	-0.20364	1.268942	0.396246	-0.061401	-0.278295	-1.710270	0.759072	-0.420487	...	1.485973	-0.798376	1.854728	0.278625	0.963083	0.301464	0.129265	0.093563	-0.256385	-0.493294
2	0.216546	-0.272445	-0.20364	1.050988	0.148753	-0.061401	-0.278295	-1.710270	1.106509	-0.420487	...	1.141780	-1.332931	0.314688	2.342663	-0.488279	0.061002	0.065417	0.024596	-0.256385	1.140018

3 rows × 21 columns

It is verified that, using all components, the reconstruction is total. The reconstructed values are equal to the original data (small differences are due to numerical imprecision of operations).

Reconstruction error¶

The mean squared reconstruction error of an observation is calculated as the average of the squared differences between the original value of its variables and the reconstructed value, that is, the average of the reconstruction errors of all its variables squared.

In the previous section, it was shown that, using all components, the reconstruction is total, so the reconstruction error is zero. When the objective of reconstruction is to identify anomalous observations, only a subset of all components must be used, thus allowing differentiation between observations that are easy to reconstruct and those that are difficult.

To facilitate the process, a function is defined that, given a dataset and a number of components: scales the data, trains a PCA model, and returns the reconstruction of the initial data and its error.

def pca_reconstruction(
        X: pd.DataFrame,
        X_new: pd.DataFrame | None = None,
        components: int | list | np.ndarray | None = None,
        verbose : int = 0
    ):
    '''
    Function to calculate the reconstruction and error of a dataset
    using PCA
    
    Parameters
    ----------
    X: pd.DataFrame
        PCA training data.
    X_new: pd.DataFrame
        Data on which to apply reconstruction. If None,
        X is used.
    components: int, list, np.ndarray, default=None
        Components used for reconstruction.
         - int: the first 'components' components are used.
         - list or np.ndarray: the components indicated in the list are used
         starting from 1.
         - None (default): all components are used.
    verbose: int, default = 0
        If greater than 0, displays additional information on screen.
    
    Returns
    -------
    reconstruction: pd.DataFrame
        Data reconstruction.
    reconstruction_error: np.ndarray
        Reconstruction error of each observation.
    '''
    
    if X_new is None:
        X_new = X

    max_allowed_components = min(X.shape)

    # Component selection, only as many as needed are calculated.
    if isinstance(components, int):
        if components <= 0 or components > max_allowed_components:
            raise ValueError(f"components must be between 1 and {max_allowed_components}")
        n_components = components
        component_ids = np.arange(components)
    elif isinstance(components, (list, np.ndarray, range)):
        component_ids = np.asarray(components) - 1
        if len(component_ids) == 0:
            raise ValueError("The component list cannot be empty")
        if np.any(component_ids < 0) or np.any(component_ids >= max_allowed_components):
            raise ValueError(f"Component indices must be between 1 and {max_allowed_components}")
        component_ids = np.unique(component_ids)
        n_components = max(component_ids) + 1
    else:
        n_components  = None
        component_ids = None
    
    # Training PCA model with data scaling
    pca_pipeline = make_pipeline(StandardScaler(), PCA(n_components=n_components))
    pca_pipeline.fit(X=X)

    if component_ids is None:
        component_ids = np.arange(pca_pipeline['pca'].n_components_)
    
    # Project the data
    projections = pca_pipeline.transform(X_new)
    
    # Reconstruction
    pca_components = pca_pipeline['pca'].components_
    
    reconstruction = np.dot(
                        projections[:, component_ids],
                        pca_components[component_ids, :]
                     )
    
    # Undo scaling
    reconstruction = pca_pipeline['standardscaler'].inverse_transform(reconstruction)
    
    reconstruction = pd.DataFrame(
                        reconstruction,
                        columns = X_new.columns,
                        index   = X_new.index
                    )
    
    # Mean squared reconstruction error
    reconstruction_error = reconstruction - X_new
    reconstruction_error = reconstruction_error**2
    reconstruction_error = reconstruction_error.mean(axis=1)
    
    if verbose > 0:
        print(f"Reconstruction with components: {component_ids + 1}")
    
    return reconstruction, reconstruction_error

# Reconstruction with the first 11 components (90% of explained variance)
# ==============================================================================
reconstruction, reconstruction_error = pca_reconstruction(X=X_data, components=11, verbose=1)

Reconstruction with components: [ 1  2  3  4  5  6  7  8  9 10 11]

# Distribution of reconstruction error
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(6, 3.5))
sns.kdeplot(
    reconstruction_error,
    fill    = True,
    ax      = ax
)
sns.rugplot(reconstruction_error,  ax=ax, color='black')
ax.set_title('Distribution of reconstruction errors (PCA)')
ax.set_xlabel('Reconstruction error');

Anomaly detection¶

Once the reconstruction error has been calculated, it can be used as a criterion to identify anomalies. Assuming that dimensionality reduction has been performed so that most of the data (the normal ones) are well represented, those observations with the highest reconstruction error should be the most atypical.

In practice, if this detection strategy is being used, it is because there is no labeled data, that is, it is not known which observations are really anomalies. However, since in this example the true classification is available, it can be verified whether anomalous data really have higher reconstruction errors.

df_results = pd.DataFrame({
                    'reconstruction_error' : reconstruction_error,
                    'anomaly'             : y_data
                })

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5.5, 3.5))
sns.boxplot(
    x       = 'anomaly',
    y       = 'reconstruction_error',
    data    = df_results,
    palette = 'tab10',
    ax      = ax
)
ax.set_yscale("log")
ax.set_title('Distribution of reconstruction errors (PCA)')
ax.set_ylabel('log(Reconstruction error)')
ax.set_xlabel('classification (0 = normal, 1 = anomaly)');

The distribution of reconstruction errors in the anomaly group (1) is clearly higher. However, since there is overlap, if the n observations with the highest reconstruction error are classified as anomalies, false positives would occur.

According to the documentation, the Cardiotocography dataset contains 176 anomalies. See the resulting confusion matrix if the 176 observations with the highest reconstruction error are classified as anomalies.

# Confusion matrix of the final classification
# ==============================================================================
df_results = (
    df_results
    .sort_values('reconstruction_error', ascending=False)
    .reset_index(drop=True)
)

df_results['classification'] = np.where(df_results.index <= 176, 1, 0)

pd.crosstab(
    df_results.anomaly,
    df_results.classification
)

classification	0	1
anomaly
0.0	1545	110
1.0	109	67

Of the 176 observations identified as anomalies, only 38% (67/176) are. The false positive rate (62%) is very high.

Last components¶

In the previous section, the first 11 components out of a total of 21 were used since with them approximately 90% of the variance observed can be explained. Whether this selection of components is appropriate for anomaly detection largely depends on in which directions the anomalies deviate. Often, anomalous data do not have atypical values in the dominant directions but do in the less dominant ones, so their atypical behavior would be captured in the last components. Anomaly detection is repeated but this time discarding the first 4 principal components in the reconstruction.

The inverse_transform() method of the PCA class does not allow selecting specific components with which to reconstruct the data. Fortunately, reconstruction can be easily done by multiplying the projections by the vectors that define each component.

# Reconstruction using all except the first 4 components
# ==============================================================================
reconstruction, reconstruction_error = pca_reconstruction(X=X_data, components = range(4, 22), verbose=1)

Reconstruction with components: [ 4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21]

df_results = pd.DataFrame({
                    'reconstruction_error' : reconstruction_error,
                    'anomaly'             : y_data
                })

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5.5, 3.5))
sns.boxplot(
    x     = 'anomaly',
    y     = 'reconstruction_error',
    data  = df_results,
    palette = 'tab10',
    ax    = ax
)
ax.set_yscale("log")
ax.set_title('Distribution of reconstruction errors (PCA)')
ax.set_ylabel('log(Reconstruction error)')
ax.set_xlabel('classification (0 = normal, 1 = anomaly)');

# Confusion matrix of the final classification
# ==============================================================================
df_results = (
    df_results
    .sort_values('reconstruction_error', ascending=False) 
    .reset_index(drop=True)
)

df_results['classification'] = np.where(df_results.index <= 176, 1, 0)
pd.crosstab(
    df_results.anomaly,
    df_results.classification
)

classification	0	1
anomaly
0.0	1601	54
1.0	53	123

By discarding the first 4 components, 123 of the 176 (70%) observations identified as anomalies are actually anomalies. The false positive rate has been reduced to 30%.

Iterative retraining¶

The previous PCA was obtained using all observations, including potential anomalies. Since the objective is to generate a projection space only with "normal" data, the result can be improved by retraining the PCA but excluding the n observations with the highest reconstruction error (potential anomalies).

Anomaly detection is repeated but, this time, discarding observations with a reconstruction error greater than the 0.9 quantile.

# Remove observations with a reconstruction error greater than the
# 0.9 quantile
# ==============================================================================
quantile = np.quantile(a=reconstruction_error,  q=0.9)
X_data_trimmed = X_data.loc[reconstruction_error < quantile, :].copy()
y_data_trimmed = y_data[reconstruction_error < quantile].copy()

# Reconstruction using all except the first 4 components
# ==============================================================================
reconstruction2, reconstruction_error2 = pca_reconstruction(
                                            X     = X_data_trimmed,
                                            X_new = X_data,
                                            components = range(4, 21)
                                        )

# Confusion matrix of the final classification
# ==============================================================================
df_results = pd.DataFrame({
                    'reconstruction_error' : reconstruction_error2,
                    'anomaly'             : y_data
                })

df_results = (
    df_results
    .sort_values('reconstruction_error', ascending=False)
    .reset_index(drop=True)
)

df_results['classification'] = np.where(df_results.index <= 176, 1, 0)

pd.crosstab(
    df_results.anomaly,
    df_results.classification
)

classification	0	1
anomaly
0.0	1605	50
1.0	49	127

After discarding the 10% of observations with the highest error and retraining the PCA, 127 of the 176 (72%) observations identified as anomalies are actually anomalies. The false positive rate has been slightly reduced to 28%.

Bagging PCA¶

The bagging (bootstrap aggregating) method allows reducing the variance of statistical and machine learning models. In general terms, this method consists of training n models, each with a pseudo-sample obtained by resampling from the training data. The final prediction of an observation is the average of the predictions of all the generated models.

This strategy can be combined with PCA for anomaly detection. In this case, multiple PCAs are trained, each with a resampling of the initial data. With each of the PCAs, the reconstruction error of the original observations is calculated. Finally, the reconstruction error of each sample is averaged.

Functions¶

def bagging_pca_reconstruction(
        X: pd.DataFrame,
        X_new: pd.DataFrame = None,
        iterations: int = 500,
        verbose: int = 0,
        random_state: int = 61879782
    ):
    '''
    Function to calculate the reconstruction error combining the bagging
    method with PCA. In each iteration, a PCA model is trained with
    a resample of the data and reconstructed with a random set of
    components. The component set can have between 1 and half of
    the maximum components.
    
    Parameters
    ----------
    
    X: pd.DataFrame
        PCA training data.
    X_new: pd.DataFrame
        Data on which to apply reconstruction. If None,
        X is used.
    iterations: int, default=500
        Number of bagging iterations.
    verbose: int, default=0
        Level of detail in screen output. If greater than 0,
        the components used in each iteration are shown.
    random_state: int, default=61879782
        Seed used for reproducibility.
    
    Returns
    -------
    
    reconstruction_error: np.ndarray
        Average reconstruction error of each observation.
    '''

    if X_new is None:
        X_new = X

    if iterations <= 0:
        raise ValueError("iterations must be greater than 0")
        
    error_matrix = np.full(shape=(X_new.shape[0], iterations), fill_value=np.nan)
    rng = np.random.default_rng(random_state)
    
    for i in tqdm(range(iterations)):
        
        # Resampling
        resample = X.sample(frac=0.632, axis=0, random_state=rng.integers(0, 2**31))
        resample = resample.sample(frac=0.8, axis=1, random_state=rng.integers(0, 2**31))
        
        # If due to resampling, any column is constant, it is removed
        if any(resample.var() == 0):
            constant_cols = resample.columns[resample.var() == 0]
            resample = resample.drop(columns=constant_cols)

        # Verify that enough columns remain
        if resample.shape[1] == 0:
            if verbose > 0:
                print(f"Iteration {i}: all columns are constant, skipping...")
            continue
            
            
        # Random selection of components used in reconstruction
        n_components = min(resample.shape)
        component_ids = rng.choice(
                            range(n_components),
                            size    = max(1, int(n_components/2)),
                            replace = False
                         )
        # Convert to base-1 for pca_reconstruction
        component_ids = component_ids + 1
        
        if verbose > 0:
            print(f"Components used iteration {i}: {np.sort(component_ids)}")
            
        # Reconstruction error
        _, reconstruction_error = pca_reconstruction(
                                    X          = resample,
                                    X_new      = X_new.loc[:, resample.columns],
                                    components = component_ids,
                                    verbose    = 0
                                  )   
        
        # Store iteration results
        error_matrix[:, i] = reconstruction_error
        
    # Average value of all iterations (values are scaled beforehand)
    error_matrix = StandardScaler().fit_transform(error_matrix)
    average_error = np.nanmean(error_matrix, axis=1)

    return average_error

Reconstruction error¶

# Calculate reconstruction error
# ==============================================================================
reconstruction_error = bagging_pca_reconstruction(X=X_data, iterations=500, verbose=0)

  0%|          | 0/500 [00:00<?, ?it/s]

Anomaly detection¶

# Confusion matrix of the final classification
# ==============================================================================
df_results = pd.DataFrame({
                    'reconstruction_error' : reconstruction_error,
                    'anomaly'             : y_data
                })

df_results = (
    df_results
    .sort_values('reconstruction_error', ascending=False)
    .reset_index(drop=True)
)
df_results['classification'] = np.where(df_results.index <= 176, 1, 0)

pd.crosstab(
    df_results.anomaly,
    df_results.classification
)

classification	0	1
anomaly
0.0	1583	72
1.0	71	105

The combination of Bagging and PCA achieves notably superior results to those achieved with PCA alone (without iterative filtering of observations).

Generalized Low Rank Model (GLRM)¶

One of the limitations of the PCA method for use in anomaly detection is its high sensitivity to the presence of anomalous values (outliers). Since the objective is to learn a new space in which most observations (the normal ones) are well represented, it is convenient to minimize the influence of outliers present in the training set.

The generalized low rank models (GLRMs) are a generalization of PCA and matrix factorization that, among other characteristics, allows using more robust cost functions than the squared error (used by PCA), for example, absolute error or Huber. They are, therefore, a potentially more suitable dimensionality reduction method than PCA for use in anomaly detection.

From Python, it is possible to access this type of models thanks to the implementation available in the H2O library. For more information on how to use this excellent library, see Machine Learning with H2O and Python.

✏️ Note

A GLRM with absolute error as a cost function is equivalent to a Robust-PCA

# Libraries
# ==============================================================================
import h2o
from h2o.estimators import H2OGeneralizedLowRankEstimator

Model¶

# Cluster initialization and data transfer
# ==============================================================================
h2o.init()
h2o.remove_all()
h2o.no_progress()

X_data_h2o = h2o.H2OFrame(X_data)

Checking whether there is an H2O instance running at http://localhost:54321. connected.


H2O_cluster_uptime:	56 mins 20 secs
H2O_cluster_timezone:	Europe/Madrid
H2O_data_parsing_timezone:	UTC
H2O_cluster_version:	3.46.0.9
H2O_cluster_version_age:	1 month and 9 days
H2O_cluster_name:	H2O_from_python_joaquin_kyc6yy
H2O_cluster_total_nodes:	1
H2O_cluster_free_memory:	3.874 Gb
H2O_cluster_total_cores:	8
H2O_cluster_allowed_cores:	8
H2O_cluster_status:	locked, healthy
H2O_connection_url:	http://localhost:54321
H2O_connection_proxy:	{"http": null, "https": null}
H2O_internal_security:	False
Python_version:	3.13.11 final

# Training GLRM model with the maximum number of possible components
# ==============================================================================
glrm_model = H2OGeneralizedLowRankEstimator(
                k         = min(X_data_h2o.shape),
                loss      = "absolute",
                transform = 'standardize'
            )
_ = glrm_model.train(training_frame=X_data_h2o)

# Cumulative percentage of variance
# ==============================================================================
cumulative_variance_ratio = glrm_model.varimp(use_pandas=True).iloc[2, 1:]
print('--------------------------------------------------')
print('Cumulative percentage of variance explained')
print('--------------------------------------------------')
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7, 4))
ax.plot(
    np.arange(len(cumulative_variance_ratio)) + 1,
    cumulative_variance_ratio,
    marker = 'o'
)

for x, y in zip(np.arange(len(cumulative_variance_ratio)) + 1, cumulative_variance_ratio):
    label = round(y, 2)
    ax.annotate(
        label,
        (x,y),
        textcoords="offset points",
        xytext=(0, 10),
        ha='center',
        fontsize=8
    )

ax.set_ylim(0, 1.1)
ax.set_xticks(np.arange(pca_model.n_components_) + 1)
ax.set_title('Cumulative percentage of variance explained')
ax.set_xlabel('Component')
ax.set_ylabel('Cumulative % variance');

--------------------------------------------------
Cumulative percentage of variance explained
--------------------------------------------------

glrm_model = H2OGeneralizedLowRankEstimator(
                k         = 14,
                loss      = "absolute",
                transform = 'standardize'
            )
_ = glrm_model.train(training_frame=X_data_h2o)

Projection and reconstruction¶

With the reconstruct method of an H2OGeneralizedLowRankEstimator model, the reconstruction for each observation can be obtained directly. Internally, this method is first obtaining the projections for each observation and then reversing the process.

reconstruction = glrm_model.reconstruct(test_data=X_data_h2o, reverse_transform=True)

Once the reconstructions are obtained, the mean squared reconstruction error of an observation can be calculated with the average of the squared differences between the original value of its variables and the reconstructed value, that is, the average of the reconstruction errors of all its variables squared.

reconstruction_error = reconstruction - X_data_h2o
reconstruction_error = reconstruction_error.as_data_frame()
reconstruction_error = reconstruction_error**2
reconstruction_error = reconstruction_error.mean(axis=1)

Anomaly detection¶

# Confusion matrix of the final classification
# ==============================================================================
df_results = pd.DataFrame({
                    'reconstruction_error' : reconstruction_error,
                    'anomaly'             : y_data
                })

df_results = (
    df_results
    .sort_values('reconstruction_error', ascending=False)
    .reset_index(drop=True)
)

df_results['classification'] = np.where(df_results.index <= 176, 1, 0)

pd.crosstab(
    df_results.anomaly,
    df_results.classification
)

classification	0	1
anomaly
0.0	1522	133
1.0	132	44

Contrary to what might be expected, in this case, a GLRM with absolute cost function does not achieve better results than a PCA.

Session information¶

import session_info
session_info.show(html=False)

-----
h2o                 3.46.0.9
mat4py              0.6.0
matplotlib          3.10.8
numpy               2.2.6
pandas              2.3.3
seaborn             0.13.2
session_info        v1.0.1
sklearn             1.7.2
tqdm                4.67.1
-----
IPython             9.8.0
jupyter_client      8.7.0
jupyter_core        5.9.1
-----
Python 3.13.11 | packaged by Anaconda, Inc. | (main, Dec 10 2025, 21:28:48) [GCC 14.3.0]
Linux-6.14.0-37-generic-x86_64-with-glibc2.39
-----
Session information updated at 2026-01-02 17:27

Bibliography¶

Outlier Analysis Aggarwal, Charu C.

Outlier Ensembles: An Introduction by Charu C. Aggarwal, Saket Sathe

Introduction to Machine Learning with Python: A Guide for Data Scientists

Python Data Science Handbook by Jake VanderPlas

Citation instructions¶

How to cite this document?

If you use this document or any part of it, we would appreciate you citing it. Thank you very much!

Anomaly Detection with PCA by Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://cienciadedatos.net/documentos/py21-anomaly-detection-pca-python.html

Did you like the article? Your help is important

Your contribution will help me continue generating free divulgative content. Thank you so much! 😊

This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.

Permissions:

Share: copy and redistribute the material in any medium or format.
Adapt: remix, transform, and build upon the material.

Under the following terms:

Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial: You may not use the material for commercial purposes.
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.