Más sobre ciencia de datos en: cienciadedatos.net


Introduction

Isolation Forest is an unsupervised method for detecting anomalies (outliers) in unlabeled datasets, i.e., when it is not known a priori whether an observation is anomalous or normal.

The algorithm uses a set of trees, conceptually similar to Random Forest, but with a different objective: instead of learning predictive rules, it seeks to isolate individual observations. An Isolation Forest model is composed of multiple binary trees called isolation trees.

Each isolation tree is built by recursively separating observations through binary splits. Unlike classification or regression trees, the variables and cutoff points are selected completely randomly, without optimizing any impurity function. Observations with infrequent or extreme values are typically isolated after few splits, while normal observations require longer paths within the tree.

The path length (number of splits needed to isolate an observation) is, therefore, the key element of the algorithm.

Isolation Tree Algorithm


1) Create a root node containing the $N$ training observations.

2) Randomly select an attribute $i$ and a random value $a$ within the observed range of that attribute.

3) Create two new nodes by separating the observations according to either the criterion $x_i \leq a$ or $x_i > a$.

4) Repeat steps 2 and 3 until the node contains a single observation or a predefined maximum depth is reached.



Diagram of an isolation tree. In blue, the path needed to isolate a normal observation. In red, the path to isolate an anomaly.

Isolation Forest Algorithm

The Isolation Forest model is obtained by combining multiple isolation trees, each trained on a sample generated via bootstrapping from the original dataset.

For each observation, the model calculates the average path length needed to isolate it across all trees in the forest. The lower this average length, the higher the probability that the observation is an anomaly. In the literature, this magnitude is usually called path length and, informally, is sometimes referred to as "isolation distance", although it does not represent a distance in the mathematical sense.

Practical Considerations

Since it is an unsupervised method, there is no way to know the optimal threshold at which an observation should be considered an anomaly. The score assigned to each observation is a relative measure compared to the rest of the observations. In practice, those observations whose predicted distance is below a certain quantile are usually considered potential outliers. For example, if it is assumed that there is 1% of anomalies, the 0.01 quantile of all calculated distances is used as the decision threshold.

When there are many observations, isolating them all in terminal nodes requires trees with many branches, which translates into a very high computational cost. One way to alleviate this problem is to determine a maximum depth up to which the tree can grow. For those observations that, once the stopping criterion is reached, have not reached individual terminal nodes, the average theoretical number of splits $c(r)$ needed to isolate a node of $r$ observations through binary partitions is added.

$$c(r) = \log(r-1) - \frac{2(r - 1)}{r} + 0.5772$$

Isolation Forest in Python

Three of the main implementations of Isolation Forest are available in Scikit Learn, H2O, and PyOD, all highly optimized, although with relevant differences in their use:

  • Scikit Learn: when training the model, you must specify the percentage of anomalies expected in the training data (contamination). With this value, the model learns the threshold from which an observation is considered an anomaly. When applying the predict() method, -1 is returned if it is an anomaly (outlier) or 1 if it is normal data (inliers). To retrieve the anomaly metric instead of the classification, you must use the score_samples() method. The latter returns the negative value of the isolation distance, normalized as proposed in the original paper.

  • H2O: In H2O's implementation, the predict() method directly returns the isolation score. The user must explicitly define the decision threshold, usually through quantiles calculated on the training observations.

  • PyOD: includes an Isolation Forest implementation, based on Scikit Learn's, but with some additional functionalities.

Libraries

The libraries used in this document are:

# Data handling
# ==============================================================================
import numpy as np
import pandas as pd
from mat4py import loadmat
from sklearn.datasets import make_blobs

# Graphics
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8
import seaborn as sns

# Preprocessing and modeling
# ==============================================================================
from sklearn.ensemble import IsolationForest
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import h2o
from h2o.estimators import H2OExtendedIsolationForestEstimator
from h2o.estimators import H2OGeneralizedLowRankEstimator
from h2o.transforms.preprocessing import H2OScaler

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

Isolation Forest

Data

The data used were obtained from Outlier Detection DataSets (ODDS), a repository with data commonly used to compare the ability of different algorithms to identify anomalies (outliers). Shebuti Rayana (2016). ODDS Library. Stony Brook, NY: Stony Brook University, Department of Computer Science.

All these datasets are labeled, it is known whether the observations are anomalies or not (variable y). Although the methods described in this document are unsupervised, i.e., they do not use the response variable, knowing the true classification allows evaluating their ability to correctly identify anomalies.

  • Cardiotocography dataset link:
    • Number of observations: 1831
    • Number of variables: 21
    • Number of outliers: 176 (9.6%)
    • y: 1 = outliers, 0 = inliers
    • Observations: all variables are centered and scaled (mean 0, sd 1).
    • Reference: C. C. Aggarwal and S. Sathe, "Theoretical foundations and algorithms for outlier ensembles." ACM SIGKDD Explorations Newsletter, vol. 17, no. 1, pp. 24–47, 2015. Saket Sathe and Charu C. Aggarwal. LODES: Local Density meets Spectral Outlier Detection. SIAM Conference on Data Mining, 2016.

The data are available in MATLAB format (.mat). To read its content, the loadmat() function from the mat4py package is used.

# Data reading
# ==============================================================================
datos = loadmat(filename='cardio.mat')
datos_X = pd.DataFrame(datos['X'])
datos_X.columns = ["col_" + str(i) for i in datos_X.columns]
datos_y = pd.DataFrame(datos['y'])
datos_y = datos_y.to_numpy().flatten()

Model

The sklearn.ensemble.IsolationForest class incorporates the main functionalities needed when working with Isolation Forest models. The main arguments for training this type of model are:

  • n_estimators: number of trees that form the model.

  • max_samples: number of observations used to train each tree.

  • contamination: expected proportion of anomalies in the training data. Based on this value, the threshold is established according to which observations are classified as normal or anomalous.

  • random_state: seed to guarantee reproducibility of results.

We proceed to train a model assuming there is 1% of anomalous observations in the training set.

# Definition and training of IsolationForest model
# ==============================================================================
modelo_isof = IsolationForest(
                n_estimators  = 1000,
                max_samples   ='auto',
                contamination = 0.01,
                n_jobs        = -1,
                random_state  = 123,
            )

modelo_isof.fit(X=datos_X)
IsolationForest(contamination=0.01, n_estimators=1000, n_jobs=-1,
                random_state=123)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Prediction

IsolationForest has two prediction methods that return different information. With the predict() method, the classification of anomaly (-1) or non-anomaly (1) is directly returned according to the contamination proportion indicated in the model definition.

# Classification prediction
# ==============================================================================
clasificacion_predicha = modelo_isof.predict(X=datos_X)
clasificacion_predicha
array([1, 1, 1, ..., 1, 1, 1], shape=(1831,))

With the score_samples() method, instead of the classification, the anomaly value predicted by the model is obtained. It is important to note that this value is not the average isolation distance, but a normalization of it proposed in the original paper.

As a result of the normalization, and multiplying it by -1, the anomaly values are bounded in the range [-1, 0]. The closer to -1 the value, the greater the evidence of anomaly. Values between -0.5 and 0 are expected for normal observations.

# Anomaly score prediction
# ==============================================================================
score_anomalia = modelo_isof.score_samples(X=datos_X)
score_anomalia
array([-0.41908062, -0.42765109, -0.45293005, ..., -0.47186108,
       -0.46324072, -0.53377418], shape=(1831,))

What is the relationship between predict() and score_samples()?

During model training, it was indicated that the proportion of anomalies (contamination) was 1%. This information is used to identify what the anomaly score is at which only 1% of observations would be considered anomalies, i.e., the 0.01 quantile.

cuantil_01 = np.quantile(score_anomalia, q=0.01)
cuantil_01
np.float64(-0.5864598467056028)

This is the value that is automatically stored in the .offset_ attribute based on the indicated contamination proportion.

modelo_isof.offset_
np.float64(-0.5864598467056028)
# Distribution of anomaly values
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3))
sns.kdeplot(
    data = score_anomalia,
    fill = True,
    ax   = ax
)
sns.rugplot(score_anomalia, ax=ax, color='black')
ax.axvline(cuantil_01, c='red', linestyle='--', label='quantile 0.01')
ax.set_title('Distribution of anomaly values')
ax.set_xlabel('Anomaly score')
ax.legend(loc='upper left');

If the quantile value is used to classify observations, the results obtained are equivalent to those returned by predict().

all(clasificacion_predicha == np.where(score_anomalia < cuantil_01, -1, 1))
True

Anomaly Detection

Once the isolation distance has been calculated, it can be used as a criterion to identify anomalies. Assuming that observations with outlier values in any of their variables are more easily separated from the rest, those observations with the lowest average distance should be the most atypical. Due to the normalization performed in the Scikit Learn implementation, this translates to: the more negative the predicted score, the greater the evidence of anomaly.

In practice, if this detection strategy is being used, it is because no labeled data is available, i.e., it is not known which observations are truly anomalies. However, since in this example we have the true classification, we can verify if anomalous data truly have lower distances.

# Distribution of anomaly values
# ==============================================================================
df_resultados = pd.DataFrame({
                    'score'    : score_anomalia,
                    'anomalia' : datos_y
                })

fig, ax = plt.subplots(figsize=(4, 3))
sns.boxplot(
    x     = 'anomalia',
    y     = 'score',
    hue   = 'anomalia',
    data  = df_resultados,
    ax    = ax
)
ax.set_title('Anomaly score by true classification')
ax.set_ylabel('Anomaly score')
ax.set_xlabel('classification (0 = normal, 1 = anomaly)')
ax.get_legend().remove();

The distribution of anomaly values (score) in the anomaly group is clearly lower (more negative). However, since there is overlap, if the n observations with the lowest score are classified as anomalies, false positive errors would be incurred.

According to the documentation, the Cardiotocography dataset contains 176 anomalies. See the resulting confusion matrix if the 176 observations with the lowest predicted score are classified as anomalies.

# Confusion matrix of final classification
# ==============================================================================
df_resultados = (
    df_resultados
    .sort_values('score', ascending=True)
    .reset_index(drop=True)
)
df_resultados['clasificacion'] = np.where(df_resultados.index <= 176, 1, 0)
pd.crosstab(
    df_resultados['anomalia'],
    df_resultados['clasificacion'],
    rownames=['true value'],
    colnames=['prediction']
)
prediction 0 1
true value
0.0 1571 84
1.0 83 93

Of the 176 observations identified as anomalies, only 53% (93/176) are. The false positive rate (47%) is high, the isolation forest method does not achieve notable results on this dataset.

Combination of PCA and Isolation Forest

The isolation forest algorithm, like many others, is negatively affected as the dimensionality of the dataset increases, i.e., the number of variables. This phenomenon is known as the curse of dimensionality. The underlying reason this problem appears is that, in a high-dimensional space, observations are so far apart from each other that the difference between normal and anomalous disappears, everything becomes an anomaly.

Added to this is the fact that, in practice, many of the available variables only add noise, they are not useful for discriminating between observations.

One way to try to mitigate the problem is to transform the data using a dimensionality reduction method before training the model, for example, using PCA.

Model

Using sklearn's Pipelines, it is easy to combine several models into a single object.

# Pipeline of PCA and IsolationForest
# ==============================================================================
modelo_PCA = PCA(n_components=0.9)
modelo_isof = IsolationForest(
                n_estimators  = 1000,
                max_samples   = 100,
                contamination = 0.01,
                n_jobs        = -1,
                random_state  = 123,
            )
pipeline_pca_isof = make_pipeline(StandardScaler(), modelo_PCA, modelo_isof)
pipeline_pca_isof.fit(X=datos_X)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('pca', PCA(n_components=0.9)),
                ('isolationforest',
                 IsolationForest(contamination=0.01, max_samples=100,
                                 n_estimators=1000, n_jobs=-1,
                                 random_state=123))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Prediction

# Score prediction
# ==============================================================================
score_anomalia = pipeline_pca_isof.score_samples(X=datos_X)
score_anomalia
array([-0.41306992, -0.42741431, -0.47682121, ..., -0.4604572 ,
       -0.44811813, -0.50387696], shape=(1831,))

Anomaly Detection

# Confusion matrix of final classification
# ==============================================================================
df_resultados = pd.DataFrame({'score': score_anomalia, 'anomalia' : datos_y})
df_resultados = (
    df_resultados
    .sort_values('score', ascending=True)
    .reset_index(drop=True)
)
df_resultados['clasificacion'] = np.where(df_resultados.index <= 176, 1, 0)
pd.crosstab(
    df_resultados.anomalia,
    df_resultados.clasificacion,
    rownames=['true value'],
    colnames=['prediction']
)
prediction 0 1
true value
0.0 1569 86
1.0 85 91

In this dataset, the combination of PCA and Isolation Forest does not improve the results obtained.

Extended Isolation Forest

The Extended Isolation Forest algorithm is an improved version of its predecessor, Isolation Forest. The original algorithm introduced an innovative way to detect anomalies, but has a bias due to the way trees decide how to split the data.
The extended version corrects this bias, making the original algorithm just a special case of the generalized version.

The bias problem arises because, in the original algorithm, data splitting is done following a pattern similar to binary search trees (BST). At each branching point, the following is chosen:

  • A specific feature.
  • A cutoff value.

This causes the splits to be aligned with the axes and generates bias.

The extended version introduces a random slope at each split. Instead of directly selecting the feature and value, the following is chosen:

  • A random slope $n$ (a vector in the feature space), which defines the inclination of the cutting hyperplane.
  • A random intercept $p$, which determines where that hyperplane passes through the data space.

Each component of the slope $n$ is generated independently from a Gaussian distribution $N(0,1)$, and the intercept $p$ is obtained from a uniform distribution within the limits of the data to be split.

With this approach, a point $x$ is assigned to one side of the split following the rule:

$$ (x - p) \cdot n \leq 0 $$

Thanks to this:

  • Splits are no longer limited to being parallel to the axes.
  • The algorithm detects anomalies more accurately and fairly.
  • The simplicity and efficiency of the original Isolation Forest is maintained.


Partitions generated by the isolation forest algorithm.


Partitions generated by the extended isolation forest algorithm.
# H2O cluster initialization and data transfer
# ==============================================================================
h2o.init(verbose=False)
h2o.remove_all()
h2o.no_progress()
datos_X_h2o = h2o.H2OFrame(datos_X)
# Definition and training of Extended IsolationForest model
# ==============================================================================
modelo_isof = H2OExtendedIsolationForestEstimator(
                ntrees          = 500,
                sample_size     = 100,
                extension_level = datos_X_h2o.shape[1] - 1
              )

_ = modelo_isof.train(x=datos_X_h2o.columns, training_frame=datos_X_h2o)
# Classification prediction
# ==============================================================================
score_anomalia = modelo_isof.predict(datos_X_h2o)
score_anomalia = score_anomalia.as_data_frame()['anomaly_score']
# Distribution of anomaly values
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3))
sns.kdeplot(
    x     = score_anomalia,
    fill  = True,
    ax    = ax
)
sns.rugplot(score_anomalia,  ax=ax, color='black')
ax.set_title('Distribution of anomaly scores')
ax.set_xlabel('Anomaly score');
# Confusion matrix of final classification
# ==============================================================================
df_resultados = pd.DataFrame({ 'score': score_anomalia, 'anomalia' : datos_y})

df_resultados = (
    df_resultados
    .sort_values('score', ascending=False)
    .reset_index(drop=True)
)
df_resultados['clasificacion'] = np.where(df_resultados.index <= 176, 1, 0)
pd.crosstab(
    df_resultados['anomalia'],
    df_resultados['clasificacion'],
    rownames=['true value'],
    colnames=['prediction']
)
prediction 0 1
true value
0.0 1573 82
1.0 81 95

Anomaly detection with Extended Isolation Forest slightly improves the results obtained with the original version of the Isolation Forest algorithm.

PCA and Extended Isolation Forest

# Dimensionality reduction with PCA in H2O followed by Extended Isolation Forest
# ==============================================================================
pipeline_pca = make_pipeline(StandardScaler(), PCA(n_components=0.95))
pipeline_pca.fit(datos_X)
proyecciones = pipeline_pca.transform(datos_X)
proyecciones = h2o.H2OFrame(proyecciones)
# Extended Isolation Forest model definition and training
# ==============================================================================
modelo_isof = H2OExtendedIsolationForestEstimator(
                ntrees          = 1000,
                sample_size     = 100,
                extension_level = proyecciones.shape[1] - 1
              )
_ = modelo_isof.train(x=proyecciones.columns, training_frame=proyecciones)
# Anomaly score prediction
# ==============================================================================
score_anomalia = modelo_isof.predict(proyecciones)
score_anomalia = score_anomalia.as_data_frame()['anomaly_score']
# Confusion matrix of final classification
# ==============================================================================
df_resultados = pd.DataFrame({'score': score_anomalia, 'anomalia': datos_y})
df_resultados = df_resultados.sort_values('score', ascending=False).reset_index(drop=True)
df_resultados['clasificacion'] = np.where(df_resultados.index <= 176, 1, 0)
pd.crosstab(
    df_resultados['anomalia'],
    df_resultados['clasificacion'],
    rownames=['true value'],
    colnames=['prediction']
)
prediction 0 1
true value
0.0 1573 82
1.0 81 95

In this case, by combining PCA and Extended Isolation Forest, even better results are achieved.

GLRM and Extended Isolation Forest

GLRM (Generalized Low Rank Models) is a dimensionality reduction technique that generalizes classical methods such as PCA, SVD, or NMF. It uses a low-rank factor matrix to approximate the original data, allowing it to capture complex patterns and handle missing data. GLRM is versatile, applicable to various data types, and useful in tasks such as compression, imputation, and anomaly detection.

We proceed to train a GLRM model to reduce the dimensionality of the data before applying Extended Isolation Forest.

# GLRM model
# ==============================================================================
glrm_model = H2OGeneralizedLowRankEstimator(
                k=min(datos_X_h2o.shape),
                loss="absolute",
                transform="standardize"
            )
_ = glrm_model.train(training_frame=datos_X_h2o)

By plotting the cumulative variance as a function of the number of components, it is observed that with 13 components 90% of the total variance is already explained.

# Component selection based on cumulative variance
# ==============================================================================
varianza_acumulada = (
    glrm_model.varimp(use_pandas=True)
    .set_index("")
    .transpose()
)
varianza_acumulada
Standard deviation Proportion of Variance Cumulative Proportion
pc1 2.324221 2.572383e-01 0.257238
pc2 1.900043 1.719126e-01 0.429151
pc3 1.321502 8.316032e-02 0.512311
pc4 1.174894 6.573215e-02 0.578043
pc5 1.074044 5.493189e-02 0.632975
pc6 1.003130 4.791758e-02 0.680893
pc7 0.952741 4.322454e-02 0.724117
pc8 0.899020 3.848746e-02 0.762605
pc9 0.812392 3.142768e-02 0.794033
pc10 0.747440 2.660318e-02 0.820636
pc11 0.733461 2.561740e-02 0.846253
pc12 0.682764 2.219841e-02 0.868452
pc13 0.515233 1.264118e-02 0.881093
pc14 0.448287 9.569589e-03 0.890662
pc15 0.413305 8.134349e-03 0.898797
pc16 0.365480 6.360729e-03 0.905157
pc17 0.288387 3.960341e-03 0.909118
pc18 0.169435 1.367057e-03 0.910485
pc19 0.033592 5.373524e-05 0.910539
pc20 0.004301 8.808969e-07 0.910539
pc21 0.000486 1.125578e-08 0.910539

We proceed to transform the original data using the first 13 components obtained with GLRM.

# GLRM model with k=13
# ==============================================================================
glrm_model = H2OGeneralizedLowRankEstimator(
                k=13,
                loss="absolute",
                transform="standardize"
            )

_ = glrm_model.train(training_frame=datos_X_h2o)
proyecciones = glrm_model.predict(datos_X_h2o)
# Extended Isolation Forest model
# ==============================================================================
modelo_isof = H2OExtendedIsolationForestEstimator(
                ntrees          = 500,
                sample_size     = 100,
                extension_level = proyecciones.shape[1] - 1
              )

_ = modelo_isof.train(x=proyecciones.columns, training_frame=proyecciones)
# Prediction
# ==============================================================================
score_anomalia = modelo_isof.predict(proyecciones)
score_anomalia = score_anomalia.as_data_frame()['anomaly_score']

# Confusion matrix of final classification
# ==============================================================================
df_resultados = pd.DataFrame({'score': score_anomalia,'anomalia': datos_y})
df_resultados = (
    df_resultados
    .sort_values('score', ascending=False)
    .reset_index(drop=True)
)
df_resultados['clasificacion'] = np.where(df_resultados.index <= 176, 1, 0)
pd.crosstab(
    df_resultados['anomalia'],
    df_resultados['clasificacion'],
    rownames=['true value'],
    colnames=['prediction']
)
prediction 0 1
true value
0.0 1591 64
1.0 63 113

Among all the tested combinations, the one that obtains the best results is GLRM and Extended Isolation Forest. Of the 176 observations identified as anomalies, 68% (120/176) are truly anomalous.

Session Information

import session_info
session_info.show(html=False)
-----
h2o                 3.46.0.9
mat4py              0.6.0
matplotlib          3.10.8
numpy               2.2.6
pandas              2.3.3
seaborn             0.13.2
session_info        v1.0.1
sklearn             1.7.2
-----
IPython             9.8.0
jupyter_client      8.7.0
jupyter_core        5.9.1
-----
Python 3.13.11 | packaged by Anaconda, Inc. | (main, Dec 10 2025, 21:28:48) [GCC 14.3.0]
Linux-6.14.0-37-generic-x86_64-with-glibc2.39
-----
Session information updated at 2026-01-21 22:34

Bibliography

Outlier Analysis Aggarwal, Charu C.

Outlier Ensembles: An Introduction by Charu C. Aggarwal, Saket Sathe

https://www.h2o.ai/blog/anomaly-detection-with-isolation-forests-using-h2o/

Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.” Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on

Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation-based anomaly detection.” ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012)

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-Based Anomaly Detection. ACM Trans. Knowl. Discov. Data 6, 1, Article 3 (March 2012), 39 pages. DOI:https://doi.org/10.1145/2133360.2133363

Citation Instructions

How to cite this document?

If you use this document or any part of it, we appreciate you citing it. Thank you very much!

Anomaly detection with Isolation Forest and Python by Joaquín Amat Rodrigo, available under a Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://www.cienciadedatos.net/documentos/py22-anomaly-detection-isolation-forest-python.html

Did you like the article? Your help is important

Your contribution will help me continue generating free educational content. Thank you very much! 😊

Become a GitHub Sponsor

Creative Commons Licence

This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.

It is allowed:

  • Share: copy and redistribute the material in any medium or format.

  • Adapt: remix, transform, and build upon the material.

Under the following terms:

  • Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  • NonCommercial: You may not use the material for commercial purposes.

  • ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.