Más sobre ciencia de datos en: cienciadedatos.net

Introduction¶

Isolation Forest is an unsupervised method for detecting anomalies (outliers) in unlabeled datasets, i.e., when it is not known a priori whether an observation is anomalous or normal.

The algorithm uses a set of trees, conceptually similar to Random Forest, but with a different objective: instead of learning predictive rules, it seeks to isolate individual observations. An Isolation Forest model is composed of multiple binary trees called isolation trees.

Each isolation tree is built by recursively separating observations through binary splits. Unlike classification or regression trees, the variables and cutoff points are selected completely randomly, without optimizing any impurity function. Observations with infrequent or extreme values are typically isolated after few splits, while normal observations require longer paths within the tree.

The path length (number of splits needed to isolate an observation) is, therefore, the key element of the algorithm.

Isolation Tree Algorithm

1) Create a root node containing the $N$ training observations.

2) Randomly select an attribute $i$ and a random value $a$ within the observed range of that attribute.

3) Create two new nodes by separating the observations according to either the criterion $x_i \leq a$ or $x_i > a$.

4) Repeat steps 2 and 3 until the node contains a single observation or a predefined maximum depth is reached.

Diagram of an isolation tree. In blue, the path needed to isolate a normal observation. In red, the path to isolate an anomaly.

Isolation Forest Algorithm

The Isolation Forest model is obtained by combining multiple isolation trees, each trained on a sample generated via bootstrapping from the original dataset.

For each observation, the model calculates the average path length needed to isolate it across all trees in the forest. The lower this average length, the higher the probability that the observation is an anomaly. In the literature, this magnitude is usually called path length and, informally, is sometimes referred to as "isolation distance", although it does not represent a distance in the mathematical sense.

Practical Considerations

Since it is an unsupervised method, there is no way to know the optimal threshold at which an observation should be considered an anomaly. The score assigned to each observation is a relative measure compared to the rest of the observations. In practice, those observations whose predicted distance is below a certain quantile are usually considered potential outliers. For example, if it is assumed that there is 1% of anomalies, the 0.01 quantile of all calculated distances is used as the decision threshold.

When there are many observations, isolating them all in terminal nodes requires trees with many branches, which translates into a very high computational cost. One way to alleviate this problem is to determine a maximum depth up to which the tree can grow. For those observations that, once the stopping criterion is reached, have not reached individual terminal nodes, the average theoretical number of splits $c(r)$ needed to isolate a node of $r$ observations through binary partitions is added.

$$c(r) = \log(r-1) - \frac{2(r - 1)}{r} + 0.5772$$

Isolation Forest in Python

Three of the main implementations of Isolation Forest are available in Scikit Learn, H2O, and PyOD, all highly optimized, although with relevant differences in their use:

Scikit Learn: when training the model, you must specify the percentage of anomalies expected in the training data (contamination). With this value, the model learns the threshold from which an observation is considered an anomaly. When applying the predict() method, -1 is returned if it is an anomaly (outlier) or 1 if it is normal data (inliers). To retrieve the anomaly metric instead of the classification, you must use the score_samples() method. The latter returns the negative value of the isolation distance, normalized as proposed in the original paper.
H2O: In H2O's implementation, the predict() method directly returns the isolation score. The user must explicitly define the decision threshold, usually through quantiles calculated on the training observations.
PyOD: includes an Isolation Forest implementation, based on Scikit Learn's, but with some additional functionalities.

Libraries¶

The libraries used in this document are:

# Data handling
# ==============================================================================
import numpy as np
import pandas as pd
from mat4py import loadmat

# Graphics
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8
import seaborn as sns

# Preprocessing and modeling
# ==============================================================================
from sklearn.ensemble import IsolationForest
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import h2o
from h2o.estimators import H2OExtendedIsolationForestEstimator
from h2o.estimators import H2OGeneralizedLowRankEstimator
from h2o.transforms.preprocessing import H2OScaler

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

Isolation Forest¶

Data¶

The data used were obtained from Outlier Detection DataSets (ODDS), a repository with data commonly used to compare the ability of different algorithms to identify anomalies (outliers). Shebuti Rayana (2016). ODDS Library. Stony Brook, NY: Stony Brook University, Department of Computer Science.

All these datasets are labeled, it is known whether the observations are anomalies or not (variable y). Although the methods described in this document are unsupervised, i.e., they do not use the response variable, knowing the true classification allows evaluating their ability to correctly identify anomalies.

Cardiotocography dataset link:
- Number of observations: 1831
- Number of variables: 21
- Number of outliers: 176 (9.6%)
- y: 1 = outliers, 0 = inliers
- Observations: all variables are centered and scaled (mean 0, sd 1).
- Reference: C. C. Aggarwal and S. Sathe, "Theoretical foundations and algorithms for outlier ensembles." ACM SIGKDD Explorations Newsletter, vol. 17, no. 1, pp. 24–47, 2015. Saket Sathe and Charu C. Aggarwal. LODES: Local Density meets Spectral Outlier Detection. SIAM Conference on Data Mining, 2016.

The data are available in MATLAB format (.mat). To read its content, the loadmat() function from the mat4py package is used.

# Data reading
# ==============================================================================
data = loadmat(filename='cardio.mat')
X = pd.DataFrame(data['X'])
X.columns = ["col_" + str(i) for i in X.columns]
y = pd.Series(np.array(data['y']).flatten())

Model¶

The sklearn.ensemble.IsolationForest class incorporates the main functionalities needed when working with Isolation Forest models. The main arguments for training this type of model are:

n_estimators: number of trees that form the model.
max_samples: number of observations used to train each tree.
contamination: expected proportion of anomalies in the training data. Based on this value, the threshold is established according to which observations are classified as normal or anomalous.
random_state: seed to guarantee reproducibility of results.

We proceed to train a model assuming there is 1% of anomalous observations in the training set.

# Definition and training of IsolationForest model
# ==============================================================================
modelo_isof = IsolationForest(
                n_estimators  = 1000,
                max_samples   ='auto',
                contamination = 0.01,
                n_jobs        = -1,
                random_state  = 123,
            )
modelo_isof.fit(X=X)

IsolationForest(contamination=0.01, n_estimators=1000, n_jobs=-1,
                random_state=123)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Prediction¶

IsolationForest has two prediction methods that return different information. With the predict() method, the classification of anomaly (-1) or non-anomaly (1) is directly returned according to the contamination proportion indicated in the model definition.

# Classification prediction
# ==============================================================================
clasificacion_predicha = modelo_isof.predict(X=X)
clasificacion_predicha

array([1, 1, 1, ..., 1, 1, 1], shape=(1831,))

With the score_samples() method, instead of the classification, the anomaly value predicted by the model is obtained. It is important to note that this value is not the average isolation distance, but a normalization of it proposed in the original paper.

As a result of the normalization, and multiplying it by -1, the anomaly values are bounded in the range [-1, 0]. The closer to -1 the value, the greater the evidence of anomaly. Values between -0.5 and 0 are expected for normal observations.

# Anomaly score prediction
# ==============================================================================
score_anomalia = modelo_isof.score_samples(X=X)
score_anomalia

array([-0.41908062, -0.42765109, -0.45293005, ..., -0.47186108,
       -0.46324072, -0.53377418], shape=(1831,))

What is the relationship between predict() and score_samples()?

During model training, it was indicated that the proportion of anomalies (contamination) was 1%. This information is used to identify what the anomaly score is at which only 1% of observations would be considered anomalies, i.e., the 0.01 quantile.

cuantil_01 = np.quantile(score_anomalia, q=0.01)
cuantil_01

np.float64(-0.5864598467056028)

This is the value that is automatically stored in the .offset_ attribute based on the indicated contamination proportion.

modelo_isof.offset_

np.float64(-0.5864598467056028)

# Distribution of anomaly values
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3))
sns.kdeplot(
    data = score_anomalia,
    fill = True,
    ax   = ax
)
sns.rugplot(score_anomalia, ax=ax, color='black')
ax.axvline(cuantil_01, c='red', linestyle='--', label='quantile 0.01')
ax.set_title('Distribution of anomaly values')
ax.set_xlabel('Anomaly score')
ax.legend(loc='upper left');

If the quantile value is used to classify observations, the results obtained are equivalent to those returned by predict().

all(clasificacion_predicha == np.where(score_anomalia < cuantil_01, -1, 1))

True

Anomaly Detection¶

Once the isolation distance has been calculated, it can be used as a criterion to identify anomalies. Assuming that observations with outlier values in any of their variables are more easily separated from the rest, those observations with the lowest average distance should be the most atypical. Due to the normalization performed in the Scikit Learn implementation, this translates to: the more negative the predicted score, the greater the evidence of anomaly.

In practice, if this detection strategy is being used, it is because no labeled data is available, i.e., it is not known which observations are truly anomalies. However, since in this example we have the true classification, we can verify if anomalous data truly have lower distances.

# Distribution of anomaly values
# ==============================================================================
df_resultados = pd.DataFrame({
                    'score'    : score_anomalia,
                    'anomalia' : y.astype(str)
                })

fig, ax = plt.subplots(figsize=(4, 3))
sns.boxplot(
    x     = 'anomalia',
    y     = 'score',
    hue   = 'anomalia',
    data  = df_resultados,
    ax    = ax
)
ax.set_title('Anomaly score by true classification')
ax.set_ylabel('Anomaly score')
ax.set_xlabel('classification (0 = normal, 1 = anomaly)')
ax.get_legend();

The distribution of anomaly values (score) in the anomaly group is clearly lower (more negative). However, since there is overlap, if the n observations with the lowest score are classified as anomalies, false positive errors would be incurred.

According to the documentation, the Cardiotocography dataset contains 176 anomalies. See the resulting confusion matrix if the 176 observations with the lowest predicted score are classified as anomalies.

# Confusion matrix of final classification
# ==============================================================================
df_resultados = (
    df_resultados
    .sort_values('score', ascending=True)
    .reset_index(drop=True)
)
df_resultados['clasificacion'] = np.where(df_resultados.index <= 176, 1, 0)
pd.crosstab(
    df_resultados['anomalia'],
    df_resultados['clasificacion'],
    rownames=['true value'],
    colnames=['prediction']
)

prediction	0	1
true value
0.0	1571	84
1.0	83	93

Of the 176 observations identified as anomalies, only 53% (93/176) are. The false positive rate (47%) is high, the isolation forest method does not achieve notable results on this dataset.

Combination of PCA and Isolation Forest¶

The isolation forest algorithm, like many others, is negatively affected as the dimensionality of the dataset increases, i.e., the number of variables. This phenomenon is known as the curse of dimensionality. The underlying reason this problem appears is that, in a high-dimensional space, observations are so far apart from each other that the difference between normal and anomalous disappears, everything becomes an anomaly.

Added to this is the fact that, in practice, many of the available variables only add noise, they are not useful for discriminating between observations.

One way to try to mitigate the problem is to transform the data using a dimensionality reduction method before training the model, for example, using PCA.

Model¶

Using sklearn's Pipelines, it is easy to combine several models into a single object.

# Pipeline of PCA and IsolationForest
# ==============================================================================
modelo_PCA = PCA(n_components=0.9)
modelo_isof = IsolationForest(
                n_estimators  = 1000,
                max_samples   = 100,
                contamination = 0.01,
                n_jobs        = -1,
                random_state  = 123,
            )
pipeline_pca_isof = make_pipeline(StandardScaler(), modelo_PCA, modelo_isof)
pipeline_pca_isof.fit(X=X)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('pca', PCA(n_components=0.9)),
                ('isolationforest',
                 IsolationForest(contamination=0.01, max_samples=100,
                                 n_estimators=1000, n_jobs=-1,
                                 random_state=123))])

Pipeline

?Documentation for PipelineiFitted

Parameters

	steps	[('standardscaler', ...), ('pca', ...), ...]
	transform_input	None
	memory	None
	verbose	False

StandardScaler

?Documentation for StandardScaler

Parameters

	copy	True
	with_mean	True
	with_std	True

PCA

?Documentation for PCA

Parameters

	n_components	0.9
	copy	True
	whiten	False
	svd_solver	'auto'
	tol	0.0
	iterated_power	'auto'
	n_oversamples	10
	power_iteration_normalizer	'auto'
	random_state	None

IsolationForest

?Documentation for IsolationForest

Parameters

	n_estimators	1000
	max_samples	100
	contamination	0.01
	max_features	1.0
	bootstrap	False
	n_jobs	-1
	random_state	123
	verbose	0
	warm_start	False

Prediction¶

# Score prediction
# ==============================================================================
score_anomalia = pipeline_pca_isof.score_samples(X=X)
score_anomalia

array([-0.41306992, -0.42741431, -0.47682121, ..., -0.4604572 ,
       -0.44811813, -0.50387696], shape=(1831,))

Anomaly Detection¶

# Confusion matrix of final classification
# ==============================================================================
df_resultados = pd.DataFrame({'score': score_anomalia, 'anomalia' : y})
df_resultados = (
    df_resultados
    .sort_values('score', ascending=True)
    .reset_index(drop=True)
)
df_resultados['clasificacion'] = np.where(df_resultados.index <= 176, 1, 0)
pd.crosstab(
    df_resultados.anomalia,
    df_resultados.clasificacion,
    rownames=['true value'],
    colnames=['prediction']
)

prediction	0	1
true value
0.0	1569	86
1.0	85	91

In this dataset, the combination of PCA and Isolation Forest does not improve the results obtained.

Extended Isolation Forest¶

The Extended Isolation Forest algorithm is an improved version of its predecessor, Isolation Forest. The original algorithm introduced an innovative way to detect anomalies, but has a bias due to the way trees decide how to split the data.
The extended version corrects this bias, making the original algorithm just a special case of the generalized version.

The bias problem arises because, in the original algorithm, data splitting is done following a pattern similar to binary search trees (BST). At each branching point, the following is chosen:

A specific feature.
A cutoff value.

This causes the splits to be aligned with the axes and generates bias.

The extended version introduces a random slope at each split. Instead of directly selecting the feature and value, the following is chosen:

A random slope $n$ (a vector in the feature space), which defines the inclination of the cutting hyperplane.
A random intercept $p$, which determines where that hyperplane passes through the data space.

Each component of the slope $n$ is generated independently from a Gaussian distribution $N(0,1)$, and the intercept $p$ is obtained from a uniform distribution within the limits of the data to be split.

With this approach, a point $x$ is assigned to one side of the split following the rule:

$$ (x - p) \cdot n \leq 0 $$

Thanks to this:

Splits are no longer limited to being parallel to the axes.
The algorithm detects anomalies more accurately and fairly.
The simplicity and efficiency of the original Isolation Forest is maintained.

Partitions generated by the isolation forest algorithm.

Partitions generated by the extended isolation forest algorithm.

# H2O cluster initialization and data transfer
# ==============================================================================
h2o.init(verbose=False)
h2o.remove_all()
h2o.no_progress()
X = h2o.H2OFrame(X)

# Definition and training of Extended IsolationForest model
# ==============================================================================
modelo_isof = H2OExtendedIsolationForestEstimator(
                ntrees          = 500,
                sample_size     = 100,
                extension_level = X.shape[1] - 1
              )
_ = modelo_isof.train(x=X.columns, training_frame=X)
modelo_isof.summary()

Model Summary:
	number_of_trees	size_of_subsample	extension_level	seed	number_of_trained_trees	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves	min_isolated_point	max_isolated_point	mean_isolated_point	min_not_isolated_point	max_not_isolated_point	mean_not_isolated_point	min_zero_splits	max_zero_splits	mean_zero_splits
	500	100	20	-2738853644859803136.0000000	500.0	6.0	6.0	6.0	7.0	44.0	20.196	0.0	16.0	6.208	84.0	100.0	93.792	1.0	16.0	6.538

# Classification prediction
# ==============================================================================
score_anomalia = modelo_isof.predict(X)
score_anomalia = score_anomalia.as_data_frame()['anomaly_score']

# Distribution of anomaly values
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3))
sns.kdeplot(
    x     = score_anomalia,
    fill  = True,
    ax    = ax
)
sns.rugplot(score_anomalia,  ax=ax, color='black')
ax.set_title('Distribution of anomaly scores')
ax.set_xlabel('Anomaly score');

# Confusion matrix of final classification
# ==============================================================================
df_resultados = pd.DataFrame({'score': score_anomalia, 'anomalia': y})

df_resultados = (
    df_resultados
    .sort_values('score', ascending=False)
    .reset_index(drop=True)
)
df_resultados['clasificacion'] = np.where(df_resultados.index <= 176, 1, 0)
pd.crosstab(
    df_resultados['anomalia'],
    df_resultados['clasificacion'],
    rownames=['true value'],
    colnames=['prediction']
)

prediction	0	1
true value
0.0	1578	77
1.0	76	100

Anomaly detection with Extended Isolation Forest slightly improves the results obtained with the original version of the Isolation Forest algorithm.

PCA and Extended Isolation Forest¶

# Dimensionality reduction with PCA in H2O followed by Extended Isolation Forest
# ==============================================================================
pipeline_pca = make_pipeline(StandardScaler(), PCA(n_components=0.95))
# Convert X again to pandas dataframe
X = X.as_data_frame()
pipeline_pca.fit(X)
projections = pipeline_pca.transform(X)
projections = h2o.H2OFrame(projections)

# Extended Isolation Forest model definition and training
# ==============================================================================
modelo_isof = H2OExtendedIsolationForestEstimator(
                ntrees          = 1000,
                sample_size     = 100,
                extension_level = projections.shape[1] - 1
              )
_ = modelo_isof.train(x=projections.columns, training_frame=projections)

# Anomaly score prediction
# ==============================================================================
score_anomalia = modelo_isof.predict(projections)
score_anomalia = score_anomalia.as_data_frame()['anomaly_score']

# Confusion matrix of final classification
# ==============================================================================
df_resultados = pd.DataFrame({'score': score_anomalia, 'anomalia': y})
df_resultados = df_resultados.sort_values('score', ascending=False).reset_index(drop=True)
df_resultados['clasificacion'] = np.where(df_resultados.index <= 176, 1, 0)
pd.crosstab(
    df_resultados['anomalia'],
    df_resultados['clasificacion'],
    rownames=['true value'],
    colnames=['prediction']
)

prediction	0	1
true value
0.0	1574	81
1.0	80	96

In this case, by combining PCA and Extended Isolation Forest, even better results are achieved.

GLRM and Extended Isolation Forest¶

GLRM (Generalized Low Rank Models) is a dimensionality reduction technique that generalizes classical methods such as PCA, SVD, or NMF. It uses a low-rank factor matrix to approximate the original data, allowing it to capture complex patterns and handle missing data. GLRM is versatile, applicable to various data types, and useful in tasks such as compression, imputation, and anomaly detection.

We proceed to train a GLRM model to reduce the dimensionality of the data before applying Extended Isolation Forest.

# GLRM model
# ==============================================================================
X = h2o.H2OFrame(X)
glrm_model = H2OGeneralizedLowRankEstimator(
                k=min(X.shape),
                loss="absolute",
                transform="standardize"
            )
_ = glrm_model.train(training_frame=X)

By plotting the cumulative variance as a function of the number of components, it is observed that with 13 components 90% of the total variance is already explained.

# Component selection based on cumulative variance
# ==============================================================================
varianza_acumulada = (
    glrm_model.varimp(use_pandas=True)
    .set_index("")
    .transpose()
)
varianza_acumulada

	Standard deviation	Proportion of Variance	Cumulative Proportion
pc1	2.342267	2.612482e-01	0.261248
pc2	1.897824	1.715112e-01	0.432759
pc3	1.329301	8.414477e-02	0.516904
pc4	1.163825	6.449944e-02	0.581404
pc5	1.106302	5.828110e-02	0.639685
pc6	1.011849	4.875423e-02	0.688439
pc7	0.962841	4.414587e-02	0.732585
pc8	0.918723	4.019292e-02	0.772778
pc9	0.825331	3.243669e-02	0.805214
pc10	0.779929	2.896614e-02	0.834181
pc11	0.731950	2.551191e-02	0.859692
pc12	0.703528	2.356913e-02	0.883262
pc13	0.537615	1.376334e-02	0.897025
pc14	0.452267	9.740260e-03	0.906765
pc15	0.370932	6.551917e-03	0.913317
pc16	0.296267	4.179722e-03	0.917497
pc17	0.270451	3.483048e-03	0.920980
pc18	0.218141	2.265973e-03	0.923246
pc19	0.034028	5.513708e-05	0.923301
pc20	0.003335	5.296137e-07	0.923302
pc21	0.002588	3.189379e-07	0.923302

We proceed to transform the original data using the first 13 components obtained with GLRM.

# GLRM model with k=13
# ==============================================================================
glrm_model = H2OGeneralizedLowRankEstimator(
                k=13,
                loss="absolute",
                transform="standardize"
            )

_ = glrm_model.train(training_frame=X)
projections = glrm_model.predict(X)

# Extended Isolation Forest model
# ==============================================================================
modelo_isof = H2OExtendedIsolationForestEstimator(
                ntrees          = 500,
                sample_size     = 100,
                extension_level = projections.shape[1] - 1
              )

_ = modelo_isof.train(x=projections.columns, training_frame=projections)

# Prediction
# ==============================================================================
score_anomalia = modelo_isof.predict(projections)
score_anomalia = score_anomalia.as_data_frame()['anomaly_score']

# Confusion matrix of final classification
# ==============================================================================
df_resultados = pd.DataFrame({'score': score_anomalia,'anomalia': y})
df_resultados = (
    df_resultados
    .sort_values('score', ascending=False)
    .reset_index(drop=True)
)
df_resultados['clasificacion'] = np.where(df_resultados.index <= 176, 1, 0)
pd.crosstab(
    df_resultados['anomalia'],
    df_resultados['clasificacion'],
    rownames=['true value'],
    colnames=['prediction']
)

prediction	0	1
true value
0.0	1596	59
1.0	58	118

Among all the tested combinations, the one that obtains the best results is GLRM and Extended Isolation Forest.

Session Information¶

import session_info
session_info.show(html=False)

-----
h2o                 3.46.0.9
mat4py              0.6.0
matplotlib          3.10.8
numpy               2.2.6
pandas              2.3.3
seaborn             0.13.2
session_info        v1.0.1
sklearn             1.7.2
-----
IPython             9.8.0
jupyter_client      8.7.0
jupyter_core        5.9.1
-----
Python 3.13.11 | packaged by Anaconda, Inc. | (main, Dec 10 2025, 21:28:48) [GCC 14.3.0]
Linux-6.14.0-37-generic-x86_64-with-glibc2.39
-----
Session information updated at 2026-01-21 22:34

Bibliography¶

Outlier Analysis Aggarwal, Charu C.

Outlier Ensembles: An Introduction by Charu C. Aggarwal, Saket Sathe

https://www.h2o.ai/blog/anomaly-detection-with-isolation-forests-using-h2o/

Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.” Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on

Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation-based anomaly detection.” ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012)

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-Based Anomaly Detection. ACM Trans. Knowl. Discov. Data 6, 1, Article 3 (March 2012), 39 pages. DOI:https://doi.org/10.1145/2133360.2133363

Citation Instructions¶

How to cite this document?

If you use this document or any part of it, we appreciate you citing it. Thank you very much!

Anomaly detection with Isolation Forest and Python by Joaquín Amat Rodrigo, available under a Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://www.cienciadedatos.net/documentos/py22-anomaly-detection-isolation-forest-python.html

Did you like the article? Your help is important

Your contribution will help me continue generating free educational content. Thank you very much! 😊

This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.

It is allowed:

Share: copy and redistribute the material in any medium or format.
Adapt: remix, transform, and build upon the material.

Under the following terms:

Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial: You may not use the material for commercial purposes.
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Anomaly detection with isolation forest and python

Joaquín Amat Rodrigo

December 2020 (last updated January 2026)

Introduction¶

Libraries¶

Isolation Forest¶

Data¶

Model¶

Prediction¶

Anomaly Detection¶

Combination of PCA and Isolation Forest¶

Model¶

Prediction¶

Anomaly Detection¶

Extended Isolation Forest¶

PCA and Extended Isolation Forest¶

GLRM and Extended Isolation Forest¶

Session Information¶

Bibliography¶

Citation Instructions¶