Introduction

A Random Forest model is composed of multiple individual decision trees. Each of these trees is trained with a slightly different sample of the training data, generated using a technique known as bootstrapping. To make predictions on new observations, the predictions of all the trees that make up the model are combined.

Many predictive methods generate global models in which a single equation is applied to the entire sample space. When the use case involves multiple predictors that interact in a complex and nonlinear way, it is very difficult to find a single global model that can reflect the relationship between the variables. Tree-based statistical and machine learning methods encompass a set of supervised non-parametric techniques that segment the predictor space into simple regions, within which it is easier to handle interactions. This feature provides much of their potential.

Tree-based methods have gained recognition as a benchmark in the field of prediction due to the excellent results they offer in a wide range of problems. This document explores how Random Forest models are built and used.

✎ Note

It is important to understand how decision trees work in order to understand Random Forest models. It is also recommended to be familiar with Gradient Boosting models, which share many characteristics with Random Forest.

Advantages

They are able to automatically select the most relevant predictors.
Can be applied to both regression and classification problems.
Trees can, in theory, handle both numerical and categorical predictors without having to create dummy or one-hot-encoded variables. In practice, this depends on the implementation of the algorithm in each library.
As non-parametric methods, they do not require the data to follow a specific distribution.
Generally, they require less data cleaning and preprocessing compared to other statistical learning methods. For example, they do not require standardization.
They are less susceptible to being influenced by outliers.
If the value of a predictor is not available for an observation, a prediction can still be made using the observations from the last reached node. The prediction accuracy will be reduced but at least a prediction can be obtained.
They are very useful in data exploration, allowing for quick and efficient identification of the most important variables (predictors).
Thanks to the Out-of-Bag Error, it is possible to estimate the validation error without resorting to computationally expensive strategies such as cross-validation. This does not apply in the case of time series.
They are suitable for datasets with a large number of observations, demonstrating good scalability.

Disadvantages

The combination of multiple trees reduces interpretability compared to single-tree models.
When dealing with continuous predictors, some information may be lost when categorizing them during node splitting.
The recursive binary splitting technique used to create tree branches can favor continuous predictors or qualitative predictors with many levels, as they have a higher probability of containing an optimal split point by chance.
As described later, the creation of tree branches is achieved through the recursive binary splitting algorithm. This algorithm identifies and evaluates possible splits for each predictor according to a certain measure (RSS, Gini, entropy…). Continuous predictors or qualitative predictors with many levels are more likely to contain, just by chance, an optimal split point, so they tend to be favored in tree creation.
They are not able to extrapolate outside the range observed in the training data.

Random Forest in Python

There are multiple implementations of Random Forest models in Python, with one of the most widely used being the one available in scikit-learn. Although less well known, the main Gradient Boosting libraries such as LightGBM and XGBoost can also be configured to create Random Forest models. All of them are highly optimized and are used in a similar way, but they have differences in their implementation that can lead to different results. In scikit-learn, it is necessary to apply one-hot-encoding to categorical predictors, which is not required in LightGBM and XGBoost. This directly affects the structure of the generated trees and, consequently, the predictive results of the model and the importance attributed to the predictors (see details below).

✎ Note

As of version 1.4, scikit-learn's RandomForest models support missing values.

Random Forest

A Random Forest model is composed of an ensemble of individual decision trees. Each of these trees is trained with a random sample drawn from the original training data using bootstrapping). This means that each tree is trained with a slightly different dataset. In each individual tree, observations are distributed through splits (nodes), shaping the tree structure until reaching a terminal node. The prediction for a new observation is obtained by aggregating the predictions of all the individual trees that make up the model.

To understand how Random Forest models work, it is essential to first become familiar with the concepts of ensemble and bagging.

Ensemble Methods

All statistical learning and machine learning models face the challenge of balancing bias and variance.

The term "bias" refers to how much, on average, a model's predictions deviate from the true values. It reflects the model's ability to capture the true relationship between the predictors and the response variable. For example, if the relationship follows a nonlinear pattern, a linear regression model, regardless of how much data is available, will not be able to model the relationship properly and will have high bias.

On the other hand, the term "variance" refers to how much the model changes depending on the data used for training. Ideally, a model should not change much with small variations in the training data. If it does, it indicates that the model is memorizing the data instead of learning the true relationship between the predictors and the response variable. For example, a tree model with many nodes tends to change its structure even with small variations in the training data, indicating high variance.

As the complexity of a model increases, it becomes more flexible and can better fit the observations, which reduces bias and improves predictive ability. However, after a certain level of complexity, the problem of overfitting (overfitting) arises. This phenomenon occurs when the model fits the training data so closely that it cannot correctly predict new observations. The optimal model is the one that finds a suitable balance between bias and variance.

How are bias and variance controlled in tree-based models? Generally, small trees with few branches tend to have low variance but may not accurately capture the relationship between variables, resulting in high bias. On the other hand, large trees fit the training data very closely, reducing bias but increasing variance. One way to address this problem is with ensemble methods.

Ensemble methods combine multiple models into a new one with the goal of achieving a balance between bias and variance, thus obtaining better predictions than any of the individual original models. Two of the most widely used types of ensemble are:

Bagging: Multiple models are fitted, each with a different subset of the training data. For prediction, all models in the ensemble contribute their prediction. The final value is the mean of all predictions (for continuous variables) or the most frequent class (for categorical variables). Random Forest models fall into this category.
Boosting: Multiple simple models, called weak learners, are fitted sequentially so that each model learns from the errors of the previous one. As with bagging, the final value is the mean of all predictions (for continuous variables) or the most frequent class (for categorical variables). Three of the most commonly used boosting methods are AdaBoost, Gradient Boosting, and Stochastic Gradient Boosting.

Although the ultimate goal is the same—to achieve an optimal balance between bias and variance—there are two important differences:

How they reduce total error. The total error of a model can be decomposed as $b i a s + v a r i a n c e + ϵ$ . In bagging, models with very low bias but high variance are used; by aggregating them, variance is reduced without significantly increasing bias. In boosting, models with very low variance but high bias are used; by fitting models sequentially, bias is reduced. Therefore, each strategy reduces a different part of the total error.
How variations are introduced in the models that make up the ensemble. In bagging, each model is different from the rest because each is trained with a different sample obtained by bootstrapping). In boosting, models are fitted sequentially and the importance (weight) of the observations changes in each iteration, resulting in different fits.

The key for ensemble methods to achieve better results than any of their individual models is that the models that make them up are as diverse as possible (their errors are not correlated). An analogy that illustrates this concept is the following: imagine a trivia game where teams have to answer questions on various topics. A team made up of many players, each an expert in a different topic, will have a better chance of winning than a team made up of players who are all experts in a single topic or a single player who knows a little about everything.

Next, the bagging strategy, on which the Random Forest model is based, is described in more detail.

Bagging

The term bagging is short for bootstrap aggregation, and refers to the use of repeated sampling with replacement bootstrapping) to reduce the variance of some statistical learning models, including tree-based models.

Given $n$ samples of independent observations $Z_{1}$ , ..., $Z_{n}$ , each with variance $σ^{2}$ , the variance of the mean of the observations $\bar{Z}$ is $σ^{2} / n$ . In other words, averaging a set of observations reduces variance. Based on this idea, one way to reduce variance and increase the accuracy of a predictive method is to obtain multiple samples from the population, fit a different model to each, and average the resulting predictions (or take the mode for qualitative variables). Since in practice we usually do not have access to multiple samples, the process can be simulated using bootstrapping), thus generating pseudo-samples to fit different models and then aggregate them. This process is known as bagging and is applicable to a wide variety of regression methods.

In the particular case of decision trees, given their low bias and high variance nature, bagging has shown very good results. The way to apply it is:

Generate $B$ pseudo-training sets using bootstrapping from the original training sample.
Train a tree with each of the $B$ samples from step 1. Each tree is created with almost no restrictions and is not pruned, so it has high variance but low bias. In most cases, the only stopping rule is the minimum number of observations that terminal nodes must have. The optimal value of this hyperparameter can be obtained by comparing the out of bag error or by cross-validation.
For each new observation, obtain the prediction from each of the $B$ trees. The final prediction is the mean of the $B$ predictions for quantitative variables and the most frequent predicted class (mode) for qualitative variables.

In the bagging process, the number of trees created is not a critical hyperparameter, since increasing the number does not increase the risk of overfitting. After a certain number of trees, the reduction in test error stabilizes. Nevertheless, each tree takes up memory, so it is not advisable to store more than necessary.

Training Random Forest

The Random Forest algorithm is a modification of the bagging process that improves results by further decorrelating the trees generated in the process.

As mentioned earlier, the benefits of bagging are based on the fact that averaging a set of models reduces variance. This is true as long as the aggregated models are not correlated. If the correlation is high, the reduction in variance that can be achieved is small.

Suppose a dataset has one very influential predictor, along with others that are moderately influential. In this scenario, all or almost all trees created in the bagging process will be dominated by the same predictor and will be very similar to each other. As a result of the high correlation between the trees, bagging will barely reduce variance and thus will not improve the model. Random forest avoids this problem by randomly selecting $m$ predictors before evaluating each split. In this way, on average, $(p - m) / p$ splits will not consider the influential predictor, allowing other predictors to be selected. By adding this extra step, the trees are further decorrelated, so their aggregation achieves a greater reduction in variance.

Random forest and bagging methods follow the same algorithm with the only difference being that, in random forest, before each split, $m$ predictors are randomly selected. The difference in the result will depend on the chosen value of $m$ . If $m = p$ , the results of random forest and bagging are equivalent. Some recommendations are:

The square root of the total number of predictors for classification problems. $m \approx \sqrt{p}$
One third of the number of predictors for regression problems. $m \approx \frac{p}{3}$
If the predictors are highly correlated, small values of $m$ achieve better results.

However, the best way to find the optimal value of $m$ is to evaluate the out-of-bag-error or use cross-validation.

As with bagging, random forest does not suffer from overfitting by increasing the number of trees created in the process. After a certain number, the reduction in test error stabilizes.

Random Forest Prediction

The prediction of a Random Forest model is the mean of the predictions of all the trees that compose it.

Suppose there are 10 observations, each with a response variable $Y$ and some predictors $X$ .


id	1	2	3	4	5	6	7	8	9	10
Y	10	18	24	8	2	9	16	10	20	14
X	...	...	...	...	...	...	...	...	...	...

The following image shows how the model would predict for a new observation. In each tree, the path to the final node is highlighted. In each terminal node, the indices of the training observations that are part of it are detailed.

Random forest prediction: in each tree, the path to the final node is highlighted. In each terminal node, the indices of the training observations that are part of it are detailed.

The value predicted by each tree is the mean of the response variable $Y$ in the terminal node. According to the image, the predictions of each of the three trees (from left to right) are:

{\hat{y}}_{t r e e_{1}} = \frac{24 + 2 + 20}{3} = 15.33333

{\hat{y}}_{t r e e_{2}} = \frac{18 + 24}{2} = 21

{\hat{y}}_{t r e e_{3}} = \frac{18 + 24 + 2 + 20}{4} = 16

The final prediction of the model is the mean of all individual predictions:

\hat{μ} = \frac{15.33333 + 21 + 16}{3} = 17.4

Although the above is the most common way to obtain predictions from a Random Forest model, there is another approach. The prediction of a regression tree can be seen as a variant of nearest neighbors in which only the observations that are part of the same terminal node as the predicted observation have influence. Following this approach, the tree prediction is defined as the weighted mean of all training observations, where the weight of each observation depends only on whether it is part of the same terminal node. For Random Forest, this is equivalent to the weighted mean of all observations, using as weights $w$ the mean of the weight vectors of all trees.

According to the previous image, the weight vector for each of the three trees (from left to right) is:

w_{t r e e_{1}} = (0, 0, \frac{1}{3}, 0, \frac{1}{3}, 0, 0, 0, \frac{1}{3}, 0)

w_{t r e e_{2}} = (0, \frac{1}{2}, \frac{1}{2}, 0, 0, 0, 0, 0, 0, 0)

w_{t r e e_{3}} = (0, \frac{1}{4}, \frac{1}{4}, 0, \frac{1}{4}, 0, 0, 0, \frac{1}{4}, 0)

The mean of all weight vectors is:

\bar{w} = \frac{1}{3} (w_{t r e e_{1}} + w_{t r e e_{2}} + w_{t r e e_{3}}) = (0, \frac{1}{4}, \frac{13}{36}, 0, \frac{7}{36}, 0, 0, 0, \frac{7}{36}, 0)

Once the average weight vector is obtained, the prediction can be calculated as the weighted mean of all training observations:

\hat{μ} = \sum_{i = 1}^{n} {\bar{w}}_{i} Y_{i}

\hat{μ} = (0 \times 10) + (\frac{1}{4} \times 18) + (\frac{13}{36} \times 24) + (0 \times 8) + (\frac{1}{4} \times 2) + (0 \times 9) + (0 \times 16) + (0 \times 10) + (\frac{1}{4} \times 20) + (0 \times 14) = 17.4

Out-of-Bag Error

Due to the nature of the bagging process, it is possible to estimate the test error without the need to use cross-validation methods. The fact that trees are fitted using samples generated by bootstrapping means that, on average, each fit uses only about two-thirds of the original observations. The remaining third is called out-of-bag (OOB).

If for each tree fitted in the bagging process the observations used are recorded, the response for observation $i$ can be predicted using only those trees in which that observation was excluded and averaging them (the mode in the case of classification trees). Following this process, predictions can be obtained for all $n$ observations and the OOB-mean square error (for regression) or the OOB-classification error (for classification trees) can be calculated. Since the response variable for each observation is predicted using only the trees in which it did not participate, the OOB-error serves as an estimate of the test error. In fact, if the number of trees is high enough, the OOB-error is practically equivalent to the leave-one-out cross-validation error.

This is an added advantage of bagging methods, and therefore of Random Forest, as it avoids the need to use the (computationally expensive) cross-validation process for hyperparameter optimization.

Two limitations in the use of Out-of-Bag Error:

The Out-of-Bag Error is not suitable when observations have a temporal relationship (time series). Since the selection of observations for each training is random, the temporal order is not respected and information from the future would be introduced.
The preprocessing of the training data is done jointly, so the out-of-bag observations may suffer from data leakage). If so, the OOB-error estimates are too optimistic.

In a bootstrapping sample, if the training data size is $n$ , each observation has a probability of being chosen of $\frac{1}{n}$ . Therefore, the probability of not being chosen throughout the process is $(1 - 1 / n)^{n}$ , which converges to $1 / ϵ$ , which is approximately one third.

Predictor Importance

While it is true that the bagging (Random Forest) process improves predictive ability compared to single-tree models, this comes at a cost: model interpretability is reduced. Since it is a combination of multiple trees, it is not possible to obtain a simple graphical representation of the model, and it is not straightforward to visually identify the most important predictors. However, new strategies have been developed to quantify predictor importance, making bagging (Random Forest) models a powerful tool not only for prediction but also for exploratory analysis. Two such measures are: permutation importance and node impurity.

Permutation Importance

Identifies the influence of each predictor on a given model evaluation metric (estimated by out-of-bag error or cross-validation). The value associated with each predictor is obtained as follows:

Build the set of trees that make up the model.
Calculate a given error metric (mse, classification error, ...). This is the reference value ( $e r r o r_{0}$ ).
For each predictor $j$ :
- Permute the values of predictor $j$ in all trees of the model, keeping the rest constant.
- Recalculate the metric after permutation, call this ( $e r r o r_{j}$ ).
- Calculate the increase in the metric due to permutation of predictor $j$ . ${Increase}_{j} = \frac{(e r r o r_{j} - e r r o r_{0})}{e r r o r_{0}} * 100 %$

If the permuted predictor was contributing to the model, it is expected that the model error will increase, as the information provided by that variable is lost. The percentage increase in error due to permutation of predictor $j$ can be interpreted as the influence of $j$ on the model. Something that often causes confusion is that this increase can be negative. If the variable does not contribute to the model, it is possible that, by shuffling it randomly, the model may improve slightly just by chance, so $(e r r o r_{j} - e r r o r_{0})$ is negative. In general, these variables can be considered to have importance close to zero.

Although this strategy is usually the most recommended, some caution is needed in its interpretation. What it quantifies is the influence of the predictors on the model, not their relationship with the response variable. Why is this important? Suppose a scenario where this strategy is used to identify which predictors are related to a person's weight, and two of the predictors are: body mass index (BMI) and height. Since BMI and height are highly correlated (the information they provide is redundant), when one of them is permuted, the impact on the model will be minimal, as the other provides the same information. As a result, these predictors will appear as not very influential even though they are actually highly related to the response variable. One way to avoid such problems is, whenever predictors are excluded from a model, to check the impact on its predictive ability.

Increase in Node Purity

Quantifies the total increase in node purity due to splits involving the predictor (averaged over all trees). The way to calculate it is as follows: at each split in the trees, the decrease achieved in the measure used as the split criterion (Gini index, mse, entropy, ...) is recorded. For each predictor, the average decrease achieved in the set of trees that make up the ensemble is calculated. The higher this average value, the greater the contribution of the predictor to the model.

Regression Example

The Boston dataset contains housing prices in the city of Boston, as well as socio-economic information about the neighborhood in which they are located. The goal is to fit a regression model to predict the median price of a house (MEDV) based on the available variables.

Libraries

The libraries used in this example are:

# Data processing
# ==============================================================================
import numpy as np
import pandas as pd

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8

# Preprocessing and modeling
# ==============================================================================
import sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ParameterGrid
from sklearn.model_selection import cross_val_score
from sklearn.inspection import permutation_importance
from joblib import Parallel, delayed, cpu_count
import optuna

# Warnings configuration
# ==============================================================================
import warnings
optuna.logging.set_verbosity(optuna.logging.WARNING)

print(f"scikit-learn version: {sklearn.__version__}")

scikit-learn version: 1.7.1

Data

The Boston dataset contains housing prices in the city of Boston, as well as socio-economic information about the neighborhood in which they are located. The goal is to fit a regression model to predict the median price of a house (MEDV) based on the available variables.

Number of instances: 506

Number of attributes: 13 predictive numerical/categorical.

Attribute information (in order):

CRIM: per capita crime rate by town.
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of non-retail business acres per town.
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
NOX: nitric oxides concentration (parts per 10 million).
RM: average number of rooms per dwelling.
AGE: proportion of owner-occupied units built prior to 1940.
DIS: weighted distances to five Boston employment centers.
RAD: index of accessibility to radial highways.
TAX: full-value property-tax rate per $10,000.
PTRATIO: pupil-teacher ratio by town.
LSTAT: % lower status of the population.
MEDV: Median value of owner-occupied homes in $1000's.

Missing values: None

# Data download
# ==============================================================================
url = (
    'https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-learning-python/'
    'master/data/Boston.csv'
    )
data = pd.read_csv(url, sep=',')
display(data.head(3))
display(data.info())

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	4.03	34.7

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  LSTAT    506 non-null    float64
 12  MEDV     506 non-null    float64
dtypes: float64(10), int64(3)
memory usage: 51.5 KB

None

Model Fitting

A model is fitted using MEDV as the response variable and all other available variables as predictors.

The RandomForestRegressor class from the sklearn.ensemble module allows you to train random forest models for regression problems. The default parameters and hyperparameters used are:

n_estimators=100
criterion='squared_error'
max_depth=None
min_samples_split=2
min_samples_leaf=1
min_weight_fraction_leaf=0.0
max_features=1.0
max_leaf_nodes=None
min_impurity_decrease=0.0
bootstrap=True
oob_score=False
n_jobs=None
random_state=None
verbose=0
warm_start=False
ccp_alpha=0.0
max_samples=None

Among all these, those that stop tree growth, control the number of trees and predictors included, and manage parallelization are particularly important:

n_estimators: number of trees included in the model.
max_depth: maximum depth the trees can reach.
min_samples_split: minimum number of observations a node must have to be split. If a decimal value, it is interpreted as a fraction of the total training observations ceil(min_samples_split * n_samples).
min_samples_leaf: minimum number of observations each child node must have for the split to occur. If a decimal value, it is interpreted as a fraction of the total training observations ceil(min_samples_split * n_samples).
max_leaf_nodes: maximum number of terminal nodes the trees can have.
max_features: number of predictors considered at each split. Can be:
- An integer value
- A fraction of the total predictors.
- “sqrt”, square root of the total number of predictors.
- “log2”, log2 of the total number of predictors.
- None, uses all predictors.
oob_score: Whether to calculate the out-of-bag R^2. By default, it is False as it increases training time.
n_jobs: number of cores used for training. In random forest, trees are fitted independently, so parallelization significantly reduces training time. With -1, all available cores are used.
random_state: seed for reproducibility. Must be an integer value.

As in any predictive study, it is important not only to fit the model but also to quantify its ability to predict new observations. To make this evaluation, the data is split into two groups: one for training and one for testing.

# Split data into training and test sets
# ==============================================================================
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(columns="MEDV"),
    data['MEDV'],
    test_size = 0.25,
    random_state = 123
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

# Model creation
# ==============================================================================
model = RandomForestRegressor(
    n_estimators = 10,
    criterion    = 'squared_error',
    max_depth    = None,
    max_features = 1,
    oob_score    = False,
    n_jobs       = -1,
    random_state = 123
)

# Model training
# ==============================================================================
model.fit(X_train, y_train)

Training set size: 379
Test set size: 127

RandomForestRegressor(max_features=1, n_estimators=10, n_jobs=-1,
                      random_state=123)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model Prediction and Evaluation

Once the model is trained, its predictive performance is evaluated using the test set.

# Test error of the initial model
# ==============================================================================
predictions = model.predict(X=X_test)
rmse = root_mean_squared_error(y_true=y_test, y_pred=predictions)
print(f"The test error (rmse) is: {rmse}")

The test error (rmse) is: 4.17800427058654

Hyperparameter Optimization

The initial model was trained using 10 trees (n_estimators=10) and keeping the rest of the hyperparameters at their default values. Since these are hyperparameters, the most suitable value cannot be known in advance; the way to identify them is by using validation strategies, such as cross-validation.

Random Forest models have the advantage of providing the Out-of-Bag error, which allows for an estimate of the test error without resorting to cross-validation, which is computationally expensive. In the RandomForestRegressor implementation, the metric returned as oob_score is $R^{2}$ ; if another metric is desired, the oob_decision_function_() method must be used to obtain the predictions and then calculate the metric of interest. For a more detailed explanation, see: Grid search for Random Forest models with out-of-bag error and early stopping.

Keep in mind that when searching for the optimal value of a hyperparameter with two different metrics, the result is rarely the same. The important thing is that both metrics identify the same regions of interest.

Number of Trees

In Random Forest, the number of trees is not a critical hyperparameter, since adding trees can only improve the result. Random Forest does not suffer from overfitting due to an excess of trees. However, adding trees once the improvement has stabilized is a waste of computational resources.

# Validation using Out-of-Bag error
# ==============================================================================
warnings.filterwarnings('ignore')
train_scores = []
oob_scores   = []

# Evaluated values
estimator_range = range(1, 150, 5)

# Loop to train a model with each value of n_estimators and extract its training and Out-of-Bag error.
for n_estimators in estimator_range:
    model = RandomForestRegressor(
        n_estimators = n_estimators,
        criterion    = 'squared_error',
        max_depth    = None,
        max_features = 1,
        oob_score    = True,
        n_jobs       = -1,
        random_state = 123
    )
    model.fit(X_train, y_train)
    train_scores.append(model.score(X_train, y_train))
    oob_scores.append(model.oob_score_)
    
# Plot with the evolution of errors
fig, ax = plt.subplots(figsize=(5, 3))
ax.plot(estimator_range, train_scores, label="train scores")
ax.plot(estimator_range, oob_scores, label="out-of-bag scores")
ax.plot(estimator_range[np.argmax(oob_scores)], max(oob_scores),
        marker='o', color = "red", label="max score")
ax.set_ylabel("R^2")
ax.set_xlabel("n_estimators")
ax.set_title("Out-of-bag-error evolution vs number of trees")
plt.legend();
print(f"Optimal value of n_estimators: {estimator_range[np.argmax(oob_scores)]}")
warnings.filterwarnings('default')

Optimal value of n_estimators: 66

# Validation using k-cross-validation and neg_root_mean_squared_error
# ==============================================================================
train_scores = []
cv_scores    = []

# Evaluated values
estimator_range = range(1, 150, 5)

# Loop to train a model with each value of n_estimators and extract its training and k-cross-validation error.
for n_estimators in estimator_range:
    
    model = RandomForestRegressor(
        n_estimators = n_estimators,
        criterion    = 'squared_error',
        max_depth    = None,
        max_features = 1,
        oob_score    = False,
        n_jobs       = -1,
        random_state = 123
    )
    
    # Training error
    model.fit(X_train, y_train)
    predictions = model.predict(X=X_train)
    rmse = root_mean_squared_error(
        y_true  = y_train,
        y_pred  = predictions,
    )
    train_scores.append(rmse)
    
    # Cross-validation error
    scores = cross_val_score(
        estimator = model,
        X         = X_train,
        y         = y_train,
        scoring   = 'neg_root_mean_squared_error',
        cv        = 5
    )
    # Add cross_val_score() scores and convert to positive
    cv_scores.append(-1*scores.mean())
    
# Plot with the evolution of errors
fig, ax = plt.subplots(figsize=(5, 3))
ax.plot(estimator_range, train_scores, label="train scores")
ax.plot(estimator_range, cv_scores, label="cv scores")
ax.plot(estimator_range[np.argmin(cv_scores)], min(cv_scores),
        marker='o', color = "red", label="min score")
ax.set_ylabel("root_mean_squared_error")
ax.set_xlabel("n_estimators")
ax.set_title("cv-error evolution vs number of trees")
plt.legend();
print(f"Optimal value of n_estimators: {estimator_range[np.argmin(cv_scores)]}")

Optimal value of n_estimators: 96

Although the optimal value of the metrics is reached with 66 and 96 trees, the curves indicate that, from 20 trees onward, the model's validation error stabilizes.

Max features

The value of max_features is one of the most important hyperparameters in random forest, as it controls how much the trees are decorrelated from each other.

# Validation using Out-of-Bag error
# ==============================================================================
train_scores = []
oob_scores   = []

# Evaluated values
max_features_range = range(1, X_train.shape[1] + 1, 1)

# Loop to train a model with each value of max_features and extract its training and Out-of-Bag error.
for max_features in max_features_range:
    model = RandomForestRegressor(
        n_estimators = 100,
        criterion    = 'squared_error',
        max_depth    = None,
        max_features = max_features,
        oob_score    = True,
        n_jobs       = -1,
        random_state = 123
    )
    model.fit(X_train, y_train)
    train_scores.append(model.score(X_train, y_train))
    oob_scores.append(model.oob_score_)
    
# Plot with the evolution of errors
fig, ax = plt.subplots(figsize=(5, 3))
ax.plot(max_features_range, train_scores, label="train scores")
ax.plot(max_features_range, oob_scores, label="out-of-bag scores")
ax.plot(max_features_range[np.argmax(oob_scores)], max(oob_scores),
        marker='o', color = "red")
ax.set_ylabel("R^2")
ax.set_xlabel("max_features")
ax.set_title("Out-of-bag-error evolution vs number of predictors")
plt.legend();
print(f"Optimal value of max_features: {max_features_range[np.argmax(oob_scores)]}")

Optimal value of max_features: 4

# Validation using k-cross-validation and neg_root_mean_squared_error
# ==============================================================================
train_scores = []
cv_scores    = []

# Evaluated values
max_features_range = range(1, X_train.shape[1] + 1, 1)

# Loop to train a model with each value of max_features and extract its training and k-cross-validation error.
for max_features in max_features_range:
    
    model = RandomForestRegressor(
        n_estimators = 100,
        criterion    = 'squared_error',
        max_depth    = None,
        max_features = max_features,
        oob_score    = True,
        n_jobs       = -1,
        random_state = 123
    )
    
    # Training error
    model.fit(X_train, y_train)
    predictions = model.predict(X = X_train)
    rmse = root_mean_squared_error(
        y_true  = y_train,
        y_pred  = predictions,
    )
    train_scores.append(rmse)
    
    # Cross-validation error
    scores = cross_val_score(
        estimator = model,
        X         = X_train,
        y         = y_train,
        scoring   = 'neg_root_mean_squared_error',
        cv        = 5
    )
    # Add cross_val_score() scores and convert to positive
    cv_scores.append(-1*scores.mean())
    
# Plot with the evolution of errors
fig, ax = plt.subplots(figsize=(5, 3))
ax.plot(max_features_range, train_scores, label="train scores")
ax.plot(max_features_range, cv_scores, label="cv scores")
ax.plot(max_features_range[np.argmin(cv_scores)], min(cv_scores),
        marker='o', color = "red", label="min score")
ax.set_ylabel("root_mean_squared_error")
ax.set_xlabel("max_features")
ax.set_title("cv-error evolution vs number of predictors")
plt.legend();
print(f"Optimal value of max_features: {max_features_range[np.argmin(cv_scores)]}")

Optimal value of max_features: 6

According to the two metrics used, the optimal value of max_features is between 4 and 6.

Grid search

Although individual analysis of hyperparameters is useful for understanding their impact on the model and identifying ranges of interest, the final search should not be done sequentially, as each hyperparameter interacts with the others. It is preferable to use grid search, random search, or bayesian search to analyze several combinations of hyperparameters. More information about search strategies can be found in Machine learning with Python and Scikit-learn.

Out-of-bag error

More details at: Grid search for Random Forest models with out-of-bag error and early stopping.

# Grid of evaluated hyperparameters
# ==============================================================================
param_grid = ParameterGrid(
    {'n_estimators': [150],
     'max_features': [5, 7, 9],
     'max_depth'   : [None, 3, 10, 20]
    }
)

# Loop to fit a model with each combination of hyperparameters
# ==============================================================================
results = {'params': [], 'oob_r2': []}

for params in param_grid:
    model = RandomForestRegressor(
        oob_score    = True,
        n_jobs       = -1,
        random_state = 123,
        ** params
    )
    model.fit(X_train, y_train)
    results['params'].append(params)
    results['oob_r2'].append(model.oob_score_)
    print(f"Model: {params} ✓")

# Results
# ==============================================================================
results = pd.DataFrame(results)
results = pd.concat([results, results['params'].apply(pd.Series)], axis=1)
results = results.drop(columns = 'params')
results = results.sort_values('oob_r2', ascending=False)
results.head(4)

Model: {'max_depth': None, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'max_depth': None, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'max_depth': None, 'max_features': 9, 'n_estimators': 150} ✓
Model: {'max_depth': 3, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'max_depth': 3, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'max_depth': 3, 'max_features': 9, 'n_estimators': 150} ✓
Model: {'max_depth': 10, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'max_depth': 10, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'max_depth': 10, 'max_features': 9, 'n_estimators': 150} ✓
Model: {'max_depth': 20, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'max_depth': 20, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'max_depth': 20, 'max_features': 9, 'n_estimators': 150} ✓

	oob_r2	max_depth	max_features	n_estimators
6	0.882461	10.0	5.0	150.0
0	0.875289	NaN	5.0	150.0
9	0.874590	20.0	5.0	150.0
10	0.874318	20.0	7.0	150.0

This search process can be parallelized to take advantage of all CPU cores.

# Grid of evaluated hyperparameters
# ==============================================================================
param_grid = ParameterGrid(
    {'n_estimators': [150],
     'max_features': [5, 7, 9],
     'max_depth'   : [None, 3, 10, 20]
    }
)

# Parallel loop to fit a model with each combination of hyperparameters
# ==============================================================================
def eval_oob_error(X, y, params, verbose=True):
    """
    Function to train a model using given parameters
    and return the out-of-bag error
    """
    model = RandomForestRegressor(
        oob_score    = True,
        n_jobs       = -1,
        random_state = 123,
        ** params            
    )
    model.fit(X, y)
    if verbose:
        print(f"Model: {params} ✓")
    return{'params': params, 'oob_r2': model.oob_score_}

results = Parallel(n_jobs=cpu_count()-1)(
    delayed(eval_oob_error)(X_train, y_train, params)
    for params in param_grid
)

# Results
# ==============================================================================
results = pd.DataFrame(results)
results = pd.concat([results, results['params'].apply(pd.Series)], axis=1)
results = results.drop(columns = 'params')
results = results.sort_values('oob_r2', ascending=False)
results.head(4)

Model: {'max_depth': None, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'max_depth': 3, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'max_depth': 3, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'max_depth': 10, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'max_depth': None, 'max_features': 9, 'n_estimators': 150} ✓
Model: {'max_depth': None, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'max_depth': 3, 'max_features': 9, 'n_estimators': 150} ✓
Model: {'max_depth': 10, 'max_features': 9, 'n_estimators': 150} ✓
Model: {'max_depth': 10, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'max_depth': 20, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'max_depth': 20, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'max_depth': 20, 'max_features': 9, 'n_estimators': 150} ✓

	oob_r2	max_depth	max_features	n_estimators
6	0.882461	10.0	5.0	150.0
0	0.875289	NaN	5.0	150.0
9	0.874590	20.0	5.0	150.0
10	0.874318	20.0	7.0	150.0

# Best hyperparameters by out-of-bag error
# ==============================================================================
print("--------------------------------------------")
print("Best hyperparameters found (oob-r2)")
print("--------------------------------------------")
print(results.iloc[0,0:])

--------------------------------------------
Best hyperparameters found (oob-r2)
--------------------------------------------
oob_r2            0.882461
max_depth        10.000000
max_features      5.000000
n_estimators    150.000000
Name: 6, dtype: float64

Cross-validation

⚠ Warning

Hyperparameter search should be performed using data that the model has not seen. If the same data is used to fit the model and to evaluate it, there is a risk of overfitting the model to the training data. To avoid this problem, two strategies can be used:

Split the dataset into three groups: training, validation, and test. The training set is used to fit the model, the validation set to select the optimal value of the hyperparameters, and the test set to evaluate the predictive ability of the model.
Use cross-validation with the training data. In this case, the training set is divided into k groups and the model is fitted k times, each time with a different group as the validation set. The final metric is the average of the values obtained in each iteration. This strategy is computationally more expensive, as it requires training the model k times, but avoids having to create an additional data split when there is not much data available.

# Grid of evaluated hyperparameters
# ==============================================================================
param_grid = {'n_estimators': [150],
              'max_features': [5, 7, 9],
              'max_depth'   : [None, 3, 10, 20]
             }

# Grid search with cross-validation
# ==============================================================================
grid = GridSearchCV(
    estimator  = RandomForestRegressor(random_state = 123),
    param_grid = param_grid,
    scoring    = 'neg_root_mean_squared_error',
    n_jobs     = cpu_count() - 1,
    cv         = RepeatedKFold(n_splits=5, n_repeats=3, random_state=123), 
    refit      = True,
    verbose    = 0,
    return_train_score = True
)

grid.fit(X=X_train, y=y_train)

# Results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)') \
    .drop(columns = 'params') \
    .sort_values('mean_test_score', ascending = False) \
    .head(4)

	param_max_depth	param_max_features	param_n_estimators	mean_test_score	std_test_score	mean_train_score	std_train_score
6	10	5	150	-3.320295	0.701794	-1.323499	0.061678
7	10	7	150	-3.333357	0.696114	-1.327214	0.066324
0	None	5	150	-3.334407	0.715278	-1.261911	0.058096
9	20	5	150	-3.336755	0.716904	-1.261439	0.057603

# Best hyperparameters found by cross-validation
# ==============================================================================
print("----------------------------------------")
print("Best hyperparameters found (cv)")
print("----------------------------------------")
print(grid.best_params_, ":", grid.best_score_, grid.scoring)

----------------------------------------
Best hyperparameters found (cv)
----------------------------------------
{'max_depth': 10, 'max_features': 5, 'n_estimators': 150} : -3.320294830700744 neg_root_mean_squared_error

Once the best hyperparameters have been identified, the model is retrained using the optimal values in its arguments. If refit=True is set in GridSearchCV(), this retraining is done automatically and the resulting model is stored in .best_estimator_.

# Test error of the final model
# ==============================================================================
final_model = grid.best_estimator_
predictions = final_model.predict(X=X_test)
rmse = root_mean_squared_error(
    y_true  = y_test,
    y_pred  = predictions,
)
print(f"The test error (rmse) is: {rmse}")

The test error (rmse) is: 3.550875871674505

After optimizing the hyperparameters, the model's rmse error is reduced from 4.35 to 3.55. The predictions of the final model deviate on average by 3.55 units (3550 dollars) from the actual value.

Bayesian search

Grid search and random search can yield good results, especially when the search range is reduced. However, neither takes into account the results obtained so far, which prevents them from focusing the search on the most interesting regions and avoiding unnecessary ones.

An alternative is to use Bayesian optimization methods to search for hyperparameters. In general terms, Bayesian hyperparameter optimization consists of creating a probabilistic model in which the objective function is the model's validation metric (RMSE, AUC, accuracy, etc.). With this strategy, the search is redirected in each iteration to the most promising regions. The ultimate goal is to reduce the number of hyperparameter combinations evaluated by the model, choosing only the best candidates. This approach is especially advantageous when the search space is very large or model evaluation is very slow.

# Bayesian hyperparameter search with optuna
# ==============================================================================
def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000, step=100),
        'max_depth': trial.suggest_int('max_depth', 3, 30),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 100),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 100),
        'max_features': trial.suggest_float('max_features', 0.2, 1.0),
        'ccp_alpha': trial.suggest_float('ccp_alpha', 0.0, 1),
        # Fixed parameters
        'n_jobs': -1,
        'random_state': 4576688,
    }

    model = RandomForestRegressor(**params)
    cross_val_scores = cross_val_score(
        estimator = model,
        X = X_train,
        y = y_train,
        cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=123),
        scoring = 'neg_root_mean_squared_error',
        n_jobs=-1
    )
    score = np.mean(cross_val_scores) 
    return score

study = optuna.create_study(direction='maximize') # Maximizing because the score is negative
study.optimize(objective, n_trials=30, show_progress_bar=True, timeout=60*5)

print('Best hyperparameters:', study.best_params)
print('Best score:', study.best_value)

  0%|          | 0/30 [00:00<?, ?it/s]

Best hyperparameters: {'n_estimators': 300, 'max_depth': 20, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 0.5244309455082777, 'ccp_alpha': 0.16943620121169578}
Best score: -3.492724388695251

# Test error of the final model
# ==============================================================================
final_model = RandomForestRegressor(**study.best_params)
final_model.fit(X_train, y_train)
predictions = final_model.predict(X=X_test)
rmse = root_mean_squared_error(y_true=y_test, y_pred=predictions)
print(f"The test error (rmse) is: {rmse}")

The test error (rmse) is: 3.8358066178098786

Predictor importance

Node purity

# Predictor importance based on impurity reduction
# ==============================================================================
feature_importance = pd.DataFrame({
    'predictor': data.drop(columns = "MEDV").columns,
    'importance': final_model.feature_importances_
})
print("Predictor importance in the model")
print("-------------------------------------------")
feature_importance.sort_values('importance', ascending=False)

Predictor importance in the model
-------------------------------------------

	predictor	importance
5	RM	0.406553
11	LSTAT	0.345944
4	NOX	0.048198
7	DIS	0.046635
0	CRIM	0.045737
2	INDUS	0.035020
10	PTRATIO	0.034027
9	TAX	0.016427
6	AGE	0.016364
8	RAD	0.003311
1	ZN	0.001469
3	CHAS	0.000314

Permutation

# Predictor importance based on permutation
# ==============================================================================
importance = permutation_importance(
    estimator    = final_model,
    X            = X_train,
    y            = y_train,
    n_repeats    = 5,
    scoring      = 'neg_root_mean_squared_error',
    n_jobs       = cpu_count() - 1,
    random_state = 123
)

# Store the results (mean and std) in a dataframe
df_importance = pd.DataFrame(
    {k: importance[k] for k in ['importances_mean', 'importances_std']}
    )
df_importance['feature'] = X_train.columns
df_importance.sort_values('importances_mean', ascending=False)

	importances_mean	importances_std	feature
11	4.838738	0.223503	LSTAT
5	4.209149	0.073740	RM
4	0.716986	0.051328	NOX
7	0.624620	0.043498	DIS
0	0.462018	0.037184	CRIM
10	0.372917	0.033469	PTRATIO
9	0.234956	0.017121	TAX
2	0.223568	0.028646	INDUS
6	0.179764	0.021913	AGE
8	0.048013	0.003933	RAD
1	0.008110	0.001410	ZN
3	0.003015	0.000434	CHAS

# Plot of predictor importance
# ==============================================================================
fig, ax = plt.subplots(figsize=(3.5, 4))
df_importance = df_importance.sort_values('importances_mean', ascending=True)
ax.barh(
    df_importance['feature'],
    df_importance['importances_mean'],
    xerr=df_importance['importances_std'],
    align='center',
    alpha=0
)
ax.plot(
    df_importance['importances_mean'],
    df_importance['feature'],
    marker="D",
    linestyle="",
    alpha=0.8,
    color="r"
)
ax.set_title('Predictor importance (train)')
ax.set_xlabel('Increase in error after permutation');

Both strategies identify LSTAT and RM as the most influential predictors, according to the training data.

Classification Example

Libraries

The libraries used in this document are:

# Data processing
# ==============================================================================
import numpy as np
import pandas as pd
import statsmodels.api as sm

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8

# Preprocessing and modeling
# ==============================================================================
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ParameterGrid
from sklearn.inspection import permutation_importance
from joblib import Parallel, delayed
import optuna

# Warnings configuration
# ==============================================================================
import warnings

Data

The Carseats dataset, originally from the R package ISLR and accessible in Python via statsmodels.datasets.get_rdataset, contains information on child seat sales in 400 different stores. For each of the 400 stores, 11 variables have been recorded. The goal is to build a classification model to predict whether a store has high sales (Sales > 8) or low sales (Sales <= 8) based on all available variables.

✎ Note

List of all available datasets at Rdatasets.

# Data loading
# ==============================================================================
carseats_data = sm.datasets.get_rdataset("Carseats", "ISLR")
data = carseats_data.data
print(carseats_data.__doc__)

.. container::

   .. container::

      ======== ===============
      Carseats R Documentation
      ======== ===============

      .. rubric:: Sales of Child Car Seats
         :name: sales-of-child-car-seats

      .. rubric:: Description
         :name: description

      A simulated data set containing sales of child car seats at 400
      different stores.

      .. rubric:: Usage
         :name: usage

      .. code:: R

         Carseats

      .. rubric:: Format
         :name: format

      A data frame with 400 observations on the following 11 variables.

      ``Sales``
         Unit sales (in thousands) at each location

      ``CompPrice``
         Price charged by competitor at each location

      ``Income``
         Community income level (in thousands of dollars)

      ``Advertising``
         Local advertising budget for company at each location (in
         thousands of dollars)

      ``Population``
         Population size in region (in thousands)

      ``Price``
         Price company charges for car seats at each site

      ``ShelveLoc``
         A factor with levels ``Bad``, ``Good`` and ``Medium``
         indicating the quality of the shelving location for the car
         seats at each site

      ``Age``
         Average age of the local population

      ``Education``
         Education level at each location

      ``Urban``
         A factor with levels ``No`` and ``Yes`` to indicate whether the
         store is in an urban or rural location

      ``US``
         A factor with levels ``No`` and ``Yes`` to indicate whether the
         store is in the US or not

      .. rubric:: Source
         :name: source

      Simulated data

      .. rubric:: References
         :name: references

      James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) *An
      Introduction to Statistical Learning with applications in R*,
      https://www.statlearning.com, Springer-Verlag, New York

      .. rubric:: Examples
         :name: examples

      .. code:: R

         summary(Carseats)
         lm.fit=lm(Sales~Advertising+Price,data=Carseats)

data.head(3)

	Sales	CompPrice	Income	Advertising	Population	Price	ShelveLoc	Age	Education	Urban	US
0	9.50	138	73	11	276	120	Bad	42	17	Yes	Yes
1	11.22	111	48	16	260	83	Good	65	10	Yes	Yes
2	10.06	113	35	10	269	80	Medium	59	12	Yes	Yes

Since Sales is a continuous variable and the goal of the study is to classify stores according to whether they sell a lot or a little, a new binary variable (0, 1) called high_sales is created.

data['high_sales'] = np.where(data.Sales > 8, 1, 0)
# Once the new response variable is created, drop the original
data = data.drop(columns = 'Sales')

Model fitting and hyperparameter optimization

A classification tree is fitted using high_sales as the response variable and all available variables as predictors. First, the hyperparameters max_depth=5 and criterion='gini' are used, with the rest left at their default values. Then, pruning is applied and the results are compared to the initial model.

Unlike the previous example, these data contain categorical variables, so before training the model, it is necessary to apply one-hot encoding. A more detailed description of this process can be found in Machine learning with Python and Scikit-learn.

# Split data into train and test sets
# ==============================================================================
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(columns = 'high_sales'),
    data['high_sales'],
    random_state = 123
)

# One-hot encoding of categorical variables
# ==============================================================================
# Identify the names of numeric and categorical columns
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()

# Apply one-hot encoding only to categorical columns
preprocessor = ColumnTransformer(
    [('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='if_binary'), cat_cols)],
    remainder='passthrough',
    verbose_feature_names_out=False
).set_output(transform="pandas")

# Once the ColumnTransformer object is defined, use fit() to learn the transformations with the training data and apply them to both sets with transform(). Both operations at once with fit_transform().
X_train_prep = preprocessor.fit_transform(X_train)
X_test_prep  = preprocessor.transform(X_test)

X_train_prep.info()

<class 'pandas.core.frame.DataFrame'>
Index: 300 entries, 170 to 365
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ShelveLoc_Bad     300 non-null    float64
 1   ShelveLoc_Good    300 non-null    float64
 2   ShelveLoc_Medium  300 non-null    float64
 3   Urban_Yes         300 non-null    float64
 4   US_Yes            300 non-null    float64
 5   CompPrice         300 non-null    int64  
 6   Income            300 non-null    int64  
 7   Advertising       300 non-null    int64  
 8   Population        300 non-null    int64  
 9   Price             300 non-null    int64  
 10  Age               300 non-null    int64  
 11  Education         300 non-null    int64  
dtypes: float64(5), int64(7)
memory usage: 30.5 KB

Although RandomForestClassifier has default values for its hyperparameters, it cannot be known in advance if these are the most suitable; the way to identify them is by using validation strategies, such as cross-validation.

Random Forest models have the advantage of providing the Out-of-Bag error, which allows for an estimate of the test error without resorting to cross-validation, which is computationally expensive. In the RandomForestClassifier implementation, the metric returned as oob_score is accuracy; if another metric is desired, the oob_decision_function_() method must be used to obtain the predictions and then calculate the metric of interest. For a more detailed explanation, see: Grid search for Random Forest models with out-of-bag error and early stopping.

Keep in mind that when searching for the optimal value of a hyperparameter with two different metrics, the result is rarely the same. The important thing is that both metrics identify the same regions of interest.

Although individual analysis of hyperparameters is useful for understanding their impact on the model and identifying ranges of interest, the final search should not be done sequentially, as each hyperparameter interacts with the others. It is preferable to use grid search or random search to analyze several combinations of hyperparameters. More information about search strategies can be found in Machine learning with Python and Scikit-learn.

Grid Search (out-of-bag score)

More details at: Grid search for Random Forest models with out-of-bag error and early stopping.

# Grid of evaluated hyperparameters
# ==============================================================================
param_grid = ParameterGrid(
    {'n_estimators': [150],
     'max_features': [5, 7, 9],
     'max_depth'   : [None, 3, 10, 20],
     'criterion'   : ['gini', 'entropy']
    }
)

# Loop to fit a model with each combination of hyperparameters
# ==============================================================================
results = {'params': [], 'oob_accuracy': []}

for params in param_grid:
    model = RandomForestClassifier(
        oob_score    = True,
        n_jobs       = -1,
        random_state = 123,
        **params
    )
    model.fit(X_train_prep, y_train)
    results['params'].append(params)
    results['oob_accuracy'].append(model.oob_score_)
    print(f"Model: {params} ✓")

# Results
# ==============================================================================
results = pd.DataFrame(results)
results = pd.concat([results, results['params'].apply(pd.Series)], axis=1)
results = results.sort_values('oob_accuracy', ascending=False)
results = results.drop(columns='params')
results.head(4)

Model: {'criterion': 'gini', 'max_depth': None, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'criterion': 'gini', 'max_depth': None, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'criterion': 'gini', 'max_depth': None, 'max_features': 9, 'n_estimators': 150} ✓
Model: {'criterion': 'gini', 'max_depth': 3, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'criterion': 'gini', 'max_depth': 3, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'criterion': 'gini', 'max_depth': 3, 'max_features': 9, 'n_estimators': 150} ✓
Model: {'criterion': 'gini', 'max_depth': 10, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'criterion': 'gini', 'max_depth': 10, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'criterion': 'gini', 'max_depth': 10, 'max_features': 9, 'n_estimators': 150} ✓
Model: {'criterion': 'gini', 'max_depth': 20, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'criterion': 'gini', 'max_depth': 20, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'criterion': 'gini', 'max_depth': 20, 'max_features': 9, 'n_estimators': 150} ✓
Model: {'criterion': 'entropy', 'max_depth': None, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'criterion': 'entropy', 'max_depth': None, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'criterion': 'entropy', 'max_depth': None, 'max_features': 9, 'n_estimators': 150} ✓
Model: {'criterion': 'entropy', 'max_depth': 3, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'criterion': 'entropy', 'max_depth': 3, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'criterion': 'entropy', 'max_depth': 3, 'max_features': 9, 'n_estimators': 150} ✓
Model: {'criterion': 'entropy', 'max_depth': 10, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'criterion': 'entropy', 'max_depth': 10, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'criterion': 'entropy', 'max_depth': 10, 'max_features': 9, 'n_estimators': 150} ✓
Model: {'criterion': 'entropy', 'max_depth': 20, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'criterion': 'entropy', 'max_depth': 20, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'criterion': 'entropy', 'max_depth': 20, 'max_features': 9, 'n_estimators': 150} ✓

	oob_accuracy	criterion	max_depth	max_features	n_estimators
22	0.820000	entropy	20.0	7	150
13	0.820000	entropy	NaN	7	150
6	0.816667	gini	10.0	5	150
18	0.810000	entropy	10.0	5	150

# Best hyperparameters by out-of-bag accuracy
# ==============================================================================
print("--------------------------------------------------")
print("Best hyperparameters found (oob-accuracy)")
print("--------------------------------------------------")
print(results.iloc[0,0:])

--------------------------------------------------
Best hyperparameters found (oob-accuracy)
--------------------------------------------------
oob_accuracy       0.82
criterion       entropy
max_depth          20.0
max_features          7
n_estimators        150
Name: 22, dtype: object

Grid Search (cross-validation)

# Grid of evaluated hyperparameters
# ==============================================================================
param_grid = {
    'n_estimators': [150],
    'max_features': [5, 7, 9],
    'max_depth'   : [None, 3, 10, 20],
    'criterion'   : ['gini', 'entropy']
}

# Grid search with cross-validation
# ==============================================================================
grid = GridSearchCV(
    estimator  = RandomForestClassifier(random_state = 123),
    param_grid = param_grid,
    scoring    = 'accuracy',
    n_jobs     = cpu_count() - 1,
    cv         = RepeatedKFold(n_splits=5, n_repeats=3, random_state=123), 
    refit      = True,
    verbose    = 0,
    return_train_score = True
)

grid.fit(X=X_train_prep, y=y_train)

# Results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex='(param*|mean_t|std_t)') \
    .drop(columns='params') \
    .sort_values('mean_test_score', ascending = False) \
    .head(4)

	param_criterion	param_max_depth	param_max_features	param_n_estimators	mean_test_score	std_test_score	mean_train_score
21	entropy	20	5	150	0.816667	0.029814	1.0
12	entropy	None	5	150	0.816667	0.029814	1.0
18	entropy	10	5	150	0.814444	0.030952	1.0
13	entropy	None	7	150	0.812222	0.028846	1.0

# Best hyperparameters found by cross-validation
# ==============================================================================
print("--------------------------------------------")
print("Best hyperparameters found by (cv)")
print("--------------------------------------------")
print(grid.best_params_, ":", grid.best_score_, grid.scoring)

--------------------------------------------
Best hyperparameters found by (cv)
--------------------------------------------
{'criterion': 'entropy', 'max_depth': None, 'max_features': 5, 'n_estimators': 150} : 0.8166666666666668 accuracy

Once the best hyperparameters have been identified, the model is retrained using the optimal values in its arguments. If refit=True is set in GridSearchCV(), this retraining is done automatically and the resulting model is stored in .best_estimator_.

Prediction and evaluation

Finally, the predictive ability of the final model is evaluated using the test set.

# Model with the best hyperparameters
# ==============================================================================
final_model = grid.best_estimator_
final_model

RandomForestClassifier(criterion='entropy', max_features=5, n_estimators=150,
                       random_state=123)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Test error of the final model
# ==============================================================================
predictions = final_model.predict(X=X_test_prep)
predictions[:10]

array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0])

conf_matrix = confusion_matrix(y_true=y_test, y_pred=predictions)
accuracy = accuracy_score(y_true=y_test, y_pred=predictions, normalize=True)
print("Confusion matrix")
print("-------------------")
print(conf_matrix)
print("")
print(f"Test accuracy is: {100 * accuracy} % \n")
fig, ax = plt.subplots(figsize=(3, 3))
ConfusionMatrixDisplay(conf_matrix).plot(ax=ax)

print(
    classification_report(
        y_true = y_test,
        y_pred = predictions
    )
)

Confusion matrix
-------------------
[[47  3]
 [17 33]]

Test accuracy is: 80.0 % 

              precision    recall  f1-score   support

           0       0.73      0.94      0.82        50
           1       0.92      0.66      0.77        50

    accuracy                           0.80       100
   macro avg       0.83      0.80      0.80       100
weighted avg       0.83      0.80      0.80       100

After optimizing the hyperparameters, an accuracy of 78% is achieved.

Probabilistic prediction

Most implementations of Random Forest, including scikit-learn, allow probability prediction for classification problems. It is important to understand how these values are calculated to interpret and use them correctly.

In the previous example, applying .predict() returns $1$ (high sales) or $0$ (low sales) for each test observation. However, there is no information about the confidence with which the model makes this assignment. With .predict_proba(), instead of a classification, you get the probability with which the model considers that each observation may belong to each of the classes.

# Probability prediction
# ==============================================================================
predictions = final_model.predict_proba(X=X_test_prep)
predictions[:5, :]

array([[0.21333333, 0.78666667],
       [0.28      , 0.72      ],
       [0.14      , 0.86      ],
       [0.27333333, 0.72666667],
       [0.20666667, 0.79333333]])

The result of .predict_proba() is an array with one row per observation and as many columns as there are classes in the response variable. The value in the first column corresponds to the probability, according to the model, that the observation belongs to class 0, and so on. The probability value shown for each prediction corresponds to the fraction of observations of each class in the terminal nodes reached by the predicted observation across all trees.

By default, .predict() assigns each new observation to the class with the highest probability (in case of a tie, it is assigned randomly). However, this may not always be the desired behavior.

# Classification using the class with the highest probability
# ==============================================================================
df_predictions = pd.DataFrame(data=predictions, columns=['0', '1'])
df_predictions['classification_default_0.5'] = np.where(df_predictions['0'] > df_predictions['1'], 0, 1)
df_predictions.head(3)

	0	1	classification_default_0.5
0	0.213333	0.786667	1
1	0.280000	0.720000	1
2	0.140000	0.860000	1

Suppose the following scenario: the Christmas campaign is approaching and the owners of the chain want to double the stock of items in those stores expected to have high sales. Since transporting this material to the stores is costly, the manager wants to limit this strategy only to stores for which there is high confidence that they will achieve high sales.

If probabilities are available, a specific cutoff point can be set, for example, considering only as class $1$ (high sales) those stores whose prediction for this class is greater than 0.9 (90%). In this way, the final classification is better tailored to the needs of the use case.

# Final classification using a threshold of 0.8 for class 1.
# ==============================================================================
df_predictions['classification_custom_0.8'] = np.where(df_predictions['1'] > 0.9, 1, 0)
df_predictions.iloc[4:10, :]

	0	1	classification_default_0.5
4	0.206667	0.793333	1
5	0.393333	0.606667	1
6	0.813333	0.186667	0
7	0.640000	0.360000	0
8	0.873333	0.126667	0
9	0.840000	0.160000	0

How much should you trust these probabilities?

It is very important to keep in mind the difference between the "view" the model has of the world and the real world. Everything a model knows about the real world is what it has learned from the training data and, therefore, it has a limited "view". For example, suppose that in the training data, all stores in urban areas Urban='Yes' have high sales regardless of the value of the other predictors. When the model tries to predict a new observation, if it is in an urban area, it will classify the store as high sales with 100% confidence. However, this does not mean it is unequivocally true; there could be stores in urban areas that do not have high sales but, since they are not present in the training data, the model does not consider this possibility.

With this in mind, the probabilities generated by the model should be considered as the model's confidence, from its limited perspective, when making predictions. But not as the real-world probability that this is the case.

Predictor importance

Node purity

feature_importance = pd.DataFrame(
    {'predictor': X_train_prep.columns,
     'importance': final_model.feature_importances_}
    )
print("Predictor importance in the model")
print("-------------------------------------------")
feature_importance.sort_values('importance', ascending=False)

Predictor importance in the model
-------------------------------------------

	predictor	importance
9	Price	0.221835
5	CompPrice	0.115144
7	Advertising	0.114074
10	Age	0.100719
6	Income	0.097072
1	ShelveLoc_Good	0.086468
8	Population	0.074825
0	ShelveLoc_Bad	0.062682
11	Education	0.062288
2	ShelveLoc_Medium	0.027956
4	US_Yes	0.025334
3	Urban_Yes	0.011603

Permutation

importance = permutation_importance(
    estimator    = final_model,
    X            = X_train_prep,
    y            = y_train,
    n_repeats    = 5,
    scoring      = 'neg_root_mean_squared_error',
    n_jobs       = cpu_count() - 1,
    random_state = 123,
    )

# Store the results (mean and std) in a dataframe
df_importance = pd.DataFrame(
    {k: importance[k] for k in ['importances_mean', 'importances_std']}
    )
df_importance['feature'] = X_train_prep.columns
df_importance.sort_values('importances_mean', ascending=False)

	importances_mean	importances_std	feature
9	0.448637	0.026930	Price
1	0.338378	0.012934	ShelveLoc_Good
7	0.295299	0.011481	Advertising
5	0.257405	0.020228	CompPrice
0	0.254924	0.018636	ShelveLoc_Bad
10	0.176646	0.011373	Age
6	0.130213	0.019439	Income
8	0.084207	0.015574	Population
11	0.060537	0.031654	Education
4	0.034641	0.028284	US_Yes
3	0.000000	0.000000	Urban_Yes
2	0.000000	0.000000	ShelveLoc_Medium

# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(3.5, 4))
df_importance = df_importance.sort_values('importances_mean', ascending=True)
ax.barh(
    df_importance['feature'],
    df_importance['importances_mean'],
    xerr=df_importance['importances_std'],
    align='center',
    alpha=0
)
ax.plot(
    df_importance['importances_mean'],
    df_importance['feature'],
    marker="D",
    linestyle="",
    alpha=0.8,
    color="r"
)
ax.set_title('Predictor importance (train)')
ax.set_xlabel('Increase in error after permutation');

Both strategies identify Price, Advertising, and ShelveLoc as the most influential predictors, according to the training data.

Random Forest with XGBoost

XGBoost is mainly used to create gradient boosting models, however, since random forest models use the same tree representation and prediction strategy, it is possible to configure XGBoost to use the random forest algorithm during training. The parameter configuration for training random forest models is:

booster must be set to gbtree.
subsample must be set to a value less than 1 to allow random selection of training observations (rows) in each tree.
One of the colsample_by* parameters must be set to a value less than 1 to allow random selection of columns. Usually, colsample_bynode is used to randomly select columns at each tree split.
num_parallel_tree indicates the number of trees that make up the model.
num_boost_round should be set to 1 to prevent XGBoost from combining multiple random forest models. This is an argument of the train() method, not of the initialization.
eta (alias: learning_rate) should be 1.

To facilitate the use of random forest models with XGBoost, there are two classes: XGBRFClassifier and XGBRFRegressor. These are versions of XGBClassifier and XGBRegressor that already have the default parameter values set for random forest training instead of gradient boosting.

n_estimators specifies the number of trees. Internally, it is converted to num_parallel_tree.
learning_rate is set to 1 by default.
colsample_bynode and subsample are set to 0.8 by default.
booster is always gbtree.

✎ Note

The main benefit of using the XGBoost library to train random forest models compared to the native scikit-learn implementation is speed, along with its ability to handle missing values and categorical variables.

⚠ Warning

The XGBoost implementation has some differences compared to the scikit-learn implementation that can lead to different results.

XGBoost uses a second-order approximation to the objective function. This can lead to results that differ from an implementation that uses the exact value of the objective function.
XGBoost does not perform replacement when selecting training observations. Each training value can appear at most once in each sampling.

from xgboost import XGBRFRegressor
import xgboost
print(f"XGBoost version: {xgboost.__version__}")

# Data download
# ==============================================================================
url = (
    'https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-learning-python/'
    'master/data/Boston.csv'
    )
data = pd.read_csv(url, sep=',')
data.head(3)

# Split data into train and test sets
# ==============================================================================
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(columns = "MEDV"),
    data['MEDV'],
    random_state = 123,
    )
# Model creation
# ==============================================================================
model = XGBRFRegressor(
    n_estimators     = 100,
    tree_method      = 'hist',
    max_depth        = 10,
    max_leaves       = 5,
    learning_rate    = 1.0,
    subsample        = 0.8,
    colsample_bynode = 0.8,
    reg_lambda       = 1e-05,
    reg_alpha        = 0,
    )

# Model training
# ==============================================================================
model.fit(X_train, y_train)

XGBoost version: 3.0.0

XGBRFRegressor(base_score=None, booster=None, callbacks=None,
               colsample_bylevel=None, colsample_bytree=None, device=None,
               early_stopping_rounds=None, enable_categorical=False,
               eval_metric=None, feature_types=None, feature_weights=None,
               gamma=None, grow_policy=None, importance_type=None,
               interaction_constraints=None, max_bin=None,
               max_cat_threshold=None, max_cat_to_onehot=None,
               max_delta_step=None, max_depth=10, max_leaves=5,
               min_child_weight=None, missing=nan, monotone_constraints=None,
               multi_strategy=None, n_estimators=100, n_jobs=None,
               num_parallel_tree=None, objective='reg:squarederror',
               random_state=None, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

XGBRFRegressor

?Documentation for XGBRFRegressoriFitted

Parameters

	learning_rate	1.0
	subsample	0.8
	colsample_bynode	0.8
	reg_lambda	1e-05
	objective	'reg:squarederror'
	base_score	None
	booster	None
	callbacks	None
	colsample_bylevel	None
	colsample_bytree	None
	device	None
	early_stopping_rounds	None
	enable_categorical	False
	eval_metric	None
	feature_types	None
	feature_weights	None
	gamma	None
	grow_policy	None
	importance_type	None
	interaction_constraints	None
	max_bin	None
	max_cat_threshold	None
	max_cat_to_onehot	None
	max_delta_step	None
	max_depth	10
	max_leaves	5
	min_child_weight	None
	missing	nan
	monotone_constraints	None
	multi_strategy	None
	n_estimators	100
	n_jobs	None
	num_parallel_tree	None
	random_state	None
	reg_alpha	0
	sampling_method	None
	scale_pos_weight	None
	tree_method	'hist'
	validate_parameters	None
	verbosity	None

# Predictions
# ==============================================================================
predictions = model.predict(X_test)

# Test error
# ==============================================================================
rmse = root_mean_squared_error(y_true=y_test, y_pred=predictions)
print(f"The test error (rmse) is: {rmse}")

The test error (rmse) is: 4.759722741275837

Random Forest with LightGBM

As with XGBoost, LightGBM can also be configured to train random forest models instead of gradient boosting.

from lightgbm import LGBMRegressor
import lightgbm
print(f"LightGBM version: {lightgbm.__version__}")

# Split data into train and test sets
# ==============================================================================
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(columns = "MEDV"),
    data['MEDV'],
    random_state = 123,
    )
# Model creation
# ==============================================================================
model = LGBMRegressor(
    boosting_type="rf",
    n_estimators=50,
    max_depth=10,
    colsample_bytree=0.8,
    subsample=0.8,
    subsample_freq=1,
    reg_alpha= 1, # L1 regularization
    reg_lambda= 1, # L2 regularization
    verbose=-1
)

# Model training
# ==============================================================================
model.fit(X_train, y_train)

LightGBM version: 4.6.0

LGBMRegressor(boosting_type='rf', colsample_bytree=0.8, max_depth=10,
              n_estimators=50, reg_alpha=1, reg_lambda=1, subsample=0.8,
              subsample_freq=1, verbose=-1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Predictions
# ==============================================================================
predictions = model.predict(X_test)
predictions

# Test error
# ==============================================================================
rmse = root_mean_squared_error(y_true=y_test, y_pred=predictions)
print(f"The test error (rmse) is: {rmse}")

The test error (rmse) is: 4.8841999147856745

Extrapolation with Random Forest

An important limitation of regression trees, and therefore of Random Forest, is that they do not extrapolate outside the training range. When the model is applied to a new observation whose predictor values are higher or lower than those observed in training, the prediction is always the mean of the nearest node, regardless of how far the value is. See the following example where two models are trained, a linear model and a regression tree, and then values of $X$ outside the training range are predicted.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

# Simulated data
# ==============================================================================
X = np.linspace(0, 150, 100)
y = (X + 100) + np.random.normal(loc=0.0, scale=5.0, size=X.shape)
X_train = X[(X>=50) & (X<100)]
y_train = y[(X>=50) & (X<100)]
X_test_low = X[X < 50]
y_test_low = y[X < 50]
X_test_high = X[X >= 100]
y_test_high = y[X >= 100]

fig, ax = plt.subplots(figsize=(5, 3))
ax.plot(X_train, y_train, c='black', linestyle='dashed', label = "Train data")
ax.axvspan(50, 100, color='gray', alpha=0.2, lw=0)
ax.plot(X_test_low, y_test_low, c='blue', linestyle='dashed', label = "Test data")
ax.plot(X_test_high, y_test_high, c='blue', linestyle='dashed')
ax.set_title("Train and test data")
plt.legend();

# Linear model
linear_model = LinearRegression()
linear_model.fit(X_train.reshape(-1, 1), y_train)

# Random forest model
rf_model = RandomForestRegressor()
rf_model.fit(X_train.reshape(-1, 1), y_train)

# Predictions
linear_prediction = linear_model.predict(X.reshape(-1, 1))
rf_prediction  = rf_model.predict(X.reshape(-1, 1))

fig, ax = plt.subplots(figsize=(5, 3))
ax.plot(X_train, y_train, c='black', linestyle='dashed', label = "Train data")
ax.axvspan(50, 100, color='gray', alpha=0.2, lw=0)
ax.plot(X_test_low, y_test_low, c='blue', linestyle='dashed', label = "Test data")
ax.plot(X_test_high, y_test_high, c='blue', linestyle='dashed')
ax.plot(X, linear_prediction, c='firebrick',
        label = "Linear model prediction")
ax.plot(X, rf_prediction, c='orange',
        label = "Random forest prediction")
ax.set_title("Extrapolation of a linear model and a random forest model")
plt.legend();

The plot shows that, with the Random Forest model, the predicted value outside the training range is the same regardless of how far X is. A strategy that allows tree-based models to extrapolate is to fit a linear model in each terminal node, but this is not available in Scikit-learn.

One-hot encoding in Random Forest

Tree-based models, including Random Forest, are capable of using categorical predictors in their natural form without needing to convert them to dummy variables via one-hot encoding. However, in practice, this depends on the implementation in the library or software used. This directly impacts the structure of the generated trees and, consequently, the predictive results of the model and the calculated importance for the predictors.

In the book Feature Engineering and Selection: A Practical Approach for Predictive Models By Max Kuhn, Kjell Johnson, the authors conduct an experiment comparing models trained with and without one-hot encoding on 5 different datasets. The results show:

There is no clear difference in terms of predictive ability of the models. Only in the case of stochastic gradient boosting are there slight differences against applying one-hot encoding.
Model training takes on average 2.5 times longer when one-hot encoding is applied. This is due to the increase in dimensionality when creating the new dummy variables, forcing the algorithm to analyze many more split points.

Based on the results obtained, the authors recommend not applying one-hot encoding. Only if the results are good, then try to see if they improve by creating the dummy variables.

Another detailed comparison can be found in Are categorical variables getting lost in your random forests? By Nick Dingwall, Chris Potts, where the results show:

When a categorical variable is converted into multiple dummy variables, its importance is diluted, making it harder for the model to learn from it and thus losing predictive power. This effect is greater the more levels the original variable has.
By diluting the importance of categorical predictors, they are less likely to be selected by the model, which distorts the metrics that measure predictor importance.

Currently, in scikit-learn it is necessary to perform one-hot encoding to convert categorical variables into dummy variables. The implementations of XGBoost and LightGBM do allow direct use of categorical variables.

Extremely randomized trees

The extremely randomized trees algorithm takes the decorrelation of trees a step further than random forest. As in random forest, at each node split, only a random subset of the available predictors is evaluated. In addition, in extremely randomized trees, within each selected predictor, only a random subset of possible split points is evaluated. In certain scenarios, this method can further reduce variance and improve the model's predictive ability. This algorithm can be found in the ExtraTrees class of scikit-learn.

Random Forest vs Gradient Boosting comparison

Random Forest is often compared to another tree-based algorithm, Gradient Boosting Trees. Both methods are very powerful and the superiority of one or the other depends on the problem to which they are applied. Some aspects to consider are:

Thanks to the out-of-bag error, the Random Forest method does not need to use cross-validation for hyperparameter optimization. This can be a very important advantage when computational requirements are limiting. This feature is also present in Stochastic Gradient Boosting but not in AdaBoost and Gradient Boosting.
Random forest has fewer hyperparameters, making its correct implementation simpler.
If there is a high proportion of irrelevant predictors, Random Forest can be negatively affected, especially as the number of predictors ( $m$ ) evaluated decreases. Suppose that, out of 100 predictors, 90 provide no information (are noise). If the hyperparameter $m$ is set to 3 (three random predictors are evaluated at each split), it is very likely that the 3 selected predictors are among those that provide nothing. Even so, the algorithm will select the best of them, incorporating it into the model. The higher the percentage of irrelevant predictors, the more frequently this occurs, so the trees that make up the Random Forest will contain irrelevant predictors. As a result, its predictive ability is reduced. In the case of Gradient Boosting Trees, all predictors are always evaluated, so irrelevant ones are ignored.
In Random Forest, each model in the ensemble is independent of the rest, allowing training to be parallelized.
Random Forest does not suffer from overfitting no matter how many trees are added.
If hyperparameters are well optimized, Gradient Boosting Trees usually achieve slightly better results.

Session information

import session_info
session_info.show(html=False)

-----
joblib              1.4.2
lightgbm            4.6.0
matplotlib          3.10.1
numpy               2.2.6
optuna              3.6.2
pandas              2.3.1
session_info        v1.0.1
sklearn             1.7.1
statsmodels         0.14.4
xgboost             3.0.0
-----
IPython             9.1.0
jupyter_client      8.6.3
jupyter_core        5.7.2
jupyterlab          4.4.5
notebook            7.4.5
-----
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0]
Linux-6.14.0-27-generic-x86_64-with-glibc2.39
-----
Session information updated at 2025-08-18 13:15

Bibliography

Introduction to Statistical Learning, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

Applied Predictive Modeling Max Kuhn, Kjell Johnson

T.Hastie, R.Tibshirani, J.Friedman. The Elements of Statistical Learning. Springer, 2009

Bradley Efron and Trevor Hastie. 2016. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science (1st. ed.). Cambridge University Press, USA.

Citation instructions

How to cite this document?

If you use this document or any part of it, we appreciate your citation. Thank you very much!

Random Forest with Python by Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://www.cienciadedatos.net/documentos/py08_random_forest_python.html

Did you like the article? Your support is important

Your contribution will help me continue generating free educational content. Thank you very much! 😊

This document created by Joaquín Amat Rodrigo and Javier Escobar Ortiz is licensed under an Attribution-NonCommercial-ShareAlike 4.0 International.

You are allowed to:

Share: copy and redistribute the material in any medium or format.
Adapt: remix, transform, and build upon the material.

Under the following terms:

Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
Non-Commercial: You may not use the material for commercial purposes.
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.


id	1	2	3	4	5	6	7	8	9	10
Y	10	18	24	8	2	9	16	10	20	14
X	...	...	...	...	...	...	...	...	...	...

	n_estimators	10
	criterion	'squared_error'
	max_depth	None
	min_samples_split	2
	min_samples_leaf	1
	min_weight_fraction_leaf	0.0
	max_features	1
	max_leaf_nodes	None
	min_impurity_decrease	0.0
	bootstrap	True
	oob_score	False
	n_jobs	-1
	random_state	123
	verbose	0
	warm_start	False
	ccp_alpha	0.0
	max_samples	None
	monotonic_cst	None

	n_estimators	150
	criterion	'entropy'
	max_depth	None
	min_samples_split	2
	min_samples_leaf	1
	min_weight_fraction_leaf	0.0
	max_features	5
	max_leaf_nodes	None
	min_impurity_decrease	0.0
	bootstrap	True
	oob_score	False
	n_jobs	None
	random_state	123
	verbose	0
	warm_start	False
	class_weight	None
	ccp_alpha	0.0
	max_samples	None
	monotonic_cst	None

	boosting_type	'rf'
	num_leaves	31
	max_depth	10
	learning_rate	0.1
	n_estimators	50
	subsample_for_bin	200000
	objective	None
	class_weight	None
	min_split_gain	0.0
	min_child_weight	0.001
	min_child_samples	20
	subsample	0.8
	subsample_freq	1
	colsample_bytree	0.8
	reg_alpha	1
	reg_lambda	1
	random_state	None
	n_jobs	None
	importance_type	'split'
	verbose	-1


id	1	2	3	4	5	6	7	8	9	10
Y	10	18	24	8	2	9	16	10	20	14
X	...	...	...	...	...	...	...	...	...	...


id	1	2	3	4	5	6	7	8	9	10
Y	10	18	24	8	2	9	16	10	20	14
X	...	...	...	...	...	...	...	...	...	...