More about machine learning: Machine Learning with Python


Scikit-learn

Python is one of the dominant programming languages in the fields of statistics, data mining, and machine learning. As open-source software, countless users have been able to implement their algorithms, resulting in a very large number of libraries where virtually all existing machine learning techniques can be found. However, this has a downside: each library has its own syntax, structure, and implementation, which makes learning more difficult. Scikit-learn is an open-source library that unifies the main algorithms and functions under a single framework, greatly facilitating all stages of preprocessing, training, optimization, and validation of predictive models.

✏️ Note

Scikit-learn offers so many possibilities that they can hardly be shown with a single example. In this document, only some of its functionalities are used. If a detailed explanation is required in any case, so as not to interfere with the narrative of the analysis, an appendix will be added. Nevertheless, to fully understand all the functionalities of **Scikit-learn**, it is recommended to read its documentation. One of the notable characteristics of this library is its high degree of maturity, which makes it suitable for creating predictive models intended for production. This document is also reproduced with tidymodels, mlr3. Other similar projects are caret and H2O, all based on the R programming language.

Introduction

In recent years, interest in and application of machine learning have expanded so much that they have become a discipline applied in virtually all areas of academic and industrial research. The growing number of people dedicated to this discipline has resulted in a whole repertoire of tools for accessing powerful predictive methods. The Python programming language is a clear example of this.

The term machine learning encompasses a set of algorithms that allow identifying patterns present in data and creating structures (models) that represent them. Once generated, these models can be used to predict information about facts or events that have not yet been observed. It is important to remember that machine learning systems "learn" patterns from the data they are trained on, therefore, they can only recognize scenarios similar to those they have seen before. When using systems trained with past data to predict the future, it is assumed that such behavior will continue, which is not always the case.

Although terms such as machine learning, data mining, artificial intelligence, or data science are often used as synonyms, it is important to note that machine learning methods are only part of the many strategies needed to extract, understand, and add value to data. The following document is intended to be an example of the type of problem an analyst usually faces: starting from a more or less processed dataset (data preparation is a key stage that precedes machine learning), the goal is to create a model that can successfully predict the behavior or value of new observations.

Stages of a machine learning problem

  • Define the problem: What is to be predicted? What data is available? What data needs to be obtained?

  • Explore and understand the data that will be used to create the model.

  • Success metric: define an appropriate way to quantify how good the results obtained are.

  • Prepare the strategy to evaluate the model: separate the observations into a training set, a validation set (or cross-validation), and a test set. It is very important to ensure that no information from the test set participates in the model training or optimization process.

  • Preprocess the data: apply the necessary transformations so that the data can be interpreted by the selected machine learning algorithm.

  • Fit a first model capable of exceeding minimum results. For example, in classification problems, the minimum to beat is usually the percentage of the majority class; in a regression model, the mean of the response variable.

  • Gradually improve the model, incorporating or creating new variables and optimizing hyperparameters.

  • Evaluate the capacity of the final model with the test set, to estimate its performance when predicting new observations.

  • Train the final model with all available data.

✏️ Note

Unlike other documents, this one is intended to be a practical example with less theoretical development. The reader will realize how easy it is to apply a wide range of predictive methods with Python and Scikit-learn. However, it is crucial that any analyst understands the theoretical foundations on which each of them is based for a project of this type to be successful. Although they are only briefly described here, they will be accompanied by links where detailed information can be found.



Libraries

The libraries used in this document are:

# Data manipulation
# ==============================================================================
import numpy as np
import pandas as pd

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import statsmodels.api as sm
plt.style.use('bmh')

# Preprocessing and modeling
# ==============================================================================
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.model_selection import (
    train_test_split,
    KFold,
    RepeatedKFold,
    cross_val_score,
    cross_validate,
    cross_val_predict,
    GridSearchCV,
    RandomizedSearchCV,
)
from sklearn.metrics import root_mean_squared_error
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
import optuna

# Miscellaneous
# ==============================================================================
import multiprocessing
from joblib import cpu_count
from itertools import product

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

Data

The SaratogaHouses dataset from the mosaicData package in R contains information about the price of 1728 houses located in Saratoga County, New York, USA in 2006. The data can be downloaded in csv format from SaratogaHouses.csv

In addition to the price, it includes 15 additional variables:

  • price: house price.
  • lotSize: square meters of the property.
  • age: age of the house.
  • landValue: land value.
  • livingArea: livable square meters.
  • pctCollege: percentage of the neighborhood with a college degree.
  • bedrooms: number of bedrooms.
  • fireplaces: number of fireplaces.
  • bathrooms: number of bathrooms (the value 0.5 refers to bathrooms without a shower).
  • rooms: number of rooms.
  • heating: type of heating.
  • fuel: type of heating fuel (gas, electricity, or diesel).
  • sewer: type of sewer.
  • waterfront: whether the house has lake views.
  • newConstruction: whether the house is newly constructed.
  • centralAir: whether the house has air conditioning.

The objective is to obtain a model capable of predicting house prices.

url = (
    "https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-learning-python/"
    "master/data/SaratogaHouses.csv"
)
data = pd.read_csv(url, sep=",")

# Rename columns to be more descriptive
data.columns = [
    "price", "total_sqm", "age", "land_value", "living_sqm",
    "college_pct", "bedrooms", "fireplace", "bathrooms", "rooms", "heating",
    "heating_fuel", "sewer", "lake_view", "new_construction", "air_conditioning"
]

Exploratory Analysis

Before training a predictive model—or even before performing any calculation with a new dataset—it is very important to carry out a descriptive exploration of the data. This process allows a better understanding of what information each variable contains, as well as detecting possible errors. Some common examples are:

  • A column has been stored with the wrong type: for example, a numeric variable that is being recognized as text, or vice versa.

  • A variable contains values that do not make sense: for example, to indicate that the price of a house is not available, a value of 0 or a blank space is entered.

  • A word has been entered instead of a number in a numeric variable.

Additionally, this initial analysis can provide clues about which variables are suitable as predictors in a model (more on this in the following sections).

data.head(4)
price total_sqm age land_value living_sqm college_pct bedrooms fireplace bathrooms rooms heating heating_fuel sewer lake_view new_construction air_conditioning
0 132500 0.09 42 50000 906 35 2 1 1.0 5 electric electric septic No No No
1 181115 0.92 0 22300 1953 51 3 0 2.5 6 hot water/steam gas septic No No No
2 109000 0.19 133 7300 1944 51 4 1 1.0 8 hot water/steam gas public/commercial No No No
3 155000 0.41 13 18700 1944 51 3 1 1.5 5 hot air gas septic No No No

Variable Types

# Type of each variable
# ==============================================================================
# In pandas, the "object" type refers to strings
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   price             1728 non-null   int64  
 1   total_sqm         1728 non-null   float64
 2   age               1728 non-null   int64  
 3   land_value        1728 non-null   int64  
 4   living_sqm        1728 non-null   int64  
 5   college_pct       1728 non-null   int64  
 6   bedrooms          1728 non-null   int64  
 7   fireplace         1728 non-null   int64  
 8   bathrooms         1728 non-null   float64
 9   rooms             1728 non-null   int64  
 10  heating           1728 non-null   object 
 11  heating_fuel      1728 non-null   object 
 12  sewer             1728 non-null   object 
 13  lake_view         1728 non-null   object 
 14  new_construction  1728 non-null   object 
 15  air_conditioning  1728 non-null   object 
dtypes: float64(2), int64(8), object(6)
memory usage: 216.1+ KB


All columns have the appropriate type.

Number of Observations and Missing Values

Along with the study of variable types, it is essential to know the number of available observations and whether all of them are complete.

Missing values are especially important when creating models, as most algorithms do not accept incomplete observations, or are highly influenced by them.

Although imputation of missing values is part of preprocessing—and therefore must be learned only with training data—their identification must be done before splitting the data. This ensures that all necessary imputation strategies can be established appropriately.

# Dataset dimensions (rows, columns)
# ==============================================================================
data.shape
(1728, 16)
# Number of missing values per variable
# ==============================================================================
data.isna().sum().sort_values()
price               0
total_sqm           0
age                 0
land_value          0
living_sqm          0
college_pct         0
bedrooms            0
fireplace           0
bathrooms           0
rooms               0
heating             0
heating_fuel        0
sewer               0
lake_view           0
new_construction    0
air_conditioning    0
dtype: int64

No variable contains missing values. In the missing value imputation section, several imputation strategies are shown when the dataset is incomplete.

Response Variable

When creating a model, it is very important to study the distribution of the response variable, since, after all, it is what is intended to be predicted.

The price variable has an asymmetric distribution with a positive tail, because some houses have a price much higher than the average. This type of distribution is usually better visualized after applying a logarithmic transformation or square root.

fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(6, 6))
sns.kdeplot(data.price, fill=True, color="blue",ax=axes[0])
sns.rugplot(data.price, color="blue", ax=axes[0])
axes[0].set_title("Original distribution", fontsize = 'medium')
axes[0].set_xlabel('price', fontsize='small')
axes[0].set_ylabel('')
axes[0].tick_params(labelsize = 8)

sns.kdeplot(np.sqrt(data.price), fill=True, color="blue", ax=axes[1])
sns.rugplot(np.sqrt(data.price), color="blue", ax=axes[1])
axes[1].set_title("Square root transformation", fontsize = 'medium')
axes[1].set_xlabel('sqrt(price)', fontsize='small') 
axes[1].set_ylabel('')
axes[1].tick_params(labelsize = 8)

sns.kdeplot(np.log(data.price), fill=True, color="blue", ax=axes[2])
sns.rugplot(np.log(data.price), color="blue",ax=axes[2])
axes[2].set_title("Logarithmic transformation", fontsize = 'medium')
axes[2].set_xlabel('log(price)', fontsize='small')
axes[2].set_ylabel('')
axes[2].tick_params(labelsize = 8)

fig.tight_layout()

Some machine learning and statistical learning models require the response variable to be distributed in a certain way. For example, for linear regression models (LM), the distribution must be normal. For generalized linear models (GLM), the distribution must belong to the exponential family.

Numeric Variables

# Numeric variables
# ==============================================================================
data.select_dtypes(include=['float64', 'int']).describe()
price total_sqm age land_value living_sqm college_pct bedrooms fireplace bathrooms rooms
count 1728.000000 1728.000000 1728.000000 1728.000000 1728.000000 1728.000000 1728.000000 1728.000000 1728.000000 1728.000000
mean 211966.705440 0.500214 27.916088 34557.187500 1754.975694 55.567708 3.154514 0.601852 1.900174 7.041667
std 98441.391015 0.698680 29.209988 35021.168056 619.935553 10.333581 0.817351 0.556102 0.658352 2.316453
min 5000.000000 0.000000 0.000000 200.000000 616.000000 20.000000 1.000000 0.000000 0.000000 2.000000
25% 145000.000000 0.170000 13.000000 15100.000000 1300.000000 52.000000 3.000000 0.000000 1.500000 5.000000
50% 189900.000000 0.370000 19.000000 25000.000000 1634.500000 57.000000 3.000000 1.000000 2.000000 7.000000
75% 259000.000000 0.540000 34.000000 40200.000000 2137.750000 64.000000 4.000000 1.000000 2.500000 8.250000
max 775000.000000 12.200000 225.000000 412600.000000 5228.000000 82.000000 7.000000 4.000000 4.500000 12.000000
# Distribution plot for each numeric variable
# ==============================================================================
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(9, 5))
axes = axes.flat
numeric_cols = data.select_dtypes(include=['float64', 'int']).columns
numeric_cols = numeric_cols.drop('price')

for i, colum in enumerate(numeric_cols):
    sns.histplot(
        data     = data,
        x        = colum,
        stat     = "count",
        kde      = True,
        color    = (list(plt.rcParams['axes.prop_cycle'])*2)[i]["color"],
        line_kws = {'linewidth': 2},
        alpha    = 0.3,
        ax       = axes[i]
    )
    axes[i].set_title(colum, fontsize = 10)
    axes[i].tick_params(labelsize = 6)
    axes[i].set_xlabel("")
    axes[i].set_ylabel("")
    
    
fig.tight_layout()
plt.subplots_adjust(top = 0.9)
fig.suptitle('Distribution of numeric variables', fontsize = 12);

The fireplace variable, although numeric, takes only a few values, and the vast majority of observations belong to only two of them. In cases like this, it is usually convenient to treat the variable as categorical.

# Observed values of fireplace
# ==============================================================================
data.fireplace.value_counts()
fireplace
1    942
0    740
2     42
4      2
3      2
Name: count, dtype: int64
# Convert the fireplace variable to string type
# ==============================================================================
data.fireplace = data.fireplace.astype("str")

Since the objective of the study is to predict house prices, the analysis of each variable is also done in relation to the response variable price. By analyzing the data in this way, ideas can begin to be extracted about which variables are most related to the price and in what way.

# Distribution plot for each numeric variable
# ==============================================================================
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(9, 5))
axes = axes.flat
numeric_cols = data.select_dtypes(include=['float64', 'int']).columns
numeric_cols = numeric_cols.drop('price')

for i, colum in enumerate(numeric_cols):
    sns.regplot(
        x           = data[colum],
        y           = data['price'],
        color       = "gray",
        marker      = '.',
        scatter_kws = {"alpha":0.4},
        line_kws    = {"color":"r","alpha":0.7, "linewidth":1.5},
        ax          = axes[i]
    )
    axes[i].set_title(f"price vs {colum}", fontsize = 10)
    axes[i].yaxis.set_major_formatter(ticker.EngFormatter())
    axes[i].xaxis.set_major_formatter(ticker.EngFormatter())
    axes[i].tick_params(labelsize = 6)
    axes[i].set_xlabel("")
    axes[i].set_ylabel("")

# Remove empty axes
for i in [8]:
    fig.delaxes(axes[i])
    
fig.tight_layout()
plt.subplots_adjust(top=0.9)
fig.suptitle('Correlation with price', fontsize = 12);

Correlation of Numeric Variables

Some models (LM, GLM, etc.) are negatively affected if they incorporate highly correlated predictors. For this reason, it is convenient to study the degree of correlation between the available variables.

# Correlation between numeric columns
# ==============================================================================
def tidy_corr_matrix(corr_mat):
    '''
    Function to convert a pandas correlation matrix to tidy format
    '''
    corr_mat = corr_mat.stack().reset_index()
    corr_mat.columns = ['variable_1','variable_2','r']
    corr_mat = corr_mat.loc[corr_mat['variable_1'] != corr_mat['variable_2'], :]
    corr_mat['abs_r'] = np.abs(corr_mat['r'])
    corr_mat = corr_mat.sort_values('abs_r', ascending=False)
    
    return(corr_mat)

corr_matrix = data.select_dtypes(include=['float64', 'int']).corr(method='pearson')
tidy_corr_matrix(corr_matrix).head(10)
variable_1 variable_2 r abs_r
76 rooms living_sqm 0.733666 0.733666
44 living_sqm rooms 0.733666 0.733666
43 living_sqm bathrooms 0.718564 0.718564
67 bathrooms living_sqm 0.718564 0.718564
4 price living_sqm 0.712390 0.712390
36 living_sqm price 0.712390 0.712390
62 bedrooms rooms 0.671863 0.671863
78 rooms bedrooms 0.671863 0.671863
58 bedrooms living_sqm 0.656196 0.656196
42 living_sqm bedrooms 0.656196 0.656196
# Heatmap of the correlation matrix
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(4, 4))

sns.heatmap(
    corr_matrix,
    annot     = True,
    cbar      = False,
    annot_kws = {"size": 6},
    vmin      = -1,
    vmax      = 1,
    center    = 0,
    cmap      = sns.diverging_palette(20, 220, n=200),
    square    = True,
    ax        = ax
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation = 45,
    horizontalalignment = 'right',
)

ax.tick_params(labelsize = 8)

Categorical Variables

# Categorical variables (object type)
# ==============================================================================
data.select_dtypes(include=['object']).describe()
fireplace heating heating_fuel sewer lake_view new_construction air_conditioning
count 1728 1728 1728 1728 1728 1728 1728
unique 5 3 3 3 2 2 2
top 1 hot air gas public/commercial No No No
freq 942 1121 1197 1213 1713 1647 1093
# Plot for each categorical variable
# ==============================================================================
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(9, 5))
axes = axes.flat
object_cols = data.select_dtypes(include=['object']).columns

for i, colum in enumerate(object_cols):
    data[colum].value_counts().plot.barh(ax = axes[i])
    axes[i].set_title(colum, fontsize = 10)
    axes[i].tick_params(labelsize = 6)
    axes[i].set_xlabel("")
    axes[i].set_ylabel("")

# Remove empty axes
for i in [7, 8]:
    fig.delaxes(axes[i])
    
fig.tight_layout()
plt.subplots_adjust(top=0.9)
fig.suptitle('Distribution of categorical variables', fontsize = 12);

If one of the levels of a categorical variable has very few observations compared to the other levels, it may happen that, during cross-validation or bootstrapping, some partitions do not contain any observations of that class (zero variance), which can lead to errors.

In these cases, it is usually convenient to:

  • Remove observations from the minority group, if it is a multiclass variable.

  • Remove the variable if it only has two levels.

  • Group minority levels into a single group.

  • Ensure that, in creating the partitions, all groups are represented in each of them.

For this case, caution must be taken with the fireplace variable. Levels 2, 3, and 4 are unified into a new level called "2_plus".

data.fireplace.value_counts().sort_index()
fireplace
0    740
1    942
2     42
3      2
4      2
Name: count, dtype: int64
dic_replace = {'2': "2_plus",
               '3': "2_plus",
               '4': "2_plus"}

data['fireplace'] = (
    data['fireplace']
    .map(dic_replace) 
    .fillna(data['fireplace'])
)

data.fireplace.value_counts().sort_index()
fireplace
0         740
1         942
2_plus     46
Name: count, dtype: int64
# Plot of relationship between price and each categorical variable
# ==============================================================================
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(9, 5))
axes = axes.flat
object_cols = data.select_dtypes(include=['object']).columns

for i, colum in enumerate(object_cols):
    sns.violinplot(
        x     = colum,
        y     = 'price',
        data  = data,
        color = "white",
        ax    = axes[i]
    )
    axes[i].set_title(f"price vs {colum}", fontsize = 10)
    axes[i].yaxis.set_major_formatter(ticker.EngFormatter())
    axes[i].tick_params(labelsize = 6)
    axes[i].set_xlabel("")
    axes[i].set_ylabel("")

# Remove empty axes
for i in [7, 8]:
    fig.delaxes(axes[i])
    
fig.tight_layout()
plt.subplots_adjust(top=0.9)
fig.suptitle('Price distribution by group', fontsize = 12);

Train-Test Split

Evaluating the predictive capacity of a model consists of checking how close its predictions are to the true values of the response variable.

To be able to quantify it correctly, it is necessary to have a set of observations whose response variable is known, but which the model has not "seen," that is, they have not participated in its training.

For this purpose, the available data is divided into a training set and a test set. The appropriate size of the partitions depends largely on the amount of available data and the certainty needed in error estimation. An 80%-20% split usually yields good results. The split should be done randomly or in a stratified random manner.

# Data split into train and test
# ==============================================================================
X_train, X_test, y_train, y_test = train_test_split(
                                        data.drop(columns=['price']),
                                        data['price'],
                                        train_size   = 0.8,
                                        random_state = 1234,
                                        shuffle      = True
                                    )

It is important to verify that the distribution of the response variable is similar in the training set and the test set.

To ensure this is the case, the train_test_split() function from scikit-learn allows, in classification problems, to identify, through the stratify argument, the variable on which to base the split.

This type of stratified split ensures that the training set and the test set are similar in terms of the response variable. However, it does not guarantee that the same happens with the predictors. For example, in a dataset with 100 observations, a binary predictor that has 90 observations of one group and only 10 of another has a high risk that, in one of the partitions, the minority group has no representation.

If this happens in the training set, some algorithms will give an error when applied to the test set, as they will not understand the value being passed to them. This problem can be avoided by removing variables with near-zero variance (see below).

print("Training partition")
print("------------------")
print(y_train.describe())
Training partition
------------------
count      1382.000000
mean     211436.516643
std       96846.639129
min       10300.000000
25%      145625.000000
50%      190000.000000
75%      255000.000000
max      775000.000000
Name: price, dtype: float64
print("Test partition")
print("--------------")
print(y_test.describe())
Test partition
--------------
count       346.000000
mean     214084.395954
std      104689.155889
min        5000.000000
25%      139000.000000
50%      180750.000000
75%      271750.000000
max      670000.000000
Name: price, dtype: float64

Preprocessing

Preprocessing encompasses all those transformations performed on the data with the objective of being interpreted by the machine learning algorithm as efficiently as possible.

All data preprocessing must be learned with the training observations and then applied to both the training set and the test set. This is very important so as not to violate the condition that no information from the test observations participates or influences the model fitting.

This principle must also be applied if cross-validation is used (see below). In that case, preprocessing must be done within each validation iteration, so that the estimates made with each validation partition do not contain information from the other partitions.

Although it is not possible to create a single list, below are some of the preprocessing steps that are most often needed.

Missing Value Imputation

The vast majority of algorithms do not accept incomplete observations, so when the dataset contains missing values, you can:

  • Remove those observations that are incomplete.

  • Remove those variables that contain missing values.

  • Try to estimate missing values using the rest of the available information (imputation).

The first two options, although simple, imply losing information. Removing observations can only be applied when there are many and the percentage of incomplete records is very low. In the case of removing variables, the impact will depend on how much information those variables contribute to the model.

When using imputation, it is very important to consider the risk of introducing values in predictors that have a lot of influence on the model. Suppose a medical study in which, when one of the predictors is positive, the model almost always predicts that the patient is healthy. For a patient whose value of this predictor is unknown, the risk that the imputation is wrong is very high, so it is preferable to obtain a prediction based only on the available information. This is another example of the importance of the analyst knowing the problem they are facing so that they can make the best decision.

The sklearn.impute module incorporates several different imputation methods:

  • SimpleImputer: allows imputations using a constant value or a statistic (mean, median, most frequent value) from the same column where the missing value is located.

  • IterativeImputer: allows imputing the value of a column taking into account the rest of the columns. Specifically, it is an iterative process in which, in each iteration, one of the variables is used as the response variable and the rest as predictors. Once the model is obtained, it is used to predict the empty positions of that variable. This process is carried out with each variable and the cycle is repeated max_iter times to gain stability. The sklearn.impute.IterativeImputer implementation allows almost any of its algorithms to be used to create the imputation models (KNN, RandomForest, GradientBoosting...).

  • KNNImputer: it is a specific case of IterativeImputer in which k-Nearest Neighbors is used as the imputation algorithm.

Despite being a widely used method, imputing using KNN has two problems: its high computational cost makes it only applicable to small or moderately sized datasets. Additionally, if there are categorical variables, due to the difficulty of measuring "distances" in this context, it can lead to unrealistic results. For these two reasons, it is more advisable to use a Random Forest type model IterativeImputer(predictor = RandomForestRegressor()).

With the argument add_indicator=True, a new column is automatically created in which the value 1 indicates which values have been imputed. This can be useful both to identify observations in which some imputation has been performed and to use it as another predictor in the model.

Near-Zero Variance Variables

Predictors that contain a single value (zero-variance) should not be included in the model as they provide no information. It is also not advisable to include predictors that have near-zero variance, that is, predictors that take only a few values, of which some appear very infrequently. The problem with the latter is that they can become zero-variance predictors when observations are divided by cross-validation or bootstrap.

The VarianceThreshold class from the sklearn.feature_selection module identifies and excludes all predictors whose variance does not exceed a certain threshold. In the case of categorical variables, it should be remembered that scikitlearn requires them to be binarized (one hot encoding or dummy) to be able to train models. A boolean variable follows a Bernoulli distribution, so its variance can be calculated as: $$\mathrm{Var}[X] = p(1 - p)$$

While the removal of non-informative predictors could be considered a step in the predictor selection process, since it consists of filtering by variance, it must be done before standardizing the data, because afterwards, all predictors have variance 1.

Scaling of Numeric Variables

When predictors are numeric, the scale on which they are measured, as well as the magnitude of their variance, can greatly influence the model. Many machine learning algorithms (SVM, neural networks, lasso...) are sensitive to this, so if the predictors are not equalized in some way, those measured on a larger scale or with more variance will dominate the model even if they are not the ones most related to the response variable. There are mainly 2 strategies to avoid this:

  • Centering: consists of subtracting from each value the mean of the predictor to which it belongs. If the data is stored in a dataframe, centering is achieved by subtracting from each value the mean of the column in which it is located. As a result of this transformation, all predictors come to have a mean of zero, that is, the values are centered around the origin. StandardScaler(with_std=False)
  • Normalization (standardization): consists of transforming the data so that all predictors are approximately on the same scale. There are two ways to achieve this:

    • Z-score normalization (StandardScaler): divide each predictor by its standard deviation after it has been centered, this way the data comes to have a normal distribution.

      $$z = \frac{x - \mu}{\sigma}$$

    • Min-max standardization (MinMaxScaler): transform the data so that it is within the range [0, 1].

      $$X_{norm} = \frac{X - X_{min}}{X_{max}-X_{min}}$$

Variables should never be standardized after being binarized (see below).

Binarization of Categorical Variables

Binarization (one-hot-encoding) consists of creating new dummy variables with each of the levels of the categorical variables. For example, a variable called color that contains the levels red, green, and blue will become three new variables (color_red, color_green, color_blue), all with the value 0 except the one that matches the observation, which takes the value 1.

By default, the OneHotEncoder class binarizes all variables, so it must be applied only to categorical variables (see how to do it in the ColumnTransformer section). With the argument drop='first', one of the levels is removed to avoid redundancies. Returning to the previous example, it is not necessary to store all three variables, since if color_red and color_green take the value 0, the variable color_blue necessarily takes the value 1. If color_red or color_green take the value 1, then color_blue is necessarily 0. This is important in models that suffer problems if predictors are perfectly correlated (linear models without regularization, neural networks...).

In certain scenarios, a new level may appear in the test data that was not in the training data. If all possible levels are not known in advance, prediction errors can be avoided by indicating OneHotEncoder(handle_unknown='ignore').

The way to preprocess data within the scikit-learn ecosystem is by using ColumnTransformer and pipeline. In addition to those already mentioned, many more preprocessing transformations can be found in the sklearn.preprocessing module.

Pipeline and ColumnTransformer

The ColumnTransformer and make_column_transformer classes from the sklearn.compose module allow combining multiple preprocessing transformations, specifying which columns each one is applied to. Like any transformer, it has a training method (fit) and a transformation method (transform). This allows the learning of transformations to be done only with training observations, and they can be applied later to any dataset. The idea behind this module is as follows:

  1. Define all the transformations (scaling, selection, filtering...) to be applied and to which columns ColumnTransformer(). Column selection can be done by: name, index, boolean mask, slice, regex pattern, column type, or with the selection functions make_column_selector.

  2. Learn the parameters necessary for those transformations with the training observations .fit().

  3. Apply the learned transformations to any dataset .transform().

By default, the result returned by ColumnTransformer is a numpy array, so the column names are lost. Starting from version 1.2.1, scikit-learn includes the possibility for the result of transformers to be a Pandas DataFrame instead of a numpy array. This has the advantage that the result of any transformation directly incorporates the column names. This behavior can be indicated individually per transformer with the .set_output(transform="pandas") method or as a global configuration with set_config(transform_output="pandas").

It should be noted that, if a OneHotEncoder type transformer is used in combination with .set_output(transform="pandas"), it must be indicated that the OneHotEncoder does not return data in sparse matrix form (sparse_output = False).

# Selection of variables by type
# ==============================================================================
# Numeric columns are standardized and one-hot-encoding is applied to 
# categorical columns. To keep columns that are not transformed,
# remainder='passthrough' must be indicated.
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()
preprocessor = ColumnTransformer(
                   [('scale', StandardScaler(), numeric_cols),
                    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_cols)],
                remainder = 'passthrough',
                verbose_feature_names_out = False
               ).set_output(transform="pandas")
preprocessor
ColumnTransformer(remainder='passthrough',
                  transformers=[('scale', StandardScaler(),
                                 ['total_sqm', 'age', 'land_value',
                                  'living_sqm', 'college_pct', 'bedrooms',
                                  'bathrooms', 'rooms']),
                                ('onehot',
                                 OneHotEncoder(handle_unknown='ignore',
                                               sparse_output=False),
                                 ['fireplace', 'heating', 'heating_fuel',
                                  'sewer', 'lake_view', 'new_construction',
                                  'air_conditioning'])],
                  verbose_feature_names_out=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Once the ColumnTransformer object has been defined, with the fit() method the transformations are learned with the training data and applied to both sets with transform().

preprocessor.fit(X_train)
X_train_prep = preprocessor.transform(X_train)
X_test_prep  = preprocessor.transform(X_test)

X_train_prep.head(3)
total_sqm age land_value living_sqm college_pct bedrooms bathrooms rooms fireplace_0 fireplace_1 ... heating_fuel_oil sewer_none sewer_public/commercial sewer_septic lake_view_No lake_view_Yes new_construction_No new_construction_Yes air_conditioning_No air_conditioning_Yes
1571 -0.061549 0.301597 -0.195148 1.320467 0.826826 1.011039 0.905289 2.128665 0.0 1.0 ... 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0
832 -0.195573 -0.407560 -0.277982 0.194931 0.730235 1.011039 1.663079 1.264569 0.0 1.0 ... 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0
1302 0.072475 -0.478476 -0.892092 -0.064313 -1.877728 1.011039 0.147500 -0.463623 1.0 0.0 ... 1.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0

3 rows × 26 columns

ColumnTransformer applies operations in parallel, not sequentially, which means it does not allow applying more than one transformation to the same column. If it is necessary to do so, pipeline must be used, which also groups operations but executes them sequentially, so that the output of one operation is the input of the next. If you want to apply multiple preprocessing transformations to the same column, you must first group them in a pipeline

In the following example, the transformations are combined:

  • Numeric columns: missing values are imputed with the median and then standardized.

  • Categorical columns: missing values are imputed with the most frequent value and then one hot encoding is applied.

# Pipeline for preprocessing
# ==============================================================================
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()

# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[
                            ('imputer', SimpleImputer(strategy='median')),
                            ('scaler', StandardScaler())
                        ]
                      )


# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[
                                ('imputer', SimpleImputer(strategy='most_frequent')),
                                ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
                            ]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

preprocessor
ColumnTransformer(remainder='passthrough',
                  transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['total_sqm', 'age', 'land_value',
                                  'living_sqm', 'college_pct', 'bedrooms',
                                  'bathrooms', 'rooms']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 ['fireplace', 'heating', 'heating_fuel',
                                  'sewer', 'lake_view', 'new_construction',
                                  'air_conditioning'])],
                  verbose_feature_names_out=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
preprocessor.fit(X_train)
X_train_prep = preprocessor.transform(X_train)
X_test_prep  = preprocessor.transform(X_test)
X_train_prep.head(3)
total_sqm age land_value living_sqm college_pct bedrooms bathrooms rooms fireplace_0 fireplace_1 ... heating_fuel_oil sewer_none sewer_public/commercial sewer_septic lake_view_No lake_view_Yes new_construction_No new_construction_Yes air_conditioning_No air_conditioning_Yes
1571 -0.061549 0.301597 -0.195148 1.320467 0.826826 1.011039 0.905289 2.128665 0.0 1.0 ... 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0
832 -0.195573 -0.407560 -0.277982 0.194931 0.730235 1.011039 1.663079 1.264569 0.0 1.0 ... 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0
1302 0.072475 -0.478476 -0.892092 -0.064313 -1.877728 1.011039 0.147500 -0.463623 1.0 0.0 ... 1.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0

3 rows × 26 columns

Creating a Model

The next step, once the training data has been defined, is to select the algorithm to be used. In scikit-learn, this is done by creating an estimator object. This object encapsulates the algorithm name, as well as its parameters and hyperparameters, and provides the fit(X, y) and predict() methods, which allow it to learn from the data and make predictions on new observations. In the following listing you can consult all the algorithms implemented in scikit-learn.

Training

A first linear regression model with ridge regularization is fitted to predict the house price based on all available predictors. All arguments of sklearn.linear_model.Ridge are kept at their default values.

It is important to note that, when a linear regression model incorporates regularization on the coefficients (ridge, lasso, elasticnet), it is necessary to standardize the predictors. To ensure that preprocessing is done only with the training data, the transformations and model fitting are combined in a single pipeline.

# Preprocessing
# ==============================================================================
# Identification of numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()


# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Pipeline
# ==============================================================================
# Preprocessing steps and model are combined in the same pipeline
pipe = Pipeline([('preprocessing', preprocessor),
                 ('model', Ridge(random_state=789456))])

# Training
# ==============================================================================
pipe.fit(X=X_train, y=y_train)
Pipeline(steps=[('preprocessing',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('numeric',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['total_sqm', 'age',
                                                   'land_value', 'living_sqm',
                                                   'college_pct', 'bedrooms',
                                                   'bathrooms', 'rooms']),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['fireplace', 'heating',
                                                   'heating_fuel', 'sewer',
                                                   'lake_view',
                                                   'new_construction',
                                                   'air_conditioning'])],
                                   verbose_feature_names_out=False)),
                ('model', Ridge(random_state=789456))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Validation

The ultimate purpose of a model is to predict the response variable in future observations or in data the model has not "seen" before. The error shown by default after training a model usually corresponds to the training error, that is, the error the model makes when predicting observations it has already used to learn.

While this type of error is useful for analyzing how the model is learning (for example, through residual analysis), it does not represent a realistic estimate of its performance on new observations, as it tends to be overly optimistic. To obtain a more reliable estimate—and before resorting to the test set—validation strategies based on resampling can be used.

Scikit-learn incorporates various validation strategies in the sklearn.model_selection module.

  • All of them receive as the first argument an estimator, which can be directly a model or a pipeline.

  • Error metrics for regression tasks are always returned with a negative sign, so the closer to 0 the value is, the better the fit. This is done so that optimization algorithms, which maximize by default, work correctly.

The simplest way to apply cross-validation is through the cross_val_score() function, which by default uses KFold.

# Cross-validation
# ==============================================================================
cv_scores = cross_val_score(
                estimator = pipe,
                X         = X_train,
                y         = y_train,
                scoring   = 'neg_root_mean_squared_error',
                cv        = 5
            )

print(f"Cross-validation metrics: {cv_scores}")
print(f"Mean cross-validation metrics: {cv_scores.mean()}")
Cross-validation metrics: [-53936.97817183 -53076.74364513 -62746.15219054 -65963.09754244
 -48929.66260476]
Mean cross-validation metrics: -56930.52683094097

It is also possible to use other validation strategies (KFold, RepeatedKFold, LeaveOneOut, LeavePOut, ShuffleSplit) by first splitting with the helper functions of the sklearn.model_selection module and providing the generated indices to the cv argument.

Each of these methods works internally in a different way, but they all share the same fundamental idea: fit and evaluate the model repeatedly using different subsets derived from the training set, obtaining an error estimate in each iteration. The average of all these estimates tends to converge towards the actual value of the test set error.$^{\text{Appendix 1}}$

# Repeated cross-validation
# ==============================================================================
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=123)
cv_scores = cross_val_score(
                estimator = pipe,
                X         = X_train,
                y         = y_train,
                scoring   = 'neg_root_mean_squared_error',
                cv        = cv
            )

print(f"Cross-validation metrics: {cv_scores}")
print("")
print(f"Mean cross-validation metrics: {cv_scores.mean()}")
Cross-validation metrics: [-66487.72274814 -56333.87119132 -48992.85039086 -55030.00923381
 -58790.48472729 -62651.67080028 -55779.2103529  -61820.5882113
 -57243.1819794  -48910.70385765 -50033.46929956 -51586.8240037
 -62478.2206031  -65111.86204296 -54900.98944223 -64220.39882006
 -53694.93490461 -64690.99877883 -51242.27642488 -52330.80605289
 -56777.05219272 -58644.60923875 -55573.25900549 -64350.25804164
 -50578.15339623]

Mean cross-validation metrics: -57130.17622962478

The cross_validate function is similar to cross_val_score but allows estimating multiple metrics at once, both for test and train, and returns the results in a dictionary.

# Repeated cross-validation with multiple metrics
# ==============================================================================
cv = RepeatedKFold(n_splits=3, n_repeats=5, random_state=123)
cv_scores = cross_validate(
                estimator = pipe,
                X         = X_train,
                y         = y_train,
                scoring   = ('r2', 'neg_root_mean_squared_error'),
                cv        = cv,
                return_train_score = True
            )

# Convert dictionary to dataframe for easier visualization
cv_scores = pd.DataFrame(cv_scores)
cv_scores
fit_time score_time test_r2 train_r2 test_neg_root_mean_squared_error train_neg_root_mean_squared_error
0 0.017680 0.011029 0.632359 0.679758 -64605.636258 -51789.256510
1 0.019497 0.015855 0.718708 0.636812 -48322.428406 -59983.019196
2 0.018450 0.010004 0.595874 0.689939 -58434.926618 -55213.669996
3 0.011611 0.007364 0.672969 0.655316 -58179.173708 -55331.536238
4 0.011359 0.007863 0.615521 0.683390 -60148.310664 -54366.426683
5 0.013043 0.007721 0.650829 0.664771 -53812.579173 -57561.184588
6 0.011932 0.007122 0.698178 0.647441 -49813.241120 -59221.407345
7 0.013098 0.007118 0.646554 0.673016 -61749.477863 -53221.315710
8 0.012182 0.010252 0.615151 0.682999 -59192.994923 -54895.180113
9 0.012179 0.008770 0.645685 0.670425 -61061.033083 -53843.654748
10 0.012779 0.006467 0.564282 0.696123 -59084.326066 -55269.382399
11 0.011912 0.007219 0.709130 0.637747 -52788.788472 -57942.367392
12 0.012289 0.008397 0.663592 0.652572 -59372.918876 -55251.685765
13 0.011996 0.006569 0.637408 0.671598 -54925.502000 -57014.192113
14 0.010961 0.006674 0.624017 0.679811 -58863.107740 -54881.130021
# Distribution of cross-validation error
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3))

sns.kdeplot(
    cv_scores['test_neg_root_mean_squared_error'],
    fill    = True,
    color   = "#30a2da",
    ax      = ax
)
sns.rugplot(
    cv_scores['test_neg_root_mean_squared_error'],
    color   = "black",
    ax      = ax
)
ax.set_title('Distribution of cross-validation error', fontsize = 12)
ax.tick_params(labelsize = 8)
ax.set_xlabel("")
ax.set_ylabel("");

The cross_val_predict function, instead of returning the metrics for each partition, returns the predictions generated in each of them. This is useful for evaluating model residuals and diagnosing its behavior.

It is important to note that, if repeated cross-validation or bootstrapping techniques are used, the same observation can appear multiple times in the validation sets.

# Error (residual) diagnostics in cross-validation predictions
# ========================================================================
# Cross-validation configuration
cv = KFold(n_splits=5, random_state=123, shuffle=True)

# Cross-validation predictions
y_pred_cv = cross_val_predict(estimator=pipe, X=X_train, y=y_train, cv=cv)

# Residual calculation
residuals = y_train - y_pred_cv

# Visualization
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 6))
fig.suptitle('Model residual diagnostics', fontsize=12)
plt.subplots_adjust(top=0.88)

# Standard font size
label_fontsize = 8
title_fontsize = 10
tick_fontsize = 7

# Plot 1: Actual value vs prediction
axes[0, 0].scatter(y_train, y_pred_cv, edgecolors='black', alpha=0.4)
axes[0, 0].plot(
    [y_train.min(), y_train.max()],
    [y_train.min(), y_train.max()],
    '--', color='black', lw=2
)
axes[0, 0].set_title('Predicted vs Actual value', fontsize=title_fontsize)
axes[0, 0].set_xlabel('Actual value', fontsize=label_fontsize)
axes[0, 0].set_ylabel('Predicted value', fontsize=label_fontsize)
axes[0, 0].tick_params(labelsize=tick_fontsize)

# Plot 2: Residuals vs index
axes[0, 1].scatter(np.arange(len(residuals)), residuals, edgecolors='black', alpha=0.4)
axes[0, 1].axhline(0, linestyle='--', color='black', lw=2)
axes[0, 1].set_title('Model residuals', fontsize=title_fontsize)
axes[0, 1].set_xlabel('Index', fontsize=label_fontsize)
axes[0, 1].set_ylabel('Residual', fontsize=label_fontsize)
axes[0, 1].tick_params(labelsize=tick_fontsize)

# Plot 3: Histogram and KDE of residuals
sns.histplot(
    residuals,
    stat='density',
    kde=True,
    color="#30a2da",
    alpha=0.3,
    line_kws={'linewidth': 1},
    ax=axes[1, 0]
)
axes[1, 0].set_title('Residual distribution', fontsize=title_fontsize)
axes[1, 0].set_xlabel('Residual', fontsize=label_fontsize)
axes[1, 0].set_ylabel('Density', fontsize=label_fontsize)
axes[1, 0].tick_params(labelsize=tick_fontsize)

# Plot 4: Q-Q plot of residuals
sm.qqplot(residuals, fit=True, line='q', ax=axes[1, 1], color='#30a2da', alpha=0.4, lw=2)
qq_line = axes[1, 1].lines[1]
qq_line.set_linewidth(1)
qq_line.set_color('black')
axes[1, 1].set_title('Q-Q plot of residuals', fontsize=title_fontsize)
axes[1, 1].set_xlabel('Theoretical quantiles', fontsize=label_fontsize)
axes[1, 1].set_ylabel('Sample quantiles', fontsize=label_fontsize)
axes[1, 1].tick_params(labelsize=tick_fontsize)

fig.tight_layout()

For practical purposes, when applying resampling methods to validate a model, two things must be taken into account: the computational cost involved in fitting a model multiple times, each time with a different subset of data, and the reproducibility in creating the partitions. The functions cross_val_score() and cross_val_predict allow parallelizing the process through the n_jobs argument.

# Parallelized repeated cross-validation (multicore)
# ==============================================================================
cv = RepeatedKFold(n_splits=10, n_repeats=5, random_state=123)
cv_scores = cross_val_score(
                estimator = pipe,
                X         = X_train,
                y         = y_train,
                scoring   = 'neg_root_mean_squared_error',
                cv        = cv,
                n_jobs    = -1 # all available cores
            )

print(f"Mean cross-validation metrics: {cv_scores.mean()}")
Mean cross-validation metrics: -56735.77500472445

The average root_mean_squared_error estimated through cross-validation for the ridge model is 56735. This value will be compared later when the model error is calculated with the test set.

Prediction

Once the model has been trained, either using an estimator directly or a pipeline, new observations can be predicted with the .predict() method. If a pipeline is used, the transformations learned during training are automatically applied.

predictions = pipe.predict(X_test)
# Create a dataframe with predictions and actual value
df_predictions = pd.DataFrame({'price' : y_test, 'prediction' : predictions})
df_predictions.head()
price prediction
903 105000 112491.665253
208 113000 185142.195564
358 110500 168900.516039
1187 159000 139333.302565
319 215000 236967.476884

Test Error

Although validation methods such as KFold or LeaveOneOut provide good estimates of a model's error when predicting new observations, the best way to evaluate a final model is to do so with a test set. This set must be completely separated from the training and optimization process.

Depending on the specific problem, different metrics may be relevant $^{\text{Appendix 2}}$. The sklearn.metrics module offers a wide variety of metrics to evaluate the quality of predictions.

# mean_squared_error metric
# ==============================================================================
rmse = root_mean_squared_error(
        y_true = y_test,
        y_pred = predictions,
       )
rmse
65372.916389239625

In the validation section, it was estimated, through repeated cross-validation, that the model's rmse was 56735, a value close to that obtained with the test set 65372.

Hyperparameters

Many models, including linear regression with Ridge regularization, contain parameters that cannot be learned from the training data and, therefore, must be set by the analyst. These are known as hyperparameters. The results of a model can depend largely on the values taken by its hyperparameters, however, it is not possible to know in advance what the appropriate value is. Although with practice, machine learning specialists gain intuition about what values may work better in each problem, there are no fixed rules. The most common way to find the optimal values is by trying different possibilities.


  1. Choose a set of values for the hyperparameter(s).

    • grid search: an exhaustive search is done on a set of values previously defined by the user.

    • random search: random values are evaluated within limits defined by the user.

  2. For each value (combination of values if there is more than one hyperparameter), train the model and estimate its error through a validation method $^{\text{Appendix 1}}$.

  3. Finally, fit the model again, this time with all training data and with the best hyperparameters found.


Scikit-learn allows exploring different hyperparameter values through model_selection.GridSearchCV() and model_selection.RandomizedSearchCV()

The Ridge model used so far has a hyperparameter called alpha, which by default has the value 1.0. This hyperparameter controls the penalty applied to the model coefficients. The higher its value, the more restriction is imposed on the coefficients, thus reducing variance, attenuating the effect of correlation between predictors, and minimizing the risk of overfitting.

A Ridge model is fitted again with different values of alpha using repeated cross-validation to identify which one yields the best results.

# Pipe: preprocessing + model
# ==============================================================================
# Identification of numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()


# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Preprocessing steps and model are combined in the same pipeline
pipe = Pipeline([('preprocessing', preprocessor),
                 ('model', Ridge())])
# Hyperparameter grid
# ==============================================================================
param_grid = {'model__alpha': np.logspace(-5, 3, 10)}

# Cross-validation search
# ==============================================================================
grid = GridSearchCV(
        estimator  = pipe,
        param_grid = param_grid,
        scoring    = 'neg_root_mean_squared_error',
        n_jobs     = cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 5), 
        verbose    = 0,
        return_train_score = True
       )

# Result is assigned to _ so it is not printed to screen
_ = grid.fit(X = X_train, y = y_train)
# Grid results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)
param_model__alpha mean_test_score std_test_score mean_train_score std_train_score
6 2.154435 -57015.966627 5465.631682 -55903.194589 1434.421966
5 0.278256 -57061.419747 5453.841055 -55890.429453 1434.319041
4 0.035938 -57070.046327 5452.082559 -55890.168259 1434.316012
7 16.681005 -57070.856505 5510.116607 -56163.247029 1433.747581
3 0.004642 -57071.217277 5451.851119 -55890.163778 1434.315957
2 0.000599 -57071.369487 5451.821155 -55890.163702 1434.315956
1 0.000077 -57071.389162 5451.817284 -55890.163701 1434.315956
0 0.000010 -57071.391703 5451.816784 -55890.163701 1434.315956
8 129.154967 -57925.670520 5685.548777 -57352.164605 1429.554577
9 1000.000000 -62521.145397 6344.532757 -62375.934333 1399.738326
# Best hyperparameters
# ==============================================================================
print("--------------------------------")
print("Best hyperparameters found")
print("--------------------------------")
print(f"{grid.best_params_} : {grid.best_score_} ({grid.scoring})")
--------------------------------
Best hyperparameters found
--------------------------------
{'model__alpha': np.float64(2.154434690031882)} : -57015.96662678427 (neg_root_mean_squared_error)
# Cross-validation results plot for each hyperparameter
# ==============================================================================
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 3.84), sharey=True)

# Plot 1
# ------------------------------------------------------------------------------
results.plot('param_model__alpha', 'mean_train_score', ax=axes[0])
results.plot('param_model__alpha', 'mean_test_score', ax=axes[0])
axes[0].fill_between(results.param_model__alpha.astype(float),
                results['mean_train_score'] + results['std_train_score'],
                results['mean_train_score'] - results['std_train_score'],
                alpha=0.2)
axes[0].fill_between(results.param_model__alpha.astype(float),
                results['mean_test_score'] + results['std_test_score'],
                results['mean_test_score'] - results['std_test_score'],
                alpha=0.2)
axes[0].legend()
axes[0].set_xscale('log')
axes[0].set_title('CV error evolution')
axes[0].set_ylabel('neg_root_mean_squared_error');

# Plot 2
# ------------------------------------------------------------------------------
n_splits = grid.n_splits_

results.plot(
    x     = 'param_model__alpha',
    y     = [f'split{i}_train_score' for i in range(n_splits)],
    alpha = 0.3,
    c     = 'blue', 
    ax    = axes[1]
)

results.plot(
    x     = 'param_model__alpha',
    y     = [f'split{i}_test_score' for i in range(n_splits)],
    alpha = 0.3,
    c     = 'red', 
    ax    = axes[1]
)

axes[1].legend(
    (axes[1].get_children()[0], axes[1].get_children()[n_splits]),
    ('training scores', 'test_scores')
)
axes[1].set_xscale('log')
axes[1].set_title('CV error evolution')
axes[1].set_ylabel('neg_root_mean_squared_error');

fig.tight_layout()
fig, ax = plt.subplots(figsize=(6, 3.84))
ax.barh(
    [str(d) for d in results['params']],
    results['mean_test_score'],
    xerr=results['std_test_score'],
    align='center',
    alpha=0
)
ax.plot(
    results['mean_test_score'],
    [str(d) for d in results['params']],
    marker="D",
    linestyle="",
    alpha=0.8,
    color="r"
)
ax.set_title('Hyperparameter Comparison')
ax.set_xlabel('Test neg_root_mean_squared_error');

If refit=True is indicated in GridSearchCV(), after identifying the best hyperparameters, the model is retrained with them and stored in .best_estimator_.

GridSearchCV() performs an exhaustive search evaluating all parameter combinations. This strategy has the drawback that a lot of time can be spent in regions of little interest before evaluating other combinations. An alternative is to do a random search, this way, the search space is explored in a more distributed manner. RandomizedSearchCV() allows this type of strategy, it only requires that the search space of each hyperparameter be indicated (list of options or a distribution) and the number of random combinations to evaluate.

fig, axs = plt.subplots(nrows = 1, ncols = 2,figsize=(8, 4),
                        sharex = True, sharey = False)

# Exhaustive grid
# ==============================================================================
hyperparameter_1 = np.linspace(start = 0, stop = 100, num=20)
hyperparameter_2 = np.linspace(start = 0, stop = 10, num=20)

# List with all combinations
combinations = [list(x) for x in product(hyperparameter_1, hyperparameter_2)]
combinations = pd.DataFrame.from_records(
                    combinations,
                    columns=['hyperparameter_1', 'hyperparameter_2']
                )

combinations.plot(
    x    = 'hyperparameter_1',
    y    = 'hyperparameter_2',
    kind = 'scatter',
    ax   = axs[0]
)
axs[0].set_title('Uniform distribution')

# Random grid
# ==============================================================================
hyperparameter_1 = np.random.uniform(low = 0, high = 100, size  = 400)
hyperparameter_2 = np.random.uniform(low = 0, high = 10, size  = 400)

combinations = pd.DataFrame(
                    {
                    'hyperparameter_1': hyperparameter_1,
                    'hyperparameter_2': hyperparameter_2,
                    }
                )
combinations.plot(
    x    = 'hyperparameter_1',
    y    = 'hyperparameter_2',
    kind = 'scatter',
    ax   = axs[1]
)
axs[1].set_title('Random distribution');
# Search space for each hyperparameter
# ==============================================================================
param_distributions = {'model__alpha': np.logspace(-5, 3, 100)}

# Cross-validation search
# ==============================================================================
grid = RandomizedSearchCV(
        estimator  = pipe,
        param_distributions = param_distributions,
        n_iter     = 50,
        scoring    = 'neg_root_mean_squared_error',
        n_jobs     = cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 5), 
        verbose    = 0,
        random_state = 123,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Results
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)\
    .head(1)
param_model__alpha mean_test_score std_test_score mean_train_score std_train_score
33 3.764936 -57016.19437 4221.321925 -55948.199147 1070.076033

Bayesian Optimization

Although these three strategies are totally valid and generate good results, especially when you have criteria to narrow down the search range, they share a common limitation: none of them take into account the results obtained so far, which prevents them from focusing the search on regions of greatest interest and avoiding unnecessary regions.

An alternative is hyperparameter search with Bayesian optimization methods. In general terms, Bayesian hyperparameter optimization consists of creating a probabilistic model in which the objective function is the model's validation metric (rmse, auc, precision..). With this strategy, the search is redirected in each iteration towards the regions of greatest interest. The ultimate goal is to reduce the number of hyperparameter combinations with which the model is evaluated, choosing only the best candidates. This means that the advantage over the other mentioned strategies is maximized when the search space is very large or the model evaluation is very slow.

One of the most widely used libraries for Bayesian optimization is Optuna.

# Hyperparameter search with Optuna
# ==============================================================================
# Adjust Optuna logging level to not show too much information
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    """
    Objective function for hyperparameter search.
    Adjusts the 'alpha' parameter of the model and evaluates performance using cross-validation.
    """
    # Propose value for 'model__alpha' within logarithmic range
    model__alpha = trial.suggest_float("model__alpha", 1e-6, 1e+3, log=True)
    
    # Set the 'alpha' value in the pipeline
    pipe.set_params(model__alpha=model__alpha)
    
    # Evaluate the model using cross-validation with negative RMSE score
    score = -np.mean(
                cross_val_score(
                    pipe,
                    X_train,
                    y_train,
                    cv=5,
                    n_jobs=-1,
                    scoring="neg_root_mean_squared_error"
                )
            )
    
    return score

# Create and optimize the study with 50 trials
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=50, show_progress_bar=True)

# Show results
print(f"Best trial: {study.best_trial}")
print(f"\nBest validation score: {study.best_value}")
print(f"Best hyperparameters: {study.best_params}")
  0%|          | 0/50 [00:00<?, ?it/s]
Best trial: FrozenTrial(number=47, state=1, values=[56904.67245507283], datetime_start=datetime.datetime(2025, 11, 30, 20, 2, 55, 753761), datetime_complete=datetime.datetime(2025, 11, 30, 20, 2, 55, 810761), params={'model__alpha': 4.577720439947758}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'model__alpha': FloatDistribution(high=1000.0, log=True, low=1e-06, step=None)}, trial_id=47, value=None)

Best validation score: 56904.67245507283
Best hyperparameters: {'model__alpha': 4.577720439947758}

Preprocessing Tuning

In most cases, the optimization process focuses on the model's hyperparameters. However, it can also be very interesting to compare different preprocessing transformations. Thanks to pipeline, this can be done in the same way as with hyperparameters. See the following example where incorporating interactions between predictors is compared.

# Pipe: preprocessing + model
# ==============================================================================
# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Preprocessing steps and model are combined in the same pipeline.
pipe = Pipeline(
        [('preprocessing', preprocessor),
        ('interactions', PolynomialFeatures(degree=2)),
        ('model', Ridge())])

# Hyperparameter grid
# ==============================================================================
param_grid = {'interactions': [PolynomialFeatures(degree=2), 'passthrough'],
              'model__alpha': np.logspace(-5, 3, 10)}

# Cross-validation search
# ==============================================================================
grid = GridSearchCV(
        estimator  = pipe,
        param_grid = param_grid,
        scoring    = 'neg_root_mean_squared_error',
        n_jobs     = cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 5), 
        verbose    = 0,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Grid results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)\
    .head(1)
param_interactions param_model__alpha mean_test_score std_test_score mean_train_score std_train_score
16 passthrough 2.154435 -57035.896444 4011.21685 -55923.521551 1058.94183

Algorithms

In the following sections, different machine learning models are trained with the objective of comparing them and identifying the one that achieves the best result in predicting house prices.

K-Nearest Neighbor (kNN)

K-Nearest Neighbor is one of the simplest machine learning algorithms. Its operation is as follows: to predict an observation, the K observations from the training set that most resemble it (based on their predictors) are identified and the average of the response variable in those observations is used as the predicted value. Given its simplicity, it usually gives worse results than other algorithms, but it is a good reference as a baseline.

# Pipeline: preprocessing + model
# ==============================================================================
# Identification of numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()


# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Preprocessing steps and model are combined in the same pipeline.
pipe = Pipeline([('preprocessing', preprocessor),
                 ('model', KNeighborsRegressor())])

# Hyperparameter optimization
# ==============================================================================
# Search space for each hyperparameter
param_distributions = {'model__n_neighbors': np.linspace(1, 100, 500, dtype=int)}

# Random grid search
grid = RandomizedSearchCV(
        estimator  = pipe,
        param_distributions = param_distributions,
        n_iter     = 20,
        scoring    = 'neg_root_mean_squared_error',
        n_jobs     = cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 3), 
        refit      = True, 
        verbose    = 0,
        random_state = 123,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Grid results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)\
    .head(1)
param_model__n_neighbors mean_test_score std_test_score mean_train_score std_train_score
12 7 -61169.195218 5654.381605 -52870.939049 1112.83243
# Cross-validation results plot for each hyperparameter
# ------------------------------------------------------------------------------
fig, ax = plt.subplots(figsize=(6, 3.84))
hyperparameter = 'param_model__n_neighbors'
results = results.sort_values(hyperparameter, ascending = False)
metric    = grid.scoring

results.plot(hyperparameter, 'mean_train_score', ax=ax)
results.plot(hyperparameter, 'mean_test_score', ax=ax)
ax.fill_between(results[hyperparameter].astype(int),
                results['mean_train_score'] + results['std_train_score'],
                results['mean_train_score'] - results['std_train_score'],
                alpha=0.2)
ax.fill_between(results[hyperparameter].astype(int),
                results['mean_test_score'] + results['std_test_score'],
                results['mean_test_score'] - results['std_test_score'],
                alpha=0.2)
ax.legend()
ax.set_title('CV error evolution')
ax.set_ylabel(metric);
fig, ax = plt.subplots(figsize=(8, 5))
results = results.sort_values('mean_test_score', ascending = True)
ax.barh(
    [str(d) for d in results['params']],
    results['mean_test_score'],
    xerr=results['std_test_score'],
    align='center',
    alpha=0
)
ax.plot(
    results['mean_test_score'],
    [str(d) for d in results['params']],
    marker="D",
    linestyle="",
    alpha=0.8,
    color="r"
)
ax.set_title('Hyperparameter Comparison')
ax.set_ylabel(metric);

Once the best hyperparameters have been identified, the model is retrained by indicating the optimal values in its arguments. If refit=True is indicated in GridSearchCV(), this retraining is done automatically and the resulting model is stored in .best_estimator_.

# Test error of the final model
# ==============================================================================
final_model = grid.best_estimator_
predictions = final_model.predict(X = X_test)
rmse_knn = root_mean_squared_error(
            y_true  = y_test,
            y_pred  = predictions
           )
print(f"The test error (rmse) is: {rmse_knn}")
The test error (rmse) is: 67303.99640453797

Linear Regression (Ridge and Lasso)

Linear regression is a statistical method that tries to model the linear relationship between a continuous variable (dependent variable, response, or target) and one or more independent variables (regressors or predictors) by fitting a linear equation. It is called simple linear regression when there is only one independent variable and multiple linear regression when there is more than one.

The linear regression model (Legendre, Gauss, Galton and Pearson) considers that, given a set of observations, the mean $\mu$ of the response variable $Y$ is linearly related to the regressor variable(s) $X$ according to the equation:

$$\mu_Y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + ... + \beta_p x_{p}$$

or in matrix notation (incorporating $\beta_0$ into the vector $\mathbf{\beta}$):

$$\mu_Y = \mathbf{X}^T \mathbf{\beta}$$

Additionally, the linear model can include regularization during its fitting, so it also includes the ridge regression, lasso, and elastic-net models.

Scikit-Learn incorporates 3 types of regularization for linear models with the objective of avoiding overfitting, reducing variance, and attenuating the effect of correlation between predictors. Generally, by applying regularization, models with greater predictive power (generalization) are achieved.

  • The lasso model is a least squares linear model that incorporates regularization that penalizes the sum of the absolute values of the regression coefficients $(||\beta||_1 = \sum_{k=1} ^p |\beta_k|)$. This penalty is known as l1 and has the effect of forcing predictor coefficients towards zero. Since a predictor with a zero regression coefficient does not influence the model, lasso manages to select the most influential predictors. The degree of penalty is controlled by the hyperparameter $\lambda$. When $\lambda = 0$, the result is equivalent to a linear model by ordinary least squares. As $\lambda$ increases, the greater the penalty and the more predictors are excluded.

  • The ridge model is a least squares linear model that incorporates regularization that penalizes the sum of the squared coefficients $(||\beta||^2_2 = \sum_{k=1} ^p \beta^2_k)$. This penalty is known as l2 and has the effect of proportionally reducing the value of all model coefficients but without them reaching zero. Like lasso, the degree of penalty is controlled by the hyperparameter $\lambda$.

The main practical difference between lasso and ridge is that the former makes some coefficients exactly zero, thus performing predictor selection, while the latter does not exclude any. This is a notable advantage of lasso in scenarios where not all predictors are important for the model and it is desired that the least influential ones be excluded. On the other hand, when there are highly correlated predictors (linearly), ridge reduces the influence of all of them at once and proportionally, while lasso tends to select one of them, giving it all the weight and excluding the rest. In the presence of correlations, this selection varies a lot with small perturbations (changes in training data), so lasso solutions are very unstable if predictors are highly correlated.

To achieve an optimal balance between these two properties, what is known as elastic net penalty can be used, which combines both strategies.

The elastic net model includes regularization that combines l1 and l2 penalties $(\alpha \lambda ||\beta||_1 + \frac{1}{2}(1- \alpha)||\beta||^2_2)$. The degree to which each penalty influences is controlled by the hyperparameter $\alpha$. Its value must be between [0,1], when $\alpha = 0$, ridge regression is applied and when $\alpha = 1$ lasso is applied. The combination of both penalties usually yields good results. A frequently used strategy is to assign almost all the weight to the l1 penalty ($\alpha$ very close to 1) to select predictors and a little to l2 to give some stability in case some predictors are correlated.

Finding the best model involves identifying the optimal values of the regularization hyperparameters $\alpha$ and $\lambda$.

# Pipeline: preprocessing + model
# ==============================================================================
# Identification of numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()


# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Preprocessing steps and model are combined in the same pipeline.
pipe = Pipeline([('preprocessing', preprocessor),
                 ('model', Ridge())])

# Hyperparameter optimization
# ==============================================================================
# Search space for each hyperparameter
param_distributions = {'model__alpha': np.logspace(-5, 5, 500)}

# Random grid search
grid = RandomizedSearchCV(
        estimator  = pipe,
        param_distributions = param_distributions,
        n_iter     = 20,
        scoring    = 'neg_root_mean_squared_error',
        n_jobs     = multiprocessing.cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 3), 
        refit      = True, 
        verbose    = 0,
        random_state = 123,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Grid results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)\
    .head(1)
param_model__alpha mean_test_score std_test_score mean_train_score std_train_score
10 3.396739 -56907.292901 4690.339075 -55945.219309 1227.446654
# Cross-validation results plot for each hyperparameter
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3.84))
hyperparameter = 'param_model__alpha'
results = results.sort_values(hyperparameter, ascending = False)
metric    = grid.scoring

results.plot(hyperparameter, 'mean_train_score', ax=ax)
results.plot(hyperparameter, 'mean_test_score', ax=ax)
ax.fill_between(results[hyperparameter].astype(int),
                results['mean_train_score'] + results['std_train_score'],
                results['mean_train_score'] - results['std_train_score'],
                alpha=0.2)
ax.fill_between(results[hyperparameter].astype(int),
                results['mean_test_score'] + results['std_test_score'],
                results['mean_test_score'] - results['std_test_score'],
                alpha=0.2)
ax.legend()
ax.set_title('CV error evolution')
ax.set_ylabel(metric);
# Test error of the final model
# ==============================================================================
final_model = grid.best_estimator_
predictions = final_model.predict(X = X_test)
rmse_lm = root_mean_squared_error(
            y_true  = y_test,
            y_pred  = predictions,
          )
print(f"The test error (rmse) is: {rmse_lm}")
The test error (rmse) is: 65406.82894279458

Random Forest

A Random Forest model is made up of a set of individual decision trees, each trained with a slightly different sample of the training data generated through bootstrapping). The prediction of a new observation is obtained by aggregating the predictions of all the individual trees that make up the model. To learn more about this type of model, visit Random Forest with Python.

# Pipeline: preprocessing + model
# ==============================================================================
# Identification of numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()


# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Preprocessing steps and model are combined in the same pipeline.
pipe = Pipeline([('preprocessing', preprocessor),
                 ('model', RandomForestRegressor())])

# Hyperparameter optimization
# ==============================================================================
# Search space for each hyperparameter

param_distributions = {
    'model__n_estimators': [50, 100, 1000, 2000],
    'model__max_features': [3, 5, 7, 1.0],
    'model__max_depth'   : [3, 5, 10, 20]
}

# Random grid search
grid = RandomizedSearchCV(
        estimator  = pipe,
        param_distributions = param_distributions,
        n_iter     = 20,
        scoring    = 'neg_root_mean_squared_error',
        n_jobs     = multiprocessing.cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 3),
        refit      = True, 
        verbose    = 0,
        random_state = 123,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Grid results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)\
    .head(1)
param_model__n_estimators param_model__max_features param_model__max_depth mean_test_score std_test_score mean_train_score std_train_score
13 100 7.0 20 -54659.470515 5036.819337 -20520.19286 402.006403
# Test error of the final model
# ==============================================================================
final_model = grid.best_estimator_
predictions = final_model.predict(X = X_test)
rmse_rf = root_mean_squared_error(
            y_true  = y_test,
            y_pred  = predictions,
          )
print(f"The test error (rmse) is: {rmse_rf}")
The test error (rmse) is: 61567.52103489776

Gradient Boosting Trees

A Gradient Boosting Trees model is made up of a set of individual decision trees, trained sequentially, so that each new tree tries to improve the errors of the previous trees. The prediction of a new observation is obtained by aggregating the predictions of all the individual trees that make up the model. To learn more about this type of model, visit Gradient Boosting with Python.

# Pipeline: preprocessing + model
# ==============================================================================
# Identification of numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()


# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Preprocessing steps and model are combined in the same pipeline.
pipe = Pipeline([('preprocessing', preprocessor),
                 ('model', GradientBoostingRegressor())])

# Hyperparameter optimization
# ==============================================================================
# Search space for each hyperparameter

param_distributions = {
    'model__n_estimators': [50, 100, 1000, 2000],
    'model__max_features': [3, 5, 7, 1.0],
    'model__max_depth'   : [3, 5, 10, 20],
    'model__subsample'   : [0.5,0.7, 1]
}

# Random grid search
grid = RandomizedSearchCV(
        estimator  = pipe,
        param_distributions = param_distributions,
        n_iter     = 20,
        scoring    = 'neg_root_mean_squared_error',
        n_jobs     = multiprocessing.cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 3),
        refit      = True, 
        verbose    = 0,
        random_state = 123,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Grid results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)\
    .head(1)
param_model__subsample param_model__n_estimators param_model__max_features param_model__max_depth mean_test_score std_test_score mean_train_score std_train_score
17 1.0 100 7.0 5 -54175.092368 4480.817421 -24542.710342 521.692074
# Test error of the final model
# ==============================================================================
final_model = grid.best_estimator_
predictions = final_model.predict(X = X_test)
rmse_gbm = root_mean_squared_error(
            y_true  = y_test,
            y_pred  = predictions,
          )
print(f"The test error (rmse) is: {rmse_gbm}")
The test error (rmse) is: 62311.56422295233

Stacking

In general, the term model ensembling refers to combining the predictions of two or more different models with the goal of improving the final predictions. This strategy is based on the assumption that different models trained independently use different aspects of the data to make predictions—that is, each one is able to identify part of the "truth" but not all of it. By combining the perspective of each, a more detailed description of the true underlying structure in the data is obtained. As an analogy, imagine a group of students taking a multidisciplinary exam. Although they all get approximately the same grade, each one will have scored more points on questions about the disciplines in which they excel. If instead of taking the exam individually they take it as a group, each one can contribute in the areas they master most, and the final result will probably be superior to any of the individual results.

The key for ensembling to improve results is model diversity. If all the combined models are similar to each other, they won't be able to compensate for one another. For this reason, one should try to combine models that perform as well as possible individually and are as different from each other as possible.

The simplest ways to combine predictions from multiple models are using the mean for regression problems and the mode for classification problems. However, there are more complex approaches capable of achieving better results:

  • Weight the aggregations by giving different weights to each model, for example, in proportion to the accuracy they have achieved individually.
  • Super Learner (stacked regression): use a new model to decide the best way to combine the predictions of the other models.

Super Learner Algorithm

The Super Learner implementation available in scikit-learn through the StackingRegressor and StackingClassifier classes follows this algorithm:

Ensemble Definition

  1. Define a list of base algorithms (each with the appropriate hyperparameters).

  2. Select the metalearning algorithm that defines how the higher-level model is trained. By default, RidgeCV is used for regression and LogisticRegression for classification.

Ensemble Training

  1. Train each of the base algorithms with the training set.

  2. Perform k-fold cross-validation with each of the base algorithms and store the predictions made in each of the k partitions.

  3. Combine the predictions from step 2 into a single NxL matrix (N = number of observations in the training set, L = number of base models).

  4. Train the metalearner with the response variable and the NxL matrix as predictors.

  5. The final Super learner is composed of the base models and the metalearning model.

Prediction

  1. Predict the new observation with each of the base models.

  2. Use the predictions from the base models as input to the metalearner to obtain the final prediction.

We proceed to create a stacking with Ridge and RandomForest models, using the best hyperparameters found in the previous sections for each.

# Pipeline: preprocessing + models for the stacking
# ==============================================================================
# Identify numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()

# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Combine preprocessing steps and models by creating several pipelines.
pipe_ridge = Pipeline([('preprocessing', preprocessor),
                     ('ridge', Ridge(alpha=3.4))])

pipe_rf = Pipeline([('preprocessing', preprocessor),
                     ('random_forest', RandomForestRegressor(
                                         n_estimators = 1000,
                                         max_features = 7,
                                         max_depth    = 20
                                        )
                     )])
# Definition and training of the StackingRegressor
# ==============================================================================
estimators = [('ridge', pipe_ridge),
              ('random_forest', pipe_rf)]

stacking_regressor = StackingRegressor(estimators=estimators,
                                       final_estimator=RidgeCV())
# Assign the result to _ so it doesn't print to screen
_ = stacking_regressor.fit(X = X_train, y = y_train)
# Test error of the stacking
# ==============================================================================
final_model = stacking_regressor
predictions = final_model.predict(X = X_test)
rmse_stacking = root_mean_squared_error(
                    y_true  = y_test,
                    y_pred  = predictions
                  )
print(f"The test error (rmse) is: {rmse_stacking}")
The test error (rmse) is: 60752.873757270354

Comparison

The test error of all trained models is compared.

error_models = pd.DataFrame({
                        'model': ['knn', 'lm', 'random forest', 'gradient boosting',
                                   'stacking'],
                        'rmse': [rmse_knn, rmse_lm, rmse_rf, rmse_gbm, rmse_stacking]
                     })
error_models = error_models.sort_values('rmse', ascending=False)

fig, ax = plt.subplots(figsize=(6, 3.84))
ax.hlines(error_models.model, xmin=0, xmax=error_models.rmse)
ax.plot(error_models.rmse, error_models.model, "o", color='black')
ax.tick_params(axis='y', which='major', labelsize=12)
ax.set_title('Model Test Error Comparison'),
ax.set_xlabel('Test rmse');

Annexes

Annex 1: Validation Methods

Validation methods, also known as resampling, are strategies that allow estimating the predictive capacity of models when applied to new observations, using only the training data. The idea behind all of them is the following: the model is fitted using a subset of observations from the training set and is evaluated (calculating a metric that measures how good the model is, for example, accuracy) with the remaining observations. This process is repeated multiple times and the results are aggregated and averaged. Thanks to the repetitions, possible deviations that may arise from the random allocation of observations are compensated. The difference between methods is usually the way in which the training/validation subsets are generated.

k-Fold-Cross-Validation (CV)

The training observations are divided into k folds (sets) of the same size. The model is fitted with all observations except those in the first fold and is evaluated by predicting the observations of the excluded fold, thus obtaining the first metric. The process is repeated k times, excluding a different fold in each iteration. At the end, k values of the metric are generated, which are aggregated (usually with the mean and standard deviation) generating the final validation estimate.

Leave-One-Out Cross-Validation (LOOCV)

LOOCV is a special case of k-Fold-Cross-Validation in which the number k of folds equals the number of observations available in the training set. The model is fitted each time with all observations except one, which is used to evaluate the model. This method involves a very high computational cost—the model is fitted as many times as there are training observations—so in practice, it usually doesn't pay off to use it.

Repeated k-Fold-Cross-Validation (repeated CV)

It is exactly the same as the k-Fold-Cross-Validation method but repeating the complete process n times. For example, 10-Fold-Cross-Validation with 5 repetitions implies a total of 50 fit-validation iterations, but it is not equivalent to 50-Fold-Cross-Validation.

Leave-Group-Out Cross-Validation (LGOCV)

LGOCV, also known as repeated train/test splits or Monte Carlo Cross-Validation, simply consists of generating multiple random train-test divisions (only two sets per repetition). The proportion of observations going to each set is determined beforehand; 80%-20% usually gives good results. This method, although simpler to implement than CV, requires many repetitions (>50) to generate stable estimates.

Bootstrapping

A bootstrap sample is a sample obtained from the original sample by random sampling with replacement, and of the same size as the original sample. Random sampling with replacement (resampling with replacement) means that, after an observation is drawn, it is made available again for subsequent draws. As a result of this type of sampling, some observations will appear multiple times in the bootstrap sample and others not at all. Observations not selected are called out-of-bag (OOB). For each bootstrapping iteration, a new bootstrap sample is generated, the model is fitted with it, and is evaluated with the out-of-bag observations.


  1. Obtain a new sample of the same size as the original sample through random sampling with replacement.

  2. Fit the model using the new sample generated in step 1.

  3. Calculate the model error using those observations from the original sample that have not been included in the new sample. This error is known as the validation error.

  4. Repeat the process n times and calculate the mean of the n validation errors.

  5. Finally, and after the n repetitions, fit the final model using all the original training observations.


The nature of the bootstrapping process generates some bias in the estimates that can be problematic when the training set is small. There are certain modifications to the original algorithm to correct this problem, some of them are: 632 method and 632+ method.

There is no validation method that outperforms all others in every scenario; the choice should be based on several factors.

  • If the sample size is small, repeated k-Fold-Cross-Validation is recommended, as it achieves a good bias-variance balance and, since there are not many observations, the computational cost is not excessive.

  • If the main objective is to compare models rather than obtain a precise estimate of the metrics, bootstrapping is recommended since it has less variance.

  • If the sample size is very large, the difference between methods is reduced and computational efficiency becomes more important. In these cases, simple 10-Fold-Cross-Validation is sufficient.

A comparative study of different methods can be found in Comparing Different Species of Cross-Validation.

Annex 2: Metrics

There is a great variety of metrics that allow evaluating how good an algorithm is at making predictions. The suitability of each one depends entirely on the problem in question, and its correct choice will depend on how well the analyst understands the problem they face. Below, some of the most commonly used are described.

Accuracy and Kappa

These two metrics are the most commonly used in binary and multiclass classification problems. Accuracy is the percentage of observations correctly classified relative to the total predictions. Kappa or Cohen's Kappa is the accuracy value normalized with respect to the percentage of success expected by chance. Unlike accuracy, whose range of values can be [0, 1], that of kappa is [-1, 1]. In problems with imbalanced classes, where the majority group far exceeds the others, Kappa is more useful because it avoids falling into the illusion of believing that a model is good when it really only slightly exceeds what is expected by chance.

MSE, RMSE, MAE

These are the most commonly used metrics in regression problems.

MSE (Mean Squared Error) is the mean of the squared errors. It is usually a very good indicator of how the model performs in general, but it has the disadvantage of being in squared units. To improve interpretation, RMSE (Root Mean Squared Error) is often used, which is the square root of MSE and therefore its units are the same as the response variable.

MAE (Mean Absolute Error) is the mean of the absolute errors. The difference from MSE is that the latter squares the errors, which means it penalizes large deviations much more. In general, MSE favors models that perform approximately equally well across all observations, while MAE favors models that predict the vast majority of observations very well even if they are very wrong on a few.

Session Information

import session_info
session_info.show(html=False)
-----
joblib              1.5.2
matplotlib          3.10.7
numpy               2.2.6
optuna              4.5.0
pandas              2.3.3
seaborn             0.13.2
session_info        v1.0.1
sklearn             1.7.2
statsmodels         0.14.5
-----
IPython             9.6.0
jupyter_client      8.6.3
jupyter_core        5.9.1
jupyterlab          4.5.0
notebook            6.5.7
-----
Python 3.13.9 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 19:16:10) [GCC 11.2.0]
Linux-6.14.0-36-generic-x86_64-with-glibc2.39
-----
Session information updated at 2025-11-30 20:11

Bibliography

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Introduction to Statistical Learning, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

Applied Predictive Modeling Max Kuhn, Kjell Johnson

T.Hastie, R.Tibshirani, J.Friedman. The Elements of Statistical Learning. Springer, 2009

How to Cite

How to cite this document?

If you use this document or any part of it, we appreciate you citing it. Thank you very much!

Machine learning with Python and Scikit-learn by Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://www.cienciadedatos.net/documentos/py06-machine-learning-python-scikitlearn-en.html

Did you like the article? Your support is important

Your contribution will help me continue generating free educational content. Thank you so much! 😊

Become a GitHub Sponsor

Creative Commons Licence

This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.

You are allowed to:

  • Share: copy and redistribute the material in any medium or format.

  • Adapt: remix, transform, and build upon the material.

Under the following terms:

  • Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  • NonCommercial: You may not use the material for commercial purposes.

  • ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.