More about machine learning: Machine Learning with Python

Scikit-learn¶

Python is one of the dominant programming languages in the fields of statistics, data mining, and machine learning. As open-source software, countless users have been able to implement their algorithms, resulting in a very large number of libraries where virtually all existing machine learning techniques can be found. However, this has a downside: each library has its own syntax, structure, and implementation, which makes learning more difficult. Scikit-learn is an open-source library that unifies the main algorithms and functions under a single framework, greatly facilitating all stages of preprocessing, training, optimization, and validation of predictive models.

✏️ Note

Scikit-learn offers so many possibilities that they can hardly be shown with a single example. In this document, only some of its functionalities are used. If a detailed explanation is required in any case, so as not to interfere with the narrative of the analysis, an appendix will be added. Nevertheless, to fully understand all the functionalities of **Scikit-learn**, it is recommended to read its documentation. One of the notable characteristics of this library is its high degree of maturity, which makes it suitable for creating predictive models intended for production. This document is also reproduced with tidymodels, mlr3. Other similar projects are caret and H2O, all based on the R programming language.

Introduction¶

In recent years, interest in and application of machine learning have expanded so much that they have become a discipline applied in virtually all areas of academic and industrial research. The growing number of people dedicated to this discipline has resulted in a whole repertoire of tools for accessing powerful predictive methods. The Python programming language is a clear example of this.

The term machine learning encompasses a set of algorithms that allow identifying patterns present in data and creating structures (models) that represent them. Once generated, these models can be used to predict information about facts or events that have not yet been observed. It is important to remember that machine learning systems "learn" patterns from the data they are trained on, therefore, they can only recognize scenarios similar to those they have seen before. When using systems trained with past data to predict the future, it is assumed that such behavior will continue, which is not always the case.

Although terms such as machine learning, data mining, artificial intelligence, or data science are often used as synonyms, it is important to note that machine learning methods are only part of the many strategies needed to extract, understand, and add value to data. The following document is intended to be an example of the type of problem an analyst usually faces: starting from a more or less processed dataset (data preparation is a key stage that precedes machine learning), the goal is to create a model that can successfully predict the behavior or value of new observations.

Stages of a machine learning problem

Define the problem: What is to be predicted? What data is available? What data needs to be obtained?
Explore and understand the data that will be used to create the model.
Success metric: define an appropriate way to quantify how good the results obtained are.
Prepare the strategy to evaluate the model: separate the observations into a training set, a validation set (or cross-validation), and a test set. It is very important to ensure that no information from the test set participates in the model training or optimization process.
Preprocess the data: apply the necessary transformations so that the data can be interpreted by the selected machine learning algorithm.
Fit a first model capable of exceeding minimum results. For example, in classification problems, the minimum to beat is usually the percentage of the majority class; in a regression model, the mean of the response variable.
Gradually improve the model, incorporating or creating new variables and optimizing hyperparameters.
Evaluate the capacity of the final model with the test set, to estimate its performance when predicting new observations.
Train the final model with all available data.

✏️ Note

Unlike other documents, this one is intended to be a practical example with less theoretical development. The reader will realize how easy it is to apply a wide range of predictive methods with Python and Scikit-learn. However, it is crucial that any analyst understands the theoretical foundations on which each of them is based for a project of this type to be successful. Although they are only briefly described here, they will be accompanied by links where detailed information can be found.

Libraries¶

The libraries used in this document are:

# Data manipulation
# ==============================================================================
import numpy as np
import pandas as pd

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import statsmodels.api as sm
plt.style.use('bmh')

# Preprocessing and modeling
# ==============================================================================
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.model_selection import (
    train_test_split,
    KFold,
    RepeatedKFold,
    cross_val_score,
    cross_validate,
    cross_val_predict,
    GridSearchCV,
    RandomizedSearchCV,
)
from sklearn.metrics import root_mean_squared_error
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
import optuna

# Miscellaneous
# ==============================================================================
import multiprocessing
from joblib import cpu_count
from itertools import product

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

Data¶

The SaratogaHouses dataset from the mosaicData package in R contains information about the price of 1728 houses located in Saratoga County, New York, USA in 2006. The data can be downloaded in csv format from SaratogaHouses.csv

In addition to the price, it includes 15 additional variables:

price: house price.
lotSize: square meters of the property.
age: age of the house.
landValue: land value.
livingArea: livable square meters.
pctCollege: percentage of the neighborhood with a college degree.
bedrooms: number of bedrooms.
fireplaces: number of fireplaces.
bathrooms: number of bathrooms (the value 0.5 refers to bathrooms without a shower).
rooms: number of rooms.
heating: type of heating.
fuel: type of heating fuel (gas, electricity, or diesel).
sewer: type of sewer.
waterfront: whether the house has lake views.
newConstruction: whether the house is newly constructed.
centralAir: whether the house has air conditioning.

The objective is to obtain a model capable of predicting house prices.

url = (
    "https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-learning-python/"
    "master/data/SaratogaHouses.csv"
)
data = pd.read_csv(url, sep=",")

# Rename columns to be more descriptive
data.columns = [
    "price", "total_sqm", "age", "land_value", "living_sqm",
    "college_pct", "bedrooms", "fireplace", "bathrooms", "rooms", "heating",
    "heating_fuel", "sewer", "lake_view", "new_construction", "air_conditioning"
]

Exploratory Analysis¶

Before training a predictive model—or even before performing any calculation with a new dataset—it is very important to carry out a descriptive exploration of the data. This process allows a better understanding of what information each variable contains, as well as detecting possible errors. Some common examples are:

A column has been stored with the wrong type: for example, a numeric variable that is being recognized as text, or vice versa.
A variable contains values that do not make sense: for example, to indicate that the price of a house is not available, a value of 0 or a blank space is entered.
A word has been entered instead of a number in a numeric variable.

Additionally, this initial analysis can provide clues about which variables are suitable as predictors in a model (more on this in the following sections).

data.head(4)

	price	total_sqm	age	land_value	living_sqm	college_pct	bedrooms	fireplace	bathrooms	rooms	heating	heating_fuel	sewer	lake_view	new_construction	air_conditioning
0	132500	0.09	42	50000	906	35	2	1	1.0	5	electric	electric	septic	No	No	No
1	181115	0.92	0	22300	1953	51	3	0	2.5	6	hot water/steam	gas	septic	No	No	No
2	109000	0.19	133	7300	1944	51	4	1	1.0	8	hot water/steam	gas	public/commercial	No	No	No
3	155000	0.41	13	18700	1944	51	3	1	1.5	5	hot air	gas	septic	No	No	No

Variable Types¶

# Type of each variable
# ==============================================================================
# In pandas, the "object" type refers to strings
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   price             1728 non-null   int64  
 1   total_sqm         1728 non-null   float64
 2   age               1728 non-null   int64  
 3   land_value        1728 non-null   int64  
 4   living_sqm        1728 non-null   int64  
 5   college_pct       1728 non-null   int64  
 6   bedrooms          1728 non-null   int64  
 7   fireplace         1728 non-null   int64  
 8   bathrooms         1728 non-null   float64
 9   rooms             1728 non-null   int64  
 10  heating           1728 non-null   object 
 11  heating_fuel      1728 non-null   object 
 12  sewer             1728 non-null   object 
 13  lake_view         1728 non-null   object 
 14  new_construction  1728 non-null   object 
 15  air_conditioning  1728 non-null   object 
dtypes: float64(2), int64(8), object(6)
memory usage: 216.1+ KB

All columns have the appropriate type.

Number of Observations and Missing Values¶

Along with the study of variable types, it is essential to know the number of available observations and whether all of them are complete.

Missing values are especially important when creating models, as most algorithms do not accept incomplete observations, or are highly influenced by them.

Although imputation of missing values is part of preprocessing—and therefore must be learned only with training data—their identification must be done before splitting the data. This ensures that all necessary imputation strategies can be established appropriately.

# Dataset dimensions (rows, columns)
# ==============================================================================
data.shape

(1728, 16)

# Number of missing values per variable
# ==============================================================================
data.isna().sum().sort_values()

price               0
total_sqm           0
age                 0
land_value          0
living_sqm          0
college_pct         0
bedrooms            0
fireplace           0
bathrooms           0
rooms               0
heating             0
heating_fuel        0
sewer               0
lake_view           0
new_construction    0
air_conditioning    0
dtype: int64

No variable contains missing values. In the missing value imputation section, several imputation strategies are shown when the dataset is incomplete.

Response Variable¶

When creating a model, it is very important to study the distribution of the response variable, since, after all, it is what is intended to be predicted.

The price variable has an asymmetric distribution with a positive tail, because some houses have a price much higher than the average. This type of distribution is usually better visualized after applying a logarithmic transformation or square root.

fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(6, 6))
sns.kdeplot(data.price, fill=True, color="blue",ax=axes[0])
sns.rugplot(data.price, color="blue", ax=axes[0])
axes[0].set_title("Original distribution", fontsize = 'medium')
axes[0].set_xlabel('price', fontsize='small')
axes[0].set_ylabel('')
axes[0].tick_params(labelsize = 8)

sns.kdeplot(np.sqrt(data.price), fill=True, color="blue", ax=axes[1])
sns.rugplot(np.sqrt(data.price), color="blue", ax=axes[1])
axes[1].set_title("Square root transformation", fontsize = 'medium')
axes[1].set_xlabel('sqrt(price)', fontsize='small') 
axes[1].set_ylabel('')
axes[1].tick_params(labelsize = 8)

sns.kdeplot(np.log(data.price), fill=True, color="blue", ax=axes[2])
sns.rugplot(np.log(data.price), color="blue",ax=axes[2])
axes[2].set_title("Logarithmic transformation", fontsize = 'medium')
axes[2].set_xlabel('log(price)', fontsize='small')
axes[2].set_ylabel('')
axes[2].tick_params(labelsize = 8)

fig.tight_layout()

Some machine learning and statistical learning models require the response variable to be distributed in a certain way. For example, for linear regression models (LM), the distribution must be normal. For generalized linear models (GLM), the distribution must belong to the exponential family.

Numeric Variables¶

# Numeric variables
# ==============================================================================
data.select_dtypes(include=['float64', 'int']).describe()

	price	total_sqm	age	land_value	living_sqm	college_pct	bedrooms	fireplace	bathrooms	rooms
count	1728.000000	1728.000000	1728.000000	1728.000000	1728.000000	1728.000000	1728.000000	1728.000000	1728.000000	1728.000000
mean	211966.705440	0.500214	27.916088	34557.187500	1754.975694	55.567708	3.154514	0.601852	1.900174	7.041667
std	98441.391015	0.698680	29.209988	35021.168056	619.935553	10.333581	0.817351	0.556102	0.658352	2.316453
min	5000.000000	0.000000	0.000000	200.000000	616.000000	20.000000	1.000000	0.000000	0.000000	2.000000
25%	145000.000000	0.170000	13.000000	15100.000000	1300.000000	52.000000	3.000000	0.000000	1.500000	5.000000
50%	189900.000000	0.370000	19.000000	25000.000000	1634.500000	57.000000	3.000000	1.000000	2.000000	7.000000
75%	259000.000000	0.540000	34.000000	40200.000000	2137.750000	64.000000	4.000000	1.000000	2.500000	8.250000
max	775000.000000	12.200000	225.000000	412600.000000	5228.000000	82.000000	7.000000	4.000000	4.500000	12.000000

# Distribution plot for each numeric variable
# ==============================================================================
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(9, 5))
axes = axes.flat
numeric_cols = data.select_dtypes(include=['float64', 'int']).columns
numeric_cols = numeric_cols.drop('price')

for i, colum in enumerate(numeric_cols):
    sns.histplot(
        data     = data,
        x        = colum,
        stat     = "count",
        kde      = True,
        color    = (list(plt.rcParams['axes.prop_cycle'])*2)[i]["color"],
        line_kws = {'linewidth': 2},
        alpha    = 0.3,
        ax       = axes[i]
    )
    axes[i].set_title(colum, fontsize = 10)
    axes[i].tick_params(labelsize = 6)
    axes[i].set_xlabel("")
    axes[i].set_ylabel("")
    
    
fig.tight_layout()
plt.subplots_adjust(top = 0.9)
fig.suptitle('Distribution of numeric variables', fontsize = 12);

The fireplace variable, although numeric, takes only a few values, and the vast majority of observations belong to only two of them. In cases like this, it is usually convenient to treat the variable as categorical.

# Observed values of fireplace
# ==============================================================================
data.fireplace.value_counts()

fireplace
1    942
0    740
2     42
4      2
3      2
Name: count, dtype: int64

# Convert the fireplace variable to string type
# ==============================================================================
data.fireplace = data.fireplace.astype("str")

Since the objective of the study is to predict house prices, the analysis of each variable is also done in relation to the response variable price. By analyzing the data in this way, ideas can begin to be extracted about which variables are most related to the price and in what way.

# Distribution plot for each numeric variable
# ==============================================================================
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(9, 5))
axes = axes.flat
numeric_cols = data.select_dtypes(include=['float64', 'int']).columns
numeric_cols = numeric_cols.drop('price')

for i, colum in enumerate(numeric_cols):
    sns.regplot(
        x           = data[colum],
        y           = data['price'],
        color       = "gray",
        marker      = '.',
        scatter_kws = {"alpha":0.4},
        line_kws    = {"color":"r","alpha":0.7, "linewidth":1.5},
        ax          = axes[i]
    )
    axes[i].set_title(f"price vs {colum}", fontsize = 10)
    axes[i].yaxis.set_major_formatter(ticker.EngFormatter())
    axes[i].xaxis.set_major_formatter(ticker.EngFormatter())
    axes[i].tick_params(labelsize = 6)
    axes[i].set_xlabel("")
    axes[i].set_ylabel("")

# Remove empty axes
for i in [8]:
    fig.delaxes(axes[i])
    
fig.tight_layout()
plt.subplots_adjust(top=0.9)
fig.suptitle('Correlation with price', fontsize = 12);

Correlation of Numeric Variables¶

Some models (LM, GLM, etc.) are negatively affected if they incorporate highly correlated predictors. For this reason, it is convenient to study the degree of correlation between the available variables.

# Correlation between numeric columns
# ==============================================================================
def tidy_corr_matrix(corr_mat):
    '''
    Function to convert a pandas correlation matrix to tidy format
    '''
    corr_mat = corr_mat.stack().reset_index()
    corr_mat.columns = ['variable_1','variable_2','r']
    corr_mat = corr_mat.loc[corr_mat['variable_1'] != corr_mat['variable_2'], :]
    corr_mat['abs_r'] = np.abs(corr_mat['r'])
    corr_mat = corr_mat.sort_values('abs_r', ascending=False)
    
    return(corr_mat)

corr_matrix = data.select_dtypes(include=['float64', 'int']).corr(method='pearson')
tidy_corr_matrix(corr_matrix).head(10)

	variable_1	variable_2	r	abs_r
76	rooms	living_sqm	0.733666	0.733666
44	living_sqm	rooms	0.733666	0.733666
43	living_sqm	bathrooms	0.718564	0.718564
67	bathrooms	living_sqm	0.718564	0.718564
4	price	living_sqm	0.712390	0.712390
36	living_sqm	price	0.712390	0.712390
62	bedrooms	rooms	0.671863	0.671863
78	rooms	bedrooms	0.671863	0.671863
58	bedrooms	living_sqm	0.656196	0.656196
42	living_sqm	bedrooms	0.656196	0.656196

# Heatmap of the correlation matrix
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(4, 4))

sns.heatmap(
    corr_matrix,
    annot     = True,
    cbar      = False,
    annot_kws = {"size": 6},
    vmin      = -1,
    vmax      = 1,
    center    = 0,
    cmap      = sns.diverging_palette(20, 220, n=200),
    square    = True,
    ax        = ax
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation = 45,
    horizontalalignment = 'right',
)

ax.tick_params(labelsize = 8)

Categorical Variables¶

# Categorical variables (object type)
# ==============================================================================
data.select_dtypes(include=['object']).describe()

	fireplace	heating	heating_fuel	sewer	lake_view	new_construction	air_conditioning
count	1728	1728	1728	1728	1728	1728	1728
unique	5	3	3	3	2	2	2
top	1	hot air	gas	public/commercial	No	No	No
freq	942	1121	1197	1213	1713	1647	1093

# Plot for each categorical variable
# ==============================================================================
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(9, 5))
axes = axes.flat
object_cols = data.select_dtypes(include=['object']).columns

for i, colum in enumerate(object_cols):
    data[colum].value_counts().plot.barh(ax = axes[i])
    axes[i].set_title(colum, fontsize = 10)
    axes[i].tick_params(labelsize = 6)
    axes[i].set_xlabel("")
    axes[i].set_ylabel("")

# Remove empty axes
for i in [7, 8]:
    fig.delaxes(axes[i])
    
fig.tight_layout()
plt.subplots_adjust(top=0.9)
fig.suptitle('Distribution of categorical variables', fontsize = 12);

If one of the levels of a categorical variable has very few observations compared to the other levels, it may happen that, during cross-validation or bootstrapping, some partitions do not contain any observations of that class (zero variance), which can lead to errors.

In these cases, it is usually convenient to:

Remove observations from the minority group, if it is a multiclass variable.
Remove the variable if it only has two levels.
Group minority levels into a single group.
Ensure that, in creating the partitions, all groups are represented in each of them.

For this case, caution must be taken with the fireplace variable. Levels 2, 3, and 4 are unified into a new level called "2_plus".

data.fireplace.value_counts().sort_index()

fireplace
0    740
1    942
2     42
3      2
4      2
Name: count, dtype: int64

dic_replace = {'2': "2_plus",
               '3': "2_plus",
               '4': "2_plus"}

data['fireplace'] = (
    data['fireplace']
    .map(dic_replace) 
    .fillna(data['fireplace'])
)

data.fireplace.value_counts().sort_index()

fireplace
0         740
1         942
2_plus     46
Name: count, dtype: int64

# Plot of relationship between price and each categorical variable
# ==============================================================================
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(9, 5))
axes = axes.flat
object_cols = data.select_dtypes(include=['object']).columns

for i, colum in enumerate(object_cols):
    sns.violinplot(
        x     = colum,
        y     = 'price',
        data  = data,
        color = "white",
        ax    = axes[i]
    )
    axes[i].set_title(f"price vs {colum}", fontsize = 10)
    axes[i].yaxis.set_major_formatter(ticker.EngFormatter())
    axes[i].tick_params(labelsize = 6)
    axes[i].set_xlabel("")
    axes[i].set_ylabel("")

# Remove empty axes
for i in [7, 8]:
    fig.delaxes(axes[i])
    
fig.tight_layout()
plt.subplots_adjust(top=0.9)
fig.suptitle('Price distribution by group', fontsize = 12);

Train-Test Split¶

Evaluating the predictive capacity of a model consists of checking how close its predictions are to the true values of the response variable.

To be able to quantify it correctly, it is necessary to have a set of observations whose response variable is known, but which the model has not "seen," that is, they have not participated in its training.

For this purpose, the available data is divided into a training set and a test set. The appropriate size of the partitions depends largely on the amount of available data and the certainty needed in error estimation. An 80%-20% split usually yields good results. The split should be done randomly or in a stratified random manner.

# Data split into train and test
# ==============================================================================
X_train, X_test, y_train, y_test = train_test_split(
                                        data.drop(columns=['price']),
                                        data['price'],
                                        train_size   = 0.8,
                                        random_state = 1234,
                                        shuffle      = True
                                    )

It is important to verify that the distribution of the response variable is similar in the training set and the test set.

To ensure this is the case, the train_test_split() function from scikit-learn allows, in classification problems, to identify, through the stratify argument, the variable on which to base the split.

This type of stratified split ensures that the training set and the test set are similar in terms of the response variable. However, it does not guarantee that the same happens with the predictors. For example, in a dataset with 100 observations, a binary predictor that has 90 observations of one group and only 10 of another has a high risk that, in one of the partitions, the minority group has no representation.

If this happens in the training set, some algorithms will give an error when applied to the test set, as they will not understand the value being passed to them. This problem can be avoided by removing variables with near-zero variance (see below).

print("Training partition")
print("------------------")
print(y_train.describe())

Training partition
------------------
count      1382.000000
mean     211436.516643
std       96846.639129
min       10300.000000
25%      145625.000000
50%      190000.000000
75%      255000.000000
max      775000.000000
Name: price, dtype: float64

print("Test partition")
print("--------------")
print(y_test.describe())

Test partition
--------------
count       346.000000
mean     214084.395954
std      104689.155889
min        5000.000000
25%      139000.000000
50%      180750.000000
75%      271750.000000
max      670000.000000
Name: price, dtype: float64

Preprocessing¶

Preprocessing encompasses all those transformations performed on the data with the objective of being interpreted by the machine learning algorithm as efficiently as possible.

All data preprocessing must be learned with the training observations and then applied to both the training set and the test set. This is very important so as not to violate the condition that no information from the test observations participates or influences the model fitting.

This principle must also be applied if cross-validation is used (see below). In that case, preprocessing must be done within each validation iteration, so that the estimates made with each validation partition do not contain information from the other partitions.

Although it is not possible to create a single list, below are some of the preprocessing steps that are most often needed.

Missing Value Imputation¶

The vast majority of algorithms do not accept incomplete observations, so when the dataset contains missing values, you can:

Remove those observations that are incomplete.
Remove those variables that contain missing values.
Try to estimate missing values using the rest of the available information (imputation).

The first two options, although simple, imply losing information. Removing observations can only be applied when there are many and the percentage of incomplete records is very low. In the case of removing variables, the impact will depend on how much information those variables contribute to the model.

When using imputation, it is very important to consider the risk of introducing values in predictors that have a lot of influence on the model. Suppose a medical study in which, when one of the predictors is positive, the model almost always predicts that the patient is healthy. For a patient whose value of this predictor is unknown, the risk that the imputation is wrong is very high, so it is preferable to obtain a prediction based only on the available information. This is another example of the importance of the analyst knowing the problem they are facing so that they can make the best decision.

The sklearn.impute module incorporates several different imputation methods:

SimpleImputer: allows imputations using a constant value or a statistic (mean, median, most frequent value) from the same column where the missing value is located.
IterativeImputer: allows imputing the value of a column taking into account the rest of the columns. Specifically, it is an iterative process in which, in each iteration, one of the variables is used as the response variable and the rest as predictors. Once the model is obtained, it is used to predict the empty positions of that variable. This process is carried out with each variable and the cycle is repeated max_iter times to gain stability. The sklearn.impute.IterativeImputer implementation allows almost any of its algorithms to be used to create the imputation models (KNN, RandomForest, GradientBoosting...).
KNNImputer: it is a specific case of IterativeImputer in which k-Nearest Neighbors is used as the imputation algorithm.

Despite being a widely used method, imputing using KNN has two problems: its high computational cost makes it only applicable to small or moderately sized datasets. Additionally, if there are categorical variables, due to the difficulty of measuring "distances" in this context, it can lead to unrealistic results. For these two reasons, it is more advisable to use a Random Forest type model IterativeImputer(predictor = RandomForestRegressor()).

With the argument add_indicator=True, a new column is automatically created in which the value 1 indicates which values have been imputed. This can be useful both to identify observations in which some imputation has been performed and to use it as another predictor in the model.

Near-Zero Variance Variables¶

Predictors that contain a single value (zero-variance) should not be included in the model as they provide no information. It is also not advisable to include predictors that have near-zero variance, that is, predictors that take only a few values, of which some appear very infrequently. The problem with the latter is that they can become zero-variance predictors when observations are divided by cross-validation or bootstrap.

The VarianceThreshold class from the sklearn.feature_selection module identifies and excludes all predictors whose variance does not exceed a certain threshold. In the case of categorical variables, it should be remembered that scikitlearn requires them to be binarized (one hot encoding or dummy) to be able to train models. A boolean variable follows a Bernoulli distribution, so its variance can be calculated as: $$\mathrm{Var}[X] = p(1 - p)$$

While the removal of non-informative predictors could be considered a step in the predictor selection process, since it consists of filtering by variance, it must be done before standardizing the data, because afterwards, all predictors have variance 1.

Scaling of Numeric Variables¶

When predictors are numeric, the scale on which they are measured, as well as the magnitude of their variance, can greatly influence the model. Many machine learning algorithms (SVM, neural networks, lasso...) are sensitive to this, so if the predictors are not equalized in some way, those measured on a larger scale or with more variance will dominate the model even if they are not the ones most related to the response variable. There are mainly 2 strategies to avoid this:

Centering: consists of subtracting from each value the mean of the predictor to which it belongs. If the data is stored in a dataframe, centering is achieved by subtracting from each value the mean of the column in which it is located. As a result of this transformation, all predictors come to have a mean of zero, that is, the values are centered around the origin. StandardScaler(with_std=False)

Normalization (standardization): consists of transforming the data so that all predictors are approximately on the same scale. There are two ways to achieve this:
- Z-score normalization (StandardScaler): divide each predictor by its standard deviation after it has been centered, this way the data comes to have a normal distribution.
  
  $$z = \frac{x - \mu}{\sigma}$$
- Min-max standardization (MinMaxScaler): transform the data so that it is within the range [0, 1].
  
  $$X_{norm} = \frac{X - X_{min}}{X_{max}-X_{min}}$$

Variables should never be standardized after being binarized (see below).

Binarization of Categorical Variables¶

Binarization (one-hot-encoding) consists of creating new dummy variables with each of the levels of the categorical variables. For example, a variable called color that contains the levels red, green, and blue will become three new variables (color_red, color_green, color_blue), all with the value 0 except the one that matches the observation, which takes the value 1.

By default, the OneHotEncoder class binarizes all variables, so it must be applied only to categorical variables (see how to do it in the ColumnTransformer section). With the argument drop='first', one of the levels is removed to avoid redundancies. Returning to the previous example, it is not necessary to store all three variables, since if color_red and color_green take the value 0, the variable color_blue necessarily takes the value 1. If color_red or color_green take the value 1, then color_blue is necessarily 0. This is important in models that suffer problems if predictors are perfectly correlated (linear models without regularization, neural networks...).

In certain scenarios, a new level may appear in the test data that was not in the training data. If all possible levels are not known in advance, prediction errors can be avoided by indicating OneHotEncoder(handle_unknown='ignore').

The way to preprocess data within the scikit-learn ecosystem is by using ColumnTransformer and pipeline. In addition to those already mentioned, many more preprocessing transformations can be found in the sklearn.preprocessing module.

Pipeline and ColumnTransformer¶

The ColumnTransformer and make_column_transformer classes from the sklearn.compose module allow combining multiple preprocessing transformations, specifying which columns each one is applied to. Like any transformer, it has a training method (fit) and a transformation method (transform). This allows the learning of transformations to be done only with training observations, and they can be applied later to any dataset. The idea behind this module is as follows:

Define all the transformations (scaling, selection, filtering...) to be applied and to which columns ColumnTransformer(). Column selection can be done by: name, index, boolean mask, slice, regex pattern, column type, or with the selection functions make_column_selector.
Learn the parameters necessary for those transformations with the training observations .fit().
Apply the learned transformations to any dataset .transform().

By default, the result returned by ColumnTransformer is a numpy array, so the column names are lost. Starting from version 1.2.1, scikit-learn includes the possibility for the result of transformers to be a Pandas DataFrame instead of a numpy array. This has the advantage that the result of any transformation directly incorporates the column names. This behavior can be indicated individually per transformer with the .set_output(transform="pandas") method or as a global configuration with set_config(transform_output="pandas").

It should be noted that, if a OneHotEncoder type transformer is used in combination with .set_output(transform="pandas"), it must be indicated that the OneHotEncoder does not return data in sparse matrix form (sparse_output = False).

# Selection of variables by type
# ==============================================================================
# Numeric columns are standardized and one-hot-encoding is applied to 
# categorical columns. To keep columns that are not transformed,
# remainder='passthrough' must be indicated.
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()
preprocessor = ColumnTransformer(
                   [('scale', StandardScaler(), numeric_cols),
                    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_cols)],
                remainder = 'passthrough',
                verbose_feature_names_out = False
               ).set_output(transform="pandas")
preprocessor

ColumnTransformer(remainder='passthrough',
                  transformers=[('scale', StandardScaler(),
                                 ['total_sqm', 'age', 'land_value',
                                  'living_sqm', 'college_pct', 'bedrooms',
                                  'bathrooms', 'rooms']),
                                ('onehot',
                                 OneHotEncoder(handle_unknown='ignore',
                                               sparse_output=False),
                                 ['fireplace', 'heating', 'heating_fuel',
                                  'sewer', 'lake_view', 'new_construction',
                                  'air_conditioning'])],
                  verbose_feature_names_out=False)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

ColumnTransformer

?Documentation for ColumnTransformeriNot fitted

Parameters

	transformers	[('scale', ...), ('onehot', ...)]
	remainder	'passthrough'
	sparse_threshold	0.3
	n_jobs	None
	transformer_weights	None
	verbose	False
	verbose_feature_names_out	False
	force_int_remainder_cols	'deprecated'

scale

['total_sqm', 'age', 'land_value', 'living_sqm', 'college_pct', 'bedrooms', 'bathrooms', 'rooms']

StandardScaler

?Documentation for StandardScaler

Parameters

	copy	True
	with_mean	True
	with_std	True

onehot

['fireplace', 'heating', 'heating_fuel', 'sewer', 'lake_view', 'new_construction', 'air_conditioning']

OneHotEncoder

?Documentation for OneHotEncoder

Parameters

	categories	'auto'
	drop	None
	sparse_output	False
	dtype	<class 'numpy.float64'>
	handle_unknown	'ignore'
	min_frequency	None
	max_categories	None
	feature_name_combiner	'concat'

remainder

passthrough

passthrough

Once the ColumnTransformer object has been defined, with the fit() method the transformations are learned with the training data and applied to both sets with transform().

preprocessor.fit(X_train)
X_train_prep = preprocessor.transform(X_train)
X_test_prep  = preprocessor.transform(X_test)

X_train_prep.head(3)

	total_sqm	age	land_value	living_sqm	college_pct	bedrooms	bathrooms	rooms	fireplace_0	fireplace_1	...	heating_fuel_oil	sewer_public/commercial	sewer_septic	lake_view_No	new_construction_No	air_conditioning_No	air_conditioning_Yes
1571	-0.061549	0.301597	-0.195148	1.320467	0.826826	1.011039	0.905289	2.128665	0.0	1.0	...	0.0	1.0	0.0	1.0	1.0	1.0	0.0
832	-0.195573	-0.407560	-0.277982	0.194931	0.730235	1.011039	1.663079	1.264569	0.0	1.0	...	0.0	1.0	0.0	1.0	1.0	1.0	0.0
1302	0.072475	-0.478476	-0.892092	-0.064313	-1.877728	1.011039	0.147500	-0.463623	1.0	0.0	...	1.0	0.0	1.0	1.0	1.0	0.0	1.0

3 rows × 26 columns

ColumnTransformer applies operations in parallel, not sequentially, which means it does not allow applying more than one transformation to the same column. If it is necessary to do so, pipeline must be used, which also groups operations but executes them sequentially, so that the output of one operation is the input of the next. If you want to apply multiple preprocessing transformations to the same column, you must first group them in a pipeline

In the following example, the transformations are combined:

Numeric columns: missing values are imputed with the median and then standardized.
Categorical columns: missing values are imputed with the most frequent value and then one hot encoding is applied.

# Pipeline for preprocessing
# ==============================================================================
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()

# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[
                            ('imputer', SimpleImputer(strategy='median')),
                            ('scaler', StandardScaler())
                        ]
                      )


# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[
                                ('imputer', SimpleImputer(strategy='most_frequent')),
                                ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
                            ]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

preprocessor

ColumnTransformer(remainder='passthrough',
                  transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['total_sqm', 'age', 'land_value',
                                  'living_sqm', 'college_pct', 'bedrooms',
                                  'bathrooms', 'rooms']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 ['fireplace', 'heating', 'heating_fuel',
                                  'sewer', 'lake_view', 'new_construction',
                                  'air_conditioning'])],
                  verbose_feature_names_out=False)

ColumnTransformer

?Documentation for ColumnTransformeriNot fitted

Parameters

	transformers	[('numeric', ...), ('cat', ...)]
	remainder	'passthrough'
	sparse_threshold	0.3
	n_jobs	None
	transformer_weights	None
	verbose	False
	verbose_feature_names_out	False
	force_int_remainder_cols	'deprecated'

numeric

['total_sqm', 'age', 'land_value', 'living_sqm', 'college_pct', 'bedrooms', 'bathrooms', 'rooms']

SimpleImputer

?Documentation for SimpleImputer

Parameters

	missing_values	nan
	strategy	'median'
	fill_value	None
	copy	True
	add_indicator	False
	keep_empty_features	False

StandardScaler

?Documentation for StandardScaler

Parameters

	copy	True
	with_mean	True
	with_std	True

cat

['fireplace', 'heating', 'heating_fuel', 'sewer', 'lake_view', 'new_construction', 'air_conditioning']

SimpleImputer

?Documentation for SimpleImputer

Parameters

	missing_values	nan
	strategy	'most_frequent'
	fill_value	None
	copy	True
	add_indicator	False
	keep_empty_features	False

OneHotEncoder

?Documentation for OneHotEncoder

Parameters

	categories	'auto'
	drop	None
	sparse_output	False
	dtype	<class 'numpy.float64'>
	handle_unknown	'ignore'
	min_frequency	None
	max_categories	None
	feature_name_combiner	'concat'

remainder

passthrough

passthrough

preprocessor.fit(X_train)
X_train_prep = preprocessor.transform(X_train)
X_test_prep  = preprocessor.transform(X_test)
X_train_prep.head(3)

	total_sqm	age	land_value	living_sqm	college_pct	bedrooms	bathrooms	rooms	fireplace_0	fireplace_1	...	heating_fuel_oil	sewer_public/commercial	sewer_septic	lake_view_No	new_construction_No	air_conditioning_No	air_conditioning_Yes
1571	-0.061549	0.301597	-0.195148	1.320467	0.826826	1.011039	0.905289	2.128665	0.0	1.0	...	0.0	1.0	0.0	1.0	1.0	1.0	0.0
832	-0.195573	-0.407560	-0.277982	0.194931	0.730235	1.011039	1.663079	1.264569	0.0	1.0	...	0.0	1.0	0.0	1.0	1.0	1.0	0.0
1302	0.072475	-0.478476	-0.892092	-0.064313	-1.877728	1.011039	0.147500	-0.463623	1.0	0.0	...	1.0	0.0	1.0	1.0	1.0	0.0	1.0

3 rows × 26 columns

Creating a Model¶

The next step, once the training data has been defined, is to select the algorithm to be used. In scikit-learn, this is done by creating an estimator object. This object encapsulates the algorithm name, as well as its parameters and hyperparameters, and provides the fit(X, y) and predict() methods, which allow it to learn from the data and make predictions on new observations. In the following listing you can consult all the algorithms implemented in scikit-learn.

Training¶

A first linear regression model with ridge regularization is fitted to predict the house price based on all available predictors. All arguments of sklearn.linear_model.Ridge are kept at their default values.

It is important to note that, when a linear regression model incorporates regularization on the coefficients (ridge, lasso, elasticnet), it is necessary to standardize the predictors. To ensure that preprocessing is done only with the training data, the transformations and model fitting are combined in a single pipeline.

# Preprocessing
# ==============================================================================
# Identification of numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()


# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Pipeline
# ==============================================================================
# Preprocessing steps and model are combined in the same pipeline
pipe = Pipeline([('preprocessing', preprocessor),
                 ('model', Ridge(random_state=789456))])

# Training
# ==============================================================================
pipe.fit(X=X_train, y=y_train)

Pipeline(steps=[('preprocessing',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('numeric',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['total_sqm', 'age',
                                                   'land_value', 'living_sqm',
                                                   'college_pct', 'bedrooms',
                                                   'bathrooms', 'rooms']),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['fireplace', 'heating',
                                                   'heating_fuel', 'sewer',
                                                   'lake_view',
                                                   'new_construction',
                                                   'air_conditioning'])],
                                   verbose_feature_names_out=False)),
                ('model', Ridge(random_state=789456))])

Pipeline

?Documentation for PipelineiFitted

Parameters

	steps	[('preprocessing', ...), ('model', ...)]
	transform_input	None
	memory	None
	verbose	False

preprocessing: ColumnTransformer

?Documentation for preprocessing: ColumnTransformer

Parameters

	transformers	[('numeric', ...), ('cat', ...)]
	remainder	'passthrough'
	sparse_threshold	0.3
	n_jobs	None
	transformer_weights	None
	verbose	False
	verbose_feature_names_out	False
	force_int_remainder_cols	'deprecated'

numeric

['total_sqm', 'age', 'land_value', 'living_sqm', 'college_pct', 'bedrooms', 'bathrooms', 'rooms']

StandardScaler

?Documentation for StandardScaler

Parameters

	copy	True
	with_mean	True
	with_std	True

cat

['fireplace', 'heating', 'heating_fuel', 'sewer', 'lake_view', 'new_construction', 'air_conditioning']

OneHotEncoder

?Documentation for OneHotEncoder

Parameters

	categories	'auto'
	drop	None
	sparse_output	False
	dtype	<class 'numpy.float64'>
	handle_unknown	'ignore'
	min_frequency	None
	max_categories	None
	feature_name_combiner	'concat'

remainder

[]

passthrough

passthrough

Ridge

?Documentation for Ridge

Parameters

	alpha	1.0
	fit_intercept	True
	copy_X	True
	max_iter	None
	tol	0.0001
	solver	'auto'
	positive	False
	random_state	789456

Validation¶

The ultimate purpose of a model is to predict the response variable in future observations or in data the model has not "seen" before. The error shown by default after training a model usually corresponds to the training error, that is, the error the model makes when predicting observations it has already used to learn.

While this type of error is useful for analyzing how the model is learning (for example, through residual analysis), it does not represent a realistic estimate of its performance on new observations, as it tends to be overly optimistic. To obtain a more reliable estimate—and before resorting to the test set—validation strategies based on resampling can be used.

Scikit-learn incorporates various validation strategies in the sklearn.model_selection module.

All of them receive as the first argument an estimator, which can be directly a model or a pipeline.
Error metrics for regression tasks are always returned with a negative sign, so the closer to 0 the value is, the better the fit. This is done so that optimization algorithms, which maximize by default, work correctly.

The simplest way to apply cross-validation is through the cross_val_score() function, which by default uses KFold.

# Cross-validation
# ==============================================================================
cv_scores = cross_val_score(
                estimator = pipe,
                X         = X_train,
                y         = y_train,
                scoring   = 'neg_root_mean_squared_error',
                cv        = 5
            )

print(f"Cross-validation metrics: {cv_scores}")
print(f"Mean cross-validation metrics: {cv_scores.mean()}")

Cross-validation metrics: [-53936.97817183 -53076.74364513 -62746.15219054 -65963.09754244
 -48929.66260476]
Mean cross-validation metrics: -56930.52683094097

It is also possible to use other validation strategies (KFold, RepeatedKFold, LeaveOneOut, LeavePOut, ShuffleSplit) by first splitting with the helper functions of the sklearn.model_selection module and providing the generated indices to the cv argument.

Each of these methods works internally in a different way, but they all share the same fundamental idea: fit and evaluate the model repeatedly using different subsets derived from the training set, obtaining an error estimate in each iteration. The average of all these estimates tends to converge towards the actual value of the test set error.$^{\text{Appendix 1}}$

# Repeated cross-validation
# ==============================================================================
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=123)
cv_scores = cross_val_score(
                estimator = pipe,
                X         = X_train,
                y         = y_train,
                scoring   = 'neg_root_mean_squared_error',
                cv        = cv
            )

print(f"Cross-validation metrics: {cv_scores}")
print("")
print(f"Mean cross-validation metrics: {cv_scores.mean()}")

Cross-validation metrics: [-66487.72274814 -56333.87119132 -48992.85039086 -55030.00923381
 -58790.48472729 -62651.67080028 -55779.2103529  -61820.5882113
 -57243.1819794  -48910.70385765 -50033.46929956 -51586.8240037
 -62478.2206031  -65111.86204296 -54900.98944223 -64220.39882006
 -53694.93490461 -64690.99877883 -51242.27642488 -52330.80605289
 -56777.05219272 -58644.60923875 -55573.25900549 -64350.25804164
 -50578.15339623]

Mean cross-validation metrics: -57130.17622962478

The cross_validate function is similar to cross_val_score but allows estimating multiple metrics at once, both for test and train, and returns the results in a dictionary.

# Repeated cross-validation with multiple metrics
# ==============================================================================
cv = RepeatedKFold(n_splits=3, n_repeats=5, random_state=123)
cv_scores = cross_validate(
                estimator = pipe,
                X         = X_train,
                y         = y_train,
                scoring   = ('r2', 'neg_root_mean_squared_error'),
                cv        = cv,
                return_train_score = True
            )

# Convert dictionary to dataframe for easier visualization
cv_scores = pd.DataFrame(cv_scores)
cv_scores

	fit_time	score_time	test_r2	train_r2	test_neg_root_mean_squared_error	train_neg_root_mean_squared_error
0	0.017680	0.011029	0.632359	0.679758	-64605.636258	-51789.256510
1	0.019497	0.015855	0.718708	0.636812	-48322.428406	-59983.019196
2	0.018450	0.010004	0.595874	0.689939	-58434.926618	-55213.669996
3	0.011611	0.007364	0.672969	0.655316	-58179.173708	-55331.536238
4	0.011359	0.007863	0.615521	0.683390	-60148.310664	-54366.426683
5	0.013043	0.007721	0.650829	0.664771	-53812.579173	-57561.184588
6	0.011932	0.007122	0.698178	0.647441	-49813.241120	-59221.407345
7	0.013098	0.007118	0.646554	0.673016	-61749.477863	-53221.315710
8	0.012182	0.010252	0.615151	0.682999	-59192.994923	-54895.180113
9	0.012179	0.008770	0.645685	0.670425	-61061.033083	-53843.654748
10	0.012779	0.006467	0.564282	0.696123	-59084.326066	-55269.382399
11	0.011912	0.007219	0.709130	0.637747	-52788.788472	-57942.367392
12	0.012289	0.008397	0.663592	0.652572	-59372.918876	-55251.685765
13	0.011996	0.006569	0.637408	0.671598	-54925.502000	-57014.192113
14	0.010961	0.006674	0.624017	0.679811	-58863.107740	-54881.130021

# Distribution of cross-validation error
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3))

sns.kdeplot(
    cv_scores['test_neg_root_mean_squared_error'],
    fill    = True,
    color   = "#30a2da",
    ax      = ax
)
sns.rugplot(
    cv_scores['test_neg_root_mean_squared_error'],
    color   = "black",
    ax      = ax
)
ax.set_title('Distribution of cross-validation error', fontsize = 12)
ax.tick_params(labelsize = 8)
ax.set_xlabel("")
ax.set_ylabel("");

The cross_val_predict function, instead of returning the metrics for each partition, returns the predictions generated in each of them. This is useful for evaluating model residuals and diagnosing its behavior.

It is important to note that, if repeated cross-validation or bootstrapping techniques are used, the same observation can appear multiple times in the validation sets.

# Error (residual) diagnostics in cross-validation predictions
# ========================================================================
# Cross-validation configuration
cv = KFold(n_splits=5, random_state=123, shuffle=True)

# Cross-validation predictions
y_pred_cv = cross_val_predict(estimator=pipe, X=X_train, y=y_train, cv=cv)

# Residual calculation
residuals = y_train - y_pred_cv

# Visualization
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 6))
fig.suptitle('Model residual diagnostics', fontsize=12)
plt.subplots_adjust(top=0.88)

# Standard font size
label_fontsize = 8
title_fontsize = 10
tick_fontsize = 7

# Plot 1: Actual value vs prediction
axes[0, 0].scatter(y_train, y_pred_cv, edgecolors='black', alpha=0.4)
axes[0, 0].plot(
    [y_train.min(), y_train.max()],
    [y_train.min(), y_train.max()],
    '--', color='black', lw=2
)
axes[0, 0].set_title('Predicted vs Actual value', fontsize=title_fontsize)
axes[0, 0].set_xlabel('Actual value', fontsize=label_fontsize)
axes[0, 0].set_ylabel('Predicted value', fontsize=label_fontsize)
axes[0, 0].tick_params(labelsize=tick_fontsize)

# Plot 2: Residuals vs index
axes[0, 1].scatter(np.arange(len(residuals)), residuals, edgecolors='black', alpha=0.4)
axes[0, 1].axhline(0, linestyle='--', color='black', lw=2)
axes[0, 1].set_title('Model residuals', fontsize=title_fontsize)
axes[0, 1].set_xlabel('Index', fontsize=label_fontsize)
axes[0, 1].set_ylabel('Residual', fontsize=label_fontsize)
axes[0, 1].tick_params(labelsize=tick_fontsize)

# Plot 3: Histogram and KDE of residuals
sns.histplot(
    residuals,
    stat='density',
    kde=True,
    color="#30a2da",
    alpha=0.3,
    line_kws={'linewidth': 1},
    ax=axes[1, 0]
)
axes[1, 0].set_title('Residual distribution', fontsize=title_fontsize)
axes[1, 0].set_xlabel('Residual', fontsize=label_fontsize)
axes[1, 0].set_ylabel('Density', fontsize=label_fontsize)
axes[1, 0].tick_params(labelsize=tick_fontsize)

# Plot 4: Q-Q plot of residuals
sm.qqplot(residuals, fit=True, line='q', ax=axes[1, 1], color='#30a2da', alpha=0.4, lw=2)
qq_line = axes[1, 1].lines[1]
qq_line.set_linewidth(1)
qq_line.set_color('black')
axes[1, 1].set_title('Q-Q plot of residuals', fontsize=title_fontsize)
axes[1, 1].set_xlabel('Theoretical quantiles', fontsize=label_fontsize)
axes[1, 1].set_ylabel('Sample quantiles', fontsize=label_fontsize)
axes[1, 1].tick_params(labelsize=tick_fontsize)

fig.tight_layout()

For practical purposes, when applying resampling methods to validate a model, two things must be taken into account: the computational cost involved in fitting a model multiple times, each time with a different subset of data, and the reproducibility in creating the partitions. The functions cross_val_score() and cross_val_predict allow parallelizing the process through the n_jobs argument.

# Parallelized repeated cross-validation (multicore)
# ==============================================================================
cv = RepeatedKFold(n_splits=10, n_repeats=5, random_state=123)
cv_scores = cross_val_score(
                estimator = pipe,
                X         = X_train,
                y         = y_train,
                scoring   = 'neg_root_mean_squared_error',
                cv        = cv,
                n_jobs    = -1 # all available cores
            )

print(f"Mean cross-validation metrics: {cv_scores.mean()}")

Mean cross-validation metrics: -56735.77500472445

The average root_mean_squared_error estimated through cross-validation for the ridge model is 56735. This value will be compared later when the model error is calculated with the test set.

Prediction¶

Once the model has been trained, either using an estimator directly or a pipeline, new observations can be predicted with the .predict() method. If a pipeline is used, the transformations learned during training are automatically applied.

predictions = pipe.predict(X_test)
# Create a dataframe with predictions and actual value
df_predictions = pd.DataFrame({'price' : y_test, 'prediction' : predictions})
df_predictions.head()

	price	prediction
903	105000	112491.665253
208	113000	185142.195564
358	110500	168900.516039
1187	159000	139333.302565
319	215000	236967.476884

Test Error¶

Although validation methods such as KFold or LeaveOneOut provide good estimates of a model's error when predicting new observations, the best way to evaluate a final model is to do so with a test set. This set must be completely separated from the training and optimization process.

Depending on the specific problem, different metrics may be relevant $^{\text{Appendix 2}}$. The sklearn.metrics module offers a wide variety of metrics to evaluate the quality of predictions.

# mean_squared_error metric
# ==============================================================================
rmse = root_mean_squared_error(
        y_true = y_test,
        y_pred = predictions,
       )
rmse

65372.916389239625

In the validation section, it was estimated, through repeated cross-validation, that the model's rmse was 56735, a value close to that obtained with the test set 65372.

Hyperparameters¶

Many models, including linear regression with Ridge regularization, contain parameters that cannot be learned from the training data and, therefore, must be set by the analyst. These are known as hyperparameters. The results of a model can depend largely on the values taken by its hyperparameters, however, it is not possible to know in advance what the appropriate value is. Although with practice, machine learning specialists gain intuition about what values may work better in each problem, there are no fixed rules. The most common way to find the optimal values is by trying different possibilities.

Choose a set of values for the hyperparameter(s).
- grid search: an exhaustive search is done on a set of values previously defined by the user.
- random search: random values are evaluated within limits defined by the user.
For each value (combination of values if there is more than one hyperparameter), train the model and estimate its error through a validation method $^{\text{Appendix 1}}$.
Finally, fit the model again, this time with all training data and with the best hyperparameters found.

Scikit-learn allows exploring different hyperparameter values through model_selection.GridSearchCV() and model_selection.RandomizedSearchCV()

Grid Search¶

The Ridge model used so far has a hyperparameter called alpha, which by default has the value 1.0. This hyperparameter controls the penalty applied to the model coefficients. The higher its value, the more restriction is imposed on the coefficients, thus reducing variance, attenuating the effect of correlation between predictors, and minimizing the risk of overfitting.

A Ridge model is fitted again with different values of alpha using repeated cross-validation to identify which one yields the best results.

# Pipe: preprocessing + model
# ==============================================================================
# Identification of numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()


# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Preprocessing steps and model are combined in the same pipeline
pipe = Pipeline([('preprocessing', preprocessor),
                 ('model', Ridge())])

# Hyperparameter grid
# ==============================================================================
param_grid = {'model__alpha': np.logspace(-5, 3, 10)}

# Cross-validation search
# ==============================================================================
grid = GridSearchCV(
        estimator  = pipe,
        param_grid = param_grid,
        scoring    = 'neg_root_mean_squared_error',
        n_jobs     = cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 5), 
        verbose    = 0,
        return_train_score = True
       )

# Result is assigned to _ so it is not printed to screen
_ = grid.fit(X = X_train, y = y_train)

# Grid results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)

	param_model__alpha	mean_test_score	std_test_score	mean_train_score	std_train_score
6	2.154435	-57015.966627	5465.631682	-55903.194589	1434.421966
5	0.278256	-57061.419747	5453.841055	-55890.429453	1434.319041
4	0.035938	-57070.046327	5452.082559	-55890.168259	1434.316012
7	16.681005	-57070.856505	5510.116607	-56163.247029	1433.747581
3	0.004642	-57071.217277	5451.851119	-55890.163778	1434.315957
2	0.000599	-57071.369487	5451.821155	-55890.163702	1434.315956
1	0.000077	-57071.389162	5451.817284	-55890.163701	1434.315956
0	0.000010	-57071.391703	5451.816784	-55890.163701	1434.315956
8	129.154967	-57925.670520	5685.548777	-57352.164605	1429.554577
9	1000.000000	-62521.145397	6344.532757	-62375.934333	1399.738326

# Best hyperparameters
# ==============================================================================
print("--------------------------------")
print("Best hyperparameters found")
print("--------------------------------")
print(f"{grid.best_params_} : {grid.best_score_} ({grid.scoring})")

--------------------------------
Best hyperparameters found
--------------------------------
{'model__alpha': np.float64(2.154434690031882)} : -57015.96662678427 (neg_root_mean_squared_error)

# Cross-validation results plot for each hyperparameter
# ==============================================================================
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 3.84), sharey=True)

# Plot 1
# ------------------------------------------------------------------------------
results.plot('param_model__alpha', 'mean_train_score', ax=axes[0])
results.plot('param_model__alpha', 'mean_test_score', ax=axes[0])
axes[0].fill_between(results.param_model__alpha.astype(float),
                results['mean_train_score'] + results['std_train_score'],
                results['mean_train_score'] - results['std_train_score'],
                alpha=0.2)
axes[0].fill_between(results.param_model__alpha.astype(float),
                results['mean_test_score'] + results['std_test_score'],
                results['mean_test_score'] - results['std_test_score'],
                alpha=0.2)
axes[0].legend()
axes[0].set_xscale('log')
axes[0].set_title('CV error evolution')
axes[0].set_ylabel('neg_root_mean_squared_error');

# Plot 2
# ------------------------------------------------------------------------------
n_splits = grid.n_splits_

results.plot(
    x     = 'param_model__alpha',
    y     = [f'split{i}_train_score' for i in range(n_splits)],
    alpha = 0.3,
    c     = 'blue', 
    ax    = axes[1]
)

results.plot(
    x     = 'param_model__alpha',
    y     = [f'split{i}_test_score' for i in range(n_splits)],
    alpha = 0.3,
    c     = 'red', 
    ax    = axes[1]
)

axes[1].legend(
    (axes[1].get_children()[0], axes[1].get_children()[n_splits]),
    ('training scores', 'test_scores')
)
axes[1].set_xscale('log')
axes[1].set_title('CV error evolution')
axes[1].set_ylabel('neg_root_mean_squared_error');

fig.tight_layout()

fig, ax = plt.subplots(figsize=(6, 3.84))
ax.barh(
    [str(d) for d in results['params']],
    results['mean_test_score'],
    xerr=results['std_test_score'],
    align='center',
    alpha=0
)
ax.plot(
    results['mean_test_score'],
    [str(d) for d in results['params']],
    marker="D",
    linestyle="",
    alpha=0.8,
    color="r"
)
ax.set_title('Hyperparameter Comparison')
ax.set_xlabel('Test neg_root_mean_squared_error');

If refit=True is indicated in GridSearchCV(), after identifying the best hyperparameters, the model is retrained with them and stored in .best_estimator_.

Random Search¶

GridSearchCV() performs an exhaustive search evaluating all parameter combinations. This strategy has the drawback that a lot of time can be spent in regions of little interest before evaluating other combinations. An alternative is to do a random search, this way, the search space is explored in a more distributed manner. RandomizedSearchCV() allows this type of strategy, it only requires that the search space of each hyperparameter be indicated (list of options or a distribution) and the number of random combinations to evaluate.

fig, axs = plt.subplots(nrows = 1, ncols = 2,figsize=(8, 4),
                        sharex = True, sharey = False)

# Exhaustive grid
# ==============================================================================
hyperparameter_1 = np.linspace(start = 0, stop = 100, num=20)
hyperparameter_2 = np.linspace(start = 0, stop = 10, num=20)

# List with all combinations
combinations = [list(x) for x in product(hyperparameter_1, hyperparameter_2)]
combinations = pd.DataFrame.from_records(
                    combinations,
                    columns=['hyperparameter_1', 'hyperparameter_2']
                )

combinations.plot(
    x    = 'hyperparameter_1',
    y    = 'hyperparameter_2',
    kind = 'scatter',
    ax   = axs[0]
)
axs[0].set_title('Uniform distribution')

# Random grid
# ==============================================================================
hyperparameter_1 = np.random.uniform(low = 0, high = 100, size  = 400)
hyperparameter_2 = np.random.uniform(low = 0, high = 10, size  = 400)

combinations = pd.DataFrame(
                    {
                    'hyperparameter_1': hyperparameter_1,
                    'hyperparameter_2': hyperparameter_2,
                    }
                )
combinations.plot(
    x    = 'hyperparameter_1',
    y    = 'hyperparameter_2',
    kind = 'scatter',
    ax   = axs[1]
)
axs[1].set_title('Random distribution');

# Search space for each hyperparameter
# ==============================================================================
param_distributions = {'model__alpha': np.logspace(-5, 3, 100)}

# Cross-validation search
# ==============================================================================
grid = RandomizedSearchCV(
        estimator  = pipe,
        param_distributions = param_distributions,
        n_iter     = 50,
        scoring    = 'neg_root_mean_squared_error',
        n_jobs     = cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 5), 
        verbose    = 0,
        random_state = 123,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Results
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)\
    .head(1)

	param_model__alpha	mean_test_score	std_test_score	mean_train_score	std_train_score
33	3.764936	-57016.19437	4221.321925	-55948.199147	1070.076033

Bayesian Optimization¶

Although these three strategies are totally valid and generate good results, especially when you have criteria to narrow down the search range, they share a common limitation: none of them take into account the results obtained so far, which prevents them from focusing the search on regions of greatest interest and avoiding unnecessary regions.

An alternative is hyperparameter search with Bayesian optimization methods. In general terms, Bayesian hyperparameter optimization consists of creating a probabilistic model in which the objective function is the model's validation metric (rmse, auc, precision..). With this strategy, the search is redirected in each iteration towards the regions of greatest interest. The ultimate goal is to reduce the number of hyperparameter combinations with which the model is evaluated, choosing only the best candidates. This means that the advantage over the other mentioned strategies is maximized when the search space is very large or the model evaluation is very slow.

One of the most widely used libraries for Bayesian optimization is Optuna.

# Hyperparameter search with Optuna
# ==============================================================================
# Adjust Optuna logging level to not show too much information
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    """
    Objective function for hyperparameter search.
    Adjusts the 'alpha' parameter of the model and evaluates performance using cross-validation.
    """
    # Propose value for 'model__alpha' within logarithmic range
    model__alpha = trial.suggest_float("model__alpha", 1e-6, 1e+3, log=True)
    
    # Set the 'alpha' value in the pipeline
    pipe.set_params(model__alpha=model__alpha)
    
    # Evaluate the model using cross-validation with negative RMSE score
    score = -np.mean(
                cross_val_score(
                    pipe,
                    X_train,
                    y_train,
                    cv=5,
                    n_jobs=-1,
                    scoring="neg_root_mean_squared_error"
                )
            )
    
    return score

# Create and optimize the study with 50 trials
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=50, show_progress_bar=True)

# Show results
print(f"Best trial: {study.best_trial}")
print(f"\nBest validation score: {study.best_value}")
print(f"Best hyperparameters: {study.best_params}")

  0%|          | 0/50 [00:00<?, ?it/s]

Best trial: FrozenTrial(number=47, state=1, values=[56904.67245507283], datetime_start=datetime.datetime(2025, 11, 30, 20, 2, 55, 753761), datetime_complete=datetime.datetime(2025, 11, 30, 20, 2, 55, 810761), params={'model__alpha': 4.577720439947758}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'model__alpha': FloatDistribution(high=1000.0, log=True, low=1e-06, step=None)}, trial_id=47, value=None)

Best validation score: 56904.67245507283
Best hyperparameters: {'model__alpha': 4.577720439947758}

Preprocessing Tuning¶

In most cases, the optimization process focuses on the model's hyperparameters. However, it can also be very interesting to compare different preprocessing transformations. Thanks to pipeline, this can be done in the same way as with hyperparameters. See the following example where incorporating interactions between predictors is compared.

# Pipe: preprocessing + model
# ==============================================================================
# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Preprocessing steps and model are combined in the same pipeline.
pipe = Pipeline(
        [('preprocessing', preprocessor),
        ('interactions', PolynomialFeatures(degree=2)),
        ('model', Ridge())])

# Hyperparameter grid
# ==============================================================================
param_grid = {'interactions': [PolynomialFeatures(degree=2), 'passthrough'],
              'model__alpha': np.logspace(-5, 3, 10)}

# Cross-validation search
# ==============================================================================
grid = GridSearchCV(
        estimator  = pipe,
        param_grid = param_grid,
        scoring    = 'neg_root_mean_squared_error',
        n_jobs     = cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 5), 
        verbose    = 0,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Grid results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)\
    .head(1)

	param_interactions	param_model__alpha	mean_test_score	std_test_score	mean_train_score	std_train_score
16	passthrough	2.154435	-57035.896444	4011.21685	-55923.521551	1058.94183

Algorithms¶

In the following sections, different machine learning models are trained with the objective of comparing them and identifying the one that achieves the best result in predicting house prices.

K-Nearest Neighbor (kNN)¶

K-Nearest Neighbor is one of the simplest machine learning algorithms. Its operation is as follows: to predict an observation, the K observations from the training set that most resemble it (based on their predictors) are identified and the average of the response variable in those observations is used as the predicted value. Given its simplicity, it usually gives worse results than other algorithms, but it is a good reference as a baseline.

# Pipeline: preprocessing + model
# ==============================================================================
# Identification of numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()


# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Preprocessing steps and model are combined in the same pipeline.
pipe = Pipeline([('preprocessing', preprocessor),
                 ('model', KNeighborsRegressor())])

# Hyperparameter optimization
# ==============================================================================
# Search space for each hyperparameter
param_distributions = {'model__n_neighbors': np.linspace(1, 100, 500, dtype=int)}

# Random grid search
grid = RandomizedSearchCV(
        estimator  = pipe,
        param_distributions = param_distributions,
        n_iter     = 20,
        scoring    = 'neg_root_mean_squared_error',
        n_jobs     = cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 3), 
        refit      = True, 
        verbose    = 0,
        random_state = 123,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Grid results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)\
    .head(1)

	param_model__n_neighbors	mean_test_score	std_test_score	mean_train_score	std_train_score
12	7	-61169.195218	5654.381605	-52870.939049	1112.83243

# Cross-validation results plot for each hyperparameter
# ------------------------------------------------------------------------------
fig, ax = plt.subplots(figsize=(6, 3.84))
hyperparameter = 'param_model__n_neighbors'
results = results.sort_values(hyperparameter, ascending = False)
metric    = grid.scoring

results.plot(hyperparameter, 'mean_train_score', ax=ax)
results.plot(hyperparameter, 'mean_test_score', ax=ax)
ax.fill_between(results[hyperparameter].astype(int),
                results['mean_train_score'] + results['std_train_score'],
                results['mean_train_score'] - results['std_train_score'],
                alpha=0.2)
ax.fill_between(results[hyperparameter].astype(int),
                results['mean_test_score'] + results['std_test_score'],
                results['mean_test_score'] - results['std_test_score'],
                alpha=0.2)
ax.legend()
ax.set_title('CV error evolution')
ax.set_ylabel(metric);

fig, ax = plt.subplots(figsize=(8, 5))
results = results.sort_values('mean_test_score', ascending = True)
ax.barh(
    [str(d) for d in results['params']],
    results['mean_test_score'],
    xerr=results['std_test_score'],
    align='center',
    alpha=0
)
ax.plot(
    results['mean_test_score'],
    [str(d) for d in results['params']],
    marker="D",
    linestyle="",
    alpha=0.8,
    color="r"
)
ax.set_title('Hyperparameter Comparison')
ax.set_ylabel(metric);

Once the best hyperparameters have been identified, the model is retrained by indicating the optimal values in its arguments. If refit=True is indicated in GridSearchCV(), this retraining is done automatically and the resulting model is stored in .best_estimator_.

# Test error of the final model
# ==============================================================================
final_model = grid.best_estimator_
predictions = final_model.predict(X = X_test)
rmse_knn = root_mean_squared_error(
            y_true  = y_test,
            y_pred  = predictions
           )
print(f"The test error (rmse) is: {rmse_knn}")

The test error (rmse) is: 67303.99640453797

Linear Regression (Ridge and Lasso)¶

Linear regression is a statistical method that tries to model the linear relationship between a continuous variable (dependent variable, response, or target) and one or more independent variables (regressors or predictors) by fitting a linear equation. It is called simple linear regression when there is only one independent variable and multiple linear regression when there is more than one.

The linear regression model (Legendre, Gauss, Galton and Pearson) considers that, given a set of observations, the mean $\mu$ of the response variable $Y$ is linearly related to the regressor variable(s) $X$ according to the equation:

$$\mu_Y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + ... + \beta_p x_{p}$$

or in matrix notation (incorporating $\beta_0$ into the vector $\mathbf{\beta}$):

$$\mu_Y = \mathbf{X}^T \mathbf{\beta}$$

Additionally, the linear model can include regularization during its fitting, so it also includes the ridge regression, lasso, and elastic-net models.

Scikit-Learn incorporates 3 types of regularization for linear models with the objective of avoiding overfitting, reducing variance, and attenuating the effect of correlation between predictors. Generally, by applying regularization, models with greater predictive power (generalization) are achieved.

The lasso model is a least squares linear model that incorporates regularization that penalizes the sum of the absolute values of the regression coefficients $(||\beta||_1 = \sum_{k=1} ^p |\beta_k|)$. This penalty is known as l1 and has the effect of forcing predictor coefficients towards zero. Since a predictor with a zero regression coefficient does not influence the model, lasso manages to select the most influential predictors. The degree of penalty is controlled by the hyperparameter $\lambda$. When $\lambda = 0$, the result is equivalent to a linear model by ordinary least squares. As $\lambda$ increases, the greater the penalty and the more predictors are excluded.
The ridge model is a least squares linear model that incorporates regularization that penalizes the sum of the squared coefficients $(||\beta||^2_2 = \sum_{k=1} ^p \beta^2_k)$. This penalty is known as l2 and has the effect of proportionally reducing the value of all model coefficients but without them reaching zero. Like lasso, the degree of penalty is controlled by the hyperparameter $\lambda$.

The main practical difference between lasso and ridge is that the former makes some coefficients exactly zero, thus performing predictor selection, while the latter does not exclude any. This is a notable advantage of lasso in scenarios where not all predictors are important for the model and it is desired that the least influential ones be excluded. On the other hand, when there are highly correlated predictors (linearly), ridge reduces the influence of all of them at once and proportionally, while lasso tends to select one of them, giving it all the weight and excluding the rest. In the presence of correlations, this selection varies a lot with small perturbations (changes in training data), so lasso solutions are very unstable if predictors are highly correlated.

To achieve an optimal balance between these two properties, what is known as elastic net penalty can be used, which combines both strategies.

The elastic net model includes regularization that combines l1 and l2 penalties $(\alpha \lambda ||\beta||_1 + \frac{1}{2}(1- \alpha)||\beta||^2_2)$. The degree to which each penalty influences is controlled by the hyperparameter $\alpha$. Its value must be between [0,1], when $\alpha = 0$, ridge regression is applied and when $\alpha = 1$ lasso is applied. The combination of both penalties usually yields good results. A frequently used strategy is to assign almost all the weight to the l1 penalty ($\alpha$ very close to 1) to select predictors and a little to l2 to give some stability in case some predictors are correlated.

Finding the best model involves identifying the optimal values of the regularization hyperparameters $\alpha$ and $\lambda$.

# Pipeline: preprocessing + model
# ==============================================================================
# Identification of numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()


# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Preprocessing steps and model are combined in the same pipeline.
pipe = Pipeline([('preprocessing', preprocessor),
                 ('model', Ridge())])

# Hyperparameter optimization
# ==============================================================================
# Search space for each hyperparameter
param_distributions = {'model__alpha': np.logspace(-5, 5, 500)}

# Random grid search
grid = RandomizedSearchCV(
        estimator  = pipe,
        param_distributions = param_distributions,
        n_iter     = 20,
        scoring    = 'neg_root_mean_squared_error',
        n_jobs     = multiprocessing.cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 3), 
        refit      = True, 
        verbose    = 0,
        random_state = 123,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Grid results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)\
    .head(1)

	param_model__alpha	mean_test_score	std_test_score	mean_train_score	std_train_score
10	3.396739	-56907.292901	4690.339075	-55945.219309	1227.446654

# Cross-validation results plot for each hyperparameter
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3.84))
hyperparameter = 'param_model__alpha'
results = results.sort_values(hyperparameter, ascending = False)
metric    = grid.scoring

results.plot(hyperparameter, 'mean_train_score', ax=ax)
results.plot(hyperparameter, 'mean_test_score', ax=ax)
ax.fill_between(results[hyperparameter].astype(int),
                results['mean_train_score'] + results['std_train_score'],
                results['mean_train_score'] - results['std_train_score'],
                alpha=0.2)
ax.fill_between(results[hyperparameter].astype(int),
                results['mean_test_score'] + results['std_test_score'],
                results['mean_test_score'] - results['std_test_score'],
                alpha=0.2)
ax.legend()
ax.set_title('CV error evolution')
ax.set_ylabel(metric);

# Test error of the final model
# ==============================================================================
final_model = grid.best_estimator_
predictions = final_model.predict(X = X_test)
rmse_lm = root_mean_squared_error(
            y_true  = y_test,
            y_pred  = predictions,
          )
print(f"The test error (rmse) is: {rmse_lm}")

The test error (rmse) is: 65406.82894279458

Random Forest¶

A Random Forest model is made up of a set of individual decision trees, each trained with a slightly different sample of the training data generated through bootstrapping). The prediction of a new observation is obtained by aggregating the predictions of all the individual trees that make up the model. To learn more about this type of model, visit Random Forest with Python.

# Pipeline: preprocessing + model
# ==============================================================================
# Identification of numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()


# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Preprocessing steps and model are combined in the same pipeline.
pipe = Pipeline([('preprocessing', preprocessor),
                 ('model', RandomForestRegressor())])

# Hyperparameter optimization
# ==============================================================================
# Search space for each hyperparameter

param_distributions = {
    'model__n_estimators': [50, 100, 1000, 2000],
    'model__max_features': [3, 5, 7, 1.0],
    'model__max_depth'   : [3, 5, 10, 20]
}

# Random grid search
grid = RandomizedSearchCV(
        estimator  = pipe,
        param_distributions = param_distributions,
        n_iter     = 20,
        scoring    = 'neg_root_mean_squared_error',
        n_jobs     = multiprocessing.cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 3),
        refit      = True, 
        verbose    = 0,
        random_state = 123,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Grid results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)\
    .head(1)

	param_model__n_estimators	param_model__max_features	param_model__max_depth	mean_test_score	std_test_score	mean_train_score	std_train_score
13	100	7.0	20	-54659.470515	5036.819337	-20520.19286	402.006403

# Test error of the final model
# ==============================================================================
final_model = grid.best_estimator_
predictions = final_model.predict(X = X_test)
rmse_rf = root_mean_squared_error(
            y_true  = y_test,
            y_pred  = predictions,
          )
print(f"The test error (rmse) is: {rmse_rf}")

The test error (rmse) is: 61567.52103489776

Gradient Boosting Trees¶

A Gradient Boosting Trees model is made up of a set of individual decision trees, trained sequentially, so that each new tree tries to improve the errors of the previous trees. The prediction of a new observation is obtained by aggregating the predictions of all the individual trees that make up the model. To learn more about this type of model, visit Gradient Boosting with Python.

# Pipeline: preprocessing + model
# ==============================================================================
# Identification of numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()


# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Preprocessing steps and model are combined in the same pipeline.
pipe = Pipeline([('preprocessing', preprocessor),
                 ('model', GradientBoostingRegressor())])

# Hyperparameter optimization
# ==============================================================================
# Search space for each hyperparameter

param_distributions = {
    'model__n_estimators': [50, 100, 1000, 2000],
    'model__max_features': [3, 5, 7, 1.0],
    'model__max_depth'   : [3, 5, 10, 20],
    'model__subsample'   : [0.5,0.7, 1]
}

# Random grid search
grid = RandomizedSearchCV(
        estimator  = pipe,
        param_distributions = param_distributions,
        n_iter     = 20,
        scoring    = 'neg_root_mean_squared_error',
        n_jobs     = multiprocessing.cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 3),
        refit      = True, 
        verbose    = 0,
        random_state = 123,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Grid results
# ==============================================================================
results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)\
    .head(1)

	param_model__subsample	param_model__n_estimators	param_model__max_features	param_model__max_depth	mean_test_score	std_test_score	mean_train_score	std_train_score
17	1.0	100	7.0	5	-54175.092368	4480.817421	-24542.710342	521.692074

# Test error of the final model
# ==============================================================================
final_model = grid.best_estimator_
predictions = final_model.predict(X = X_test)
rmse_gbm = root_mean_squared_error(
            y_true  = y_test,
            y_pred  = predictions,
          )
print(f"The test error (rmse) is: {rmse_gbm}")

The test error (rmse) is: 62311.56422295233

Stacking¶

In general, the term model ensembling refers to combining the predictions of two or more different models with the goal of improving the final predictions. This strategy is based on the assumption that different models trained independently use different aspects of the data to make predictions—that is, each one is able to identify part of the "truth" but not all of it. By combining the perspective of each, a more detailed description of the true underlying structure in the data is obtained. As an analogy, imagine a group of students taking a multidisciplinary exam. Although they all get approximately the same grade, each one will have scored more points on questions about the disciplines in which they excel. If instead of taking the exam individually they take it as a group, each one can contribute in the areas they master most, and the final result will probably be superior to any of the individual results.

The key for ensembling to improve results is model diversity. If all the combined models are similar to each other, they won't be able to compensate for one another. For this reason, one should try to combine models that perform as well as possible individually and are as different from each other as possible.

The simplest ways to combine predictions from multiple models are using the mean for regression problems and the mode for classification problems. However, there are more complex approaches capable of achieving better results:

Weight the aggregations by giving different weights to each model, for example, in proportion to the accuracy they have achieved individually.

Super Learner (stacked regression): use a new model to decide the best way to combine the predictions of the other models.

Super Learner Algorithm¶

The Super Learner implementation available in scikit-learn through the StackingRegressor and StackingClassifier classes follows this algorithm:

Ensemble Definition

Define a list of base algorithms (each with the appropriate hyperparameters).
Select the metalearning algorithm that defines how the higher-level model is trained. By default, RidgeCV is used for regression and LogisticRegression for classification.

Ensemble Training

Train each of the base algorithms with the training set.
Perform k-fold cross-validation with each of the base algorithms and store the predictions made in each of the k partitions.
Combine the predictions from step 2 into a single NxL matrix (N = number of observations in the training set, L = number of base models).
Train the metalearner with the response variable and the NxL matrix as predictors.
The final Super learner is composed of the base models and the metalearning model.

Prediction

Predict the new observation with each of the base models.
Use the predictions from the base models as input to the metalearner to obtain the final prediction.

We proceed to create a stacking with Ridge and RandomForest models, using the best hyperparameters found in the previous sections for each.

# Pipeline: preprocessing + models for the stacking
# ==============================================================================
# Identify numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.to_list()

# Transformations for numeric variables
numeric_transformer = Pipeline(
                        steps=[('scaler', StandardScaler())]
                      )

# Transformations for categorical variables
categorical_transformer = Pipeline(
                            steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]
                          )

preprocessor = ColumnTransformer(
                    transformers=[
                        ('numeric', numeric_transformer, numeric_cols),
                        ('cat', categorical_transformer, cat_cols)
                    ],
                    remainder='passthrough',
                    verbose_feature_names_out = False
               ).set_output(transform="pandas")

# Combine preprocessing steps and models by creating several pipelines.
pipe_ridge = Pipeline([('preprocessing', preprocessor),
                     ('ridge', Ridge(alpha=3.4))])

pipe_rf = Pipeline([('preprocessing', preprocessor),
                     ('random_forest', RandomForestRegressor(
                                         n_estimators = 1000,
                                         max_features = 7,
                                         max_depth    = 20
                                        )
                     )])

# Definition and training of the StackingRegressor
# ==============================================================================
estimators = [('ridge', pipe_ridge),
              ('random_forest', pipe_rf)]

stacking_regressor = StackingRegressor(estimators=estimators,
                                       final_estimator=RidgeCV())
# Assign the result to _ so it doesn't print to screen
_ = stacking_regressor.fit(X = X_train, y = y_train)

# Test error of the stacking
# ==============================================================================
final_model = stacking_regressor
predictions = final_model.predict(X = X_test)
rmse_stacking = root_mean_squared_error(
                    y_true  = y_test,
                    y_pred  = predictions
                  )
print(f"The test error (rmse) is: {rmse_stacking}")

The test error (rmse) is: 60752.873757270354

Comparison¶

The test error of all trained models is compared.

error_models = pd.DataFrame({
                        'model': ['knn', 'lm', 'random forest', 'gradient boosting',
                                   'stacking'],
                        'rmse': [rmse_knn, rmse_lm, rmse_rf, rmse_gbm, rmse_stacking]
                     })
error_models = error_models.sort_values('rmse', ascending=False)

fig, ax = plt.subplots(figsize=(6, 3.84))
ax.hlines(error_models.model, xmin=0, xmax=error_models.rmse)
ax.plot(error_models.rmse, error_models.model, "o", color='black')
ax.tick_params(axis='y', which='major', labelsize=12)
ax.set_title('Model Test Error Comparison'),
ax.set_xlabel('Test rmse');

Annexes¶

Annex 1: Validation Methods¶

Validation methods, also known as resampling, are strategies that allow estimating the predictive capacity of models when applied to new observations, using only the training data. The idea behind all of them is the following: the model is fitted using a subset of observations from the training set and is evaluated (calculating a metric that measures how good the model is, for example, accuracy) with the remaining observations. This process is repeated multiple times and the results are aggregated and averaged. Thanks to the repetitions, possible deviations that may arise from the random allocation of observations are compensated. The difference between methods is usually the way in which the training/validation subsets are generated.

k-Fold-Cross-Validation (CV)

The training observations are divided into k folds (sets) of the same size. The model is fitted with all observations except those in the first fold and is evaluated by predicting the observations of the excluded fold, thus obtaining the first metric. The process is repeated k times, excluding a different fold in each iteration. At the end, k values of the metric are generated, which are aggregated (usually with the mean and standard deviation) generating the final validation estimate.

Leave-One-Out Cross-Validation (LOOCV)

LOOCV is a special case of k-Fold-Cross-Validation in which the number k of folds equals the number of observations available in the training set. The model is fitted each time with all observations except one, which is used to evaluate the model. This method involves a very high computational cost—the model is fitted as many times as there are training observations—so in practice, it usually doesn't pay off to use it.

Repeated k-Fold-Cross-Validation (repeated CV)

It is exactly the same as the k-Fold-Cross-Validation method but repeating the complete process n times. For example, 10-Fold-Cross-Validation with 5 repetitions implies a total of 50 fit-validation iterations, but it is not equivalent to 50-Fold-Cross-Validation.

Leave-Group-Out Cross-Validation (LGOCV)

LGOCV, also known as repeated train/test splits or Monte Carlo Cross-Validation, simply consists of generating multiple random train-test divisions (only two sets per repetition). The proportion of observations going to each set is determined beforehand; 80%-20% usually gives good results. This method, although simpler to implement than CV, requires many repetitions (>50) to generate stable estimates.

Bootstrapping

A bootstrap sample is a sample obtained from the original sample by random sampling with replacement, and of the same size as the original sample. Random sampling with replacement (resampling with replacement) means that, after an observation is drawn, it is made available again for subsequent draws. As a result of this type of sampling, some observations will appear multiple times in the bootstrap sample and others not at all. Observations not selected are called out-of-bag (OOB). For each bootstrapping iteration, a new bootstrap sample is generated, the model is fitted with it, and is evaluated with the out-of-bag observations.

Obtain a new sample of the same size as the original sample through random sampling with replacement.
Fit the model using the new sample generated in step 1.
Calculate the model error using those observations from the original sample that have not been included in the new sample. This error is known as the validation error.
Repeat the process n times and calculate the mean of the n validation errors.
Finally, and after the n repetitions, fit the final model using all the original training observations.

The nature of the bootstrapping process generates some bias in the estimates that can be problematic when the training set is small. There are certain modifications to the original algorithm to correct this problem, some of them are: 632 method and 632+ method.

There is no validation method that outperforms all others in every scenario; the choice should be based on several factors.

If the sample size is small, repeated k-Fold-Cross-Validation is recommended, as it achieves a good bias-variance balance and, since there are not many observations, the computational cost is not excessive.
If the main objective is to compare models rather than obtain a precise estimate of the metrics, bootstrapping is recommended since it has less variance.
If the sample size is very large, the difference between methods is reduced and computational efficiency becomes more important. In these cases, simple 10-Fold-Cross-Validation is sufficient.

A comparative study of different methods can be found in Comparing Different Species of Cross-Validation.

Annex 2: Metrics¶

There is a great variety of metrics that allow evaluating how good an algorithm is at making predictions. The suitability of each one depends entirely on the problem in question, and its correct choice will depend on how well the analyst understands the problem they face. Below, some of the most commonly used are described.

Accuracy and Kappa

These two metrics are the most commonly used in binary and multiclass classification problems. Accuracy is the percentage of observations correctly classified relative to the total predictions. Kappa or Cohen's Kappa is the accuracy value normalized with respect to the percentage of success expected by chance. Unlike accuracy, whose range of values can be [0, 1], that of kappa is [-1, 1]. In problems with imbalanced classes, where the majority group far exceeds the others, Kappa is more useful because it avoids falling into the illusion of believing that a model is good when it really only slightly exceeds what is expected by chance.

MSE, RMSE, MAE

These are the most commonly used metrics in regression problems.

MSE (Mean Squared Error) is the mean of the squared errors. It is usually a very good indicator of how the model performs in general, but it has the disadvantage of being in squared units. To improve interpretation, RMSE (Root Mean Squared Error) is often used, which is the square root of MSE and therefore its units are the same as the response variable.

MAE (Mean Absolute Error) is the mean of the absolute errors. The difference from MSE is that the latter squares the errors, which means it penalizes large deviations much more. In general, MSE favors models that perform approximately equally well across all observations, while MAE favors models that predict the vast majority of observations very well even if they are very wrong on a few.

Session Information¶

import session_info
session_info.show(html=False)

-----
joblib              1.5.2
matplotlib          3.10.7
numpy               2.2.6
optuna              4.5.0
pandas              2.3.3
seaborn             0.13.2
session_info        v1.0.1
sklearn             1.7.2
statsmodels         0.14.5
-----
IPython             9.6.0
jupyter_client      8.6.3
jupyter_core        5.9.1
jupyterlab          4.5.0
notebook            6.5.7
-----
Python 3.13.9 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 19:16:10) [GCC 11.2.0]
Linux-6.14.0-36-generic-x86_64-with-glibc2.39
-----
Session information updated at 2025-11-30 20:11

Bibliography¶

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Introduction to Statistical Learning, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

Applied Predictive Modeling Max Kuhn, Kjell Johnson

T.Hastie, R.Tibshirani, J.Friedman. The Elements of Statistical Learning. Springer, 2009

How to Cite¶

How to cite this document?

If you use this document or any part of it, we appreciate you citing it. Thank you very much!

Machine learning with Python and Scikit-learn by Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://www.cienciadedatos.net/documentos/py06-machine-learning-python-scikitlearn-en.html

Did you like the article? Your support is important

Your contribution will help me continue generating free educational content. Thank you so much! 😊

This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.

You are allowed to:

Share: copy and redistribute the material in any medium or format.
Adapt: remix, transform, and build upon the material.

Under the following terms:

Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial: You may not use the material for commercial purposes.
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.