More about machine learning: Machine Learning with Python

Introduction¶

Linear regression is a statistical method that aims to model the relationship between a continuous dependent variable and one or more independent variables by fitting a linear equation. Three of the limitations that arise in practice when trying to use this type of model (fitted by ordinary least squares) are:

They are negatively affected by the incorporation of correlated predictors.
They do not perform automatic predictor selection; all predictors are incorporated into the model even if they do not provide relevant information. This often complicates the interpretation of the model and reduces its predictive capacity.
They cannot be fitted when the number of predictors is greater than the number of observations.

One way to mitigate the impact of these problems is to use regularization strategies such as ridge, Lasso, or Elastic Net, which force the model coefficients to tend toward zero, thus minimizing the risk of overfitting, reducing variance, attenuating the effect of correlation between predictors, and reducing the influence of less relevant predictors on the model.

Throughout this document, the theoretical foundations of linear regression with ridge, Lasso, and Elastic Net regularization are progressively described, along with the main practical aspects to consider and examples of how to create this type of model in Python and Scikit-learn.

Linear regression¶

The linear regression model (Legendre, Gauss, Galton, and Pearson) considers that, given a set of observations $\{y_i, x_{i1},...,x_{np}\}^{n}_{i=1}$, the mean ($\mu$) of the response variable $y$ is linearly related to the predictor variables $x_1$ ... $x_p$ according to the equation:

$$\mu_y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + ... + \beta_p x_{p}$$

The result of this equation is known as the population regression line, and it captures the relationship between the predictors and the mean of the response variable.

Another definition frequently found in statistics textbooks is:

$$y_i= \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} +\epsilon_i$$

In this case, reference is being made to the value of $y$ for a specific observation $i$. The value of a specific observation will never be exactly equal to the average, which is why the error term $\epsilon$ is added.

In both cases, the interpretation of the model elements is the same:

$\beta_{0}$: is the intercept, which corresponds to the average value of the response variable $y$ when all predictors are zero.
$\beta_{j}$: is the average effect on the response variable of a one-unit increase in the predictor variable $x_{j}$, while keeping the other variables constant. They are known as partial regression coefficients.
$\epsilon$: is the residual or error, the difference between the observed value and the value estimated by the model. It captures the effect of all variables that influence $y$ but are not included in the model as predictors.

In the vast majority of cases, the population values $\beta_0$ and $\beta_j$ are unknown, so their estimates $\hat{\beta}_0$ and $\hat{\beta}_j$ are obtained from a sample. Fitting the model consists of estimating, from the available data, the values of the regression coefficients that maximize the likelihood, i.e., those that give rise to the model that most likely could have generated the observed data. The most frequently used method is ordinary least squares (OLS) fitting, which identifies as the best model the line (or plane in multiple regression) that minimizes the sum of the squared vertical deviations between each training data point and the line.

The regression coefficients of the model are therefore those that control the influence of each predictor on the model. However, the magnitude of each coefficient depends on the units in which the predictors are measured, so their magnitude is not associated with the importance of each predictor. To be able to consider that the closer a predictor's coefficient is to zero, the less its influence on the response variable compared to the rest, it is necessary to standardize all predictors before fitting the model.

Regularization¶

Regularization strategies incorporate penalties in ordinary least squares (OLS) fitting with the objective of avoiding overfitting, reducing variance, attenuating the effect of correlation between predictors, and minimizing the influence of less relevant predictors on the model. In general, applying regularization results in models with greater predictive power (generalization).

⚠️ Warning

Since these regularization methods act on the magnitude of the model coefficients, all must be on the same scale, which is why it is necessary to standardize or normalize the predictors before training the model.

Ridge¶

Ridge regularization penalizes the sum of the squared coefficients $(||\beta||^2_2 = \sum_{j=1} ^p \beta^2_j)$. This penalty is known as l2 and has the effect of reducing the value of all model coefficients but without them reaching zero. The degree of penalization is controlled by the hyperparameter $\lambda$. When $\lambda = 0$, the penalty is null and the result is equivalent to an ordinary least squares (OLS) linear model. As $\lambda$ increases, the greater the penalty and the smaller the value of the coefficients.

$$\sum^n_{i=1}(y_i - \beta_0 - \sum^p_{j=1} \beta_j x_{ij})^2 + \lambda \sum^p_{j=1} \beta_j^2 = \text{residual sum of squares} + \lambda \sum^p_{j=1} \beta_j^2$$

The main advantage of applying ridge versus ordinary least squares (OLS) fitting is variance reduction. In general, in situations where the relationship between the response variable and the predictors is approximately linear, ordinary least squares estimates have little bias but can still suffer from high variance (small changes in the training data have a large impact on the resulting model). This problem intensifies as the number of predictors introduced into the model approaches the number of training observations, reaching the point where, if $p>n$, it is not possible to fit the model by ordinary least squares. Using an appropriate value of $\lambda$, the ridge method is capable of reducing variance with barely increasing the bias, thus achieving a lower total error.

The disadvantage of the ridge method is that the final model includes all predictors. This is because, although the penalty forces the coefficients to tend toward zero, they never reach exactly zero (only if $\lambda=\infty$). This method manages to minimize the influence on the model of predictors less related to the response variable, but they will continue to appear in the final model. Although this does not pose a problem for the model's accuracy, it does for its interpretation.

Lasso¶

Lasso regularization penalizes the sum of the absolute values of the regression coefficients $(||\beta||_1 = \sum_{j=1} ^p |\beta_j|)$. This penalty is known as l1 and has the effect of forcing the predictor coefficients to tend toward zero. Since a predictor with a regression coefficient of zero does not influence the model, Lasso manages to exclude the least relevant predictors. As with ridge, the degree of penalization is controlled by the hyperparameter $\lambda$. When $\lambda = 0$, the result is equivalent to an ordinary least squares linear model. As $\lambda$ increases, the penalty becomes greater and more predictors are excluded.

$$\sum^n_{i=1}(y_i - \beta_0 - \sum^p_{j=1} \beta_j x_{ij})^2 + \lambda \sum^p_{j=1} |\beta_j| = \text{residual sum of squares} + \lambda \sum^p_{j=1} |\beta_j|$$

Comparison of Ridge and Lasso¶

The main practical difference between lasso and ridge is that the former makes some coefficients exactly zero, thus performing predictor selection, while the latter does not exclude any. This represents a notable advantage of lasso in scenarios where not all predictors are important for the model and it is desired that the least influential ones are excluded.

On the other hand, when there are highly correlated predictors (linearly), ridge reduces the influence of all of them simultaneously and proportionally, while lasso tends to select one of them, giving it all the weight and excluding the rest. In the presence of correlations, this selection varies greatly with small perturbations (changes in training data), so lasso solutions are very unstable if the predictors are highly correlated.

To achieve an optimal balance between these two properties, what is known as elastic net penalization can be used, which combines both strategies.

Elastic net¶

Elastic Net includes a regularization that combines l1 and l2 penalization $(\alpha \lambda ||\beta||_1 + \frac{1}{2}(1- \alpha)||\beta||^2_2)$. The degree to which each of the penalties influences is controlled by the hyperparameter $\alpha$. Its value is between [0,1]. When $\alpha = 0$, ridge is applied and when $\alpha = 1$, Lasso is applied. The combination of both penalties often yields good results. A frequently used strategy is to assign almost all the weight to the l1 penalty ($\alpha$ very close to 1) to achieve predictor selection and a little to the l2 penalty to provide some stability in case some predictors are correlated.

$$\frac{\sum^n_{i=1}(y_i - \beta_0 - \sum^p_{j=1} \beta_j x_{ij})^2}{2n} + \lambda (\alpha \sum^p_{j=1} |\beta_j| + \frac{1-\alpha}{2} + \sum^p_{j=1} \beta_j^2$$

Lambda Search¶

Finding the best model involves identifying the optimal value of the regularization hyperparameter lambda ($\lambda$). Being a hyperparameter, there is no way to know in advance what value is appropriate. One way to achieve this is to use cross-validation or generalized cross-validation, the latter being an efficient adaptation of leave-one-out cross-validation available for Ridge regularization. With scikit-learn this can be easily achieved in two ways:

Combine GridSearchCV and a model of type Ridge, Lasso, or ElasticNet.
Use RidgeCV, LassoCV, or ElasticNetCV which are estimators that incorporate cross-validation. If no value is specified for the cv argument, RidgeCV applies Generalized Cross-Validation (GCV). If an integer value is provided, standard k-fold cross-validation is used. For example, cv=10 corresponds to 10-fold cross-validation instead of GCV.
RidgeCV, LassoCV, and ElasticNetCV use Mean Squared Error (MSE) as the error metric. To use different metrics, the user should use GridSearchCV.

In general, the versions that incorporate cross-validation are more straightforward and convenient to use; however, there are situations for which they are not suitable. For example, if the data has a temporal order (time series), cross-validation must make an ordered distribution of observations, which is achieved by combining sklearn.model_selection.TimeSeriesSplit with GridSearchCV. For more information on how to use GridSearchCV, see machine learning with scikit-learn.

Whether one method or another is used, predictors must always be standardized or normalized.

Although the optimal value of $\lambda$ is the one with which the cross-validation error is minimized, an extended practice is to use, instead of this, the largest value of $\lambda$ that is less than one standard deviation away from the optimum. In this way, a simpler model is achieved (excluding more predictors), but whose predictive capacity is similar to that achieved with the more complex model.

✏️ Note

scikit-learn uses the alpha argument to refer to the regularization parameter lambda ($\lambda$) and the l1_ratio argument to refer to the mixing parameter alpha ($\alpha$) in Elastic Net.

Regression example¶

The quality department of a food company is responsible for measuring the fat content of the meat it sells. This study is carried out using chemical analytical techniques, a relatively costly process in terms of time and resources. An alternative that would allow cost reduction and time optimization is to use a spectrophotometer (an instrument capable of detecting the absorbance a material has to different types of light based on its characteristics) and infer the fat content from its measurements.

Before validating this new technique, the company needs to verify what margin of error it has compared to chemical analysis. To do this, the absorbance spectrum at 100 wavelengths is measured on 215 meat samples, whose fat content is also obtained by chemical analysis, and a model is trained with the objective of predicting fat content based on the values given by the spectrophotometer.

The data meatspec.csv used in this example have been obtained from the magnificent book Linear Models with R, Second Edition.

Libraries¶

The libraries used in this example are:

# Data manipulation
# ==============================================================================
import pandas as pd
import numpy as np

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 8
import seaborn as sns

# Preprocessing and modeling
# ==============================================================================
from scipy.stats import pearsonr
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.linear_model import ElasticNetCV

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

Data¶

The dataset contains 101 columns. The first 100, named as $V1$, ..., $V100$ collect the absorbance value for each of the 100 analyzed wavelengths (predictors), and the fat column contains the fat content measured by chemical techniques (response variable).

# Data
# ==============================================================================
datos = pd.read_csv('meatspec.csv')
datos = datos.drop(columns = datos.columns[0])
datos.info()
datos.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Columns: 101 entries, V1 to fat
dtypes: float64(101)
memory usage: 169.8 KB

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...	V92	V93	V94	V95	V96	V97	V98	V99	V100	fat
0	2.61776	2.61814	2.61859	2.61912	2.61981	2.62071	2.62186	2.62334	2.62511	2.62722	...	2.98145	2.96072	2.94013	2.91978	2.89966	2.87964	2.85960	2.83940	2.81920	22.5
1	2.83454	2.83871	2.84283	2.84705	2.85138	2.85587	2.86060	2.86566	2.87093	2.87661	...	3.29186	3.27921	3.26655	3.25369	3.24045	3.22659	3.21181	3.19600	3.17942	40.1
2	2.58284	2.58458	2.58629	2.58808	2.58996	2.59192	2.59401	2.59627	2.59873	2.60131	...	2.68951	2.67009	2.65112	2.63262	2.61461	2.59718	2.58034	2.56404	2.54816	8.4

3 rows × 101 columns

Relationship between variables¶

The first step when establishing a multiple linear model is to study the relationship that exists between variables. This information is critical when identifying which can be the best predictors for the model, and to detect collinearity between predictors. As a complement, it is recommended to represent the distribution of each variable using histograms.

# Correlation between numeric columns
# ==============================================================================

def tidy_corr_matrix(corr_mat):
    '''
    Function to convert a pandas correlation matrix to tidy format
    '''
    corr_mat = corr_mat.stack().reset_index()
    corr_mat.columns = ['variable_1','variable_2','r']
    corr_mat = corr_mat.loc[corr_mat['variable_1'] != corr_mat['variable_2'], :]
    corr_mat['abs_r'] = np.abs(corr_mat['r'])
    corr_mat = corr_mat.sort_values('abs_r', ascending=False)
    
    return(corr_mat)

corr_matrix = datos.select_dtypes(include=['float64', 'int']) \
              .corr(method='pearson')
display(tidy_corr_matrix(corr_matrix).head(5))

	variable_1	variable_2	r	abs_r
1019	V11	V10	0.999996	0.999996
919	V10	V11	0.999996	0.999996
1121	V12	V11	0.999996	0.999996
1021	V11	V12	0.999996	0.999996
917	V10	V9	0.999996	0.999996

# Heatmap of correlation matrix
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5, 5))
sns.heatmap(
    data      = corr_matrix,
    square    = True,
    ax        = ax
)
ax.tick_params(labelsize=8)

Many of the variables are highly correlated (absolute correlation > 0.8), which poses a problem when using linear regression models.

Data splitting¶

In the following sections, several linear models with and without regularization are fitted, with the objective of identifying which of them is capable of better predicting the fat content of meat based on the signals recorded by the spectrophotometer.

To be able to evaluate the predictive capacity of each model, the available observations are divided into two groups: a training set (70%) and a test set (30%).

# Splitting data into train and test
# ==============================================================================
X = datos.drop(columns='fat')
y = datos['fat']
X_train, X_test, y_train, y_test = train_test_split(
                                        X,
                                        y,
                                        train_size   = 0.7,
                                        random_state = 1234,
                                        shuffle      = True
                                    )

Scaling¶

Linear regression models with Ridge, Lasso, and Elastic Net regularization penalize the magnitude of model coefficients, so it is necessary that all predictors are on the same scale. For this, standardization or normalization can be used. In this example, standardization is used.

# Scaling
# ==============================================================================
scaler = StandardScaler().set_output(transform="pandas")
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

Ordinary Least Squares (OLS)¶

# Model creation and training
# ==============================================================================
modelo = LinearRegression()
modelo.fit(X = X_train, y = y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Model coefficients
# ==============================================================================
df_coeficientes = pd.DataFrame(
                        {'predictor': X_train.columns,
                         'coef': modelo.coef_.flatten()}
                  )

fig, ax = plt.subplots(figsize=(8, 3))
ax.stem(df_coeficientes.predictor, df_coeficientes.coef, markerfmt=' ')
plt.xticks(rotation=90, ha='right', size=6)
ax.set_xlabel('variable')
ax.set_ylabel('coefficients')
ax.set_title('Model coefficients');

# Test predictions
# ==============================================================================
predicciones = modelo.predict(X=X_test)
predicciones = predicciones.flatten()
predicciones[:10]

array([36.88226752, 62.47992661, 61.04825535,  9.95161352, 18.11993067,
        6.56158193, 28.42860453,  9.18085599, 15.56800749, 16.50461443])

# Test error of the model 
# ==============================================================================
rmse_ols = root_mean_squared_error(
            y_true  = y_test,
            y_pred  = predicciones,
           )
print(f"The test error (rmse) is: {rmse_ols}")

The test error (rmse) is: 3.839667585638014

The predictions of the final model deviate on average 3.84 units from the actual value.

Ridge¶

# Model creation and training (with CV search for optimal alpha value)
# ==============================================================================
# RidgeCV uses Generalized CV (LOOCV-style) to find the optimal alpha
modelo = RidgeCV(
            alphas           = np.logspace(-20, 2, 200),
            scoring          = 'neg_root_mean_squared_error',
            fit_intercept    = True,
            store_cv_results = True
         )

_ = modelo.fit(X = X_train, y = y_train)

When using regularization, it is useful to evaluate how the coefficients approach zero as the value of alpha increases, as well as the evolution of the cross-validation error as a function of the alpha used.

# Evolution of coefficients as a function of alpha
# ==============================================================================
alphas = modelo.alphas
coefs = []

for alpha in alphas:
    modelo_temp = Ridge(alpha=alpha, fit_intercept=True)
    modelo_temp.fit(X_train, y_train)
    coefs.append(modelo_temp.coef_.flatten())

fig, ax = plt.subplots(figsize=(7, 3.84))
ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlabel('alpha')
ax.set_ylabel('coefficients')
ax.set_title('Model coefficients as a function of regularization');
plt.axis('tight')
plt.show()

It can be seen how, as the value of alpha increases, the regularization is greater and the value of the coefficients decreases.

# Evolution of error as a function of alpha
# ==============================================================================
# modelo.cv_results_ stores the CV result for each alpha value.
# It has dimensions (n_samples, n_alphas) or (n_samples, n_targets, n_alphas)
cv_results = modelo.cv_results_
# Negative root mean squared error is used by default, so its sign is inverted
rmse_cv = -1 * cv_results.mean(axis=0)
rmse_sd = cv_results.std(axis=0)

# Identify the optimum and the optimum + 1std
min_rmse     = np.min(rmse_cv)
sd_min_rmse  = rmse_sd[np.argmin(rmse_cv)]
min_rmse_1sd = np.max(rmse_cv[rmse_cv <= min_rmse + sd_min_rmse])
optimo       = modelo.alphas[np.argmin(rmse_cv)].item()
optimo_1sd   = modelo.alphas[rmse_cv == min_rmse_1sd].item()


# Plot of error +- 1 standard deviation
fig, ax = plt.subplots(figsize=(7, 3.84))
ax.plot(modelo.alphas, rmse_cv)
ax.fill_between(
    modelo.alphas,
    rmse_cv + rmse_sd,
    rmse_cv - rmse_sd,
    alpha=0.2
)
ax.axvline(
    x         = optimo,
    c         = "gray",
    linestyle = '--',
    label     = 'optimum'
)
ax.axvline(
    x         = optimo_1sd,
    c         = "blue",
    linestyle = '--',
    label     = 'optimum_1sd'
)
ax.set_xscale('log')
ax.set_title('Evolution of CV error as a function of regularization')
ax.set_xlabel('alpha')
ax.set_ylabel('RMSE')
plt.legend();

# Best alpha value found
# ==============================================================================
print(f"Best alpha value found: {modelo.alpha_}")

Best alpha value found: 5.0526310653357e-06

# Model coefficients
# ==============================================================================
df_coeficientes = pd.DataFrame(
                        {'predictor': X_train.columns,
                         'coef': modelo.coef_.flatten()}
                  )

fig, ax = plt.subplots(figsize=(8, 3))
ax.stem(df_coeficientes.predictor, df_coeficientes.coef, markerfmt=' ')
plt.xticks(rotation=90, ha='right', size=6)
ax.set_xlabel('variable')
ax.set_ylabel('coefficients')
ax.set_title('Model coefficients');

Compared to the ordinary least squares model, with ridge, the order of magnitude of the coefficients is much smaller.

# Test predictions
# ==============================================================================
predicciones = modelo.predict(X=X_test)
predicciones = predicciones.flatten()
predicciones[:10]

array([43.18344257, 40.77188055, 51.13840121,  9.98568293, 17.87646644,
        7.66526713, 28.10025771,  8.19516254, 14.75080466, 14.43545959])

# Test error of the model 
# ==============================================================================
rmse_ridge = root_mean_squared_error(
                y_true  = y_test,
                y_pred  = predicciones,
             )
print(f"The test error (rmse) is: {rmse_ridge}")

The test error (rmse) is: 2.3915892373049306

The predictions of the final model deviate on average 2.39 units from the actual value.

Lasso¶

# Model creation and training (with CV search for optimal alpha value)
# ==============================================================================
# LassoCV uses 5-fold cross-validation to find the optimal alpha
modelo = LassoCV(
            alphas = np.logspace(-5, 3, 200),
            cv     = 5,
         )
_ = modelo.fit(X = X_train, y = y_train)

# Evolution of coefficients as a function of alpha
# ==============================================================================
alphas = modelo.alphas_
coefs = []

for alpha in alphas:
    modelo_temp = Lasso(alpha=alpha, fit_intercept=True)
    modelo_temp.fit(X_train, y_train)
    coefs.append(modelo_temp.coef_.flatten())

fig, ax = plt.subplots(figsize=(7, 3.84))
ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlabel('alpha')
ax.set_ylabel('coefficients')
ax.set_title('Model coefficients as a function of regularization');

It can be seen how, as the value of alpha increases, the regularization is greater and more predictors are excluded (their coefficient is 0).

# Number of predictors included (coefficient !=0) as a function of alpha
# ==============================================================================
alphas = modelo.alphas_
n_predictores = []

for alpha in alphas:
    modelo_temp = Lasso(alpha=alpha, fit_intercept=True)
    modelo_temp.fit(X_train, y_train)
    coef_no_cero = np.sum(modelo_temp.coef_.flatten() != 0)
    n_predictores.append(coef_no_cero)

fig, ax = plt.subplots(figsize=(7, 3))
ax.plot(alphas, n_predictores)
ax.set_xscale('log')
ax.set_ylim([-15,None])
ax.set_xlabel('alpha')
ax.set_ylabel('n° predictors')
ax.set_title('Predictors included as a function of regularization');

# Evolution of error as a function of alpha
# ==============================================================================
# modelo.mse_path_ stores the cv mse for each alpha value. It has
# dimensions (n_alphas, n_folds)
mse_cv = modelo.mse_path_.mean(axis=1)
mse_sd = modelo.mse_path_.std(axis=1)

# Square root is applied to go from mse to rmse
rmse_cv = np.sqrt(mse_cv)
rmse_sd = np.sqrt(mse_sd)

# Identify the optimum and the optimum + 1std
min_rmse     = np.min(rmse_cv)
sd_min_rmse  = rmse_sd[np.argmin(rmse_cv)]
min_rmse_1sd = np.max(rmse_cv[rmse_cv <= min_rmse + sd_min_rmse])
optimo       = modelo.alphas_[np.argmin(rmse_cv)]
optimo_1sd   = modelo.alphas_[rmse_cv == min_rmse_1sd]

# Plot of error +- 1 standard deviation
fig, ax = plt.subplots(figsize=(7, 3.84))
ax.plot(modelo.alphas_, rmse_cv)
ax.fill_between(
    modelo.alphas_,
    rmse_cv + rmse_sd,
    rmse_cv - rmse_sd,
    alpha=0.2
)
ax.axvline(
    x         = optimo,
    c         = "gray",
    linestyle = '--',
    label     = 'optimum'
)
ax.axvline(
    x         = optimo_1sd,
    c         = "blue",
    linestyle = '--',
    label     = 'optimum_1sd'
)

ax.set_xscale('log')
ax.set_ylim([0,None])
ax.set_title('Evolution of CV error as a function of regularization')
ax.set_xlabel('alpha')
ax.set_ylabel('RMSE')
plt.legend();

# Best alpha value found
# ==============================================================================
print(f"Best alpha value found: {modelo.alpha_}")

Best alpha value found: 1e-05

# Best alpha value found + 1sd
# ==============================================================================
min_rmse     = np.min(rmse_cv)
sd_min_rmse  = rmse_sd[np.argmin(rmse_cv)]
min_rmse_1sd = np.max(rmse_cv[rmse_cv <= min_rmse + sd_min_rmse])
optimo       = modelo.alphas_[np.argmin(rmse_cv)]
optimo_1sd   = modelo.alphas_[rmse_cv == min_rmse_1sd].item()

print(f"Best alpha value found + 1 standard deviation: {optimo_1sd}")

Best alpha value found + 1 standard deviation: 0.08703591361485166

The model is trained again, this time using the largest alpha value whose error is less than one standard deviation from the minimum found in cross-validation.

# Best model with optimal alpha + 1sd
# ==============================================================================
modelo = Lasso(alpha=optimo_1sd)
modelo.fit(X_train, y_train)

Lasso(alpha=0.08703591361485166)

# Model coefficients
# ==============================================================================
df_coeficientes = pd.DataFrame(
                        {'predictor': X_train.columns,
                         'coef': modelo.coef_.flatten()}
                  )

# Predictors included in the model (coefficient != 0)
df_coeficientes[df_coeficientes.coef != 0].reset_index(drop=True)

	predictor	coef
0	V8	-2.809618
1	V9	-7.678029
2	V10	-6.773139
3	V11	-6.028624
4	V12	-5.284843
5	V13	-4.547397
6	V14	-3.780134
7	V15	-3.064607
8	V16	-2.425411
9	V17	-1.873268
10	V18	-1.218858
11	V19	-0.605048
12	V20	-0.076677
13	V36	5.480320
14	V37	16.903281
15	V38	16.934001
16	V39	15.131325
17	V40	10.970831
18	V41	4.323470
19	V49	-2.789976
20	V50	-9.042049
21	V51	-4.566902
22	V52	-2.128191
23	V53	-0.259053

Of the 100 available predictors, the final model only includes 24.

fig, ax = plt.subplots(figsize=(8, 3))
ax.stem(df_coeficientes.predictor, df_coeficientes.coef, markerfmt=' ')
plt.xticks(rotation=90, ha='right', size=6)
ax.set_xlabel('variable')
ax.set_ylabel('coefficients')
ax.set_title('Model coefficients');

# Test predictions
# ==============================================================================
predicciones = modelo.predict(X=X_test)
predicciones = predicciones.flatten()
predicciones[:10]

array([34.35041421, 53.75222685, 36.17366435, 11.74672533, 15.1183471 ,
        6.35327526, 24.5488085 ,  7.48191556, 13.46044097, 18.71293633])

# Test error of the model 
# ==============================================================================
rmse_lasso = root_mean_squared_error(
                y_true = y_test,
                y_pred = predicciones,
             )
print("")
print(f"The test error (rmse) is: {rmse_lasso}")

The test error (rmse) is: 3.747834638323956

The predictions of the final model deviate on average 3.75 units from the actual value, using only 24 of the 100 available predictors.

Elastic net¶

# Model creation and training (with CV search for optimal alpha value)
# ==============================================================================
# ElasticNetCV uses 5-fold cross-validation to find the optimal alpha and l1_ratio
modelo = ElasticNetCV(
            l1_ratio = [0, 0.1, 0.5, 0.7, 0.9, 0.95, 0.99],
            alphas   = np.logspace(-10, 3, 200),
            cv       = 5
         )
_ = modelo.fit(X = X_train, y = y_train)

# Evolution of error as a function of alpha and l1_ratio
# ==============================================================================
# modelo.mse_path_ stores the cv mse for each alpha and l1_ratio value.
# It has dimensions (n_l1_ratio, n_alpha, n_folds)

# Mean error of the 10 partitions for each alpha and l1_ratio value 
mean_error_cv = modelo.mse_path_.mean(axis=2)

# The result is an array of dimensions (n_l1_ratio, n_alpha)
# Convert to a dataframe
df_resultados_cv = pd.DataFrame(
                        data   = mean_error_cv.flatten(),
                        index  = pd.MultiIndex.from_product(
                                    iterables = [modelo.l1_ratio, modelo.alphas_],
                                    names     = ['l1_ratio', 'modelo.alphas_']
                                 ),
                        columns = ["mse_cv"]
                    )

df_resultados_cv['rmse_cv'] = np.sqrt(df_resultados_cv['mse_cv'])
df_resultados_cv = df_resultados_cv.reset_index().sort_values('mse_cv', ascending = True)
df_resultados_cv

	l1_ratio	modelo.alphas_	mse_cv	rmse_cv
1351	0.99	1.366716e-07	7.313915	2.704425
1352	0.99	1.175850e-07	7.313916	2.704425
1350	0.99	1.588565e-07	7.313945	2.704431
1353	0.99	1.011638e-07	7.313964	2.704434
1349	0.99	1.846425e-07	7.313992	2.704439
...	...	...	...	...
628	0.70	1.482021e+01	153.374640	12.384452
627	0.70	1.722586e+01	153.374640	12.384452
629	0.70	1.275051e+01	153.374640	12.384452
630	0.70	1.096986e+01	153.374640	12.384452
631	0.70	9.437878e+00	153.374640	12.384452

1400 rows × 4 columns

# Best value found for each l1_ratio
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
df_resultados_cv.groupby('l1_ratio')['rmse_cv'].min().plot(ax = ax)
ax.set_title('Evolution of CV error as a function of l1_ratio')
ax.set_xlabel('l1_ratio')
ax.set_ylabel('rmse_cv');

# Best alpha and l1_ratio_ value found
# ==============================================================================
print(f"Best alpha value found: {modelo.alpha_}")
print(f"Best l1_ratio value found: {modelo.l1_ratio_}")

Best alpha value found: 1.3667163564620074e-07
Best l1_ratio value found: 0.99

# Model coefficients
# ==============================================================================
df_coeficientes = pd.DataFrame(
                        {'predictor': X_train.columns,
                         'coef': modelo.coef_.flatten()}
                  )

fig, ax = plt.subplots(figsize=(8, 3))
ax.stem(df_coeficientes.predictor, df_coeficientes.coef, markerfmt=' ')
plt.xticks(rotation=90, ha='right', size=6)
ax.set_xlabel('variable')
ax.set_ylabel('coefficients')
ax.set_title('Model coefficients');

# Test predictions
# ==============================================================================
predicciones = modelo.predict(X=X_test)
predicciones = predicciones.flatten()

# Test error of the model 
# ==============================================================================
rmse_elastic = root_mean_squared_error(
                y_true  = y_test,
                y_pred  = predicciones,
               )
print("")
print(f"The test error (rmse) is: {rmse_elastic}")

The test error (rmse) is: 5.232998798175906

The predictions of the final model deviate on average 5.23 units from the actual value, using only 22 of the 100 available predictors. Although the error is greater than with Ridge or Lasso, the model is simpler (fewer predictors).

Comparison¶

The test error (rmse) of the 4 models is compared.

df_comparacion = pd.DataFrame({
                    'modelo': ['OLS', 'Ridge', 'Lasso', 'Elastic-net'],
                    'test rmse': [rmse_ols, rmse_ridge, rmse_lasso, rmse_elastic]
                 })

fig, ax = plt.subplots(figsize=(7, 3.84))
df_comparacion.set_index('modelo').plot(kind='barh', ax=ax)
ax.set_xlabel('rmse')
ax.set_ylabel('model')
ax.set_title('Model comparison');

In this case, the best model in terms of predictive error is obtained by applying Ridge regularization. The model with Lasso regularization has a slightly higher error than the standard linear model (OLS) model but uses only 24 predictors instead of 100, offering a good balance between accuracy and interpretability. Elastic Net regularization offers the simplest model (22 predictors) but with a higher error.

Session information¶

import session_info
session_info.show(html=False)

-----
matplotlib          3.10.8
numpy               2.2.6
pandas              2.3.3
scipy               1.15.3
seaborn             0.13.2
session_info        v1.0.1
sklearn             1.7.2
-----
IPython             9.8.0
jupyter_client      8.7.0
jupyter_core        5.9.1
-----
Python 3.13.11 | packaged by Anaconda, Inc. | (main, Dec 10 2025, 21:28:48) [GCC 14.3.0]
Linux-6.14.0-37-generic-x86_64-with-glibc2.39
-----
Session information updated at 2026-01-12 08:58

Bibliography¶

Linear Models with R by Julian J.Faraway

An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics)

Applied Predictive Modeling by Max Kuhn and Kjell Johnson

The Elements of Statistical Learning by T.Hastie, R.Tibshirani, J.Friedman

Introduction to Machine Learning with Python: A Guide for Data Scientists

Python Data Science Handbook by Jake VanderPlas

Lever, J., Krzywinski, M. & Altman, N. Regularization. Nat Methods 13, 803–804 (2016)

Citation instructions¶

How to cite this document?

If you use this document or any part of it, we would appreciate it if you cite it. Thank you very much!

Ridge, Lasso, and Elastic Net Regularization with Python by Joaquín Amat Rodrigo, available with an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://cienciadedatos.net/documentos/py14-ridge-lasso-elastic-net-python-en.html

Did you like the article? Your help is important

Your contribution will help me continue generating free educational content. Thank you very much! 😊

This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.

Allowed:

Share: copy and redistribute the material in any medium or format.
Adapt: remix, transform, and build upon the material.

Under the following terms:

Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
Non-Commercial: You may not use the material for commercial purposes.
Share-Alike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

	alpha	0.08703591361485166
	fit_intercept	True
	precompute	False
	copy_X	True
	max_iter	1000
	tol	0.0001
	warm_start	False
	positive	False
	random_state	None
	selection	'cyclic'