More about data science: cienciadedatos.net

Introduction

Logistic regression is a statistical method that aims to model the probability of a binary qualitative variable (two possible values) as a function of one or more independent variables. The main application of logistic regression is the creation of binary classification models.

It is called simple logistic regression when there is only one independent variable and multiple logistic regression when there is more than one. Depending on the context, the modeled variable is known as the dependent variable or response variable, and the independent variables as regressors, predictors, or features.

Throughout this document, the theoretical foundations of logistic regression, the main practical aspects to consider, and examples of how to create these types of models in Python are progressively described.

Logistic regression models in Python

Two of the most widely used implementations of logistic regression models in Python are: scikit-learn and statsmodels. Although both are highly optimized, Scikit-learn is mainly oriented towards prediction, so it has very few functionalities that show the many characteristics of the model that should be analyzed for inference. Statsmodels is more complete in this regard.

Mathematical Definition

The linear regression model (Legendre, Gauss, Galton, and Pearson) considers that, given a set of observations ${y_{i}, x_{i 1}, . . ., x_{n p}}_{i = 1}^{n}$ , the mean $μ$ of the response variable $y$ is linearly related to the regressor variables $x_{1}$ ... $x_{p}$ according to the equation:

μ_{y} = β_{0} + β_{1} x_{1} + β_{2} x_{2} + . . . + β_{p} x_{p}

The result of this equation is known as the population regression line, and it captures the relationship between the predictors and the mean of the response variable.

Another definition frequently found in statistics books is:

y_{i} = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + . . . + β_{p} x_{i p} + ϵ_{i}

In this case, it refers to the value of $y$ for a specific observation $i$ . The value of a single observation will never be exactly equal to the average, hence the error term $ϵ$ is added.

In both cases, the interpretation of the model elements is the same:

$β_{0}$ : is the intercept, corresponding to the average value of the response variable $y$ when all predictors are zero.
$β_{p}$ : is the average effect on the response variable of increasing the predictor variable $x_{p}$ by one unit, keeping the rest of the variables constant. These are known as partial regression coefficients.
$e$ : is the residual or error, the difference between the observed value and the value estimated by the model. It captures the effect of all variables that influence $y$ but are not included in the model as predictors.

In the vast majority of cases, the population values $β_{0}$ and $β_{p}$ are unknown, so their estimates ${\hat{β}}_{0}$ and ${\hat{β}}_{p}$ are obtained from a sample. Fitting the model consists of estimating, from the available data, the values of the regression coefficients that maximize the likelihood, i.e., those that result in the model most likely to have generated the observed data.

The most frequently used method is ordinary least squares (OLS), which identifies as the best model the line (or plane if multiple regression) that minimizes the sum of the squared vertical deviations between each training data point and the line.

The term "linear" in regression models refers to the fact that the parameters are incorporated into the equation linearly, not that the relationship between each predictor and the response variable must necessarily follow a straight-line pattern.

What happens when the response variable is binary, only two possible values?

If a qualitative variable with two levels (binary) is coded as 0 and 1, it is mathematically possible to fit a linear regression model by least squares. However, this approach has two problems:

By generating a line (hyperplane if there are multiple variables), predicted values different from 0 and 1 can be obtained, which contradicts the definition of the binary response variable.
If you want to interpret the model's predictions as class membership probabilities, the condition that all probabilities must be within the interval [0,1] would not be met, as values outside this range could be obtained.

To avoid these problems, logistic regression (David Cox 1958) transforms the value returned by linear regression with a function whose result is always between 0 and 1. There are several functions that meet this description, one of the most used is the logistic function (also known as the sigmoid function):

sigmoid = σ (y) = \frac{1}{1 + e^{- y}}

For very large positive values of $y$ , $e^{- y}$ is approximately 0, so the value of the sigmoid function is 1. For very large negative values of $y$ , $e^{- y}$ tends to infinity, so the value of the sigmoid function is 0.

Substituting the $y$ in the previous equation with the function of a linear model $y = β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p}$ gives:

\begin{matrix} (annex 1) & P (y = 1 | X = x) = \frac{e^{β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p}}}{1 + e^{β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p}}} \end{matrix}

where $P (y = 1 | X = x)$ can be interpreted as the probability that the response variable $y$ takes the value 1 (reference class), given the predictors $x_{1}, . . ., x_{p}$ .

The resulting model has the regression coefficients in the exponents, so it is not a linear model and cannot be fitted with the strategy described at the beginning of the document. How can this inconvenience be avoided?

The obtained expression is always positive, since the exponential function only takes positive values and the quotient of positive values is always positive. This makes it possible to apply the logarithm:

\begin{matrix} (annex 2) & l n (\frac{p (y = 1 | X = x)}{p (y = 0 | X = x)}) = β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p} \end{matrix}

By performing the transformation, the right side yields the equation of a linear model. The left side term turns out to be the logarithm of a probability ratio, known as the log of odds.

As a result of this process, a nonlinear classification problem is converted into a linear regression problem that can be fitted using conventional methods.

Once the model coefficients ( $β_{0}, β_{1}, \dots, β_{p}$ ) are obtained, the probability that a new observation belongs to class $y = 1$ can be calculated with the equation:

p (y = 1 | X = x) = \frac{e^{(β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p})}}{1 + e^{(β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p})}}

Modeling the log of odds is the mathematical strategy that allows finding the coefficient values (fitting the model).

Linear regression model vs logistic regression for a binary variable. The gray line represents the regression line (the model).

In the previous section, two important demonstrations were omitted to facilitate the narrative. However, it is important to keep them in mind to fully understand how these types of models work.

Annex 1

Combination of the sigmoid function and the equation of a linear model to obtain a model that relates the predictors to the probability that $y = 1$ .

P (y = 1 | X = x) =

= \frac{1}{1 + e^{- (β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p})}} =

= \frac{1}{\frac{e^{β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p}}}{e^{β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p}}} + \frac{1}{e^{β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p}}}} =

= \frac{1}{\frac{1 + e^{β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p}}}{e^{β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p}}}} =

= \frac{e^{β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p}}}{1 + e^{β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p}}}

Annex 2

Logarithmic transformation to linearize the model.

The probability that $y = 1$ given the predictors $X = x_{1}, . . ., x_{p}$ is:

P (y = 1 | X = x) = \frac{e^{β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p}}}{1 + e^{β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p}}} = \frac{e^{\sum β_{i} x_{i}}}{1 + e^{\sum β_{i} x_{i}}}

Since this is a binary classification case, the probability of $y = 0$ is the complementary probability:

P (y = 0 | X = x) = 1 - P (y = 0 | X = x) = 1 - \frac{e^{\sum β_{i} x_{i}}}{1 + e^{\sum β_{i} x_{i}}} = \frac{1}{1 + e^{\sum β_{i} x_{i}}}

If one probability is divided by the other:

\frac{P (y = 1 | X = x)}{P (y = 0 | X = x)} = \frac{\frac{e^{\sum β_{i} x_{i}}}{1 + e^{\sum β_{i} x_{i}}}}{\frac{1}{1 + e^{\sum β_{i} x_{i}}}} = e^{\sum β_{i} x_{i}}

l n (\frac{P (y = 1 | X = x)}{P (y = 0 | X = x)}) = l n (e^{\sum β_{i} x_{i}}) = \sum β_{i} x_{i}

l n (\frac{P (y = 1 | X = x)}{P (y = 0 | X = x)}) = \sum β_{i} x_{i}

l n (\frac{P (y = 1 | X = x)}{P (y = 0 | X = x)}) = β_{0} + β_{1} x_{1} + . . . + β_{p} x_{p}

Odds and Log Odds

Due to the necessary transformations, logistic regression models the probability that the response variable belongs to the reference group indirectly through the use of the log odds. Although most implementations convert the log odds to probabilities when showing predictions, it is useful to understand their meaning.

Suppose an event has a probability of being true of 0.8 ( $p = 0.8$ ). Since it is a binary variable, the probability of being false is $1 - 0.8 = 0.2$ , ( $q = 0.2$ ). The odds or odds ratio are defined as the ratio between the probability of the event being true and the probability of the event being false $\frac{p}{q}$ . In this case, the odds are 0.8 / 0.2 = 4, which is equivalent to saying that 4 true events are expected for every false event.

Since probabilities are always bounded in the range [0,1], odds are bounded between [0, $\infty$ ]. When the logarithm is applied, the range of values becomes [- $\infty$ , + $\infty$ ].

Model Interpretation

The main elements to interpret in a logistic regression model are the coefficients of the predictors:

$β_{0}$ is the intercept. It corresponds to the expected value of the log odds when all predictors are zero. It can be transformed to probability with the equation $e^{β_{0}} / (1 + e^{β_{0}})$ . After the transformation, its value corresponds to the expected probability of belonging to class 1 when all predictors are zero.
$β_{p}$ the partial regression coefficients of each predictor indicate the average change in the log odds when the predictor variable $x_{p}$ increases by one unit, keeping the rest of the variables constant. This is equivalent to saying that, for each unit increase in $x_{p}$ , the odds are multiplied by $e^{β_{p}}$ .

Since the relationship between $p (y = 1)$ and $x$ is not linear, the regression coefficients $β_{p}$ do not correspond to the change in the probability of $y$ associated with a one-unit increase in $x$ . How much the probability of $y$ increases per unit of $x$ depends on the value of $x$ , i.e., the position on the logistic curve. This is a very important difference compared to the interpretation of coefficients in a linear regression model.

The magnitude of each partial regression coefficient depends on the units in which the corresponding predictor variable is measured, so its magnitude is not associated with the importance of each predictor. To determine the impact of each variable in the model, standardized partial coefficients are used, which are obtained by standardizing (subtracting the mean and dividing by the standard deviation) the predictor variables before fitting the model.

While regression coefficients are usually the first objective in interpreting a linear model, there are many other aspects (significance of the model as a whole, significance of the predictors...). These are often treated with little detail when the only goal of the model is to make predictions, but they are very relevant if inference is desired, i.e., explaining the relationships between predictors and the response variable. Throughout this document, each of these aspects will be introduced.

Model Significance

Likelihood Ratio

One of the first results to evaluate when fitting a logistic regression model is the result of the likelihood ratio (LLR) significance test. This test answers the question of whether the model as a whole is able to predict the response variable better than expected by chance, or equivalently, whether at least one of the predictors in the model contributes significantly. To perform this analysis, the probability of obtaining the observed values (log likelihood) with the model of interest ( $m_{1}$ ) is compared to that of a model without predictors (null model $m_{0}$ ).

L L R = - 2 l o g (L (m_{0}) / L (m_{1})) = 2 (l o g L (m_{1}) - l o g L (m_{0}))

The LLR statistic has a chi-squared distribution with degrees of freedom equal to the difference in degrees of freedom between the two models compared. If compared to the null model, the degrees of freedom equal the number of predictors.

Frequently, the null and alternative hypotheses of this test are described as:

$H_{0}$ : $β_{1}$ = ... = $β_{p} = 0$
$H_{a}$ : at least one $β_{i} \neq 0$

If the test is significant, it implies that the model is useful, but not necessarily the best. It could happen that some of its predictors are not needed.

Pseudo R2

Unlike linear regression models, logistic models do not have an exact equivalent to $R^{2}$ that determines the variance explained by the model. Different methods known as $p s e u d o R^{2}$ have been developed to approximate the concept of $R^{2}$ , but although their range oscillates between 0 and 1, they cannot be considered equivalent. The best known is that proposed by McFadden:

R_{M c F}^{2} = 1 - \frac{l o g \hat{L} (model)}{l o g \hat{L} (null model)}

where $\hat{L}$ is the likelihood value of each model.

$R_{M c F}^{2}$ is 0 if the candidate model does not improve on the null model, and 1 if it fits the data perfectly.

Predictor Significance

In most cases, although the logistic regression study is applied to a sample, the ultimate goal is to obtain a model that explains the relationship between the variables in the entire population. This means that the generated model is an estimate of the population relationship based on the relationship observed in the sample and, therefore, is subject to variation. For each of the coefficients in the logistic regression equation ( $β_{p}$ ), its significance (p-value) and confidence interval can be calculated. The most commonly used statistical test is the Wald chi-test.

The significance test for the coefficients ( $β_{p}$ ) of the logistic model considers the following hypotheses:

$H_{0}$ : the predictor $x_{p}$ does not contribute to the model ( $β_{p} = 0$ ), in the presence of the other predictors.
$H_{a}$ : the predictor $x_{p}$ does contribute to the model ( $β_{p} \neq 0$ ), in the presence of the other predictors.

In models generated with statsmodels, along with the value of the regression coefficients, the value of the $z$ statistic obtained for each, the p-values, and the corresponding confidence intervals are returned. This allows you to know, in addition to the estimates, whether they are significantly different from 0, i.e., whether they contribute to the model.

Categorical Variables as Predictors

When a categorical variable is introduced as a predictor, one level is considered the reference (usually coded as 0) and the other levels are compared to it. If the categorical predictor has more than two levels, what are known as dummy or one-hot-encoding variables are generated, which are variables created for each level of the categorical predictor and can take the value 0 or 1. Each time the model is used to predict a value, only one dummy variable per predictor takes the value 1 (the one that matches the value of the predictor in that case) while the rest are considered 0. The value of the partial regression coefficient $β_{p}$ of each dummy variable indicates the average percentage by which that level influences the log odds of the dependent variable $y$ compared to the reference level of that predictor.

The idea of dummy variables is better understood with an example. Suppose a model in which the response variable green eyes (yes/no) is predicted based on the subject's nationality. The nationality variable is qualitative with 3 levels (American, European, and Asian). Although the initial predictor is nationality, a new variable is created for each level, which can be 0 or 1. Thus, the equation of the complete model is:

p (green eyes = yes) = \frac{e^{(β_{0} + β_{1} a m e r i c a n + β_{2} e u r o p e a n + β_{3} a s i a n)}}{1 + e^{(β_{0} + β_{1} a m e r i c a n + β_{2} e u r o p e a n + β_{3} a s i a n)}}

If the subject is European, the dummy variables American and Asian are considered 0, so the model for this case becomes:

p (green eyes = yes) = \frac{e^{(β_{0} + β_{2} e u r o p e a n)}}{1 + e^{(β_{0} + β_{2} e u r o p e a n)}}

Prediction

Once a valid model has been generated, it is possible to predict the probability of the response variable $y$ for new values of the predictor variables $x$ . It is important to note that predictions should, a priori, be limited to the range of values within which the observations used to train the model are found. This is important because, although logistic regression models can be extrapolated, only in this region is it certain that the conditions for the model to be valid are met. To calculate predictions, the following equation is used:

\hat{p} (y = 1 | X) = \frac{e^{({\hat{β}}_{0} + {\hat{β}}_{1} x_{1} + \dots + {\hat{β}}_{p} x_{p})}}{1 + e^{({\hat{β}}_{0} + {\hat{β}}_{1} x_{1} + \dots + {\hat{β}}_{p} x_{p})}}

Converting Probability to Classification

One of the main applications of a logistic regression model is to classify the qualitative variable based on the values taken by the predictors. Since the output of a logistic model is a probability, to achieve classification, it is necessary to set a threshold above which the variable is considered to belong to one of the levels. For example, an observation can be assigned to group 1 if the estimated probability is greater than 0.5 and to group 0 otherwise.

Model Validation

Once the best model that can be created with the available data has been selected, its ability to predict new observations that were not used for training must be checked. This verifies whether the model can be generalized. A commonly used strategy is to randomly split the data into two groups, fit the model with the first group, and estimate the prediction accuracy with the second.

The appropriate partition size depends largely on the amount of data available and the required confidence in the error estimate. An 80%/20% split usually yields good results.

Conditions for Logistic Regression

For a logistic regression model and the conclusions derived from it to be completely valid, the assumptions on which its mathematical development is based must be verified. In practice, these are rarely all met or can be demonstrated to be met, but this does not mean the model is not useful. The important thing is to be aware of them and the impact this has on the conclusions drawn from the model.

No collinearity or multicollinearity:

In multiple logistic regression models, the predictors must be independent; there should be no collinearity between them. Collinearity occurs when a predictor is linearly related to one or more of the other predictors in the model. As a consequence of collinearity, it is not possible to precisely identify the individual effect of each predictor on the response variable, which results in an increase in the variance of the estimated regression coefficients to the point where it becomes impossible to establish their statistical significance. In addition, small changes in the data cause large changes in the coefficient estimates. While true collinearity only exists if the simple or multiple correlation coefficient between predictors is 1 (which rarely occurs in reality), it is common to find so-called near-collinearity or imperfect multicollinearity.

If collinearity is found between predictors, there are two possible solutions. The first is to exclude one of the problematic predictors, trying to keep the one that, in the researcher's judgment, is truly influencing the response variable. This measure usually has little impact on the model's predictive ability since, due to collinearity, the information provided by one predictor is redundant in the presence of the other. The second option is to combine the collinear variables into a single predictor, although at the risk of losing interpretability.

When trying to establish cause-effect relationships, collinearity can lead to very erroneous conclusions, making one believe that a variable is the cause when, in reality, another is influencing that predictor.

Linear relationship between numeric predictors and the log odds of the response variable

Each numeric predictor must be linearly related to the log odds of the response variable $y$ while the other predictors are held constant; otherwise, they should not be included in the model.

No autocorrelation (Independence)

The values of each observation are independent of the others. This is especially important to check when working with time measurements. It is recommended to plot the residuals ordered according to the time of observation; if a certain pattern exists, there are signs of autocorrelation. The Durbin-Watson hypothesis test can also be used.

Outliers, high leverage, or influential points

It is important to identify observations that are outliers or may be influencing the model. The easiest way to detect them is through the residuals.

Sample size

This is not a condition per se, but if there are not enough observations, predictors that are not truly influential may appear to be so. A common recommendation is that the number of observations should be at least 10 to 20 times the number of predictors in the model.

Parsimony

This term refers to the fact that the best model is the one capable of explaining the observed variability in the response variable with the greatest precision using the fewest predictors, and therefore, with fewer assumptions.

The vast majority of conditions are checked using the residuals, so the model is usually generated first and then the conditions are validated. In fact, fitting a model should be seen as an iterative process in which the model is fitted, the residuals are evaluated, and it is improved. This continues until an optimal model is reached.

Simple Logistic Regression Example

A study aims to establish a model that allows calculating the probability of obtaining honors at the end of high school based on the grade obtained in mathematics. The honors variable is coded as 0 if the student does not receive honors and 1 if they do.

Libraries

The libraries used in this example are:

# Data processing
# ==============================================================================
import pandas as pd
import numpy as np

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns

# Preprocessing and modeling
# ==============================================================================
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.weightstats import ttest_ind

# Matplotlib configuration
# ==============================================================================
plt.rcParams['image.cmap'] = "bwr"
plt.rcParams['savefig.bbox'] = "tight"
style.use('ggplot') or plt.style.use('ggplot')

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

Data

# Data
# ==============================================================================
honors = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
                     0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
                     0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
                     0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
                     1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0,
                     1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
                     1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
                     0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
                     0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
                     0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0,
                     0, 0, 0, 0, 1, 0, 0, 0, 1, 1])

mathematics = np.array([
                  41, 53, 54, 47, 57, 51, 42, 45, 54, 52, 51, 51, 71, 57, 50, 43,
                  51, 60, 62, 57, 35, 75, 45, 57, 45, 46, 66, 57, 49, 49, 57, 64,
                  63, 57, 50, 58, 75, 68, 44, 40, 41, 62, 57, 43, 48, 63, 39, 70,
                  63, 59, 61, 38, 61, 49, 73, 44, 42, 39, 55, 52, 45, 61, 39, 41,
                  50, 40, 60, 47, 59, 49, 46, 58, 71, 58, 46, 43, 54, 56, 46, 54,
                  57, 54, 71, 48, 40, 64, 51, 39, 40, 61, 66, 49, 65, 52, 46, 61,
                  72, 71, 40, 69, 64, 56, 49, 54, 53, 66, 67, 40, 46, 69, 40, 41,
                  57, 58, 57, 37, 55, 62, 64, 40, 50, 46, 53, 52, 45, 56, 45, 54,
                  56, 41, 54, 72, 56, 47, 49, 60, 54, 55, 33, 49, 43, 50, 52, 48,
                  58, 43, 41, 43, 46, 44, 43, 61, 40, 49, 56, 61, 50, 51, 42, 67,
                  53, 50, 51, 72, 48, 40, 53, 39, 63, 51, 45, 39, 42, 62, 44, 65,
                  63, 54, 45, 60, 49, 48, 57, 55, 66, 64, 55, 42, 56, 53, 41, 42,
                  53, 42, 60, 52, 38, 57, 58, 65])

data = pd.DataFrame({'honors': honors, 'mathematics': mathematics})
data.head(3)

	honors	mathematics
0	0	41
1	0	53
2	0	54

Graphical Exploration

The first step before generating a simple logistic regression model is to plot the data to get an idea of whether there is a relationship between the independent variable and the response variable.

# Number of observations per class
# ==============================================================================
data.honors.value_counts().sort_index()

honors
0    151
1     49
Name: count, dtype: int64

# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3.84))

sns.violinplot(
        x     = 'honors',
        y     = 'mathematics',
        data  = data,
        ax    = ax
)

ax.set_title('Distribution of math grades by class');

# T-test between classes
# ==============================================================================
res_ttest = ttest_ind(
                x1 = mathematics[honors == 0],
                x2 = mathematics[honors == 1],
                alternative='two-sided'
)
print(f"t={res_ttest[0]}, p-value={res_ttest[1]}")

t=-8.245421127756739, p-value=2.248243794123437e-14

Both the plot and the t-test provide evidence that there is a difference in the math grades of people with and without honors. This information is useful for considering the math grade as a good predictor for the model.

Model Fitting

A model is fitted using matricula as the response variable and matematicas as the predictor. As in any predictive study, it is important not only to fit the model but also to quantify its ability to predict new observations. To make this evaluation, the data are split into two groups, one for training and one for testing.

Scikit-learn

# Split the data into train and test sets
# ==============================================================================
X = data[['mathematics']]
y = data['honors']

X_train, X_test, y_train, y_test = train_test_split(
                                        X.values,
                                        y.values,
                                        train_size   = 0.8,
                                        random_state = 1234,
                                        shuffle      = True
)

# Model creation
# ==============================================================================
# To avoid including any regularization in the model, set
# penalty='none'
model = LogisticRegression(penalty=None)
model.fit(X = X_train, y = y_train.ravel())

LogisticRegression(penalty=None)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Model information
# ==============================================================================
print("Intercept:", model.intercept_)
print("Coefficient:", list(zip(X.columns, model.coef_.flatten(), )))
print("Training accuracy:", model.score(X_train, y_train))

Intercept: [-8.98479044]
Coefficient: [('mathematics', np.float64(0.14393266992917017))]
Training accuracy: 0.79375

Once the model is trained, new observations can be predicted.

# Probabilistic predictions
# ==============================================================================
# With .predict_proba() you get, for each observation, the predicted probability
# of belonging to each of the two classes.
predictions = model.predict_proba(X = X_test)
predictions = pd.DataFrame(predictions, columns = model.classes_)
predictions.head(3)

	0	1
0	0.685816	0.314184
1	0.838109	0.161891
2	0.443517	0.556483

# Predictions with final classification
# ==============================================================================
# With .predict() you get, for each observation, the classification predicted by
# the model. This classification corresponds to the class with the highest probability.
predictions = model.predict(X = X_test)
predictions

array([0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Statsmodels

The implementation of logistic regression in Statsmodels is more complete than that of Scikit-learn because, in addition to fitting the model, it allows you to calculate the statistical tests and analyses needed to verify that the conditions on which these types of models are based are met. Statsmodels has two ways to train the model:

By specifying the model formula and passing the training data as a dataframe that includes the response variable and the predictors. This approach is similar to that used in R.
Passing two matrices, one with the predictors and one with the response variable. This is the same as used by Scikit-learn with the difference that a column of 1s must be added to the predictor matrix.

# Split the data into train and test sets
# ==============================================================================
X = data[['mathematics']]
y = data['honors']

X_train, X_test, y_train, y_test = train_test_split(
                                        X.values,
                                        y.values,
                                        train_size   = 0.8,
                                        random_state = 1234,
                                        shuffle      = True
)

# Model creation using matrices as in scikit-learn
# ==============================================================================
# A column of 1s must be added to the predictor matrix for the model intercept
X_train = sm.add_constant(X_train, prepend=True)
model = sm.Logit(endog=y_train, exog=X_train,)
model = model.fit()
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.451215
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  160
Model:                          Logit   Df Residuals:                      158
Method:                           MLE   Df Model:                            1
Date:                Mon, 18 Aug 2025   Pseudo R-squ.:                  0.2247
Time:                        11:24:18   Log-Likelihood:                -72.194
converged:                       True   LL-Null:                       -93.122
Covariance Type:            nonrobust   LLR p-value:                 9.831e-11
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -8.9848      1.543     -5.821      0.000     -12.010      -5.960
x1             0.1439      0.027      5.387      0.000       0.092       0.196
==============================================================================

The estimated coefficient for the intercept (Intercept or const) is the expected value of the log odds that a student obtains honors with a 0 in mathematics. As one might expect, the odds are very low $e^{- 8.9848} = 0.0001254$ , which corresponds to a probability of obtaining honors of $p = \frac{e^{0.0001254}}{1 + e^{0.0001254}} = 0.000125$ .

According to the model, the log odds that a student receives honors is positively related to the score obtained in mathematics (regression coefficient = 0.1439). This means that for each unit increase in the mathematics variable, the log odds of the honors variable is expected to increase by an average of 0.1439 units. Applying the inverse of the natural logarithm ( $e^{0.1439} = 1.155$ ), it is found that for each unit increase in the mathematics variable, the odds of obtaining honors increase by an average of 1.169 units. This should not be confused with the probability of honors increasing by 1.169%.

Unlike linear regression, where $β_{1}$ corresponds to the average change in the dependent variable $y$ due to a one-unit increase in the predictor $x_{1}$ , in logistic regression, $β_{1}$ indicates the change in the log odds due to a one-unit increase in $x_{1}$ , or in other words, multiplies the odds by $e^{β_{1}}$ . Since the relationship between $p (y = 1)$ and $x$ is not linear, the regression coefficients $β_{p}$ do not correspond to the change in the probability of $y$ associated with a one-unit increase in $x$ . How much the probability of $y$ increases per unit of $x$ depends on the value of $x$ , i.e., the position on the logistic curve.

In addition to the estimated values of the model's partial correlation coefficients, it is advisable to calculate their corresponding confidence intervals.

Confidence Intervals for the Coefficients

# Confidence intervals for the model coefficients
# ==============================================================================
ci_intervals = model.conf_int(alpha=0.05)
ci_intervals = pd.DataFrame(ci_intervals, columns = ['2.5%', '97.5%'])
ci_intervals

	2.5%	97.5%
0	-12.009997	-5.959608
1	0.091564	0.196301

Predictions

Once the model is trained, predictions for new data can be obtained. Logistic regression models from statsmodels return the probability of belonging to the reference class.

# Probability prediction
# ==============================================================================
predictions = model.predict(exog = X_train)
predictions[:4]

array([0.0437907 , 0.52073   , 0.05755755, 0.72042378])

To obtain the final classification, probability values greater than 0.5 are converted to 1 and the rest to 0.

# Predicted classification
# ==============================================================================
classification = np.where(predictions<0.5, 0, 1)
classification

array([0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1])

Graphical Representation of the Model

In addition to the least squares line, it is recommended to include the upper and lower limits of the confidence interval. This allows you to identify the region in which, according to the generated model and for a given confidence level, the average value of the response variable is found.

# Predictions across the entire range of X
# ==============================================================================
# Create a vector with new interpolated values in the range of observations.
grid_X = np.linspace(
            start = min(data.mathematics),
            stop  = max(data.mathematics),
            num   = 200
).reshape(-1,1)

grid_X = sm.add_constant(grid_X, prepend=True)
predictions = model.predict(exog = grid_X)

# Model plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(6, 3.84))

ax.scatter(
    X_train[(y_train == 1).flatten(), 1],
    y_train[(y_train == 1).flatten()].flatten()
    # Points for class 1
)
ax.scatter(
    X_train[(y_train == 0).flatten(), 1],
    y_train[(y_train == 0).flatten()].flatten()
    # Points for class 0
)
ax.plot(grid_X[:, 1], predictions, color = "gray")
ax.set_title("Logistic Regression Model")
ax.set_ylabel("P(honors = 1 | mathematics)")
ax.set_xlabel("Mathematics grade");

Test Accuracy

The percentage of correct predictions the model achieves on the test observations (accuracy) is calculated.

# Test accuracy of the model 
# ==============================================================================
X_test = sm.add_constant(X_test, prepend=True)
predictions = model.predict(exog = X_test)
classification = np.where(predictions<0.5, 0, 1)
accuracy = accuracy_score(
            y_true    = y_test,
            y_pred    = classification,
            normalize = True
)
print("")
print(f"The test accuracy is: {100*accuracy}%")

The test accuracy is: 87.5%

Conclusion

The logistic model created to predict the probability that a student obtains honors based on their mathematics grade is overall significant (Likelihood ratio p-value = 9.831e-11). The p-value for the mathematics predictor is significant (p-value = 7.17e-08).

P(honors) = \frac{e^{- 8.9848 + 0.1439 * mathematics grade}}{1 + e^{- 8.9848 + 0.1439 * mathematics grade}}

The results obtained with the test set indicate that the model is able to correctly classify 87.5% of the observations.

Multiple Logistic Regression Example

The spam dataset, obtained from the UCI Repository Of Machine Learning Databases, contains information on 4601 emails classified as spam and not spam.

For each email, there are 58 variables: the first 48 contain the frequency with which certain words appear in the email text. Variables 49-54 indicate the frequency with which the characters ;’, ‘(’, ‘[’, ‘!’, ‘$’, ‘#’ appear. Variables 55-57 contain the mean, maximum length, and total number of uppercase letters.

A multiple logistic regression model is created to classify whether an email is spam or not.

Libraries

The libraries used in this example are:

# Data processing
# ==============================================================================
import pandas as pd
import numpy as np

# Plots
# ==============================================================================
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns

# Preprocessing and modeling
# ==============================================================================
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Matplotlib configuration
# ==============================================================================
plt.rcParams['image.cmap'] = "bwr"
plt.rcParams['savefig.bbox'] = "tight"
style.use('ggplot') or plt.style.use('ggplot')

# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

Data

# Data
# ==============================================================================
url = (
    'https://raw.githubusercontent.com/JoaquinAmatRodrigo/'
    'Estadistica-machine-learning-python/master/data/spam.csv'
)
data = pd.read_csv(url)
data.head(3)

	make	address	all	our	over	remove	internet	order	mail	...	charSemicolon	charRoundbracket	charExclamation	charDollar	charHash	capitalAve	capitalLong	capitalTotal	type
0	0.00	0.64	0.64	0.32	0.00	0.00	0.00	0.00	0.00	...	0.00	0.000	0.778	0.000	0.000	3.756	61	278	spam
1	0.21	0.28	0.50	0.14	0.28	0.21	0.07	0.00	0.94	...	0.00	0.132	0.372	0.180	0.048	5.114	101	1028	spam
2	0.06	0.00	0.71	1.23	0.19	0.19	0.12	0.64	0.25	...	0.01	0.143	0.276	0.184	0.010	9.821	485	2259	spam

3 rows × 58 columns

The response variable is coded as 1 if it is spam and 0 if it is not, and the number of observations in each class is identified.

data['type'] = np.where(data['type'] == 'spam', 1, 0)

print("Number of observations per class")
print(data['type'].value_counts())
print("")

print("Percentage of observations per class")
print(100 * data['type'].value_counts(normalize=True))

Number of observations per class
type
0    2788
1    1813
Name: count, dtype: int64

Percentage of observations per class
type
0    60.595523
1    39.404477
Name: proportion, dtype: float64

66.6% of emails are not spam and 39.4% are. A useful classification model must be able to correctly predict a percentage of observations above that of the majority class. In this case, the reference threshold to surpass is 66.6%.

Model Fitting

A multiple logistic regression model is fitted to predict whether an email is spam based on all available variables.

# Split the data into train and test sets
# ==============================================================================
X = data.drop(columns = 'type')
y = data['type']

X_train, X_test, y_train, y_test = train_test_split(
                                        X,
                                        y.values,
                                        train_size   = 0.8,
                                        random_state = 1234,
                                        shuffle      = True
)

# Model creation using matrices as in scikit-learn
# ==============================================================================
# A column of 1s must be added to the predictor matrix for the model intercept
X_train = sm.add_constant(X_train, prepend=True)
model = sm.Logit(endog=y_train, exog=X_train,)
model = model.fit()
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.200751
         Iterations 15
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                 3680
Model:                          Logit   Df Residuals:                     3622
Method:                           MLE   Df Model:                           57
Date:                Mon, 18 Aug 2025   Pseudo R-squ.:                  0.7009
Time:                        11:24:20   Log-Likelihood:                -738.76
converged:                       True   LL-Null:                       -2469.6
Covariance Type:            nonrobust   LLR p-value:                     0.000
=====================================================================================
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -1.5530      0.156     -9.956      0.000      -1.859      -1.247
make                 -0.3077      0.255     -1.205      0.228      -0.808       0.193
address              -0.1464      0.079     -1.860      0.063      -0.301       0.008
all                   0.0882      0.125      0.705      0.481      -0.157       0.333
num3d                 2.2531      1.454      1.550      0.121      -0.596       5.103
our                   0.7254      0.128      5.682      0.000       0.475       0.976
over                  0.7286      0.255      2.859      0.004       0.229       1.228
remove                1.9991      0.332      6.022      0.000       1.348       2.650
internet              0.5267      0.183      2.884      0.004       0.169       0.885
order                 0.6556      0.316      2.076      0.038       0.037       1.274
mail                  0.0833      0.075      1.106      0.269      -0.064       0.231
receive              -0.1917      0.333     -0.576      0.565      -0.844       0.461
will                 -0.1788      0.087     -2.050      0.040      -0.350      -0.008
people               -0.1835      0.251     -0.732      0.464      -0.675       0.308
report                0.1589      0.156      1.016      0.310      -0.148       0.466
addresses             1.0691      0.710      1.507      0.132      -0.322       2.460
free                  1.0272      0.163      6.309      0.000       0.708       1.346
business              1.0484      0.258      4.059      0.000       0.542       1.555
email                 0.0832      0.131      0.634      0.526      -0.174       0.340
you                   0.0760      0.040      1.894      0.058      -0.003       0.155
credit                1.4274      0.750      1.903      0.057      -0.043       2.897
your                  0.2196      0.057      3.826      0.000       0.107       0.332
font                  0.2320      0.192      1.208      0.227      -0.144       0.608
num000                2.3195      0.540      4.296      0.000       1.261       3.378
money                 0.3512      0.146      2.403      0.016       0.065       0.638
hp                   -1.8795      0.320     -5.880      0.000      -2.506      -1.253
hpl                  -1.1604      0.486     -2.387      0.017      -2.113      -0.207
george              -12.9370      2.497     -5.182      0.000     -17.830      -8.044
num650                0.4662      0.208      2.243      0.025       0.059       0.874
lab                  -3.9850      2.593     -1.537      0.124      -9.067       1.097
labs                 -0.1316      0.342     -0.385      0.700      -0.801       0.538
telnet                1.0132      0.884      1.146      0.252      -0.720       2.746
num857                2.3901      3.536      0.676      0.499      -4.541       9.321
data                 -0.7578      0.363     -2.086      0.037      -1.470      -0.046
num415                0.8594      1.678      0.512      0.608      -2.429       4.147
num85                -1.9348      0.870     -2.225      0.026      -3.639      -0.231
technology            0.8951      0.354      2.529      0.011       0.201       1.589
num1999               0.0493      0.204      0.241      0.809      -0.351       0.449
parts                 1.9359      1.232      1.571      0.116      -0.480       4.351
pm                   -0.9294      0.464     -2.001      0.045      -1.840      -0.019
direct               -0.3704      0.432     -0.857      0.391      -1.217       0.477
cs                  -43.9842     27.447     -1.603      0.109     -97.779       9.811
meeting              -3.0673      0.988     -3.104      0.002      -5.004      -1.130
original             -0.8759      0.810     -1.082      0.279      -2.463       0.711
project              -1.6666      0.583     -2.858      0.004      -2.809      -0.524
re                   -0.8289      0.171     -4.856      0.000      -1.163      -0.494
edu                  -1.1653      0.268     -4.355      0.000      -1.690      -0.641
table                -2.0108      1.691     -1.189      0.234      -5.326       1.304
conference           -5.0363      2.136     -2.358      0.018      -9.223      -0.850
charSemicolon        -1.2001      0.491     -2.443      0.015      -2.163      -0.237
charRoundbracket     -0.2822      0.278     -1.015      0.310      -0.827       0.263
charSquarebracket    -0.8608      1.059     -0.813      0.416      -2.937       1.215
charExclamation       0.2735      0.072      3.798      0.000       0.132       0.415
charDollar            5.7530      0.836      6.880      0.000       4.114       7.392
charHash              1.4144      1.287      1.099      0.272      -1.108       3.936
capitalAve            0.0389      0.021      1.842      0.066      -0.002       0.080
capitalLong           0.0048      0.003      1.799      0.072      -0.000       0.010
capitalTotal          0.0014      0.000      5.068      0.000       0.001       0.002
=====================================================================================

Possibly complete quasi-separation: A fraction 0.28 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Predictions

Once the model is trained, predictions for new data can be obtained. statsmodels models allow you to calculate the confidence intervals associated with each prediction.

# Predictions with confidence interval 
# ==============================================================================
predictions = model.predict(exog = X_train)

# Predicted classification
# ==============================================================================
classification = np.where(predictions<0.5, 0, 1)
classification

array([0, 0, 0, ..., 1, 1, 0], shape=(3680,))

Test Accuracy

The percentage of correct predictions the model achieves on the test observations (accuracy) is calculated.

# Test accuracy of the model 
# ==============================================================================
X_test = sm.add_constant(X_test, prepend=True)
predictions = model.predict(exog = X_test)
classification = np.where(predictions<0.5, 0, 1)
accuracy = accuracy_score(
            y_true    = y_test,
            y_pred    = classification,
            normalize = True
)
print("")
print(f"The test accuracy is: {100*accuracy}%")

The test accuracy is: 92.83387622149837%

# Confusion matrix of test predictions
# ==============================================================================
confusion_matrix = pd.crosstab(
    y_test.ravel(),
    classification,
    rownames=['Actual'],
    colnames=['Prediction']
    )
confusion_matrix

Prediction	0	1
Actual
0	535	28
1	38	320

Conclusion

The logistic model created to predict the probability that an email is spam is overall significant (Likelihood ratio p-value = 0). The percentage of correct classification in the test set is 92.8%, well above the 66.6% threshold expected by chance.

According to the individual p-value of each predictor, only some of them contribute information to the model. It is advisable to identify which ones and exclude them to simplify the model and avoid introducing noise. One way to reduce the influence of predictors that do not contribute to a logistic regression model is to incorporate regularization in its fitting.

Session Information

import session_info
session_info.show(html=False)

-----
matplotlib          3.10.1
numpy               2.2.6
pandas              2.3.1
seaborn             0.13.2
session_info        v1.0.1
sklearn             1.7.1
statsmodels         0.14.4
-----
IPython             9.1.0
jupyter_client      8.6.3
jupyter_core        5.7.2
jupyterlab          4.4.5
notebook            7.4.5
-----
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0]
Linux-6.14.0-27-generic-x86_64-with-glibc2.39
-----
Session information updated at 2025-08-18 11:24

References

An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics)

Linear Models with R, Julian J.Faraway

OpenIntro Statistics: Third Edition, David M Diez, Christopher D Barr, Mine Çetinkaya-Rundel

Points of Significance: Logistic regression by Jake Lever, Martin Krzywinski & Naomi Altman

How to Cite

How to cite this document?

If you use this document or any part of it, we appreciate your citation. Thank you very much!

Logistic Regression with Python by Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://cienciadedatos.net/documentos/py17-regresion-logistica-python.html

Did you like the article? Your support is important

Your contribution will help me continue generating free educational content. Thank you very much! 😊

This document created by Joaquín Amat Rodrigo is licensed under an Attribution-NonCommercial-ShareAlike 4.0 International license.

You are allowed to:

Share: copy and redistribute the material in any medium or format.
Adapt: remix, transform, and build upon the material.

Under the following terms:

Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
Non-Commercial: You may not use the material for commercial purposes.
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

	penalty	None
	dual	False
	tol	0.0001
	C	1.0
	fit_intercept	True
	intercept_scaling	1
	class_weight	None
	random_state	None
	solver	'lbfgs'
	max_iter	100
	multi_class	'deprecated'
	verbose	0
	warm_start	False
	n_jobs	None
	l1_ratio	None