More about Data Science and Statistics
- Linear Correlation
- T-test
- Permutation tests
- Bootstrapping
- ANOVA
- Kolmogorov-Smirnov Test
- Cramer-Von Mises Test
Introduction¶
Linear correlation is a statistical method that allows quantifying the linear relationship between two variables. There are several statistics, called linear correlation coefficients, developed to measure this type of association. Some of the most commonly used are Pearson, Spearman, and Kendall.
Frequently, linear correlation studies precede more complex analyses, such as the creation of regression models. First, it is analyzed whether the variables are correlated and, if so, models are generated.
It is important to note that the existence of correlation between two variables does not necessarily imply causality. The observed association may be due to a third factor (confounder).
Linear Correlation with Python
There are multiple implementations that allow calculating linear correlations in Python. Three of the most widely used are available in the libraries: SciPy, Pandas, and Pingouin. Throughout this document, examples of how to use them are shown.
Linear Correlation Coefficients¶
Covariance¶
To study the linear relationship between two continuous variables, it is necessary to have parameters that allow quantifying such a relationship. One of these parameters is covariance, which measures the degree of joint variation of two random variables.
$$\text{Sample Covariance} = Cov(X,Y)=\dfrac {\sum _{i=1}^{n}\left( x_{i}-\overline {x}\right) \left( y_{i}-\overline {y}\right) }{n-1}$$where $\overline{x}$ and $\overline{y}$ are the mean of each variable, and $x_i$ and $y_i$ are the values of the variables for observation $i$.
Positive values indicate that the two variables change in the same direction, and negative values indicate that they change in opposite directions.
The main limitation of covariance is that its magnitude depends on the scales in which the studied variables are measured. This implies that it cannot be used to compare the degree of association between pairs of variables measured on different scales. One way to avoid this limitation and be able to make comparisons is to standardize the covariance, generating what is known as correlation coefficients.
Linear Correlation Coefficients¶
Linear correlation coefficients are statistics that quantify the linear association between two numerical variables. There are different types, among which Pearson, Spearman's Rho, and Kendall's Tau stand out. They all share the following characteristics:
Their value is in the range [-1, +1], where +1 indicates a perfect positive correlation and -1 indicates a perfect negative correlation.
They are used as a measure of the strength of association between two variables (effect size):
0: null association
0.1: small association
0.3: medium association
0.5: moderate association
0.7: high association
0.9: very high association
From a practical point of view, the main differences between these three coefficients are:
Pearson correlation works well with quantitative variables that have a normal or near-normal distribution. It is more sensitive to extreme values than the other two alternatives.
Spearman correlation is used with quantitative variables (continuous or discrete). Instead of directly using the value of each variable, the data are ordered and replaced by their respective rank. It is a non-parametric method widely used when the normality condition necessary to apply Pearson correlation is not satisfied.
Kendall correlation is another non-parametric alternative that, like Spearman correlation, uses the rank ordering of observations. It is recommended when there is little data and many values occupy the same position in the rank, that is, when there are many ties.
Statistical Significance¶
In addition to the value obtained for the correlation coefficient, it is necessary to calculate its statistical significance. No matter how close the value of the correlation coefficient is to $+1$ or $-1$, if it is not significant, there is not enough evidence to affirm that a real correlation exists, since the observed value could be due to simple randomness.
The parametric test of statistical significance used for the correlation coefficient is the t-test, where the t statistic is calculated as:
$$t=\dfrac {r\sqrt{N-2}}{\sqrt{1-r^{2}}}$$$r$ is the value of the correlation coefficient and N is the number of available observations. The degrees of freedom are calculated as $df= N-2$.
In this test, the null hypothesis ($H_0$) considers that the variables are independent (population correlation coefficient = 0), and as an alternative hypothesis ($H_a$), that there is a relationship (population correlation coefficient $\neq$ 0).
The significance of a correlation coefficient can also be calculated using non-parametric methods such as bootstrapping.
Effect Size¶
Linear correlation, in addition to the value of the correlation coefficient and its significance, also has an associated effect size known as the coefficient of determination $R^2$.
$R^2$ is interpreted as the amount of variance in $Y$ explained by $X$. In the case of Pearson and Spearman coefficients, $R^2$ is obtained by squaring the correlation coefficient. In the case of Kendall, it cannot be calculated this way.
Pearson Coefficient¶
The Pearson correlation coefficient is the standardized covariance.
$$\rho=\dfrac {Cov(X,Y)}{\sigma_{x}\sigma_{y}}$$The above equation corresponds to the population Pearson coefficient ($\rho$). In practice, the entire population is rarely accessible, so its value is estimated from a sample using the sample Pearson coefficient ($r$).
$$r_{xy}=\dfrac {\sum _{i=1}^{n}\left( x_{i}-\overline {x}\right) \left( y_{i}-\overline {y}\right) } {\sqrt {\sum _{i=1}^{n}\left( x_{i}-\overline {x}\right) ^{2}\sum _{i=1}^{n}\left( y_{i}-\overline {y}\right) ^{2}}}$$Conditions
The conditions that must be met for the Pearson correlation coefficient to be valid are:
The relationship to be studied is linear (otherwise, the Pearson coefficient cannot detect it).
Both variables must be numerical.
Normality: both variables must be normally distributed. In practice, it is usually considered valid even when they moderately deviate from normality.
Homoscedasticity: the variance of $Y$ must be constant throughout the variable $X$. This can be checked if in a scatterplot the values of $Y$ maintain the same dispersion in different regions of variable $X$.
Characteristics
It takes values between [-1, +1], with +1 being a perfect positive linear correlation and -1 a perfect negative linear correlation.
It is independent of the scales in which the variables are measured.
It does not vary if linear transformations are applied to the variables.
It does not take into account whether the variables are dependent or independent.
The Pearson correlation coefficient does not equal the slope of the regression line.
It is sensitive to outliers, so it is recommended, if they can be justified, to exclude them before performing the calculation.
Spearman Coefficient (Spearman's rho)¶
The Spearman coefficient is equivalent to the Pearson coefficient but with a prior transformation of the data to ranks. This means that, instead of directly using the value of each variable, the data are ordered and replaced by their respective rank (first, second, third...). It is used as a non-parametric alternative to the Pearson coefficient when the values are ordinal or when the values are continuous but do not satisfy the normality condition.
$$r_{s}=1-\frac{6\sum d_{i}^{2}}{n (n^{2}-1)},$$Where $d_{i}$ is the distance between the ranks of each observation ($x_{i} - y_{i}$) and n is the number of observations.
The Spearman coefficient requires that the relationship between variables be monotonic, that is, when one variable grows the other also grows, or when one grows the other decreases (the trend is constant).
By working with the order of observations instead of their actual value, it has the characteristic of being less sensitive than Pearson to extreme values.
Kendall's Tau Coefficient¶
Kendall correlation is a non-parametric method that, like Spearman correlation, uses the ordering of observations ranking.
It is another alternative to the Pearson correlation coefficient when the normality condition is not met. It is usually used instead of the Spearman coefficient when the number of observations is small or the values accumulate in a region so that the number of ties when generating the ranking is high.
$$\tau= \frac{C-D}{\frac{1}{2}n(n-1)}$$where $C$ is the number of concordant pairs, those in which the rank of the second variable is greater than the rank of the first variable. $D$ is the number of discordant pairs, when the rank of the second is equal to or less than the rank of the first variable.
Jackknife Correlation¶
The Pearson correlation coefficient is effective in very diverse areas. However, it has the disadvantage of not being robust against outliers even when the normality condition is met. If two variables have a common peak or valley in a single observation, for example due to a reading error, the correlation will be dominated by this record even though there is no real correlation between the two variables. The same can occur in the opposite direction. If two variables are highly correlated except for one observation where the values are very different, then the existing correlation will be masked. One way to avoid this is to resort to Jackknife correlation, which consists of calculating all possible correlation coefficients between two variables if each observation is excluded one at a time. The average of all calculated Jackknife correlations attenuates to some extent the effect of the outlier.
$$\bar{\theta}_{(A,B)} = \text{Average Jackknife correlation (A,B)} = \frac{1}{n} \sum_{i=1}^n \hat r_i$$where $n$ is the number of observations and $\hat r_i$ is the Pearson correlation coefficient estimated between variables $A$ and $B$, having excluded observation $i$.
In addition to the average, its standard error ($SE$) can be estimated, thus obtaining confidence intervals for the Jackknife correlation and its corresponding p-value.
$$SE = \sqrt{\frac{n-1}{n} \sum_{i=1}^n (\hat r_i - \bar{\theta})^2}$$95% confidence interval $(Z=1.96)$
$$\text{Average Jackknife correlation (A,B)} \pm \ 1.96 *SE$$$$\bar{\theta} - 1.96 \sqrt{\frac{n-1}{n} \sum_{i=1}^n (\hat r_i - \bar{\theta})^2} \ , \ \ \bar{\theta}+ 1.96 \sqrt{\frac{n-1}{n} \sum_{i=1}^n (\hat r_i - \bar{\theta})^2}$$P-value for the null hypothesis that $\bar{\theta} = 0$:
$$Z_{calculated} = \frac{\bar{\theta} - H_0}{SE}= \frac{\bar{\theta} - 0}{\sqrt{\frac{n-1}{n} \sum_{i=1}^n (\hat r_i - \bar{\theta})^2}}$$$$p_{value} = P(Z > Z_{calculated})$$When this method is used, it is convenient to calculate the difference between the correlation value obtained by Jackknife correlation $(\bar{\theta})$ and the one obtained if all observations are used $(\hat{r})$. This difference is known as Bias. Its magnitude is an indicator of how much the estimation of the correlation between two variables is influenced by an outlier.
$$Bias = (n-1) \times (\bar{\theta} - \hat{r})$$If the difference between each correlation $(\hat r_i)$ estimated in the Jackknife process and the correlation value $(\hat r)$ obtained if all observations are used is calculated, the most influential observations can be identified.
When the study requires minimizing the presence of false positives as much as possible, even if false negatives increase, the lowest value among all those calculated in the Jackknife process can be selected as the correlation value.
$$\text{Correlation}= min \{ \hat r_1, \hat r_2,..., \hat r_n \}$$Although the Jackknife method allows increasing the robustness of Pearson correlation, if the outliers are very extreme, their influence will still be notable. A graphical representation of the data is always convenient to identify if there are outliers and remove them. Other robust alternatives are Spearman correlation or the Bootstrapping method.
Partial Correlation¶
As explained, correlation studies the (linear or monotonic) relationship between two variables. It may happen that the relationship shown by two variables is due to a third variable that influences the other two. This phenomenon is known as confounding. For example, if a person's foot size is correlated with their intelligence, a high positive correlation is found. However, this relationship is due to a third variable that is related to the other two, age. Partial correlation allows studying the linear relationship between two variables while blocking the effect of a third (or more) variables. If the correlation value of two variables is different from the partial correlation value of those same two variables when a third is controlled, it means that the third variable influences the other two. The partial_corr() function from the pingouin package allows studying partial correlations.
Linear Correlation Example¶
A study aims to analyze whether there is a positive linear correlation between people's height and weight. The data used in this example were obtained from the book Statistical Rethinking by Richard McElreath. The dataset contains information collected by Nancy Howell in the late 1960s about the !Kung San people, who live in the Kalahari Desert between Botswana, Namibia, and Angola.
Libraries¶
The libraries used in this document are:
# Data processing
# ==============================================================================
import pandas as pd
import numpy as np
# Graphics
# ==============================================================================
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
# Preprocessing and analysis
# ==============================================================================
import statsmodels.api as sm
import pingouin as pg
from scipy import stats
from scipy.stats import pearsonr
# Configuration warnings
# ==============================================================================
import warnings
warnings.filterwarnings('once')
Data¶
# Data reading
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/' +
'Estadistica-machine-learning-python/master/data/Howell1.csv')
data = pd.read_csv(url)
# Only information from individuals over 18 years old is used.
data = data[data.age > 18]
data.info()
<class 'pandas.core.frame.DataFrame'> Index: 346 entries, 0 to 543 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 height 346 non-null float64 1 weight 346 non-null float64 2 age 346 non-null float64 3 male 346 non-null int64 dtypes: float64(3), int64(1) memory usage: 13.5 KB
Graphical Analysis¶
First, the two variables are represented using a scatter plot to visually assess whether there is a linear or monotonic relationship. If there is not, calculating such correlations is not meaningful.
# Graph
# ==============================================================================
fig, ax = plt.subplots(1, 1, figsize=(6, 3))
ax.scatter(x=data.height, y=data.weight, alpha=0.8)
ax.set_xlabel('Height')
ax.set_ylabel('Weight')
ax.set_title('Height-weight scatter plot');
The scatter plot appears to indicate a positive linear relationship between both variables.
To be able to choose the appropriate correlation coefficient, the type of variables and their distribution must be analyzed. In this case, both variables are continuous quantitative and can be ordered to convert them into a ranking, so, a priori, all three coefficients could be applied. The choice will be made based on the distribution of the observations: normality, homoscedasticity, and presence of outliers.
Normality¶
# Variable distribution graph
# ==============================================================================
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(8, 4), sharey=True)
axs[0].hist(x=data.height, bins=20, color="#3182bd", alpha=0.5)
axs[0].plot(data.height, np.full_like(data.height, -0.01), '|k', markeredgewidth=1)
axs[0].set_title('Height distribution')
axs[0].set_xlabel('height')
axs[0].set_ylabel('counts')
axs[1].hist(x=data.weight, bins=20, color="#3182bd", alpha=0.5)
axs[1].plot(data.weight, np.full_like(data.weight, -0.01), '|k', markeredgewidth=1)
axs[1].set_title('Weight distribution')
axs[1].set_xlabel('weight')
plt.tight_layout();
# Q-Q plot
# ==============================================================================
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(8, 4))
sm.qqplot(
data.height,
fit = True,
line = 'q',
alpha = 0.4,
lw = 2,
ax = axs[0]
)
axs[0].set_title('Height')
axs[0].tick_params(labelsize = 7)
sm.qqplot(
data.weight,
fit = True,
line = 'q',
alpha = 0.4,
lw = 2,
ax = axs[1]
)
axs[1].set_title('Weight')
axs[1].tick_params(labelsize = 7)
plt.tight_layout();
In addition to the graphical study, two statistical tests are used to test the normality of the data: Shapiro-Wilk test and D'Agostino's K-squared test. The latter is included in the statsmodels summary under the name Omnibus.
In both tests, the null hypothesis considers that the data follow a normal distribution, therefore, if the p-value is not lower than the selected reference level alpha, there is no evidence to rule out that the data are normally distributed.
# Shapiro-Wilk test for normality
# ==============================================================================
shapiro_test = stats.shapiro(data.height)
print(f"Variable height: {shapiro_test}")
shapiro_test = stats.shapiro(data.weight)
print(f"Variable weight: {shapiro_test}")
Variable height: ShapiroResult(statistic=np.float64(0.9910705950686569), pvalue=np.float64(0.034413216987494714)) Variable weight: ShapiroResult(statistic=np.float64(0.9911816371339782), pvalue=np.float64(0.0367282884335488))
# D'Agostino's K-squared test for normality
# ==============================================================================
k2, p_value = stats.normaltest(data.height)
print(f"Variable height: Statistic = {k2}, p-value = {p_value}")
k2, p_value = stats.normaltest(data.weight)
print(f"Variable weight: Statistic = {k2}, p-value = {p_value}")
Variable height: Statistic = 7.210790495766356, p-value = 0.02717670115638557 Variable weight: Statistic = 8.402628478646044, p-value = 0.014975881988444982
The graphical analysis and statistical tests show evidence that normality cannot be assumed in either of the two variables. Being strict, this fact excludes the possibility of using the Pearson coefficient, leaving Spearman or Kendall as alternatives. However, given that the distribution does not deviate much from normality and that the Pearson coefficient has some robustness, for practical purposes it could be used as long as this fact is taken into account and communicated in the results. Another possibility is to try to transform the variables to improve their distribution, for example, by applying the logarithm.
# Logarithmic transformation of the data
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5, 4))
sm.qqplot(
np.log(data.height),
fit = True,
line = 'q',
alpha = 0.4,
lw = 2,
ax = ax
)
ax.set_title('Q-Q plot log(height)', fontsize = 13)
ax.tick_params(labelsize = 7)
shapiro_test = stats.shapiro(np.log(data.height))
print(f"Variable height: {shapiro_test}")
Variable height: ShapiroResult(statistic=np.float64(0.9922663896096799), pvalue=np.float64(0.06946765640621282))
The logarithmic transformation of the height variable achieves a distribution closer to normal.
Homoscedasticity¶
Homoscedasticity implies that the variance remains constant. It can be analyzed graphically by representing the observations in a scatter plot and seeing if they maintain homogeneity in their dispersion along the X axis. A conical shape is a clear indication of lack of homoscedasticity. Two statistical tests used to test homoscedasticity are: Goldfeld-Quandt test and Breusch-Pagan test.
As shown in the scatter plot generated at the beginning of the exercise, no conical pattern is observed and the dispersion is constant.
Correlation Coefficients¶
Due to the lack of normality, the results generated by Pearson are not entirely accurate. However, given that the deviation from normality is slight and no outliers are observed, for illustrative purposes, we proceed to calculate all three types of coefficients.
Again, remember that when any of the conditions assumed by a model or statistical test are not met, it does not mean that it must necessarily be discarded, but one must be aware of the implications it has and always report it in the results.
Pandas
Pandas allows calculating the correlation of two Series (columns of a DataFrame). The calculation is done pairwise, automatically eliminating those with NA/null values. A limitation of Pandas is that it does not calculate statistical significance.
# Correlation calculation with Pandas
# ==============================================================================
print('Pearson correlation : ', data['weight'].corr(data['height'], method='pearson'))
print('Spearman correlation: ', data['weight'].corr(data['height'], method='spearman'))
print('Kendall correlation : ', data['weight'].corr(data['height'], method='kendall'))
Pearson correlation : 0.7528177220327669 Spearman correlation: 0.7510966609219974 Kendall correlation : 0.5639709660523899
Scipy.stats
The Scipy.stats implementation does allow calculating statistical significance in addition to the correlation coefficient. The stats.pearsonr() function returns an error if any of the observations contain NA/null values. The stats.spearmanr() and stats.kendalltau() functions do allow automatically excluding them if nan_policy='omit' is specified.
# Correlation and significance calculation with Scipy
# ==============================================================================
r, p = stats.pearsonr(data['weight'], data['height'])
print(f"Pearson correlation : r={r}, p-value={p}")
r, p = stats.spearmanr(data['weight'], data['height'])
print(f"Spearman correlation: r={r}, p-value={p}")
r, p = stats.kendalltau(data['weight'], data['height'])
print(f"Kendall correlation : r={r}, p-value={p}")
Pearson correlation : r=0.752817722032767, p-value=1.8941037794174146e-64 Spearman correlation: r=0.7510966609219974, p-value=5.2882247217804375e-64 Kendall correlation : r=0.5639709660523899, p-value=3.162649137764635e-54
Pingouin
The Pingouin library has one of the most complete implementations. With the corr() function, in addition to the correlation coefficient, its significance, confidence interval, and statistical power among others are obtained.
# Correlation, significance and intervals calculation with pingouin
# ==============================================================================
display(pg.corr(data['weight'], data['height'], method='pearson'))
display(pg.corr(data['weight'], data['height'], method='spearman'))
display(pg.corr(data['weight'], data['height'], method='kendall'))
| n | r | CI95% | p-val | BF10 | power | |
|---|---|---|---|---|---|---|
| pearson | 346 | 0.752818 | [0.7, 0.8] | 1.894104e-64 | 8.84e+60 | 1.0 |
| n | r | CI95% | p-val | power | |
|---|---|---|---|---|---|
| spearman | 346 | 0.751097 | [0.7, 0.79] | 5.288225e-64 | 1.0 |
| n | r | CI95% | p-val | power | |
|---|---|---|---|---|---|
| kendall | 346 | 0.563971 | [0.49, 0.63] | 3.162649e-54 | 1.0 |
Conclusion¶
The statistical tests show a linear correlation between moderate and high (r ≈ 0.5-0.7), with clear statistical evidence that the observed relationship is not due to chance ($\text{p-value} \approx 0$).
Jackknife Correlation Example¶
A research team wants to study whether there is a correlation in the presence of two substances (A and B) in river water. To do this, they have performed a series of measurements in which the concentration of the two substances is quantified in 10 independent water samples. It is suspected that the reading instrument suffers some failure that causes some readings to spike, which is why a robust correlation method is to be used. The objective of this example is to illustrate the Jackknife method, so it is assumed that the conditions for Pearson correlation are met.
Data¶
# Simulated data from two variables A and B
a = np.array([12, 9, 6, 7, 2, 5, 4, 0, 1, 8])
b = np.array([3, 5, 1, 9, 5, 3, 7, 2, 10, 5])
# An outlier is introduced
a[5] = 20
b[5] = 16
# Graph
# ==============================================================================
fig, ax = plt.subplots(1, 1, figsize=(6, 3))
ax.plot(a, label='A')
ax.plot(b, label='B')
ax.set_xlabel('Sample ID')
ax.set_ylabel('Concentration')
ax.set_title('Concentration of substances A and B in samples')
ax.legend();
Pearson Correlation¶
We proceed to calculate the Pearson correlation with and without the outlier.
# Correlation with outlier
# ==============================================================================
r, p = stats.pearsonr(a, b)
print(f"Pearson correlation with outlier: r={r}, p-value={p}")
Pearson correlation with outlier: r=0.5173731151689149, p-value=0.12563522982639552
# Correlation without outlier
# ==============================================================================
r, p = stats.pearsonr(np.delete(a, 5), np.delete(b, 5))
print(f"Pearson correlation without outlier: r={r}, p-value={p}")
Pearson correlation without outlier: r=-0.18420184544326054, p-value=0.6351961086690546
It is confirmed that observation number 5 has a great influence on the correlation result, being 0.52 if it is present and -0.18 if it is excluded.
Jackknife Pearson Correlation¶
# Jackknife correlation function
# ==============================================================================
def jackknife_correlation(x, y):
'''
This function applies the Jackknife method for calculating the Pearson
correlation coefficient.
Parameters
----------
x : 1D np.ndarray, pd.Series
Variable X.
y : 1D np.ndarray, pd.Series
Variable y.
Returns
-------
correlations: 1D np.ndarray
Correlation value for each Jackknife iteration
'''
n = len(x)
jackknife_values = np.array([
stats.pearsonr(np.append(x[:i], x[i + 1:]),
np.append(y[:i], y[i + 1:]))[0]
for i in range(n)
])
jack_avg = np.mean(jackknife_values)
full_corr = stats.pearsonr(x, y)[0]
se = np.sqrt(((n - 1) / n) * np.sum((jackknife_values - jack_avg)**2))
bias = (n - 1) * (jack_avg - full_corr)
return {
'jackknife_values': jackknife_values,
'average': jack_avg,
'se': se,
'bias': bias
}
correlation = jackknife_correlation(x=a, y=b)
print(f"Jackknife correlation: {correlation['average']}")
print(f"Standard error : {correlation['se']}")
print(f"Bias error : {correlation['bias']}")
print(f"Jackknife values : {correlation['jackknife_values']}")
Jackknife correlation: 0.4781856690553489 Standard error : 0.6915097651492328 Bias error : -0.35268701502209404 Jackknife values : [ 0.64719522 0.53705998 0.54597653 0.52827609 0.51215595 -0.18420185 0.53554935 0.44125573 0.69065085 0.52793883]
To identify if there is any very influential value, the change in the correlation coefficient produced by the exclusion of each observation can be plotted.
correlation_variation = correlation['jackknife_values'] - stats.pearsonr(a, b)[0]
fig, ax = plt.subplots(1, 1, figsize=(6, 3))
ax.plot(correlation_variation)
ax.set_xlabel('Sample ID')
ax.set_ylabel('Correlation variation')
ax.set_title('Variation of correlation when excluding each sample');
The Jackknife correlation method has only been able to dampen a small part of the outlier's influence, however, it has allowed identifying which observation is affecting to a greater extent.
Correlation Matrix Example¶
When multiple numerical variables are available, for example in statistical modeling and machine learning problems, it is convenient to study the degree of correlation between the available variables.
One way to do this is through correlation matrices, which show the correlation coefficient for each pair of variables.
# Data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/' +
'Estadistica-machine-learning-python/master/data/SaratogaHouses.csv')
data = pd.read_csv(url, sep=",")
# Columns are renamed to be more descriptive
data.columns = ["price", "total_sqft", "age", "land_price",
"living_sqft", "college_graduates", "bedrooms",
"fireplaces", "bathrooms", "rooms", "heating",
"heating_consumption", "drainage", "lake_views",
"new_construction", "air_conditioning"]
# Numerical variables
data = data.select_dtypes(include=['float64', 'int'])
# Correlation matrix
# ==============================================================================
corr_matrix = data.corr(method='pearson')
corr_matrix
| price | total_sqft | age | land_price | living_sqft | college_graduates | bedrooms | fireplaces | bathrooms | rooms | |
|---|---|---|---|---|---|---|---|---|---|---|
| price | 1.000000 | 0.158333 | -0.188793 | 0.581266 | 0.712390 | 0.200119 | 0.400349 | 0.376786 | 0.597250 | 0.531170 |
| total_sqft | 0.158333 | 1.000000 | -0.016352 | 0.059222 | 0.163450 | -0.033148 | 0.113982 | 0.085226 | 0.084823 | 0.137604 |
| age | -0.188793 | -0.016352 | 1.000000 | -0.021818 | -0.174242 | -0.037785 | 0.027125 | -0.172022 | -0.361897 | -0.082264 |
| land_price | 0.581266 | 0.059222 | -0.021818 | 1.000000 | 0.423441 | 0.228427 | 0.202449 | 0.211727 | 0.297498 | 0.298865 |
| living_sqft | 0.712390 | 0.163450 | -0.174242 | 0.423441 | 1.000000 | 0.209981 | 0.656196 | 0.473788 | 0.718564 | 0.733666 |
| college_graduates | 0.200119 | -0.033148 | -0.037785 | 0.228427 | 0.209981 | 1.000000 | 0.162919 | 0.246626 | 0.179541 | 0.157068 |
| bedrooms | 0.400349 | 0.113982 | 0.027125 | 0.202449 | 0.656196 | 0.162919 | 1.000000 | 0.284475 | 0.458033 | 0.671863 |
| fireplaces | 0.376786 | 0.085226 | -0.172022 | 0.211727 | 0.473788 | 0.246626 | 0.284475 | 1.000000 | 0.436234 | 0.319894 |
| bathrooms | 0.597250 | 0.084823 | -0.361897 | 0.297498 | 0.718564 | 0.179541 | 0.458033 | 0.436234 | 1.000000 | 0.517585 |
| rooms | 0.531170 | 0.137604 | -0.082264 | 0.298865 | 0.733666 | 0.157068 | 0.671863 | 0.319894 | 0.517585 | 1.000000 |
Correlation matrices have the disadvantage of being considerably large when many variables are available. To facilitate the identification of pairs of variables with high correlations, it is convenient to convert them to long format (tidy).
def tidy_corr_matrix(corr_mat):
'''
Function to convert a pandas correlation matrix to tidy format.
'''
corr_mat = corr_mat.stack().reset_index()
corr_mat.columns = ['variable_1','variable_2','r']
corr_mat = corr_mat.loc[corr_mat['variable_1'] != corr_mat['variable_2'], :]
corr_mat['abs_r'] = np.abs(corr_mat['r'])
corr_mat = corr_mat.sort_values('abs_r', ascending=False)
return corr_mat
tidy_corr_matrix(corr_matrix).head(10)
| variable_1 | variable_2 | r | abs_r | |
|---|---|---|---|---|
| 94 | rooms | living_sqft | 0.733666 | 0.733666 |
| 49 | living_sqft | rooms | 0.733666 | 0.733666 |
| 48 | living_sqft | bathrooms | 0.718564 | 0.718564 |
| 84 | bathrooms | living_sqft | 0.718564 | 0.718564 |
| 4 | price | living_sqft | 0.712390 | 0.712390 |
| 40 | living_sqft | price | 0.712390 | 0.712390 |
| 69 | bedrooms | rooms | 0.671863 | 0.671863 |
| 96 | rooms | bedrooms | 0.671863 | 0.671863 |
| 46 | living_sqft | bedrooms | 0.656196 | 0.656196 |
| 64 | bedrooms | living_sqft | 0.656196 | 0.656196 |
# Heatmap correlation matrix
# ==============================================================================
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5, 5))
sns.heatmap(
corr_matrix,
annot = True,
cbar = False,
annot_kws = {"size": 8},
vmin = -1,
vmax = 1,
center = 0,
cmap = sns.diverging_palette(20, 220, n=200),
square = True,
ax = ax
)
ax.set_xticklabels(
ax.get_xticklabels(),
rotation = 45,
horizontalalignment = 'right',
)
ax.tick_params(labelsize=10)
Partial Correlation Example¶
We want to study the relationship between the price and weight variables of automobiles. It is suspected that this relationship could be influenced by the engine horsepower variable, since greater vehicle weight requires greater horsepower and, in turn, more powerful engines are more expensive.
Data¶
For this example, the Cars93 dataset available in the R package MASS is used.
# Data reading
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/' +
'Estadistica-machine-learning-python/master/data/Cars93.csv')
data = pd.read_csv(url)
data['log_Price'] = np.log(data.Price)
# Graph
# ==============================================================================
fig, ax = plt.subplots(1, 1, figsize=(6, 3))
ax.scatter(x=data.Weight, y=data.log_Price, alpha=0.8)
ax.set_xlabel('Weight')
ax.set_ylabel('Log(Price)')
ax.set_title('Weight-log(price) scatter plot');
The graph shows a clear linear relationship between a car's weight and the logarithm of its price.
Correlation¶
# Linear correlation calculation
# ==============================================================================
pg.corr(x=data['Weight'], y=data['log_Price'], method='pearson')
| n | r | CI95% | p-val | BF10 | power | |
|---|---|---|---|---|---|---|
| pearson | 93 | 0.763544 | [0.66, 0.84] | 5.640674e-19 | 1.069e+16 | 1.0 |
# Partial linear correlation calculation
# ==============================================================================
pg.partial_corr(data=data, x='Weight', y='log_Price', covar='Horsepower', method='pearson')
| n | r | CI95% | p-val | |
|---|---|---|---|---|
| pearson | 93 | 0.404741 | [0.22, 0.56] | 0.000063 |
Conclusion¶
The correlation between weight and the logarithm of price is high (r=0.764) and significant ($\text{p-value} \approx 0$). However, when their relationship is studied while blocking the engine horsepower variable, although the relationship remains significant, it becomes low (r=0.4047).
It can be stated that the linear relationship between weight and the logarithm of price is influenced by the effect of the engine horsepower variable.
Session Information¶
import session_info
session_info.show(html=False)
----- matplotlib 3.10.8 numpy 2.4.0 pandas 2.3.3 pingouin 0.5.5 scipy 1.16.3 seaborn 0.13.2 session_info v1.0.1 statsmodels 0.14.6 ----- IPython 9.9.0 jupyter_client 8.7.0 jupyter_core 5.9.1 ----- Python 3.13.11 | packaged by conda-forge | (main, Dec 6 2025, 11:10:00) [MSC v.1944 64 bit (AMD64)] Windows-11-10.0.26100-SP0 ----- Session information updated at 2026-01-06 22:07
Bibliography¶
Linear Models with R by Julian J.Faraway
OpenIntro Statistics: Fourth Edition by David Diez, Mine Çetinkaya-Rundel, Christopher Barr
Introduction to Machine Learning with Python: A Guide for Data Scientists
Points of Significance: Association, correlation and causation. Naomi Altman & Martin Krzywinski Nature Methods
Citation Instructions¶
How to cite this document?
If you use this document or any part of it, we would appreciate if you cite it. Thank you very much!
Linear Correlation with Python by Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://www.cienciadedatos.net/documentos/pystats05-linear-correlation-python.html
Did you like the article? Your help is important
Your contribution will help me continue generating free educational content. Thank you very much! 😊
This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.
Allowed:
-
Share: copy and redistribute the material in any medium or format.
-
Adapt: remix, transform, and build upon the material.
Under the following terms:
-
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
-
Non-Commercial: You may not use the material for commercial purposes.
-
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
