More about Data Science and Statistics

Introduction

Statistical methods based on repeated sampling (resampling) are considered non-parametric tests since they do not require any assumptions about the distribution of the populations studied. They are therefore an alternative to parametric tests (t-test, ANOVA, ...) when their conditions are not satisfied or when inference is desired on a parameter other than the mean. Throughout this document, one of the most widely used resampling methods is described and applied: permutations.

Permutation Tests

The permutation test is a statistical significance test for studying differences between groups. It was developed by Ronald Fisher and E.J.G. Pitman in 1930. The distribution of the studied statistic (mean, median...) is obtained by calculating the value of that statistic for all possible reorganizations of the observations into different groups. Since it involves calculating all possible situations, it is an exact test.

To illustrate the idea of permutation tests, a simple experiment is proposed. Suppose a set of subjects distributed into two groups, A and B, of sizes $n_{A}$ and $n_{B}$, whose sample means after conducting the experiment are $\bar{x}_{A}$ and $\bar{x}_{B}$. The goal is to determine if there is a significant difference between the mean of the two groups. This is equivalent to checking if the observed difference is greater than what would be expected solely due to the random assignment of subjects to the two groups, if both samples truly come from the same population.

In statistical terms, the null hypothesis ($H_0$) is that both samples belong to the same distribution and that the observed differences are due only to variations caused by random group assignment. The permutation test allows identifying if there is evidence against this hypothesis. The steps to follow are:


1) Calculate the observed difference between the means of the two groups $(\text{diff}_{observed})$.

2) All observations are pooled without regard to which group they belong to.

3) Generate all possible combinations* in which the observations can be distributed into two groups of size $n_{A}$ and $n_{B}$.

4) For each combination, calculate the difference between means $(\text{diff}_{calculated})$. The set of values obtained forms the exact distribution of possible differences under the null hypothesis that both groups come from the same population. This distribution is known as the permutation distribution of the mean difference.

5) Calculate the two-tailed p-value as the proportion of combinations in which the absolute value of the calculated difference is greater than or equal to the absolute value of the observed difference.

In this example, the mean is used as the statistic, but it could be any other (median, variance...)


The only necessary condition for a permutation test is known as exchangeability, according to which all possible permutations have the same probability of occurring under the null hypothesis.

Permutation tests are significance tests and, therefore, are used to calculate p-values. It is important to note that the conclusions drawn are only applicable to experimental designs. Studies in which, after selecting subjects, random assignment to different groups is performed.

Note: although the term permutation is used, it really refers to combinations, since the order of observations within each group does not matter.

Monte Carlo Simulation

In permutation tests, the p-value obtained is exact since all possible combinations of the observations are calculated. This becomes complicated or impossible when the sample size is medium or large. For example, there are more than 155 million different combinations to distribute 30 observations into two groups of 15 each.

$$\frac{(m + n)!}{m!n!} = \frac{(15 + 15)!}{15!15!} = 155117520$$

Monte Carlo simulation or randomization test consists of using only a random sample of all possible combinations, thus avoiding having to calculate them all. Since not all possible combinations are calculated, it is not an exact test but an approximate one.

The estimation of the p-value through Monte Carlo simulation may present a slight bias. In particular, when the observed value is more extreme than all simulated values, the uncorrected p-value would be exactly zero, which can be problematic from a statistical standpoint. One of the most commonly used corrections for this problem is the one proposed by Davison and Hinkley:

$$p_{value}=\frac{r+1}{k+1}$$

where r is the number of combinations equal to or more extreme than the observed statistic and k is the number of combinations used in the simulation.

This correction ensures that the p-value is never exactly zero, which is more appropriate from the standpoint of statistical inference. Although the correction is minimal when the number of simulations is high, it has the advantage that if the observed value is greater than any of the calculated ones, the p-value is very low but not zero. For computational convenience, a number of permutations ending in 9 (4999, 9999) is usually used.

How many permutations (combinations) are necessary?

There is no predefined number of combinations that ensures good results. One way to guide the choice is to consider the minimum value capable of being detected. For example, with 1000 permutations, the smallest possible p-value is $p_{value}=\frac{1+1}{1000+1}= 0.001998$.

Despite this, the stability and error of the estimation vary greatly depending on the available data. It is commonly recommended to use at least 2000 iterations.

How to interpret the obtained p-value?

The p-value is the probability of obtaining a value of the statistic equal to or more extreme than the observed value, if the null hypothesis is true $P(data|H_0)$. It is important not to confuse this with the probability that the null hypothesis is true $P(H_0|data)$. If the value is very small, it can be interpreted as evidence against the null hypothesis, but not as a confirmation of alternative hypotheses.

Example: Comparing Means

Suppose a study that aims to determine if participation in extracurricular activities increases students' empathic capacity. To do this, the school offers a voluntary program in which each participant is randomly assigned to a "control" group that does not receive extracurricular classes or a "treatment" group that does receive them. At the end of the year, all subjects in the study take an exam that quantifies their empathic capacity. In view of the results, can it be considered that extracurricular classes have an impact on how students relate socially (on average)?

The experimental design of the study employs random assignment of subjects to two groups (treatment and control). This randomness in assignment implies that, on average, the two groups are equal for all characteristics, so the only difference between them is whether or not they receive treatment. To determine if the observed difference is significant, one can study how probable it is to obtain an equal or greater difference if the treatment has no effect, or in other words, determine if the observed difference is greater than would be expected solely due to the variability produced by the random formation of groups.

The null hypothesis considers that the mean of both groups is the same:

$$\mu_{control} = \mu_{treatment}$$$$\mu_{control} - \mu_{treatment} = 0$$

The alternative hypothesis considers that the mean of both groups is different:

$$\mu_{control} \neq \mu_{treatment}$$$$\mu_{control} - \mu_{treatment} \neq 0$$

✏️ Note

While the experimental design of random assignment allows for cause-effect conclusions, since the selection of subjects was not random but by volunteers, the results cannot be extrapolated to the entire population.

Libraries

The libraries used in this document are:

# Data processing
# ==============================================================================
import pandas as pd
import numpy as np

# Graphics
# ==============================================================================
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

# Statistical tests
# ==============================================================================
from scipy import stats

# Miscellaneous
# ==============================================================================
from tqdm import tqdm

# Warning configuration
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

Data

The data used in this example were obtained from the book Comparing Groups Randomization and Bootstrap Methods Using R.

# Data
# ==============================================================================
url = 'https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-con-R/' \
      + 'master/datos/AfterSchool.csv'
data = pd.read_csv(url)
data = data[['Treatment', 'Delinq']]
data = data.rename(columns={'Treatment': 'group', 'Delinq': 'value'})
data['group'] = np.where(data['group'] == 0, 'control', 'treatment')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356 entries, 0 to 355
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   group   356 non-null    object 
 1   value   356 non-null    float64
dtypes: float64(1), object(1)
memory usage: 5.7+ KB

Descriptive Analysis

# Observed distribution plots
# ==============================================================================
fig, axs = plt.subplots(nrows=2, ncols=1, figsize=(6, 7))
sns.violinplot(
    x     = data.value,
    y     = data.group,
    color = ".8",
    ax    = axs[0]
)
sns.stripplot(
    x    = data.value,
    y    = data.group,
    data = data,
    size = 4,
    jitter  = 0.1,
    palette = 'tab10',
    ax = axs[0]
)
axs[0].set_title('Distribution of values by group')
axs[0].set_ylabel('group')
axs[0].set_xlabel('value');

for group in data.group.unique():
    data_temp = data[data.group == group]['value']
    data_temp.plot.kde(ax=axs[1], label=group)
    axs[1].plot(data_temp, np.full_like(data_temp, 0), '|k', markeredgewidth=1)

axs[1].set_title('Distribution of values by group')
axs[1].set_xlabel('value');
axs[1].legend()

fig.tight_layout();
# Descriptive statistics by group
# ==============================================================================
data.groupby(by='group').describe()
value
count mean std min 25% 50% 75% max
group
control 187.0 50.725591 10.52089 44.463082 44.463082 44.463082 50.933188 89.753823
treatment 169.0 49.018956 8.97423 44.463082 44.463082 44.463082 50.933188 89.753823

The graphical representation shows a clear asymmetric distribution of values, with a pronounced right tail in both groups. No obvious difference is observed between groups. The location statistics (mean and median) of both groups are similar. The same is true for their dispersion.

Judging by the graphical analysis and descriptive statistics, there is no clear evidence that attending extracurricular activities increases the empathic capacity of young people.

Since the observations are not normally distributed, the parametric hypothesis test t-test cannot be applied. As an alternative, a non-parametric test based on resampling is used. Since this is an experimental design in which subjects have been randomly assigned to each group, the appropriate test is the permutation test.

Permutation Test

First, the difference between the means of both groups (observed difference) is calculated.

def diff_mean(x1, x2):
    '''
    Function to calculate the difference of means between two groups.
    
    Parameters
    ----------
    x1 : numpy array
         values from sample 1.
         
    x2 : numpy array
         values from sample 2.
         
    Returns
    -------
    statistic: float
        value of the statistic.
    '''
    
    statistic = np.mean(x1) - np.mean(x2)
    return statistic
diff_observed = diff_mean(
                    x1 = data[data.group == 'control']['value'],
                    x2 = data[data.group == 'treatment']['value']
                )
print(f"Observed difference: {diff_observed}")
Observed difference: 1.706635519197242

Determining if the observed difference is significant is equivalent to asking how probable it is to obtain this difference if the treatment has no effect.

To obtain the exact probability, it is necessary to generate all possible combinations in which 356 subjects can be divided into two groups of 187 and 169, and calculate the difference of means for each one. The number of possible combinations is very high, ($3.93 \times 10^{105}$), so it is not feasible to calculate them all. Instead, Monte Carlo simulation is used.

def permutations(x1, x2, statistic_fun, n_iterations=9999, random_state=89776):
    '''
    Function to calculate the value of the statistic in multiple permutations
    of two samples.
    
    Parameters
    ----------
    x1 : numpy array or pandas Series
         values from sample 1.
    x2 : numpy array or pandas Series
         values from sample 2.
    statistic_fun : function
        function that receives the two samples as arguments and returns the value
        of the statistic.
    n_iterations : int
        number of calculated permutations (default `9999`).
    random_state : int
        Seed for the random number generator (default `89776`).
    
    Returns
    -------
    permutation_results: numpy array
        value of the statistic in each permutation.
    '''
    
    rng = np.random.default_rng(random_state)
    x1 = np.asarray(x1)
    x2 = np.asarray(x2)
    
    n_x1 = len(x1)
    data_pool = np.hstack((x1, x2))
    
    permutation_results = np.full(shape=n_iterations, fill_value=np.nan)
    
    for i in tqdm(range(n_iterations)):
        rng.shuffle(data_pool)
        statistic = statistic_fun(data_pool[:n_x1], data_pool[n_x1:])
        permutation_results[i] = statistic
        
    return permutation_results
dist_permut = permutations(
                x1 = data.loc[data.group == 'control', 'value'],
                x2 = data.loc[data.group == 'treatment', 'value'],
                statistic_fun = diff_mean,
                n_iterations  = 9999
              )

The simulated data form what is known as the permutation distribution or Monte Carlo distribution. This distribution represents the expected variation in the difference of means due solely to random group assignment.

# Permutation distribution
# ==============================================================================
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(7,4))
ax.hist(dist_permut, bins=30, density=True, color='#3182bd', alpha=0.5)
ax.axvline(x=dist_permut.mean(), color='black', label='distribution mean')
ax.axvline(x=diff_observed, color='red', label='observed difference')
ax.axvline(x=-diff_observed, color='red')

ax.set_title('Permutation distribution')
ax.set_xlabel('mean difference')
ax.set_ylabel('density')
ax.legend();
pd.Series(dist_permut).describe()
count    9999.000000
mean        0.014782
std         1.036286
min        -3.832560
25%        -0.698541
50%         0.030300
75%         0.686257
max         3.601623
dtype: float64

Since the hypothesis simulated when generating the permutations was that the treatment is not effective (null hypothesis), the generated distribution is centered at zero (black vertical line).

Finally, the probability (p-value) of obtaining differences equal to or more extreme than the observed one (red vertical lines) is calculated.

# Empirical p-value with and without correction (two-sided)
# ==============================================================================
p_value = (sum(np.abs(dist_permut) >= np.abs(diff_observed)))/len(dist_permut)
p_value_correc = (sum(np.abs(dist_permut) >= np.abs(diff_observed)) + 1)/(len(dist_permut) + 1)
print(f"p-value without correction: {p_value}")
print(f"p-value with correction: {p_value_correc}")
p-value without correction: 0.1079107910791079
p-value with correction: 0.108

Conclusion

The 356 subjects in the study were randomly assigned to a control group (n=187) that did not attend remedial classes or a treatment group (n=169) that did attend. A permutation test is used to determine if there is a significant difference between the average empathic capacity of both groups. The p-value calculated through a Monte Carlo simulation and with the continuity correction suggested by Davison and Hinkley (1997), shows very weak evidence against the null hypothesis that both groups are equal (that the treatment has no effect). Thus, there is no evidence to affirm that attending extracurricular classes improves empathic capacity. Being statistically strict, it cannot be extrapolated to the student population since the selection of subjects was not random.

Note: it is important to remember that not rejecting the null hypothesis does not mean affirming it. In other words, not having sufficient evidence to reject that both groups are equal does not mean we can affirm that they are.

Alternative with SciPy

The permutations() function implemented above has a didactic purpose, allowing step-by-step understanding of the permutation test calculation process. However, for analysis with large data volumes or a very high number of iterations, it is not the most efficient option.

The SciPy library includes scipy.stats.permutation_test(), a vectorized and optimized implementation that is significantly faster.

# Permutation test with SciPy
# ==============================================================================
def statistic_means(x, y, axis):
    return np.mean(x, axis=axis) - np.mean(y, axis=axis)

x1 = data.loc[data.group == 'control', 'value']
x2 = data.loc[data.group == 'treatment', 'value']

result_scipy = stats.permutation_test(
    data             = (x1, x2),
    statistic        = statistic_means,
    permutation_type = 'independent',
    vectorized       = True,
    n_resamples      = 9999,
    alternative      = 'two-sided',
    random_state     = 9863
)

print(f"p-value (SciPy): {result_scipy.pvalue:.6f}")
p-value (SciPy): 0.101200

The p-value obtained with SciPy is very similar to that calculated with the custom function, the differences are minimal and are due to the random nature of the Monte Carlo simulation method. The main advantage of scipy.stats.permutation_test() lies in its internal optimization and ease of use.

Example: Comparing Variances

In most cases, comparison between groups focuses on studying differences in the location of distributions (mean, median...), however, it may also be of interest to compare whether the variances of two groups are equal. For this type of study, a permutation test can be used in which the statistic employed is the variance.

Returning to the study described in the previous example, it is known that there is no significant difference in the average scores obtained in empathic capacity. However, we want to evaluate whether there is a difference in their variances. This is interesting because, although extracurricular activities may not be able to increase empathy, they could reduce variability among subjects.

The null hypothesis considers that the variance of both groups is the same:

$$\sigma^2_{control} = \sigma^2_{treatment}$$$$\sigma^2_{control} - \sigma^2_{treatment} = 0$$

The alternative hypothesis considers that the variance of both groups is different:

$$\sigma^2_{control} \neq \sigma^2_{treatment}$$$$\sigma^2_{control} - \sigma^2_{treatment} \neq 0$$

Permutation Test

First, the difference between the variances of both groups (observed difference) is calculated.

def diff_var(x1, x2):
    '''
    Function to calculate the difference of variances between two samples.
    
    Parameters
    ----------
    x1 : numpy array
         values from sample 1.
         
    x2 : numpy array
         values from sample 2.
         
    Returns
    -------
    statistic: float
        value of the statistic.
    '''
    
    statistic = np.var(x1) - np.var(x2)
    return statistic
diff_observed = diff_var(
                    x1 = data.loc[data.group == 'control', 'value'],
                    x2 = data.loc[data.group == 'treatment', 'value']
                )
print(f"Observed difference: {diff_observed}")
Observed difference: 30.036953280863997

Through permutations, the distribution of the expected variance difference due solely to random group assignment is obtained, under the null hypothesis.

dist_permut = permutations(
                x1 = data[data.group == 'control']['value'],
                x2 = data[data.group == 'treatment']['value'],
                statistic_fun = diff_var,
                n_iterations  = 9999
              )
# Permutation distribution
# ==============================================================================
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(7,4))
ax.hist(dist_permut, bins=30, density=True, color='#3182bd', alpha=0.5)
ax.axvline(x=dist_permut.mean(), color='black', label='distribution mean')
ax.axvline(x=diff_observed, color='red', label='observed difference')
ax.axvline(x=-diff_observed, color='red')

ax.set_title('Permutation distribution')
ax.set_xlabel('variance difference')
ax.set_ylabel('density')
ax.legend();
pd.Series(dist_permut).describe()
count    9999.000000
mean        0.402285
std        25.198097
min       -99.478055
25%       -16.731303
50%         0.541239
75%        17.763365
max        92.137734
dtype: float64

The permutation distribution shows that the mean difference between variances if extracurricular activities have no effect is very close to zero ($\sim 0.05$, black vertical line).

Finally, the probability (p-value) of obtaining differences equal to or more extreme than the observed one (red vertical lines) is calculated with and without continuity correction.

# Empirical p-value with and without correction
# ==============================================================================
p_value = (sum(np.abs(dist_permut) >= np.abs(diff_observed)))/len(dist_permut)
p_value_correc = (sum(np.abs(dist_permut) >= np.abs(diff_observed)) + 1)/(len(dist_permut) + 1)
print(f"p-value without correction: {p_value}")
print(f"p-value with correction: {p_value_correc}")
p-value without correction: 0.23412341234123413
p-value with correction: 0.2342

Conclusion

The 356 subjects in the study were randomly assigned to a control group (n=187) that did not attend remedial classes or a treatment group (n=169) that did attend. A permutation test is used to determine if there is a significant difference in the variance of empathic capacity between both groups. The p-value calculated through a Monte Carlo simulation and with the continuity correction suggested by Davison and Hinkley (1997), shows very weak evidence against the null hypothesis that both groups are equal (that the treatment has no effect). Thus, there is no evidence to affirm that attending extracurricular classes alters the variability in empathic capacity. Being statistically strict, it cannot be extrapolated to the student population since the selection of subjects was not random.

Note: it is important to remember that not rejecting the null hypothesis does not mean affirming it. In other words, not having sufficient evidence to reject that both groups are equal does not mean we can affirm that they are.

Alternative with SciPy

As in the previous example, SciPy allows performing the permutation test to compare variances more efficiently.

# Permutation test with SciPy (variances)
# ==============================================================================
def statistic_variances(x, y, axis):
    return np.var(x, axis=axis) - np.var(y, axis=axis)

x1 = data.loc[data.group == 'control', 'value']
x2 = data.loc[data.group == 'treatment', 'value']

result_scipy = stats.permutation_test(
    data             = (x1, x2),
    statistic        = statistic_variances,
    permutation_type = 'independent',
    vectorized       = True,
    n_resamples      = 9999,
    alternative      = 'two-sided',
    random_state     = 9863
)
print(f"p-value (SciPy): {result_scipy.pvalue:.6f}")
p-value (SciPy): 0.230800

Example: Comparing Qualitative Variables (Proportions)

A study aims to determine if a drug reduces the risk of death after heart surgery. An experiment is designed in which patients who are to undergo surgery are randomly distributed into two groups (control and treatment). In view of the results, can it be said that the drug is effective at a 5% significance level?

group alive dead total
control 11 39 50
treatment 14 26 40
----------- ----- ----- -----
total 25 65 90

The two hypotheses to be tested are:

  • $H_0$: the survival percentage is independent of treatment, the proportion of survivors is the same in both groups.
$$\text{p(control) = p(treatment)}$$$$\text{p(control) - p(treatment) = 0}$$
  • $H_a$: the survival percentage is different between the control and treatment groups.
$$\text{p(control)} \neq \text{p(treatment)}$$$$\text{p(control)} - \text{p(treatment)} \neq 0$$

Data

# Data
# ==============================================================================
# Of the 90 individuals there are 50 control and 40 treated.
control   = np.array(11 * [True] +  39 * [False])
treatment = np.array(14 * [True] +  26 * [False])

Permutation Test

Through permutations, the sampling distribution of p(control) - p(treatment) is estimated. According to the null hypothesis $H_0$, the probability of survival is the same in both groups. To simulate this, subjects are randomly redistributed (maintaining the size of each group) and the difference in proportions is calculated.

First, the observed difference between the survival proportions of both groups is calculated.

def diff_proportions(x1, x2):
    '''
    Function to calculate the difference of proportions between two samples.
    
    Parameters
    ----------
    x1 : numpy array
         values from sample 1.
         
    x2 : numpy array
         values from sample 2.
         
    Returns
    -------
    statistic: float
        value of the statistic.
    '''
    
    # The mean of a boolean vector is the proportion of Trues
    statistic = np.mean(x1) - np.mean(x2)
    return statistic
diff_observed = diff_proportions(
                    x1 = control,
                    x2 = treatment
                )
print(f"Observed difference: {diff_observed}")
Observed difference: -0.12999999999999998
dist_permut = permutations(
                x1            = control,
                x2            = treatment,
                statistic_fun = diff_proportions,
                n_iterations  = 9999
              )
# Permutation distribution
# ==============================================================================
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(7,4))
ax.hist(dist_permut, bins=30, density=True, color='#3182bd', alpha=0.5)
ax.axvline(x=dist_permut.mean(), color='black', label='distribution mean')
ax.axvline(x=diff_observed, color='red', label='observed difference')
ax.axvline(x=-diff_observed, color='red')

ax.set_title('Permutation distribution')
ax.set_xlabel('proportion difference between groups')
ax.set_ylabel('density')
ax.legend();
pd.Series(dist_permut).describe()
count    9999.000000
mean       -0.000072
std         0.095380
min        -0.400000
25%        -0.085000
50%         0.005000
75%         0.050000
max         0.320000
dtype: float64

Finally, the probability (p-value) of obtaining differences equal to or more extreme than the observed one (red vertical lines) is calculated with and without continuity correction. In this case, the number of permutations in which the calculated value is less than or equal to -0.1299 or greater than or equal to 0.1299, divided by the total number of permutations.

# Empirical p-value with and without correction
# ==============================================================================
p_value = (sum(np.abs(dist_permut) >= np.abs(diff_observed)))/len(dist_permut)
p_value_correc = (sum(np.abs(dist_permut) >= np.abs(diff_observed)) + 1)/(len(dist_permut) + 1)
print(f"p-value without correction: {p_value}")
print(f"p-value with correction: {p_value_correc}")
p-value without correction: 0.23402340234023403
p-value with correction: 0.2341

Alternative with SciPy

For the case of proportions, SciPy also offers an efficient implementation of the permutation test.

# Permutation test with SciPy (proportions)
# ==============================================================================
def statistic_proportions(x, y, axis):
    return np.mean(x, axis=axis) - np.mean(y, axis=axis)

result_scipy = stats.permutation_test(
    data             = (control, treatment),
    statistic        = statistic_proportions,
    permutation_type = 'independent',
    vectorized       = True,
    n_resamples      = 9999,
    alternative      = 'two-sided',
    random_state     = 9863
)

print(f"p-value (SciPy): {result_scipy.pvalue:.6f}")
p-value (SciPy): 0.253200

Conclusion

Since the p-value is greater than the significance level $(\alpha = 0.05)$, there is not sufficient evidence to reject the null hypothesis that the proportion of survivors is equal in both groups. Therefore, it cannot be affirmed that the drug is effective.

Comparison Between Permutations and Bootstrapping

Bootstrapping is another strategy based on repeated sampling, widely used for hypothesis testing.

Both permutation tests and bootstrapping tests can be used to study differences between groups. There is a very extensive list of references that debate which of the two methods is most appropriate. In general, all of them agree that the most appropriate method depends on the objective of the inference, and in turn, the possible objectives are limited by the study design followed. The following table contains the different types of designs that can be used to compare two groups and the type of inference (conclusions) that can be made in each:

Random sampling Random group assignment Inference objective Allows determining causality
Yes No Population No
No Yes Sample Yes
Yes Yes Population and Sample Yes

The main difference between both methods appears when they are used to calculate p-values. Significance tests (p-value calculation) are based on the null hypothesis that all observations come from the same population. The objective of the test is to determine if the observed difference between groups is due to a specific factor (treatment) or only to the expected variability by the nature of a random process.

When randomness is due to the assignment of subjects to different groups, permutation tests are used. The structure of an experiment that can be analyzed through permutation tests is:

  • Selection of study subjects.

  • Random assignment to different groups

  • Application of "treatment" and comparison of results.

Permutation tests answer the question: How much variability is expected in a given statistic due solely to the randomness of assignments, if all subjects truly come from the same population? When comparing the mean between two groups, the previous question is equivalent to: What difference between means can be expected depending on how subjects are distributed in the two groups, if they all come from the same population? (Even being all from the same population, since they will not be exactly identical, there will be small differences depending on how they are grouped).

Bootstrapping as a significance test is used when randomness is due to the process of obtaining samples and not to group assignment. It answers the question: How much variability is expected in a given statistic due solely to random sampling, if all subjects truly come from the same population? Due to small differences between individuals in a population, if two random samples are drawn from it and compared, they will not be exactly equal, moreover, this difference will be different for each pair of random samples drawn. The structure of an experiment that can be analyzed through bootstrapping is:

  • Two random samples are obtained from two populations.

  • They are compared.

In conclusion, although both tests can be used to calculate p-values, their applications do not overlap. Permutation and randomization tests are used for experimental designs, while bootstrapping is used for sampling designs.

It is important to note that none of these methods is immune to the problems that small samples entail.

Session Information

import session_info
session_info.show(html=False)
-----
matplotlib          3.10.8
numpy               2.2.6
pandas              2.3.3
scipy               1.15.3
seaborn             0.13.2
session_info        v1.0.1
tqdm                4.67.1
-----
IPython             9.8.0
jupyter_client      8.7.0
jupyter_core        5.9.1
-----
Python 3.13.11 | packaged by Anaconda, Inc. | (main, Dec 10 2025, 21:28:48) [GCC 14.3.0]
Linux-6.14.0-37-generic-x86_64-with-glibc2.39
-----
Session information updated at 2025-12-26 16:57

Bibliography

Comparing groups Randomization and Bootstrap Methods using R. Andrew S.Zieffler

Bootstrap Methods and Permutation Test by Tim Hestemberg

Introduction to Statistical Learning, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

Points of Significance Sampling distributions and the bootstrap by Anthony Kulesa, Martin Krzywinski, Paul Blainey & Naomi Altman

Citation Instructions

How to cite this document?

If you use this document or any part of it, we kindly ask you to cite it. Thank you very much!

Permutation Test for Hypothesis Testing with Python by Joaquín Amat Rodrigo, available under a Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://cienciadedatos.net/documentos/pystats03-permutation-tests-python.html

Did you like the article? Your help is important

Your contribution will help me continue generating free educational content. Thank you very much! 😊

Become a GitHub Sponsor

Creative Commons Licence

This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.

Permissions:

  • Share: copy and redistribute the material in any medium or format.

  • Adapt: remix, transform, and build upon the material.

Under the following terms:

  • Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  • NonCommercial: You may not use the material for commercial purposes.

  • ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.