More about Data Science and Statistics
- Normality Tests
- Equality of variances
- Linear Correlation
- T-test
- ANOVA
- Permutation tests
- Bootstrapping
- Fitting probability distributions
- Kernel Density Estimation (KDE)
- Kolmogorov-Smirnov Test
- Cramer-Von Mises Test
Introduction¶
Imagine you manage an e-commerce platform and want to sort your product catalog based on the quality perceived by users. You have a review system where customers can rate each product as "positive" or "negative". A first approach to ranking products is to calculate the percentage of positive reviews ($k/N$) and sort them from highest to lowest. However, this method has a major problem: it does not account for the statistical confidence provided by the sample size (number of reviews).
To illustrate this problem, consider the following three products with their respective reviews:
| Product | Positive reviews | Total reviews | % positive reviews |
|---|---|---|---|
| Bluetooth Headphones | 1 | 1 | 100% |
| Bestseller Novel | 4 | 5 | 80% |
| Gaming Laptop | 95 | 100 | 95% |
The simple percentage of positive reviews places the headphones with a single review ahead of a laptop with 95 positive reviews out of 100. Intuitively we know the laptop is a better product, but the simple percentage completely ignores the statistical confidence provided by the sample size.
This is a classic problem in statistics known as the problem of proportion estimation with small samples, and it appears in many real systems:
- Ranking forum posts or social media content by number of "likes" or "upvotes".
- Ranking restaurants in a city by number of positive reviews.
- Ranking athletes by their success rate (baskets, serves, penalties, etc.).
This document studies and compares these four statistical approaches:
Bayesian Average (Empirical Bayes): Assumes every product is "average" until it proves otherwise with enough reviews.
Beta-Binomial Model: The theoretical basis of the Bayesian average, which also allows computing 95% credibility intervals.
Wilson Interval Lower Bound: Ranks by the reasonable worst-case performance of each product, not by its mean.
Minimum Threshold + Simple Percentage: Excludes from the ranking products with fewer than $N_{min}$ reviews.
Libraries¶
The libraries used in this document are:
# Data processing
# ==============================================================================
import numpy as np
import pandas as pd
# Statistics
# ==============================================================================
from scipy.stats import norm, beta
# Visualization
# ==============================================================================
import matplotlib.pyplot as plt
Statistical Foundations¶
The Problem with Simple Proportions¶
When a customer leaves a positive or negative review, we are dealing with a Bernoulli trial: the outcome is binary (positive = 1, negative = 0) with an unknown probability $p$ of being positive. The total number of positive reviews $k$ out of $N$ total reviews follows a Binomial distribution:
$$k \sim \text{Binomial}(N,\, p)$$The most natural estimate of $p$ is the maximum likelihood estimator:
$$\hat{p}_{simple} = \frac{k}{N}$$This estimator is unbiased and consistent, but presents a critical problem when $N$ is small: its variance is very high.
$$\text{Var}(\hat{p}) = \frac{p(1-p)}{N}$$With $N=1$, the variance is maximized. The estimator only takes the values 0 or 1, with no discriminative power.
✏️ Note
The problem is not the calculation itself, but the uncertainty. A product with 1 positive review out of 1 total review has a 100% rate, but that single data point does not tell us whether its true probability of a positive review is 50%, 80%, or 99%. We need more data to reduce that uncertainty.
Bayesian Average (Empirical Bayes)¶
The Empirical Bayes solution starts from an intuitive idea: every new product is, in principle, as good as the catalog average, until its reviews prove otherwise. The formula adds pseudo-counts to each product based on the global mean:
$$\hat{p}_{bayes} = \frac{k + C \cdot \mu}{N + C}$$Where the parameters are:
| Parameter | Description |
|---|---|
| $k$ | Number of positive reviews for the product |
| $N$ | Total number of reviews for the product |
| $\mu$ | Average percentage of positive reviews across the entire catalog (weighted global mean: $\sum k_i / \sum N_i$) |
| $C$ | Confidence constant: how much weight we give to the global average. Typically set to the mean of $N$ across the catalog |
Why does it work? The formula can be rewritten as a weighted average between the global mean and the product's observed rate:
$$\hat{p}_{bayes} = \underbrace{\frac{N}{N+C}}_{\text{weight of data}} \cdot \frac{k}{N} \;\;+\;\; \underbrace{\frac{C}{N+C}}_{\text{weight of prior}} \cdot \mu$$When $N \ll C$, the prior weight dominates and the estimate is pulled toward $\mu$. When $N \gg C$, the data weight dominates and the estimate converges to $k/N$. The parameter $C$ controls exactly the number of reviews a product needs before the system "trusts" its own data over the catalog average.
⚠️ When NOT to use the Bayesian Average?
This method assumes that the global mean μ is a meaningful reference for all products. If the catalog is very heterogeneous (for example, mixing textbooks with electronics and clothing), the global mean loses significance as a prior. In that case, it is preferable to compute μ and C by product category.💡 Tip: choosing C
A practical and common choice is to use the average number of reviews per product across the entire catalog. This means a new product must accumulate at least as many reviews as the typical product before the system fully trusts its real data. Note that C is a hyperparameter: a larger C means more aggressive shrinkage toward the global mean, while a smaller C lets even products with few reviews rely on their own data.
Sensitivity of estimates to C: The following table shows how the Bayesian Average estimate of two products changes as $C$ varies. Notice how a larger $C$ forces both products closer to the global mean $\mu$, regardless of their observed rate.
# Sensitivity of Bayesian Average to hyperparameter C
# ==============================================================================
mu_global = 367 / 600 # weighted global mean for this catalog
products_sensitivity = {
'Bluetooth Headphones': {'k': 1, 'N': 1},
'Gaming Laptop': {'k': 95, 'N': 100},
}
C_values = [5, 10, 25, 50, 75, 150, 300]
rows = []
for C_val in C_values:
row = {'C': C_val}
for name, d in products_sensitivity.items():
row[name] = (d['k'] + C_val * mu_global) / (d['N'] + C_val)
rows.append(row)
sensitivity_df = pd.DataFrame(rows).set_index('C')
sensitivity_df.loc['mu (prior)'] = {name: mu_global for name in products_sensitivity}
sensitivity_df.style.format('{:.3f}').set_caption(
'Bayesian Average by C value (last row = global mean μ used as prior)'
)
| Bluetooth Headphones | Gaming Laptop | |
|---|---|---|
| C | ||
| 5 | 0.676 | 0.934 |
| 10 | 0.647 | 0.919 |
| 25 | 0.627 | 0.882 |
| 50 | 0.619 | 0.837 |
| 75 | 0.617 | 0.805 |
| 150 | 0.614 | 0.747 |
| 300 | 0.613 | 0.696 |
| mu (prior) | 0.612 | 0.612 |
Connection to the Beta-Binomial Model¶
The Bayesian Average formula is not arbitrary: it is exactly the mean of the posterior distribution of a Bayesian model with a Beta prior and Binomial likelihood.
If we assume a prior $p \sim \text{Beta}(\alpha_0, \beta_0)$ with:
$$\alpha_0 = C \cdot \mu \qquad \beta_0 = C \cdot (1 - \mu)$$After observing $k$ positive reviews out of $N$ total, the posterior distribution is:
$$p \mid k,N \sim \text{Beta}(\alpha_0 + k,\; \beta_0 + N - k)$$The mean of this posterior distribution is exactly:
$$\mathbb{E}[p \mid k, N] = \frac{\alpha_0 + k}{\alpha_0 + \beta_0 + N} = \frac{k + C \cdot \mu}{N + C}$$Which is the Bayesian Average formula. The additional advantage of knowing the full posterior distribution is that it allows computing 95% credibility intervals analytically using scipy.stats.beta.ppf, without the need for simulation.
✏️ Note: Why is PyMC not needed?
PyMC is a probabilistic programming library that uses MCMC simulation to approximate complex posterior distributions. For this problem, Beta-Binomial conjugacy gives us an exact analytical solution: the posterior distribution has a known closed form. No simulation is needed; it is enough to compute αpost and βpost and evaluate the distribution function with scipy.stats.beta.
Wilson Interval Lower Bound¶
The Wilson interval is a confidence interval for proportions that remains valid even for small $N$, unlike the Wald interval (the classic normal approximation $\hat{p} \pm z\sqrt{\hat{p}(1-\hat{p})/N}$), which can produce values outside $[0, 1]$ and whose coverage degrades severely with few observations. Rather than ranking by the point estimate $k/N$, this method ranks by the lower bound of the interval: the most pessimistic value of $p$ that is still statistically compatible with the observed data at a given confidence level.
The lower bound of the Wilson interval at confidence level $1-\alpha$ is:
$$w^- = \frac{\hat{p} + \frac{z^2}{2N} - z\sqrt{\frac{\hat{p}(1-\hat{p})}{N} + \frac{z^2}{4N^2}}}{1 + \frac{z^2}{N}}$$Where $\hat{p} = k/N$ and $z = z_{1-\alpha/2}$ is the quantile of the standard normal distribution (e.g., $z = 1.96$ for a 95% confidence level).
Intuitive example:
- Bluetooth Headphones (1/1, 100%): high uncertainty → its lower bound is approximately 20%.
- Gaming Laptop (95/100, 95%): low uncertainty → its lower bound is approximately 88%.
✏️ Note
It is a conservative approach: to move up the ranking, a product must prove it is good by accumulating reviews, not just get lucky with the first ones.
✏️ Note: Why not the Wald interval?
The classic Wald interval p̂ ± z√(p̂(1−p̂)/N) has two failure modes with small samples: (1) it can produce bounds outside [0, 1] (e.g., a product with 0 positives out of 3 reviews gets a lower bound of −0), and (2) its actual coverage probability can be far below the nominal level (e.g., a nominal 95% interval may achieve only 85% actual coverage). Brown, Cai & DasGupta (2001) showed that the Wilson interval provides substantially better coverage properties across all values of p and N. This is the primary reason why platforms use Wilson rather than the simpler Wald formula.
Minimum Threshold + Simple Percentage¶
The simplest approach consists of:
- Define a threshold $N_{min}$ (for example, 10 reviews).
- Exclude from the ranking all products with $N < N_{min}$. They are shown as "not enough reviews".
- For products that meet the threshold, sort by $\hat{p}_{simple} = k/N$.
This approach is easy to implement and communicate to users. Its main limitation is the hard boundary: a product with 9 reviews is treated identically to one with 0 reviews, while a product with 10 mediocre reviews enters the ranking ahead of a product with 9 excellent ones. The choice of $N_{min}$ is also arbitrary — common values in production systems range from 5 to 50 depending on the category — and it typically requires business context to justify.
The Bayesian and Wilson approaches avoid this binary exclusion by applying a soft penalty: all products appear in the ranking, but those with few reviews are automatically pushed down by the conservative estimate.
Example¶
Data¶
A catalog of 8 products is created to illustrate the behavior of each method.
# Data
# ==============================================================================
data = {
'product': [
'Bluetooth Headphones', # 1/1: new and perfect
'Gaming Laptop', # 95/100: excellent veteran
'Phone Case', # 200/400: mediocre with long history
'Bestseller Novel', # 4/5: few but very good reviews
'Mechanical Keyboard', # 2/5: few mixed reviews
'Office Chair', # 8/9: just below the 10-review threshold
'Running Shoes', # 45/60: good with mid-length history
'Power Bank', # 12/20: mediocre with sufficient history
],
'category': [
'Electronics', 'Electronics', 'Accessories', 'Books',
'Electronics', 'Furniture', 'Fashion', 'Electronics',
],
'positive_reviews': [1, 95, 200, 4, 2, 8, 45, 12],
'total_reviews': [1, 100, 400, 5, 5, 9, 60, 20],
}
data = pd.DataFrame(data)
data
| product | category | positive_reviews | total_reviews | |
|---|---|---|---|---|
| 0 | Bluetooth Headphones | Electronics | 1 | 1 |
| 1 | Gaming Laptop | Electronics | 95 | 100 |
| 2 | Phone Case | Accessories | 200 | 400 |
| 3 | Bestseller Novel | Books | 4 | 5 |
| 4 | Mechanical Keyboard | Electronics | 2 | 5 |
| 5 | Office Chair | Furniture | 8 | 9 |
| 6 | Running Shoes | Fashion | 45 | 60 |
| 7 | Power Bank | Electronics | 12 | 20 |
Simple Percentage¶
# Simple percentage
# ==============================================================================
data['pct_simple'] = data['positive_reviews'] / data['total_reviews']
ranking_simple = (
data[['product', 'positive_reviews', 'total_reviews', 'pct_simple']]
.sort_values('pct_simple', ascending=False)
.reset_index(drop=True)
)
ranking_simple
| product | positive_reviews | total_reviews | pct_simple | |
|---|---|---|---|---|
| 0 | Bluetooth Headphones | 1 | 1 | 1.000000 |
| 1 | Gaming Laptop | 95 | 100 | 0.950000 |
| 2 | Office Chair | 8 | 9 | 0.888889 |
| 3 | Bestseller Novel | 4 | 5 | 0.800000 |
| 4 | Running Shoes | 45 | 60 | 0.750000 |
| 5 | Power Bank | 12 | 20 | 0.600000 |
| 6 | Phone Case | 200 | 400 | 0.500000 |
| 7 | Mechanical Keyboard | 2 | 5 | 0.400000 |
With the simple percentage, the Bluetooth Headphones (1/1) rank first with 100%, ahead of the Gaming Laptop (95/100). The Office Chair (8/9, ~89%) also appears in high positions despite having very few reviews.
Bayesian Average¶
The first step is to compute the global catalog parameters:
$\mu$: the average percentage of positive reviews across the entire catalog.
$C$: the confidence constant, chosen as the mean number of reviews per product.
# Global catalog parameters
# ==============================================================================
mu = data['positive_reviews'].sum() / data['total_reviews'].sum()
C = data['total_reviews'].mean()
print(f'Global average (mu): {mu}')
print(f'Confidence constant (C): {C}')
Global average (mu): 0.6116666666666667 Confidence constant (C): 75.0
✏️ Note: weighted mean vs. arithmetic mean
μ is computed as the weighted global mean Σki / ΣNi, not as the arithmetic mean of individual proportions mean(ki / Ni). These are numerically different. The weighted version gives more influence to products with more reviews, which is appropriate: it reflects the true proportion of positive reviews across all user interactions in the catalog. The arithmetic mean would give a product with 1 review the same influence as a product with 10,000 reviews, distorting the prior.
The next step is to apply the Bayesian Average formula to each product.
# Bayesian Average
# ==============================================================================
k = data['positive_reviews']
N = data['total_reviews']
data['pct_bayesian'] = (k + C * mu) / (N + C)
ranking_bayesian = (
data[['product', 'positive_reviews', 'total_reviews', 'pct_simple', 'pct_bayesian']]
.sort_values('pct_bayesian', ascending=False)
.reset_index(drop=True)
)
ranking_bayesian
| product | positive_reviews | total_reviews | pct_simple | pct_bayesian | |
|---|---|---|---|---|---|
| 0 | Gaming Laptop | 95 | 100 | 0.950000 | 0.805000 |
| 1 | Running Shoes | 45 | 60 | 0.750000 | 0.673148 |
| 2 | Office Chair | 8 | 9 | 0.888889 | 0.641369 |
| 3 | Bestseller Novel | 4 | 5 | 0.800000 | 0.623437 |
| 4 | Bluetooth Headphones | 1 | 1 | 1.000000 | 0.616776 |
| 5 | Power Bank | 12 | 20 | 0.600000 | 0.609211 |
| 6 | Mechanical Keyboard | 2 | 5 | 0.400000 | 0.598437 |
| 7 | Phone Case | 200 | 400 | 0.500000 | 0.517632 |
The Bayesian Average corrects the ranking as expected. The Gaming Laptop rises to first place: with 100 reviews ($N > C$), its observed rate (95%) carries enough weight to overcome the pull of the prior. The Bluetooth Headphones drop substantially: with only 1 review ($N \ll C$), their estimate is almost entirely determined by the prior $\mu$, not by the single positive review — this is the shrinkage effect of Bayesian estimation. The Phone Case illustrates the opposite asymptote: with 400 reviews ($N \gg C$), the prior contributes negligibly and the estimate is nearly identical to its raw proportion (~50%), which is well below the catalog average.
Beta-Binomial Model: 95% Credibility Interval¶
The Beta-Binomial model allows us to compute 95% credibility intervals for each product. We sort the ranking by the posterior mean (which coincides with the Bayesian Average), but also display the full interval.
# Beta prior parameters
# ==============================================================================
alpha_0 = C * mu
beta_0 = C * (1 - mu)
# Posterior distribution parameters per product
# ==============================================================================
alpha_post = alpha_0 + k
beta_post = beta_0 + (N - k)
# 95% credibility interval
# ==============================================================================
data['cred_95_lower'] = beta.ppf(0.025, alpha_post, beta_post)
data['cred_95_upper'] = beta.ppf(0.975, alpha_post, beta_post)
cols_beta = ['product', 'total_reviews', 'pct_simple', 'pct_bayesian', 'cred_95_lower', 'cred_95_upper']
(
data[cols_beta]
.sort_values('pct_bayesian', ascending=False)
.reset_index(drop=True)
)
| product | total_reviews | pct_simple | pct_bayesian | cred_95_lower | cred_95_upper | |
|---|---|---|---|---|---|---|
| 0 | Gaming Laptop | 100 | 0.950000 | 0.805000 | 0.743307 | 0.860095 |
| 1 | Running Shoes | 60 | 0.750000 | 0.673148 | 0.592037 | 0.749399 |
| 2 | Office Chair | 9 | 0.888889 | 0.641369 | 0.536522 | 0.739839 |
| 3 | Bestseller Novel | 5 | 0.800000 | 0.623437 | 0.515312 | 0.725716 |
| 4 | Bluetooth Headphones | 1 | 1.000000 | 0.616776 | 0.505611 | 0.722118 |
| 5 | Power Bank | 20 | 0.600000 | 0.609211 | 0.509667 | 0.704397 |
| 6 | Mechanical Keyboard | 5 | 0.400000 | 0.598437 | 0.489657 | 0.702554 |
| 7 | Phone Case | 400 | 0.500000 | 0.517632 | 0.472692 | 0.562431 |
The credibility interval reveals the true uncertainty behind each point estimate:
- Bluetooth Headphones (1 review): the interval ranges from ~51% to ~72%. It is the widest in the catalog. With $C \approx 75$ virtual pseudo-reviews and only 1 real observation, the prior has roughly 75× more weight than the data — the posterior is almost entirely determined by the prior, which explains why the interval, while wide, is centered well above 50%.
- Gaming Laptop (100 reviews): the interval is narrow (~74% to ~86%). With $N > C$, the 100 real reviews outweigh the prior and the system trusts its observed data.
- Phone Case (400 reviews): the interval is the narrowest of all (~47% to ~56%). With $N \gg C$, the posterior is dominated almost entirely by the 400 observed reviews, and the system knows with high precision that this is a mediocre product.
This is the key insight of Bayesian shrinkage: the credibility interval width is determined not just by $N$, but by the ratio $N/C$. A product needs $N \gg C$ reviews before its uncertainty is comparable to a product with many reviews in a non-Bayesian setting.
Wilson Interval Lower Bound¶
The Wilson interval lower bound ranks products by their reasonable worst-case performance.
# Wilson Interval Lower Bound
# ==============================================================================
confidence = 0.95 # confidence level
z = norm.ppf(1 - (1 - confidence) / 2) # z = 1.96 for 95%
p_hat = data['pct_simple']
data['wilson_lower'] = (
(p_hat + z**2 / (2 * N) - z * np.sqrt(p_hat * (1 - p_hat) / N + z**2 / (4 * N**2)))
/ (1 + z**2 / N)
)
ranking_wilson = (
data[['product', 'positive_reviews', 'total_reviews', 'pct_simple', 'wilson_lower']]
.sort_values('wilson_lower', ascending=False)
.reset_index(drop=True)
)
ranking_wilson
| product | positive_reviews | total_reviews | pct_simple | wilson_lower | |
|---|---|---|---|---|---|
| 0 | Gaming Laptop | 95 | 100 | 0.950000 | 0.888250 |
| 1 | Running Shoes | 45 | 60 | 0.750000 | 0.627679 |
| 2 | Office Chair | 8 | 9 | 0.888889 | 0.565000 |
| 3 | Phone Case | 200 | 400 | 0.500000 | 0.451235 |
| 4 | Power Bank | 12 | 20 | 0.600000 | 0.386582 |
| 5 | Bestseller Novel | 4 | 5 | 0.800000 | 0.375535 |
| 6 | Bluetooth Headphones | 1 | 1 | 1.000000 | 0.206549 |
| 7 | Mechanical Keyboard | 2 | 5 | 0.400000 | 0.117621 |
The Wilson Interval is the most conservative method. It penalizes uncertainty heavily: the Bluetooth Headphones (1/1) drop nearly to the bottom of the ranking because their lower bound is very low (~21%). The Office Chair (8/9) also drops significantly despite its high simple percentage.
⚠️ Edge case: products with zero positive reviews (p̂ = 0)
When p̂ = 0, the term p̂(1−p̂) = 0 and the Wilson formula simplifies to w− = 0 / (1 + z2/N), which yields exactly 0. The lower bound is well-defined in this case. However, if p̂ = 1 (all reviews positive), the formula also yields a well-defined w−, but the upper bound w+ would equal 1 only asymptotically. In production systems, it is common to apply a clamp max(0, w−) to avoid any floating-point edge cases and to handle products with no reviews at all (N = 0), where the formula is undefined.
Minimum Threshold¶
The minimum threshold method excludes products with fewer than 10 reviews.
# Minimum review threshold
# ==============================================================================
N_min = 10
print(f'Products with N >= {N_min} reviews (included in the ranking):')
ranking_threshold = (
data.loc[data['total_reviews'] >= N_min, ['product', 'positive_reviews', 'total_reviews', 'pct_simple']]
.sort_values('pct_simple', ascending=False)
.reset_index(drop=True)
)
display(ranking_threshold)
print(f'\nProducts excluded from the ranking (N < {N_min}):')
excluded = data.loc[data['total_reviews'] < N_min, ['product', 'positive_reviews', 'total_reviews']]
excluded
Products with N >= 10 reviews (included in the ranking):
| product | positive_reviews | total_reviews | pct_simple | |
|---|---|---|---|---|
| 0 | Gaming Laptop | 95 | 100 | 0.95 |
| 1 | Running Shoes | 45 | 60 | 0.75 |
| 2 | Power Bank | 12 | 20 | 0.60 |
| 3 | Phone Case | 200 | 400 | 0.50 |
Products excluded from the ranking (N < 10):
| product | positive_reviews | total_reviews | |
|---|---|---|---|
| 0 | Bluetooth Headphones | 1 | 1 |
| 3 | Bestseller Novel | 4 | 5 |
| 4 | Mechanical Keyboard | 2 | 5 |
| 5 | Office Chair | 8 | 9 |
With $N_{min} = 10$, the Bluetooth Headphones, Bestseller Novel, Mechanical Keyboard, and Office Chair are excluded from the ranking. The latter, which has 89% positive reviews with 9 reviews, does not appear in the ranking despite being the second product with the highest simple percentage.
Ranking Comparison¶
The following chart shows the position (rank) of each product under each of the four methods. Products with the same position across all methods are stable; products whose rank changes significantly are those most affected by uncertainty in their reviews.
# Ranking comparison across methods
# ==============================================================================
rank_df = data[['product']].copy()
rank_df['Simple %'] = (
data['pct_simple']
.rank(ascending=False, method='min')
.astype(int)
)
rank_df['Bayesian Avg'] = (
data['pct_bayesian']
.rank(ascending=False, method='min')
.astype(int)
)
rank_df['Wilson LB'] = (
data['wilson_lower']
.rank(ascending=False, method='min')
.astype(int)
)
included_mask = data['total_reviews'] >= N_min
rank_df['Min Threshold'] = (
data['pct_simple']
.where(included_mask)
.rank(ascending=False, method='min')
)
rank_df = rank_df.set_index('product')
display(rank_df)
# Bump chart
fig, ax = plt.subplots(figsize=(8, 5))
for product, ranks in rank_df.iterrows():
ax.plot(
rank_df.columns,
ranks,
marker='o',
linewidth=2,
label=product
)
ax.invert_yaxis()
ax.set_ylabel('Rank')
ax.set_title('Ranking comparison across methods')
ax.grid(axis='y', linestyle='--', alpha=0.3)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.legend(
title='Product',
bbox_to_anchor=(1.02, 1),
loc='upper left'
)
plt.tight_layout()
plt.show()
| Simple % | Bayesian Avg | Wilson LB | Min Threshold | |
|---|---|---|---|---|
| product | ||||
| Bluetooth Headphones | 1 | 5 | 7 | NaN |
| Gaming Laptop | 2 | 1 | 1 | 1.0 |
| Phone Case | 7 | 8 | 4 | 4.0 |
| Bestseller Novel | 4 | 4 | 6 | NaN |
| Mechanical Keyboard | 8 | 7 | 8 | NaN |
| Office Chair | 3 | 3 | 3 | NaN |
| Running Shoes | 5 | 2 | 2 | 2.0 |
| Power Bank | 6 | 6 | 5 | 3.0 |
Conclusions¶
The methods studied solve in different ways the same fundamental problem: the instability of the simple percentage with small samples.
| Method | Complexity | When to use it | Advantage | Disadvantage |
|---|---|---|---|---|
| Simple % | Very low | Never as a final ranking | Intuitive and easy to communicate | Overvalues products with few reviews |
| Bayesian Average | Low | Unified ranking where all products appear from day one | Fair to new and established products; easy to interpret | Requires the global mean $\mu$ to be representative of the catalog |
| Beta-Binomial | Low | Same as Bayesian, plus credibility interval | Provides full uncertainty; no MCMC required | Slightly more complex to implement |
| Wilson Interval | Low-medium | Highly competitive environments where consistency is rewarded | Very conservative; avoids the "lucky start" effect | Penalizes new products heavily even if they show a good early signal |
| Minimum Threshold | Very low | Simple systems where temporary exclusion is acceptable | Easy to implement and communicate to users | Completely excludes new products from the ranking |
Recommendation by use case:
- Use the Bayesian Average if you want a unified, fair ranking where all products (new and established) appear from day one without penalizing anyone too harshly. It is the standard on e-commerce platforms.
- Add the Beta-Binomial credibility interval if you need to communicate uncertainty to the user (for example, showing a confidence bar alongside the star rating).
- Use the Wilson Interval if the context is highly competitive and you want to be strict, such as in comment or app review rankings, where top results have an enormous impact.
- Use the minimum threshold only if implementation simplicity is a requirement and temporary exclusion of new products is acceptable for your business.
Session Information¶
import session_info
session_info.show(html=False)
----- matplotlib 3.10.8 numpy 1.26.4 pandas 2.2.3 plotly 6.7.0 scipy 1.11.4 session_info v1.0.1 ----- IPython 9.10.1 jupyter_client 8.8.0 jupyter_core 5.9.1 ----- Python 3.11.15 (main, Mar 11 2026, 17:20:07) [GCC 14.3.0] Linux-6.17.0-29-generic-x86_64-with-glibc2.39 ----- Session information updated at 2026-06-01 14:10
Citation Instructions¶
How to cite this document?
If you use this document or any part of it, we appreciate you citing it. Thank you!
Statistical Product Ranking: Empirical Bayes, Wilson Interval, and Minimum Threshold by Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://www.cienciadedatos.net/documentos/pystats13-statistical-ranking-python.html
Did you like the article? Your support matters
Your contribution will help me keep producing free educational content. Thank you so much! 😊
This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.
Permitted:
-
Share: copy and redistribute the material in any medium or format.
-
Adapt: remix, transform, and build upon the material.
Under the following terms:
-
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
-
NonCommercial: You may not use the material for commercial purposes.
-
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
