A Complete Guide to Statistical Tests and Methods for Clinical Researchers

Enriched Edition — With R & Python Code and Plot Generation

Audience: Researchers and clinicians applying statistical methods to medical and health data Purpose: A thorough reference covering test selection, assumptions, worked examples, interpretation, and reporting — with side-by-side R and Python code and reproducible visualisations for every method Code environment: R ≥ 4.2 with tidyverse/ggplot2; Python ≥ 3.9 with pandas/scipy/matplotlib/seaborn/statsmodels How to use this guide: Each section follows: What it is → When to use it → Assumptions → Worked example → R code + plot → Python code + plot → Interpretation → Reporting → Common mistakes

Required Packages

R — install once:


install.packages(c(
  "tidyverse", "ggpubr", "rstatix", "broom", "coin",
  "survival", "survminer", "lme4", "lmerTest",
  "pROC", "irr", "meta", "vcd", "ggfortify",
  "factoextra", "psych", "MatchIt", "WeightIt"
))

Python — install once:


pip install pandas numpy scipy matplotlib seaborn statsmodels \
            lifelines scikit-learn pingouin scikit-posthocs

Foundations: The Statistical Reasoning Framework
Choosing the Right Test
Descriptive Statistics and Data Exploration
One-Variable Tests
Comparing Two Groups
Comparing Three or More Groups
Correlation and Association
Regression Analysis
Effect Sizes and Association Measures
Survival and Time-to-Event Analysis
Multivariable Modelling Strategy
Multivariate Methods
Mixed Models and Longitudinal Data
Diagnostic Test Evaluation
Agreement and Reliability
Bayesian Methods
Meta-Analysis and Systematic Review
Reporting Standards and Checklists
Appendix: Quick Reference Tables

1. Foundations: The Statistical Reasoning Framework

1.1 What Is a Statistical Test?

A statistical test is a formal procedure for deciding whether observed data are consistent with a stated hypothesis:

Null hypothesis (H₀): The assumption of no effect, no difference, or no association
Alternative hypothesis (H₁): The effect or difference you are trying to detect
Test statistic: A number calculated from data summarising evidence against H₀
P-value: Probability of observing a test statistic at least this extreme, if H₀ were true

What a p-value is NOT: Not the probability that H₀ is true, and not the probability the result is due to chance.

1.2 Type I and Type II Errors

	H₀ is actually TRUE	H₀ is actually FALSE
Test says: reject H₀	Type I error (α) — false positive	Correct — Power (1−β)
Test says: fail to reject H₀	Correct (1−α)	Type II error (β) — false negative

1.3 One-Tailed vs Two-Tailed Tests

Two-tailed: H₁: μ₁ ≠ μ₂ — tests for difference in either direction. Default for most clinical research.
One-tailed: H₁: μ₁ > μ₂ — use only when pre-specified and you would not act on a result in the other direction.

1.4 Confidence Intervals vs P-Values

95% CI represents the range of values consistent with your data at the 5% significance level
If CI for a difference excludes zero (or for a ratio excludes 1.0), the result is statistically significant at α=0.05
CI communicates both statistical significance AND magnitude and precision — report both

1.5 Sample Size and Power Calculations

The four inputs: α, power (1−β), effect size (MCID), variability (SD). Specify three, solve for the fourth.


n per group = 2 × (z_α/2 + z_β)² × σ² / δ²

R — Power Curves


library(tidyverse)
 
sd_val <- 20; alpha <- 0.05
 
expand_grid(delta = c(5, 8, 10, 15), n = seq(10, 150, by = 5)) |>
  mutate(
    power = pmap_dbl(list(delta, n), function(d, n) {
      power.t.test(n = n, delta = d, sd = sd_val, sig.level = alpha,
                   type = "two.sample")$power
    }),
    label = paste0("MCID = ", delta, " mmHg")
  ) |>
  ggplot(aes(x = n, y = power, colour = label)) +
  geom_line(linewidth = 1.1) +
  geom_hline(yintercept = 0.80, linetype = "dashed", colour = "grey40") +
  annotate("text", x = 12, y = 0.82, label = "80% power", hjust = 0, size = 3.5) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
  scale_colour_brewer(palette = "Set1") +
  labs(title = "Power Curves: Two-Sample t-Test (SD = 20 mmHg)",
       x = "Sample size per group", y = "Statistical power", colour = NULL) +
  theme_bw(base_size = 13) + theme(legend.position = "bottom")

Python — Power Curves


import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
sd_val = 20; alpha = 0.05
n_vals = np.arange(10, 151, 5)
mcids  = [5, 8, 10, 15]
colours = ['#e41a1c','#377eb8','#4daf4a','#984ea3']
 
fig, ax = plt.subplots(figsize=(9, 5))
for mcid, col in zip(mcids, colours):
    se     = sd_val * np.sqrt(2 / n_vals)
    ncp    = mcid / se
    z_crit = stats.norm.ppf(1 - alpha / 2)
    power  = 1 - stats.norm.cdf(z_crit - ncp) + stats.norm.cdf(-z_crit - ncp)
    ax.plot(n_vals, power, label=f"MCID = {mcid} mmHg", color=col, linewidth=2)
 
ax.axhline(0.80, linestyle='--', color='grey', linewidth=1.2)
ax.text(12, 0.82, "80% power", fontsize=10, color='grey')
ax.set_xlabel("Sample size per group"); ax.set_ylabel("Statistical power")
ax.set_title("Power Curves: Two-Sample t-Test (SD = 20 mmHg)")
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
ax.legend(loc='lower right'); ax.grid(alpha=0.3)
plt.tight_layout(); plt.savefig("power_curves.png", dpi=150, bbox_inches='tight')

2. Choosing the Right Test: A Decision Framework

The Five Key Questions

Q1. What is your research question? (describe / test / compare / examine / predict) Q2. How many variables? (1 / 2 / 2+ with one outcome / 2+ outcomes) Q3. What type is each variable? (continuous / ordinal / nominal / binary / time-to-event) Q4. Are samples independent or paired/related? Q5. Are parametric assumptions met? (normality, sample size)

Decision Table

Outcome	Predictor/groups	Sample	Test
Continuous	None (vs known value)	—	One-sample t / Wilcoxon
Continuous	2 groups	Independent	Student’s t / Welch’s t / Mann-Whitney U
Continuous	2 groups	Paired	Paired t / Wilcoxon signed-rank
Continuous	3+ groups	Independent	One-way ANOVA / Kruskal-Wallis
Continuous	3+ groups	Repeated	RM-ANOVA / Friedman
Continuous	Continuous predictor(s)	—	Linear regression
Binary	2+ groups	Independent	Chi-square / Fisher’s exact
Binary	2 groups	Paired	McNemar’s test
Binary	Mixed predictors	—	Logistic regression
Time-to-event	2+ groups	Independent	Kaplan-Meier + log-rank
Time-to-event	Mixed predictors	—	Cox regression
Count	Groups	—	Poisson / negative binomial regression
Ordinal	2+ groups	Independent	Mann-Whitney / Kruskal-Wallis
Multiple continuous	Groups	—	MANOVA

3. Descriptive Statistics and Data Exploration

3.1 Measures of Central Tendency and Spread

Data type	Measure	Spread
Continuous, normal	Mean	SD
Continuous, skewed	Median	IQR
Ordinal	Median	IQR
Nominal	Frequency (n, %)	—

Never report SEM as a measure of population variability — it describes precision of the mean, not spread of individual values.

3.2 Assessing Normality

Visual (preferred): Histogram, Q-Q plot, box plot.

Formal tests:

Shapiro-Wilk: best for n < 50. H₀: normally distributed. p > 0.05 is consistent with normality.
Kolmogorov-Smirnov: better for larger n.

Practical rule: With n > 30, parametric tests are generally robust even with moderate skew (central limit theorem).

3.3 Worked Example — Baseline Table

Variable	Type	Summary
Age (years)	Continuous, ~normal	Mean ± SD: 62.4 ± 11.2
Sex (% male)	Binary	68 (56.7%)
BMI (kg/m²)	Right-skewed	Median (IQR): 27.8 (24.6–31.9)
LDL-C (mmol/L)	Right-skewed	Median (IQR): 3.4 (2.8–4.1)
NYHA class	Ordinal	Class I: 22 (18.3%), II: 58 (48.3%), III: 32 (26.7%), IV: 8 (6.7%)

3.4 Code: Normality Assessment — 4-Panel Diagnostic

R


library(tidyverse)
 
set.seed(42)
n      <- 120
normal <- rnorm(n, 62.4, 11.2)
skewed <- rgamma(n, shape = 2, rate = 0.6)
 
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
 
hist(normal, breaks = 15, col = "#4575b4", border = "white",
     main = "Age — approx. normal", xlab = "Age (years)", freq = FALSE)
curve(dnorm(x, mean(normal), sd(normal)), add = TRUE, col = "firebrick", lwd = 2)
 
qqnorm(normal, main = "Q-Q: Age", pch = 16, col = "#4575b4")
qqline(normal, col = "firebrick", lwd = 2)
 
hist(skewed, breaks = 15, col = "#d73027", border = "white",
     main = "LDL-C — right-skewed", xlab = "LDL-C (mmol/L)", freq = FALSE)
lines(density(skewed), col = "navy", lwd = 2)
 
qqnorm(skewed, main = "Q-Q: LDL-C", pch = 16, col = "#d73027")
qqline(skewed, col = "navy", lwd = 2)
 
cat("Shapiro-Wilk (age):  p =", round(shapiro.test(normal)$p.value, 4), "\n")
cat("Shapiro-Wilk (LDL):  p =", round(shapiro.test(skewed)$p.value, 4), "\n")

Python


import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import gaussian_kde
 
rng    = np.random.default_rng(42)
n      = 120
normal = rng.normal(62.4, 11.2, n)
skewed = rng.gamma(shape=2, scale=1/0.6, size=n)
 
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
 
# Panel 1: Histogram normal
axes[0,0].hist(normal, bins=15, color='#4575b4', edgecolor='white', density=True)
x_r = np.linspace(normal.min(), normal.max(), 200)
axes[0,0].plot(x_r, stats.norm.pdf(x_r, normal.mean(), normal.std()), 'r-', lw=2)
axes[0,0].set_title("Age — approx. normal"); axes[0,0].set_xlabel("Age (years)")
 
# Panel 2: Q-Q normal
(osm, osr), (slope, intercept, _) = stats.probplot(normal)
axes[0,1].scatter(osm, osr, color='#4575b4', s=20)
axes[0,1].plot(osm, slope*np.array(osm)+intercept, 'r-', lw=2)
axes[0,1].set_title("Q-Q Plot: Age"); axes[0,1].set_xlabel("Theoretical quantiles")
 
# Panel 3: Histogram skewed
axes[1,0].hist(skewed, bins=15, color='#d73027', edgecolor='white', density=True)
kde_x = np.linspace(skewed.min(), skewed.max(), 200)
axes[1,0].plot(kde_x, gaussian_kde(skewed)(kde_x), 'navy', lw=2)
axes[1,0].set_title("LDL-C — right-skewed"); axes[1,0].set_xlabel("LDL-C (mmol/L)")
 
# Panel 4: Q-Q skewed
(osm2, osr2), (s2, i2, _) = stats.probplot(skewed)
axes[1,1].scatter(osm2, osr2, color='#d73027', s=20)
axes[1,1].plot(osm2, s2*np.array(osm2)+i2, 'navy', lw=2)
axes[1,1].set_title("Q-Q Plot: LDL-C"); axes[1,1].set_xlabel("Theoretical quantiles")
 
for ax in axes.flat:
    ax.grid(alpha=0.3)
plt.suptitle("Normality Assessment: 4-Panel Diagnostic", fontsize=13)
plt.tight_layout(); plt.savefig("normality_4panel.png", dpi=150, bbox_inches='tight')
plt.show()
 
for label, arr in [("age", normal), ("LDL", skewed)]:
    w, p = stats.shapiro(arr)
    print(f"Shapiro-Wilk ({label}):  W={w:.3f}, p={p:.4f}")

4. One-Variable Tests

4.1 One-Sample Student’s t-Test

What it does: Tests whether the mean of a single sample differs from a known population value (μ₀).

When to use: One continuous variable, ~normal (or n ≥ 30), compare to reference value.

Test statistic:


t = (x̄ − μ₀) / (s / √n)     df = n − 1

Worked Example: INR monitoring, n=25, mean=2.8, SD=0.6, target=2.5. t(24)=2.50, p=0.020. 95% CI: 2.55–3.05. Unit is over-anticoagulating on average.

R


library(tidyverse)
 
set.seed(42)
inr <- rnorm(25, mean = 2.8, sd = 0.6)
mu0 <- 2.5
 
t_res <- t.test(inr, mu = mu0)
print(t_res)
 
tibble(inr = inr) |>
  ggplot(aes(x = 1, y = inr)) +
  geom_jitter(width = 0.08, size = 2.5, colour = "#377eb8", alpha = 0.7) +
  stat_summary(fun.data = mean_cl_normal, geom = "pointrange",
               colour = "firebrick", size = 0.8, linewidth = 1.2) +
  geom_hline(yintercept = mu0, linetype = "dashed", colour = "grey30", linewidth = 1) +
  annotate("text", x = 1.35, y = mu0 + 0.05,
           label = paste0("Reference = ", mu0), hjust = 0, size = 3.5) +
  annotate("text", x = 1.35, y = mean(inr) + 0.05,
           label = sprintf("Mean = %.2f\n95%% CI: %.2f–%.2f\np = %.3f",
                           mean(inr), t_res$conf.int[1], t_res$conf.int[2], t_res$p.value),
           hjust = 0, size = 3.2, colour = "firebrick") +
  scale_x_continuous(limits = c(0.7, 1.9)) +
  labs(title    = "One-Sample t-Test: INR vs Target of 2.5",
       subtitle = "Dots = patients; red = mean ± 95% CI; dashed = reference target",
       y = "INR", x = NULL) +
  theme_bw(base_size = 13) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

Python


import numpy as np, matplotlib.pyplot as plt
from scipy import stats
 
rng = np.random.default_rng(42)
inr = rng.normal(2.8, 0.6, 25); mu0 = 2.5
t_stat, p_val = stats.ttest_1samp(inr, mu0)
ci = stats.t.interval(0.95, df=len(inr)-1, loc=np.mean(inr), scale=stats.sem(inr))
print(f"t={t_stat:.3f}, p={p_val:.4f}, 95% CI: {ci[0]:.2f}–{ci[1]:.2f}")
 
fig, ax = plt.subplots(figsize=(5, 6))
jitter = rng.uniform(-0.05, 0.05, len(inr))
ax.scatter(jitter, inr, color='#377eb8', alpha=0.65, s=50, zorder=3)
ax.errorbar(0, np.mean(inr), yerr=[[np.mean(inr)-ci[0]], [ci[1]-np.mean(inr)]],
            fmt='o', color='firebrick', markersize=8, linewidth=2.5, capsize=6, zorder=4)
ax.axhline(mu0, linestyle='--', color='grey', linewidth=1.5)
ax.set_xlim(-0.3, 0.55); ax.set_xticks([])
ax.set_ylabel("INR", fontsize=12)
ax.set_title(f"One-Sample t-Test: INR vs Target 2.5\nt={t_stat:.2f}, p={p_val:.3f}", fontsize=12)
ax.grid(axis='y', alpha=0.3); plt.tight_layout()
plt.savefig("onesample_t.png", dpi=150, bbox_inches='tight'); plt.show()

4.2 One-Sample Wilcoxon Signed-Rank Test

What it does: Non-parametric equivalent of one-sample t-test. Tests whether the median differs from a hypothesised value.

When to use: Skewed, ordinal, or n < 30 with non-normal distribution.

Worked Example: Pain clinic, n=18, median NRS=7, reference=5. W=142, p=0.008.

R


library(tidyverse)
 
pain <- c(7,8,6,9,5,7,8,7,6,9,7,8,5,7,9,6,8,7)
wilcox.test(pain, mu = 5, exact = FALSE)
 
tibble(pain = pain) |>
  ggplot(aes(x = 1, y = pain)) +
  geom_violin(fill = "#a6cee3", alpha = 0.6, width = 0.6) +
  geom_jitter(width = 0.06, size = 2.5, colour = "#1f78b4", alpha = 0.8) +
  stat_summary(fun = median, geom = "crossbar",
               colour = "firebrick", width = 0.3, linewidth = 0.9) +
  geom_hline(yintercept = 5, linetype = "dashed", colour = "grey30") +
  annotate("text", x = 1.32, y = 5.15, label = "Reference median = 5",
           hjust = 0, size = 3.5) +
  scale_y_continuous(breaks = 0:10) +
  labs(title = "One-Sample Wilcoxon: NRS Pain vs Reference",
       subtitle = "Violin = distribution; red bar = sample median; dashed = reference",
       y = "NRS Pain Score (0–10)", x = NULL) +
  theme_bw(base_size = 13) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

Python


import numpy as np, matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import gaussian_kde
 
pain = np.array([7,8,6,9,5,7,8,7,6,9,7,8,5,7,9,6,8,7])
res  = stats.wilcoxon(pain - 5)
print(f"W={res.statistic:.0f}, p={res.pvalue:.4f}")
 
rng = np.random.default_rng(42)
fig, ax = plt.subplots(figsize=(5, 6))
y_range = np.linspace(3, 11, 200)
kde     = gaussian_kde(pain, bw_method=0.4)
width   = kde(y_range) * 2.5
ax.fill_betweenx(y_range, -width, width, color='#a6cee3', alpha=0.55)
ax.scatter(rng.uniform(-0.06, 0.06, len(pain)), pain, color='#1f78b4', s=50, alpha=0.8, zorder=3)
ax.hlines(np.median(pain), -0.28, 0.28, color='firebrick', lw=2.5, zorder=4, label=f"Median={np.median(pain):.0f}")
ax.axhline(5, linestyle='--', color='grey', lw=1.5, label="Reference=5")
ax.set_xlim(-0.5, 0.7); ax.set_xticks([])
ax.set_ylabel("NRS Pain Score (0–10)", fontsize=12)
ax.set_title(f"Wilcoxon Signed-Rank vs Reference\nW={res.statistic:.0f}, p={res.pvalue:.3f}", fontsize=12)
ax.legend(loc='upper right'); ax.grid(axis='y', alpha=0.3)
plt.tight_layout(); plt.savefig("wilcoxon_onesample.png", dpi=150, bbox_inches='tight'); plt.show()

4.3 One-Proportion Test

What it does: Tests whether an observed proportion differs from a known reference proportion.

Worked Example: Readmission rate 6.0% vs national 4.0%; z=1.61, p=0.107 (NS). Study may be underpowered.

R


# 15 readmissions in 250 procedures; reference = 4%
prop.test(x = 15, n = 250, p = 0.04, correct = FALSE)

Python


from statsmodels.stats.proportion import proportions_ztest, proportion_confint
z_stat, p_val = proportions_ztest(count=15, nobs=250, value=0.04)
ci = proportion_confint(15, 250, alpha=0.05, method='normal')
print(f"z={z_stat:.3f}, p={p_val:.4f}, 95% CI: {ci[0]:.3f}–{ci[1]:.3f}")

4.4 Chi-Square Goodness-of-Fit Test

What it does: Tests whether observed distribution of a categorical variable matches an expected distribution.

Worked Example: ABO blood types, n=200 cardiac patients vs UK population. χ²(3)=3.42, p=0.331.

R


library(tidyverse)
obs      <- c(A = 96, B = 16, AB = 6, O = 82)
expected <- c(A = 0.42, B = 0.10, AB = 0.04, O = 0.44)
chisq.test(obs, p = expected)
 
tibble(Group = names(obs),
       Observed = as.numeric(obs),
       Expected = expected * sum(obs)) |>
  pivot_longer(-Group, names_to = "type", values_to = "count") |>
  ggplot(aes(x = Group, y = count, fill = type)) +
  geom_col(position = "dodge", width = 0.6) +
  scale_fill_manual(values = c("Observed" = "#4575b4", "Expected" = "#d73027")) +
  labs(title    = "Chi-Square GoF: ABO Blood Types",
       subtitle = "Cardiac patients (n=200) vs UK population; χ²(3)=3.42, p=0.331",
       x = "Blood type", y = "Count", fill = NULL) +
  theme_bw(base_size = 13) + theme(legend.position = "top")

Python


import numpy as np, matplotlib.pyplot as plt
from scipy.stats import chisquare
 
obs      = np.array([96, 16, 6, 82])
expected = np.array([0.42, 0.10, 0.04, 0.44]) * obs.sum()
groups   = ['A', 'B', 'AB', 'O']
chi2, p  = chisquare(obs, f_exp=expected)
print(f"chi2={chi2:.3f}, p={p:.4f}")
 
x = np.arange(len(groups))
fig, ax = plt.subplots(figsize=(7, 5))
ax.bar(x - 0.2, obs,      width=0.35, label='Observed', color='#4575b4', alpha=0.9)
ax.bar(x + 0.2, expected, width=0.35, label='Expected',  color='#d73027', alpha=0.9)
ax.set_xticks(x); ax.set_xticklabels(groups, fontsize=12)
ax.set_xlabel("Blood type"); ax.set_ylabel("Count")
ax.set_title(f"Chi-Square GoF: ABO Blood Types\nχ²(3)={chi2:.2f}, p={p:.3f}")
ax.legend(); ax.grid(axis='y', alpha=0.3)
plt.tight_layout(); plt.savefig("chisq_gof.png", dpi=150, bbox_inches='tight'); plt.show()

5. Comparing Two Groups

5.1 Independent Samples t-Test

What it does: Compares means of two independent groups.

Assumptions: Continuous outcome, two independent groups, ~normal or n ≥ 30/group, equal variances (check Levene’s).

Worked Example: ACE inhibitor RCT (n=45/group): 12.4±8.2 vs 5.8±7.6 mmHg reduction. Mean diff 6.6 mmHg (95% CI 3.3–9.9); t(88)=3.96, p<0.001.

R


library(tidyverse); library(rstatix)
set.seed(42)
df <- tibble(group = rep(c("ACE inhibitor","Placebo"), each=45),
             sbp   = c(rnorm(45,12.4,8.2), rnorm(45,5.8,7.6)))
 
levene_test(df, sbp ~ group)
t_test(df, sbp ~ group, var.equal = TRUE, detailed = TRUE)
 
ggplot(df, aes(x=group, y=sbp, fill=group)) +
  geom_boxplot(alpha=0.6, outlier.shape=NA, width=0.5) +
  geom_jitter(aes(colour=group), width=0.12, size=1.8, alpha=0.6) +
  stat_compare_means(method="t.test", label="p.signif",
                     comparisons=list(c("ACE inhibitor","Placebo")), size=5) +
  scale_fill_manual(values=c("#4575b4","#d73027")) +
  scale_colour_manual(values=c("#4575b4","#d73027")) +
  labs(title="Independent t-Test: SBP Reduction by Treatment",
       subtitle="Mean ± IQR; individual patients as dots",
       x=NULL, y="SBP reduction (mmHg)") +
  theme_bw(base_size=13) + theme(legend.position="none")

Python


import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from scipy import stats
 
rng  = np.random.default_rng(42)
ace  = rng.normal(12.4, 8.2, 45); plac = rng.normal(5.8, 7.6, 45)
lev_stat, lev_p = stats.levene(ace, plac)
t_stat, p_val   = stats.ttest_ind(ace, plac, equal_var=True)
print(f"Levene: F={lev_stat:.3f}, p={lev_p:.4f}")
print(f"t={t_stat:.3f}, p={p_val:.4f}")
 
df = pd.DataFrame({'group':['ACE inhibitor']*45+['Placebo']*45,
                   'sbp'  :np.concatenate([ace,plac])})
 
fig, ax = plt.subplots(figsize=(6,6))
palette = {'ACE inhibitor':'#4575b4','Placebo':'#d73027'}
sns.boxplot(data=df, x='group', y='sbp', palette=palette, width=0.45, fliersize=0, ax=ax)
sns.stripplot(data=df, x='group', y='sbp', palette=palette, jitter=0.1, size=4, alpha=0.5, ax=ax)
y_max = df['sbp'].max() + 3
ax.plot([0,0,1,1],[y_max,y_max+1,y_max+1,y_max], lw=1.2, color='black')
ax.text(0.5, y_max+1.5, f"p={p_val:.4f}", ha='center', fontsize=11)
ax.set_ylabel("SBP Reduction (mmHg)", fontsize=12); ax.set_xlabel("")
ax.set_title(f"Independent t-Test: SBP Reduction\nMean diff={np.mean(ace)-np.mean(plac):.1f} mmHg, t={t_stat:.2f}, p={p_val:.4f}", fontsize=11)
ax.grid(axis='y', alpha=0.3); plt.tight_layout()
plt.savefig("indep_ttest.png", dpi=150, bbox_inches='tight'); plt.show()

5.2 Welch’s t-Test (Unequal Variances)

What it does: t-test with Welch-Satterthwaite df correction — does not assume equal variances.

When to use: Levene’s test p ≤ 0.05, or as a safe default.

Worked Example: CRP: bacterial 118.4±94.2 vs viral 22.6±18.7 mg/L. Levene’s p=0.001 → Welch’s. t=5.46, p<0.001.

R


set.seed(42)
df_crp <- tibble(group = rep(c("Bacterial","Viral"), c(30,28)),
                 crp   = c(abs(rnorm(30,118.4,94.2)), abs(rnorm(28,22.6,18.7))))
t.test(crp ~ group, data=df_crp, var.equal=FALSE)
 
ggplot(df_crp, aes(x=group, y=crp+1, fill=group)) +
  geom_boxplot(alpha=0.65, width=0.5) +
  geom_jitter(aes(colour=group), width=0.12, size=1.8, alpha=0.6) +
  scale_y_log10(labels=scales::label_comma()) +
  scale_fill_manual(values=c("#e66101","#5e3c99")) +
  scale_colour_manual(values=c("#e66101","#5e3c99")) +
  labs(title="Welch's t-Test: CRP by Infection Type",
       subtitle="Log scale; Levene's p=0.001 → unequal variances → Welch's appropriate",
       x=NULL, y="CRP+1 (mg/L, log scale)") +
  theme_bw(base_size=13) + theme(legend.position="none")

Python


import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from scipy import stats
 
rng       = np.random.default_rng(42)
bacterial = np.abs(rng.normal(118.4, 94.2, 30))
viral     = np.abs(rng.normal(22.6,  18.7, 28))
lev_stat, lev_p = stats.levene(bacterial, viral)
t_stat, p_val   = stats.ttest_ind(bacterial, viral, equal_var=False)
print(f"Levene p={lev_p:.4f}; Welch's t={t_stat:.3f}, p={p_val:.4f}")
 
df = pd.DataFrame({'Group':['Bacterial']*30+['Viral']*28,
                   'CRP'  :np.concatenate([bacterial, viral])})
fig, ax = plt.subplots(figsize=(6,6))
palette = {'Bacterial':'#e66101','Viral':'#5e3c99'}
sns.boxplot(data=df, x='Group', y='CRP', palette=palette, width=0.45, ax=ax)
sns.stripplot(data=df, x='Group', y='CRP', palette=palette, jitter=0.1, size=4, alpha=0.5, ax=ax)
ax.set_yscale('log'); ax.set_ylabel("CRP (mg/L, log scale)", fontsize=12)
ax.set_title(f"Welch's t-Test: CRP by Infection\nLevene p={lev_p:.3f}, t={t_stat:.2f}, p<0.001")
ax.grid(axis='y', alpha=0.3); plt.tight_layout()
plt.savefig("welch_ttest.png", dpi=150, bbox_inches='tight'); plt.show()

5.3 Mann-Whitney U Test

What it does: Non-parametric test comparing two independent groups. Tests whether one group tends to have higher values.

Note: Does not strictly test equality of medians — tests P(X>Y)=0.5 (stochastic superiority).

Worked Example: QoL scores: standard care median=58 (IQR 42–70) vs integrated programme median=72 (IQR 62–82). U=161.5, p=0.014.

R


library(tidyverse)
std_care <- c(42,55,58,60,48,70,62,50,75,38,65,72,55,43,68,58,60,70,45,57,80,30)
int_prog <- c(62,72,80,65,75,82,70,68,75,73,65,78,85,62,70,80,72,68,75,82,66,78,62,74)
df_qol   <- tibble(group=rep(c("Standard care","Integrated programme"),c(22,24)),
                   qol  =c(std_care,int_prog))
wilcox.test(qol ~ group, data=df_qol)
 
ggplot(df_qol, aes(x=group, y=qol, fill=group)) +
  geom_violin(alpha=0.5, trim=FALSE) +
  geom_boxplot(width=0.15, outlier.shape=NA, alpha=0.8) +
  geom_jitter(aes(colour=group), width=0.06, size=2, alpha=0.6) +
  scale_fill_manual(values=c("#1b7837","#762a83")) +
  scale_colour_manual(values=c("#1b7837","#762a83")) +
  labs(title="Mann-Whitney U: Quality of Life by Care Group",
       subtitle="U=161.5, p=0.014; violin+box+jitter",
       x=NULL, y="EORTC QoL score (0–100)") +
  theme_bw(base_size=13) + theme(legend.position="none")

Python


import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from scipy import stats
 
std_care = [42,55,58,60,48,70,62,50,75,38,65,72,55,43,68,58,60,70,45,57,80,30]
int_prog = [62,72,80,65,75,82,70,68,75,73,65,78,85,62,70,80,72,68,75,82,66,78,62,74]
u_stat, p_val = stats.mannwhitneyu(std_care, int_prog, alternative='two-sided')
print(f"U={u_stat:.1f}, p={p_val:.4f}")
 
df = pd.DataFrame({'Group':['Standard care']*22+['Integrated programme']*24,
                   'QoL'  :std_care+int_prog})
fig, ax = plt.subplots(figsize=(7,6))
palette = {'Standard care':'#1b7837','Integrated programme':'#762a83'}
sns.violinplot(data=df,x='Group',y='QoL',palette=palette,inner=None,alpha=0.45,ax=ax)
sns.boxplot(data=df,x='Group',y='QoL',palette=palette,width=0.15,fliersize=0,ax=ax)
sns.stripplot(data=df,x='Group',y='QoL',palette=palette,jitter=0.07,size=4,alpha=0.5,ax=ax)
ax.set_xlabel(""); ax.set_ylabel("EORTC QoL Score (0–100)", fontsize=12)
ax.set_title(f"Mann-Whitney U: QoL by Care Group\nU={u_stat:.0f}, p={p_val:.4f}")
ax.grid(axis='y', alpha=0.3); plt.tight_layout()
plt.savefig("mannwhitney.png", dpi=150, bbox_inches='tight'); plt.show()

5.4 Paired Samples t-Test

What it does: Compares means from the same subjects at two time points. Removes between-subject variability, substantially increasing power.

Test statistic: t = d̄ / (sd / √n), df = n − 1

Worked Example: Sodium restriction crossover (n=20): mean reduction 43.8 mmol/24h (SD 28.4). t(19)=−6.90, p<0.001.

R


library(tidyverse)
set.seed(42); n <- 20
pre  <- rnorm(n, 168.4, 35)
post <- pre - rnorm(n, 43.8, 28.4)
t.test(pre, post, paired=TRUE)
 
tibble(id=1:n, pre=pre, post=post) |>
  pivot_longer(c(pre,post), names_to="time", values_to="sodium") |>
  mutate(time=factor(time, c("pre","post"), c("Before","After"))) |>
  ggplot(aes(x=time, y=sodium)) +
  geom_line(aes(group=id), colour="grey60", alpha=0.5) +
  geom_point(aes(colour=time), size=2.5, alpha=0.8) +
  stat_summary(fun=mean, geom="line", aes(group=1), colour="firebrick", linewidth=2) +
  stat_summary(fun.data=mean_cl_normal, geom="pointrange",
               colour="firebrick", size=0.9, linewidth=1.5) +
  scale_colour_manual(values=c("Before"="#4575b4","After"="#d73027")) +
  labs(title="Paired t-Test: Urinary Sodium Before/After Restriction",
       subtitle="Grey=individual trajectories; red=mean ± 95% CI",
       x=NULL, y="24-h urinary sodium (mmol/24h)") +
  theme_bw(base_size=13) + theme(legend.position="none")

Python


import numpy as np, matplotlib.pyplot as plt
from scipy import stats
 
rng  = np.random.default_rng(42); n = 20
pre  = rng.normal(168.4, 35, n)
post = pre - rng.normal(43.8, 28.4, n)
t_stat, p_val = stats.ttest_rel(pre, post)
print(f"t={t_stat:.3f}, p={p_val:.4f}")
 
fig, ax = plt.subplots(figsize=(6,6))
for i in range(n):
    ax.plot([0,1],[pre[i],post[i]], color='grey', alpha=0.4, lw=1)
ax.scatter([0]*n, pre,  color='#4575b4', s=50, alpha=0.75, zorder=3)
ax.scatter([1]*n, post, color='#d73027', s=50, alpha=0.75, zorder=3)
ci_pre  = stats.t.interval(0.95, n-1, np.mean(pre),  stats.sem(pre))
ci_post = stats.t.interval(0.95, n-1, np.mean(post), stats.sem(post))
ax.errorbar([0,1],[np.mean(pre),np.mean(post)],
            yerr=[[np.mean(pre)-ci_pre[0],np.mean(post)-ci_post[0]],
                  [ci_pre[1]-np.mean(pre),ci_post[1]-np.mean(post)]],
            fmt='o-', color='firebrick', markersize=9, lw=2.5, capsize=6, zorder=5)
ax.set_xticks([0,1]); ax.set_xticklabels(['Before','After'], fontsize=12)
ax.set_ylabel("24-h Urinary Sodium (mmol/24h)", fontsize=12)
ax.set_title(f"Paired t-Test: Sodium Restriction\nMean reduction={np.mean(pre-post):.1f} mmol/24h, p<0.001")
ax.grid(axis='y', alpha=0.3); plt.tight_layout()
plt.savefig("paired_ttest.png", dpi=150, bbox_inches='tight'); plt.show()

5.5 Wilcoxon Signed-Rank Test

What it does: Non-parametric equivalent of paired t-test. Does not assume normality of differences.

Worked Example: Knee OA (n=16): NRS pre median=7, post=4. Z=−3.29, p=0.001.

R


library(tidyverse)
pre  <- c(7,8,6,9,7,8,6,7,9,8,7,6,8,7,9,6)
post <- c(4,5,3,5,4,4,3,4,5,4,3,3,5,4,4,3)
wilcox.test(pre, post, paired=TRUE)
 
tibble(id=1:16, pre=pre, post=post, improved=ifelse(post<pre,"Improved","Not improved")) |>
  pivot_longer(c(pre,post), names_to="time", values_to="nrs") |>
  mutate(time=factor(time,c("pre","post"),c("Before","After"))) |>
  ggplot(aes(x=time, y=nrs, group=id)) +
  geom_line(aes(colour=rep(c("Improved","Not improved"),each=2)[match(id,1:16)]),
            alpha=0.7, linewidth=1) +
  geom_point(size=2.5) +
  scale_colour_manual(values=c("Improved"="#1b7837","Not improved"="#d73027"), name=NULL) +
  labs(title="Wilcoxon Signed-Rank: Pain Before/After Physiotherapy",
       subtitle="Green=improved; red=not improved",
       x=NULL, y="NRS Pain Score (0–10)") +
  theme_bw(base_size=13) + theme(legend.position="bottom")

Python


import numpy as np, matplotlib.pyplot as plt
from scipy import stats
from matplotlib.lines import Line2D
 
pre  = np.array([7,8,6,9,7,8,6,7,9,8,7,6,8,7,9,6])
post = np.array([4,5,3,5,4,4,3,4,5,4,3,3,5,4,4,3])
res  = stats.wilcoxon(pre, post)
print(f"W={res.statistic:.1f}, p={res.pvalue:.4f}")
 
improved = post < pre
fig, ax = plt.subplots(figsize=(6,6))
for i in range(len(pre)):
    col = '#1b7837' if improved[i] else '#d73027'
    ax.plot([0,1],[pre[i],post[i]], color=col, alpha=0.65, lw=1.5)
    ax.scatter([0,1],[pre[i],post[i]], color=col, s=40, zorder=3)
ax.set_xticks([0,1]); ax.set_xticklabels(['Before','After'], fontsize=12)
ax.set_yticks(range(0,11)); ax.set_ylabel("NRS Pain Score (0–10)", fontsize=12)
ax.set_title(f"Wilcoxon Signed-Rank: Knee OA Physio\nW={res.statistic:.0f}, p={res.pvalue:.4f}")
legend_elements=[Line2D([0],[0],color='#1b7837',label=f'Improved ({improved.sum()})'),
                 Line2D([0],[0],color='#d73027',label=f'Not improved ({(~improved).sum()})')]
ax.legend(handles=legend_elements, loc='upper right')
ax.grid(axis='y', alpha=0.3); plt.tight_layout()
plt.savefig("wilcoxon_signed_rank.png", dpi=150, bbox_inches='tight'); plt.show()

5.6 Chi-Square Test of Independence

What it does: Tests whether two categorical variables are associated.

Assumptions: Both variables categorical, independent observations, expected frequency ≥ 5 in all cells.

Worked Example: Smoking and post-op pneumonia (n=180). χ²(1)=16.58, p<0.001. OR=4.33.

R


library(tidyverse)
ct <- matrix(c(24,36,16,104), nrow=2,
             dimnames=list(c("Smoker","Non-smoker"),c("Pneumonia","No pneumonia")))
chisq.test(ct)
 
as_tibble(ct, rownames="Smoking") |>
  pivot_longer(-Smoking, names_to="Outcome", values_to="n") |>
  group_by(Smoking) |> mutate(pct=n/sum(n)) |>
  ggplot(aes(x=Smoking, y=pct, fill=Outcome)) +
  geom_col(position="fill", width=0.55) +
  geom_text(aes(label=scales::percent(pct,accuracy=0.1)),
            position=position_fill(vjust=0.5), size=4, colour="white") +
  scale_y_continuous(labels=scales::percent_format()) +
  scale_fill_manual(values=c("Pneumonia"="#d73027","No pneumonia"="#4575b4")) +
  labs(title="Chi-Square: Smoking and Post-Op Pneumonia",
       subtitle="χ²(1)=16.58, p<0.001; OR=4.33",
       x=NULL, y="Proportion", fill="Outcome") +
  theme_bw(base_size=13)

Python


import numpy as np, matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
 
ct = np.array([[24,36],[16,104]])
chi2, p, dof, expected = chi2_contingency(ct, correction=False)
print(f"χ²({dof})={chi2:.3f}, p={p:.6f}")
 
groups = ['Smoker','Non-smoker']
pneum  = ct[:,0] / ct.sum(axis=1)
no_pn  = ct[:,1] / ct.sum(axis=1)
fig, ax = plt.subplots(figsize=(6,6))
ax.bar(groups, pneum, label='Pneumonia', color='#d73027', alpha=0.9)
ax.bar(groups, no_pn, bottom=pneum, label='No pneumonia', color='#4575b4', alpha=0.9)
for i,(p0,p1) in enumerate(zip(pneum,no_pn)):
    ax.text(i, p0/2,     f"{p0:.1%}", ha='center', va='center', color='white', fontsize=11)
    ax.text(i, p0+p1/2,  f"{p1:.1%}", ha='center', va='center', color='white', fontsize=11)
ax.set_ylabel("Proportion"); ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda y,_:f'{y:.0%}'))
ax.set_title(f"Chi-Square: Smoking and Post-Op Pneumonia\nχ²(1)={chi2:.2f}, p<0.001")
ax.legend(loc='upper right'); ax.grid(axis='y', alpha=0.3)
plt.tight_layout(); plt.savefig("chisq_independence.png", dpi=150, bbox_inches='tight'); plt.show()

5.7 Fisher’s Exact Test

What it does: Tests association when expected cell frequencies are small (< 5). Calculates exact probability.

When to use: 2×2 table with any expected cell < 5, small n, sparse data.

Worked Example: Fungal infection and immunosuppression (n=12). Fisher’s p=0.045. Hypothesis-generating only.

R


ct_small <- matrix(c(4,1,1,6), nrow=2,
                   dimnames=list(c("Immunosuppressed","Not"),c("Infection","No infection")))
fisher.test(ct_small)
# Expected cells: min=(5×5)/12=2.08 < 5 → Fisher's appropriate

Python


from scipy.stats import fisher_exact
import numpy as np
ct_small = np.array([[4,1],[1,6]])
odds_ratio, p_val = fisher_exact(ct_small, alternative='two-sided')
print(f"OR={odds_ratio:.2f}, p={p_val:.4f}")

5.8 McNemar’s Test

What it does: Tests whether proportion of a binary outcome differs between two paired groups.

Key insight: Only the discordant pairs (b and c) contribute information.

Worked Example: Hand-hygiene compliance before/after education (n=80). Discordant: b=12, c=22. McNemar χ²=2.94, p=0.086.

R


ct_mc <- matrix(c(38,12,22,8), nrow=2,
                dimnames=list(c("Pre: Compliant","Pre: Non-compliant"),
                              c("Post: Compliant","Post: Non-compliant")))
mcnemar.test(ct_mc)

Python


from statsmodels.stats.contingency_tables import mcnemar
import numpy as np
ct_mc  = np.array([[38,12],[22,8]])
result = mcnemar(ct_mc, exact=False)
print(f"chi2={result.statistic:.3f}, p={result.pvalue:.4f}")

6. Comparing Three or More Groups

6.1 One-Way ANOVA

What it does: Tests whether means of 3+ independent groups differ. Tests the omnibus null hypothesis (ALL means equal) — follow up with post-hoc tests to find which groups differ.

Assumptions: Continuous outcome, 3+ independent groups, ~normal within each group, equal variances (Levene’s test).

Logic: Partitions total variability into between-group (treatment) and within-group (noise). F = MSbetween / MSwithin.

Post-hoc tests: Tukey HSD (all-pairs, good default), Bonferroni (conservative), Dunnett (vs control), Scheffé (most conservative).

Worked Example: Anti-nausea drug trial (n=50/group, 4 groups). Vomiting episodes: placebo 6.8, low 5.1, medium 3.4, high 2.9. F(3,196)=38.5, p<0.001. Tukey: medium vs high not significant (p=0.41).

R


library(tidyverse); library(rstatix)
set.seed(42)
df_anova <- tibble(
  group = rep(c("Placebo","Low","Medium","High"), each=50),
  vomit = c(rnorm(50,6.8,2.4), rnorm(50,5.1,2.1),
            rnorm(50,3.4,1.8), rnorm(50,2.9,1.7))
) |> mutate(group=factor(group, levels=c("Placebo","Low","Medium","High")))
 
# Levene's and ANOVA
levene_test(df_anova, vomit ~ group)
anova_result <- aov(vomit ~ group, data=df_anova)
summary(anova_result)
 
# Tukey HSD post-hoc
tukey_hsd(df_anova, vomit ~ group)
 
# Plot
ggplot(df_anova, aes(x=group, y=vomit, fill=group)) +
  geom_boxplot(alpha=0.65, outlier.shape=NA, width=0.55) +
  geom_jitter(aes(colour=group), width=0.12, size=1.5, alpha=0.5) +
  stat_compare_means(method="anova", label.y=max(df_anova$vomit)+1) +
  scale_fill_brewer(palette="RdYlGn") +
  scale_colour_brewer(palette="RdYlGn") +
  labs(title="One-Way ANOVA: Vomiting Episodes by Drug Dose",
       subtitle="Post-hoc Tukey HSD; medium vs high not significant (p=0.41)",
       x="Treatment group", y="Vomiting episodes (24h)") +
  theme_bw(base_size=13) + theme(legend.position="none")

Python


import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from scipy import stats
import scikit_posthocs as sp
 
rng = np.random.default_rng(42)
groups_dict = {'Placebo':rng.normal(6.8,2.4,50), 'Low':rng.normal(5.1,2.1,50),
               'Medium':rng.normal(3.4,1.8,50),  'High':rng.normal(2.9,1.7,50)}
df = pd.DataFrame([(g,v) for g,vals in groups_dict.items() for v in vals],
                  columns=['group','vomit'])
 
# ANOVA
f_stat, p_val = stats.f_oneway(*[v for v in groups_dict.values()])
print(f"F={f_stat:.2f}, p={p_val:.6f}")
 
# Tukey post-hoc via pingouin
import pingouin as pg
posthoc = pg.pairwise_tukey(data=df, dv='vomit', between='group')
print(posthoc[['A','B','diff','p-tukey']])
 
# Plot
fig, ax = plt.subplots(figsize=(8,6))
order  = ['Placebo','Low','Medium','High']
palette = {'Placebo':'#d73027','Low':'#fc8d59','Medium':'#91cf60','High':'#1a9850'}
sns.boxplot(data=df, x='group', y='vomit', order=order, palette=palette,
            width=0.5, fliersize=0, ax=ax)
sns.stripplot(data=df, x='group', y='vomit', order=order, palette=palette,
              jitter=0.1, size=3, alpha=0.4, ax=ax)
ax.set_xlabel("Treatment group"); ax.set_ylabel("Vomiting episodes (24h)")
ax.set_title(f"One-Way ANOVA: Vomiting Episodes by Dose\nF(3,196)={f_stat:.1f}, p<0.001")
ax.grid(axis='y', alpha=0.3); plt.tight_layout()
plt.savefig("oneway_anova.png", dpi=150, bbox_inches='tight'); plt.show()

6.2 Welch’s ANOVA and Kruskal-Wallis Test

Welch’s ANOVA: Use when Levene’s test is significant (p<0.05). Post-hoc: Games-Howell.

Kruskal-Wallis: Non-parametric alternative to one-way ANOVA. Tests if 3+ independent groups have the same distribution. Post-hoc: Dunn’s test with Bonferroni correction.

Worked Example (Kruskal-Wallis): Discharge pain scores (NRS) at 3 hospitals. H(2)=8.74, p=0.013. Hospital A < B (Dunn’s p=0.010); A vs C and B vs C not significant.

R


library(tidyverse); library(rstatix)
set.seed(42)
df_kw <- tibble(
  hospital = rep(c("A","B","C"), c(35,38,33)),
  pain     = c(sample(c(3,4,5,6), 35, replace=TRUE, prob=c(0.3,0.4,0.2,0.1)),
               sample(c(4,5,6,7,8), 38, replace=TRUE, prob=c(0.15,0.25,0.3,0.2,0.1)),
               sample(c(3,4,5,6,7), 33, replace=TRUE, prob=c(0.2,0.25,0.3,0.15,0.1)))
)
 
kruskal.test(pain ~ hospital, data=df_kw)
dunn_test(df_kw, pain ~ hospital, p.adjust.method="bonferroni")
 
ggplot(df_kw, aes(x=hospital, y=pain, fill=hospital)) +
  geom_violin(alpha=0.55, trim=FALSE) +
  geom_boxplot(width=0.12, outlier.shape=NA) +
  geom_jitter(aes(colour=hospital), width=0.08, size=1.8, alpha=0.5) +
  scale_fill_brewer(palette="Set2") + scale_colour_brewer(palette="Set2") +
  scale_y_continuous(breaks=0:10) +
  labs(title="Kruskal-Wallis: Discharge Pain by Hospital",
       subtitle="H(2)=8.74, p=0.013; Dunn's post-hoc: A<B (p=0.010)",
       x="Hospital", y="NRS Pain at Discharge (0–10)") +
  theme_bw(base_size=13) + theme(legend.position="none")

Python


import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from scipy import stats
import scikit_posthocs as sp
 
rng = np.random.default_rng(42)
data_A = rng.choice([3,4,5,6], 35, p=[0.3,0.4,0.2,0.1])
data_B = rng.choice([4,5,6,7,8], 38, p=[0.15,0.25,0.3,0.2,0.1])
data_C = rng.choice([3,4,5,6,7], 33, p=[0.2,0.25,0.3,0.15,0.1])
 
h_stat, p_val = stats.kruskal(data_A, data_B, data_C)
print(f"H={h_stat:.3f}, p={p_val:.4f}")
 
df = pd.DataFrame({'Hospital':['A']*35+['B']*38+['C']*33,
                   'Pain'    :np.concatenate([data_A,data_B,data_C])})
dunn_result = sp.posthoc_dunn(df, val_col='Pain', group_col='Hospital', p_adjust='bonferroni')
print(dunn_result)
 
fig, ax = plt.subplots(figsize=(7,6))
palette = {'A':'#66c2a5','B':'#fc8d62','C':'#8da0cb'}
sns.violinplot(data=df,x='Hospital',y='Pain',palette=palette,inner=None,alpha=0.5,ax=ax)
sns.boxplot(data=df,x='Hospital',y='Pain',palette=palette,width=0.12,fliersize=0,ax=ax)
sns.stripplot(data=df,x='Hospital',y='Pain',palette=palette,jitter=0.08,size=4,alpha=0.4,ax=ax)
ax.set_ylabel("NRS Pain at Discharge (0–10)"); ax.set_xlabel("Hospital")
ax.set_title(f"Kruskal-Wallis: Pain by Hospital\nH(2)={h_stat:.2f}, p={p_val:.3f}")
ax.grid(axis='y', alpha=0.3); plt.tight_layout()
plt.savefig("kruskal_wallis.png", dpi=150, bbox_inches='tight'); plt.show()

6.3 Repeated Measures ANOVA

What it does: Tests for differences in a continuous outcome at 3+ time points in the same subjects.

Key assumption — Sphericity: Variances of differences between all time-point pairs should be equal. Test with Mauchly’s. If violated, apply Greenhouse-Geisser correction.

Worked Example: CKD creatinine at 4 time points (n=30). F(3,87)=8.43, p<0.001, η²=0.23.

R


library(tidyverse); library(rstatix)
set.seed(42); n <- 30
df_rm <- tibble(
  id        = rep(1:n, 4),
  timepoint = rep(c("Baseline","3 months","6 months","12 months"), each=n),
  creatinine = c(rnorm(n,142,38), rnorm(n,138,36), rnorm(n,131,34), rnorm(n,128,33))
) |> mutate(timepoint=factor(timepoint, levels=c("Baseline","3 months","6 months","12 months")))
 
anova_test(df_rm, dv=creatinine, wid=id, within=timepoint)
 
# Mean ± CI line plot
df_rm |>
  group_by(timepoint) |>
  summarise(mean=mean(creatinine), se=sd(creatinine)/sqrt(n()),
            ci_lo=mean-qt(0.975,n()-1)*se, ci_hi=mean+qt(0.975,n()-1)*se) |>
  ggplot(aes(x=timepoint, y=mean, group=1)) +
  geom_ribbon(aes(ymin=ci_lo, ymax=ci_hi), fill="#74add1", alpha=0.3) +
  geom_line(colour="#4575b4", linewidth=1.5) +
  geom_point(colour="#4575b4", size=4) +
  labs(title="Repeated Measures ANOVA: Serum Creatinine over 12 Months",
       subtitle="Mean ± 95% CI; F(3,87)=8.43, p<0.001, η²=0.23",
       x="Time point", y="Serum creatinine (µmol/L)") +
  theme_bw(base_size=13)

Python


import numpy as np, pandas as pd, matplotlib.pyplot as plt
from scipy import stats
import pingouin as pg
 
rng = np.random.default_rng(42); n = 30
timepoints = ['Baseline','3 months','6 months','12 months']
means = [142,138,131,128]; sds = [38,36,34,33]
df = pd.concat([pd.DataFrame({'id':range(n),'timepoint':tp,
                              'creatinine':rng.normal(m,s,n)})
               for tp,m,s in zip(timepoints,means,sds)])
df['timepoint'] = pd.Categorical(df['timepoint'], categories=timepoints, ordered=True)
 
# RM-ANOVA via pingouin
rm_result = pg.rm_anova(data=df, dv='creatinine', within='timepoint', subject='id', detailed=True)
print(rm_result[['Source','F','p-unc','np2']])
 
# Mean ± CI
summary = df.groupby('timepoint')['creatinine'].agg(['mean','sem'])
summary['ci95'] = summary['sem'] * stats.t.ppf(0.975, n-1)
 
fig, ax = plt.subplots(figsize=(8,5))
x = range(len(timepoints))
ax.fill_between(x, summary['mean']-summary['ci95'], summary['mean']+summary['ci95'],
                alpha=0.25, color='#74add1')
ax.plot(x, summary['mean'], 'o-', color='#4575b4', linewidth=2, markersize=8)
ax.errorbar(x, summary['mean'], yerr=summary['ci95'],
            fmt='none', color='#4575b4', capsize=5, linewidth=1.5)
ax.set_xticks(x); ax.set_xticklabels(timepoints)
ax.set_ylabel("Serum creatinine (µmol/L)"); ax.set_xlabel("")
ax.set_title("Repeated Measures ANOVA: Creatinine over 12 Months\n"
             "Mean ± 95% CI; F(3,87)=8.43, p<0.001")
ax.grid(alpha=0.3); plt.tight_layout()
plt.savefig("rm_anova.png", dpi=150, bbox_inches='tight'); plt.show()

6.4 Friedman Test

What it does: Non-parametric equivalent of repeated measures ANOVA. Compares 3+ related groups.

When to use: Same subjects at 3+ time points; data are skewed, ordinal, or RM-ANOVA assumptions violated.

Worked Example: Rheumatoid arthritis biologic therapy (n=20): NRS baseline 7, week 4: 5, week 8: 3. χ²(2)=28.4, p<0.001.

R


library(tidyverse)
set.seed(42); n <- 20
df_fr <- tibble(
  id        = rep(1:n, 3),
  timepoint = rep(c("Baseline","Week 4","Week 8"), each=n),
  nrs       = c(sample(6:9,n,replace=TRUE), sample(3:7,n,replace=TRUE),
                sample(1:5,n,replace=TRUE))
) |> mutate(timepoint=factor(timepoint,levels=c("Baseline","Week 4","Week 8")))
 
friedman.test(nrs ~ timepoint | id, data=df_fr)
 
# Spaghetti + median
medians <- df_fr |> group_by(timepoint) |> summarise(med=median(nrs))
 
ggplot(df_fr, aes(x=timepoint, y=nrs, group=id)) +
  geom_line(colour="grey60", alpha=0.5, linewidth=0.8) +
  geom_point(colour="grey60", size=1.5, alpha=0.5) +
  geom_line(data=medians, aes(y=med, group=1),
            colour="firebrick", linewidth=2.5, inherit.aes=FALSE,
            mapping=aes(x=timepoint)) +
  geom_point(data=medians, aes(x=timepoint, y=med),
             colour="firebrick", size=5, inherit.aes=FALSE) +
  labs(title="Friedman Test: Pain Scores over 8 Weeks of Biologic Therapy",
       subtitle="Grey=individual patients; red=median trajectory; χ²(2)=28.4, p<0.001",
       x=NULL, y="NRS Pain Score (0–10)") +
  theme_bw(base_size=13)

Python


import numpy as np, pandas as pd, matplotlib.pyplot as plt
from scipy import stats
 
rng = np.random.default_rng(42); n = 20
baseline = rng.integers(6, 10, n)
week4    = rng.integers(3,  8, n)
week8    = rng.integers(1,  6, n)
stat, p  = stats.friedmanchisquare(baseline, week4, week8)
print(f"Friedman χ²={stat:.3f}, p={p:.6f}")
 
timepoints = ['Baseline','Week 4','Week 8']
fig, ax = plt.subplots(figsize=(7,6))
for i in range(n):
    ax.plot([0,1,2],[baseline[i],week4[i],week8[i]], color='grey', alpha=0.4, lw=0.9)
medians = [np.median(baseline),np.median(week4),np.median(week8)]
ax.plot([0,1,2], medians, 'o-', color='firebrick', lw=2.5, markersize=10, zorder=5,
        label=f'Median: {medians[0]:.0f}→{medians[1]:.0f}→{medians[2]:.0f}')
ax.set_xticks([0,1,2]); ax.set_xticklabels(timepoints)
ax.set_ylabel("NRS Pain Score (0–10)")
ax.set_title(f"Friedman Test: RA Biologic Therapy\nχ²(2)={stat:.1f}, p<0.001")
ax.legend(); ax.grid(axis='y', alpha=0.3)
plt.tight_layout(); plt.savefig("friedman.png", dpi=150, bbox_inches='tight'); plt.show()

7. Correlation and Association

7.1 Pearson Correlation

What it does: Quantifies the strength and direction of the linear relationship between two continuous variables. Output: r (−1 to +1).

Interpreting |r|: 0–0.19 = negligible; 0.20–0.39 = weak; 0.40–0.59 = moderate; 0.60–0.79 = strong; 0.80–1.00 = very strong.

Important: Always plot the data first. r can miss non-linear relationships and is distorted by outliers.

Worked Example: Age and eGFR (n=150). r=−0.58 (95% CI −0.68 to −0.46), p<0.001. r²=0.34 (age explains 34% of eGFR variance).

R


library(tidyverse)
set.seed(42)
n <- 150
df_cor <- tibble(
  age  = rnorm(n, 58, 14),
  egfr = 110 - 0.6*age + rnorm(n, 0, 12)
)
cor.test(df_cor$age, df_cor$egfr)
 
ggplot(df_cor, aes(x=age, y=egfr)) +
  geom_point(colour="#4575b4", alpha=0.65, size=2.5) +
  geom_smooth(method="lm", se=TRUE, colour="firebrick", fill="firebrick", alpha=0.15) +
  annotate("text", x=30, y=85, hjust=0, size=4,
           label=sprintf("r = %.2f\n95%% CI: %.2f to %.2f\np < 0.001",
                         cor.test(df_cor$age,df_cor$egfr)$estimate,
                         cor.test(df_cor$age,df_cor$egfr)$conf.int[1],
                         cor.test(df_cor$age,df_cor$egfr)$conf.int[2])) +
  labs(title="Pearson Correlation: Age and eGFR",
       subtitle="Scatterplot with regression line and 95% CI band",
       x="Age (years)", y="eGFR (mL/min/1.73m²)") +
  theme_bw(base_size=13)

Python


import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from scipy import stats
 
rng = np.random.default_rng(42); n = 150
age  = rng.normal(58, 14, n)
egfr = 110 - 0.6*age + rng.normal(0, 12, n)
r, p = stats.pearsonr(age, egfr)
n_ci = len(age)
r_z  = np.arctanh(r)
se   = 1/np.sqrt(n_ci-3)
ci   = np.tanh([r_z - 1.96*se, r_z + 1.96*se])
print(f"r={r:.3f}, 95% CI: {ci[0]:.3f}–{ci[1]:.3f}, p={p:.6f}")
 
fig, ax = plt.subplots(figsize=(7,6))
ax.scatter(age, egfr, color='#4575b4', alpha=0.6, s=40)
slope, intercept, *_ = stats.linregress(age, egfr)
x_line = np.linspace(age.min(), age.max(), 100)
ax.plot(x_line, slope*x_line+intercept, 'r-', lw=2)
# CI band
from scipy.stats import t as t_dist
n_c = len(age); se_fit = np.sqrt(np.var(egfr - (slope*age+intercept)) *
                                  (1/n_c + (x_line-age.mean())**2/np.sum((age-age.mean())**2)))
ax.fill_between(x_line, slope*x_line+intercept-1.96*se_fit,
                slope*x_line+intercept+1.96*se_fit, color='red', alpha=0.12)
ax.text(25, 85, f"r = {r:.2f}\n95% CI: {ci[0]:.2f}–{ci[1]:.2f}\np < 0.001", fontsize=10)
ax.set_xlabel("Age (years)"); ax.set_ylabel("eGFR (mL/min/1.73m²)")
ax.set_title("Pearson Correlation: Age and eGFR\n(r = −0.58)")
ax.grid(alpha=0.3); plt.tight_layout()
plt.savefig("pearson_correlation.png", dpi=150, bbox_inches='tight'); plt.show()

7.2 Spearman’s Rank Correlation

What it does: Non-parametric measure of monotonic association. Calculates Pearson correlation on ranks.

When to use: Ordinal data, skewed continuous data, outliers, non-linear monotonic relationships.

Worked Example: NYHA class vs 6-minute walk distance (n=80). ρ=−0.71 (95% CI −0.79 to −0.60), p<0.001.

R


set.seed(42); n <- 80
nyha   <- sample(1:4, n, replace=TRUE, prob=c(0.25,0.4,0.25,0.1))
walk   <- 450 - 50*nyha + rnorm(n,0,40)
cor.test(nyha, walk, method="spearman")

Python


import numpy as np
from scipy import stats
 
rng  = np.random.default_rng(42); n = 80
nyha = rng.choice([1,2,3,4], n, p=[0.25,0.4,0.25,0.1])
walk = 450 - 50*nyha + rng.normal(0, 40, n)
rho, p = stats.spearmanr(nyha, walk)
print(f"Spearman ρ={rho:.3f}, p={p:.6f}")

7.3 Common Pitfalls in Correlation Analysis

No scatterplot: r=0.50 could reflect a clean trend, a curve, or a few outliers — always plot first
Ecological fallacy: Group-level correlations don’t imply individual-level correlations
Confounding: Correlation between A and B may be explained by third variable C
Restricted range: Correlations are attenuated when studying a narrow range of one variable
Multiple testing: 20 correlations at α=0.05 → ~1 false positive by chance

8. Regression Analysis

8.1 Simple Linear Regression

What it does: Models linear relationship between one continuous predictor (X) and one continuous outcome (Y). Extends correlation by fitting a line and quantifying predicted change in Y per unit change in X.

Model: Y = β₀ + β₁X + ε

Assumptions: Linearity, independence of residuals, homoscedasticity, normality of residuals, no influential outliers.

Worked Example: BMI and SBP (n=200). SBP = 98.4 + 1.72×BMI. β₁=1.72 (95% CI 1.28–2.16), p<0.001. R²=0.21.

R


library(tidyverse)
set.seed(42); n <- 200
bmi <- rnorm(n, 27, 5)
sbp <- 98.4 + 1.72*bmi + rnorm(n,0,18)
model <- lm(sbp ~ bmi)
summary(model); confint(model)
 
par(mfrow=c(1,2))
# Scatter + fit
plot(bmi, sbp, pch=16, col=adjustcolor("#4575b4",0.6),
     xlab="BMI (kg/m²)", ylab="SBP (mmHg)",
     main="Simple Linear Regression: BMI and SBP")
abline(model, col="firebrick", lwd=2)
# Residuals vs fitted
plot(model, which=1, main="Residuals vs Fitted")

Python


import numpy as np, pandas as pd, matplotlib.pyplot as plt, statsmodels.api as sm
from scipy import stats
 
rng = np.random.default_rng(42); n = 200
bmi = rng.normal(27, 5, n)
sbp = 98.4 + 1.72*bmi + rng.normal(0, 18, n)
 
X_sm = sm.add_constant(bmi)
model = sm.OLS(sbp, X_sm).fit()
print(model.summary())
 
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
 
# Scatter + fit + CI
x_range = np.linspace(bmi.min(), bmi.max(), 100)
pred     = model.get_prediction(sm.add_constant(x_range))
pred_sum = pred.summary_frame(alpha=0.05)
ax1.scatter(bmi, sbp, color='#4575b4', alpha=0.55, s=25)
ax1.plot(x_range, pred_sum['mean'], 'r-', lw=2)
ax1.fill_between(x_range, pred_sum['mean_ci_lower'], pred_sum['mean_ci_upper'],
                 color='red', alpha=0.15)
ax1.set_xlabel("BMI (kg/m²)"); ax1.set_ylabel("SBP (mmHg)")
ax1.set_title(f"Simple Linear Regression\nβ={model.params[1]:.2f}, R²={model.rsquared:.2f}")
 
# Residuals vs fitted
fitted   = model.fittedvalues
residuals = model.resid
ax2.scatter(fitted, residuals, color='#377eb8', alpha=0.55, s=25)
ax2.axhline(0, color='red', lw=1.5, linestyle='--')
ax2.set_xlabel("Fitted values"); ax2.set_ylabel("Residuals")
ax2.set_title("Residuals vs Fitted")
 
ax1.grid(alpha=0.3); ax2.grid(alpha=0.3)
plt.tight_layout(); plt.savefig("simple_regression.png", dpi=150, bbox_inches='tight')
plt.show()

8.2 Multiple Linear Regression

What it does: Models relationship between 2+ predictors and a continuous outcome. Each coefficient = effect of that predictor adjusted for all others.

Model: Y = β₀ + β₁X₁ + β₂X₂ + … + βₚXₚ + ε

Worked Example: SBP predictors in 200 adults. Adjusted R²=0.34. BMI β=1.42, age β=0.68, male sex β=4.10, smoking β=3.85 (all p<0.01).

R


set.seed(42); n <- 200
bmi     <- rnorm(n,27,5)
age     <- rnorm(n,52,12)
male    <- rbinom(n,1,0.5)
smoker  <- rbinom(n,1,0.2)
sbp     <- 78.2 + 1.42*bmi + 0.68*age + 4.10*male + 3.85*smoker + rnorm(n,0,15)
df_mlr  <- data.frame(sbp=sbp,bmi=bmi,age=age,male=male,smoker=smoker)
model   <- lm(sbp ~ bmi + age + male + smoker, data=df_mlr)
summary(model)
 
# Coefficient forest plot
library(broom)
tidy(model, conf.int=TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(term=c("BMI (per kg/m²)","Age (per year)","Male sex","Current smoker")) |>
  ggplot(aes(x=estimate, y=fct_reorder(term,estimate))) +
  geom_vline(xintercept=0, linetype="dashed", colour="grey40") +
  geom_errorbarh(aes(xmin=conf.low, xmax=conf.high), height=0.2, linewidth=1) +
  geom_point(size=4, colour="#4575b4") +
  labs(title="Multiple Linear Regression: Predictors of SBP",
       subtitle="Coefficients with 95% CI; adjusted R²=0.34",
       x="Regression coefficient (β, mmHg)", y=NULL) +
  theme_bw(base_size=13)

Python


import numpy as np, pandas as pd, matplotlib.pyplot as plt, statsmodels.api as sm
 
rng = np.random.default_rng(42); n = 200
bmi    = rng.normal(27,5,n); age = rng.normal(52,12,n)
male   = rng.binomial(1,0.5,n); smoker = rng.binomial(1,0.2,n)
sbp    = 78.2 + 1.42*bmi + 0.68*age + 4.10*male + 3.85*smoker + rng.normal(0,15,n)
X      = sm.add_constant(np.column_stack([bmi,age,male,smoker]))
model  = sm.OLS(sbp, X).fit()
print(model.summary())
 
coef_names = ['BMI (per kg/m²)','Age (per year)','Male sex','Current smoker']
coefs  = model.params[1:]
conf   = model.conf_int()[1:]
 
fig, ax = plt.subplots(figsize=(8,5))
y_pos  = range(len(coef_names))
ax.barh(y_pos, coefs, xerr=[coefs-conf[:,0], conf[:,1]-coefs],
        height=0.5, align='center', color='#4575b4', alpha=0.8,
        error_kw=dict(ecolor='black', capsize=6, linewidth=1.5))
ax.axvline(0, color='grey', linestyle='--', lw=1.5)
ax.set_yticks(y_pos); ax.set_yticklabels(coef_names)
ax.set_xlabel("Regression coefficient β (mmHg)")
ax.set_title("Multiple Linear Regression: Predictors of SBP\n"
             f"Coefficients ± 95% CI; Adjusted R²={model.rsquared_adj:.2f}")
ax.grid(axis='x', alpha=0.3); plt.tight_layout()
plt.savefig("multiple_regression.png", dpi=150, bbox_inches='tight'); plt.show()

8.3 Logistic Regression

What it does: Models relationship between predictors and a binary outcome. Output: log-odds, converted to OR.

Key assumptions: Binary outcome, independence, no multicollinearity, EPV ≥ 10 events per predictor.

Worked Example: 30-day COPD readmission (n=400, 21% events). C-statistic=0.72. Prior admission OR=2.84 (95% CI 1.63–4.95).

R


library(tidyverse); library(broom); library(pROC)
set.seed(42); n <- 400
age     <- rnorm(n,68,12); fev1 <- rnorm(n,55,15)
prev_adm<- rbinom(n,1,0.35); home_o2 <- rbinom(n,1,0.25)
logit_p <- -3.5 + 0.022*age - 0.013*fev1 + 1.04*prev_adm + 0.66*home_o2
readmit <- rbinom(n, 1, plogis(logit_p))
df_lr   <- data.frame(readmit=readmit,age=age,fev1=fev1,prev_adm=prev_adm,home_o2=home_o2)
model   <- glm(readmit ~ age + fev1 + prev_adm + home_o2, data=df_lr, family=binomial)
summary(model)
 
# Forest plot of ORs
tidy(model, exponentiate=TRUE, conf.int=TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(term=c("Age (per 10 yr)","FEV₁% (per 10%)","Prior admission","Home oxygen")) |>
  ggplot(aes(x=estimate, y=fct_reorder(term,estimate))) +
  geom_vline(xintercept=1, linetype="dashed", colour="grey40") +
  geom_errorbarh(aes(xmin=conf.low, xmax=conf.high), height=0.25, linewidth=1) +
  geom_point(size=4, colour="#d73027") +
  scale_x_log10() +
  labs(title="Logistic Regression: Predictors of 30-Day COPD Readmission",
       subtitle="Odds ratios with 95% CI (log scale)",
       x="Odds Ratio (log scale)", y=NULL) +
  theme_bw(base_size=13)

Python


import numpy as np, pandas as pd, matplotlib.pyplot as plt, statsmodels.api as sm
from sklearn.metrics import roc_curve, auc
 
rng = np.random.default_rng(42); n = 400
age      = rng.normal(68,12,n); fev1 = rng.normal(55,15,n)
prev_adm = rng.binomial(1,0.35,n); home_o2 = rng.binomial(1,0.25,n)
logit_p  = -3.5 + 0.022*age - 0.013*fev1 + 1.04*prev_adm + 0.66*home_o2
readmit  = rng.binomial(1, 1/(1+np.exp(-logit_p)), n)
X        = sm.add_constant(np.column_stack([age,fev1,prev_adm,home_o2]))
model    = sm.Logit(readmit, X).fit()
print(model.summary())
 
coef_names = ['Age (per yr)','FEV₁ (per %)','Prior admission','Home oxygen']
ors    = np.exp(model.params[1:])
ci_lo  = np.exp(model.conf_int()[1:,0])
ci_hi  = np.exp(model.conf_int()[1:,1])
 
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13,5))
 
# OR forest plot
y_pos = range(len(coef_names))
ax1.errorbar(ors, y_pos, xerr=[ors-ci_lo, ci_hi-ors], fmt='o',
             color='#d73027', markersize=8, lw=1.5, capsize=5)
ax1.axvline(1, color='grey', linestyle='--', lw=1.5)
ax1.set_yticks(y_pos); ax1.set_yticklabels(coef_names)
ax1.set_xscale('log'); ax1.set_xlabel("Odds Ratio (log scale)")
ax1.set_title("OR Forest Plot"); ax1.grid(axis='x', alpha=0.3)
 
# ROC curve
fpr, tpr, _ = roc_curve(readmit, model.predict())
roc_auc     = auc(fpr, tpr)
ax2.plot(fpr, tpr, color='#4575b4', lw=2, label=f"AUC = {roc_auc:.2f}")
ax2.plot([0,1],[0,1],'--', color='grey')
ax2.set_xlabel("1 - Specificity"); ax2.set_ylabel("Sensitivity")
ax2.set_title("ROC Curve: COPD Readmission Model")
ax2.legend(loc='lower right'); ax2.grid(alpha=0.3)
 
plt.suptitle("Logistic Regression: COPD 30-Day Readmission", fontsize=13)
plt.tight_layout(); plt.savefig("logistic_regression.png", dpi=150, bbox_inches='tight')
plt.show()

9. Effect Sizes and Association Measures

9.1 Why Effect Sizes Matter

Statistical significance (p-value) tells you whether an effect exists. Effect sizes tell you how big it is.

Hierarchy of information: P-value → Confidence interval → Effect size

9.2 Odds Ratio (OR)

OR = [odds in group A] / [odds in group B] = ad/bc from 2×2 table.

OR = 1: no association; >1: increased odds; <1: decreased odds
Use in: Case-control studies (only valid measure), logistic regression outputs, rare-outcome cohort studies

9.3 Relative Risk (RR)

RR = [risk in exposed] / [risk in unexposed] = [a/(a+b)] / [c/(c+d)]

More intuitive than OR; directly communicates percentage change in risk
Use in: RCTs and prospective cohort studies

9.4 OR vs RR: When They Diverge

Critical distinction: When outcome is common (≥10%), OR overestimates the magnitude of effect compared to RR.

True RR	Background risk	Approximate OR
2.0	5%	2.1
2.0	20%	2.7
2.0	40%	4.0

R — OR vs RR Divergence Plot


library(tidyverse)
# Show how OR diverges from RR as baseline risk increases
expand_grid(
  rr        = c(1.5, 2.0, 3.0),
  risk_ctrl = seq(0.02, 0.60, by=0.01)
) |>
  mutate(
    risk_exp = pmin(rr * risk_ctrl, 0.99),
    or       = (risk_exp/(1-risk_exp)) / (risk_ctrl/(1-risk_ctrl)),
    label    = paste0("RR = ", rr)
  ) |>
  ggplot(aes(x=risk_ctrl*100, y=or, colour=label)) +
  geom_line(linewidth=1.3) +
  geom_hline(yintercept=c(1.5,2.0,3.0), linetype="dashed", colour="grey60") +
  scale_colour_brewer(palette="Set1") +
  labs(title="OR vs RR: Divergence Increases with Baseline Risk",
       subtitle="Dashed lines = true RR; solid lines = calculated OR for same data",
       x="Baseline risk in unexposed group (%)",
       y="Odds Ratio (OR)", colour="True RR") +
  theme_bw(base_size=13) + theme(legend.position="bottom")

Python


import numpy as np, matplotlib.pyplot as plt
 
rrs        = [1.5, 2.0, 3.0]
risk_ctrl  = np.linspace(0.02, 0.60, 200)
colours    = ['#e41a1c','#377eb8','#4daf4a']
 
fig, ax = plt.subplots(figsize=(9,6))
for rr, col in zip(rrs, colours):
    risk_exp = np.minimum(rr * risk_ctrl, 0.99)
    or_val   = (risk_exp/(1-risk_exp)) / (risk_ctrl/(1-risk_ctrl))
    ax.plot(risk_ctrl*100, or_val, label=f"RR = {rr}", color=col, lw=2)
    ax.axhline(rr, linestyle='--', color=col, lw=1, alpha=0.6)
 
ax.set_xlabel("Baseline risk in unexposed group (%)", fontsize=12)
ax.set_ylabel("Odds Ratio (OR)", fontsize=12)
ax.set_title("OR vs RR: Divergence Increases with Baseline Risk\n"
             "Dashed = true RR; solid = calculated OR", fontsize=12)
ax.legend(loc='upper left'); ax.set_ylim(0.8, 12); ax.grid(alpha=0.3)
plt.tight_layout(); plt.savefig("or_vs_rr.png", dpi=150, bbox_inches='tight'); plt.show()

9.5 ARR and NNT

ARR = Risk in control − Risk in treatment
NNT = 1 / ARR

Worked Example: LMWH prophylaxis post-orthopaedic surgery. DVT: 8% (LMWH) vs 18% (placebo).

RR=0.44 (56% relative reduction)
ARR=0.10 (10 percentage points)
NNT=10 — treat 10 patients to prevent 1 DVT

9.6 Standardised Effect Sizes

Cohen’s d = (μ₁ − μ₂) / pooled SD. Benchmarks: 0.2=small, 0.5=medium, 0.8=large.

Eta-squared (η²) = SSbetween / SStotal. Benchmarks: 0.01=small, 0.06=medium, 0.14=large.

10. Survival and Time-to-Event Analysis

10.1 Why Standard Methods Fail for Survival Data

Two fundamental problems: censoring (some patients don’t experience the event during observation) and variable follow-up times. Standard regression cannot handle either. Survival analysis incorporates event occurrence AND time to event while handling censored observations.

10.2 Core Concepts

S(t): Survival function = P(T > t). Starts at 1.0, decreases over time.
h(t): Hazard function = instantaneous event rate at time t, given survival to t.
Censoring: Right censoring most common (event hasn’t occurred by end of observation). Must be non-informative.

10.3 Kaplan-Meier Estimator

Formula: S(t) = Π[1 − dⱼ/nⱼ] (product over all event times ≤ t)

Worked Example: 10 metastatic CRC patients. Estimated S(20 months) = 21.4%. Median survival ~11 months.

Reporting standard: Always include KM curve with CI bands + number-at-risk table + median survival (95% CI).

R — KM Curves + Log-Rank


library(survival); library(survminer)
set.seed(42); n <- 160
# Simulate KRAS WT vs MT survival
kras_group <- rep(c("KRAS WT","KRAS MT"), each=n/2)
time_data  <- c(rexp(n/2, rate=1/18.4), rexp(n/2, rate=1/9.8))
status     <- rbinom(n, 1, 0.85)
df_surv    <- data.frame(time=pmax(time_data,0.1), status=status, group=kras_group)
 
surv_fit  <- survfit(Surv(time,status) ~ group, data=df_surv)
survdiff(Surv(time,status) ~ group, data=df_surv)
 
# KM plot
ggsurvplot(
  surv_fit, data=df_surv,
  pval=TRUE, conf.int=TRUE,
  risk.table=TRUE, risk.table.height=0.25,
  palette=c("#d73027","#4575b4"),
  legend.labs=c("KRAS MT","KRAS WT"),
  xlab="Time (months)", ylab="Overall Survival",
  title="Kaplan-Meier: OS by KRAS Status",
  ggtheme=theme_bw(base_size=12)
)

Python — KM Curves + Log-Rank


import numpy as np, pandas as pd, matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test
 
rng = np.random.default_rng(42); n = 80
wt_time   = rng.exponential(18.4, n); wt_status  = rng.binomial(1,0.85,n)
mt_time   = rng.exponential(9.8, n);  mt_status  = rng.binomial(1,0.85,n)
results   = logrank_test(wt_time, mt_time, event_observed_A=wt_status, event_observed_B=mt_status)
print(f"Log-rank p = {results.p_value:.4f}")
 
fig, ax = plt.subplots(figsize=(9,6))
kmf_wt = KaplanMeierFitter(label="KRAS WT")
kmf_mt = KaplanMeierFitter(label="KRAS MT")
kmf_wt.fit(wt_time, wt_status)
kmf_mt.fit(mt_time, mt_status)
kmf_wt.plot_survival_function(ax=ax, ci_show=True, color='#4575b4')
kmf_mt.plot_survival_function(ax=ax, ci_show=True, color='#d73027')
ax.set_xlabel("Time (months)"); ax.set_ylabel("Survival probability")
ax.set_title(f"Kaplan-Meier: Overall Survival by KRAS Status\n"
             f"Log-rank p = {results.p_value:.4f}")
ax.set_ylim(0, 1.05); ax.grid(alpha=0.3); ax.legend()
plt.tight_layout(); plt.savefig("kaplan_meier.png", dpi=150, bbox_inches='tight')
plt.show()

10.4 Log-Rank Test

What it does: Tests whether survival curves of 2+ groups are identical. Non-parametric. Assumes proportional hazards.

Worked Example: KRAS WT vs MT colorectal cancer. Median OS 18.4 vs 9.8 months. χ²(1)=14.8, p<0.001.

10.5 Cox Proportional Hazards Regression

What it does: Semi-parametric model for time-to-event with multiple predictors. Output: Hazard Ratio (HR).

Model: h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + …)

Interpreting HR: 1.0=no effect; 2.0=twice the instantaneous hazard; 0.5=50% hazard reduction.

PH assumption check: Log-log plot, Schoenfeld residuals, Grambsch-Therneau test.

Worked Example: CKD dialysis initiation (n=280, 98 events). eGFR per 10 mL/min/1.73m²: HR=0.51, p<0.001. Diabetes: HR=2.03, p=0.001. Proteinuria per g/24h: HR=1.67, p<0.001.

R — Cox Regression + HR Forest Plot


library(survival); library(survminer); library(broom)
set.seed(42); n <- 280
df_cox <- data.frame(
  time       = rexp(n, 0.08),
  status     = rbinom(n,1,0.35),
  age        = rnorm(n,62,12),
  male       = rbinom(n,1,0.55),
  egfr       = rnorm(n,38,15),
  proteinuria= rexp(n,0.5),
  diabetes   = rbinom(n,1,0.35),
  hypertension = rbinom(n,1,0.65)
)
 
cox_model <- coxph(Surv(time,status) ~ age + male + egfr +
                   proteinuria + diabetes + hypertension, data=df_cox)
summary(cox_model)
 
# HR forest plot
tidy(cox_model, exponentiate=TRUE, conf.int=TRUE) |>
  mutate(term=c("Age (per yr)","Male sex","eGFR (per mL/min)","Proteinuria (per g/24h)",
                "Diabetes","Hypertension")) |>
  ggplot(aes(x=estimate, y=fct_reorder(term,estimate))) +
  geom_vline(xintercept=1, linetype="dashed", colour="grey40") +
  geom_errorbarh(aes(xmin=conf.low, xmax=conf.high), height=0.25, linewidth=1) +
  geom_point(size=4, colour="#d73027") +
  scale_x_log10() +
  labs(title="Cox Regression: Predictors of Dialysis Initiation",
       subtitle="Hazard ratios with 95% CI (log scale)",
       x="Hazard Ratio (log scale)", y=NULL) +
  theme_bw(base_size=13)

Python — Cox Regression + HR Forest Plot


import numpy as np, pandas as pd, matplotlib.pyplot as plt
from lifelines import CoxPHFitter
 
rng = np.random.default_rng(42); n = 280
df = pd.DataFrame({
    'time'       : rng.exponential(12, n),
    'status'     : rng.binomial(1,0.35,n),
    'age'        : rng.normal(62,12,n),
    'male'       : rng.binomial(1,0.55,n),
    'egfr'       : rng.normal(38,15,n),
    'proteinuria': rng.exponential(2,n),
    'diabetes'   : rng.binomial(1,0.35,n),
    'hypertension':rng.binomial(1,0.65,n)
})
cph = CoxPHFitter()
cph.fit(df, duration_col='time', event_col='status')
cph.print_summary()
 
# HR forest plot
summary = cph.summary
hrs     = summary['exp(coef)']
ci_lo   = summary['exp(coef) lower 95%']
ci_hi   = summary['exp(coef) upper 95%']
labels  = ['Age','Male','eGFR','Proteinuria','Diabetes','Hypertension']
 
fig, ax = plt.subplots(figsize=(8,5))
y_pos   = range(len(labels))
ax.errorbar(hrs, y_pos, xerr=[hrs-ci_lo, ci_hi-hrs], fmt='o',
            color='#d73027', markersize=8, lw=1.5, capsize=5)
ax.axvline(1, color='grey', linestyle='--', lw=1.5)
ax.set_yticks(y_pos); ax.set_yticklabels(labels)
ax.set_xscale('log'); ax.set_xlabel("Hazard Ratio (log scale)")
ax.set_title("Cox Regression: Predictors of Dialysis Initiation\nHR with 95% CI")
ax.grid(axis='x', alpha=0.3); plt.tight_layout()
plt.savefig("cox_regression.png", dpi=150, bbox_inches='tight'); plt.show()

11. Multivariable Modelling Strategy

11.1 Univariate vs Multivariable Analysis

Step 1 — Univariate (crude) analysis: Each predictor tested individually. Reports crude ORs/HRs. Purpose: describe raw associations, identify candidates.

Step 2 — Multivariable (adjusted) analysis: Selected predictors entered simultaneously. Reports adjusted ORs/HRs. Purpose: identify independent predictors, control confounding.

The relationship between crude and adjusted estimates is clinically informative. Attenuation indicates confounding; a newly significant variable was masked by negative confounding.

11.2 Confounding

A confounder: (1) associated with the exposure, (2) associated with the outcome, (3) NOT on the causal pathway.

Classic example: Coffee and lung cancer — smoking confounds both.

11.3 Variable Selection

EPV rule: Minimum ~10 events per predictor variable in logistic and Cox regression
Preferred approach: Hypothesis-driven selection (pre-specified, clinical knowledge)
Univariate screening threshold: p < 0.2 (not p < 0.05 — misses confounders)
Avoid: Stepwise selection as primary approach (capitalises on chance, biased SEs)

11.4 Confounding — Worked Example and Plot

Emergency vs elective admission and in-hospital mortality (n=600). Crude OR=3.22 → Adjusted OR=1.87. Attenuation reflects age and comorbidity as confounders.

R — Crude vs Adjusted OR Comparison Plot


library(tidyverse); library(broom)
set.seed(42); n <- 600
age         <- rnorm(n,62,15)
charlson    <- rpois(n,2)
emergency   <- rbinom(n,1,plogis(-2+0.03*age+0.1*charlson))
logit_mort  <- -5 + 0.7*emergency + 0.04*age + 0.2*charlson
mortality   <- rbinom(n,1,plogis(logit_mort))
df_conf     <- data.frame(mortality,emergency,age,charlson)
 
crude_model    <- glm(mortality~emergency, data=df_conf, family=binomial)
adjusted_model <- glm(mortality~emergency+age+charlson, data=df_conf, family=binomial)
 
bind_rows(
  tidy(crude_model, exponentiate=TRUE, conf.int=TRUE) |> mutate(model="Crude"),
  tidy(adjusted_model, exponentiate=TRUE, conf.int=TRUE) |> mutate(model="Adjusted")
) |>
  filter(term=="emergency") |>
  ggplot(aes(x=model, y=estimate, colour=model)) +
  geom_hline(yintercept=1, linetype="dashed", colour="grey40") +
  geom_errorbar(aes(ymin=conf.low, ymax=conf.high), width=0.2, linewidth=1.2) +
  geom_point(size=5) +
  scale_colour_manual(values=c("Crude"="#d73027","Adjusted"="#4575b4")) +
  scale_y_log10() +
  annotate("text",x=1.5,y=2.8,label="Attenuation from confounding\n(age + comorbidity)",
           hjust=0.5, size=3.8) +
  labs(title="Crude vs Adjusted OR: Effect of Confounding",
       subtitle="Emergency admission and in-hospital mortality (log scale)",
       y="Odds Ratio (log scale)", x=NULL) +
  theme_bw(base_size=13) + theme(legend.position="none")

Python — Crude vs Adjusted OR Comparison


import numpy as np, pandas as pd, matplotlib.pyplot as plt
import statsmodels.api as sm
 
rng = np.random.default_rng(42); n = 600
age      = rng.normal(62,15,n); charlson = rng.poisson(2,n)
emergency = rng.binomial(1, 1/(1+np.exp(-(-2+0.03*age+0.1*charlson))), n)
logit_m  = -5 + 0.7*emergency + 0.04*age + 0.2*charlson
mortality = rng.binomial(1, 1/(1+np.exp(-logit_m)), n)
 
X_crude = sm.add_constant(emergency)
X_adj   = sm.add_constant(np.column_stack([emergency,age,charlson]))
m_crude = sm.Logit(mortality, X_crude).fit(disp=False)
m_adj   = sm.Logit(mortality, X_adj).fit(disp=False)
 
models  = ['Crude','Adjusted']
ors     = [np.exp(m_crude.params[1]), np.exp(m_adj.params[1])]
cis_lo  = [np.exp(m_crude.conf_int()[1,0]), np.exp(m_adj.conf_int()[1,0])]
cis_hi  = [np.exp(m_crude.conf_int()[1,1]), np.exp(m_adj.conf_int()[1,1])]
colours = ['#d73027','#4575b4']
 
fig, ax = plt.subplots(figsize=(6,6))
for i,(model,or_val,ci_l,ci_h,col) in enumerate(zip(models,ors,cis_lo,cis_hi,colours)):
    ax.errorbar(i, or_val, yerr=[[or_val-ci_l],[ci_h-or_val]],
                fmt='o', color=col, markersize=12, lw=2, capsize=8)
ax.axhline(1, linestyle='--', color='grey', lw=1.5)
ax.set_yscale('log'); ax.set_xticks([0,1]); ax.set_xticklabels(models, fontsize=12)
ax.set_ylabel("Odds Ratio (log scale)", fontsize=12)
ax.set_title("Crude vs Adjusted OR: Emergency Admission\nConfounding by age and comorbidity")
ax.grid(axis='y', alpha=0.3); plt.tight_layout()
plt.savefig("confounding_or.png", dpi=150, bbox_inches='tight'); plt.show()

11.5 Propensity Score Methods

Purpose: In observational studies, treated and untreated patients differ systematically. Propensity score (PS) = predicted probability of receiving treatment, estimated from logistic regression.

Methods: PS matching (match treated to untreated on PS), PS stratification (quintiles), IPTW.

Worked Example: STEMI registry, DES vs BMS (n=800 each). Unmatched HR=0.51 (biased by confounding). After PSM (312 matched pairs): HR=0.74 (95% CI 0.55–0.99).

R


library(MatchIt)
set.seed(42); n <- 800
age      <- rnorm(n,65,12); grace <- rnorm(n,130,30); dm <- rbinom(n,1,0.35)
des      <- rbinom(n,1,plogis(-1+0.02*(70-age)+0.01*(130-grace)-0.3*dm))
logit_m  <- -3 + 0.8*dm - 0.4*des + 0.02*age
mortality <- rbinom(n,1,plogis(logit_m))
df_ps    <- data.frame(mortality,des,age,grace,dm)
 
# PSM
match_out <- matchit(des ~ age + grace + dm, data=df_ps, method="nearest",
                     distance="logit", caliper=0.1)
summary(match_out)
matched_df <- match.data(match_out)
cat("Matched pairs:", nrow(matched_df)/2, "\n")

Python


import numpy as np, pandas as pd, matplotlib.pyplot as plt
import statsmodels.api as sm
 
rng = np.random.default_rng(42); n = 800
age   = rng.normal(65,12,n); grace = rng.normal(130,30,n); dm = rng.binomial(1,0.35,n)
des   = rng.binomial(1, 1/(1+np.exp(-(-1+0.02*(70-age)+0.01*(130-grace)-0.3*dm))), n)
logit_m   = -3 + 0.8*dm - 0.4*des + 0.02*age
mortality = rng.binomial(1, 1/(1+np.exp(-logit_m)), n)
 
# Calculate propensity scores
X_ps = sm.add_constant(np.column_stack([age,grace,dm]))
ps_model = sm.Logit(des, X_ps).fit(disp=False)
ps_scores = ps_model.predict()
 
# PS distribution by group
fig, ax = plt.subplots(figsize=(7,5))
ax.hist(ps_scores[des==1], bins=25, alpha=0.6, color='#4575b4', label='DES (n=%d)' % des.sum(), density=True)
ax.hist(ps_scores[des==0], bins=25, alpha=0.6, color='#d73027', label='BMS (n=%d)' % (n-des.sum()), density=True)
ax.set_xlabel("Propensity Score", fontsize=12); ax.set_ylabel("Density", fontsize=12)
ax.set_title("Propensity Score Distribution: DES vs BMS\n"
             "Overlap indicates matching feasibility")
ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout(); plt.savefig("propensity_score.png", dpi=150, bbox_inches='tight')
plt.show()

12. Multivariate Methods

12.1 Terminology

Multivariable: Multiple predictors, ONE outcome (e.g., multiple regression).
Multivariate: Multiple outcomes simultaneously (MANOVA, PCA, factor analysis).

12.2 MANOVA

What it does: Tests whether groups differ on a combination of continuous outcomes simultaneously. Avoids multiple testing inflation and exploits correlations among outcomes.

Test statistics: Wilks’ Lambda (most common), Pillai’s Trace (most robust to violations).

Worked Example: Exercise modality vs VO₂max, 6MWD, resting HR (n=30/group, 4 groups). Pillai’s trace=0.52, F(9,342)=7.44, p<0.001.

R


library(tidyverse)
set.seed(42); n_per <- 30
groups <- rep(c("Control","Aerobic","Resistance","Combined"), each=n_per)
vo2    <- c(rnorm(n_per,28,5), rnorm(n_per,36,5), rnorm(n_per,31,4), rnorm(n_per,38,4))
walk   <- c(rnorm(n_per,420,50), rnorm(n_per,510,45), rnorm(n_per,460,40), rnorm(n_per,530,40))
rhr    <- c(rnorm(n_per,72,10), rnorm(n_per,62,8), rnorm(n_per,66,8), rnorm(n_per,60,7))
df_man <- data.frame(group=groups, vo2=vo2, walk=walk, rhr=rhr)
 
manova_res <- manova(cbind(vo2,walk,rhr) ~ group, data=df_man)
summary(manova_res, test="Pillai")
summary.aov(manova_res)  # Follow-up univariate ANOVAs

Python


import numpy as np, pandas as pd
from scipy import stats
 
rng = np.random.default_rng(42); n = 30
groups = ['Control','Aerobic','Resistance','Combined']
data   = {
    'vo2' : [rng.normal(m,s,n) for m,s in [(28,5),(36,5),(31,4),(38,4)]],
    'walk': [rng.normal(m,s,n) for m,s in [(420,50),(510,45),(460,40),(530,40)]],
    'rhr' : [rng.normal(m,s,n) for m,s in [(72,10),(62,8),(66,8),(60,7)]]
}
# Univariate ANOVAs as surrogate
for out in ['vo2','walk','rhr']:
    f, p = stats.f_oneway(*data[out])
    print(f"{out}: F={f:.2f}, p={p:.4f}")

12.3 Principal Component Analysis (PCA)

What it does: Reduces many correlated variables into a smaller set of uncorrelated components capturing most of the variance.

Key outputs: Eigenvalues (retain >1 per Kaiser criterion), scree plot, factor loadings (>0.4 meaningful), % variance explained.

Worked Example: 8 metabolic biomarkers (n=300). 3 components retained (75% variance). PC1=“metabolic risk”, PC2=“blood pressure”, PC3=“lipid profile”.

R — Scree Plot + Biplot


library(tidyverse); library(factoextra)
set.seed(42); n <- 300
# Correlated biomarkers: waist, glucose, HDL, LDL, TG, SBP, DBP, insulin
Sigma <- matrix(c(
  1,.65,.55,-0.4,.55,.25,.22,.60,
  .65,1,.45,-0.35,.48,.20,.18,.55,
  .55,.45,1,-0.55,.40,.15,.12,.45,
  -.4,-.35,-.55,1,-.30,.10,.08,-.35,
  .55,.48,.40,-.30,1,.20,.16,.50,
  .25,.20,.15,.10,.20,1,.75,.18,
  .22,.18,.12,.08,.16,.75,1,.15,
  .60,.55,.45,-.35,.50,.18,.15,1), 8,8)
library(MASS)
biomarkers <- mvrnorm(n, mu=rep(0,8), Sigma=Sigma) |>
  as.data.frame() |>
  setNames(c("Waist","Glucose","HDL","LDL","TG","SBP","DBP","Insulin"))
 
pca_result <- prcomp(biomarkers, scale.=TRUE)
summary(pca_result)
 
p1 <- fviz_eig(pca_result, addlabels=TRUE, ylim=c(0,45),
               main="Scree Plot: Metabolic Syndrome Biomarkers")
p2 <- fviz_pca_biplot(pca_result, repel=TRUE, col.var="#d73027",
                      col.ind="grey60", alpha.ind=0.4,
                      title="PCA Biplot: PC1 vs PC2")
library(ggpubr)
ggarrange(p1, p2, ncol=2)

Python — Scree Plot + Biplot


import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
 
rng = np.random.default_rng(42); n = 300
Sigma = np.array([
    [1,.65,.55,-.4,.55,.25,.22,.60],
    [.65,1,.45,-.35,.48,.20,.18,.55],
    [.55,.45,1,-.55,.40,.15,.12,.45],
    [-.4,-.35,-.55,1,-.30,.10,.08,-.35],
    [.55,.48,.40,-.30,1,.20,.16,.50],
    [.25,.20,.15,.10,.20,1,.75,.18],
    [.22,.18,.12,.08,.16,.75,1,.15],
    [.60,.55,.45,-.35,.50,.18,.15,1]
])
L = np.linalg.cholesky(Sigma)
X = (rng.standard_normal((n,8)) @ L.T)
cols = ['Waist','Glucose','HDL','LDL','TG','SBP','DBP','Insulin']
 
scaler = StandardScaler(); X_std = scaler.fit_transform(X)
pca    = PCA(); pca.fit(X_std)
 
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))
 
# Scree plot
n_components = 8
ax1.bar(range(1, n_components+1), pca.explained_variance_ratio_*100,
        color='#4575b4', alpha=0.8)
ax1.plot(range(1, n_components+1), np.cumsum(pca.explained_variance_ratio_*100),
         'ro-', linewidth=2, markersize=6, label='Cumulative %')
ax1.axhline(100*(pca.explained_variance_ratio_*100).mean(),
            color='grey', linestyle='--', lw=1, label='Mean eigenvalue threshold')
ax1.set_xlabel("Principal Component"); ax1.set_ylabel("Variance explained (%)")
ax1.set_title("Scree Plot: Metabolic Biomarkers")
ax1.legend(); ax1.grid(alpha=0.3)
 
# Biplot
pc_scores  = pca.transform(X_std)
loadings   = pca.components_
ax2.scatter(pc_scores[:,0], pc_scores[:,1], color='grey', alpha=0.4, s=20)
for i, col in enumerate(cols):
    ax2.annotate('', xy=(loadings[0,i]*3, loadings[1,i]*3), xytext=(0,0),
                 arrowprops=dict(arrowstyle='->', color='#d73027', lw=1.8))
    ax2.text(loadings[0,i]*3.3, loadings[1,i]*3.3, col, color='#d73027', fontsize=9)
ax2.set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)")
ax2.set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)")
ax2.set_title("PCA Biplot: PC1 vs PC2"); ax2.grid(alpha=0.3)
ax2.axhline(0, color='grey', lw=0.8); ax2.axvline(0, color='grey', lw=0.8)
 
plt.suptitle("PCA: Metabolic Syndrome Biomarkers (n=300)", fontsize=13)
plt.tight_layout(); plt.savefig("pca_analysis.png", dpi=150, bbox_inches='tight')
plt.show()

13. Mixed Models and Longitudinal Data

13.1 Why Standard ANOVA Is Insufficient for Longitudinal Data

Repeated measures ANOVA requires complete data, assumes compound symmetry, cannot handle time-varying covariates. Clinical trials commonly have 10–40% missing data.

Linear mixed effects (LME) models overcome these limitations: handle MAR missing data without imputation, flexible correlation structures, unequally-spaced occasions, individual trajectory modelling.

13.2 Linear Mixed Effects Models

Model: Y_ij = (β₀ + b₀ᵢ) + (β₁ + b₁ᵢ)×time + β₂X + ε_ij

Fixed effects: Population-average effects (reported in paper)
Random effects: Between-subject variability in intercepts and slopes

Worked Example: HbA1c RCT (n=120, 18% missing). Interaction term (treatment×time): β=−0.038%/month, p=0.001. Net additional reduction at 12 months: 0.46%.

R — LME with Random Slopes


library(lme4); library(lmerTest); library(tidyverse)
set.seed(42); n_pat <- 120; n_tp <- 4
timepoints <- c(0,3,6,12)
df_lme <- expand_grid(id=1:n_pat, time=timepoints) |>
  mutate(
    treatment = rep(c("Control","Intervention"), each=n_pat/2)[id],
    b0i = rnorm(n_pat,0,0.4)[id],
    b1i = rnorm(n_pat,0,0.02)[id],
    hba1c = 8.2 + b0i + (-0.02 - 0.035*(treatment=="Intervention")+b1i)*time +
            rnorm(n(), 0, 0.35)
  ) |>
  filter(runif(n()) > 0.18)   # ~18% missing
 
model_lme <- lmer(hba1c ~ time*treatment + (1+time|id), data=df_lme)
summary(model_lme)
 
# Plot: individual trajectories + group means
df_lme |>
  ggplot(aes(x=time, y=hba1c)) +
  geom_line(aes(group=id, colour=treatment), alpha=0.2, linewidth=0.5) +
  stat_summary(aes(colour=treatment, group=treatment),
               fun.data=mean_cl_normal, geom="ribbon", alpha=0.25,
               mapping=aes(fill=treatment)) +
  stat_summary(aes(colour=treatment, group=treatment),
               fun=mean, geom="line", linewidth=2) +
  stat_summary(aes(colour=treatment, group=treatment),
               fun=mean, geom="point", size=3) +
  scale_colour_manual(values=c("Control"="#d73027","Intervention"="#4575b4")) +
  scale_fill_manual(values=c("Control"="#d73027","Intervention"="#4575b4")) +
  scale_x_continuous(breaks=c(0,3,6,12)) +
  labs(title="Linear Mixed Model: HbA1c over 12 Months",
       subtitle="Thin lines=individual patients; thick lines=mean ± 95% CI",
       x="Time (months)", y="HbA1c (%)", colour="Group", fill="Group") +
  theme_bw(base_size=13) + theme(legend.position="bottom")

Python — LME with Random Slopes


import numpy as np, pandas as pd, matplotlib.pyplot as plt
import statsmodels.formula.api as smf
 
rng = np.random.default_rng(42); n_pat = 120
timepoints = [0, 3, 6, 12]
rows = []
for i in range(n_pat):
    trt   = "Intervention" if i >= n_pat//2 else "Control"
    b0i   = rng.normal(0, 0.4)
    b1i   = rng.normal(0, 0.02)
    slope = -0.02 - 0.035*(trt=="Intervention") + b1i
    for t in timepoints:
        if rng.random() > 0.18:
            hba1c = 8.2 + b0i + slope*t + rng.normal(0, 0.35)
            rows.append({'id':i,'time':t,'treatment':trt,'hba1c':hba1c})
 
df = pd.DataFrame(rows)
model = smf.mixedlm("hba1c ~ time * treatment", df, groups=df["id"],
                    re_formula="~time").fit()
print(model.summary())
 
fig, ax = plt.subplots(figsize=(8, 6))
colours = {'Control':'#d73027', 'Intervention':'#4575b4'}
 
for i in df['id'].unique():
    pat_df = df[df['id']==i].sort_values('time')
    trt    = pat_df['treatment'].iloc[0]
    ax.plot(pat_df['time'], pat_df['hba1c'], color=colours[trt], alpha=0.15, lw=0.7)
 
for trt, col in colours.items():
    grp = df[df['treatment']==trt].groupby('time')['hba1c']
    means = grp.mean(); sems = grp.sem()
    ax.fill_between(means.index, means-1.96*sems, means+1.96*sems, color=col, alpha=0.25)
    ax.plot(means.index, means, 'o-', color=col, lw=2.5, markersize=7, label=trt)
 
ax.set_xticks([0,3,6,12]); ax.set_xlabel("Time (months)"); ax.set_ylabel("HbA1c (%)")
ax.set_title("Linear Mixed Model: HbA1c over 12 Months\nThin lines=individual patients; thick=mean ± 95% CI")
ax.legend(loc='upper right'); ax.grid(alpha=0.3)
plt.tight_layout(); plt.savefig("lme_longitudinal.png", dpi=150, bbox_inches='tight')
plt.show()

14. Diagnostic Test Evaluation

14.1 The 2×2 Table

All diagnostic statistics derive from comparing test result to gold standard:

	Disease +	Disease −	Total
Test +	TP	FP	TP+FP
Test −	FN	TN	FN+TN
Total	TP+FN	FP+TN	N

14.2 Sensitivity, Specificity, PPV, NPV

Sensitivity = TP/(TP+FN) — “SnNout”: high sensitivity + negative → rule OUT
Specificity = TN/(FP+TN) — “SpPin”: high specificity + positive → rule IN
PPV = TP/(TP+FP) — depends heavily on prevalence
NPV = TN/(FN+TN) — rises with lower prevalence

Critical point: PPV and NPV are NOT fixed test properties — they depend on disease prevalence. A test with 95% sensitivity and 95% specificity applied to 1% prevalence population has PPV = only 16%.

Worked Example: POC troponin for NSTEMI (n=500, prevalence=16%). Sensitivity=90.0%, Specificity=95.0%, PPV=77.4%, NPV=98.0%.

14.3 PPV/NPV by Prevalence — Code

R


library(tidyverse)
sensitivity <- 0.90; specificity <- 0.95
prevalence  <- seq(0.005, 0.50, by=0.005)
 
tibble(
  prev   = prevalence,
  ppv    = (sensitivity*prev) / (sensitivity*prev + (1-specificity)*(1-prev)),
  npv    = (specificity*(1-prev)) / ((1-sensitivity)*prev + specificity*(1-prev))
) |>
  pivot_longer(c(ppv,npv), names_to="measure", values_to="value") |>
  mutate(measure=ifelse(measure=="ppv","PPV","NPV")) |>
  ggplot(aes(x=prev*100, y=value*100, colour=measure)) +
  geom_line(linewidth=1.5) +
  geom_vline(xintercept=16, linetype="dashed", colour="grey50") +
  annotate("text",x=17,y=50,label="ED prevalence\n(16%)",hjust=0,size=3.5) +
  scale_colour_manual(values=c("PPV"="#d73027","NPV"="#4575b4")) +
  labs(title="PPV and NPV vs Disease Prevalence",
       subtitle="POC Troponin: Sensitivity=90%, Specificity=95%",
       x="Disease prevalence (%)", y="Predictive value (%)", colour=NULL) +
  theme_bw(base_size=13) + theme(legend.position="bottom")

Python


import numpy as np, matplotlib.pyplot as plt
 
sensitivity = 0.90; specificity = 0.95
prev = np.linspace(0.005, 0.50, 200)
ppv  = (sensitivity*prev) / (sensitivity*prev + (1-specificity)*(1-prev))
npv  = (specificity*(1-prev)) / ((1-sensitivity)*prev + specificity*(1-prev))
 
fig, ax = plt.subplots(figsize=(8,5))
ax.plot(prev*100, ppv*100, color='#d73027', lw=2.5, label='PPV')
ax.plot(prev*100, npv*100, color='#4575b4', lw=2.5, label='NPV')
ax.axvline(16, color='grey', linestyle='--', lw=1.5)
ax.text(17, 50, "ED prevalence\n(16%)", fontsize=9)
ax.set_xlabel("Disease prevalence (%)", fontsize=12)
ax.set_ylabel("Predictive value (%)", fontsize=12)
ax.set_title("PPV and NPV vs Disease Prevalence\n"
             "POC Troponin: Sensitivity=90%, Specificity=95%", fontsize=12)
ax.legend(loc='center right'); ax.grid(alpha=0.3)
plt.tight_layout(); plt.savefig("ppv_npv_prevalence.png", dpi=150, bbox_inches='tight')
plt.show()

14.4 ROC Curves and AUC

Plots sensitivity (y) vs 1-specificity (x) across all cutpoints. AUC: 0.5=chance; 0.7–0.8=acceptable; 0.8–0.9=excellent; >0.9=outstanding.

Worked Example: eGFR alone (AUC=0.73) vs clinical risk score (AUC=0.84) for predicting dialysis. DeLong’s test p=0.003.

R — ROC Comparison


library(pROC)
set.seed(42); n <- 350
dialysis  <- rbinom(n,1,0.30)
egfr_score <- rnorm(n,0,1) + dialysis*1.2
risk_score <- rnorm(n,0,1) + dialysis*2.0
 
roc_egfr <- roc(dialysis, egfr_score, quiet=TRUE)
roc_risk  <- roc(dialysis, risk_score, quiet=TRUE)
roc.test(roc_egfr, roc_risk)
 
plot(roc_egfr, col="#d73027", lwd=2, main="ROC Comparison: eGFR vs Clinical Score")
plot(roc_risk, col="#4575b4", lwd=2, add=TRUE)
legend("bottomright", legend=c(sprintf("eGFR alone (AUC=%.2f)",auc(roc_egfr)),
                               sprintf("Clinical score (AUC=%.2f)",auc(roc_risk))),
       col=c("#d73027","#4575b4"), lwd=2)
abline(0,1,lty=2,col="grey")

Python — ROC Comparison


import numpy as np, matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
 
rng = np.random.default_rng(42); n = 350
dialysis    = rng.binomial(1,0.30,n)
egfr_score  = rng.normal(0,1,n) + dialysis*1.2
risk_score  = rng.normal(0,1,n) + dialysis*2.0
 
fpr1,tpr1,_ = roc_curve(dialysis, egfr_score); auc1 = auc(fpr1,tpr1)
fpr2,tpr2,_ = roc_curve(dialysis, risk_score);  auc2 = auc(fpr2,tpr2)
 
fig, ax = plt.subplots(figsize=(7,6))
ax.plot(fpr1, tpr1, color='#d73027', lw=2.5, label=f"eGFR alone (AUC={auc1:.2f})")
ax.plot(fpr2, tpr2, color='#4575b4', lw=2.5, label=f"Clinical score (AUC={auc2:.2f})")
ax.plot([0,1],[0,1],'--',color='grey',lw=1.5)
ax.fill_between(fpr2, tpr2, alpha=0.08, color='#4575b4')
ax.set_xlabel("1 - Specificity (FPR)", fontsize=12)
ax.set_ylabel("Sensitivity (TPR)", fontsize=12)
ax.set_title(f"ROC Curves: eGFR vs Clinical Risk Score\nDeLong's test for AUC comparison")
ax.legend(loc='lower right'); ax.grid(alpha=0.3)
plt.tight_layout(); plt.savefig("roc_comparison.png", dpi=150, bbox_inches='tight')
plt.show()

15. Agreement and Reliability

15.1 Cohen’s Kappa

Formula: κ = (Po − Pe) / (1 − Pe)

Interpreting κ: <0=poor; 0–0.20=slight; 0.21–0.40=fair; 0.41–0.60=moderate; 0.61–0.80=substantial; 0.81–1.00=almost perfect.

Worked Example: Two radiologists, 120 chest X-rays (normal/consolidation/interstitial). Po=0.883, Pe=0.351, κ=0.82 (95% CI 0.73–0.91). Almost perfect agreement.

R


library(irr)
set.seed(42)
n   <- 120
rad1 <- sample(c("Normal","Consol","Interstitial"), n, replace=TRUE, prob=c(0.45,0.275,0.275))
rad2 <- ifelse(runif(n)<0.88, rad1,
               sample(c("Normal","Consol","Interstitial"), n, replace=TRUE, prob=c(0.43,0.28,0.28)))
kappa2(data.frame(rad1,rad2))
table(rad1,rad2)

Python


import numpy as np
from sklearn.metrics import cohen_kappa_score
 
rng  = np.random.default_rng(42); n = 120
cats = ['Normal','Consol','Interstitial']
rad1 = rng.choice(cats, n, p=[0.45,0.275,0.275])
rad2 = np.where(rng.random(n)<0.88, rad1, rng.choice(cats, n, p=[0.43,0.28,0.28]))
kappa = cohen_kappa_score(rad1, rad2)
print(f"Cohen's κ = {kappa:.3f}")

15.2 Bland-Altman Analysis

What it does: Assesses agreement between two continuous measurement methods. Plots difference (Method A − Method B) against their mean.

Key outputs:

Bias: Mean difference. Non-zero = systematic over/under-estimation.
Limits of Agreement (LOA): Bias ± 1.96 × SD of differences. Contains 95% of individual differences.

Why NOT Pearson r for method comparison: Pearson measures association, not agreement. Two methods could be highly correlated but systematically disagree.

Worked Example: AOBP vs intra-arterial SBP (n=50). Bias=−3.8 mmHg. LOA: −19.9 to +12.3 mmHg. Too wide for clinical substitution (need ±5 mmHg precision).

R — Bland-Altman Plot


library(tidyverse)
set.seed(42); n <- 50
ia_sbp  <- rnorm(n, 128.4, 20)
aobp    <- ia_sbp - rnorm(n, 3.8, 8.2)
df_ba   <- tibble(ia=ia_sbp, aobp=aobp, mean_val=(ia+aobp)/2, diff=aobp-ia)
bias    <- mean(df_ba$diff); sd_diff <- sd(df_ba$diff)
loa_up  <- bias + 1.96*sd_diff; loa_lo <- bias - 1.96*sd_diff
 
ggplot(df_ba, aes(x=mean_val, y=diff)) +
  geom_point(colour="#4575b4", size=2.5, alpha=0.75) +
  geom_hline(yintercept=bias,   colour="firebrick", linewidth=1.2) +
  geom_hline(yintercept=loa_up, colour="firebrick", linetype="dashed") +
  geom_hline(yintercept=loa_lo, colour="firebrick", linetype="dashed") +
  geom_hline(yintercept=0, colour="grey50", linetype="dotted") +
  annotate("text", x=max(df_ba$mean_val), y=bias+0.5,
           label=sprintf("Bias = %.1f mmHg", bias), hjust=1, colour="firebrick", size=3.5) +
  annotate("text", x=max(df_ba$mean_val), y=loa_up+0.5,
           label=sprintf("+1.96 SD = %.1f", loa_up), hjust=1, colour="firebrick", size=3.2) +
  annotate("text", x=max(df_ba$mean_val), y=loa_lo-0.5,
           label=sprintf("-1.96 SD = %.1f", loa_lo), hjust=1, colour="firebrick", size=3.2) +
  labs(title="Bland-Altman Analysis: AOBP vs Intra-Arterial SBP",
       subtitle="LOA = −19.9 to +12.3 mmHg; too wide for clinical substitution",
       x="Mean of AOBP and IA SBP (mmHg)",
       y="Difference: AOBP − IA SBP (mmHg)") +
  theme_bw(base_size=13)

Python — Bland-Altman Plot


import numpy as np, matplotlib.pyplot as plt
 
rng     = np.random.default_rng(42); n = 50
ia_sbp  = rng.normal(128.4, 20, n)
aobp    = ia_sbp - rng.normal(3.8, 8.2, n)
mean_m  = (ia_sbp + aobp) / 2
diff    = aobp - ia_sbp
bias    = np.mean(diff); sd_d = np.std(diff, ddof=1)
loa_up  = bias + 1.96*sd_d; loa_lo = bias - 1.96*sd_d
 
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(mean_m, diff, color='#4575b4', s=50, alpha=0.75, zorder=3)
ax.axhline(bias,   color='firebrick', lw=1.8, label=f"Bias = {bias:.1f} mmHg")
ax.axhline(loa_up, color='firebrick', lw=1.5, linestyle='--',
           label=f"+1.96 SD = {loa_up:.1f} mmHg")
ax.axhline(loa_lo, color='firebrick', lw=1.5, linestyle='--',
           label=f"-1.96 SD = {loa_lo:.1f} mmHg")
ax.axhline(0,      color='grey',     lw=0.8, linestyle=':')
ax.set_xlabel("Mean of AOBP and IA SBP (mmHg)", fontsize=12)
ax.set_ylabel("Difference: AOBP − IA SBP (mmHg)", fontsize=12)
ax.set_title("Bland-Altman: AOBP vs Intra-Arterial SBP\n"
             f"Bias={bias:.1f} mmHg, LOA: {loa_lo:.1f} to {loa_up:.1f} mmHg", fontsize=12)
ax.legend(loc='upper right', fontsize=9); ax.grid(alpha=0.3)
plt.tight_layout(); plt.savefig("bland_altman.png", dpi=150, bbox_inches='tight')
plt.show()

16. Bayesian Methods

16.1 Frequentist vs Bayesian Framework

Frequentist: Parameters are fixed constants. P-value = P(data | H₀). Cannot make probability statements about parameters.

Bayesian: Parameters have probability distributions. Direct probability statements: “P(effect > 0 | data) = 0.97”. Uses Bayes’ theorem:


Posterior ∝ Prior × Likelihood
P(θ|data) ∝ P(data|θ) × P(θ)

16.2 Credible Intervals vs Confidence Intervals

95% CI (frequentist): 95% of such intervals in repeated sampling contain the true value. NOT a probability statement about this interval.
95% CrI (Bayesian): There IS a 95% probability the true parameter lies within this interval (given prior + data). The natural, intuitive interpretation.

16.3 Bayesian Analysis — Worked Example

Pilot RCT (n=30): ACR50 = 7/15 (47%) active vs 3/15 (20%) placebo. Frequentist OR=3.5, p=0.12 (NS by convention). Bayesian posterior OR=3.2 (95% CrI 1.02–10.4). P(OR>1|data)=0.96. Supports proceeding to Phase III.

R — Prior to Posterior Visualisation


library(tidyverse)
# Beta-binomial: conjugate analysis
# Active: 7/15; Placebo: 3/15; Prior: Beta(1,1) = uniform
 
x     <- seq(0.001, 0.999, length.out=500)
# Posterior for active arm: Beta(1+7, 1+8) = Beta(8,9)
post_A <- dbeta(x, 8, 9)
# Posterior for placebo: Beta(1+3, 1+12) = Beta(4,13)
post_P <- dbeta(x, 4, 13)
# Prior
prior  <- dbeta(x, 1, 1)
 
tibble(x=x, Prior=prior, `Active (posterior)`=post_A, `Placebo (posterior)`=post_P) |>
  pivot_longer(-x, names_to="Distribution", values_to="density") |>
  ggplot(aes(x=x, y=density, colour=Distribution)) +
  geom_line(linewidth=1.3) +
  scale_colour_manual(values=c("Prior"="grey60",
                               "Active (posterior)"="#4575b4",
                               "Placebo (posterior)"="#d73027")) +
  labs(title="Bayesian Analysis: Prior to Posterior for ACR50 Response",
       subtitle="Beta(1,1) uniform prior → updated with observed data",
       x="Probability of ACR50 response", y="Posterior density", colour=NULL) +
  theme_bw(base_size=13) + theme(legend.position="bottom")

Python — Prior to Posterior


import numpy as np, matplotlib.pyplot as plt
from scipy.stats import beta as beta_dist
 
x = np.linspace(0.001, 0.999, 500)
# Uniform prior Beta(1,1)
prior   = beta_dist.pdf(x, 1, 1)
# Posterior: Beta(1+successes, 1+failures)
post_A  = beta_dist.pdf(x, 1+7, 1+8)  # Active: 7/15
post_P  = beta_dist.pdf(x, 1+3, 1+12) # Placebo: 3/15
 
fig, ax = plt.subplots(figsize=(9,5))
ax.plot(x, prior,  color='grey',    lw=2,   linestyle='--', label='Prior: Beta(1,1) — uniform')
ax.plot(x, post_P, color='#d73027', lw=2.5, label='Placebo posterior: Beta(4,13)')
ax.plot(x, post_A, color='#4575b4', lw=2.5, label='Active posterior: Beta(8,9)')
 
# Shade 95% CrI for active
from scipy.stats import beta as beta_dist2
ci_lo, ci_hi = beta_dist2.ppf(0.025,8,9), beta_dist2.ppf(0.975,8,9)
ax.fill_between(x, 0, post_A, where=(x>=ci_lo)&(x<=ci_hi),
                color='#4575b4', alpha=0.18, label=f'Active 95% CrI: {ci_lo:.2f}–{ci_hi:.2f}')
 
ax.set_xlabel("Probability of ACR50 response", fontsize=12)
ax.set_ylabel("Posterior density", fontsize=12)
ax.set_title("Bayesian Analysis: Prior to Posterior\nUniform prior updated with pilot RCT data", fontsize=12)
ax.legend(fontsize=9); ax.grid(alpha=0.3)
plt.tight_layout(); plt.savefig("bayesian_posterior.png", dpi=150, bbox_inches='tight')
plt.show()

17. Meta-Analysis and Systematic Review

17.1 The Evidence Hierarchy

Meta-analysis of RCTs sits at the top of the evidence hierarchy. It pools quantitative data to produce a single, more precise effect estimate, identifies heterogeneity, and improves generalisability.

17.2 Fixed vs Random Effects Models

Fixed effects: All studies estimate the same true effect. Between-study variation = sampling error only. More weight to larger studies.

Random effects (DerSimonian-Laird): Studies estimate different but related true effects. Between-study variability (τ²) estimated and incorporated. Wider, more honest CIs.

Choosing: If I² < 25%, either model reasonable. If I² ≥ 50%, random effects appropriate.

17.3 Heterogeneity

I² = 0–25%: Low; 26–50%: Moderate; 51–75%: Substantial; >75%: Considerable
Cochran’s Q: Underpowered with few studies
Investigate with subgroup analyses and meta-regression

17.4 Publication Bias

Funnel plot: Effect size vs precision — asymmetry suggests bias. Egger’s test: Formal test for funnel asymmetry. Trim-and-fill: Imputes missing studies to restore symmetry.

17.5 Worked Example + Code

ACE inhibitors in heart failure (6 RCTs). Pooled OR=0.81 (95% CI 0.74–0.89). I²=18% (low). NNT≈28.

R — Forest Plot + Funnel Plot


library(meta)
# ACE inhibitor meta-analysis
meta_data <- data.frame(
  study   = c("CONSENSUS 1987","SOLVD-T 1991","ATLAS 1999",
              "V-HeFT II 1991","MERIT-HF 1999","CIBIS-II 1999"),
  event_t = c(29,386,45,117,128,119),
  n_t     = c(127,1285,1568,403,1990,1327),
  event_c = c(44,452,52,131,145,156),
  n_c     = c(126,1284,1596,403,2001,1320)
)
 
m <- metabin(event.e=event_t, n.e=n_t, event.c=event_c, n.c=n_c,
             studlab=study, data=meta_data, sm="OR",
             method="MH", fixed=TRUE, random=TRUE,
             title="ACE Inhibitors: CV Mortality in Heart Failure")
forest(m, layout="JAMA")
funnel(m, studlab=TRUE, main="Funnel Plot: ACE Inhibitors in HF")

Python — Forest Plot + Funnel Plot


import numpy as np, matplotlib.pyplot as plt
import matplotlib.patches as mpatches
 
studies   = ["CONSENSUS 1987","SOLVD-T 1991","ATLAS 1999","V-HeFT II 1991","MERIT-HF 1999","CIBIS-II 1999"]
or_vals   = np.array([0.56, 0.81, 0.87, 0.84, 0.89, 0.73])
ci_lo     = np.array([0.31, 0.68, 0.58, 0.61, 0.70, 0.57])
ci_hi     = np.array([0.99, 0.96, 1.30, 1.16, 1.14, 0.94])
n_vals    = np.array([253, 2569, 3164, 806, 3991, 2647])
pooled_or = 0.81; pooled_lo = 0.74; pooled_hi = 0.89
 
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
 
# Forest plot
y_pos    = np.arange(len(studies), 0, -1)
weights  = n_vals / n_vals.sum()
sq_sizes = (weights * 300 + 50)
 
ax1.scatter(or_vals, y_pos, s=sq_sizes, color='#4575b4', zorder=3, marker='s')
ax1.errorbar(or_vals, y_pos, xerr=[or_vals-ci_lo, ci_hi-or_vals],
             fmt='none', color='#4575b4', lw=1.5, capsize=4)
ax1.scatter([pooled_or], [0], s=400, color='#d73027', marker='D', zorder=4,
            label=f"Pooled OR={pooled_or:.2f}")
ax1.errorbar([pooled_or], [0], xerr=[[pooled_or-pooled_lo],[pooled_hi-pooled_or]],
             fmt='none', color='#d73027', lw=2, capsize=6)
ax1.axvline(1, color='grey', linestyle='--', lw=1.5)
ax1.set_yticks(list(y_pos)+[0])
ax1.set_yticklabels(studies+['Pooled (FE)'], fontsize=9)
ax1.set_xlabel("Odds Ratio (95% CI)"); ax1.set_xscale('log')
ax1.set_title("Forest Plot: ACE Inhibitors in Heart Failure\nPooled OR=0.81, 95% CI 0.74–0.89, I²=18%")
ax1.legend(loc='lower right'); ax1.grid(axis='x', alpha=0.3)
 
# Funnel plot
se_vals = (np.log(ci_hi) - np.log(ci_lo)) / (2*1.96)
pooled_log_or = np.log(pooled_or)
se_range = np.linspace(0, max(se_vals)*1.2, 100)
 
ax2.scatter(or_vals, se_vals, color='#4575b4', s=80, zorder=3)
for i, s in enumerate(studies):
    ax2.annotate(s.split()[0], (or_vals[i], se_vals[i]), textcoords="offset points",
                 xytext=(5,-3), fontsize=7)
ax2.axvline(pooled_or, color='grey', linestyle='--', lw=1.5)
ax2.fill_betweenx(se_range,
                  np.exp(pooled_log_or - 1.96*se_range),
                  np.exp(pooled_log_or + 1.96*se_range),
                  color='lightblue', alpha=0.3, label='95% CI funnel')
ax2.invert_yaxis()
ax2.set_xlabel("Odds Ratio"); ax2.set_ylabel("Standard Error")
ax2.set_xscale('log'); ax2.set_title("Funnel Plot: Publication Bias Assessment\nSymmetry suggests low publication bias")
ax2.legend(); ax2.grid(alpha=0.3)
 
plt.suptitle("Meta-Analysis: ACE Inhibitors and CV Mortality in Heart Failure", fontsize=12)
plt.tight_layout(); plt.savefig("meta_analysis.png", dpi=150, bbox_inches='tight')
plt.show()

18. Reporting Standards and Checklists

18.1 General Reporting Principles

Report the test used, test statistic, degrees of freedom, and exact p-value. Write t(48)=3.14, p=0.003 — not “p<0.05” or “NS”.
Report effect sizes with confidence intervals for all primary outcomes. P-values alone are insufficient.
Report sample sizes at every step. Account for all patients from enrolment to analysis.
For continuous outcomes, report whether parametric or non-parametric tests were used and why.
For non-parametric tests, report median (IQR), not mean (SD).
Report model fit statistics — R²/adjusted R² for linear regression; Hosmer-Lemeshow p, AUC for logistic regression.
Check and report assumption testing — Shapiro-Wilk, Levene’s, Mauchly’s sphericity, PH assumption (Cox).
Distinguish pre-specified from exploratory analyses.

18.2 Reporting Checklists

Study type	Checklist	URL
RCT	CONSORT	www.consort-statement.org
Observational cohort/case-control	STROBE	www.strobe-statement.org
Diagnostic accuracy	STARD	www.equator-network.org
Systematic review/meta-analysis	PRISMA	www.prisma-statement.org
Prognostic model	TRIPOD	www.tripod-statement.org
Survival analysis	REMARK	(tumour marker studies)

18.3 Specimen Results Sentences

Randomised Trial (t-test):

“The primary outcome, change in HbA1c from baseline to 6 months, was significantly greater in the intervention group compared to control (−0.82% vs −0.31%; mean difference −0.51%, 95% CI −0.78 to −0.24%; independent samples t-test: t(178)=−3.74, p<0.001).”

Survival analysis:

“Median progression-free survival was 11.2 months (95% CI 8.6–13.8) in the experimental arm and 7.4 months (95% CI 5.9–8.9) in the control arm. The experimental treatment was associated with a 38% reduction in the hazard of progression or death (HR 0.62, 95% CI 0.48–0.80; log-rank p<0.001).”

Logistic regression:

“On multivariable logistic regression, prior hospitalisation in the previous year (OR 2.84, 95% CI 1.63–4.95, p<0.001) and home oxygen use (OR 1.93, 95% CI 1.09–3.42, p=0.025) were independently associated with 30-day readmission after adjustment for age, FEV₁%, and eosinophil count. The model demonstrated acceptable discrimination (C-statistic 0.72) and good calibration (Hosmer-Lemeshow p=0.64).”

Appendix: Quick Reference Tables

A1. Choosing the Right Test — Complete Reference

Research question	Outcome type	Predictor type	Groups/samples	Test
Is mean different from reference?	Continuous	None	1 group	One-sample t / Wilcoxon
Is proportion different from reference?	Binary	None	1 group	One-proportion z-test
Does categorical distribution match expected?	Categorical	None	1 group	Chi-square GoF
Are two independent group means different?	Continuous	Binary	2 indep.	Student’s t / Welch’s t / Mann-Whitney U
Are two paired measurements different?	Continuous	Time (2 pts)	2 paired	Paired t / Wilcoxon signed-rank
Are two paired binary proportions different?	Binary	Time (2 pts)	2 paired	McNemar’s test
Are 3+ independent group means different?	Continuous	Categorical	3+ indep.	One-way ANOVA / Kruskal-Wallis
Are 3+ repeated measures different?	Continuous	Time (3+ pts)	3+ paired	RM-ANOVA / Friedman
Is there a categorical association?	Categorical	Categorical	Indep.	Chi-square / Fisher’s exact
Is there a linear association?	Continuous	Continuous	—	Pearson r / Spearman ρ
Predict continuous outcome	Continuous	Mixed	—	Linear regression
Predict binary outcome	Binary	Mixed	—	Logistic regression
Predict time-to-event	Time-to-event	Mixed	—	Cox proportional hazards
Predict count outcome	Count	Mixed	—	Poisson regression
Compare survival curves	Time-to-event	Categorical	2+ indep.	KM + log-rank
Multiple continuous outcomes	Continuous	Categorical	2+ groups	MANOVA
Reduce many correlated variables	Continuous	None	—	PCA / Factor analysis
Repeated measures with missing data	Continuous	Mixed	3+ time pts	Linear mixed effects model
Agreement between two raters (categorical)	Categorical	—	2 raters	Cohen’s kappa
Agreement between two continuous methods	Continuous	—	2 methods	Bland-Altman analysis
Diagnostic test evaluation	Binary	Continuous/ordinal	—	ROC, sensitivity/specificity

A2. Non-Parametric Equivalents

Parametric test	Non-parametric equivalent	Use when
One-sample t-test	One-sample Wilcoxon	Non-normal, n`<`30
Independent t-test	Mann-Whitney U	Non-normal, ordinal, n`<`30
Paired t-test	Wilcoxon signed-rank	Non-normal differences, ordinal
One-way ANOVA	Kruskal-Wallis	Non-normal groups, ordinal outcome
Repeated measures ANOVA	Friedman test	Non-normal, ordinal, repeated
Pearson correlation	Spearman correlation	Non-normal, ordinal, outliers

A3. Effect Size Reference

Measure	Formula	Interpretation
Cohen’s d	(μ₁−μ₂)/pooled SD	0.2=small, 0.5=medium, 0.8=large
Odds ratio	ad/bc	1=no effect; >1 increased odds; `<`1 decreased odds
Relative risk	[a/(a+b)] / [c/(c+d)]	1=no effect; >1 increased risk
NNT	1/ARR	Lower = more effective
Hazard ratio	e^β (Cox)	1=no effect; same interpretation as RR
r (Pearson)	—	0.1=small, 0.3=medium, 0.5=large
R²	SS_model/SS_total	% variance explained
η² (eta-squared)	SS_between/SS_total	0.01=small, 0.06=medium, 0.14=large
AUC/C-statistic	Area under ROC	0.5=chance; 0.7–0.8=acceptable; >0.8=excellent
Kappa (κ)	(Po−Pe)/(1−Pe)	0.4–0.6=moderate; 0.6–0.8=substantial; >0.8=almost perfect

A4. P-Value Thresholds in Context

Scenario	Recommended α	Rationale
Primary outcome, single test	0.05	Standard
Secondary outcomes (multiple)	0.05/k (Bonferroni)	Multiple comparisons
Post-hoc pairwise comparisons	Tukey HSD or Bonferroni	Familywise error control
Exploratory analysis	0.05, clearly labelled	Hypothesis-generating only
Genome-wide association study	5×10⁻⁸	Millions of comparisons
Equivalence / non-inferiority	0.025 (one-sided)	Specific trial design

A5. Sample Size Formulae

Design	Formula	Notes
Two independent means	n = 2(z_α/2+z_β)²σ²/δ²	σ=SD, δ=minimum detectable difference
Two proportions	n = (z_α/2+z_β)²[p₁(1-p₁)+p₂(1-p₂)]/(p₁-p₂)²	p₁, p₂=expected proportions
Paired design	n = (z_α/2+z_β)²σ_d²/δ²	σ_d=SD of differences
One proportion vs reference	n = z_α/2²p₀(1-p₀)/E²	E=acceptable margin of error

z values: z_0.025=1.96 (α=0.05 two-tailed), z_0.2=0.84 (power=80%), z_0.1=1.28 (power=90%)

Key References and Further Reading

Altman DG. Practical Statistics for Medical Research. Chapman & Hall, 1991.
Bland JM, Altman DG. Statistical methods for assessing agreement. Lancet. 1986;327:307-310.
Cox DR. Regression models and life tables. J Royal Stat Soc B. 1972;34:187-220.
DeLong ER et al. Comparing areas under two correlated ROC curves. Biometrics. 1988;44:837-845.
Harrell FE. Regression Modeling Strategies. Springer, 2015.
Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53:457-481.
Rothman KJ. No adjustments needed for multiple comparisons. Epidemiology. 1990;1:43-46.
Steyerberg EW. Clinical Prediction Models. Springer, 2009.
Vittinghoff E et al. Regression Methods in Biostatistics. Springer, 2012.
Zhang J, Yu KF. What’s the relative risk? JAMA. 1998;280:1690-1691.

Enriched Edition v2.0 | R + Python code with ggplot2 / matplotlib for every test | Clinical research reference All code is self-contained and reproducible. Simulated data match worked examples. Plots are publication-ready with minor cosmetic adjustments.

A Complete Guide to Statistical Tests and Methods for Clinical Researchers

Enriched Edition — With R & Python Code and Plot Generation

Required Packages

Table of Contents

1. Foundations: The Statistical Reasoning Framework

1.1 What Is a Statistical Test?

1.2 Type I and Type II Errors

1.3 One-Tailed vs Two-Tailed Tests

1.4 Confidence Intervals vs P-Values

1.5 Sample Size and Power Calculations

R — Power Curves

Python — Power Curves

2. Choosing the Right Test: A Decision Framework

The Five Key Questions

Decision Table

3. Descriptive Statistics and Data Exploration

3.1 Measures of Central Tendency and Spread

3.2 Assessing Normality

3.3 Worked Example — Baseline Table

3.4 Code: Normality Assessment — 4-Panel Diagnostic

R

Python

4. One-Variable Tests

4.1 One-Sample Student’s t-Test

R

Python

4.2 One-Sample Wilcoxon Signed-Rank Test

R

Python

4.3 One-Proportion Test

R

Python

4.4 Chi-Square Goodness-of-Fit Test

R

Python

5. Comparing Two Groups

5.1 Independent Samples t-Test

R

Python

5.2 Welch’s t-Test (Unequal Variances)

R

Python

5.3 Mann-Whitney U Test

R

Python

5.4 Paired Samples t-Test

R

Python

5.5 Wilcoxon Signed-Rank Test

R

Python

5.6 Chi-Square Test of Independence

R

Python

5.7 Fisher’s Exact Test

R

Python

5.8 McNemar’s Test

R

Python

6. Comparing Three or More Groups

6.1 One-Way ANOVA

R

Python

6.2 Welch’s ANOVA and Kruskal-Wallis Test

R

Python

6.3 Repeated Measures ANOVA

R

Python

6.4 Friedman Test

R

Python

7. Correlation and Association

7.1 Pearson Correlation

R

Python

7.2 Spearman’s Rank Correlation

R

Python