Skip to Content
New Article Every Week 🎉
InceptionStatisticsGuide to Statistics with Python and R code

A Complete Guide to Statistical Tests and Methods for Clinical Researchers

Enriched Edition — With R & Python Code and Plot Generation

Audience: Researchers and clinicians applying statistical methods to medical and health data Purpose: A thorough reference covering test selection, assumptions, worked examples, interpretation, and reporting — with side-by-side R and Python code and reproducible visualisations for every method Code environment: R ≥ 4.2 with tidyverse/ggplot2; Python ≥ 3.9 with pandas/scipy/matplotlib/seaborn/statsmodels How to use this guide: Each section follows: What it is → When to use it → Assumptions → Worked example → R code + plot → Python code + plot → Interpretation → Reporting → Common mistakes

Required Packages

R — install once:

install.packages(c( "tidyverse", "ggpubr", "rstatix", "broom", "coin", "survival", "survminer", "lme4", "lmerTest", "pROC", "irr", "meta", "vcd", "ggfortify", "factoextra", "psych", "MatchIt", "WeightIt" ))

Python — install once:

pip install pandas numpy scipy matplotlib seaborn statsmodels \ lifelines scikit-learn pingouin scikit-posthocs

Table of Contents

  1. Foundations: The Statistical Reasoning Framework
  2. Choosing the Right Test
  3. Descriptive Statistics and Data Exploration
  4. One-Variable Tests
  5. Comparing Two Groups
  6. Comparing Three or More Groups
  7. Correlation and Association
  8. Regression Analysis
  9. Effect Sizes and Association Measures
  10. Survival and Time-to-Event Analysis
  11. Multivariable Modelling Strategy
  12. Multivariate Methods
  13. Mixed Models and Longitudinal Data
  14. Diagnostic Test Evaluation
  15. Agreement and Reliability
  16. Bayesian Methods
  17. Meta-Analysis and Systematic Review
  18. Reporting Standards and Checklists
  19. Appendix: Quick Reference Tables

1. Foundations: The Statistical Reasoning Framework

1.1 What Is a Statistical Test?

A statistical test is a formal procedure for deciding whether observed data are consistent with a stated hypothesis:

  1. Null hypothesis (H₀): The assumption of no effect, no difference, or no association
  2. Alternative hypothesis (H₁): The effect or difference you are trying to detect
  3. Test statistic: A number calculated from data summarising evidence against H₀
  4. P-value: Probability of observing a test statistic at least this extreme, if H₀ were true

What a p-value is NOT: Not the probability that H₀ is true, and not the probability the result is due to chance.

1.2 Type I and Type II Errors

H₀ is actually TRUEH₀ is actually FALSE
Test says: reject H₀Type I error (α) — false positiveCorrect — Power (1−β)
Test says: fail to reject H₀Correct (1−α)Type II error (β) — false negative

1.3 One-Tailed vs Two-Tailed Tests

  • Two-tailed: H₁: μ₁ ≠ μ₂ — tests for difference in either direction. Default for most clinical research.
  • One-tailed: H₁: μ₁ > μ₂ — use only when pre-specified and you would not act on a result in the other direction.

1.4 Confidence Intervals vs P-Values

  • 95% CI represents the range of values consistent with your data at the 5% significance level
  • If CI for a difference excludes zero (or for a ratio excludes 1.0), the result is statistically significant at α=0.05
  • CI communicates both statistical significance AND magnitude and precision — report both

1.5 Sample Size and Power Calculations

The four inputs: α, power (1−β), effect size (MCID), variability (SD). Specify three, solve for the fourth.

n per group = 2 × (z_α/2 + z_β)² × σ² / δ²

R — Power Curves

library(tidyverse) sd_val <- 20; alpha <- 0.05 expand_grid(delta = c(5, 8, 10, 15), n = seq(10, 150, by = 5)) |> mutate( power = pmap_dbl(list(delta, n), function(d, n) { power.t.test(n = n, delta = d, sd = sd_val, sig.level = alpha, type = "two.sample")$power }), label = paste0("MCID = ", delta, " mmHg") ) |> ggplot(aes(x = n, y = power, colour = label)) + geom_line(linewidth = 1.1) + geom_hline(yintercept = 0.80, linetype = "dashed", colour = "grey40") + annotate("text", x = 12, y = 0.82, label = "80% power", hjust = 0, size = 3.5) + scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) + scale_colour_brewer(palette = "Set1") + labs(title = "Power Curves: Two-Sample t-Test (SD = 20 mmHg)", x = "Sample size per group", y = "Statistical power", colour = NULL) + theme_bw(base_size = 13) + theme(legend.position = "bottom")

Python — Power Curves

import numpy as np import matplotlib.pyplot as plt from scipy import stats sd_val = 20; alpha = 0.05 n_vals = np.arange(10, 151, 5) mcids = [5, 8, 10, 15] colours = ['#e41a1c','#377eb8','#4daf4a','#984ea3'] fig, ax = plt.subplots(figsize=(9, 5)) for mcid, col in zip(mcids, colours): se = sd_val * np.sqrt(2 / n_vals) ncp = mcid / se z_crit = stats.norm.ppf(1 - alpha / 2) power = 1 - stats.norm.cdf(z_crit - ncp) + stats.norm.cdf(-z_crit - ncp) ax.plot(n_vals, power, label=f"MCID = {mcid} mmHg", color=col, linewidth=2) ax.axhline(0.80, linestyle='--', color='grey', linewidth=1.2) ax.text(12, 0.82, "80% power", fontsize=10, color='grey') ax.set_xlabel("Sample size per group"); ax.set_ylabel("Statistical power") ax.set_title("Power Curves: Two-Sample t-Test (SD = 20 mmHg)") ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}')) ax.legend(loc='lower right'); ax.grid(alpha=0.3) plt.tight_layout(); plt.savefig("power_curves.png", dpi=150, bbox_inches='tight')

2. Choosing the Right Test: A Decision Framework

The Five Key Questions

Q1. What is your research question? (describe / test / compare / examine / predict) Q2. How many variables? (1 / 2 / 2+ with one outcome / 2+ outcomes) Q3. What type is each variable? (continuous / ordinal / nominal / binary / time-to-event) Q4. Are samples independent or paired/related? Q5. Are parametric assumptions met? (normality, sample size)

Decision Table

OutcomePredictor/groupsSampleTest
ContinuousNone (vs known value)One-sample t / Wilcoxon
Continuous2 groupsIndependentStudent’s t / Welch’s t / Mann-Whitney U
Continuous2 groupsPairedPaired t / Wilcoxon signed-rank
Continuous3+ groupsIndependentOne-way ANOVA / Kruskal-Wallis
Continuous3+ groupsRepeatedRM-ANOVA / Friedman
ContinuousContinuous predictor(s)Linear regression
Binary2+ groupsIndependentChi-square / Fisher’s exact
Binary2 groupsPairedMcNemar’s test
BinaryMixed predictorsLogistic regression
Time-to-event2+ groupsIndependentKaplan-Meier + log-rank
Time-to-eventMixed predictorsCox regression
CountGroupsPoisson / negative binomial regression
Ordinal2+ groupsIndependentMann-Whitney / Kruskal-Wallis
Multiple continuousGroupsMANOVA

3. Descriptive Statistics and Data Exploration

3.1 Measures of Central Tendency and Spread

Data typeMeasureSpread
Continuous, normalMeanSD
Continuous, skewedMedianIQR
OrdinalMedianIQR
NominalFrequency (n, %)

Never report SEM as a measure of population variability — it describes precision of the mean, not spread of individual values.

3.2 Assessing Normality

Visual (preferred): Histogram, Q-Q plot, box plot.

Formal tests:

  • Shapiro-Wilk: best for n < 50. H₀: normally distributed. p > 0.05 is consistent with normality.
  • Kolmogorov-Smirnov: better for larger n.

Practical rule: With n > 30, parametric tests are generally robust even with moderate skew (central limit theorem).

3.3 Worked Example — Baseline Table

VariableTypeSummary
Age (years)Continuous, ~normalMean ± SD: 62.4 ± 11.2
Sex (% male)Binary68 (56.7%)
BMI (kg/m²)Right-skewedMedian (IQR): 27.8 (24.6–31.9)
LDL-C (mmol/L)Right-skewedMedian (IQR): 3.4 (2.8–4.1)
NYHA classOrdinalClass I: 22 (18.3%), II: 58 (48.3%), III: 32 (26.7%), IV: 8 (6.7%)

3.4 Code: Normality Assessment — 4-Panel Diagnostic

R

library(tidyverse) set.seed(42) n <- 120 normal <- rnorm(n, 62.4, 11.2) skewed <- rgamma(n, shape = 2, rate = 0.6) par(mfrow = c(2, 2), mar = c(4, 4, 3, 1)) hist(normal, breaks = 15, col = "#4575b4", border = "white", main = "Age — approx. normal", xlab = "Age (years)", freq = FALSE) curve(dnorm(x, mean(normal), sd(normal)), add = TRUE, col = "firebrick", lwd = 2) qqnorm(normal, main = "Q-Q: Age", pch = 16, col = "#4575b4") qqline(normal, col = "firebrick", lwd = 2) hist(skewed, breaks = 15, col = "#d73027", border = "white", main = "LDL-C — right-skewed", xlab = "LDL-C (mmol/L)", freq = FALSE) lines(density(skewed), col = "navy", lwd = 2) qqnorm(skewed, main = "Q-Q: LDL-C", pch = 16, col = "#d73027") qqline(skewed, col = "navy", lwd = 2) cat("Shapiro-Wilk (age): p =", round(shapiro.test(normal)$p.value, 4), "\n") cat("Shapiro-Wilk (LDL): p =", round(shapiro.test(skewed)$p.value, 4), "\n")

Python

import numpy as np import matplotlib.pyplot as plt from scipy import stats from scipy.stats import gaussian_kde rng = np.random.default_rng(42) n = 120 normal = rng.normal(62.4, 11.2, n) skewed = rng.gamma(shape=2, scale=1/0.6, size=n) fig, axes = plt.subplots(2, 2, figsize=(10, 8)) # Panel 1: Histogram normal axes[0,0].hist(normal, bins=15, color='#4575b4', edgecolor='white', density=True) x_r = np.linspace(normal.min(), normal.max(), 200) axes[0,0].plot(x_r, stats.norm.pdf(x_r, normal.mean(), normal.std()), 'r-', lw=2) axes[0,0].set_title("Age — approx. normal"); axes[0,0].set_xlabel("Age (years)") # Panel 2: Q-Q normal (osm, osr), (slope, intercept, _) = stats.probplot(normal) axes[0,1].scatter(osm, osr, color='#4575b4', s=20) axes[0,1].plot(osm, slope*np.array(osm)+intercept, 'r-', lw=2) axes[0,1].set_title("Q-Q Plot: Age"); axes[0,1].set_xlabel("Theoretical quantiles") # Panel 3: Histogram skewed axes[1,0].hist(skewed, bins=15, color='#d73027', edgecolor='white', density=True) kde_x = np.linspace(skewed.min(), skewed.max(), 200) axes[1,0].plot(kde_x, gaussian_kde(skewed)(kde_x), 'navy', lw=2) axes[1,0].set_title("LDL-C — right-skewed"); axes[1,0].set_xlabel("LDL-C (mmol/L)") # Panel 4: Q-Q skewed (osm2, osr2), (s2, i2, _) = stats.probplot(skewed) axes[1,1].scatter(osm2, osr2, color='#d73027', s=20) axes[1,1].plot(osm2, s2*np.array(osm2)+i2, 'navy', lw=2) axes[1,1].set_title("Q-Q Plot: LDL-C"); axes[1,1].set_xlabel("Theoretical quantiles") for ax in axes.flat: ax.grid(alpha=0.3) plt.suptitle("Normality Assessment: 4-Panel Diagnostic", fontsize=13) plt.tight_layout(); plt.savefig("normality_4panel.png", dpi=150, bbox_inches='tight') plt.show() for label, arr in [("age", normal), ("LDL", skewed)]: w, p = stats.shapiro(arr) print(f"Shapiro-Wilk ({label}): W={w:.3f}, p={p:.4f}")

4. One-Variable Tests

4.1 One-Sample Student’s t-Test

What it does: Tests whether the mean of a single sample differs from a known population value (μ₀).

When to use: One continuous variable, ~normal (or n ≥ 30), compare to reference value.

Test statistic:

t = (x̄ − μ₀) / (s / √n) df = n − 1

Worked Example: INR monitoring, n=25, mean=2.8, SD=0.6, target=2.5. t(24)=2.50, p=0.020. 95% CI: 2.55–3.05. Unit is over-anticoagulating on average.

R

library(tidyverse) set.seed(42) inr <- rnorm(25, mean = 2.8, sd = 0.6) mu0 <- 2.5 t_res <- t.test(inr, mu = mu0) print(t_res) tibble(inr = inr) |> ggplot(aes(x = 1, y = inr)) + geom_jitter(width = 0.08, size = 2.5, colour = "#377eb8", alpha = 0.7) + stat_summary(fun.data = mean_cl_normal, geom = "pointrange", colour = "firebrick", size = 0.8, linewidth = 1.2) + geom_hline(yintercept = mu0, linetype = "dashed", colour = "grey30", linewidth = 1) + annotate("text", x = 1.35, y = mu0 + 0.05, label = paste0("Reference = ", mu0), hjust = 0, size = 3.5) + annotate("text", x = 1.35, y = mean(inr) + 0.05, label = sprintf("Mean = %.2f\n95%% CI: %.2f–%.2f\np = %.3f", mean(inr), t_res$conf.int[1], t_res$conf.int[2], t_res$p.value), hjust = 0, size = 3.2, colour = "firebrick") + scale_x_continuous(limits = c(0.7, 1.9)) + labs(title = "One-Sample t-Test: INR vs Target of 2.5", subtitle = "Dots = patients; red = mean ± 95% CI; dashed = reference target", y = "INR", x = NULL) + theme_bw(base_size = 13) + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

Python

import numpy as np, matplotlib.pyplot as plt from scipy import stats rng = np.random.default_rng(42) inr = rng.normal(2.8, 0.6, 25); mu0 = 2.5 t_stat, p_val = stats.ttest_1samp(inr, mu0) ci = stats.t.interval(0.95, df=len(inr)-1, loc=np.mean(inr), scale=stats.sem(inr)) print(f"t={t_stat:.3f}, p={p_val:.4f}, 95% CI: {ci[0]:.2f}{ci[1]:.2f}") fig, ax = plt.subplots(figsize=(5, 6)) jitter = rng.uniform(-0.05, 0.05, len(inr)) ax.scatter(jitter, inr, color='#377eb8', alpha=0.65, s=50, zorder=3) ax.errorbar(0, np.mean(inr), yerr=[[np.mean(inr)-ci[0]], [ci[1]-np.mean(inr)]], fmt='o', color='firebrick', markersize=8, linewidth=2.5, capsize=6, zorder=4) ax.axhline(mu0, linestyle='--', color='grey', linewidth=1.5) ax.set_xlim(-0.3, 0.55); ax.set_xticks([]) ax.set_ylabel("INR", fontsize=12) ax.set_title(f"One-Sample t-Test: INR vs Target 2.5\nt={t_stat:.2f}, p={p_val:.3f}", fontsize=12) ax.grid(axis='y', alpha=0.3); plt.tight_layout() plt.savefig("onesample_t.png", dpi=150, bbox_inches='tight'); plt.show()

4.2 One-Sample Wilcoxon Signed-Rank Test

What it does: Non-parametric equivalent of one-sample t-test. Tests whether the median differs from a hypothesised value.

When to use: Skewed, ordinal, or n < 30 with non-normal distribution.

Worked Example: Pain clinic, n=18, median NRS=7, reference=5. W=142, p=0.008.

R

library(tidyverse) pain <- c(7,8,6,9,5,7,8,7,6,9,7,8,5,7,9,6,8,7) wilcox.test(pain, mu = 5, exact = FALSE) tibble(pain = pain) |> ggplot(aes(x = 1, y = pain)) + geom_violin(fill = "#a6cee3", alpha = 0.6, width = 0.6) + geom_jitter(width = 0.06, size = 2.5, colour = "#1f78b4", alpha = 0.8) + stat_summary(fun = median, geom = "crossbar", colour = "firebrick", width = 0.3, linewidth = 0.9) + geom_hline(yintercept = 5, linetype = "dashed", colour = "grey30") + annotate("text", x = 1.32, y = 5.15, label = "Reference median = 5", hjust = 0, size = 3.5) + scale_y_continuous(breaks = 0:10) + labs(title = "One-Sample Wilcoxon: NRS Pain vs Reference", subtitle = "Violin = distribution; red bar = sample median; dashed = reference", y = "NRS Pain Score (0–10)", x = NULL) + theme_bw(base_size = 13) + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

Python

import numpy as np, matplotlib.pyplot as plt from scipy import stats from scipy.stats import gaussian_kde pain = np.array([7,8,6,9,5,7,8,7,6,9,7,8,5,7,9,6,8,7]) res = stats.wilcoxon(pain - 5) print(f"W={res.statistic:.0f}, p={res.pvalue:.4f}") rng = np.random.default_rng(42) fig, ax = plt.subplots(figsize=(5, 6)) y_range = np.linspace(3, 11, 200) kde = gaussian_kde(pain, bw_method=0.4) width = kde(y_range) * 2.5 ax.fill_betweenx(y_range, -width, width, color='#a6cee3', alpha=0.55) ax.scatter(rng.uniform(-0.06, 0.06, len(pain)), pain, color='#1f78b4', s=50, alpha=0.8, zorder=3) ax.hlines(np.median(pain), -0.28, 0.28, color='firebrick', lw=2.5, zorder=4, label=f"Median={np.median(pain):.0f}") ax.axhline(5, linestyle='--', color='grey', lw=1.5, label="Reference=5") ax.set_xlim(-0.5, 0.7); ax.set_xticks([]) ax.set_ylabel("NRS Pain Score (0–10)", fontsize=12) ax.set_title(f"Wilcoxon Signed-Rank vs Reference\nW={res.statistic:.0f}, p={res.pvalue:.3f}", fontsize=12) ax.legend(loc='upper right'); ax.grid(axis='y', alpha=0.3) plt.tight_layout(); plt.savefig("wilcoxon_onesample.png", dpi=150, bbox_inches='tight'); plt.show()

4.3 One-Proportion Test

What it does: Tests whether an observed proportion differs from a known reference proportion.

Worked Example: Readmission rate 6.0% vs national 4.0%; z=1.61, p=0.107 (NS). Study may be underpowered.

R

# 15 readmissions in 250 procedures; reference = 4% prop.test(x = 15, n = 250, p = 0.04, correct = FALSE)

Python

from statsmodels.stats.proportion import proportions_ztest, proportion_confint z_stat, p_val = proportions_ztest(count=15, nobs=250, value=0.04) ci = proportion_confint(15, 250, alpha=0.05, method='normal') print(f"z={z_stat:.3f}, p={p_val:.4f}, 95% CI: {ci[0]:.3f}{ci[1]:.3f}")

4.4 Chi-Square Goodness-of-Fit Test

What it does: Tests whether observed distribution of a categorical variable matches an expected distribution.

Worked Example: ABO blood types, n=200 cardiac patients vs UK population. χ²(3)=3.42, p=0.331.

R

library(tidyverse) obs <- c(A = 96, B = 16, AB = 6, O = 82) expected <- c(A = 0.42, B = 0.10, AB = 0.04, O = 0.44) chisq.test(obs, p = expected) tibble(Group = names(obs), Observed = as.numeric(obs), Expected = expected * sum(obs)) |> pivot_longer(-Group, names_to = "type", values_to = "count") |> ggplot(aes(x = Group, y = count, fill = type)) + geom_col(position = "dodge", width = 0.6) + scale_fill_manual(values = c("Observed" = "#4575b4", "Expected" = "#d73027")) + labs(title = "Chi-Square GoF: ABO Blood Types", subtitle = "Cardiac patients (n=200) vs UK population; χ²(3)=3.42, p=0.331", x = "Blood type", y = "Count", fill = NULL) + theme_bw(base_size = 13) + theme(legend.position = "top")

Python

import numpy as np, matplotlib.pyplot as plt from scipy.stats import chisquare obs = np.array([96, 16, 6, 82]) expected = np.array([0.42, 0.10, 0.04, 0.44]) * obs.sum() groups = ['A', 'B', 'AB', 'O'] chi2, p = chisquare(obs, f_exp=expected) print(f"chi2={chi2:.3f}, p={p:.4f}") x = np.arange(len(groups)) fig, ax = plt.subplots(figsize=(7, 5)) ax.bar(x - 0.2, obs, width=0.35, label='Observed', color='#4575b4', alpha=0.9) ax.bar(x + 0.2, expected, width=0.35, label='Expected', color='#d73027', alpha=0.9) ax.set_xticks(x); ax.set_xticklabels(groups, fontsize=12) ax.set_xlabel("Blood type"); ax.set_ylabel("Count") ax.set_title(f"Chi-Square GoF: ABO Blood Types\nχ²(3)={chi2:.2f}, p={p:.3f}") ax.legend(); ax.grid(axis='y', alpha=0.3) plt.tight_layout(); plt.savefig("chisq_gof.png", dpi=150, bbox_inches='tight'); plt.show()

5. Comparing Two Groups

5.1 Independent Samples t-Test

What it does: Compares means of two independent groups.

Assumptions: Continuous outcome, two independent groups, ~normal or n ≥ 30/group, equal variances (check Levene’s).

Worked Example: ACE inhibitor RCT (n=45/group): 12.4±8.2 vs 5.8±7.6 mmHg reduction. Mean diff 6.6 mmHg (95% CI 3.3–9.9); t(88)=3.96, p<0.001.

R

library(tidyverse); library(rstatix) set.seed(42) df <- tibble(group = rep(c("ACE inhibitor","Placebo"), each=45), sbp = c(rnorm(45,12.4,8.2), rnorm(45,5.8,7.6))) levene_test(df, sbp ~ group) t_test(df, sbp ~ group, var.equal = TRUE, detailed = TRUE) ggplot(df, aes(x=group, y=sbp, fill=group)) + geom_boxplot(alpha=0.6, outlier.shape=NA, width=0.5) + geom_jitter(aes(colour=group), width=0.12, size=1.8, alpha=0.6) + stat_compare_means(method="t.test", label="p.signif", comparisons=list(c("ACE inhibitor","Placebo")), size=5) + scale_fill_manual(values=c("#4575b4","#d73027")) + scale_colour_manual(values=c("#4575b4","#d73027")) + labs(title="Independent t-Test: SBP Reduction by Treatment", subtitle="Mean ± IQR; individual patients as dots", x=NULL, y="SBP reduction (mmHg)") + theme_bw(base_size=13) + theme(legend.position="none")

Python

import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns from scipy import stats rng = np.random.default_rng(42) ace = rng.normal(12.4, 8.2, 45); plac = rng.normal(5.8, 7.6, 45) lev_stat, lev_p = stats.levene(ace, plac) t_stat, p_val = stats.ttest_ind(ace, plac, equal_var=True) print(f"Levene: F={lev_stat:.3f}, p={lev_p:.4f}") print(f"t={t_stat:.3f}, p={p_val:.4f}") df = pd.DataFrame({'group':['ACE inhibitor']*45+['Placebo']*45, 'sbp' :np.concatenate([ace,plac])}) fig, ax = plt.subplots(figsize=(6,6)) palette = {'ACE inhibitor':'#4575b4','Placebo':'#d73027'} sns.boxplot(data=df, x='group', y='sbp', palette=palette, width=0.45, fliersize=0, ax=ax) sns.stripplot(data=df, x='group', y='sbp', palette=palette, jitter=0.1, size=4, alpha=0.5, ax=ax) y_max = df['sbp'].max() + 3 ax.plot([0,0,1,1],[y_max,y_max+1,y_max+1,y_max], lw=1.2, color='black') ax.text(0.5, y_max+1.5, f"p={p_val:.4f}", ha='center', fontsize=11) ax.set_ylabel("SBP Reduction (mmHg)", fontsize=12); ax.set_xlabel("") ax.set_title(f"Independent t-Test: SBP Reduction\nMean diff={np.mean(ace)-np.mean(plac):.1f} mmHg, t={t_stat:.2f}, p={p_val:.4f}", fontsize=11) ax.grid(axis='y', alpha=0.3); plt.tight_layout() plt.savefig("indep_ttest.png", dpi=150, bbox_inches='tight'); plt.show()

5.2 Welch’s t-Test (Unequal Variances)

What it does: t-test with Welch-Satterthwaite df correction — does not assume equal variances.

When to use: Levene’s test p ≤ 0.05, or as a safe default.

Worked Example: CRP: bacterial 118.4±94.2 vs viral 22.6±18.7 mg/L. Levene’s p=0.001 → Welch’s. t=5.46, p<0.001.

R

set.seed(42) df_crp <- tibble(group = rep(c("Bacterial","Viral"), c(30,28)), crp = c(abs(rnorm(30,118.4,94.2)), abs(rnorm(28,22.6,18.7)))) t.test(crp ~ group, data=df_crp, var.equal=FALSE) ggplot(df_crp, aes(x=group, y=crp+1, fill=group)) + geom_boxplot(alpha=0.65, width=0.5) + geom_jitter(aes(colour=group), width=0.12, size=1.8, alpha=0.6) + scale_y_log10(labels=scales::label_comma()) + scale_fill_manual(values=c("#e66101","#5e3c99")) + scale_colour_manual(values=c("#e66101","#5e3c99")) + labs(title="Welch's t-Test: CRP by Infection Type", subtitle="Log scale; Levene's p=0.001 → unequal variances → Welch's appropriate", x=NULL, y="CRP+1 (mg/L, log scale)") + theme_bw(base_size=13) + theme(legend.position="none")

Python

import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns from scipy import stats rng = np.random.default_rng(42) bacterial = np.abs(rng.normal(118.4, 94.2, 30)) viral = np.abs(rng.normal(22.6, 18.7, 28)) lev_stat, lev_p = stats.levene(bacterial, viral) t_stat, p_val = stats.ttest_ind(bacterial, viral, equal_var=False) print(f"Levene p={lev_p:.4f}; Welch's t={t_stat:.3f}, p={p_val:.4f}") df = pd.DataFrame({'Group':['Bacterial']*30+['Viral']*28, 'CRP' :np.concatenate([bacterial, viral])}) fig, ax = plt.subplots(figsize=(6,6)) palette = {'Bacterial':'#e66101','Viral':'#5e3c99'} sns.boxplot(data=df, x='Group', y='CRP', palette=palette, width=0.45, ax=ax) sns.stripplot(data=df, x='Group', y='CRP', palette=palette, jitter=0.1, size=4, alpha=0.5, ax=ax) ax.set_yscale('log'); ax.set_ylabel("CRP (mg/L, log scale)", fontsize=12) ax.set_title(f"Welch's t-Test: CRP by Infection\nLevene p={lev_p:.3f}, t={t_stat:.2f}, p<0.001") ax.grid(axis='y', alpha=0.3); plt.tight_layout() plt.savefig("welch_ttest.png", dpi=150, bbox_inches='tight'); plt.show()

5.3 Mann-Whitney U Test

What it does: Non-parametric test comparing two independent groups. Tests whether one group tends to have higher values.

Note: Does not strictly test equality of medians — tests P(X>Y)=0.5 (stochastic superiority).

Worked Example: QoL scores: standard care median=58 (IQR 42–70) vs integrated programme median=72 (IQR 62–82). U=161.5, p=0.014.

R

library(tidyverse) std_care <- c(42,55,58,60,48,70,62,50,75,38,65,72,55,43,68,58,60,70,45,57,80,30) int_prog <- c(62,72,80,65,75,82,70,68,75,73,65,78,85,62,70,80,72,68,75,82,66,78,62,74) df_qol <- tibble(group=rep(c("Standard care","Integrated programme"),c(22,24)), qol =c(std_care,int_prog)) wilcox.test(qol ~ group, data=df_qol) ggplot(df_qol, aes(x=group, y=qol, fill=group)) + geom_violin(alpha=0.5, trim=FALSE) + geom_boxplot(width=0.15, outlier.shape=NA, alpha=0.8) + geom_jitter(aes(colour=group), width=0.06, size=2, alpha=0.6) + scale_fill_manual(values=c("#1b7837","#762a83")) + scale_colour_manual(values=c("#1b7837","#762a83")) + labs(title="Mann-Whitney U: Quality of Life by Care Group", subtitle="U=161.5, p=0.014; violin+box+jitter", x=NULL, y="EORTC QoL score (0–100)") + theme_bw(base_size=13) + theme(legend.position="none")

Python

import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns from scipy import stats std_care = [42,55,58,60,48,70,62,50,75,38,65,72,55,43,68,58,60,70,45,57,80,30] int_prog = [62,72,80,65,75,82,70,68,75,73,65,78,85,62,70,80,72,68,75,82,66,78,62,74] u_stat, p_val = stats.mannwhitneyu(std_care, int_prog, alternative='two-sided') print(f"U={u_stat:.1f}, p={p_val:.4f}") df = pd.DataFrame({'Group':['Standard care']*22+['Integrated programme']*24, 'QoL' :std_care+int_prog}) fig, ax = plt.subplots(figsize=(7,6)) palette = {'Standard care':'#1b7837','Integrated programme':'#762a83'} sns.violinplot(data=df,x='Group',y='QoL',palette=palette,inner=None,alpha=0.45,ax=ax) sns.boxplot(data=df,x='Group',y='QoL',palette=palette,width=0.15,fliersize=0,ax=ax) sns.stripplot(data=df,x='Group',y='QoL',palette=palette,jitter=0.07,size=4,alpha=0.5,ax=ax) ax.set_xlabel(""); ax.set_ylabel("EORTC QoL Score (0–100)", fontsize=12) ax.set_title(f"Mann-Whitney U: QoL by Care Group\nU={u_stat:.0f}, p={p_val:.4f}") ax.grid(axis='y', alpha=0.3); plt.tight_layout() plt.savefig("mannwhitney.png", dpi=150, bbox_inches='tight'); plt.show()

5.4 Paired Samples t-Test

What it does: Compares means from the same subjects at two time points. Removes between-subject variability, substantially increasing power.

Test statistic: t = d̄ / (sd / √n), df = n − 1

Worked Example: Sodium restriction crossover (n=20): mean reduction 43.8 mmol/24h (SD 28.4). t(19)=−6.90, p<0.001.

R

library(tidyverse) set.seed(42); n <- 20 pre <- rnorm(n, 168.4, 35) post <- pre - rnorm(n, 43.8, 28.4) t.test(pre, post, paired=TRUE) tibble(id=1:n, pre=pre, post=post) |> pivot_longer(c(pre,post), names_to="time", values_to="sodium") |> mutate(time=factor(time, c("pre","post"), c("Before","After"))) |> ggplot(aes(x=time, y=sodium)) + geom_line(aes(group=id), colour="grey60", alpha=0.5) + geom_point(aes(colour=time), size=2.5, alpha=0.8) + stat_summary(fun=mean, geom="line", aes(group=1), colour="firebrick", linewidth=2) + stat_summary(fun.data=mean_cl_normal, geom="pointrange", colour="firebrick", size=0.9, linewidth=1.5) + scale_colour_manual(values=c("Before"="#4575b4","After"="#d73027")) + labs(title="Paired t-Test: Urinary Sodium Before/After Restriction", subtitle="Grey=individual trajectories; red=mean ± 95% CI", x=NULL, y="24-h urinary sodium (mmol/24h)") + theme_bw(base_size=13) + theme(legend.position="none")

Python

import numpy as np, matplotlib.pyplot as plt from scipy import stats rng = np.random.default_rng(42); n = 20 pre = rng.normal(168.4, 35, n) post = pre - rng.normal(43.8, 28.4, n) t_stat, p_val = stats.ttest_rel(pre, post) print(f"t={t_stat:.3f}, p={p_val:.4f}") fig, ax = plt.subplots(figsize=(6,6)) for i in range(n): ax.plot([0,1],[pre[i],post[i]], color='grey', alpha=0.4, lw=1) ax.scatter([0]*n, pre, color='#4575b4', s=50, alpha=0.75, zorder=3) ax.scatter([1]*n, post, color='#d73027', s=50, alpha=0.75, zorder=3) ci_pre = stats.t.interval(0.95, n-1, np.mean(pre), stats.sem(pre)) ci_post = stats.t.interval(0.95, n-1, np.mean(post), stats.sem(post)) ax.errorbar([0,1],[np.mean(pre),np.mean(post)], yerr=[[np.mean(pre)-ci_pre[0],np.mean(post)-ci_post[0]], [ci_pre[1]-np.mean(pre),ci_post[1]-np.mean(post)]], fmt='o-', color='firebrick', markersize=9, lw=2.5, capsize=6, zorder=5) ax.set_xticks([0,1]); ax.set_xticklabels(['Before','After'], fontsize=12) ax.set_ylabel("24-h Urinary Sodium (mmol/24h)", fontsize=12) ax.set_title(f"Paired t-Test: Sodium Restriction\nMean reduction={np.mean(pre-post):.1f} mmol/24h, p<0.001") ax.grid(axis='y', alpha=0.3); plt.tight_layout() plt.savefig("paired_ttest.png", dpi=150, bbox_inches='tight'); plt.show()

5.5 Wilcoxon Signed-Rank Test

What it does: Non-parametric equivalent of paired t-test. Does not assume normality of differences.

Worked Example: Knee OA (n=16): NRS pre median=7, post=4. Z=−3.29, p=0.001.

R

library(tidyverse) pre <- c(7,8,6,9,7,8,6,7,9,8,7,6,8,7,9,6) post <- c(4,5,3,5,4,4,3,4,5,4,3,3,5,4,4,3) wilcox.test(pre, post, paired=TRUE) tibble(id=1:16, pre=pre, post=post, improved=ifelse(post<pre,"Improved","Not improved")) |> pivot_longer(c(pre,post), names_to="time", values_to="nrs") |> mutate(time=factor(time,c("pre","post"),c("Before","After"))) |> ggplot(aes(x=time, y=nrs, group=id)) + geom_line(aes(colour=rep(c("Improved","Not improved"),each=2)[match(id,1:16)]), alpha=0.7, linewidth=1) + geom_point(size=2.5) + scale_colour_manual(values=c("Improved"="#1b7837","Not improved"="#d73027"), name=NULL) + labs(title="Wilcoxon Signed-Rank: Pain Before/After Physiotherapy", subtitle="Green=improved; red=not improved", x=NULL, y="NRS Pain Score (0–10)") + theme_bw(base_size=13) + theme(legend.position="bottom")

Python

import numpy as np, matplotlib.pyplot as plt from scipy import stats from matplotlib.lines import Line2D pre = np.array([7,8,6,9,7,8,6,7,9,8,7,6,8,7,9,6]) post = np.array([4,5,3,5,4,4,3,4,5,4,3,3,5,4,4,3]) res = stats.wilcoxon(pre, post) print(f"W={res.statistic:.1f}, p={res.pvalue:.4f}") improved = post < pre fig, ax = plt.subplots(figsize=(6,6)) for i in range(len(pre)): col = '#1b7837' if improved[i] else '#d73027' ax.plot([0,1],[pre[i],post[i]], color=col, alpha=0.65, lw=1.5) ax.scatter([0,1],[pre[i],post[i]], color=col, s=40, zorder=3) ax.set_xticks([0,1]); ax.set_xticklabels(['Before','After'], fontsize=12) ax.set_yticks(range(0,11)); ax.set_ylabel("NRS Pain Score (0–10)", fontsize=12) ax.set_title(f"Wilcoxon Signed-Rank: Knee OA Physio\nW={res.statistic:.0f}, p={res.pvalue:.4f}") legend_elements=[Line2D([0],[0],color='#1b7837',label=f'Improved ({improved.sum()})'), Line2D([0],[0],color='#d73027',label=f'Not improved ({(~improved).sum()})')] ax.legend(handles=legend_elements, loc='upper right') ax.grid(axis='y', alpha=0.3); plt.tight_layout() plt.savefig("wilcoxon_signed_rank.png", dpi=150, bbox_inches='tight'); plt.show()

5.6 Chi-Square Test of Independence

What it does: Tests whether two categorical variables are associated.

Assumptions: Both variables categorical, independent observations, expected frequency ≥ 5 in all cells.

Worked Example: Smoking and post-op pneumonia (n=180). χ²(1)=16.58, p<0.001. OR=4.33.

R

library(tidyverse) ct <- matrix(c(24,36,16,104), nrow=2, dimnames=list(c("Smoker","Non-smoker"),c("Pneumonia","No pneumonia"))) chisq.test(ct) as_tibble(ct, rownames="Smoking") |> pivot_longer(-Smoking, names_to="Outcome", values_to="n") |> group_by(Smoking) |> mutate(pct=n/sum(n)) |> ggplot(aes(x=Smoking, y=pct, fill=Outcome)) + geom_col(position="fill", width=0.55) + geom_text(aes(label=scales::percent(pct,accuracy=0.1)), position=position_fill(vjust=0.5), size=4, colour="white") + scale_y_continuous(labels=scales::percent_format()) + scale_fill_manual(values=c("Pneumonia"="#d73027","No pneumonia"="#4575b4")) + labs(title="Chi-Square: Smoking and Post-Op Pneumonia", subtitle="χ²(1)=16.58, p<0.001; OR=4.33", x=NULL, y="Proportion", fill="Outcome") + theme_bw(base_size=13)

Python

import numpy as np, matplotlib.pyplot as plt from scipy.stats import chi2_contingency ct = np.array([[24,36],[16,104]]) chi2, p, dof, expected = chi2_contingency(ct, correction=False) print(f"χ²({dof})={chi2:.3f}, p={p:.6f}") groups = ['Smoker','Non-smoker'] pneum = ct[:,0] / ct.sum(axis=1) no_pn = ct[:,1] / ct.sum(axis=1) fig, ax = plt.subplots(figsize=(6,6)) ax.bar(groups, pneum, label='Pneumonia', color='#d73027', alpha=0.9) ax.bar(groups, no_pn, bottom=pneum, label='No pneumonia', color='#4575b4', alpha=0.9) for i,(p0,p1) in enumerate(zip(pneum,no_pn)): ax.text(i, p0/2, f"{p0:.1%}", ha='center', va='center', color='white', fontsize=11) ax.text(i, p0+p1/2, f"{p1:.1%}", ha='center', va='center', color='white', fontsize=11) ax.set_ylabel("Proportion"); ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda y,_:f'{y:.0%}')) ax.set_title(f"Chi-Square: Smoking and Post-Op Pneumonia\nχ²(1)={chi2:.2f}, p<0.001") ax.legend(loc='upper right'); ax.grid(axis='y', alpha=0.3) plt.tight_layout(); plt.savefig("chisq_independence.png", dpi=150, bbox_inches='tight'); plt.show()

5.7 Fisher’s Exact Test

What it does: Tests association when expected cell frequencies are small (< 5). Calculates exact probability.

When to use: 2×2 table with any expected cell < 5, small n, sparse data.

Worked Example: Fungal infection and immunosuppression (n=12). Fisher’s p=0.045. Hypothesis-generating only.

R

ct_small <- matrix(c(4,1,1,6), nrow=2, dimnames=list(c("Immunosuppressed","Not"),c("Infection","No infection"))) fisher.test(ct_small) # Expected cells: min=(5×5)/12=2.08 < 5 → Fisher's appropriate

Python

from scipy.stats import fisher_exact import numpy as np ct_small = np.array([[4,1],[1,6]]) odds_ratio, p_val = fisher_exact(ct_small, alternative='two-sided') print(f"OR={odds_ratio:.2f}, p={p_val:.4f}")

5.8 McNemar’s Test

What it does: Tests whether proportion of a binary outcome differs between two paired groups.

Key insight: Only the discordant pairs (b and c) contribute information.

Worked Example: Hand-hygiene compliance before/after education (n=80). Discordant: b=12, c=22. McNemar χ²=2.94, p=0.086.

R

ct_mc <- matrix(c(38,12,22,8), nrow=2, dimnames=list(c("Pre: Compliant","Pre: Non-compliant"), c("Post: Compliant","Post: Non-compliant"))) mcnemar.test(ct_mc)

Python

from statsmodels.stats.contingency_tables import mcnemar import numpy as np ct_mc = np.array([[38,12],[22,8]]) result = mcnemar(ct_mc, exact=False) print(f"chi2={result.statistic:.3f}, p={result.pvalue:.4f}")

6. Comparing Three or More Groups

6.1 One-Way ANOVA

What it does: Tests whether means of 3+ independent groups differ. Tests the omnibus null hypothesis (ALL means equal) — follow up with post-hoc tests to find which groups differ.

Assumptions: Continuous outcome, 3+ independent groups, ~normal within each group, equal variances (Levene’s test).

Logic: Partitions total variability into between-group (treatment) and within-group (noise). F = MSbetween / MSwithin.

Post-hoc tests: Tukey HSD (all-pairs, good default), Bonferroni (conservative), Dunnett (vs control), Scheffé (most conservative).

Worked Example: Anti-nausea drug trial (n=50/group, 4 groups). Vomiting episodes: placebo 6.8, low 5.1, medium 3.4, high 2.9. F(3,196)=38.5, p<0.001. Tukey: medium vs high not significant (p=0.41).

R

library(tidyverse); library(rstatix) set.seed(42) df_anova <- tibble( group = rep(c("Placebo","Low","Medium","High"), each=50), vomit = c(rnorm(50,6.8,2.4), rnorm(50,5.1,2.1), rnorm(50,3.4,1.8), rnorm(50,2.9,1.7)) ) |> mutate(group=factor(group, levels=c("Placebo","Low","Medium","High"))) # Levene's and ANOVA levene_test(df_anova, vomit ~ group) anova_result <- aov(vomit ~ group, data=df_anova) summary(anova_result) # Tukey HSD post-hoc tukey_hsd(df_anova, vomit ~ group) # Plot ggplot(df_anova, aes(x=group, y=vomit, fill=group)) + geom_boxplot(alpha=0.65, outlier.shape=NA, width=0.55) + geom_jitter(aes(colour=group), width=0.12, size=1.5, alpha=0.5) + stat_compare_means(method="anova", label.y=max(df_anova$vomit)+1) + scale_fill_brewer(palette="RdYlGn") + scale_colour_brewer(palette="RdYlGn") + labs(title="One-Way ANOVA: Vomiting Episodes by Drug Dose", subtitle="Post-hoc Tukey HSD; medium vs high not significant (p=0.41)", x="Treatment group", y="Vomiting episodes (24h)") + theme_bw(base_size=13) + theme(legend.position="none")

Python

import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns from scipy import stats import scikit_posthocs as sp rng = np.random.default_rng(42) groups_dict = {'Placebo':rng.normal(6.8,2.4,50), 'Low':rng.normal(5.1,2.1,50), 'Medium':rng.normal(3.4,1.8,50), 'High':rng.normal(2.9,1.7,50)} df = pd.DataFrame([(g,v) for g,vals in groups_dict.items() for v in vals], columns=['group','vomit']) # ANOVA f_stat, p_val = stats.f_oneway(*[v for v in groups_dict.values()]) print(f"F={f_stat:.2f}, p={p_val:.6f}") # Tukey post-hoc via pingouin import pingouin as pg posthoc = pg.pairwise_tukey(data=df, dv='vomit', between='group') print(posthoc[['A','B','diff','p-tukey']]) # Plot fig, ax = plt.subplots(figsize=(8,6)) order = ['Placebo','Low','Medium','High'] palette = {'Placebo':'#d73027','Low':'#fc8d59','Medium':'#91cf60','High':'#1a9850'} sns.boxplot(data=df, x='group', y='vomit', order=order, palette=palette, width=0.5, fliersize=0, ax=ax) sns.stripplot(data=df, x='group', y='vomit', order=order, palette=palette, jitter=0.1, size=3, alpha=0.4, ax=ax) ax.set_xlabel("Treatment group"); ax.set_ylabel("Vomiting episodes (24h)") ax.set_title(f"One-Way ANOVA: Vomiting Episodes by Dose\nF(3,196)={f_stat:.1f}, p<0.001") ax.grid(axis='y', alpha=0.3); plt.tight_layout() plt.savefig("oneway_anova.png", dpi=150, bbox_inches='tight'); plt.show()

6.2 Welch’s ANOVA and Kruskal-Wallis Test

Welch’s ANOVA: Use when Levene’s test is significant (p<0.05). Post-hoc: Games-Howell.

Kruskal-Wallis: Non-parametric alternative to one-way ANOVA. Tests if 3+ independent groups have the same distribution. Post-hoc: Dunn’s test with Bonferroni correction.

Worked Example (Kruskal-Wallis): Discharge pain scores (NRS) at 3 hospitals. H(2)=8.74, p=0.013. Hospital A < B (Dunn’s p=0.010); A vs C and B vs C not significant.

R

library(tidyverse); library(rstatix) set.seed(42) df_kw <- tibble( hospital = rep(c("A","B","C"), c(35,38,33)), pain = c(sample(c(3,4,5,6), 35, replace=TRUE, prob=c(0.3,0.4,0.2,0.1)), sample(c(4,5,6,7,8), 38, replace=TRUE, prob=c(0.15,0.25,0.3,0.2,0.1)), sample(c(3,4,5,6,7), 33, replace=TRUE, prob=c(0.2,0.25,0.3,0.15,0.1))) ) kruskal.test(pain ~ hospital, data=df_kw) dunn_test(df_kw, pain ~ hospital, p.adjust.method="bonferroni") ggplot(df_kw, aes(x=hospital, y=pain, fill=hospital)) + geom_violin(alpha=0.55, trim=FALSE) + geom_boxplot(width=0.12, outlier.shape=NA) + geom_jitter(aes(colour=hospital), width=0.08, size=1.8, alpha=0.5) + scale_fill_brewer(palette="Set2") + scale_colour_brewer(palette="Set2") + scale_y_continuous(breaks=0:10) + labs(title="Kruskal-Wallis: Discharge Pain by Hospital", subtitle="H(2)=8.74, p=0.013; Dunn's post-hoc: A<B (p=0.010)", x="Hospital", y="NRS Pain at Discharge (0–10)") + theme_bw(base_size=13) + theme(legend.position="none")

Python

import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns from scipy import stats import scikit_posthocs as sp rng = np.random.default_rng(42) data_A = rng.choice([3,4,5,6], 35, p=[0.3,0.4,0.2,0.1]) data_B = rng.choice([4,5,6,7,8], 38, p=[0.15,0.25,0.3,0.2,0.1]) data_C = rng.choice([3,4,5,6,7], 33, p=[0.2,0.25,0.3,0.15,0.1]) h_stat, p_val = stats.kruskal(data_A, data_B, data_C) print(f"H={h_stat:.3f}, p={p_val:.4f}") df = pd.DataFrame({'Hospital':['A']*35+['B']*38+['C']*33, 'Pain' :np.concatenate([data_A,data_B,data_C])}) dunn_result = sp.posthoc_dunn(df, val_col='Pain', group_col='Hospital', p_adjust='bonferroni') print(dunn_result) fig, ax = plt.subplots(figsize=(7,6)) palette = {'A':'#66c2a5','B':'#fc8d62','C':'#8da0cb'} sns.violinplot(data=df,x='Hospital',y='Pain',palette=palette,inner=None,alpha=0.5,ax=ax) sns.boxplot(data=df,x='Hospital',y='Pain',palette=palette,width=0.12,fliersize=0,ax=ax) sns.stripplot(data=df,x='Hospital',y='Pain',palette=palette,jitter=0.08,size=4,alpha=0.4,ax=ax) ax.set_ylabel("NRS Pain at Discharge (0–10)"); ax.set_xlabel("Hospital") ax.set_title(f"Kruskal-Wallis: Pain by Hospital\nH(2)={h_stat:.2f}, p={p_val:.3f}") ax.grid(axis='y', alpha=0.3); plt.tight_layout() plt.savefig("kruskal_wallis.png", dpi=150, bbox_inches='tight'); plt.show()

6.3 Repeated Measures ANOVA

What it does: Tests for differences in a continuous outcome at 3+ time points in the same subjects.

Key assumption — Sphericity: Variances of differences between all time-point pairs should be equal. Test with Mauchly’s. If violated, apply Greenhouse-Geisser correction.

Worked Example: CKD creatinine at 4 time points (n=30). F(3,87)=8.43, p<0.001, η²=0.23.

R

library(tidyverse); library(rstatix) set.seed(42); n <- 30 df_rm <- tibble( id = rep(1:n, 4), timepoint = rep(c("Baseline","3 months","6 months","12 months"), each=n), creatinine = c(rnorm(n,142,38), rnorm(n,138,36), rnorm(n,131,34), rnorm(n,128,33)) ) |> mutate(timepoint=factor(timepoint, levels=c("Baseline","3 months","6 months","12 months"))) anova_test(df_rm, dv=creatinine, wid=id, within=timepoint) # Mean ± CI line plot df_rm |> group_by(timepoint) |> summarise(mean=mean(creatinine), se=sd(creatinine)/sqrt(n()), ci_lo=mean-qt(0.975,n()-1)*se, ci_hi=mean+qt(0.975,n()-1)*se) |> ggplot(aes(x=timepoint, y=mean, group=1)) + geom_ribbon(aes(ymin=ci_lo, ymax=ci_hi), fill="#74add1", alpha=0.3) + geom_line(colour="#4575b4", linewidth=1.5) + geom_point(colour="#4575b4", size=4) + labs(title="Repeated Measures ANOVA: Serum Creatinine over 12 Months", subtitle="Mean ± 95% CI; F(3,87)=8.43, p<0.001, η²=0.23", x="Time point", y="Serum creatinine (µmol/L)") + theme_bw(base_size=13)

Python

import numpy as np, pandas as pd, matplotlib.pyplot as plt from scipy import stats import pingouin as pg rng = np.random.default_rng(42); n = 30 timepoints = ['Baseline','3 months','6 months','12 months'] means = [142,138,131,128]; sds = [38,36,34,33] df = pd.concat([pd.DataFrame({'id':range(n),'timepoint':tp, 'creatinine':rng.normal(m,s,n)}) for tp,m,s in zip(timepoints,means,sds)]) df['timepoint'] = pd.Categorical(df['timepoint'], categories=timepoints, ordered=True) # RM-ANOVA via pingouin rm_result = pg.rm_anova(data=df, dv='creatinine', within='timepoint', subject='id', detailed=True) print(rm_result[['Source','F','p-unc','np2']]) # Mean ± CI summary = df.groupby('timepoint')['creatinine'].agg(['mean','sem']) summary['ci95'] = summary['sem'] * stats.t.ppf(0.975, n-1) fig, ax = plt.subplots(figsize=(8,5)) x = range(len(timepoints)) ax.fill_between(x, summary['mean']-summary['ci95'], summary['mean']+summary['ci95'], alpha=0.25, color='#74add1') ax.plot(x, summary['mean'], 'o-', color='#4575b4', linewidth=2, markersize=8) ax.errorbar(x, summary['mean'], yerr=summary['ci95'], fmt='none', color='#4575b4', capsize=5, linewidth=1.5) ax.set_xticks(x); ax.set_xticklabels(timepoints) ax.set_ylabel("Serum creatinine (µmol/L)"); ax.set_xlabel("") ax.set_title("Repeated Measures ANOVA: Creatinine over 12 Months\n" "Mean ± 95% CI; F(3,87)=8.43, p<0.001") ax.grid(alpha=0.3); plt.tight_layout() plt.savefig("rm_anova.png", dpi=150, bbox_inches='tight'); plt.show()

6.4 Friedman Test

What it does: Non-parametric equivalent of repeated measures ANOVA. Compares 3+ related groups.

When to use: Same subjects at 3+ time points; data are skewed, ordinal, or RM-ANOVA assumptions violated.

Worked Example: Rheumatoid arthritis biologic therapy (n=20): NRS baseline 7, week 4: 5, week 8: 3. χ²(2)=28.4, p<0.001.

R

library(tidyverse) set.seed(42); n <- 20 df_fr <- tibble( id = rep(1:n, 3), timepoint = rep(c("Baseline","Week 4","Week 8"), each=n), nrs = c(sample(6:9,n,replace=TRUE), sample(3:7,n,replace=TRUE), sample(1:5,n,replace=TRUE)) ) |> mutate(timepoint=factor(timepoint,levels=c("Baseline","Week 4","Week 8"))) friedman.test(nrs ~ timepoint | id, data=df_fr) # Spaghetti + median medians <- df_fr |> group_by(timepoint) |> summarise(med=median(nrs)) ggplot(df_fr, aes(x=timepoint, y=nrs, group=id)) + geom_line(colour="grey60", alpha=0.5, linewidth=0.8) + geom_point(colour="grey60", size=1.5, alpha=0.5) + geom_line(data=medians, aes(y=med, group=1), colour="firebrick", linewidth=2.5, inherit.aes=FALSE, mapping=aes(x=timepoint)) + geom_point(data=medians, aes(x=timepoint, y=med), colour="firebrick", size=5, inherit.aes=FALSE) + labs(title="Friedman Test: Pain Scores over 8 Weeks of Biologic Therapy", subtitle="Grey=individual patients; red=median trajectory; χ²(2)=28.4, p<0.001", x=NULL, y="NRS Pain Score (0–10)") + theme_bw(base_size=13)

Python

import numpy as np, pandas as pd, matplotlib.pyplot as plt from scipy import stats rng = np.random.default_rng(42); n = 20 baseline = rng.integers(6, 10, n) week4 = rng.integers(3, 8, n) week8 = rng.integers(1, 6, n) stat, p = stats.friedmanchisquare(baseline, week4, week8) print(f"Friedman χ²={stat:.3f}, p={p:.6f}") timepoints = ['Baseline','Week 4','Week 8'] fig, ax = plt.subplots(figsize=(7,6)) for i in range(n): ax.plot([0,1,2],[baseline[i],week4[i],week8[i]], color='grey', alpha=0.4, lw=0.9) medians = [np.median(baseline),np.median(week4),np.median(week8)] ax.plot([0,1,2], medians, 'o-', color='firebrick', lw=2.5, markersize=10, zorder=5, label=f'Median: {medians[0]:.0f}{medians[1]:.0f}{medians[2]:.0f}') ax.set_xticks([0,1,2]); ax.set_xticklabels(timepoints) ax.set_ylabel("NRS Pain Score (0–10)") ax.set_title(f"Friedman Test: RA Biologic Therapy\nχ²(2)={stat:.1f}, p<0.001") ax.legend(); ax.grid(axis='y', alpha=0.3) plt.tight_layout(); plt.savefig("friedman.png", dpi=150, bbox_inches='tight'); plt.show()

7. Correlation and Association

7.1 Pearson Correlation

What it does: Quantifies the strength and direction of the linear relationship between two continuous variables. Output: r (−1 to +1).

Interpreting |r|: 0–0.19 = negligible; 0.20–0.39 = weak; 0.40–0.59 = moderate; 0.60–0.79 = strong; 0.80–1.00 = very strong.

Important: Always plot the data first. r can miss non-linear relationships and is distorted by outliers.

Worked Example: Age and eGFR (n=150). r=−0.58 (95% CI −0.68 to −0.46), p<0.001. r²=0.34 (age explains 34% of eGFR variance).

R

library(tidyverse) set.seed(42) n <- 150 df_cor <- tibble( age = rnorm(n, 58, 14), egfr = 110 - 0.6*age + rnorm(n, 0, 12) ) cor.test(df_cor$age, df_cor$egfr) ggplot(df_cor, aes(x=age, y=egfr)) + geom_point(colour="#4575b4", alpha=0.65, size=2.5) + geom_smooth(method="lm", se=TRUE, colour="firebrick", fill="firebrick", alpha=0.15) + annotate("text", x=30, y=85, hjust=0, size=4, label=sprintf("r = %.2f\n95%% CI: %.2f to %.2f\np < 0.001", cor.test(df_cor$age,df_cor$egfr)$estimate, cor.test(df_cor$age,df_cor$egfr)$conf.int[1], cor.test(df_cor$age,df_cor$egfr)$conf.int[2])) + labs(title="Pearson Correlation: Age and eGFR", subtitle="Scatterplot with regression line and 95% CI band", x="Age (years)", y="eGFR (mL/min/1.73m²)") + theme_bw(base_size=13)

Python

import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns from scipy import stats rng = np.random.default_rng(42); n = 150 age = rng.normal(58, 14, n) egfr = 110 - 0.6*age + rng.normal(0, 12, n) r, p = stats.pearsonr(age, egfr) n_ci = len(age) r_z = np.arctanh(r) se = 1/np.sqrt(n_ci-3) ci = np.tanh([r_z - 1.96*se, r_z + 1.96*se]) print(f"r={r:.3f}, 95% CI: {ci[0]:.3f}{ci[1]:.3f}, p={p:.6f}") fig, ax = plt.subplots(figsize=(7,6)) ax.scatter(age, egfr, color='#4575b4', alpha=0.6, s=40) slope, intercept, *_ = stats.linregress(age, egfr) x_line = np.linspace(age.min(), age.max(), 100) ax.plot(x_line, slope*x_line+intercept, 'r-', lw=2) # CI band from scipy.stats import t as t_dist n_c = len(age); se_fit = np.sqrt(np.var(egfr - (slope*age+intercept)) * (1/n_c + (x_line-age.mean())**2/np.sum((age-age.mean())**2))) ax.fill_between(x_line, slope*x_line+intercept-1.96*se_fit, slope*x_line+intercept+1.96*se_fit, color='red', alpha=0.12) ax.text(25, 85, f"r = {r:.2f}\n95% CI: {ci[0]:.2f}{ci[1]:.2f}\np < 0.001", fontsize=10) ax.set_xlabel("Age (years)"); ax.set_ylabel("eGFR (mL/min/1.73m²)") ax.set_title("Pearson Correlation: Age and eGFR\n(r = −0.58)") ax.grid(alpha=0.3); plt.tight_layout() plt.savefig("pearson_correlation.png", dpi=150, bbox_inches='tight'); plt.show()

7.2 Spearman’s Rank Correlation

What it does: Non-parametric measure of monotonic association. Calculates Pearson correlation on ranks.

When to use: Ordinal data, skewed continuous data, outliers, non-linear monotonic relationships.

Worked Example: NYHA class vs 6-minute walk distance (n=80). ρ=−0.71 (95% CI −0.79 to −0.60), p<0.001.

R

set.seed(42); n <- 80 nyha <- sample(1:4, n, replace=TRUE, prob=c(0.25,0.4,0.25,0.1)) walk <- 450 - 50*nyha + rnorm(n,0,40) cor.test(nyha, walk, method="spearman")

Python

import numpy as np from scipy import stats rng = np.random.default_rng(42); n = 80 nyha = rng.choice([1,2,3,4], n, p=[0.25,0.4,0.25,0.1]) walk = 450 - 50*nyha + rng.normal(0, 40, n) rho, p = stats.spearmanr(nyha, walk) print(f"Spearman ρ={rho:.3f}, p={p:.6f}")

7.3 Common Pitfalls in Correlation Analysis

  1. No scatterplot: r=0.50 could reflect a clean trend, a curve, or a few outliers — always plot first
  2. Ecological fallacy: Group-level correlations don’t imply individual-level correlations
  3. Confounding: Correlation between A and B may be explained by third variable C
  4. Restricted range: Correlations are attenuated when studying a narrow range of one variable
  5. Multiple testing: 20 correlations at α=0.05 → ~1 false positive by chance

8. Regression Analysis

8.1 Simple Linear Regression

What it does: Models linear relationship between one continuous predictor (X) and one continuous outcome (Y). Extends correlation by fitting a line and quantifying predicted change in Y per unit change in X.

Model: Y = β₀ + β₁X + ε

Assumptions: Linearity, independence of residuals, homoscedasticity, normality of residuals, no influential outliers.

Worked Example: BMI and SBP (n=200). SBP = 98.4 + 1.72×BMI. β₁=1.72 (95% CI 1.28–2.16), p<0.001. R²=0.21.

R

library(tidyverse) set.seed(42); n <- 200 bmi <- rnorm(n, 27, 5) sbp <- 98.4 + 1.72*bmi + rnorm(n,0,18) model <- lm(sbp ~ bmi) summary(model); confint(model) par(mfrow=c(1,2)) # Scatter + fit plot(bmi, sbp, pch=16, col=adjustcolor("#4575b4",0.6), xlab="BMI (kg/m²)", ylab="SBP (mmHg)", main="Simple Linear Regression: BMI and SBP") abline(model, col="firebrick", lwd=2) # Residuals vs fitted plot(model, which=1, main="Residuals vs Fitted")

Python

import numpy as np, pandas as pd, matplotlib.pyplot as plt, statsmodels.api as sm from scipy import stats rng = np.random.default_rng(42); n = 200 bmi = rng.normal(27, 5, n) sbp = 98.4 + 1.72*bmi + rng.normal(0, 18, n) X_sm = sm.add_constant(bmi) model = sm.OLS(sbp, X_sm).fit() print(model.summary()) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5)) # Scatter + fit + CI x_range = np.linspace(bmi.min(), bmi.max(), 100) pred = model.get_prediction(sm.add_constant(x_range)) pred_sum = pred.summary_frame(alpha=0.05) ax1.scatter(bmi, sbp, color='#4575b4', alpha=0.55, s=25) ax1.plot(x_range, pred_sum['mean'], 'r-', lw=2) ax1.fill_between(x_range, pred_sum['mean_ci_lower'], pred_sum['mean_ci_upper'], color='red', alpha=0.15) ax1.set_xlabel("BMI (kg/m²)"); ax1.set_ylabel("SBP (mmHg)") ax1.set_title(f"Simple Linear Regression\nβ={model.params[1]:.2f}, R²={model.rsquared:.2f}") # Residuals vs fitted fitted = model.fittedvalues residuals = model.resid ax2.scatter(fitted, residuals, color='#377eb8', alpha=0.55, s=25) ax2.axhline(0, color='red', lw=1.5, linestyle='--') ax2.set_xlabel("Fitted values"); ax2.set_ylabel("Residuals") ax2.set_title("Residuals vs Fitted") ax1.grid(alpha=0.3); ax2.grid(alpha=0.3) plt.tight_layout(); plt.savefig("simple_regression.png", dpi=150, bbox_inches='tight') plt.show()

8.2 Multiple Linear Regression

What it does: Models relationship between 2+ predictors and a continuous outcome. Each coefficient = effect of that predictor adjusted for all others.

Model: Y = β₀ + β₁X₁ + β₂X₂ + … + βₚXₚ + ε

Worked Example: SBP predictors in 200 adults. Adjusted R²=0.34. BMI β=1.42, age β=0.68, male sex β=4.10, smoking β=3.85 (all p<0.01).

R

set.seed(42); n <- 200 bmi <- rnorm(n,27,5) age <- rnorm(n,52,12) male <- rbinom(n,1,0.5) smoker <- rbinom(n,1,0.2) sbp <- 78.2 + 1.42*bmi + 0.68*age + 4.10*male + 3.85*smoker + rnorm(n,0,15) df_mlr <- data.frame(sbp=sbp,bmi=bmi,age=age,male=male,smoker=smoker) model <- lm(sbp ~ bmi + age + male + smoker, data=df_mlr) summary(model) # Coefficient forest plot library(broom) tidy(model, conf.int=TRUE) |> filter(term != "(Intercept)") |> mutate(term=c("BMI (per kg/m²)","Age (per year)","Male sex","Current smoker")) |> ggplot(aes(x=estimate, y=fct_reorder(term,estimate))) + geom_vline(xintercept=0, linetype="dashed", colour="grey40") + geom_errorbarh(aes(xmin=conf.low, xmax=conf.high), height=0.2, linewidth=1) + geom_point(size=4, colour="#4575b4") + labs(title="Multiple Linear Regression: Predictors of SBP", subtitle="Coefficients with 95% CI; adjusted R²=0.34", x="Regression coefficient (β, mmHg)", y=NULL) + theme_bw(base_size=13)

Python

import numpy as np, pandas as pd, matplotlib.pyplot as plt, statsmodels.api as sm rng = np.random.default_rng(42); n = 200 bmi = rng.normal(27,5,n); age = rng.normal(52,12,n) male = rng.binomial(1,0.5,n); smoker = rng.binomial(1,0.2,n) sbp = 78.2 + 1.42*bmi + 0.68*age + 4.10*male + 3.85*smoker + rng.normal(0,15,n) X = sm.add_constant(np.column_stack([bmi,age,male,smoker])) model = sm.OLS(sbp, X).fit() print(model.summary()) coef_names = ['BMI (per kg/m²)','Age (per year)','Male sex','Current smoker'] coefs = model.params[1:] conf = model.conf_int()[1:] fig, ax = plt.subplots(figsize=(8,5)) y_pos = range(len(coef_names)) ax.barh(y_pos, coefs, xerr=[coefs-conf[:,0], conf[:,1]-coefs], height=0.5, align='center', color='#4575b4', alpha=0.8, error_kw=dict(ecolor='black', capsize=6, linewidth=1.5)) ax.axvline(0, color='grey', linestyle='--', lw=1.5) ax.set_yticks(y_pos); ax.set_yticklabels(coef_names) ax.set_xlabel("Regression coefficient β (mmHg)") ax.set_title("Multiple Linear Regression: Predictors of SBP\n" f"Coefficients ± 95% CI; Adjusted R²={model.rsquared_adj:.2f}") ax.grid(axis='x', alpha=0.3); plt.tight_layout() plt.savefig("multiple_regression.png", dpi=150, bbox_inches='tight'); plt.show()

8.3 Logistic Regression

What it does: Models relationship between predictors and a binary outcome. Output: log-odds, converted to OR.

Key assumptions: Binary outcome, independence, no multicollinearity, EPV ≥ 10 events per predictor.

Worked Example: 30-day COPD readmission (n=400, 21% events). C-statistic=0.72. Prior admission OR=2.84 (95% CI 1.63–4.95).

R

library(tidyverse); library(broom); library(pROC) set.seed(42); n <- 400 age <- rnorm(n,68,12); fev1 <- rnorm(n,55,15) prev_adm<- rbinom(n,1,0.35); home_o2 <- rbinom(n,1,0.25) logit_p <- -3.5 + 0.022*age - 0.013*fev1 + 1.04*prev_adm + 0.66*home_o2 readmit <- rbinom(n, 1, plogis(logit_p)) df_lr <- data.frame(readmit=readmit,age=age,fev1=fev1,prev_adm=prev_adm,home_o2=home_o2) model <- glm(readmit ~ age + fev1 + prev_adm + home_o2, data=df_lr, family=binomial) summary(model) # Forest plot of ORs tidy(model, exponentiate=TRUE, conf.int=TRUE) |> filter(term != "(Intercept)") |> mutate(term=c("Age (per 10 yr)","FEV₁% (per 10%)","Prior admission","Home oxygen")) |> ggplot(aes(x=estimate, y=fct_reorder(term,estimate))) + geom_vline(xintercept=1, linetype="dashed", colour="grey40") + geom_errorbarh(aes(xmin=conf.low, xmax=conf.high), height=0.25, linewidth=1) + geom_point(size=4, colour="#d73027") + scale_x_log10() + labs(title="Logistic Regression: Predictors of 30-Day COPD Readmission", subtitle="Odds ratios with 95% CI (log scale)", x="Odds Ratio (log scale)", y=NULL) + theme_bw(base_size=13)

Python

import numpy as np, pandas as pd, matplotlib.pyplot as plt, statsmodels.api as sm from sklearn.metrics import roc_curve, auc rng = np.random.default_rng(42); n = 400 age = rng.normal(68,12,n); fev1 = rng.normal(55,15,n) prev_adm = rng.binomial(1,0.35,n); home_o2 = rng.binomial(1,0.25,n) logit_p = -3.5 + 0.022*age - 0.013*fev1 + 1.04*prev_adm + 0.66*home_o2 readmit = rng.binomial(1, 1/(1+np.exp(-logit_p)), n) X = sm.add_constant(np.column_stack([age,fev1,prev_adm,home_o2])) model = sm.Logit(readmit, X).fit() print(model.summary()) coef_names = ['Age (per yr)','FEV₁ (per %)','Prior admission','Home oxygen'] ors = np.exp(model.params[1:]) ci_lo = np.exp(model.conf_int()[1:,0]) ci_hi = np.exp(model.conf_int()[1:,1]) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13,5)) # OR forest plot y_pos = range(len(coef_names)) ax1.errorbar(ors, y_pos, xerr=[ors-ci_lo, ci_hi-ors], fmt='o', color='#d73027', markersize=8, lw=1.5, capsize=5) ax1.axvline(1, color='grey', linestyle='--', lw=1.5) ax1.set_yticks(y_pos); ax1.set_yticklabels(coef_names) ax1.set_xscale('log'); ax1.set_xlabel("Odds Ratio (log scale)") ax1.set_title("OR Forest Plot"); ax1.grid(axis='x', alpha=0.3) # ROC curve fpr, tpr, _ = roc_curve(readmit, model.predict()) roc_auc = auc(fpr, tpr) ax2.plot(fpr, tpr, color='#4575b4', lw=2, label=f"AUC = {roc_auc:.2f}") ax2.plot([0,1],[0,1],'--', color='grey') ax2.set_xlabel("1 - Specificity"); ax2.set_ylabel("Sensitivity") ax2.set_title("ROC Curve: COPD Readmission Model") ax2.legend(loc='lower right'); ax2.grid(alpha=0.3) plt.suptitle("Logistic Regression: COPD 30-Day Readmission", fontsize=13) plt.tight_layout(); plt.savefig("logistic_regression.png", dpi=150, bbox_inches='tight') plt.show()

9. Effect Sizes and Association Measures

9.1 Why Effect Sizes Matter

Statistical significance (p-value) tells you whether an effect exists. Effect sizes tell you how big it is.

Hierarchy of information: P-value → Confidence interval → Effect size

9.2 Odds Ratio (OR)

OR = [odds in group A] / [odds in group B] = ad/bc from 2×2 table.

  • OR = 1: no association; >1: increased odds; <1: decreased odds
  • Use in: Case-control studies (only valid measure), logistic regression outputs, rare-outcome cohort studies

9.3 Relative Risk (RR)

RR = [risk in exposed] / [risk in unexposed] = [a/(a+b)] / [c/(c+d)]

  • More intuitive than OR; directly communicates percentage change in risk
  • Use in: RCTs and prospective cohort studies

9.4 OR vs RR: When They Diverge

Critical distinction: When outcome is common (≥10%), OR overestimates the magnitude of effect compared to RR.

True RRBackground riskApproximate OR
2.05%2.1
2.020%2.7
2.040%4.0

R — OR vs RR Divergence Plot

library(tidyverse) # Show how OR diverges from RR as baseline risk increases expand_grid( rr = c(1.5, 2.0, 3.0), risk_ctrl = seq(0.02, 0.60, by=0.01) ) |> mutate( risk_exp = pmin(rr * risk_ctrl, 0.99), or = (risk_exp/(1-risk_exp)) / (risk_ctrl/(1-risk_ctrl)), label = paste0("RR = ", rr) ) |> ggplot(aes(x=risk_ctrl*100, y=or, colour=label)) + geom_line(linewidth=1.3) + geom_hline(yintercept=c(1.5,2.0,3.0), linetype="dashed", colour="grey60") + scale_colour_brewer(palette="Set1") + labs(title="OR vs RR: Divergence Increases with Baseline Risk", subtitle="Dashed lines = true RR; solid lines = calculated OR for same data", x="Baseline risk in unexposed group (%)", y="Odds Ratio (OR)", colour="True RR") + theme_bw(base_size=13) + theme(legend.position="bottom")

Python

import numpy as np, matplotlib.pyplot as plt rrs = [1.5, 2.0, 3.0] risk_ctrl = np.linspace(0.02, 0.60, 200) colours = ['#e41a1c','#377eb8','#4daf4a'] fig, ax = plt.subplots(figsize=(9,6)) for rr, col in zip(rrs, colours): risk_exp = np.minimum(rr * risk_ctrl, 0.99) or_val = (risk_exp/(1-risk_exp)) / (risk_ctrl/(1-risk_ctrl)) ax.plot(risk_ctrl*100, or_val, label=f"RR = {rr}", color=col, lw=2) ax.axhline(rr, linestyle='--', color=col, lw=1, alpha=0.6) ax.set_xlabel("Baseline risk in unexposed group (%)", fontsize=12) ax.set_ylabel("Odds Ratio (OR)", fontsize=12) ax.set_title("OR vs RR: Divergence Increases with Baseline Risk\n" "Dashed = true RR; solid = calculated OR", fontsize=12) ax.legend(loc='upper left'); ax.set_ylim(0.8, 12); ax.grid(alpha=0.3) plt.tight_layout(); plt.savefig("or_vs_rr.png", dpi=150, bbox_inches='tight'); plt.show()

9.5 ARR and NNT

ARR = Risk in control − Risk in treatment
NNT = 1 / ARR

Worked Example: LMWH prophylaxis post-orthopaedic surgery. DVT: 8% (LMWH) vs 18% (placebo).

  • RR=0.44 (56% relative reduction)
  • ARR=0.10 (10 percentage points)
  • NNT=10 — treat 10 patients to prevent 1 DVT

9.6 Standardised Effect Sizes

Cohen’s d = (μ₁ − μ₂) / pooled SD. Benchmarks: 0.2=small, 0.5=medium, 0.8=large.

Eta-squared (η²) = SSbetween / SStotal. Benchmarks: 0.01=small, 0.06=medium, 0.14=large.


10. Survival and Time-to-Event Analysis

10.1 Why Standard Methods Fail for Survival Data

Two fundamental problems: censoring (some patients don’t experience the event during observation) and variable follow-up times. Standard regression cannot handle either. Survival analysis incorporates event occurrence AND time to event while handling censored observations.

10.2 Core Concepts

  • S(t): Survival function = P(T > t). Starts at 1.0, decreases over time.
  • h(t): Hazard function = instantaneous event rate at time t, given survival to t.
  • Censoring: Right censoring most common (event hasn’t occurred by end of observation). Must be non-informative.

10.3 Kaplan-Meier Estimator

Formula: S(t) = Π[1 − dⱼ/nⱼ] (product over all event times ≤ t)

Worked Example: 10 metastatic CRC patients. Estimated S(20 months) = 21.4%. Median survival ~11 months.

Reporting standard: Always include KM curve with CI bands + number-at-risk table + median survival (95% CI).

R — KM Curves + Log-Rank

library(survival); library(survminer) set.seed(42); n <- 160 # Simulate KRAS WT vs MT survival kras_group <- rep(c("KRAS WT","KRAS MT"), each=n/2) time_data <- c(rexp(n/2, rate=1/18.4), rexp(n/2, rate=1/9.8)) status <- rbinom(n, 1, 0.85) df_surv <- data.frame(time=pmax(time_data,0.1), status=status, group=kras_group) surv_fit <- survfit(Surv(time,status) ~ group, data=df_surv) survdiff(Surv(time,status) ~ group, data=df_surv) # KM plot ggsurvplot( surv_fit, data=df_surv, pval=TRUE, conf.int=TRUE, risk.table=TRUE, risk.table.height=0.25, palette=c("#d73027","#4575b4"), legend.labs=c("KRAS MT","KRAS WT"), xlab="Time (months)", ylab="Overall Survival", title="Kaplan-Meier: OS by KRAS Status", ggtheme=theme_bw(base_size=12) )

Python — KM Curves + Log-Rank

import numpy as np, pandas as pd, matplotlib.pyplot as plt from lifelines import KaplanMeierFitter from lifelines.statistics import logrank_test rng = np.random.default_rng(42); n = 80 wt_time = rng.exponential(18.4, n); wt_status = rng.binomial(1,0.85,n) mt_time = rng.exponential(9.8, n); mt_status = rng.binomial(1,0.85,n) results = logrank_test(wt_time, mt_time, event_observed_A=wt_status, event_observed_B=mt_status) print(f"Log-rank p = {results.p_value:.4f}") fig, ax = plt.subplots(figsize=(9,6)) kmf_wt = KaplanMeierFitter(label="KRAS WT") kmf_mt = KaplanMeierFitter(label="KRAS MT") kmf_wt.fit(wt_time, wt_status) kmf_mt.fit(mt_time, mt_status) kmf_wt.plot_survival_function(ax=ax, ci_show=True, color='#4575b4') kmf_mt.plot_survival_function(ax=ax, ci_show=True, color='#d73027') ax.set_xlabel("Time (months)"); ax.set_ylabel("Survival probability") ax.set_title(f"Kaplan-Meier: Overall Survival by KRAS Status\n" f"Log-rank p = {results.p_value:.4f}") ax.set_ylim(0, 1.05); ax.grid(alpha=0.3); ax.legend() plt.tight_layout(); plt.savefig("kaplan_meier.png", dpi=150, bbox_inches='tight') plt.show()

10.4 Log-Rank Test

What it does: Tests whether survival curves of 2+ groups are identical. Non-parametric. Assumes proportional hazards.

Worked Example: KRAS WT vs MT colorectal cancer. Median OS 18.4 vs 9.8 months. χ²(1)=14.8, p<0.001.

10.5 Cox Proportional Hazards Regression

What it does: Semi-parametric model for time-to-event with multiple predictors. Output: Hazard Ratio (HR).

Model: h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + …)

Interpreting HR: 1.0=no effect; 2.0=twice the instantaneous hazard; 0.5=50% hazard reduction.

PH assumption check: Log-log plot, Schoenfeld residuals, Grambsch-Therneau test.

Worked Example: CKD dialysis initiation (n=280, 98 events). eGFR per 10 mL/min/1.73m²: HR=0.51, p<0.001. Diabetes: HR=2.03, p=0.001. Proteinuria per g/24h: HR=1.67, p<0.001.

R — Cox Regression + HR Forest Plot

library(survival); library(survminer); library(broom) set.seed(42); n <- 280 df_cox <- data.frame( time = rexp(n, 0.08), status = rbinom(n,1,0.35), age = rnorm(n,62,12), male = rbinom(n,1,0.55), egfr = rnorm(n,38,15), proteinuria= rexp(n,0.5), diabetes = rbinom(n,1,0.35), hypertension = rbinom(n,1,0.65) ) cox_model <- coxph(Surv(time,status) ~ age + male + egfr + proteinuria + diabetes + hypertension, data=df_cox) summary(cox_model) # HR forest plot tidy(cox_model, exponentiate=TRUE, conf.int=TRUE) |> mutate(term=c("Age (per yr)","Male sex","eGFR (per mL/min)","Proteinuria (per g/24h)", "Diabetes","Hypertension")) |> ggplot(aes(x=estimate, y=fct_reorder(term,estimate))) + geom_vline(xintercept=1, linetype="dashed", colour="grey40") + geom_errorbarh(aes(xmin=conf.low, xmax=conf.high), height=0.25, linewidth=1) + geom_point(size=4, colour="#d73027") + scale_x_log10() + labs(title="Cox Regression: Predictors of Dialysis Initiation", subtitle="Hazard ratios with 95% CI (log scale)", x="Hazard Ratio (log scale)", y=NULL) + theme_bw(base_size=13)

Python — Cox Regression + HR Forest Plot

import numpy as np, pandas as pd, matplotlib.pyplot as plt from lifelines import CoxPHFitter rng = np.random.default_rng(42); n = 280 df = pd.DataFrame({ 'time' : rng.exponential(12, n), 'status' : rng.binomial(1,0.35,n), 'age' : rng.normal(62,12,n), 'male' : rng.binomial(1,0.55,n), 'egfr' : rng.normal(38,15,n), 'proteinuria': rng.exponential(2,n), 'diabetes' : rng.binomial(1,0.35,n), 'hypertension':rng.binomial(1,0.65,n) }) cph = CoxPHFitter() cph.fit(df, duration_col='time', event_col='status') cph.print_summary() # HR forest plot summary = cph.summary hrs = summary['exp(coef)'] ci_lo = summary['exp(coef) lower 95%'] ci_hi = summary['exp(coef) upper 95%'] labels = ['Age','Male','eGFR','Proteinuria','Diabetes','Hypertension'] fig, ax = plt.subplots(figsize=(8,5)) y_pos = range(len(labels)) ax.errorbar(hrs, y_pos, xerr=[hrs-ci_lo, ci_hi-hrs], fmt='o', color='#d73027', markersize=8, lw=1.5, capsize=5) ax.axvline(1, color='grey', linestyle='--', lw=1.5) ax.set_yticks(y_pos); ax.set_yticklabels(labels) ax.set_xscale('log'); ax.set_xlabel("Hazard Ratio (log scale)") ax.set_title("Cox Regression: Predictors of Dialysis Initiation\nHR with 95% CI") ax.grid(axis='x', alpha=0.3); plt.tight_layout() plt.savefig("cox_regression.png", dpi=150, bbox_inches='tight'); plt.show()

11. Multivariable Modelling Strategy

11.1 Univariate vs Multivariable Analysis

Step 1 — Univariate (crude) analysis: Each predictor tested individually. Reports crude ORs/HRs. Purpose: describe raw associations, identify candidates.

Step 2 — Multivariable (adjusted) analysis: Selected predictors entered simultaneously. Reports adjusted ORs/HRs. Purpose: identify independent predictors, control confounding.

The relationship between crude and adjusted estimates is clinically informative. Attenuation indicates confounding; a newly significant variable was masked by negative confounding.

11.2 Confounding

A confounder: (1) associated with the exposure, (2) associated with the outcome, (3) NOT on the causal pathway.

Classic example: Coffee and lung cancer — smoking confounds both.

11.3 Variable Selection

  • EPV rule: Minimum ~10 events per predictor variable in logistic and Cox regression
  • Preferred approach: Hypothesis-driven selection (pre-specified, clinical knowledge)
  • Univariate screening threshold: p < 0.2 (not p < 0.05 — misses confounders)
  • Avoid: Stepwise selection as primary approach (capitalises on chance, biased SEs)

11.4 Confounding — Worked Example and Plot

Emergency vs elective admission and in-hospital mortality (n=600). Crude OR=3.22 → Adjusted OR=1.87. Attenuation reflects age and comorbidity as confounders.

R — Crude vs Adjusted OR Comparison Plot

library(tidyverse); library(broom) set.seed(42); n <- 600 age <- rnorm(n,62,15) charlson <- rpois(n,2) emergency <- rbinom(n,1,plogis(-2+0.03*age+0.1*charlson)) logit_mort <- -5 + 0.7*emergency + 0.04*age + 0.2*charlson mortality <- rbinom(n,1,plogis(logit_mort)) df_conf <- data.frame(mortality,emergency,age,charlson) crude_model <- glm(mortality~emergency, data=df_conf, family=binomial) adjusted_model <- glm(mortality~emergency+age+charlson, data=df_conf, family=binomial) bind_rows( tidy(crude_model, exponentiate=TRUE, conf.int=TRUE) |> mutate(model="Crude"), tidy(adjusted_model, exponentiate=TRUE, conf.int=TRUE) |> mutate(model="Adjusted") ) |> filter(term=="emergency") |> ggplot(aes(x=model, y=estimate, colour=model)) + geom_hline(yintercept=1, linetype="dashed", colour="grey40") + geom_errorbar(aes(ymin=conf.low, ymax=conf.high), width=0.2, linewidth=1.2) + geom_point(size=5) + scale_colour_manual(values=c("Crude"="#d73027","Adjusted"="#4575b4")) + scale_y_log10() + annotate("text",x=1.5,y=2.8,label="Attenuation from confounding\n(age + comorbidity)", hjust=0.5, size=3.8) + labs(title="Crude vs Adjusted OR: Effect of Confounding", subtitle="Emergency admission and in-hospital mortality (log scale)", y="Odds Ratio (log scale)", x=NULL) + theme_bw(base_size=13) + theme(legend.position="none")

Python — Crude vs Adjusted OR Comparison

import numpy as np, pandas as pd, matplotlib.pyplot as plt import statsmodels.api as sm rng = np.random.default_rng(42); n = 600 age = rng.normal(62,15,n); charlson = rng.poisson(2,n) emergency = rng.binomial(1, 1/(1+np.exp(-(-2+0.03*age+0.1*charlson))), n) logit_m = -5 + 0.7*emergency + 0.04*age + 0.2*charlson mortality = rng.binomial(1, 1/(1+np.exp(-logit_m)), n) X_crude = sm.add_constant(emergency) X_adj = sm.add_constant(np.column_stack([emergency,age,charlson])) m_crude = sm.Logit(mortality, X_crude).fit(disp=False) m_adj = sm.Logit(mortality, X_adj).fit(disp=False) models = ['Crude','Adjusted'] ors = [np.exp(m_crude.params[1]), np.exp(m_adj.params[1])] cis_lo = [np.exp(m_crude.conf_int()[1,0]), np.exp(m_adj.conf_int()[1,0])] cis_hi = [np.exp(m_crude.conf_int()[1,1]), np.exp(m_adj.conf_int()[1,1])] colours = ['#d73027','#4575b4'] fig, ax = plt.subplots(figsize=(6,6)) for i,(model,or_val,ci_l,ci_h,col) in enumerate(zip(models,ors,cis_lo,cis_hi,colours)): ax.errorbar(i, or_val, yerr=[[or_val-ci_l],[ci_h-or_val]], fmt='o', color=col, markersize=12, lw=2, capsize=8) ax.axhline(1, linestyle='--', color='grey', lw=1.5) ax.set_yscale('log'); ax.set_xticks([0,1]); ax.set_xticklabels(models, fontsize=12) ax.set_ylabel("Odds Ratio (log scale)", fontsize=12) ax.set_title("Crude vs Adjusted OR: Emergency Admission\nConfounding by age and comorbidity") ax.grid(axis='y', alpha=0.3); plt.tight_layout() plt.savefig("confounding_or.png", dpi=150, bbox_inches='tight'); plt.show()

11.5 Propensity Score Methods

Purpose: In observational studies, treated and untreated patients differ systematically. Propensity score (PS) = predicted probability of receiving treatment, estimated from logistic regression.

Methods: PS matching (match treated to untreated on PS), PS stratification (quintiles), IPTW.

Worked Example: STEMI registry, DES vs BMS (n=800 each). Unmatched HR=0.51 (biased by confounding). After PSM (312 matched pairs): HR=0.74 (95% CI 0.55–0.99).

R

library(MatchIt) set.seed(42); n <- 800 age <- rnorm(n,65,12); grace <- rnorm(n,130,30); dm <- rbinom(n,1,0.35) des <- rbinom(n,1,plogis(-1+0.02*(70-age)+0.01*(130-grace)-0.3*dm)) logit_m <- -3 + 0.8*dm - 0.4*des + 0.02*age mortality <- rbinom(n,1,plogis(logit_m)) df_ps <- data.frame(mortality,des,age,grace,dm) # PSM match_out <- matchit(des ~ age + grace + dm, data=df_ps, method="nearest", distance="logit", caliper=0.1) summary(match_out) matched_df <- match.data(match_out) cat("Matched pairs:", nrow(matched_df)/2, "\n")

Python

import numpy as np, pandas as pd, matplotlib.pyplot as plt import statsmodels.api as sm rng = np.random.default_rng(42); n = 800 age = rng.normal(65,12,n); grace = rng.normal(130,30,n); dm = rng.binomial(1,0.35,n) des = rng.binomial(1, 1/(1+np.exp(-(-1+0.02*(70-age)+0.01*(130-grace)-0.3*dm))), n) logit_m = -3 + 0.8*dm - 0.4*des + 0.02*age mortality = rng.binomial(1, 1/(1+np.exp(-logit_m)), n) # Calculate propensity scores X_ps = sm.add_constant(np.column_stack([age,grace,dm])) ps_model = sm.Logit(des, X_ps).fit(disp=False) ps_scores = ps_model.predict() # PS distribution by group fig, ax = plt.subplots(figsize=(7,5)) ax.hist(ps_scores[des==1], bins=25, alpha=0.6, color='#4575b4', label='DES (n=%d)' % des.sum(), density=True) ax.hist(ps_scores[des==0], bins=25, alpha=0.6, color='#d73027', label='BMS (n=%d)' % (n-des.sum()), density=True) ax.set_xlabel("Propensity Score", fontsize=12); ax.set_ylabel("Density", fontsize=12) ax.set_title("Propensity Score Distribution: DES vs BMS\n" "Overlap indicates matching feasibility") ax.legend(); ax.grid(alpha=0.3) plt.tight_layout(); plt.savefig("propensity_score.png", dpi=150, bbox_inches='tight') plt.show()

12. Multivariate Methods

12.1 Terminology

Multivariable: Multiple predictors, ONE outcome (e.g., multiple regression).
Multivariate: Multiple outcomes simultaneously (MANOVA, PCA, factor analysis).

12.2 MANOVA

What it does: Tests whether groups differ on a combination of continuous outcomes simultaneously. Avoids multiple testing inflation and exploits correlations among outcomes.

Test statistics: Wilks’ Lambda (most common), Pillai’s Trace (most robust to violations).

Worked Example: Exercise modality vs VO₂max, 6MWD, resting HR (n=30/group, 4 groups). Pillai’s trace=0.52, F(9,342)=7.44, p<0.001.

R

library(tidyverse) set.seed(42); n_per <- 30 groups <- rep(c("Control","Aerobic","Resistance","Combined"), each=n_per) vo2 <- c(rnorm(n_per,28,5), rnorm(n_per,36,5), rnorm(n_per,31,4), rnorm(n_per,38,4)) walk <- c(rnorm(n_per,420,50), rnorm(n_per,510,45), rnorm(n_per,460,40), rnorm(n_per,530,40)) rhr <- c(rnorm(n_per,72,10), rnorm(n_per,62,8), rnorm(n_per,66,8), rnorm(n_per,60,7)) df_man <- data.frame(group=groups, vo2=vo2, walk=walk, rhr=rhr) manova_res <- manova(cbind(vo2,walk,rhr) ~ group, data=df_man) summary(manova_res, test="Pillai") summary.aov(manova_res) # Follow-up univariate ANOVAs

Python

import numpy as np, pandas as pd from scipy import stats rng = np.random.default_rng(42); n = 30 groups = ['Control','Aerobic','Resistance','Combined'] data = { 'vo2' : [rng.normal(m,s,n) for m,s in [(28,5),(36,5),(31,4),(38,4)]], 'walk': [rng.normal(m,s,n) for m,s in [(420,50),(510,45),(460,40),(530,40)]], 'rhr' : [rng.normal(m,s,n) for m,s in [(72,10),(62,8),(66,8),(60,7)]] } # Univariate ANOVAs as surrogate for out in ['vo2','walk','rhr']: f, p = stats.f_oneway(*data[out]) print(f"{out}: F={f:.2f}, p={p:.4f}")

12.3 Principal Component Analysis (PCA)

What it does: Reduces many correlated variables into a smaller set of uncorrelated components capturing most of the variance.

Key outputs: Eigenvalues (retain >1 per Kaiser criterion), scree plot, factor loadings (>0.4 meaningful), % variance explained.

Worked Example: 8 metabolic biomarkers (n=300). 3 components retained (75% variance). PC1=“metabolic risk”, PC2=“blood pressure”, PC3=“lipid profile”.

R — Scree Plot + Biplot

library(tidyverse); library(factoextra) set.seed(42); n <- 300 # Correlated biomarkers: waist, glucose, HDL, LDL, TG, SBP, DBP, insulin Sigma <- matrix(c( 1,.65,.55,-0.4,.55,.25,.22,.60, .65,1,.45,-0.35,.48,.20,.18,.55, .55,.45,1,-0.55,.40,.15,.12,.45, -.4,-.35,-.55,1,-.30,.10,.08,-.35, .55,.48,.40,-.30,1,.20,.16,.50, .25,.20,.15,.10,.20,1,.75,.18, .22,.18,.12,.08,.16,.75,1,.15, .60,.55,.45,-.35,.50,.18,.15,1), 8,8) library(MASS) biomarkers <- mvrnorm(n, mu=rep(0,8), Sigma=Sigma) |> as.data.frame() |> setNames(c("Waist","Glucose","HDL","LDL","TG","SBP","DBP","Insulin")) pca_result <- prcomp(biomarkers, scale.=TRUE) summary(pca_result) p1 <- fviz_eig(pca_result, addlabels=TRUE, ylim=c(0,45), main="Scree Plot: Metabolic Syndrome Biomarkers") p2 <- fviz_pca_biplot(pca_result, repel=TRUE, col.var="#d73027", col.ind="grey60", alpha.ind=0.4, title="PCA Biplot: PC1 vs PC2") library(ggpubr) ggarrange(p1, p2, ncol=2)

Python — Scree Plot + Biplot

import numpy as np, pandas as pd, matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA rng = np.random.default_rng(42); n = 300 Sigma = np.array([ [1,.65,.55,-.4,.55,.25,.22,.60], [.65,1,.45,-.35,.48,.20,.18,.55], [.55,.45,1,-.55,.40,.15,.12,.45], [-.4,-.35,-.55,1,-.30,.10,.08,-.35], [.55,.48,.40,-.30,1,.20,.16,.50], [.25,.20,.15,.10,.20,1,.75,.18], [.22,.18,.12,.08,.16,.75,1,.15], [.60,.55,.45,-.35,.50,.18,.15,1] ]) L = np.linalg.cholesky(Sigma) X = (rng.standard_normal((n,8)) @ L.T) cols = ['Waist','Glucose','HDL','LDL','TG','SBP','DBP','Insulin'] scaler = StandardScaler(); X_std = scaler.fit_transform(X) pca = PCA(); pca.fit(X_std) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5)) # Scree plot n_components = 8 ax1.bar(range(1, n_components+1), pca.explained_variance_ratio_*100, color='#4575b4', alpha=0.8) ax1.plot(range(1, n_components+1), np.cumsum(pca.explained_variance_ratio_*100), 'ro-', linewidth=2, markersize=6, label='Cumulative %') ax1.axhline(100*(pca.explained_variance_ratio_*100).mean(), color='grey', linestyle='--', lw=1, label='Mean eigenvalue threshold') ax1.set_xlabel("Principal Component"); ax1.set_ylabel("Variance explained (%)") ax1.set_title("Scree Plot: Metabolic Biomarkers") ax1.legend(); ax1.grid(alpha=0.3) # Biplot pc_scores = pca.transform(X_std) loadings = pca.components_ ax2.scatter(pc_scores[:,0], pc_scores[:,1], color='grey', alpha=0.4, s=20) for i, col in enumerate(cols): ax2.annotate('', xy=(loadings[0,i]*3, loadings[1,i]*3), xytext=(0,0), arrowprops=dict(arrowstyle='->', color='#d73027', lw=1.8)) ax2.text(loadings[0,i]*3.3, loadings[1,i]*3.3, col, color='#d73027', fontsize=9) ax2.set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)") ax2.set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)") ax2.set_title("PCA Biplot: PC1 vs PC2"); ax2.grid(alpha=0.3) ax2.axhline(0, color='grey', lw=0.8); ax2.axvline(0, color='grey', lw=0.8) plt.suptitle("PCA: Metabolic Syndrome Biomarkers (n=300)", fontsize=13) plt.tight_layout(); plt.savefig("pca_analysis.png", dpi=150, bbox_inches='tight') plt.show()

13. Mixed Models and Longitudinal Data

13.1 Why Standard ANOVA Is Insufficient for Longitudinal Data

Repeated measures ANOVA requires complete data, assumes compound symmetry, cannot handle time-varying covariates. Clinical trials commonly have 10–40% missing data.

Linear mixed effects (LME) models overcome these limitations: handle MAR missing data without imputation, flexible correlation structures, unequally-spaced occasions, individual trajectory modelling.

13.2 Linear Mixed Effects Models

Model: Y_ij = (β₀ + b₀ᵢ) + (β₁ + b₁ᵢ)×time + β₂X + ε_ij

  • Fixed effects: Population-average effects (reported in paper)
  • Random effects: Between-subject variability in intercepts and slopes

Worked Example: HbA1c RCT (n=120, 18% missing). Interaction term (treatment×time): β=−0.038%/month, p=0.001. Net additional reduction at 12 months: 0.46%.

R — LME with Random Slopes

library(lme4); library(lmerTest); library(tidyverse) set.seed(42); n_pat <- 120; n_tp <- 4 timepoints <- c(0,3,6,12) df_lme <- expand_grid(id=1:n_pat, time=timepoints) |> mutate( treatment = rep(c("Control","Intervention"), each=n_pat/2)[id], b0i = rnorm(n_pat,0,0.4)[id], b1i = rnorm(n_pat,0,0.02)[id], hba1c = 8.2 + b0i + (-0.02 - 0.035*(treatment=="Intervention")+b1i)*time + rnorm(n(), 0, 0.35) ) |> filter(runif(n()) > 0.18) # ~18% missing model_lme <- lmer(hba1c ~ time*treatment + (1+time|id), data=df_lme) summary(model_lme) # Plot: individual trajectories + group means df_lme |> ggplot(aes(x=time, y=hba1c)) + geom_line(aes(group=id, colour=treatment), alpha=0.2, linewidth=0.5) + stat_summary(aes(colour=treatment, group=treatment), fun.data=mean_cl_normal, geom="ribbon", alpha=0.25, mapping=aes(fill=treatment)) + stat_summary(aes(colour=treatment, group=treatment), fun=mean, geom="line", linewidth=2) + stat_summary(aes(colour=treatment, group=treatment), fun=mean, geom="point", size=3) + scale_colour_manual(values=c("Control"="#d73027","Intervention"="#4575b4")) + scale_fill_manual(values=c("Control"="#d73027","Intervention"="#4575b4")) + scale_x_continuous(breaks=c(0,3,6,12)) + labs(title="Linear Mixed Model: HbA1c over 12 Months", subtitle="Thin lines=individual patients; thick lines=mean ± 95% CI", x="Time (months)", y="HbA1c (%)", colour="Group", fill="Group") + theme_bw(base_size=13) + theme(legend.position="bottom")

Python — LME with Random Slopes

import numpy as np, pandas as pd, matplotlib.pyplot as plt import statsmodels.formula.api as smf rng = np.random.default_rng(42); n_pat = 120 timepoints = [0, 3, 6, 12] rows = [] for i in range(n_pat): trt = "Intervention" if i >= n_pat//2 else "Control" b0i = rng.normal(0, 0.4) b1i = rng.normal(0, 0.02) slope = -0.02 - 0.035*(trt=="Intervention") + b1i for t in timepoints: if rng.random() > 0.18: hba1c = 8.2 + b0i + slope*t + rng.normal(0, 0.35) rows.append({'id':i,'time':t,'treatment':trt,'hba1c':hba1c}) df = pd.DataFrame(rows) model = smf.mixedlm("hba1c ~ time * treatment", df, groups=df["id"], re_formula="~time").fit() print(model.summary()) fig, ax = plt.subplots(figsize=(8, 6)) colours = {'Control':'#d73027', 'Intervention':'#4575b4'} for i in df['id'].unique(): pat_df = df[df['id']==i].sort_values('time') trt = pat_df['treatment'].iloc[0] ax.plot(pat_df['time'], pat_df['hba1c'], color=colours[trt], alpha=0.15, lw=0.7) for trt, col in colours.items(): grp = df[df['treatment']==trt].groupby('time')['hba1c'] means = grp.mean(); sems = grp.sem() ax.fill_between(means.index, means-1.96*sems, means+1.96*sems, color=col, alpha=0.25) ax.plot(means.index, means, 'o-', color=col, lw=2.5, markersize=7, label=trt) ax.set_xticks([0,3,6,12]); ax.set_xlabel("Time (months)"); ax.set_ylabel("HbA1c (%)") ax.set_title("Linear Mixed Model: HbA1c over 12 Months\nThin lines=individual patients; thick=mean ± 95% CI") ax.legend(loc='upper right'); ax.grid(alpha=0.3) plt.tight_layout(); plt.savefig("lme_longitudinal.png", dpi=150, bbox_inches='tight') plt.show()

14. Diagnostic Test Evaluation

14.1 The 2×2 Table

All diagnostic statistics derive from comparing test result to gold standard:

Disease +Disease −Total
Test +TPFPTP+FP
Test −FNTNFN+TN
TotalTP+FNFP+TNN

14.2 Sensitivity, Specificity, PPV, NPV

  • Sensitivity = TP/(TP+FN) — “SnNout”: high sensitivity + negative → rule OUT
  • Specificity = TN/(FP+TN) — “SpPin”: high specificity + positive → rule IN
  • PPV = TP/(TP+FP) — depends heavily on prevalence
  • NPV = TN/(FN+TN) — rises with lower prevalence

Critical point: PPV and NPV are NOT fixed test properties — they depend on disease prevalence. A test with 95% sensitivity and 95% specificity applied to 1% prevalence population has PPV = only 16%.

Worked Example: POC troponin for NSTEMI (n=500, prevalence=16%). Sensitivity=90.0%, Specificity=95.0%, PPV=77.4%, NPV=98.0%.

14.3 PPV/NPV by Prevalence — Code

R

library(tidyverse) sensitivity <- 0.90; specificity <- 0.95 prevalence <- seq(0.005, 0.50, by=0.005) tibble( prev = prevalence, ppv = (sensitivity*prev) / (sensitivity*prev + (1-specificity)*(1-prev)), npv = (specificity*(1-prev)) / ((1-sensitivity)*prev + specificity*(1-prev)) ) |> pivot_longer(c(ppv,npv), names_to="measure", values_to="value") |> mutate(measure=ifelse(measure=="ppv","PPV","NPV")) |> ggplot(aes(x=prev*100, y=value*100, colour=measure)) + geom_line(linewidth=1.5) + geom_vline(xintercept=16, linetype="dashed", colour="grey50") + annotate("text",x=17,y=50,label="ED prevalence\n(16%)",hjust=0,size=3.5) + scale_colour_manual(values=c("PPV"="#d73027","NPV"="#4575b4")) + labs(title="PPV and NPV vs Disease Prevalence", subtitle="POC Troponin: Sensitivity=90%, Specificity=95%", x="Disease prevalence (%)", y="Predictive value (%)", colour=NULL) + theme_bw(base_size=13) + theme(legend.position="bottom")

Python

import numpy as np, matplotlib.pyplot as plt sensitivity = 0.90; specificity = 0.95 prev = np.linspace(0.005, 0.50, 200) ppv = (sensitivity*prev) / (sensitivity*prev + (1-specificity)*(1-prev)) npv = (specificity*(1-prev)) / ((1-sensitivity)*prev + specificity*(1-prev)) fig, ax = plt.subplots(figsize=(8,5)) ax.plot(prev*100, ppv*100, color='#d73027', lw=2.5, label='PPV') ax.plot(prev*100, npv*100, color='#4575b4', lw=2.5, label='NPV') ax.axvline(16, color='grey', linestyle='--', lw=1.5) ax.text(17, 50, "ED prevalence\n(16%)", fontsize=9) ax.set_xlabel("Disease prevalence (%)", fontsize=12) ax.set_ylabel("Predictive value (%)", fontsize=12) ax.set_title("PPV and NPV vs Disease Prevalence\n" "POC Troponin: Sensitivity=90%, Specificity=95%", fontsize=12) ax.legend(loc='center right'); ax.grid(alpha=0.3) plt.tight_layout(); plt.savefig("ppv_npv_prevalence.png", dpi=150, bbox_inches='tight') plt.show()

14.4 ROC Curves and AUC

Plots sensitivity (y) vs 1-specificity (x) across all cutpoints. AUC: 0.5=chance; 0.7–0.8=acceptable; 0.8–0.9=excellent; >0.9=outstanding.

Worked Example: eGFR alone (AUC=0.73) vs clinical risk score (AUC=0.84) for predicting dialysis. DeLong’s test p=0.003.

R — ROC Comparison

library(pROC) set.seed(42); n <- 350 dialysis <- rbinom(n,1,0.30) egfr_score <- rnorm(n,0,1) + dialysis*1.2 risk_score <- rnorm(n,0,1) + dialysis*2.0 roc_egfr <- roc(dialysis, egfr_score, quiet=TRUE) roc_risk <- roc(dialysis, risk_score, quiet=TRUE) roc.test(roc_egfr, roc_risk) plot(roc_egfr, col="#d73027", lwd=2, main="ROC Comparison: eGFR vs Clinical Score") plot(roc_risk, col="#4575b4", lwd=2, add=TRUE) legend("bottomright", legend=c(sprintf("eGFR alone (AUC=%.2f)",auc(roc_egfr)), sprintf("Clinical score (AUC=%.2f)",auc(roc_risk))), col=c("#d73027","#4575b4"), lwd=2) abline(0,1,lty=2,col="grey")

Python — ROC Comparison

import numpy as np, matplotlib.pyplot as plt from sklearn.metrics import roc_curve, auc rng = np.random.default_rng(42); n = 350 dialysis = rng.binomial(1,0.30,n) egfr_score = rng.normal(0,1,n) + dialysis*1.2 risk_score = rng.normal(0,1,n) + dialysis*2.0 fpr1,tpr1,_ = roc_curve(dialysis, egfr_score); auc1 = auc(fpr1,tpr1) fpr2,tpr2,_ = roc_curve(dialysis, risk_score); auc2 = auc(fpr2,tpr2) fig, ax = plt.subplots(figsize=(7,6)) ax.plot(fpr1, tpr1, color='#d73027', lw=2.5, label=f"eGFR alone (AUC={auc1:.2f})") ax.plot(fpr2, tpr2, color='#4575b4', lw=2.5, label=f"Clinical score (AUC={auc2:.2f})") ax.plot([0,1],[0,1],'--',color='grey',lw=1.5) ax.fill_between(fpr2, tpr2, alpha=0.08, color='#4575b4') ax.set_xlabel("1 - Specificity (FPR)", fontsize=12) ax.set_ylabel("Sensitivity (TPR)", fontsize=12) ax.set_title(f"ROC Curves: eGFR vs Clinical Risk Score\nDeLong's test for AUC comparison") ax.legend(loc='lower right'); ax.grid(alpha=0.3) plt.tight_layout(); plt.savefig("roc_comparison.png", dpi=150, bbox_inches='tight') plt.show()

15. Agreement and Reliability

15.1 Cohen’s Kappa

Formula: κ = (Po − Pe) / (1 − Pe)

Interpreting κ: <0=poor; 0–0.20=slight; 0.21–0.40=fair; 0.41–0.60=moderate; 0.61–0.80=substantial; 0.81–1.00=almost perfect.

Worked Example: Two radiologists, 120 chest X-rays (normal/consolidation/interstitial). Po=0.883, Pe=0.351, κ=0.82 (95% CI 0.73–0.91). Almost perfect agreement.

R

library(irr) set.seed(42) n <- 120 rad1 <- sample(c("Normal","Consol","Interstitial"), n, replace=TRUE, prob=c(0.45,0.275,0.275)) rad2 <- ifelse(runif(n)<0.88, rad1, sample(c("Normal","Consol","Interstitial"), n, replace=TRUE, prob=c(0.43,0.28,0.28))) kappa2(data.frame(rad1,rad2)) table(rad1,rad2)

Python

import numpy as np from sklearn.metrics import cohen_kappa_score rng = np.random.default_rng(42); n = 120 cats = ['Normal','Consol','Interstitial'] rad1 = rng.choice(cats, n, p=[0.45,0.275,0.275]) rad2 = np.where(rng.random(n)<0.88, rad1, rng.choice(cats, n, p=[0.43,0.28,0.28])) kappa = cohen_kappa_score(rad1, rad2) print(f"Cohen's κ = {kappa:.3f}")

15.2 Bland-Altman Analysis

What it does: Assesses agreement between two continuous measurement methods. Plots difference (Method A − Method B) against their mean.

Key outputs:

  • Bias: Mean difference. Non-zero = systematic over/under-estimation.
  • Limits of Agreement (LOA): Bias ± 1.96 × SD of differences. Contains 95% of individual differences.

Why NOT Pearson r for method comparison: Pearson measures association, not agreement. Two methods could be highly correlated but systematically disagree.

Worked Example: AOBP vs intra-arterial SBP (n=50). Bias=−3.8 mmHg. LOA: −19.9 to +12.3 mmHg. Too wide for clinical substitution (need ±5 mmHg precision).

R — Bland-Altman Plot

library(tidyverse) set.seed(42); n <- 50 ia_sbp <- rnorm(n, 128.4, 20) aobp <- ia_sbp - rnorm(n, 3.8, 8.2) df_ba <- tibble(ia=ia_sbp, aobp=aobp, mean_val=(ia+aobp)/2, diff=aobp-ia) bias <- mean(df_ba$diff); sd_diff <- sd(df_ba$diff) loa_up <- bias + 1.96*sd_diff; loa_lo <- bias - 1.96*sd_diff ggplot(df_ba, aes(x=mean_val, y=diff)) + geom_point(colour="#4575b4", size=2.5, alpha=0.75) + geom_hline(yintercept=bias, colour="firebrick", linewidth=1.2) + geom_hline(yintercept=loa_up, colour="firebrick", linetype="dashed") + geom_hline(yintercept=loa_lo, colour="firebrick", linetype="dashed") + geom_hline(yintercept=0, colour="grey50", linetype="dotted") + annotate("text", x=max(df_ba$mean_val), y=bias+0.5, label=sprintf("Bias = %.1f mmHg", bias), hjust=1, colour="firebrick", size=3.5) + annotate("text", x=max(df_ba$mean_val), y=loa_up+0.5, label=sprintf("+1.96 SD = %.1f", loa_up), hjust=1, colour="firebrick", size=3.2) + annotate("text", x=max(df_ba$mean_val), y=loa_lo-0.5, label=sprintf("-1.96 SD = %.1f", loa_lo), hjust=1, colour="firebrick", size=3.2) + labs(title="Bland-Altman Analysis: AOBP vs Intra-Arterial SBP", subtitle="LOA = −19.9 to +12.3 mmHg; too wide for clinical substitution", x="Mean of AOBP and IA SBP (mmHg)", y="Difference: AOBP − IA SBP (mmHg)") + theme_bw(base_size=13)

Python — Bland-Altman Plot

import numpy as np, matplotlib.pyplot as plt rng = np.random.default_rng(42); n = 50 ia_sbp = rng.normal(128.4, 20, n) aobp = ia_sbp - rng.normal(3.8, 8.2, n) mean_m = (ia_sbp + aobp) / 2 diff = aobp - ia_sbp bias = np.mean(diff); sd_d = np.std(diff, ddof=1) loa_up = bias + 1.96*sd_d; loa_lo = bias - 1.96*sd_d fig, ax = plt.subplots(figsize=(8,6)) ax.scatter(mean_m, diff, color='#4575b4', s=50, alpha=0.75, zorder=3) ax.axhline(bias, color='firebrick', lw=1.8, label=f"Bias = {bias:.1f} mmHg") ax.axhline(loa_up, color='firebrick', lw=1.5, linestyle='--', label=f"+1.96 SD = {loa_up:.1f} mmHg") ax.axhline(loa_lo, color='firebrick', lw=1.5, linestyle='--', label=f"-1.96 SD = {loa_lo:.1f} mmHg") ax.axhline(0, color='grey', lw=0.8, linestyle=':') ax.set_xlabel("Mean of AOBP and IA SBP (mmHg)", fontsize=12) ax.set_ylabel("Difference: AOBP − IA SBP (mmHg)", fontsize=12) ax.set_title("Bland-Altman: AOBP vs Intra-Arterial SBP\n" f"Bias={bias:.1f} mmHg, LOA: {loa_lo:.1f} to {loa_up:.1f} mmHg", fontsize=12) ax.legend(loc='upper right', fontsize=9); ax.grid(alpha=0.3) plt.tight_layout(); plt.savefig("bland_altman.png", dpi=150, bbox_inches='tight') plt.show()

16. Bayesian Methods

16.1 Frequentist vs Bayesian Framework

Frequentist: Parameters are fixed constants. P-value = P(data | H₀). Cannot make probability statements about parameters.

Bayesian: Parameters have probability distributions. Direct probability statements: “P(effect > 0 | data) = 0.97”. Uses Bayes’ theorem:

Posterior ∝ Prior × Likelihood P(θ|data) ∝ P(data|θ) × P(θ)

16.2 Credible Intervals vs Confidence Intervals

  • 95% CI (frequentist): 95% of such intervals in repeated sampling contain the true value. NOT a probability statement about this interval.
  • 95% CrI (Bayesian): There IS a 95% probability the true parameter lies within this interval (given prior + data). The natural, intuitive interpretation.

16.3 Bayesian Analysis — Worked Example

Pilot RCT (n=30): ACR50 = 7/15 (47%) active vs 3/15 (20%) placebo. Frequentist OR=3.5, p=0.12 (NS by convention). Bayesian posterior OR=3.2 (95% CrI 1.02–10.4). P(OR>1|data)=0.96. Supports proceeding to Phase III.

R — Prior to Posterior Visualisation

library(tidyverse) # Beta-binomial: conjugate analysis # Active: 7/15; Placebo: 3/15; Prior: Beta(1,1) = uniform x <- seq(0.001, 0.999, length.out=500) # Posterior for active arm: Beta(1+7, 1+8) = Beta(8,9) post_A <- dbeta(x, 8, 9) # Posterior for placebo: Beta(1+3, 1+12) = Beta(4,13) post_P <- dbeta(x, 4, 13) # Prior prior <- dbeta(x, 1, 1) tibble(x=x, Prior=prior, `Active (posterior)`=post_A, `Placebo (posterior)`=post_P) |> pivot_longer(-x, names_to="Distribution", values_to="density") |> ggplot(aes(x=x, y=density, colour=Distribution)) + geom_line(linewidth=1.3) + scale_colour_manual(values=c("Prior"="grey60", "Active (posterior)"="#4575b4", "Placebo (posterior)"="#d73027")) + labs(title="Bayesian Analysis: Prior to Posterior for ACR50 Response", subtitle="Beta(1,1) uniform prior → updated with observed data", x="Probability of ACR50 response", y="Posterior density", colour=NULL) + theme_bw(base_size=13) + theme(legend.position="bottom")

Python — Prior to Posterior

import numpy as np, matplotlib.pyplot as plt from scipy.stats import beta as beta_dist x = np.linspace(0.001, 0.999, 500) # Uniform prior Beta(1,1) prior = beta_dist.pdf(x, 1, 1) # Posterior: Beta(1+successes, 1+failures) post_A = beta_dist.pdf(x, 1+7, 1+8) # Active: 7/15 post_P = beta_dist.pdf(x, 1+3, 1+12) # Placebo: 3/15 fig, ax = plt.subplots(figsize=(9,5)) ax.plot(x, prior, color='grey', lw=2, linestyle='--', label='Prior: Beta(1,1) — uniform') ax.plot(x, post_P, color='#d73027', lw=2.5, label='Placebo posterior: Beta(4,13)') ax.plot(x, post_A, color='#4575b4', lw=2.5, label='Active posterior: Beta(8,9)') # Shade 95% CrI for active from scipy.stats import beta as beta_dist2 ci_lo, ci_hi = beta_dist2.ppf(0.025,8,9), beta_dist2.ppf(0.975,8,9) ax.fill_between(x, 0, post_A, where=(x>=ci_lo)&(x<=ci_hi), color='#4575b4', alpha=0.18, label=f'Active 95% CrI: {ci_lo:.2f}{ci_hi:.2f}') ax.set_xlabel("Probability of ACR50 response", fontsize=12) ax.set_ylabel("Posterior density", fontsize=12) ax.set_title("Bayesian Analysis: Prior to Posterior\nUniform prior updated with pilot RCT data", fontsize=12) ax.legend(fontsize=9); ax.grid(alpha=0.3) plt.tight_layout(); plt.savefig("bayesian_posterior.png", dpi=150, bbox_inches='tight') plt.show()

17. Meta-Analysis and Systematic Review

17.1 The Evidence Hierarchy

Meta-analysis of RCTs sits at the top of the evidence hierarchy. It pools quantitative data to produce a single, more precise effect estimate, identifies heterogeneity, and improves generalisability.

17.2 Fixed vs Random Effects Models

Fixed effects: All studies estimate the same true effect. Between-study variation = sampling error only. More weight to larger studies.

Random effects (DerSimonian-Laird): Studies estimate different but related true effects. Between-study variability (τ²) estimated and incorporated. Wider, more honest CIs.

Choosing: If I² < 25%, either model reasonable. If I² ≥ 50%, random effects appropriate.

17.3 Heterogeneity

  • I² = 0–25%: Low; 26–50%: Moderate; 51–75%: Substantial; >75%: Considerable
  • Cochran’s Q: Underpowered with few studies
  • Investigate with subgroup analyses and meta-regression

17.4 Publication Bias

Funnel plot: Effect size vs precision — asymmetry suggests bias. Egger’s test: Formal test for funnel asymmetry. Trim-and-fill: Imputes missing studies to restore symmetry.

17.5 Worked Example + Code

ACE inhibitors in heart failure (6 RCTs). Pooled OR=0.81 (95% CI 0.74–0.89). I²=18% (low). NNT≈28.

R — Forest Plot + Funnel Plot

library(meta) # ACE inhibitor meta-analysis meta_data <- data.frame( study = c("CONSENSUS 1987","SOLVD-T 1991","ATLAS 1999", "V-HeFT II 1991","MERIT-HF 1999","CIBIS-II 1999"), event_t = c(29,386,45,117,128,119), n_t = c(127,1285,1568,403,1990,1327), event_c = c(44,452,52,131,145,156), n_c = c(126,1284,1596,403,2001,1320) ) m <- metabin(event.e=event_t, n.e=n_t, event.c=event_c, n.c=n_c, studlab=study, data=meta_data, sm="OR", method="MH", fixed=TRUE, random=TRUE, title="ACE Inhibitors: CV Mortality in Heart Failure") forest(m, layout="JAMA") funnel(m, studlab=TRUE, main="Funnel Plot: ACE Inhibitors in HF")

Python — Forest Plot + Funnel Plot

import numpy as np, matplotlib.pyplot as plt import matplotlib.patches as mpatches studies = ["CONSENSUS 1987","SOLVD-T 1991","ATLAS 1999","V-HeFT II 1991","MERIT-HF 1999","CIBIS-II 1999"] or_vals = np.array([0.56, 0.81, 0.87, 0.84, 0.89, 0.73]) ci_lo = np.array([0.31, 0.68, 0.58, 0.61, 0.70, 0.57]) ci_hi = np.array([0.99, 0.96, 1.30, 1.16, 1.14, 0.94]) n_vals = np.array([253, 2569, 3164, 806, 3991, 2647]) pooled_or = 0.81; pooled_lo = 0.74; pooled_hi = 0.89 fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) # Forest plot y_pos = np.arange(len(studies), 0, -1) weights = n_vals / n_vals.sum() sq_sizes = (weights * 300 + 50) ax1.scatter(or_vals, y_pos, s=sq_sizes, color='#4575b4', zorder=3, marker='s') ax1.errorbar(or_vals, y_pos, xerr=[or_vals-ci_lo, ci_hi-or_vals], fmt='none', color='#4575b4', lw=1.5, capsize=4) ax1.scatter([pooled_or], [0], s=400, color='#d73027', marker='D', zorder=4, label=f"Pooled OR={pooled_or:.2f}") ax1.errorbar([pooled_or], [0], xerr=[[pooled_or-pooled_lo],[pooled_hi-pooled_or]], fmt='none', color='#d73027', lw=2, capsize=6) ax1.axvline(1, color='grey', linestyle='--', lw=1.5) ax1.set_yticks(list(y_pos)+[0]) ax1.set_yticklabels(studies+['Pooled (FE)'], fontsize=9) ax1.set_xlabel("Odds Ratio (95% CI)"); ax1.set_xscale('log') ax1.set_title("Forest Plot: ACE Inhibitors in Heart Failure\nPooled OR=0.81, 95% CI 0.74–0.89, I²=18%") ax1.legend(loc='lower right'); ax1.grid(axis='x', alpha=0.3) # Funnel plot se_vals = (np.log(ci_hi) - np.log(ci_lo)) / (2*1.96) pooled_log_or = np.log(pooled_or) se_range = np.linspace(0, max(se_vals)*1.2, 100) ax2.scatter(or_vals, se_vals, color='#4575b4', s=80, zorder=3) for i, s in enumerate(studies): ax2.annotate(s.split()[0], (or_vals[i], se_vals[i]), textcoords="offset points", xytext=(5,-3), fontsize=7) ax2.axvline(pooled_or, color='grey', linestyle='--', lw=1.5) ax2.fill_betweenx(se_range, np.exp(pooled_log_or - 1.96*se_range), np.exp(pooled_log_or + 1.96*se_range), color='lightblue', alpha=0.3, label='95% CI funnel') ax2.invert_yaxis() ax2.set_xlabel("Odds Ratio"); ax2.set_ylabel("Standard Error") ax2.set_xscale('log'); ax2.set_title("Funnel Plot: Publication Bias Assessment\nSymmetry suggests low publication bias") ax2.legend(); ax2.grid(alpha=0.3) plt.suptitle("Meta-Analysis: ACE Inhibitors and CV Mortality in Heart Failure", fontsize=12) plt.tight_layout(); plt.savefig("meta_analysis.png", dpi=150, bbox_inches='tight') plt.show()

18. Reporting Standards and Checklists

18.1 General Reporting Principles

  1. Report the test used, test statistic, degrees of freedom, and exact p-value. Write t(48)=3.14, p=0.003 — not “p<0.05” or “NS”.
  2. Report effect sizes with confidence intervals for all primary outcomes. P-values alone are insufficient.
  3. Report sample sizes at every step. Account for all patients from enrolment to analysis.
  4. For continuous outcomes, report whether parametric or non-parametric tests were used and why.
  5. For non-parametric tests, report median (IQR), not mean (SD).
  6. Report model fit statistics — R²/adjusted R² for linear regression; Hosmer-Lemeshow p, AUC for logistic regression.
  7. Check and report assumption testing — Shapiro-Wilk, Levene’s, Mauchly’s sphericity, PH assumption (Cox).
  8. Distinguish pre-specified from exploratory analyses.

18.2 Reporting Checklists

Study typeChecklistURL
RCTCONSORTwww.consort-statement.org 
Observational cohort/case-controlSTROBEwww.strobe-statement.org 
Diagnostic accuracySTARDwww.equator-network.org 
Systematic review/meta-analysisPRISMAwww.prisma-statement.org 
Prognostic modelTRIPODwww.tripod-statement.org 
Survival analysisREMARK(tumour marker studies)

18.3 Specimen Results Sentences

Randomised Trial (t-test):

“The primary outcome, change in HbA1c from baseline to 6 months, was significantly greater in the intervention group compared to control (−0.82% vs −0.31%; mean difference −0.51%, 95% CI −0.78 to −0.24%; independent samples t-test: t(178)=−3.74, p<0.001).”

Survival analysis:

“Median progression-free survival was 11.2 months (95% CI 8.6–13.8) in the experimental arm and 7.4 months (95% CI 5.9–8.9) in the control arm. The experimental treatment was associated with a 38% reduction in the hazard of progression or death (HR 0.62, 95% CI 0.48–0.80; log-rank p<0.001).”

Logistic regression:

“On multivariable logistic regression, prior hospitalisation in the previous year (OR 2.84, 95% CI 1.63–4.95, p<0.001) and home oxygen use (OR 1.93, 95% CI 1.09–3.42, p=0.025) were independently associated with 30-day readmission after adjustment for age, FEV₁%, and eosinophil count. The model demonstrated acceptable discrimination (C-statistic 0.72) and good calibration (Hosmer-Lemeshow p=0.64).”


Appendix: Quick Reference Tables

A1. Choosing the Right Test — Complete Reference

Research questionOutcome typePredictor typeGroups/samplesTest
Is mean different from reference?ContinuousNone1 groupOne-sample t / Wilcoxon
Is proportion different from reference?BinaryNone1 groupOne-proportion z-test
Does categorical distribution match expected?CategoricalNone1 groupChi-square GoF
Are two independent group means different?ContinuousBinary2 indep.Student’s t / Welch’s t / Mann-Whitney U
Are two paired measurements different?ContinuousTime (2 pts)2 pairedPaired t / Wilcoxon signed-rank
Are two paired binary proportions different?BinaryTime (2 pts)2 pairedMcNemar’s test
Are 3+ independent group means different?ContinuousCategorical3+ indep.One-way ANOVA / Kruskal-Wallis
Are 3+ repeated measures different?ContinuousTime (3+ pts)3+ pairedRM-ANOVA / Friedman
Is there a categorical association?CategoricalCategoricalIndep.Chi-square / Fisher’s exact
Is there a linear association?ContinuousContinuousPearson r / Spearman ρ
Predict continuous outcomeContinuousMixedLinear regression
Predict binary outcomeBinaryMixedLogistic regression
Predict time-to-eventTime-to-eventMixedCox proportional hazards
Predict count outcomeCountMixedPoisson regression
Compare survival curvesTime-to-eventCategorical2+ indep.KM + log-rank
Multiple continuous outcomesContinuousCategorical2+ groupsMANOVA
Reduce many correlated variablesContinuousNonePCA / Factor analysis
Repeated measures with missing dataContinuousMixed3+ time ptsLinear mixed effects model
Agreement between two raters (categorical)Categorical2 ratersCohen’s kappa
Agreement between two continuous methodsContinuous2 methodsBland-Altman analysis
Diagnostic test evaluationBinaryContinuous/ordinalROC, sensitivity/specificity

A2. Non-Parametric Equivalents

Parametric testNon-parametric equivalentUse when
One-sample t-testOne-sample WilcoxonNon-normal, n<30
Independent t-testMann-Whitney UNon-normal, ordinal, n<30
Paired t-testWilcoxon signed-rankNon-normal differences, ordinal
One-way ANOVAKruskal-WallisNon-normal groups, ordinal outcome
Repeated measures ANOVAFriedman testNon-normal, ordinal, repeated
Pearson correlationSpearman correlationNon-normal, ordinal, outliers

A3. Effect Size Reference

MeasureFormulaInterpretation
Cohen’s d(μ₁−μ₂)/pooled SD0.2=small, 0.5=medium, 0.8=large
Odds ratioad/bc1=no effect; >1 increased odds; <1 decreased odds
Relative risk[a/(a+b)] / [c/(c+d)]1=no effect; >1 increased risk
NNT1/ARRLower = more effective
Hazard ratioe^β (Cox)1=no effect; same interpretation as RR
r (Pearson)0.1=small, 0.3=medium, 0.5=large
SS_model/SS_total% variance explained
η² (eta-squared)SS_between/SS_total0.01=small, 0.06=medium, 0.14=large
AUC/C-statisticArea under ROC0.5=chance; 0.7–0.8=acceptable; >0.8=excellent
Kappa (κ)(Po−Pe)/(1−Pe)0.4–0.6=moderate; 0.6–0.8=substantial; >0.8=almost perfect

A4. P-Value Thresholds in Context

ScenarioRecommended αRationale
Primary outcome, single test0.05Standard
Secondary outcomes (multiple)0.05/k (Bonferroni)Multiple comparisons
Post-hoc pairwise comparisonsTukey HSD or BonferroniFamilywise error control
Exploratory analysis0.05, clearly labelledHypothesis-generating only
Genome-wide association study5×10⁻⁸Millions of comparisons
Equivalence / non-inferiority0.025 (one-sided)Specific trial design

A5. Sample Size Formulae

DesignFormulaNotes
Two independent meansn = 2(z_α/2+z_β)²σ²/δ²σ=SD, δ=minimum detectable difference
Two proportionsn = (z_α/2+z_β)²[p₁(1-p₁)+p₂(1-p₂)]/(p₁-p₂)²p₁, p₂=expected proportions
Paired designn = (z_α/2+z_β)²σ_d²/δ²σ_d=SD of differences
One proportion vs referencen = z_α/2²p₀(1-p₀)/E²E=acceptable margin of error

z values: z_0.025=1.96 (α=0.05 two-tailed), z_0.2=0.84 (power=80%), z_0.1=1.28 (power=90%)


Key References and Further Reading

  • Altman DG. Practical Statistics for Medical Research. Chapman & Hall, 1991.
  • Bland JM, Altman DG. Statistical methods for assessing agreement. Lancet. 1986;327:307-310.
  • Cox DR. Regression models and life tables. J Royal Stat Soc B. 1972;34:187-220.
  • DeLong ER et al. Comparing areas under two correlated ROC curves. Biometrics. 1988;44:837-845.
  • Harrell FE. Regression Modeling Strategies. Springer, 2015.
  • Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53:457-481.
  • Rothman KJ. No adjustments needed for multiple comparisons. Epidemiology. 1990;1:43-46.
  • Steyerberg EW. Clinical Prediction Models. Springer, 2009.
  • Vittinghoff E et al. Regression Methods in Biostatistics. Springer, 2012.
  • Zhang J, Yu KF. What’s the relative risk? JAMA. 1998;280:1690-1691.

Enriched Edition v2.0 | R + Python code with ggplot2 / matplotlib for every test | Clinical research reference All code is self-contained and reproducible. Simulated data match worked examples. Plots are publication-ready with minor cosmetic adjustments.

Last updated on