Skip to Content
New Article Every Week 🎉
InceptionStatisticsGuide to Statistics and Test Selections

A Complete Guide to Statistical Tests and Methods for Clinical Researchers

Audience: Researchers and clinicians applying statistical methods to medical and health data
Purpose: A thorough reference covering test selection, assumptions, worked examples, interpretation, and reporting — from foundational hypothesis tests through advanced methods including survival analysis, multivariable modelling, and meta-analysis
How to use this guide: Each section follows a consistent structure: What it is → When to use it → Assumptions → Step-by-step workflow → Worked example → Interpretation → Reporting → Common mistakes


Table of Contents

  1. Foundations: The Statistical Reasoning Framework
  2. Choosing the Right Test
  3. Descriptive Statistics and Data Exploration
  4. One-Variable Tests
  5. Comparing Two Groups
  6. Comparing Three or More Groups
  7. Correlation and Association
  8. Regression Analysis
  9. Effect Sizes and Association Measures
  10. Survival and Time-to-Event Analysis
  11. Multivariable Modelling Strategy
  12. Multivariate Methods
  13. Mixed Models and Longitudinal Data
  14. Diagnostic Test Evaluation
  15. Agreement and Reliability
  16. Bayesian Methods
  17. Meta-Analysis and Systematic Review
  18. Reporting Standards and Checklists
  19. Appendix: Quick Reference Tables

1. Foundations: The Statistical Reasoning Framework

1.1 What Is a Statistical Test?

A statistical test is a formal procedure for deciding whether observed data are consistent with a stated hypothesis. The process has four components:

  1. Null hypothesis (H₀): The assumption of no effect, no difference, or no association
  2. Alternative hypothesis (H₁): The effect or difference you are trying to detect
  3. Test statistic: A number calculated from your data that summarises the evidence against H₀
  4. P-value: The probability of observing a test statistic at least as extreme as yours, if H₀ were true

What a p-value is NOT: A p-value is not the probability that H₀ is true. It is not the probability that your result is due to chance. These are the two most common misinterpretations in the medical literature.

1.2 Type I and Type II Errors

H₀ is actually TRUEH₀ is actually FALSE
Test says: reject H₀Type I error (false positive) — probability = αCorrect (true positive) — probability = Power (1−β)
Test says: fail to reject H₀Correct (true negative) — probability = 1−αType II error (false negative) — probability = β
  • α (significance level): Conventionally set at 0.05. If α = 0.05, you accept a 5% chance of a false positive.
  • β (Type II error rate): Conventionally ≤0.20, meaning power ≥ 80%.
  • Power: The probability of correctly detecting a true effect. Affected by sample size, effect size, and α.

Clinical implication: In a drug trial, a Type I error means declaring an ineffective drug effective (false positive). A Type II error means missing a truly effective drug (false negative). Both have real patient consequences.

1.3 One-Tailed vs Two-Tailed Tests

  • Two-tailed: Tests for a difference in either direction (H₁: μ₁ ≠ μ₂). Default for most clinical research.
  • One-tailed: Tests for a difference in a specific direction (H₁: μ₁ > μ₂). Use only when you have strong prior justification and would not act on a result in the other direction. One-tailed tests are often viewed with suspicion by reviewers if not pre-specified.

1.4 Confidence Intervals vs P-Values

Confidence intervals (CIs) convey more information than p-values alone:

  • A 95% CI represents the range of values consistent with your data at the 5% significance level
  • If the 95% CI for a difference excludes zero (or for a ratio excludes 1.0), the result is statistically significant at α = 0.05
  • CIs communicate both statistical significance AND the magnitude and precision of the estimate
  • Report both — modern journals increasingly require CIs alongside p-values

Example: A new antihypertensive reduces SBP by 8 mmHg (95% CI: 6 to 10 mmHg, p < 0.001). The CI tells you the reduction is clinically meaningful and precisely estimated. Compare this to: 8 mmHg (95% CI: 0.1 to 16 mmHg, p = 0.048) — statistically significant but very imprecise.

1.5 Sample Size and Power Calculations

Always perform a power calculation before collecting data. The four inputs are:

  1. α — significance level (typically 0.05)
  2. Power (1−β) — typically 0.80 or 0.90
  3. Effect size — the minimum clinically important difference (MCID) you want to detect
  4. Variability — standard deviation (from pilot data or literature)

These four quantities are mathematically linked — specify three to solve for the fourth. Most commonly, you solve for n (sample size).

Example: You want to detect a 10 mmHg difference in SBP between two drug groups. From previous studies, SD ≈ 20 mmHg. With α = 0.05 (two-tailed) and power = 80%:

n per group = 2 × (z_α/2 + z_β)² × σ² / δ² = 2 × (1.96 + 0.84)² × 400 / 100 = 2 × 7.84 × 4 = 63 per group

You need approximately 63 patients per arm, so ~126 total. Always add 10–20% for expected dropout.


2. Choosing the Right Test: A Decision Framework

The Five Key Questions

Before selecting any statistical test, answer these questions in order:

Q1. What is your research question?

  • Describe a population → Descriptive statistics
  • Test a hypothesis about one group → One-sample tests
  • Compare groups → Between-group tests
  • Examine relationships → Correlation / regression
  • Predict an outcome → Regression modelling

Q2. How many variables are involved?

  • 1 variable → One-sample or descriptive
  • 2 variables → Bivariate tests (correlation, two-group comparison)
  • 2+ variables with one outcome → Multivariable regression
  • 2+ outcomes simultaneously → Multivariate methods

Q3. What type is each variable?

  • Continuous: Measured on a scale (BP, weight, age, biomarker levels)
  • Ordinal: Ordered categories (pain scale 1–10, NYHA class I–IV)
  • Nominal/categorical: Unordered categories (blood type, treatment group, sex)
  • Binary: Special case of nominal with exactly two categories (alive/dead, yes/no)
  • Time-to-event: Combined measure of whether and when an event occurred

Q4. Are the samples independent or paired/related?

  • Independent: Different subjects in each group (RCT treatment arms, case-control study)
  • Paired/related: Same subjects measured twice, or matched subjects (crossover trial, matched case-control)

Q5. Are parametric assumptions met?

  • Parametric tests assume approximately normal distribution (or large enough n for CLT to apply), continuous data, and homogeneity of variance where applicable
  • Non-parametric tests make fewer distributional assumptions — use for small samples (<30), skewed distributions, ordinal data, or data with outliers

Decision Table

Outcome variablePredictor/groupsSample typeTest
ContinuousNone (1 group vs known value)One-sample t-test or Wilcoxon
Continuous2 groupsIndependentStudent’s t or Welch’s t / Mann-Whitney U
Continuous2 groupsPairedPaired t-test / Wilcoxon signed-rank
Continuous3+ groupsIndependentOne-way ANOVA / Kruskal-Wallis
Continuous3+ groupsRepeatedRepeated-measures ANOVA / Friedman
ContinuousContinuous predictor(s)Linear regression
Binary2+ groupsIndependentChi-square / Fisher’s exact
Binary2 groupsPairedMcNemar’s test
BinaryContinuous/mixed predictorsLogistic regression
Time-to-event2+ groupsIndependentKaplan-Meier + log-rank
Time-to-eventContinuous/mixed predictorsCox regression
Count dataGroupsPoisson / negative binomial regression
Ordinal2+ groupsIndependentMann-Whitney / Kruskal-Wallis
Multiple continuous outcomesGroupsMANOVA

3. Descriptive Statistics and Data Exploration

3.1 Measures of Central Tendency

Mean: Sum of all values divided by n. Best for normally distributed continuous data.

Median: The middle value when data are sorted. Preferred for skewed data or ordinal scales. Robust to outliers.

Mode: Most frequently occurring value. Rarely used in clinical research except for nominal data.

When to use which:

  • Normally distributed continuous data → Mean (± SD)
  • Skewed continuous data → Median (IQR)
  • Ordinal scales (e.g. pain scores) → Median (IQR)
  • Nominal data → Frequency and percentage

3.2 Measures of Spread

Standard deviation (SD): Average distance of data points from the mean. Use with mean for symmetric data.

Interquartile range (IQR): Difference between 75th and 25th percentiles. Use with median for skewed data.

Range: Min to max. Useful supplementary information but sensitive to outliers.

Standard error of the mean (SEM): SD / √n. Describes precision of the mean estimate, NOT the spread of individual values. Do not use SEM as a measure of variability in a study population — this is a common and misleading error in clinical publications.

3.3 Assessing Normality

Before choosing parametric vs non-parametric tests, assess distributional assumptions:

Visual methods (preferred):

  • Histogram: Look for symmetric bell shape
  • Q-Q plot (quantile-quantile plot): Points should fall along the diagonal line if data are normally distributed
  • Box plot: Check for symmetry and outliers

Formal tests:

  • Shapiro-Wilk test: Best for small samples (n < 50). H₀: data are normally distributed. A p-value > 0.05 is consistent with normality (note: does not prove normality).
  • Kolmogorov-Smirnov test: Better for larger samples.

Practical rule: With n > 30, the central limit theorem (CLT) ensures that the sampling distribution of the mean is approximately normal even if individual data are skewed. Parametric tests are generally robust in this case. For n < 30 with visibly skewed data, use non-parametric alternatives.

3.4 Worked Example: Describing a Study Population

Scenario: A clinical trial of a new statin enrols 120 patients. At baseline, data are collected on age, sex, BMI, LDL-cholesterol, and NYHA heart failure class (I–IV).

Appropriate summary statistics:

VariableTypeSummary
Age (years)Continuous, approximately normalMean ± SD: 62.4 ± 11.2
Sex (% male)Binary68 (56.7%)
BMI (kg/m²)Continuous, slightly right-skewedMedian (IQR): 27.8 (24.6–31.9)
LDL-C (mmol/L)Continuous, right-skewedMedian (IQR): 3.4 (2.8–4.1)
NYHA classOrdinalClass I: 22 (18.3%), Class II: 58 (48.3%), Class III: 32 (26.7%), Class IV: 8 (6.7%)

Reporting note: In a Table 1 (baseline characteristics), use the format: n (%) for categorical variables; mean ± SD for normally distributed continuous variables; median (IQR) for skewed or ordinal variables.


4. One-Variable Tests

4.1 One-Sample Student’s t-Test

What it does: Tests whether the mean of a single sample differs significantly from a known or hypothesised population value (μ₀).

When to use:

  • One continuous variable
  • Data are approximately normally distributed (or n ≥ 30)
  • You want to compare your sample mean to a reference value

Assumptions:

  • Continuous data
  • Approximate normality or n ≥ 30
  • Observations are independent

Test statistic:

t = (x̄ − μ₀) / (s / √n)

Where x̄ = sample mean, μ₀ = hypothesised mean, s = sample SD, n = sample size. Follows a t-distribution with n−1 degrees of freedom.

Worked Example:

Research question: A cardiology unit wants to know whether the mean INR of their anticoagulated patients differs from the therapeutic target of 2.5.

Data: n = 25 patients, mean INR = 2.8, SD = 0.6

t = (2.8 − 2.5) / (0.6 / √25) = 0.3 / 0.12 = 2.50 df = 24 p-value = 0.020 (two-tailed) 95% CI for difference: 0.05 to 0.55

Interpretation: The mean INR (2.8) is significantly above the target of 2.5 (t(24) = 2.50, p = 0.020). The 95% CI (2.55 to 3.05) excludes 2.5, confirming statistical significance. The unit may be over-anticoagulating their patients on average.

4.2 One-Sample Wilcoxon Signed-Rank Test

What it does: Non-parametric equivalent of the one-sample t-test. Tests whether the median of a sample differs from a hypothesised value.

When to use:

  • One continuous or ordinal variable
  • Data are skewed, ordinal, or n < 30 with non-normal distribution
  • You want to compare your sample median to a reference value

Worked Example:

Research question: A pain clinic wants to know whether their patients’ median pain score (NRS 0–10) differs from the population median of 5.

Data: n = 18 patients with chronic back pain, median NRS = 7 (IQR: 5–9). Shapiro-Wilk p = 0.003 — data are significantly non-normal.

Procedure: Calculate the difference between each patient’s score and 5. Rank the absolute differences. Sum the positive and negative ranks separately. Use the Wilcoxon W statistic.

Result: W = 142, p = 0.008

Interpretation: Patients’ median pain score (7) is significantly higher than the reference value of 5 (Wilcoxon W = 142, p = 0.008), indicating this population has worse pain than the general reference population.

4.3 One-Proportion Test (Z-test for proportion)

What it does: Tests whether an observed proportion differs from a known or hypothesised population proportion.

When to use:

  • One binary/nominal variable
  • You want to compare your proportion to a reference value
  • np ≥ 5 and n(1−p) ≥ 5 (otherwise use exact binomial test)

Worked Example:

Research question: The national readmission rate following elective hip replacement is 4%. A tertiary centre reviews 250 of their own procedures and finds 15 readmissions. Is their rate significantly different?

H₀: p = 0.04 (their rate equals the national rate)

p̂ = 15/250 = 0.060 z = (p̂ − p₀) / √(p₀(1−p₀)/n) = (0.060 − 0.040) / √(0.04 × 0.96 / 250) = 0.020 / 0.01241 = 1.61 p-value = 0.107 (two-tailed) 95% CI for proportion: 0.033 to 0.097

Interpretation: The observed readmission rate (6.0%) is numerically higher than the national rate (4.0%), but this difference is not statistically significant (z = 1.61, p = 0.107). The 95% CI (3.3% to 9.7%) includes 4%, consistent with this conclusion. The study may be underpowered to detect a difference of this magnitude — a power calculation would be warranted.

4.4 Chi-Square Goodness-of-Fit Test

What it does: Tests whether the observed distribution of a categorical variable matches an expected (theoretical) distribution.

When to use:

  • One categorical variable with two or more categories
  • You have hypothesised expected frequencies for each category
  • Expected frequency in each cell ≥ 5

Worked Example:

Research question: ABO blood group distribution in the general UK population is approximately: A=42%, B=10%, AB=4%, O=44%. In a sample of 200 cardiac surgery patients, you observe: A=96 (48%), B=16 (8%), AB=6 (3%), O=82 (41%). Is the distribution of blood types in cardiac patients different from the general population?

Expected counts (E = n × p): A=84, B=20, AB=8, O=88

χ² = Σ (O−E)²/E = (96−84)²/84 + (16−20)²/20 + (6−8)²/8 + (82−88)²/88 = 1.714 + 0.800 + 0.500 + 0.409 = 3.423 df = 4−1 = 3 p-value = 0.331

Interpretation: The blood type distribution among cardiac surgery patients does not differ significantly from the general population (χ²(3) = 3.42, p = 0.331).


5. Comparing Two Groups

5.1 Independent Samples t-Test (Student’s t-Test)

What it does: Compares the means of two independent groups.

When to use:

  • Continuous outcome variable
  • Two independent groups (different subjects in each group)
  • Approximately normally distributed data in both groups, or n ≥ 30 per group
  • Equal population variances (if not, use Welch’s t-test)

Checking equal variances: Use Levene’s test. If p > 0.05, assume equal variances (Student’s). If p ≤ 0.05, assume unequal variances (Welch’s). In practice, Welch’s t-test is robust and increasingly recommended as the default.

Test statistic (equal variances):

t = (x̄₁ − x̄₂) / (sp × √(1/n₁ + 1/n₂)) where sp = pooled SD = √[((n₁−1)s₁² + (n₂−1)s₂²) / (n₁+n₂−2)] df = n₁ + n₂ − 2

Worked Example:

Research question: A randomised controlled trial compares a new ACE inhibitor (Group A, n=45) to placebo (Group B, n=45) on 24-hour systolic blood pressure (SBP) reduction after 8 weeks.

Group A (ACE inhibitor)Group B (Placebo)
n4545
Mean SBP reduction (mmHg)12.45.8
SD8.27.6

Levene’s test: p = 0.62 → assume equal variances

sp = √[((44 × 8.2²) + (44 × 7.6²)) / 88] = √[(2963.84 + 2542.24) / 88] = √[62.57] = 7.910 t = (12.4 − 5.8) / (7.910 × √(1/45 + 1/45)) = 6.6 / (7.910 × 0.2108) = 6.6 / 1.667 = 3.96 df = 88 p-value < 0.001 95% CI for difference: 3.28 to 9.92 mmHg

Interpretation: The ACE inhibitor produced a significantly greater reduction in SBP compared to placebo (mean difference 6.6 mmHg, 95% CI 3.3 to 9.9 mmHg; t(88) = 3.96, p < 0.001). The CI is entirely above zero, confirming the ACE inhibitor is superior.

Reporting template: “The ACE inhibitor group showed a significantly greater reduction in 24-hour SBP compared to placebo (12.4 ± 8.2 vs 5.8 ± 7.6 mmHg; mean difference 6.6 mmHg, 95% CI 3.3 to 9.9 mmHg; p < 0.001).“

5.2 Welch’s t-Test (Unequal Variances)

What it does: Like Student’s t-test but does not assume equal population variances. The degrees of freedom are adjusted (Welch-Satterthwaite correction), resulting in a non-integer df.

When to use: Whenever Levene’s test is significant (p ≤ 0.05), or as a default (Welch’s is generally safer and loses little power when variances are actually equal).

Worked Example:

Research question: Comparing CRP levels (mg/L) between patients with confirmed bacterial infection (n=30) and viral infection (n=28).

BacterialViral
Mean CRP118.422.6
SD94.218.7

Levene’s test: p = 0.001 → unequal variances → use Welch’s

t = (118.4 − 22.6) / √(94.2²/30 + 18.7²/28) = 95.8 / √(295.87 + 12.48) = 95.8 / √308.35 = 95.8 / 17.56 = 5.46 df (Welch-Satterthwaite) ≈ 31.4 (non-integer) p < 0.001 95% CI: 60.7 to 130.9 mg/L

Interpretation: CRP was substantially and significantly higher in bacterial compared to viral infections (118.4 vs 22.6 mg/L; mean difference 95.8 mg/L, 95% CI 60.7 to 130.9; Welch’s t = 5.46, p < 0.001). The large standard deviations and Levene’s test result confirm the appropriateness of Welch’s t-test here.

5.3 Mann-Whitney U Test

What it does: Non-parametric test comparing the distributions of two independent groups. Tests whether one group tends to have higher values than the other.

When to use:

  • Continuous or ordinal outcome
  • Two independent groups
  • Data are skewed, ordinal, or n < 30 with non-normal distribution
  • Particularly appropriate for outcomes like pain scores, quality of life measures, biomarkers with skewed distributions

What it actually tests: The Mann-Whitney U test does not strictly test equality of medians (a common misconception). It tests whether one group’s values tend to be larger than the other’s — formally, P(X > Y) = 0.5. The test is equivalent to asking: “If I randomly picked one observation from each group, is there an equal probability of either being larger?”

Worked Example:

Research question: A palliative care study compares quality of life scores (EORTC QLQ-C30 global scale, 0–100) between patients receiving standard care (n=22) and those receiving a new integrated support programme (n=24) at 3 months. The data are negatively skewed.

Standard careIntegrated programme
n2224
Median (IQR)58 (42–70)72 (62–82)
Shapiro-Wilk p0.0310.028

Both groups fail the normality test → use Mann-Whitney U

Result: U = 161.5, p = 0.014

Interpretation: Quality of life scores were significantly higher in the integrated support programme group compared to standard care (median 72 vs 58; Mann-Whitney U = 161.5, p = 0.014).

Reporting template: “Global quality of life was significantly better in patients receiving the integrated support programme compared to standard care (median 72 [IQR 62–82] vs 58 [IQR 42–70]; Mann-Whitney U = 161.5, p = 0.014).“

5.4 Paired Samples t-Test

What it does: Compares means from the same subjects measured at two time points or under two conditions. Conceptually, it reduces to a one-sample t-test on the differences.

When to use:

  • Same subjects measured twice (before/after design)
  • Matched subjects in a 1:1 design
  • Approximately normally distributed differences (not necessarily the raw values)

Key advantage over independent t-test: Removes between-subject variability, substantially increasing statistical power.

Test statistic:

t = d̄ / (sd / √n) where d̄ = mean of (post − pre) differences sd = SD of differences df = n − 1

Worked Example:

Research question: A crossover trial tests whether 8 weeks of dietary sodium restriction reduces 24-hour urinary sodium excretion in 20 hypertensive patients. Each patient acts as their own control.

PatientPre (mmol/24h)Post (mmol/24h)Difference (Post−Pre)
Mean168.4124.6−43.8
SD28.4
t = −43.8 / (28.4 / √20) = −43.8 / 6.35 = −6.90 df = 19 p < 0.001 95% CI for mean difference: −57.1 to −30.5 mmol/24h

Interpretation: Sodium restriction significantly reduced 24-hour urinary sodium excretion (mean reduction 43.8 mmol/24h, 95% CI 30.5 to 57.1 mmol/24h; paired t(19) = −6.90, p < 0.001). The CI excludes zero, and the magnitude (43.8 mmol/24h) represents a clinically meaningful reduction.

5.5 Wilcoxon Signed-Rank Test

What it does: Non-parametric equivalent of the paired t-test. Compares two related groups without assuming normality of differences.

When to use:

  • Paired or repeated observations
  • Differences are not normally distributed
  • Ordinal data with paired design

Worked Example:

Research question: A physiotherapy intervention study measures pain scores (NRS 0–10) in 16 patients with knee osteoarthritis before and after 6 weeks of treatment. The differences are not normally distributed (Shapiro-Wilk p = 0.019).

Pre-treatmentPost-treatment
Median (IQR)7 (6–9)4 (3–6)

Result: Wilcoxon Z = −3.29, p = 0.001

Interpretation: Pain scores were significantly reduced following physiotherapy (pre-treatment median 7 [IQR 6–9] vs post-treatment median 4 [IQR 3–6]; Wilcoxon signed-rank Z = −3.29, p = 0.001).

5.6 Chi-Square Test of Independence

What it does: Tests whether two categorical variables are associated (i.e., whether the distribution of one variable differs across levels of the other).

When to use:

  • Both variables are categorical (nominal or ordinal)
  • Independent observations
  • Expected frequency in each cell ≥ 5 (if not, use Fisher’s exact test)

Test statistic:

χ² = Σ (O − E)² / E where E = (row total × column total) / grand total df = (rows − 1)(columns − 1)

Worked Example:

Research question: Does smoking status (smoker vs non-smoker) differ between patients who develop postoperative pneumonia and those who do not following elective colorectal surgery (n=180)?

PneumoniaNo pneumoniaTotal
Smoker243660
Non-smoker16104120
Total40140180

Expected counts:

  • Smoker/Pneumonia: (60×40)/180 = 13.3
  • Smoker/No pneumonia: (60×140)/180 = 46.7
  • Non-smoker/Pneumonia: (120×40)/180 = 26.7
  • Non-smoker/No pneumonia: (120×140)/180 = 93.3

All expected counts ≥ 5 → chi-square test appropriate

χ² = (24−13.3)²/13.3 + (36−46.7)²/46.7 + (16−26.7)²/26.7 + (104−93.3)²/93.3 = 8.61 + 2.45 + 4.29 + 1.23 = 16.58 df = 1 p < 0.001

Interpretation: Smoking was significantly associated with postoperative pneumonia (χ²(1) = 16.58, p < 0.001). Smokers had a substantially higher rate of pneumonia (40.0%) compared to non-smokers (13.3%). The odds ratio is 4.33 (95% CI: 2.04–9.21), indicating smokers had over four times the odds of developing pneumonia.

5.7 Fisher’s Exact Test

What it does: Tests the association between two categorical variables when expected cell frequencies are small (less than 5). Calculates the exact probability of the observed (or more extreme) table configuration.

When to use:

  • 2×2 contingency table with expected cell frequency < 5 in any cell
  • Small sample sizes
  • Sparse data (rare outcomes)

Worked Example:

Research question: A small case series examines whether an unusual fungal infection is associated with immunosuppressive therapy. Among 12 patients: 5 received immunosuppressants (4 with infection, 1 without), 7 did not (1 with infection, 6 without).

InfectionNo infectionTotal
Immunosuppressed415
Not immunosuppressed167
Total5712

Smallest expected cell: (5×5)/12 = 2.08 < 5 → use Fisher’s exact test

Fisher’s exact p = 0.045 (two-tailed)

Interpretation: Immunosuppressive therapy was significantly associated with fungal infection in this small series (Fisher’s exact p = 0.045). Caution: with only 12 patients, these findings should be considered hypothesis-generating.

5.8 McNemar’s Test

What it does: Tests whether the proportion of a binary outcome differs between two paired groups (same subjects measured twice, or matched pairs).

When to use:

  • Binary outcome (yes/no)
  • Paired or matched design (before/after, matched case-control)

Worked Example:

Research question: Before and after a hand-hygiene education campaign, the same 80 clinical staff are observed for compliance (compliant = yes/no). Did compliance rates change?

Post: CompliantPost: Non-compliantTotal
Pre: Compliant381250
Pre: Non-compliant22830
Total602080

The key cells are the discordant pairs: b=12 (compliant pre, not post) and c=22 (not compliant pre, compliant post).

McNemar χ² = (|b − c| − 1)² / (b + c) = (|12 − 22| − 1)² / (12 + 22) = 81 / 34 = 2.38... Wait, using corrected formula: χ² = (b − c)² / (b + c) = (12−22)² / (12+22) = 100/34 = 2.94 p = 0.086

Interpretation: There was a non-significant trend toward improved hand hygiene compliance following the education campaign (60% post-intervention vs 62.5% pre-intervention; McNemar χ² = 2.94, p = 0.086). The campaign did not produce a statistically significant change in this sample.


6. Comparing Three or More Groups

6.1 One-Way ANOVA

What it does: Tests whether the means of three or more independent groups differ. The word “one-way” refers to one grouping factor. ANOVA tests the overall (“omnibus”) null hypothesis that ALL group means are equal — it does not tell you which groups differ.

When to use:

  • Continuous outcome
  • Three or more independent groups
  • Approximately normal distribution within each group, or large samples
  • Equal variances across groups (if not, use Welch’s ANOVA)

Assumptions:

  1. Normality within each group
  2. Homogeneity of variance (Levene’s test)
  3. Independence of observations

Logic: ANOVA partitions total variability into between-group variability (due to the treatment/grouping) and within-group variability (random noise). The F-statistic is the ratio of these two components.

F = (Between-group variance) / (Within-group variance) = MSbetween / MSwithin Where: SSbetween = Σ nj(x̄j − x̄)² df = k−1 SSwithin = Σ Σ (xij − x̄j)² df = N−k F ~ F-distribution with df1 = k−1, df2 = N−k

Post-hoc tests: If ANOVA is significant, follow-up with pairwise comparisons. Common options:

  • Tukey’s HSD: Controls familywise error rate; compares all possible pairs. Good all-purpose choice.
  • Bonferroni correction: Divides α by number of comparisons. Conservative.
  • Dunnett’s test: Compares each group only to a control group. Use in dose-response studies.
  • Scheffé’s test: Most conservative; appropriate for complex contrasts planned after seeing the data.

Worked Example:

Research question: A multicentre RCT compares three doses of a novel anti-nausea drug (low dose, medium dose, high dose) versus placebo on vomiting episodes in 24 hours following chemotherapy (n=200, 50 per group).

GroupnMean episodesSD
Placebo506.82.4
Low dose505.12.1
Medium dose503.41.8
High dose502.91.7

Grand mean (x̄) = (6.8+5.1+3.4+2.9)/4 = 4.55

SSbetween = 50×(6.8−4.55)² + 50×(5.1−4.55)² + 50×(3.4−4.55)² + 50×(2.9−4.55)² = 50×(5.0625 + 0.3025 + 1.3225 + 2.7225) = 50 × 9.41 = 470.5 MSbetween = 470.5 / 3 = 156.8 SSwithin = 49×2.4² + 49×2.1² + 49×1.8² + 49×1.7² = 49×(5.76+4.41+3.24+2.89) = 49 × 16.30 = 798.7 MSwithin = 798.7 / 196 = 4.075 F = 156.8 / 4.075 = 38.5 p < 0.001

Post-hoc Tukey HSD: All pairwise comparisons are significant (p < 0.05) except: Medium dose vs High dose (mean difference 0.5, p = 0.41).

Interpretation: There were significant differences in vomiting episodes across treatment groups (one-way ANOVA: F(3,196) = 38.5, p < 0.001). Post-hoc analysis (Tukey HSD) showed all active doses were superior to placebo (all p < 0.001), and medium dose was superior to low dose (p = 0.003). There was no significant difference between medium and high doses (p = 0.41), suggesting medium dose may provide the optimal therapeutic benefit with a lower adverse event profile.

6.2 Welch’s ANOVA

What it does: An F-test that does not assume equal population variances across groups. More robust than standard ANOVA when variances are heterogeneous.

When to use: When Levene’s test is significant (p < 0.05), indicating unequal variances across groups.

Post-hoc test: Use Games-Howell (does not assume equal variances) rather than Tukey HSD.

6.3 Kruskal-Wallis Test

What it does: Non-parametric alternative to one-way ANOVA. Tests whether three or more independent groups have the same distribution. Like Mann-Whitney U extended to k groups.

When to use:

  • Continuous or ordinal outcome
  • Three or more independent groups
  • Data are skewed or non-normal within groups
  • Ordinal outcome (e.g. pain scores, Likert scales)

Post-hoc testing: If Kruskal-Wallis is significant, use Dunn’s test with Bonferroni correction for pairwise comparisons.

Worked Example:

Research question: Three hospitals (A, B, C) are compared on patient-reported pain scores (NRS 0–10) at discharge following total knee replacement.

HospitalnMedian (IQR)
A354 (3–6)
B386 (4–8)
C335 (3–7)

Data are ordinal and skewed → Kruskal-Wallis

Result: H(2) = 8.74, p = 0.013

Post-hoc (Dunn’s with Bonferroni):

  • A vs B: p = 0.010
  • A vs C: p = 0.320
  • B vs C: p = 0.182

Interpretation: Discharge pain scores differed significantly across the three hospitals (Kruskal-Wallis H(2) = 8.74, p = 0.013). Post-hoc analysis showed Hospital A had significantly lower pain scores than Hospital B (Dunn’s test, p = 0.010) but not Hospital C (p = 0.320). No significant difference was found between Hospitals B and C (p = 0.182).

6.4 Repeated Measures ANOVA

What it does: Tests for differences in a continuous outcome measured at three or more time points in the same subjects.

When to use:

  • Same subjects measured at 3+ time points
  • Continuous outcome
  • Approximately normally distributed data or adequate sample size

Assumption unique to repeated measures: Sphericity — the variances of the differences between all possible pairs of time points should be equal. Tested with Mauchly’s test. If violated, apply Greenhouse-Geisser or Huynh-Feldt epsilon correction to the degrees of freedom.

Worked Example:

Research question: Serum creatinine (μmol/L) is monitored in 30 patients with CKD at baseline, 3 months, 6 months, and 12 months of treatment.

Time pointMean creatinineSD
Baseline14238
3 months13836
6 months13134
12 months12833

Mauchly’s test: p = 0.21 (sphericity not violated)

Result: F(3, 87) = 8.43, p < 0.001, η² = 0.225

Post-hoc (pairwise t-tests with Bonferroni):

  • Baseline vs 3 months: p = 0.31 (ns)
  • Baseline vs 6 months: p = 0.012
  • Baseline vs 12 months: p < 0.001
  • 3 months vs 12 months: p = 0.003

Interpretation: Serum creatinine decreased significantly over 12 months (repeated measures ANOVA: F(3,87) = 8.43, p < 0.001, η² = 0.23). Significant reductions from baseline were apparent at 6 months (−11 μmol/L, p = 0.012) and 12 months (−14 μmol/L, p < 0.001).

6.5 Friedman Test

What it does: Non-parametric equivalent of repeated measures ANOVA. Compares three or more related groups.

When to use:

  • Same subjects measured at 3+ time points
  • Data are skewed, ordinal, or assumptions of repeated measures ANOVA are violated

Worked Example:

Research question: Pain scores (NRS 0–10) are compared at 3 time points (baseline, week 4, week 8) in 20 patients with rheumatoid arthritis starting a new biologic therapy. Data are ordinal and skewed.

TimeMedian (IQR)
Baseline7 (6–9)
Week 45 (3–7)
Week 83 (2–5)

Friedman χ²(2) = 28.4, p < 0.001

Post-hoc (Wilcoxon with Bonferroni, α adjusted to 0.017):

  • Baseline vs Week 4: p = 0.001
  • Baseline vs Week 8: p < 0.001
  • Week 4 vs Week 8: p = 0.008

Interpretation: Pain scores decreased significantly over the 8-week treatment period (Friedman χ²(2) = 28.4, p < 0.001). All pairwise comparisons showed significant improvement (all p ≤ 0.008).


7. Correlation and Association

7.1 Pearson Correlation

What it does: Quantifies the strength and direction of the linear relationship between two continuous variables. Output is the correlation coefficient r, ranging from −1 (perfect negative linear relationship) to +1 (perfect positive linear relationship).

When to use:

  • Both variables continuous
  • Approximately bivariate normal distribution
  • You are interested in linear association

Interpreting r: | |r| value | Interpretation | |---|---| | 0.00–0.19 | Negligible/very weak | | 0.20–0.39 | Weak | | 0.40–0.59 | Moderate | | 0.60–0.79 | Strong | | 0.80–1.00 | Very strong |

Important caveat: Correlation ≠ causation. Always plot the data first (scatterplot) — r can miss non-linear relationships, and can be distorted by outliers.

Worked Example:

Research question: Is there a linear association between age (years) and eGFR (mL/min/1.73m²) in a cohort of 150 adults attending a nephrology clinic?

Result: r = −0.58 (95% CI: −0.68 to −0.46), p < 0.001

Interpretation: There is a moderate to strong negative linear relationship between age and eGFR (r = −0.58, 95% CI −0.68 to −0.46, p < 0.001), indicating that kidney function declines with increasing age in this cohort. Age accounts for approximately 34% of the variance in eGFR (r² = 0.336).

7.2 Spearman’s Rank Correlation

What it does: Non-parametric measure of monotonic (not necessarily linear) association between two variables. Calculates the Pearson correlation on the ranks of the data.

When to use:

  • Ordinal data (e.g., disease severity grade, Likert scale responses)
  • Continuous data that are skewed or contain outliers
  • Non-linear but monotonic relationships

Worked Example:

Research question: Is NYHA heart failure class (I–IV, ordinal) associated with 6-minute walk distance (metres) in 80 outpatients?

Result: ρ (rho) = −0.71 (95% CI: −0.79 to −0.60), p < 0.001

Interpretation: There is a strong negative monotonic association between NYHA class and 6-minute walk distance (Spearman ρ = −0.71, p < 0.001): higher NYHA class (worse symptoms) is associated with shorter walk distance.

7.3 Common Pitfalls in Correlation Analysis

1. Correlation without scatterplot: Always plot the data. An r of 0.50 could reflect a clean linear trend, a curved relationship, or a driven entirely by a few outliers — you cannot tell from the statistic alone.

2. Ecological fallacy: Correlation at the group level (e.g., countries) does not imply correlation at the individual level.

3. Confounding: A correlation between A and B might be explained by a third variable C that is related to both.

4. Restricted range: Correlations are attenuated when you study a narrow range of one variable (e.g., only severely ill patients). True associations may be understated.

5. Multiple testing: If you test 20 correlations, you expect 1 to be significant by chance at α = 0.05.


8. Regression Analysis

8.1 Simple Linear Regression

What it does: Models the linear relationship between one continuous predictor (X) and one continuous outcome (Y). Extends correlation by fitting a line and quantifying the predicted change in Y per unit change in X.

The model:

Y = β₀ + β₁X + ε β₀ = intercept (value of Y when X = 0) β₁ = slope (change in Y for each 1-unit increase in X) ε = residual error

Key outputs:

  • β₁ (regression coefficient): The slope — how much Y changes per unit increase in X
  • 95% CI for β₁
  • R²: Proportion of variance in Y explained by X
  • Residual diagnostics: Assess model assumptions

Assumptions:

  1. Linearity: the relationship is linear
  2. Independence of residuals
  3. Homoscedasticity: residuals have constant variance across X values
  4. Normality of residuals
  5. No influential outliers

Check assumptions with:

  • Residuals vs fitted plot (linearity, homoscedasticity)
  • Q-Q plot of residuals (normality)
  • Cook’s distance (influential observations)

Worked Example:

Research question: What is the relationship between BMI (kg/m², predictor) and systolic blood pressure (mmHg, outcome) in 200 middle-aged adults?

Result:

SBP = 98.4 + 1.72 × BMI
  • β₁ = 1.72 (95% CI: 1.28 to 2.16), p < 0.001
  • R² = 0.21 (21% of SBP variance explained by BMI)

Interpretation: For every 1 kg/m² increase in BMI, systolic blood pressure increases by an estimated 1.72 mmHg (95% CI 1.28 to 2.16 mmHg, p < 0.001). BMI explains 21% of the variability in SBP in this cohort.

8.2 Multiple Linear Regression

What it does: Models the relationship between two or more predictors and a continuous outcome. Each coefficient represents the effect of that predictor adjusted for all other predictors in the model.

The model:

Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ + ε

Worked Example:

Research question: What factors independently predict systolic blood pressure in 200 adults? Candidate predictors: BMI, age, sex (female = reference), smoking status (current smoker vs not).

PredictorCoefficient (β)95% CIp-value
Intercept78.2
BMI (per kg/m²)1.420.98 to 1.86<0.001
Age (per year)0.680.44 to 0.92<0.001
Male sex4.101.22 to 6.980.005
Current smoker3.850.97 to 6.730.009

Adjusted R² = 0.34

Interpretation: After adjustment for other variables, each 1 kg/m² increase in BMI was associated with a 1.42 mmHg increase in SBP (95% CI 0.98–1.86, p < 0.001). Older age, male sex, and current smoking were also independently associated with higher SBP. Together, these four predictors explain 34% of the variance in SBP.

Important: The coefficient for BMI (1.42) differs from the unadjusted coefficient (1.72) because age, sex, and smoking are confounders — they are correlated with BMI and independently predict SBP.

8.3 Logistic Regression

What it does: Models the relationship between one or more predictors and a binary outcome (yes/no, event/no event). Output is the log-odds of the outcome, which is converted to an odds ratio (OR) for interpretation.

The model:

logit(p) = ln(p/(1−p)) = β₀ + β₁X₁ + β₂X₂ + ... OR for predictor Xj = e^βj

Assumptions:

  1. Binary outcome
  2. Independence of observations
  3. No multicollinearity among predictors
  4. Large enough sample (at least 10 events per predictor variable — the “EPV rule”)
  5. Linearity of continuous predictors with the log-odds (check with Box-Tidwell test)

Worked Example:

Research question: What factors independently predict 30-day readmission (yes/no) following hospital admission for COPD exacerbation? Data from 400 admissions.

Outcome: 30-day readmission (n=84, 21%)

PredictorOR95% CIp-value
Age (per 10 years)1.241.05 to 1.470.012
FEV₁% predicted (per 10% increase)0.820.71 to 0.950.007
Previous admission in past year (yes vs no)2.841.63 to 4.95<0.001
Home oxygen use (yes vs no)1.931.09 to 3.420.025
Eosinophil count (per 0.1×10⁹/L)0.870.76 to 0.990.038

Model fit: Hosmer-Lemeshow goodness-of-fit p = 0.64 (good fit); C-statistic (AUC) = 0.72

Interpretation: Previous admission in the past year was the strongest predictor of 30-day readmission (OR 2.84, 95% CI 1.63–4.95, p < 0.001): patients with prior admissions had nearly three times the odds of readmission compared to those without. Each 10% reduction in FEV₁% was associated with a 22% increase in the odds of readmission (OR 0.82 per 10% improvement, i.e. OR 1.22 per 10% deterioration). The model discriminates readmitted from non-readmitted patients with moderate ability (AUC 0.72).

Reporting the C-statistic / AUC: The AUC (area under the ROC curve) for a logistic model represents the probability that a randomly selected patient who was readmitted had a higher predicted probability than a randomly selected patient who was not. Values: 0.5 = no better than chance; 0.7–0.8 = acceptable; 0.8–0.9 = excellent; >0.9 = outstanding.


9. Effect Sizes and Association Measures

9.1 Why Effect Sizes Matter

A statistically significant result (small p-value) tells you that an effect probably exists in the population. It does not tell you whether the effect is clinically meaningful. Effect sizes answer the question: “How big is the effect?”

The hierarchy of information:

  1. P-value: Is there an effect? (binary: yes/no)
  2. Confidence interval: What is the plausible range of the effect?
  3. Effect size: How large is the effect, expressed in a standardised or clinically interpretable way?

9.2 Odds Ratio (OR)

Definition: The ratio of the odds of an outcome in the exposed group to the odds in the unexposed group.

Odds of event in group A = P(event in A) / P(no event in A) Odds of event in group B = P(event in B) / P(no event in B) OR = [Odds in A] / [Odds in B]

2×2 contingency table notation:

Outcome: YesOutcome: No
Exposed (E+)ab
Unexposed (E−)cd
OR = (a/b) / (c/d) = ad / bc 95% CI: exp(ln(OR) ± 1.96 × √(1/a + 1/b + 1/c + 1/d))

Interpreting OR:

  • OR = 1.0: No association
  • OR > 1.0: Exposure associated with increased odds of outcome
  • OR < 1.0: Exposure associated with decreased odds of outcome

Natural home: Case-control studies and logistic regression models.

9.3 Relative Risk (RR) — also called Risk Ratio

Definition: The ratio of the probability (risk) of an outcome in the exposed group to the probability in the unexposed group.

RR = Risk in exposed / Risk in unexposed = [a/(a+b)] / [c/(c+d)]

Interpreting RR:

  • RR = 1.0: No association
  • RR = 2.0: Exposed group has twice the risk
  • RR = 0.5: Exposed group has half the risk (50% reduction)

Natural home: Cohort studies and RCTs.

9.4 Odds Ratio vs Relative Risk: When to Use Which

This is one of the most commonly confused distinctions in clinical research. Here is the complete framework:

Study design determines feasibility

Study designCan you calculate RR?Can you calculate OR?
RCTYes (directly from data)Yes (but OR preferred only for logistic regression output)
Prospective cohortYesYes
Retrospective cohortYesYes
Case-controlNo (sampling from outcome group distorts risk)Yes — OR is the correct measure
Cross-sectionalPrevalence ratio (modified RR)Yes

Why can’t you calculate RR from a case-control study? Because you select participants based on the outcome (cases and controls), not based on exposure. The proportion of cases in your sample reflects your sampling ratio, not the true disease risk in the population. The OR is mathematically unaffected by this (it is the same whether you sample 1:1 or 1:4 cases to controls).

Outcome frequency matters

When an outcome is rare (<10%), the OR approximates the RR closely. This is the “rare disease assumption”:

When P(outcome) is small: OR ≈ RR

When an outcome is common (≥10%), the OR will be further from 1.0 than the RR, and they diverge substantially:

True RRTrue risk in unexposedApproximate OR
2.05%2.1
2.020%2.7
2.040%4.0

Reporting an OR when the outcome is common and calling it a “risk ratio” substantially overstates the effect. This is a pervasive error in the medical literature.

Worked Example:

Scenario: A study of surgical site infection (SSI) after colorectal surgery. Diabetic patients: 40 SSIs in 100 patients (40%). Non-diabetic: 20 SSIs in 100 patients (20%).

RR = (40/100) / (20/100) = 0.40 / 0.20 = 2.0 OR = (40×80) / (60×20) = 3200 / 1200 = 2.67

The OR (2.67) is 33% higher than the RR (2.0). Reporting the OR as if it were a risk ratio would overstate the association. Because this is a cohort study with a common outcome (>10%), report the RR.

Logistic regression outputs ORs — when is this a problem?

Logistic regression models produce ORs, not RRs. When:

  • The outcome is rare: OR ≈ RR, report the OR from logistic regression
  • The outcome is common: Use alternatives to estimate RR:
    • Modified Poisson regression (with robust standard errors) — preferred, produces RR directly
    • Log-binomial regression — produces RR directly but can fail to converge
    • OR-to-RR conversion formula (Zhang & Yu, 1998): RR = OR / [(1 − P₀) + (P₀ × OR)] where P₀ = baseline risk in unexposed group

Summary decision rule:

Is your study a case-control? → Report OR (only valid measure) Is your outcome rare (<10%)? → OR ≈ RR, report OR from logistic regression Is your outcome common (≥10%)? In a cohort/RCT: → Calculate and report RR directly From a logistic model: → Use modified Poisson regression for RR OR report OR with clear caveat

9.5 Absolute Risk Reduction (ARR) and Number Needed to Treat (NNT)

ARR: The absolute difference in event rates between two groups.

ARR = Risk in control − Risk in treatment = (c/(c+d)) − (a/(a+b))

NNT: How many patients need to be treated to prevent one additional outcome event.

NNT = 1 / ARR

NNT < 10: Very effective treatment NNT 10–50: Moderately effective NNT > 100: Marginally effective (may still be worthwhile for serious outcomes)

Worked Example:

Scenario: In a trial of prophylactic low-molecular-weight heparin (LMWH) after major orthopaedic surgery: DVT rate = 8% in LMWH group (a/(a+b) = 0.08), 18% in placebo group (c/(c+d) = 0.18).

RR = 0.08 / 0.18 = 0.44 (56% reduction in relative risk) ARR = 0.18 − 0.08 = 0.10 (10 percentage points) NNT = 1 / 0.10 = 10

Interpretation: LMWH reduces the relative risk of DVT by 56% (RR 0.44). In absolute terms, for every 10 patients treated with LMWH, one additional DVT is prevented (NNT = 10). The NNT communicates clinical impact in a way the RR alone does not.

9.6 Standardised Effect Sizes

When outcomes are measured on different scales and you want to compare effect sizes across studies, use standardised effect sizes:

Cohen’s d: For continuous outcomes (mean difference)

d = (μ₁ − μ₂) / pooled SD

Benchmarks: small d=0.2, medium d=0.5, large d=0.8 (Cohen, 1988 — treat as rough guides only)

Eta-squared (η²): Proportion of variance explained (for ANOVA)

η² = SSbetween / SStotal

Partial η² is preferred for factorial ANOVA.

Omega-squared (ω²): Less biased than η², preferred for meta-analyses.


10. Survival and Time-to-Event Analysis

10.1 Why Standard Methods Fail for Survival Data

Consider a study following 100 patients after cancer surgery for 5 years, tracking whether they are alive or dead. Two problems arise that standard regression cannot handle:

Problem 1: Censoring. Some patients are still alive at study end. Some are lost to follow-up. Some died from an unrelated cause. All three are “censored” — they did not experience the event during observation, but we don’t know when or if they would have. Excluding them wastes information; treating them as non-events introduces bias.

Problem 2: Variable follow-up times. Patients enrolled at different times have different follow-up durations. A patient followed for 6 months contributes different information from one followed for 48 months.

Survival analysis incorporates both the occurrence of events and the time to event, while properly handling censored observations.

10.2 Core Concepts

Survival function S(t): The probability of surviving (i.e., not experiencing the event) beyond time t.

S(t) = P(T > t)

At t=0: S(0) = 1.0 (everyone is event-free at start) Over time: S(t) decreases monotonically (or stays flat if no events)

Hazard function h(t): The instantaneous rate of the event at time t, given survival to time t. Sometimes called the “force of mortality.”

Censoring types:

  • Right censoring (most common): The event has not occurred by the end of observation
  • Left censoring: The event occurred before observation started
  • Interval censoring: The event occurred in a known time interval

The critical assumption: Censoring must be non-informative — i.e., the reason for censoring must be unrelated to the probability of experiencing the event. If patients who drop out are more likely to die than those who stay in, estimates will be biased.

10.3 Kaplan-Meier Estimator

What it does: Non-parametrically estimates the survival function S(t) from observed data, accounting for censoring. Produces a step-function survival curve.

The calculation:

S(t) = Π [1 − dj/nj] where the product is over all event times tj ≤ t dj = number of events at time tj nj = number at risk just before time tj

Worked Example:

Research question: Estimate overall survival in 10 patients with metastatic colorectal cancer following first-line chemotherapy.

PatientFollow-up (months)Event (death=1, censored=0)
131
251
360 (lost to follow-up)
481
5101
6120 (still alive at study end)
7141
8180
9201
10240

KM calculation:

Time (months)Events (d)At risk (n)S(t) = S(t-prev) × (1 − d/n)
0101.000
31101.000 × (1 − 1/10) = 0.900
5190.900 × (1 − 1/9) = 0.800
817*0.800 × (1 − 1/7) = 0.686
10160.686 × (1 − 1/6) = 0.571
1414**0.571 × (1 − 1/4) = 0.429
2012***0.429 × (1 − 1/2) = 0.214

*Patient 3 (censored at 6 months) removed from risk set before time 8 **Patient 6 (censored at 12 months) removed before time 14 ***Patient 8 (censored at 18 months) and Patient 10 (censored at 24 months) reduce the risk set

Interpretation: The estimated probability of surviving beyond 20 months is 21.4%. Median survival (where S(t) first falls below 0.5) falls between 10 and 14 months. The KM curve should be presented with number-at-risk tables below the time axis.

Reporting standard: Always include: (1) the KM curve with confidence bands, (2) the number-at-risk table at key time points, (3) median survival with 95% CI for each group.

10.4 Log-Rank Test

What it does: Tests whether the survival curves of two or more groups are identical. The non-parametric equivalent of the t-test for survival data. Uses a weighted sum of differences between observed and expected events at each event time.

When to use:

  • Comparing survival curves of 2+ groups
  • Non-parametric (makes no assumption about the shape of the survival curve)
  • Assumes proportional hazards (the hazard ratio between groups is constant over time)

Worked Example:

Research question: Do patients with KRAS wild-type (WT) colorectal cancer have better overall survival than those with KRAS mutant (MT) cancer following anti-EGFR therapy?

GroupnEventsMedian OS (months)95% CI
KRAS WT856218.414.2–22.6
KRAS MT79719.87.6–12.0

Log-rank test: χ²(1) = 14.8, p < 0.001

Interpretation: Patients with KRAS wild-type tumours had significantly longer overall survival than those with KRAS mutations (median 18.4 vs 9.8 months; log-rank p < 0.001). This finding supports the predictive role of KRAS status for anti-EGFR therapy benefit.

10.5 Cox Proportional Hazards Regression

What it does: The most widely used model for time-to-event data with multiple predictors. Models the hazard (instantaneous risk) as a function of predictor variables. Output is the hazard ratio (HR) — the ratio of hazards between groups.

The model:

h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ... + βₚXₚ)

Where h₀(t) is the baseline hazard function (unspecified — this is a “semi-parametric” model).

The hazard ratio:

HR for predictor Xj = e^βj

Interpreting HR:

  • HR = 1.0: No association with time-to-event
  • HR = 2.0: Exposed group has twice the instantaneous rate of the event at any given time
  • HR = 0.5: Exposed group has half the hazard (50% risk reduction at any time)

The proportional hazards (PH) assumption: The hazard ratio between two groups is constant over time. This is the key assumption of Cox regression. Check it with:

  • Log-log plot: log(−log(S(t))) vs log(t) — lines should be parallel
  • Schoenfeld residuals plot — no trend over time
  • Grambsch-Therneau test (formal statistical test of PH assumption)

If PH is violated: use time-varying coefficients, stratified Cox model, or parametric models (Weibull, log-logistic).

Worked Example:

Research question: What factors predict time to dialysis initiation in a cohort of 280 CKD patients followed for up to 5 years? Predictors: age, sex, eGFR at baseline, proteinuria (g/24h), diabetes, hypertension.

Events: 98 patients started dialysis; 182 censored

PredictorHR95% CIp-value
Age (per 10 years)1.180.98–1.420.082
Male sex1.440.94–2.200.094
eGFR at baseline (per 10 mL/min/1.73m² increase)0.510.42–0.62<0.001
Proteinuria (per 1 g/24h increase)1.671.38–2.02<0.001
Diabetes (yes vs no)2.031.32–3.120.001
Hypertension (yes vs no)1.380.89–2.130.150

PH assumption checked: Schoenfeld residuals test p=0.38 (no violation)

Interpretation:

  • Each 10 mL/min/1.73m² higher baseline eGFR was associated with a 49% lower hazard of dialysis initiation (HR 0.51, 95% CI 0.42–0.62, p < 0.001).
  • Each 1 g/24h increase in proteinuria was associated with a 67% higher hazard of dialysis initiation (HR 1.67, 95% CI 1.38–2.02, p < 0.001).
  • Patients with diabetes had twice the hazard of dialysis initiation compared to non-diabetic patients (HR 2.03, 95% CI 1.32–3.12, p = 0.001).
  • After adjustment, age, sex, and hypertension were not independently associated with dialysis initiation.

Reporting template: “In multivariable Cox regression, proteinuria (HR 1.67 per 1 g/24h increase, 95% CI 1.38–2.02, p < 0.001) and diabetes (HR 2.03, 95% CI 1.32–3.12, p = 0.001) were independently associated with time to dialysis initiation after adjustment for baseline eGFR and other covariates.”


11. Multivariable Modelling Strategy

11.1 Univariate vs Multivariable Analysis: The Clinical Workflow

Almost all published clinical research involves both steps:

Step 1: Univariate (crude) analysis

  • Each predictor is tested against the outcome individually, without adjustment
  • Reports crude (unadjusted) ORs, HRs, or mean differences
  • Purpose: describe raw associations, identify candidate variables for multivariable model

Step 2: Multivariable (adjusted) analysis

  • Selected predictors are entered simultaneously into a regression model
  • Reports adjusted ORs, HRs, or mean differences, with each predictor’s effect estimated after controlling for the others
  • Purpose: identify independent predictors, control for confounding

The relationship between crude and adjusted estimates is clinically informative. A variable that is significant in univariate but not multivariable analysis was likely confounded. A variable that appears non-significant univariately but significant in multivariable analysis was previously masked by confounders (negative confounding).

11.2 What Is Confounding?

A confounder is a third variable that:

  1. Is associated with the exposure/predictor
  2. Is associated with the outcome
  3. Is NOT an intermediary on the causal pathway between exposure and outcome

Example: A study finds that coffee drinking is associated with lung cancer. But coffee drinkers are also more likely to smoke, and smoking causes lung cancer. Smoking is a confounder. After adjusting for smoking, the association between coffee and lung cancer disappears.

Controlling for confounders:

  • Include them in the regression model (most common)
  • Matching (case-control studies, propensity score matching)
  • Restriction (study only non-smokers)
  • Stratification (analyse smokers and non-smokers separately)

11.3 Selecting Variables for a Multivariable Model

The EPV (events per variable) rule: As a minimum, you need approximately 10 events per predictor variable in logistic and Cox regression to avoid overfitting. With 80 events, include a maximum of 8 predictors.

Approaches to variable selection:

1. Hypothesis-driven selection (preferred in clinical research): Select predictors based on clinical knowledge and prior literature, regardless of statistical significance in univariate analysis. Pre-specify in your protocol.

2. Univariate screening approach:

  • Test each candidate predictor in univariate analysis
  • Include variables with p < 0.2 (or 0.25) as candidates — not p < 0.05, as this misses potentially important confounders
  • Also include clinically important variables regardless of p-value

3. Automated stepwise selection (not recommended as primary approach):

  • Backward elimination, forward selection, or bidirectional stepwise
  • Problems: capitalises on chance, biased SEs and p-values, unreproducible in different samples
  • May be used for exploratory analyses but results should be validated in an independent dataset

11.4 Handling Confounding: A Worked Example

Research question: Is emergency (vs elective) hospital admission associated with in-hospital mortality? Data from 600 admissions.

Univariate analysis:

VariableCrude OR95% CIp
Emergency admission (vs elective)3.221.84–5.63<0.001
Age (per 10 years)1.651.31–2.08<0.001
Charlson comorbidity index1.481.24–1.77<0.001
Male sex1.290.76–2.200.340

Multivariable logistic regression:

VariableAdjusted OR95% CIp
Emergency admission (vs elective)1.871.01–3.470.047
Age (per 10 years)1.441.12–1.850.004
Charlson comorbidity index1.361.12–1.650.002
Male sex1.150.65–2.030.627

Interpretation: Emergency admission was significantly associated with in-hospital mortality in both univariate (crude OR 3.22) and multivariable (adjusted OR 1.87) analyses. The attenuation from 3.22 to 1.87 indicates that age and comorbidity are confounders — emergency admissions tend to involve older, sicker patients, which partially explains their higher mortality. The adjusted OR represents the “true” independent association after accounting for these differences.

11.5 Propensity Score Methods

The problem: In observational studies, patients who receive a treatment differ systematically from those who don’t. Simply adjusting for confounders in regression may be insufficient when there are many confounders or when the treatment and control groups barely overlap.

Propensity score (PS): The predicted probability of receiving the treatment, given a patient’s observed baseline characteristics. Estimated using logistic regression with treatment as outcome and all confounders as predictors.

Uses of the propensity score:

1. Propensity score matching: Match each treated patient to one (or more) untreated patient(s) with a similar PS. Creates two groups balanced on measured confounders — mimics a randomised trial.

2. PS stratification: Divide patients into quintiles of PS and compare outcomes within each stratum.

3. Inverse probability of treatment weighting (IPTW): Reweight observations so that the weighted sample resembles a randomised trial.

Worked Example:

Research question: Using a registry of 800 STEMI patients, compare 1-year mortality between those who received drug-eluting stent (DES, n=400) vs bare metal stent (BMS, n=400). Patients who received DES were younger, had lower GRACE scores, and fewer comorbidities.

After propensity score matching (caliper width = 0.1 SD of logit PS):

  • 312 matched pairs (DES vs BMS)
  • Baseline characteristics now balanced (standardised differences all <0.10)

Matched analysis: HR for 1-year mortality, DES vs BMS = 0.74 (95% CI 0.55–0.99, p = 0.043)

Compare to: Unmatched analysis: HR = 0.51 (95% CI 0.39–0.67, p < 0.001) — substantially biased by confounding.

Interpretation: After propensity score matching to account for confounders, DES was associated with a 26% reduction in 1-year mortality compared to BMS (HR 0.74, p = 0.043). The unmatched estimate (51% reduction) was confounded by the baseline differences between groups.


12. Multivariate Methods

12.1 Terminology Clarification

Multivariable: Multiple predictor variables, ONE outcome (e.g. multiple linear regression) Multivariate: Multiple outcome variables simultaneously (e.g. MANOVA, PCA)

This distinction is frequently misused in published literature. MANOVA, PCA, and factor analysis are truly “multivariate” methods.

12.2 MANOVA (Multivariate Analysis of Variance)

What it does: Tests whether groups differ on a combination of continuous outcome variables simultaneously. An extension of ANOVA to multiple outcomes.

When to use:

  • 3+ continuous outcome variables that are correlated with each other
  • One or more grouping factors
  • You want to test overall group differences before examining individual outcomes

Why not just run separate ANOVAs?

  1. Multiple testing inflates Type I error (with 5 outcomes at α=0.05, ~22% chance of at least one false positive)
  2. Ignores correlations among outcomes — MANOVA uses these to improve power
  3. MANOVA can detect group differences that no single ANOVA would

MANOVA test statistics: Wilks’ Lambda (most common), Pillai’s trace, Hotelling-Lawley trace, Roy’s largest root. All test the same null hypothesis but differ in robustness to assumption violations. Pillai’s trace is most robust to violations.

Worked Example:

Research question: Does exercise training modality (aerobic vs resistance vs combined vs control, n=30 per group) differentially affect cardiorespiratory fitness across three outcomes: VO₂max (mL/kg/min), 6-minute walk distance (m), and resting heart rate (bpm)?

Outcomes are moderately intercorrelated (r = 0.40–0.65).

MANOVA:

  • Pillai’s trace = 0.52, F(9, 342) = 7.44, p < 0.001

Follow-up univariate ANOVAs (with Bonferroni correction, α = 0.017):

  • VO₂max: F(3,116) = 12.4, p < 0.001
  • 6MWD: F(3,116) = 8.7, p < 0.001
  • Resting HR: F(3,116) = 5.2, p = 0.002

Interpretation: Training modality had a significant multivariate effect on cardiorespiratory fitness outcomes (MANOVA: Pillai’s trace = 0.52, F(9,342) = 7.44, p < 0.001). Follow-up univariate ANOVAs revealed significant effects on all three individual outcomes (all p ≤ 0.002 after Bonferroni correction).

12.3 Principal Component Analysis (PCA)

What it does: A data reduction technique that transforms a set of correlated variables into a smaller set of uncorrelated components (principal components) that capture most of the variance in the original data.

When to use:

  • Many correlated predictor variables (multicollinearity) — reduce before regression
  • Exploratory data analysis of high-dimensional data
  • Visualising patterns in complex datasets

Key outputs:

  • Eigenvalues: Variance explained by each component. Components with eigenvalue > 1 are typically retained (Kaiser criterion).
  • Scree plot: Graph of eigenvalues — look for the “elbow” where the curve flattens.
  • Factor loadings: Correlation between original variables and each component. Loadings > 0.4 are typically considered meaningful.
  • % variance explained: How much of the total variability each component captures.

Worked Example:

Research question: A metabolic syndrome study measures 8 correlated biomarkers in 300 patients: waist circumference, fasting glucose, HDL-C, LDL-C, triglycerides, SBP, DBP, and insulin. Reduce these to a smaller set of components.

PCA results:

ComponentEigenvalue% VarianceCumulative %
PC13.1239.0%39.0%
PC21.8423.0%62.0%
PC31.0212.8%74.8%
PC4–8<0.80 each<10% each

Three components retained (eigenvalue > 1, 75% variance explained).

Loading matrix (simplified):

VariablePC1 (“metabolic risk”)PC2 (“blood pressure”)PC3 (“lipid profile”)
Waist circumference0.780.120.21
Fasting glucose0.720.18−0.14
Insulin0.690.080.22
Triglycerides0.610.230.48
SBP0.150.820.19
DBP0.220.790.08
HDL-C−0.540.140.62
LDL-C0.280.200.71

Interpretation: PC1 captures a “central metabolic risk” factor (high waist, glucose, insulin, TG; low HDL). PC2 represents blood pressure. PC3 captures lipid profile. These three components can replace the 8 original variables as predictors in subsequent analyses with minimal information loss.


13. Mixed Models and Longitudinal Data

13.1 Why Standard ANOVA Is Insufficient for Longitudinal Data

Repeated measures ANOVA requires complete data (no missing values), assumes compound symmetry (equal variances and covariances between all time pairs), and cannot handle time-varying covariates. In clinical trials, 10–40% of observations are commonly missing.

Linear mixed effects (LME) models overcome these limitations:

  • Handle missing data (missing-at-random) without imputation
  • Allow flexible correlation structures (not just compound symmetry)
  • Can accommodate unequally spaced measurement occasions
  • Can model individual trajectories (random slopes)

13.2 Linear Mixed Effects Models

The model:

Y_ij = (β₀ + b₀ᵢ) + (β₁ + b₁ᵢ)×time_ij + β₂×X_ij + ε_ij

Where:

  • β₀, β₁ = fixed effects (population-average intercept and slope)
  • b₀ᵢ, b₁ᵢ = random effects for subject i (individual deviations from average)
  • ε_ij = residual error

Fixed effects: Average effects across the population (reported) Random effects: Between-subject variability in intercepts and/or slopes

Worked Example:

Research question: A 12-month RCT of a lifestyle intervention in type 2 diabetes. HbA1c is measured at baseline, 3, 6, and 12 months in 120 patients (60 intervention, 60 control). 18% of follow-up data are missing (missing-at-random).

LME model: HbA1c ~ time × treatment + age + baseline HbA1c + (1+time|patient)

Key results:

EffectCoefficientSE95% CIp
Time (per month, control arm)−0.0210.008−0.037 to −0.0050.011
Treatment × time interaction−0.0380.011−0.059 to −0.0170.001
Age (per year)0.0120.007−0.002 to 0.0260.089

Interpretation: In the control arm, HbA1c decreased by 0.021% per month (reflecting background treatment changes). In the intervention arm, HbA1c decreased by an additional 0.038% per month compared to control (interaction term p = 0.001), yielding a net additional reduction of 0.46% at 12 months. The mixed model used all available data including observations with missing follow-up, reducing bias compared to complete-case analysis.


14. Diagnostic Test Evaluation

14.1 The 2×2 Table for Diagnostic Tests

All diagnostic test statistics derive from the 2×2 table comparing test result to the true diagnosis (gold standard):

Disease presentDisease absentTotal
Test positiveTrue positive (TP)False positive (FP)TP+FP
Test negativeFalse negative (FN)True negative (TN)FN+TN
TotalTP+FNFP+TNN

14.2 Sensitivity, Specificity, PPV, NPV

Sensitivity: P(test positive | disease present) = TP / (TP+FN)

  • A highly sensitive test rarely misses disease (few false negatives)
  • “SnNout” — a highly Sensitive test when Negative rules OUT disease

Specificity: P(test negative | disease absent) = TN / (FP+TN)

  • A highly specific test rarely gives false positives
  • “SpPin” — a highly Specific test when Positive rules IN disease

Positive Predictive Value (PPV): P(disease present | test positive) = TP / (TP+FP)

  • Depends heavily on disease prevalence — PPV falls sharply with lower prevalence

Negative Predictive Value (NPV): P(disease absent | test negative) = TN / (FN+TN)

  • NPV rises with lower prevalence

The prevalence dependence of PPV and NPV: Unlike sensitivity and specificity (intrinsic test properties), PPV and NPV depend on the prevalence of disease in the tested population. A test with 95% sensitivity and 95% specificity applied to a population with 1% prevalence has a PPV of only 16.1%.

Worked Example:

Research question: Evaluate a new point-of-care troponin I assay for ruling out NSTEMI in 500 ED patients with chest pain. True diagnosis confirmed by serial high-sensitivity troponin.

NSTEMI (n=80)No NSTEMI (n=420)
POC troponin positive (≥40 ng/L)7221
POC troponin negative (<40 ng/L)8399

Prevalence = 80/500 = 16%

Sensitivity = 72/80 = 90.0% (95% CI: 81.2–95.6%) Specificity = 399/420 = 95.0% (95% CI: 92.5–96.9%) PPV = 72/93 = 77.4% (95% CI: 67.7–85.3%) NPV = 399/407 = 98.0% (95% CI: 96.1–99.2%)

Interpretation: This POC troponin assay demonstrates high sensitivity (90%) and specificity (95%) for NSTEMI detection. The NPV of 98.0% supports its use as a rule-out strategy — of patients who test negative, 98% truly do not have NSTEMI. The PPV of 77.4% indicates that 23% of positive results will be false positives at this prevalence (16%), so confirmatory testing is needed for positive results.

14.3 ROC Curves and AUC

What it does: Evaluates a continuous or ordinal diagnostic test across all possible cutpoints. Plots sensitivity (y-axis) against 1-specificity (x-axis) as the threshold varies.

AUC (Area Under the Curve) / C-statistic:

  • 0.5 = no discrimination (no better than chance)
  • 0.7–0.8 = acceptable discrimination
  • 0.8–0.9 = excellent
  • 0.9 = outstanding

Optimal cutpoint: Choose based on clinical need:

  • For rule-out tests (screening): maximise sensitivity (accept lower specificity)
  • For rule-in tests (confirmation): maximise specificity (accept lower sensitivity)
  • Youden’s index (sensitivity + specificity − 1): balanced optimum

Comparing two tests: DeLong’s method for comparing paired AUCs from the same sample.

Worked Example:

Research question: Compare eGFR alone vs a clinical risk score (incorporating eGFR + proteinuria + age + diabetes) for predicting dialysis within 3 years in 350 CKD patients.

ModelAUC95% CI
eGFR alone0.730.67–0.79
Clinical risk score0.840.79–0.89
Difference+0.11p = 0.003

Interpretation: The clinical risk score (AUC 0.84) significantly outperforms eGFR alone (AUC 0.73) for predicting 3-year dialysis initiation (DeLong’s test p = 0.003). Adding proteinuria, age, and diabetes to eGFR substantially improves discrimination.


15. Agreement and Reliability

15.1 Cohen’s Kappa

What it does: Measures agreement between two raters (or methods) on categorical outcomes, corrected for chance agreement.

κ = (Po − Pe) / (1 − Pe) Po = observed agreement proportion Pe = expected agreement by chance

Interpreting kappa (Landis & Koch thresholds — use as rough guides):

κInterpretation
<0.00Poor (less than chance)
0.00–0.20Slight
0.21–0.40Fair
0.41–0.60Moderate
0.61–0.80Substantial
0.81–1.00Almost perfect

Worked Example:

Research question: Two radiologists independently classify 120 chest X-rays as: normal, consolidation, or interstitial change. What is their agreement?

Rad2: NormalRad2: Consol.Rad2: InterstitialTotal
Rad1: Normal484254
Rad1: Consolidation328233
Rad1: Interstitial123033
Total523434120

Po = (48+28+30)/120 = 106/120 = 0.883

Expected agreement:

  • Pe(normal) = (54×52)/120² = 0.195
  • Pe(consolidation) = (33×34)/120² = 0.078
  • Pe(interstitial) = (33×34)/120² = 0.078
  • Pe = 0.195 + 0.078 + 0.078 = 0.351
κ = (0.883 − 0.351) / (1 − 0.351) = 0.532 / 0.649 = 0.82

Interpretation: There is almost perfect agreement between the two radiologists for chest X-ray classification (κ = 0.82, 95% CI 0.73–0.91).

15.2 Bland-Altman Analysis

What it does: Assesses the agreement between two continuous measurement methods. Plots the difference between methods (y-axis) against the mean of the two methods (x-axis). Identifies systematic bias and limits of agreement.

Key outputs:

  • Bias: Mean difference (Method A − Method B). Non-zero bias indicates systematic over- or under-measurement by one method.
  • Limits of agreement (LOA): Bias ± 1.96 × SD of differences. The range within which 95% of differences will fall.
  • Clinical decision: Are the LOA clinically acceptable? If the maximum acceptable difference is ±5 mmHg and the LOA are ±3 mmHg, the methods agree well enough for clinical use.

Why NOT to use correlation for method comparison: Pearson r measures association, not agreement. Two methods could be highly correlated but systematically disagree. Bland-Altman is the correct approach.

Worked Example:

Research question: Compare automated oscillometric blood pressure (AOBP) with gold-standard intra-arterial (IA) SBP measurement in 50 ICU patients.

StatisticValue
Mean AOBP124.6 mmHg
Mean IA SBP128.4 mmHg
Mean difference (AOBP − IA)−3.8 mmHg
SD of differences8.2 mmHg
Upper LOA (+1.96 SD)−3.8 + 16.1 = +12.3 mmHg
Lower LOA (−1.96 SD)−3.8 − 16.1 = −19.9 mmHg

Interpretation: AOBP underestimates IA SBP by a mean of 3.8 mmHg. The limits of agreement range from −19.9 to +12.3 mmHg, meaning in 95% of patients, AOBP will differ from IA by between 20 mmHg below and 12 mmHg above. Given the wide LOA, AOBP cannot reliably substitute for intra-arterial measurement in haemodynamically unstable ICU patients where precision of ±5 mmHg is needed.


16. Bayesian Methods

16.1 Frequentist vs Bayesian Framework

The fundamental difference:

Frequentist (classical) statistics:

  • Parameters (e.g., true treatment effect) are fixed, unknown constants
  • Probability refers to long-run frequency of events
  • P-value = P(data this extreme | H₀ is true) — does not tell you probability that H₀ is true
  • Cannot make probability statements about parameters

Bayesian statistics:

  • Parameters have probability distributions reflecting uncertainty
  • You start with a prior distribution (beliefs before seeing the data)
  • You update with observed data (the likelihood)
  • You get a posterior distribution (updated beliefs)
  • Can make direct probability statements: “P(true effect > 0 | data) = 0.97”

Bayes’ theorem:

Posterior ∝ Prior × Likelihood P(θ|data) ∝ P(data|θ) × P(θ)

16.2 Credible Intervals vs Confidence Intervals

Frequentist 95% CI: In repeated sampling, 95% of such intervals would contain the true parameter. Does NOT mean “95% probability the true value is in this interval” for this specific interval — though clinicians routinely interpret it this way.

Bayesian 95% credible interval (CrI): There IS a 95% probability that the true parameter lies within this interval (given the prior and the data). This is the natural, intuitive interpretation most clinicians want.

16.3 Bayesian Analysis in Practice

Worked Example:

Research question: A pilot RCT tests a new immunotherapy in 30 patients with refractory rheumatoid arthritis (15 active, 15 placebo). ACR50 response rates: 7/15 (47%) active, 3/15 (20%) placebo.

Frequentist analysis:

  • OR = 3.5, 95% CI 0.71–17.3, p = 0.12
  • Conclusion: “Not statistically significant” — ambiguous for a small pilot trial

Bayesian analysis:

  • Prior: Weakly informative prior based on existing biologics literature (modest positive effect expected)
  • Posterior OR = 3.2 (95% CrI: 1.02–10.4)
  • P(OR > 1 | data) = 0.96 → 96% probability that the active treatment has a positive effect
  • P(OR > 2 | data) = 0.72 → 72% probability of at least a doubling of odds of response

Interpretation: While the frequentist analysis is technically non-significant (p = 0.12) — likely due to small sample size — Bayesian analysis incorporating prior evidence indicates a 96% posterior probability that the new treatment outperforms placebo. These findings support proceeding to a full Phase III RCT.

16.4 Bayes Factors

What it does: A ratio of how well H₁ predicts the data relative to H₀. Can provide evidence for the null — something p-values cannot do.

BF₁₀ = P(data | H₁) / P(data | H₀)

Interpretation:

BF₁₀Evidence for H₁
1–3Anecdotal
3–10Moderate
10–30Strong
30–100Very strong
>100Extreme
<1/3Moderate evidence for H₀

Clinical use case: Non-inferiority trials — showing that a new (cheaper, safer) treatment is “not meaningfully worse.” A Bayes Factor <1/3 provides positive evidence for the null (no difference) rather than merely failing to reject it.


17. Meta-Analysis and Systematic Review

17.1 The Evidence Hierarchy

Meta-analysis of randomised controlled trials sits at the top of the evidence hierarchy. It pools quantitative data from multiple studies to produce a single, more precise estimate of effect.

Why meta-analysis?

  • Individual trials often underpowered to detect small but clinically important effects
  • Pooling increases precision (narrower CI)
  • Identifies sources of heterogeneity
  • More generalisable than any single study

17.2 Fixed vs Random Effects Models

Fixed effects model:

  • Assumes all studies estimate the same true effect
  • Between-study variation is due to sampling error only
  • Appropriate when studies are functionally identical (same population, intervention, outcome)
  • Gives more weight to larger studies

Random effects model (DerSimonian-Laird):

  • Assumes studies estimate different but related true effects (a distribution of effects)
  • Between-study variability (heterogeneity, τ²) is estimated and incorporated
  • Results in wider, more honest CIs
  • More appropriate for most clinical meta-analyses where populations and protocols vary
  • More weight distributed to smaller studies compared to fixed effects

How to choose: Examine heterogeneity (I²). If I² < 25%, either model is reasonable. If I² ≥ 50% (substantial heterogeneity), random effects model is more appropriate. However, if heterogeneity is very high (I² > 75%), even the random effects pooled estimate should be interpreted cautiously.

17.3 Worked Meta-Analysis Example

Research question: What is the effect of ACE inhibitors on cardiovascular mortality in patients with heart failure? Systematic review identified 6 eligible RCTs.

StudyControl events/nACE-I events/nOR95% CI
CONSENSUS (1987)44/12629/1270.560.31–0.99
SOLVD-T (1991)452/1284386/12850.810.68–0.96
ATLAS (1999)52/159645/15680.870.58–1.30
V-HeFT II (1991)131/403117/4030.840.61–1.16
MERIT-HF (1999)145/2001128/19900.890.70–1.14
CIBIS-II (1999)156/1320119/13270.730.57–0.94

Heterogeneity:

  • I² = 18% (low heterogeneity — fixed effects model acceptable)
  • Cochran’s Q = 6.1, p = 0.30

Pooled estimate (fixed effects):

  • Pooled OR = 0.81 (95% CI: 0.74–0.89), p < 0.001

Random effects (for comparison):

  • Pooled OR = 0.81 (95% CI: 0.72–0.91), p < 0.001 (slightly wider CI reflecting residual heterogeneity)

Interpretation: ACE inhibitors are associated with a 19% reduction in the odds of cardiovascular mortality in heart failure patients (pooled OR 0.81, 95% CI 0.74–0.89, p < 0.001). Heterogeneity across studies was low (I² = 18%), supporting the consistency of this effect. This translates to an NNT of approximately 28 over the average trial duration to prevent one cardiovascular death.

17.4 Assessing Heterogeneity

Cochran’s Q test: Tests the null hypothesis that all studies estimate the same true effect. Underpowered with few studies; significant Q indicates heterogeneity.

I² statistic: Proportion of total variation attributable to between-study differences (not sampling error).

  • I² = 0–25%: Low/negligible
  • I² = 26–50%: Moderate
  • I² = 51–75%: Substantial
  • I² > 75%: Considerable

Sources of heterogeneity — investigate with:

  • Subgroup analysis: Does the effect differ by study population, intervention intensity, follow-up duration?
  • Meta-regression: Regress the effect size on study-level moderators (e.g., mean age, % female, baseline risk)

17.5 Publication Bias

The problem: Studies showing significant results are more likely to be published than those showing null results. This means a meta-analysis based on published literature may overestimate the true effect.

Detection:

  • Funnel plot: Plot each study’s effect size against its precision (SE or 1/SE). Under no bias, the plot is symmetric — smaller studies scatter more widely around the pooled estimate. Asymmetry suggests bias.
  • Egger’s test: Formal regression test for funnel plot asymmetry. p < 0.05 suggests asymmetry (possible publication bias).
  • Trim-and-fill method: Imputes missing studies to restore funnel symmetry and re-estimates the pooled effect. Shows how sensitive the main result is to potential publication bias.

18. Reporting Standards and Checklists

18.1 General Reporting Principles

  1. Always report the test used, the test statistic, degrees of freedom, and exact p-value. Not just “p<0.05” or “NS” — write “t(48) = 3.14, p = 0.003” or “χ²(2) = 8.74, p = 0.013.”

  2. Report effect sizes with confidence intervals for all primary outcomes. P-values alone are insufficient.

  3. Report sample sizes at every step. If 200 enrolled, 180 analysed — state what happened to the other 20 and conduct a sensitivity analysis if possible.

  4. For non-parametric tests, report median (IQR), not mean (SD).

  5. Report model fit statistics for regression models — R²/adjusted R² for linear regression; Hosmer-Lemeshow goodness of fit, AUC/C-statistic for logistic regression; overall model χ² and −2 log-likelihood.

  6. Check and report assumption testing — normality (Shapiro-Wilk), homogeneity of variance (Levene’s), sphericity (Mauchly’s), PH assumption (Cox).

  7. Distinguish pre-specified from exploratory analyses. Post-hoc subgroup analyses should be clearly labelled as exploratory and interpreted with caution.

18.2 Reporting Checklists

Study typeChecklist
RCTCONSORT (www.consort-statement.org )
Observational cohort or case-controlSTROBE (www.strobe-statement.org )
Diagnostic accuracy studySTARD (www.equator-network.org/reporting-guidelines/stard )
Systematic review/meta-analysisPRISMA (www.prisma-statement.org )
Prognostic model developmentTRIPOD (www.tripod-statement.org )
Survival analysisREMARK (for tumour marker studies)

18.3 Specimen Results Sections

Randomised Trial (t-test result):

“The primary outcome, change in HbA1c from baseline to 6 months, was significantly greater in the intervention group compared to control (−0.82% vs −0.31%; mean difference −0.51%, 95% CI −0.78 to −0.24%; independent samples t-test: t(178) = −3.74, p < 0.001).”

Survival analysis result:

“Median progression-free survival was 11.2 months (95% CI 8.6–13.8) in the experimental arm and 7.4 months (95% CI 5.9–8.9) in the control arm. The experimental treatment was associated with a 38% reduction in the hazard of progression or death (HR 0.62, 95% CI 0.48–0.80; log-rank p < 0.001).”

Logistic regression result:

“On multivariable logistic regression analysis, prior hospitalisation in the previous year (OR 2.84, 95% CI 1.63–4.95, p < 0.001) and home oxygen use (OR 1.93, 95% CI 1.09–3.42, p = 0.025) were independently associated with 30-day readmission after adjustment for age, FEV₁%, and eosinophil count. The model demonstrated acceptable discrimination (C-statistic 0.72) and good calibration (Hosmer-Lemeshow p = 0.64).”


Appendix: Quick Reference Tables

A1. Choosing the Right Test — Complete Reference

Research questionOutcome typePredictor typeGroups/samplesTest
Is mean different from reference?ContinuousNone1 groupOne-sample t-test (parametric) / Wilcoxon (non-param)
Is proportion different from reference?BinaryNone1 groupOne-proportion z-test
Does categorical distribution match expected?CategoricalNone1 groupChi-square goodness of fit
Are two independent group means different?ContinuousBinary2 indep.Student’s t / Welch’s t / Mann-Whitney U
Are two paired measurements different?ContinuousTime (2 points)2 pairedPaired t-test / Wilcoxon signed-rank
Are two paired binary proportions different?BinaryTime (2 points)2 pairedMcNemar’s test
Are 3+ independent group means different?ContinuousCategorical3+ indep.One-way ANOVA / Welch’s ANOVA / Kruskal-Wallis
Are 3+ repeated measures different?ContinuousTime (3+ points)3+ pairedRepeated measures ANOVA / Friedman
Are 3+ paired binary proportions different?BinaryTime (3+ points)3+ pairedCochran’s Q
Is there a categorical association?CategoricalCategoricalIndep.Chi-square / Fisher’s exact
Is there a linear association?ContinuousContinuousPearson r / Spearman ρ
Predict continuous outcome from 1+ predictorsContinuousMixedLinear regression
Predict binary outcome from 1+ predictorsBinaryMixedLogistic regression
Predict time-to-event from 1+ predictorsTime-to-eventMixedCox proportional hazards
Predict count outcome from 1+ predictorsCountMixedPoisson regression
Compare survival curves between groupsTime-to-eventCategorical2+ indep.Kaplan-Meier + log-rank
Multiple continuous outcomes simultaneouslyContinuousCategorical2+ groupsMANOVA
Reduce many correlated variablesContinuousNonePCA / Factor analysis
Repeated measures with missing dataContinuousMixed3+ time ptsLinear mixed effects model
Agreement between two raters (categorical)Categorical2 ratersCohen’s kappa
Agreement between two continuous methodsContinuous2 methodsBland-Altman analysis
Diagnostic test evaluationBinaryContinuous/ordinalROC analysis, sensitivity/specificity

A2. Non-Parametric Equivalents

Parametric testNon-parametric equivalentUse when
One-sample t-testOne-sample WilcoxonNon-normal data, n < 30
Independent t-testMann-Whitney UNon-normal, ordinal, n < 30
Paired t-testWilcoxon signed-rankNon-normal differences, ordinal
One-way ANOVAKruskal-WallisNon-normal groups, ordinal outcome
Repeated measures ANOVAFriedman testNon-normal, ordinal, repeated
Pearson correlationSpearman correlationNon-normal, ordinal, outliers
MANOVA— (robust MANOVA)Non-normal multivariate data

A3. Effect Size Reference

MeasureFormulaInterpretation
Cohen’s d(μ₁−μ₂)/pooled SD0.2=small, 0.5=medium, 0.8=large
Odds ratioad/bc1=no effect; >1 increased odds; <1 decreased odds
Relative risk[a/(a+b)] / [c/(c+d)]1=no effect; >1 increased risk
NNT1/ARRLower = more effective
Hazard ratioe^β (Cox)1=no effect; same interpretation as RR
r (Pearson)0.1=small, 0.3=medium, 0.5=large
SS_model/SS_total% variance explained
η² (eta-squared)SS_between/SS_total0.01=small, 0.06=medium, 0.14=large
AUC/C-statisticArea under ROC0.5=chance; 0.7–0.8=acceptable; >0.8=excellent
Kappa (κ)(Po−Pe)/(1−Pe)0.4–0.6=moderate; 0.6–0.8=substantial; >0.8=almost perfect

A4. P-Value Thresholds in Context

ScenarioRecommended αRationale
Primary outcome, single test0.05Standard
Secondary outcomes (multiple)0.05/k (Bonferroni)Multiple comparisons
Post-hoc pairwise comparisonsTukey HSD or BonferroniFamilywise error control
Exploratory analysis0.05, clearly labelledHypothesis-generating only
Genome-wide association study5×10⁻⁸Millions of comparisons
Equivalence / non-inferiority0.025 (one-sided)Specific trial design

A5. Sample Size Formulae

DesignFormulaNotes
Two independent meansn = 2(z_α/2 + z_β)²σ²/δ²σ = SD, δ = minimum detectable difference
Two proportionsn = (z_α/2 + z_β)² [p₁(1-p₁)+p₂(1-p₂)] / (p₁-p₂)²p₁, p₂ = expected proportions
Paired designn = (z_α/2 + z_β)²σ_d²/δ²σ_d = SD of differences
One proportion vs referencen = z_α/2²p₀(1-p₀)/E²E = acceptable margin of error

z values: z_0.025 = 1.96 (α=0.05 two-tailed), z_0.2 = 0.84 (power=80%), z_0.1 = 1.28 (power=90%)


Key References and Further Reading

  • Altman DG. Practical Statistics for Medical Research. Chapman & Hall, 1991.
  • Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;327:307-310.
  • Cox DR. Regression models and life tables. J Royal Stat Soc B. 1972;34:187-220.
  • DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves. Biometrics. 1988;44:837-845.
  • Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53:457-481.
  • Rothman KJ. No adjustments are needed for multiple comparisons. Epidemiology. 1990;1:43-46.
  • Steyerberg EW. Clinical Prediction Models. Springer, 2009.
  • Zhang J, Yu KF. What’s the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA. 1998;280:1690-1691.
  • Vittinghoff E, et al. Regression Methods in Biostatistics. Springer, 2012.
  • Harrell FE. Regression Modeling Strategies. Springer, 2015.

This guide is intended as a methodological reference for applied clinical research. Statistical analysis should always be conducted in consultation with a qualified statistician for complex or novel study designs. Software implementations: R (free, recommended), Stata, SPSS, SAS.

Version 1.0 | Prepared for clinical researchers | Field: Medical / clinical research

Last updated on