A Complete Guide to Statistical Tests and Methods for Clinical Researchers
Audience: Researchers and clinicians applying statistical methods to medical and health data
Purpose: A thorough reference covering test selection, assumptions, worked examples, interpretation, and reporting — from foundational hypothesis tests through advanced methods including survival analysis, multivariable modelling, and meta-analysis
How to use this guide: Each section follows a consistent structure: What it is → When to use it → Assumptions → Step-by-step workflow → Worked example → Interpretation → Reporting → Common mistakes
Table of Contents
- Foundations: The Statistical Reasoning Framework
- Choosing the Right Test
- Descriptive Statistics and Data Exploration
- One-Variable Tests
- Comparing Two Groups
- Comparing Three or More Groups
- Correlation and Association
- Regression Analysis
- Effect Sizes and Association Measures
- Survival and Time-to-Event Analysis
- Multivariable Modelling Strategy
- Multivariate Methods
- Mixed Models and Longitudinal Data
- Diagnostic Test Evaluation
- Agreement and Reliability
- Bayesian Methods
- Meta-Analysis and Systematic Review
- Reporting Standards and Checklists
- Appendix: Quick Reference Tables
1. Foundations: The Statistical Reasoning Framework
1.1 What Is a Statistical Test?
A statistical test is a formal procedure for deciding whether observed data are consistent with a stated hypothesis. The process has four components:
- Null hypothesis (H₀): The assumption of no effect, no difference, or no association
- Alternative hypothesis (H₁): The effect or difference you are trying to detect
- Test statistic: A number calculated from your data that summarises the evidence against H₀
- P-value: The probability of observing a test statistic at least as extreme as yours, if H₀ were true
What a p-value is NOT: A p-value is not the probability that H₀ is true. It is not the probability that your result is due to chance. These are the two most common misinterpretations in the medical literature.
1.2 Type I and Type II Errors
| H₀ is actually TRUE | H₀ is actually FALSE | |
|---|---|---|
| Test says: reject H₀ | Type I error (false positive) — probability = α | Correct (true positive) — probability = Power (1−β) |
| Test says: fail to reject H₀ | Correct (true negative) — probability = 1−α | Type II error (false negative) — probability = β |
- α (significance level): Conventionally set at 0.05. If α = 0.05, you accept a 5% chance of a false positive.
- β (Type II error rate): Conventionally ≤0.20, meaning power ≥ 80%.
- Power: The probability of correctly detecting a true effect. Affected by sample size, effect size, and α.
Clinical implication: In a drug trial, a Type I error means declaring an ineffective drug effective (false positive). A Type II error means missing a truly effective drug (false negative). Both have real patient consequences.
1.3 One-Tailed vs Two-Tailed Tests
- Two-tailed: Tests for a difference in either direction (H₁: μ₁ ≠ μ₂). Default for most clinical research.
- One-tailed: Tests for a difference in a specific direction (H₁: μ₁ > μ₂). Use only when you have strong prior justification and would not act on a result in the other direction. One-tailed tests are often viewed with suspicion by reviewers if not pre-specified.
1.4 Confidence Intervals vs P-Values
Confidence intervals (CIs) convey more information than p-values alone:
- A 95% CI represents the range of values consistent with your data at the 5% significance level
- If the 95% CI for a difference excludes zero (or for a ratio excludes 1.0), the result is statistically significant at α = 0.05
- CIs communicate both statistical significance AND the magnitude and precision of the estimate
- Report both — modern journals increasingly require CIs alongside p-values
Example: A new antihypertensive reduces SBP by 8 mmHg (95% CI: 6 to 10 mmHg, p < 0.001). The CI tells you the reduction is clinically meaningful and precisely estimated. Compare this to: 8 mmHg (95% CI: 0.1 to 16 mmHg, p = 0.048) — statistically significant but very imprecise.
1.5 Sample Size and Power Calculations
Always perform a power calculation before collecting data. The four inputs are:
- α — significance level (typically 0.05)
- Power (1−β) — typically 0.80 or 0.90
- Effect size — the minimum clinically important difference (MCID) you want to detect
- Variability — standard deviation (from pilot data or literature)
These four quantities are mathematically linked — specify three to solve for the fourth. Most commonly, you solve for n (sample size).
Example: You want to detect a 10 mmHg difference in SBP between two drug groups. From previous studies, SD ≈ 20 mmHg. With α = 0.05 (two-tailed) and power = 80%:
n per group = 2 × (z_α/2 + z_β)² × σ² / δ²
= 2 × (1.96 + 0.84)² × 400 / 100
= 2 × 7.84 × 4
= 63 per groupYou need approximately 63 patients per arm, so ~126 total. Always add 10–20% for expected dropout.
2. Choosing the Right Test: A Decision Framework
The Five Key Questions
Before selecting any statistical test, answer these questions in order:
Q1. What is your research question?
- Describe a population → Descriptive statistics
- Test a hypothesis about one group → One-sample tests
- Compare groups → Between-group tests
- Examine relationships → Correlation / regression
- Predict an outcome → Regression modelling
Q2. How many variables are involved?
- 1 variable → One-sample or descriptive
- 2 variables → Bivariate tests (correlation, two-group comparison)
- 2+ variables with one outcome → Multivariable regression
- 2+ outcomes simultaneously → Multivariate methods
Q3. What type is each variable?
- Continuous: Measured on a scale (BP, weight, age, biomarker levels)
- Ordinal: Ordered categories (pain scale 1–10, NYHA class I–IV)
- Nominal/categorical: Unordered categories (blood type, treatment group, sex)
- Binary: Special case of nominal with exactly two categories (alive/dead, yes/no)
- Time-to-event: Combined measure of whether and when an event occurred
Q4. Are the samples independent or paired/related?
- Independent: Different subjects in each group (RCT treatment arms, case-control study)
- Paired/related: Same subjects measured twice, or matched subjects (crossover trial, matched case-control)
Q5. Are parametric assumptions met?
- Parametric tests assume approximately normal distribution (or large enough n for CLT to apply), continuous data, and homogeneity of variance where applicable
- Non-parametric tests make fewer distributional assumptions — use for small samples (<30), skewed distributions, ordinal data, or data with outliers
Decision Table
| Outcome variable | Predictor/groups | Sample type | Test |
|---|---|---|---|
| Continuous | None (1 group vs known value) | — | One-sample t-test or Wilcoxon |
| Continuous | 2 groups | Independent | Student’s t or Welch’s t / Mann-Whitney U |
| Continuous | 2 groups | Paired | Paired t-test / Wilcoxon signed-rank |
| Continuous | 3+ groups | Independent | One-way ANOVA / Kruskal-Wallis |
| Continuous | 3+ groups | Repeated | Repeated-measures ANOVA / Friedman |
| Continuous | Continuous predictor(s) | — | Linear regression |
| Binary | 2+ groups | Independent | Chi-square / Fisher’s exact |
| Binary | 2 groups | Paired | McNemar’s test |
| Binary | Continuous/mixed predictors | — | Logistic regression |
| Time-to-event | 2+ groups | Independent | Kaplan-Meier + log-rank |
| Time-to-event | Continuous/mixed predictors | — | Cox regression |
| Count data | Groups | — | Poisson / negative binomial regression |
| Ordinal | 2+ groups | Independent | Mann-Whitney / Kruskal-Wallis |
| Multiple continuous outcomes | Groups | — | MANOVA |
3. Descriptive Statistics and Data Exploration
3.1 Measures of Central Tendency
Mean: Sum of all values divided by n. Best for normally distributed continuous data.
Median: The middle value when data are sorted. Preferred for skewed data or ordinal scales. Robust to outliers.
Mode: Most frequently occurring value. Rarely used in clinical research except for nominal data.
When to use which:
- Normally distributed continuous data → Mean (± SD)
- Skewed continuous data → Median (IQR)
- Ordinal scales (e.g. pain scores) → Median (IQR)
- Nominal data → Frequency and percentage
3.2 Measures of Spread
Standard deviation (SD): Average distance of data points from the mean. Use with mean for symmetric data.
Interquartile range (IQR): Difference between 75th and 25th percentiles. Use with median for skewed data.
Range: Min to max. Useful supplementary information but sensitive to outliers.
Standard error of the mean (SEM): SD / √n. Describes precision of the mean estimate, NOT the spread of individual values. Do not use SEM as a measure of variability in a study population — this is a common and misleading error in clinical publications.
3.3 Assessing Normality
Before choosing parametric vs non-parametric tests, assess distributional assumptions:
Visual methods (preferred):
- Histogram: Look for symmetric bell shape
- Q-Q plot (quantile-quantile plot): Points should fall along the diagonal line if data are normally distributed
- Box plot: Check for symmetry and outliers
Formal tests:
- Shapiro-Wilk test: Best for small samples (n < 50). H₀: data are normally distributed. A p-value > 0.05 is consistent with normality (note: does not prove normality).
- Kolmogorov-Smirnov test: Better for larger samples.
Practical rule: With n > 30, the central limit theorem (CLT) ensures that the sampling distribution of the mean is approximately normal even if individual data are skewed. Parametric tests are generally robust in this case. For n < 30 with visibly skewed data, use non-parametric alternatives.
3.4 Worked Example: Describing a Study Population
Scenario: A clinical trial of a new statin enrols 120 patients. At baseline, data are collected on age, sex, BMI, LDL-cholesterol, and NYHA heart failure class (I–IV).
Appropriate summary statistics:
| Variable | Type | Summary |
|---|---|---|
| Age (years) | Continuous, approximately normal | Mean ± SD: 62.4 ± 11.2 |
| Sex (% male) | Binary | 68 (56.7%) |
| BMI (kg/m²) | Continuous, slightly right-skewed | Median (IQR): 27.8 (24.6–31.9) |
| LDL-C (mmol/L) | Continuous, right-skewed | Median (IQR): 3.4 (2.8–4.1) |
| NYHA class | Ordinal | Class I: 22 (18.3%), Class II: 58 (48.3%), Class III: 32 (26.7%), Class IV: 8 (6.7%) |
Reporting note: In a Table 1 (baseline characteristics), use the format: n (%) for categorical variables; mean ± SD for normally distributed continuous variables; median (IQR) for skewed or ordinal variables.
4. One-Variable Tests
4.1 One-Sample Student’s t-Test
What it does: Tests whether the mean of a single sample differs significantly from a known or hypothesised population value (μ₀).
When to use:
- One continuous variable
- Data are approximately normally distributed (or n ≥ 30)
- You want to compare your sample mean to a reference value
Assumptions:
- Continuous data
- Approximate normality or n ≥ 30
- Observations are independent
Test statistic:
t = (x̄ − μ₀) / (s / √n)Where x̄ = sample mean, μ₀ = hypothesised mean, s = sample SD, n = sample size. Follows a t-distribution with n−1 degrees of freedom.
Worked Example:
Research question: A cardiology unit wants to know whether the mean INR of their anticoagulated patients differs from the therapeutic target of 2.5.
Data: n = 25 patients, mean INR = 2.8, SD = 0.6
t = (2.8 − 2.5) / (0.6 / √25) = 0.3 / 0.12 = 2.50
df = 24
p-value = 0.020 (two-tailed)
95% CI for difference: 0.05 to 0.55Interpretation: The mean INR (2.8) is significantly above the target of 2.5 (t(24) = 2.50, p = 0.020). The 95% CI (2.55 to 3.05) excludes 2.5, confirming statistical significance. The unit may be over-anticoagulating their patients on average.
4.2 One-Sample Wilcoxon Signed-Rank Test
What it does: Non-parametric equivalent of the one-sample t-test. Tests whether the median of a sample differs from a hypothesised value.
When to use:
- One continuous or ordinal variable
- Data are skewed, ordinal, or n < 30 with non-normal distribution
- You want to compare your sample median to a reference value
Worked Example:
Research question: A pain clinic wants to know whether their patients’ median pain score (NRS 0–10) differs from the population median of 5.
Data: n = 18 patients with chronic back pain, median NRS = 7 (IQR: 5–9). Shapiro-Wilk p = 0.003 — data are significantly non-normal.
Procedure: Calculate the difference between each patient’s score and 5. Rank the absolute differences. Sum the positive and negative ranks separately. Use the Wilcoxon W statistic.
Result: W = 142, p = 0.008
Interpretation: Patients’ median pain score (7) is significantly higher than the reference value of 5 (Wilcoxon W = 142, p = 0.008), indicating this population has worse pain than the general reference population.
4.3 One-Proportion Test (Z-test for proportion)
What it does: Tests whether an observed proportion differs from a known or hypothesised population proportion.
When to use:
- One binary/nominal variable
- You want to compare your proportion to a reference value
- np ≥ 5 and n(1−p) ≥ 5 (otherwise use exact binomial test)
Worked Example:
Research question: The national readmission rate following elective hip replacement is 4%. A tertiary centre reviews 250 of their own procedures and finds 15 readmissions. Is their rate significantly different?
H₀: p = 0.04 (their rate equals the national rate)
p̂ = 15/250 = 0.060
z = (p̂ − p₀) / √(p₀(1−p₀)/n)
= (0.060 − 0.040) / √(0.04 × 0.96 / 250)
= 0.020 / 0.01241
= 1.61
p-value = 0.107 (two-tailed)
95% CI for proportion: 0.033 to 0.097Interpretation: The observed readmission rate (6.0%) is numerically higher than the national rate (4.0%), but this difference is not statistically significant (z = 1.61, p = 0.107). The 95% CI (3.3% to 9.7%) includes 4%, consistent with this conclusion. The study may be underpowered to detect a difference of this magnitude — a power calculation would be warranted.
4.4 Chi-Square Goodness-of-Fit Test
What it does: Tests whether the observed distribution of a categorical variable matches an expected (theoretical) distribution.
When to use:
- One categorical variable with two or more categories
- You have hypothesised expected frequencies for each category
- Expected frequency in each cell ≥ 5
Worked Example:
Research question: ABO blood group distribution in the general UK population is approximately: A=42%, B=10%, AB=4%, O=44%. In a sample of 200 cardiac surgery patients, you observe: A=96 (48%), B=16 (8%), AB=6 (3%), O=82 (41%). Is the distribution of blood types in cardiac patients different from the general population?
Expected counts (E = n × p): A=84, B=20, AB=8, O=88
χ² = Σ (O−E)²/E
= (96−84)²/84 + (16−20)²/20 + (6−8)²/8 + (82−88)²/88
= 1.714 + 0.800 + 0.500 + 0.409
= 3.423
df = 4−1 = 3
p-value = 0.331Interpretation: The blood type distribution among cardiac surgery patients does not differ significantly from the general population (χ²(3) = 3.42, p = 0.331).
5. Comparing Two Groups
5.1 Independent Samples t-Test (Student’s t-Test)
What it does: Compares the means of two independent groups.
When to use:
- Continuous outcome variable
- Two independent groups (different subjects in each group)
- Approximately normally distributed data in both groups, or n ≥ 30 per group
- Equal population variances (if not, use Welch’s t-test)
Checking equal variances: Use Levene’s test. If p > 0.05, assume equal variances (Student’s). If p ≤ 0.05, assume unequal variances (Welch’s). In practice, Welch’s t-test is robust and increasingly recommended as the default.
Test statistic (equal variances):
t = (x̄₁ − x̄₂) / (sp × √(1/n₁ + 1/n₂))
where sp = pooled SD = √[((n₁−1)s₁² + (n₂−1)s₂²) / (n₁+n₂−2)]
df = n₁ + n₂ − 2Worked Example:
Research question: A randomised controlled trial compares a new ACE inhibitor (Group A, n=45) to placebo (Group B, n=45) on 24-hour systolic blood pressure (SBP) reduction after 8 weeks.
| Group A (ACE inhibitor) | Group B (Placebo) | |
|---|---|---|
| n | 45 | 45 |
| Mean SBP reduction (mmHg) | 12.4 | 5.8 |
| SD | 8.2 | 7.6 |
Levene’s test: p = 0.62 → assume equal variances
sp = √[((44 × 8.2²) + (44 × 7.6²)) / 88]
= √[(2963.84 + 2542.24) / 88]
= √[62.57]
= 7.910
t = (12.4 − 5.8) / (7.910 × √(1/45 + 1/45))
= 6.6 / (7.910 × 0.2108)
= 6.6 / 1.667
= 3.96
df = 88
p-value < 0.001
95% CI for difference: 3.28 to 9.92 mmHgInterpretation: The ACE inhibitor produced a significantly greater reduction in SBP compared to placebo (mean difference 6.6 mmHg, 95% CI 3.3 to 9.9 mmHg; t(88) = 3.96, p < 0.001). The CI is entirely above zero, confirming the ACE inhibitor is superior.
Reporting template: “The ACE inhibitor group showed a significantly greater reduction in 24-hour SBP compared to placebo (12.4 ± 8.2 vs 5.8 ± 7.6 mmHg; mean difference 6.6 mmHg, 95% CI 3.3 to 9.9 mmHg; p < 0.001).“
5.2 Welch’s t-Test (Unequal Variances)
What it does: Like Student’s t-test but does not assume equal population variances. The degrees of freedom are adjusted (Welch-Satterthwaite correction), resulting in a non-integer df.
When to use: Whenever Levene’s test is significant (p ≤ 0.05), or as a default (Welch’s is generally safer and loses little power when variances are actually equal).
Worked Example:
Research question: Comparing CRP levels (mg/L) between patients with confirmed bacterial infection (n=30) and viral infection (n=28).
| Bacterial | Viral | |
|---|---|---|
| Mean CRP | 118.4 | 22.6 |
| SD | 94.2 | 18.7 |
Levene’s test: p = 0.001 → unequal variances → use Welch’s
t = (118.4 − 22.6) / √(94.2²/30 + 18.7²/28)
= 95.8 / √(295.87 + 12.48)
= 95.8 / √308.35
= 95.8 / 17.56
= 5.46
df (Welch-Satterthwaite) ≈ 31.4 (non-integer)
p < 0.001
95% CI: 60.7 to 130.9 mg/LInterpretation: CRP was substantially and significantly higher in bacterial compared to viral infections (118.4 vs 22.6 mg/L; mean difference 95.8 mg/L, 95% CI 60.7 to 130.9; Welch’s t = 5.46, p < 0.001). The large standard deviations and Levene’s test result confirm the appropriateness of Welch’s t-test here.
5.3 Mann-Whitney U Test
What it does: Non-parametric test comparing the distributions of two independent groups. Tests whether one group tends to have higher values than the other.
When to use:
- Continuous or ordinal outcome
- Two independent groups
- Data are skewed, ordinal, or n < 30 with non-normal distribution
- Particularly appropriate for outcomes like pain scores, quality of life measures, biomarkers with skewed distributions
What it actually tests: The Mann-Whitney U test does not strictly test equality of medians (a common misconception). It tests whether one group’s values tend to be larger than the other’s — formally, P(X > Y) = 0.5. The test is equivalent to asking: “If I randomly picked one observation from each group, is there an equal probability of either being larger?”
Worked Example:
Research question: A palliative care study compares quality of life scores (EORTC QLQ-C30 global scale, 0–100) between patients receiving standard care (n=22) and those receiving a new integrated support programme (n=24) at 3 months. The data are negatively skewed.
| Standard care | Integrated programme | |
|---|---|---|
| n | 22 | 24 |
| Median (IQR) | 58 (42–70) | 72 (62–82) |
| Shapiro-Wilk p | 0.031 | 0.028 |
Both groups fail the normality test → use Mann-Whitney U
Result: U = 161.5, p = 0.014
Interpretation: Quality of life scores were significantly higher in the integrated support programme group compared to standard care (median 72 vs 58; Mann-Whitney U = 161.5, p = 0.014).
Reporting template: “Global quality of life was significantly better in patients receiving the integrated support programme compared to standard care (median 72 [IQR 62–82] vs 58 [IQR 42–70]; Mann-Whitney U = 161.5, p = 0.014).“
5.4 Paired Samples t-Test
What it does: Compares means from the same subjects measured at two time points or under two conditions. Conceptually, it reduces to a one-sample t-test on the differences.
When to use:
- Same subjects measured twice (before/after design)
- Matched subjects in a 1:1 design
- Approximately normally distributed differences (not necessarily the raw values)
Key advantage over independent t-test: Removes between-subject variability, substantially increasing statistical power.
Test statistic:
t = d̄ / (sd / √n)
where d̄ = mean of (post − pre) differences
sd = SD of differences
df = n − 1Worked Example:
Research question: A crossover trial tests whether 8 weeks of dietary sodium restriction reduces 24-hour urinary sodium excretion in 20 hypertensive patients. Each patient acts as their own control.
| Patient | Pre (mmol/24h) | Post (mmol/24h) | Difference (Post−Pre) |
|---|---|---|---|
| Mean | 168.4 | 124.6 | −43.8 |
| SD | — | — | 28.4 |
t = −43.8 / (28.4 / √20)
= −43.8 / 6.35
= −6.90
df = 19
p < 0.001
95% CI for mean difference: −57.1 to −30.5 mmol/24hInterpretation: Sodium restriction significantly reduced 24-hour urinary sodium excretion (mean reduction 43.8 mmol/24h, 95% CI 30.5 to 57.1 mmol/24h; paired t(19) = −6.90, p < 0.001). The CI excludes zero, and the magnitude (43.8 mmol/24h) represents a clinically meaningful reduction.
5.5 Wilcoxon Signed-Rank Test
What it does: Non-parametric equivalent of the paired t-test. Compares two related groups without assuming normality of differences.
When to use:
- Paired or repeated observations
- Differences are not normally distributed
- Ordinal data with paired design
Worked Example:
Research question: A physiotherapy intervention study measures pain scores (NRS 0–10) in 16 patients with knee osteoarthritis before and after 6 weeks of treatment. The differences are not normally distributed (Shapiro-Wilk p = 0.019).
| Pre-treatment | Post-treatment | |
|---|---|---|
| Median (IQR) | 7 (6–9) | 4 (3–6) |
Result: Wilcoxon Z = −3.29, p = 0.001
Interpretation: Pain scores were significantly reduced following physiotherapy (pre-treatment median 7 [IQR 6–9] vs post-treatment median 4 [IQR 3–6]; Wilcoxon signed-rank Z = −3.29, p = 0.001).
5.6 Chi-Square Test of Independence
What it does: Tests whether two categorical variables are associated (i.e., whether the distribution of one variable differs across levels of the other).
When to use:
- Both variables are categorical (nominal or ordinal)
- Independent observations
- Expected frequency in each cell ≥ 5 (if not, use Fisher’s exact test)
Test statistic:
χ² = Σ (O − E)² / E
where E = (row total × column total) / grand total
df = (rows − 1)(columns − 1)Worked Example:
Research question: Does smoking status (smoker vs non-smoker) differ between patients who develop postoperative pneumonia and those who do not following elective colorectal surgery (n=180)?
| Pneumonia | No pneumonia | Total | |
|---|---|---|---|
| Smoker | 24 | 36 | 60 |
| Non-smoker | 16 | 104 | 120 |
| Total | 40 | 140 | 180 |
Expected counts:
- Smoker/Pneumonia: (60×40)/180 = 13.3
- Smoker/No pneumonia: (60×140)/180 = 46.7
- Non-smoker/Pneumonia: (120×40)/180 = 26.7
- Non-smoker/No pneumonia: (120×140)/180 = 93.3
All expected counts ≥ 5 → chi-square test appropriate
χ² = (24−13.3)²/13.3 + (36−46.7)²/46.7 + (16−26.7)²/26.7 + (104−93.3)²/93.3
= 8.61 + 2.45 + 4.29 + 1.23
= 16.58
df = 1
p < 0.001Interpretation: Smoking was significantly associated with postoperative pneumonia (χ²(1) = 16.58, p < 0.001). Smokers had a substantially higher rate of pneumonia (40.0%) compared to non-smokers (13.3%). The odds ratio is 4.33 (95% CI: 2.04–9.21), indicating smokers had over four times the odds of developing pneumonia.
5.7 Fisher’s Exact Test
What it does: Tests the association between two categorical variables when expected cell frequencies are small (less than 5). Calculates the exact probability of the observed (or more extreme) table configuration.
When to use:
- 2×2 contingency table with expected cell frequency < 5 in any cell
- Small sample sizes
- Sparse data (rare outcomes)
Worked Example:
Research question: A small case series examines whether an unusual fungal infection is associated with immunosuppressive therapy. Among 12 patients: 5 received immunosuppressants (4 with infection, 1 without), 7 did not (1 with infection, 6 without).
| Infection | No infection | Total | |
|---|---|---|---|
| Immunosuppressed | 4 | 1 | 5 |
| Not immunosuppressed | 1 | 6 | 7 |
| Total | 5 | 7 | 12 |
Smallest expected cell: (5×5)/12 = 2.08 < 5 → use Fisher’s exact test
Fisher’s exact p = 0.045 (two-tailed)
Interpretation: Immunosuppressive therapy was significantly associated with fungal infection in this small series (Fisher’s exact p = 0.045). Caution: with only 12 patients, these findings should be considered hypothesis-generating.
5.8 McNemar’s Test
What it does: Tests whether the proportion of a binary outcome differs between two paired groups (same subjects measured twice, or matched pairs).
When to use:
- Binary outcome (yes/no)
- Paired or matched design (before/after, matched case-control)
Worked Example:
Research question: Before and after a hand-hygiene education campaign, the same 80 clinical staff are observed for compliance (compliant = yes/no). Did compliance rates change?
| Post: Compliant | Post: Non-compliant | Total | |
|---|---|---|---|
| Pre: Compliant | 38 | 12 | 50 |
| Pre: Non-compliant | 22 | 8 | 30 |
| Total | 60 | 20 | 80 |
The key cells are the discordant pairs: b=12 (compliant pre, not post) and c=22 (not compliant pre, compliant post).
McNemar χ² = (|b − c| − 1)² / (b + c)
= (|12 − 22| − 1)² / (12 + 22)
= 81 / 34
= 2.38...
Wait, using corrected formula:
χ² = (b − c)² / (b + c) = (12−22)² / (12+22) = 100/34 = 2.94
p = 0.086Interpretation: There was a non-significant trend toward improved hand hygiene compliance following the education campaign (60% post-intervention vs 62.5% pre-intervention; McNemar χ² = 2.94, p = 0.086). The campaign did not produce a statistically significant change in this sample.
6. Comparing Three or More Groups
6.1 One-Way ANOVA
What it does: Tests whether the means of three or more independent groups differ. The word “one-way” refers to one grouping factor. ANOVA tests the overall (“omnibus”) null hypothesis that ALL group means are equal — it does not tell you which groups differ.
When to use:
- Continuous outcome
- Three or more independent groups
- Approximately normal distribution within each group, or large samples
- Equal variances across groups (if not, use Welch’s ANOVA)
Assumptions:
- Normality within each group
- Homogeneity of variance (Levene’s test)
- Independence of observations
Logic: ANOVA partitions total variability into between-group variability (due to the treatment/grouping) and within-group variability (random noise). The F-statistic is the ratio of these two components.
F = (Between-group variance) / (Within-group variance)
= MSbetween / MSwithin
Where:
SSbetween = Σ nj(x̄j − x̄)² df = k−1
SSwithin = Σ Σ (xij − x̄j)² df = N−k
F ~ F-distribution with df1 = k−1, df2 = N−kPost-hoc tests: If ANOVA is significant, follow-up with pairwise comparisons. Common options:
- Tukey’s HSD: Controls familywise error rate; compares all possible pairs. Good all-purpose choice.
- Bonferroni correction: Divides α by number of comparisons. Conservative.
- Dunnett’s test: Compares each group only to a control group. Use in dose-response studies.
- Scheffé’s test: Most conservative; appropriate for complex contrasts planned after seeing the data.
Worked Example:
Research question: A multicentre RCT compares three doses of a novel anti-nausea drug (low dose, medium dose, high dose) versus placebo on vomiting episodes in 24 hours following chemotherapy (n=200, 50 per group).
| Group | n | Mean episodes | SD |
|---|---|---|---|
| Placebo | 50 | 6.8 | 2.4 |
| Low dose | 50 | 5.1 | 2.1 |
| Medium dose | 50 | 3.4 | 1.8 |
| High dose | 50 | 2.9 | 1.7 |
Grand mean (x̄) = (6.8+5.1+3.4+2.9)/4 = 4.55
SSbetween = 50×(6.8−4.55)² + 50×(5.1−4.55)² + 50×(3.4−4.55)² + 50×(2.9−4.55)²
= 50×(5.0625 + 0.3025 + 1.3225 + 2.7225)
= 50 × 9.41 = 470.5
MSbetween = 470.5 / 3 = 156.8
SSwithin = 49×2.4² + 49×2.1² + 49×1.8² + 49×1.7² = 49×(5.76+4.41+3.24+2.89)
= 49 × 16.30 = 798.7
MSwithin = 798.7 / 196 = 4.075
F = 156.8 / 4.075 = 38.5
p < 0.001Post-hoc Tukey HSD: All pairwise comparisons are significant (p < 0.05) except: Medium dose vs High dose (mean difference 0.5, p = 0.41).
Interpretation: There were significant differences in vomiting episodes across treatment groups (one-way ANOVA: F(3,196) = 38.5, p < 0.001). Post-hoc analysis (Tukey HSD) showed all active doses were superior to placebo (all p < 0.001), and medium dose was superior to low dose (p = 0.003). There was no significant difference between medium and high doses (p = 0.41), suggesting medium dose may provide the optimal therapeutic benefit with a lower adverse event profile.
6.2 Welch’s ANOVA
What it does: An F-test that does not assume equal population variances across groups. More robust than standard ANOVA when variances are heterogeneous.
When to use: When Levene’s test is significant (p < 0.05), indicating unequal variances across groups.
Post-hoc test: Use Games-Howell (does not assume equal variances) rather than Tukey HSD.
6.3 Kruskal-Wallis Test
What it does: Non-parametric alternative to one-way ANOVA. Tests whether three or more independent groups have the same distribution. Like Mann-Whitney U extended to k groups.
When to use:
- Continuous or ordinal outcome
- Three or more independent groups
- Data are skewed or non-normal within groups
- Ordinal outcome (e.g. pain scores, Likert scales)
Post-hoc testing: If Kruskal-Wallis is significant, use Dunn’s test with Bonferroni correction for pairwise comparisons.
Worked Example:
Research question: Three hospitals (A, B, C) are compared on patient-reported pain scores (NRS 0–10) at discharge following total knee replacement.
| Hospital | n | Median (IQR) |
|---|---|---|
| A | 35 | 4 (3–6) |
| B | 38 | 6 (4–8) |
| C | 33 | 5 (3–7) |
Data are ordinal and skewed → Kruskal-Wallis
Result: H(2) = 8.74, p = 0.013
Post-hoc (Dunn’s with Bonferroni):
- A vs B: p = 0.010
- A vs C: p = 0.320
- B vs C: p = 0.182
Interpretation: Discharge pain scores differed significantly across the three hospitals (Kruskal-Wallis H(2) = 8.74, p = 0.013). Post-hoc analysis showed Hospital A had significantly lower pain scores than Hospital B (Dunn’s test, p = 0.010) but not Hospital C (p = 0.320). No significant difference was found between Hospitals B and C (p = 0.182).
6.4 Repeated Measures ANOVA
What it does: Tests for differences in a continuous outcome measured at three or more time points in the same subjects.
When to use:
- Same subjects measured at 3+ time points
- Continuous outcome
- Approximately normally distributed data or adequate sample size
Assumption unique to repeated measures: Sphericity — the variances of the differences between all possible pairs of time points should be equal. Tested with Mauchly’s test. If violated, apply Greenhouse-Geisser or Huynh-Feldt epsilon correction to the degrees of freedom.
Worked Example:
Research question: Serum creatinine (μmol/L) is monitored in 30 patients with CKD at baseline, 3 months, 6 months, and 12 months of treatment.
| Time point | Mean creatinine | SD |
|---|---|---|
| Baseline | 142 | 38 |
| 3 months | 138 | 36 |
| 6 months | 131 | 34 |
| 12 months | 128 | 33 |
Mauchly’s test: p = 0.21 (sphericity not violated)
Result: F(3, 87) = 8.43, p < 0.001, η² = 0.225
Post-hoc (pairwise t-tests with Bonferroni):
- Baseline vs 3 months: p = 0.31 (ns)
- Baseline vs 6 months: p = 0.012
- Baseline vs 12 months: p < 0.001
- 3 months vs 12 months: p = 0.003
Interpretation: Serum creatinine decreased significantly over 12 months (repeated measures ANOVA: F(3,87) = 8.43, p < 0.001, η² = 0.23). Significant reductions from baseline were apparent at 6 months (−11 μmol/L, p = 0.012) and 12 months (−14 μmol/L, p < 0.001).
6.5 Friedman Test
What it does: Non-parametric equivalent of repeated measures ANOVA. Compares three or more related groups.
When to use:
- Same subjects measured at 3+ time points
- Data are skewed, ordinal, or assumptions of repeated measures ANOVA are violated
Worked Example:
Research question: Pain scores (NRS 0–10) are compared at 3 time points (baseline, week 4, week 8) in 20 patients with rheumatoid arthritis starting a new biologic therapy. Data are ordinal and skewed.
| Time | Median (IQR) |
|---|---|
| Baseline | 7 (6–9) |
| Week 4 | 5 (3–7) |
| Week 8 | 3 (2–5) |
Friedman χ²(2) = 28.4, p < 0.001
Post-hoc (Wilcoxon with Bonferroni, α adjusted to 0.017):
- Baseline vs Week 4: p = 0.001
- Baseline vs Week 8: p < 0.001
- Week 4 vs Week 8: p = 0.008
Interpretation: Pain scores decreased significantly over the 8-week treatment period (Friedman χ²(2) = 28.4, p < 0.001). All pairwise comparisons showed significant improvement (all p ≤ 0.008).
7. Correlation and Association
7.1 Pearson Correlation
What it does: Quantifies the strength and direction of the linear relationship between two continuous variables. Output is the correlation coefficient r, ranging from −1 (perfect negative linear relationship) to +1 (perfect positive linear relationship).
When to use:
- Both variables continuous
- Approximately bivariate normal distribution
- You are interested in linear association
Interpreting r: | |r| value | Interpretation | |---|---| | 0.00–0.19 | Negligible/very weak | | 0.20–0.39 | Weak | | 0.40–0.59 | Moderate | | 0.60–0.79 | Strong | | 0.80–1.00 | Very strong |
Important caveat: Correlation ≠ causation. Always plot the data first (scatterplot) — r can miss non-linear relationships, and can be distorted by outliers.
Worked Example:
Research question: Is there a linear association between age (years) and eGFR (mL/min/1.73m²) in a cohort of 150 adults attending a nephrology clinic?
Result: r = −0.58 (95% CI: −0.68 to −0.46), p < 0.001
Interpretation: There is a moderate to strong negative linear relationship between age and eGFR (r = −0.58, 95% CI −0.68 to −0.46, p < 0.001), indicating that kidney function declines with increasing age in this cohort. Age accounts for approximately 34% of the variance in eGFR (r² = 0.336).
7.2 Spearman’s Rank Correlation
What it does: Non-parametric measure of monotonic (not necessarily linear) association between two variables. Calculates the Pearson correlation on the ranks of the data.
When to use:
- Ordinal data (e.g., disease severity grade, Likert scale responses)
- Continuous data that are skewed or contain outliers
- Non-linear but monotonic relationships
Worked Example:
Research question: Is NYHA heart failure class (I–IV, ordinal) associated with 6-minute walk distance (metres) in 80 outpatients?
Result: ρ (rho) = −0.71 (95% CI: −0.79 to −0.60), p < 0.001
Interpretation: There is a strong negative monotonic association between NYHA class and 6-minute walk distance (Spearman ρ = −0.71, p < 0.001): higher NYHA class (worse symptoms) is associated with shorter walk distance.
7.3 Common Pitfalls in Correlation Analysis
1. Correlation without scatterplot: Always plot the data. An r of 0.50 could reflect a clean linear trend, a curved relationship, or a driven entirely by a few outliers — you cannot tell from the statistic alone.
2. Ecological fallacy: Correlation at the group level (e.g., countries) does not imply correlation at the individual level.
3. Confounding: A correlation between A and B might be explained by a third variable C that is related to both.
4. Restricted range: Correlations are attenuated when you study a narrow range of one variable (e.g., only severely ill patients). True associations may be understated.
5. Multiple testing: If you test 20 correlations, you expect 1 to be significant by chance at α = 0.05.
8. Regression Analysis
8.1 Simple Linear Regression
What it does: Models the linear relationship between one continuous predictor (X) and one continuous outcome (Y). Extends correlation by fitting a line and quantifying the predicted change in Y per unit change in X.
The model:
Y = β₀ + β₁X + ε
β₀ = intercept (value of Y when X = 0)
β₁ = slope (change in Y for each 1-unit increase in X)
ε = residual errorKey outputs:
- β₁ (regression coefficient): The slope — how much Y changes per unit increase in X
- 95% CI for β₁
- R²: Proportion of variance in Y explained by X
- Residual diagnostics: Assess model assumptions
Assumptions:
- Linearity: the relationship is linear
- Independence of residuals
- Homoscedasticity: residuals have constant variance across X values
- Normality of residuals
- No influential outliers
Check assumptions with:
- Residuals vs fitted plot (linearity, homoscedasticity)
- Q-Q plot of residuals (normality)
- Cook’s distance (influential observations)
Worked Example:
Research question: What is the relationship between BMI (kg/m², predictor) and systolic blood pressure (mmHg, outcome) in 200 middle-aged adults?
Result:
SBP = 98.4 + 1.72 × BMI- β₁ = 1.72 (95% CI: 1.28 to 2.16), p < 0.001
- R² = 0.21 (21% of SBP variance explained by BMI)
Interpretation: For every 1 kg/m² increase in BMI, systolic blood pressure increases by an estimated 1.72 mmHg (95% CI 1.28 to 2.16 mmHg, p < 0.001). BMI explains 21% of the variability in SBP in this cohort.
8.2 Multiple Linear Regression
What it does: Models the relationship between two or more predictors and a continuous outcome. Each coefficient represents the effect of that predictor adjusted for all other predictors in the model.
The model:
Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ + εWorked Example:
Research question: What factors independently predict systolic blood pressure in 200 adults? Candidate predictors: BMI, age, sex (female = reference), smoking status (current smoker vs not).
| Predictor | Coefficient (β) | 95% CI | p-value |
|---|---|---|---|
| Intercept | 78.2 | — | — |
| BMI (per kg/m²) | 1.42 | 0.98 to 1.86 | <0.001 |
| Age (per year) | 0.68 | 0.44 to 0.92 | <0.001 |
| Male sex | 4.10 | 1.22 to 6.98 | 0.005 |
| Current smoker | 3.85 | 0.97 to 6.73 | 0.009 |
Adjusted R² = 0.34
Interpretation: After adjustment for other variables, each 1 kg/m² increase in BMI was associated with a 1.42 mmHg increase in SBP (95% CI 0.98–1.86, p < 0.001). Older age, male sex, and current smoking were also independently associated with higher SBP. Together, these four predictors explain 34% of the variance in SBP.
Important: The coefficient for BMI (1.42) differs from the unadjusted coefficient (1.72) because age, sex, and smoking are confounders — they are correlated with BMI and independently predict SBP.
8.3 Logistic Regression
What it does: Models the relationship between one or more predictors and a binary outcome (yes/no, event/no event). Output is the log-odds of the outcome, which is converted to an odds ratio (OR) for interpretation.
The model:
logit(p) = ln(p/(1−p)) = β₀ + β₁X₁ + β₂X₂ + ...
OR for predictor Xj = e^βjAssumptions:
- Binary outcome
- Independence of observations
- No multicollinearity among predictors
- Large enough sample (at least 10 events per predictor variable — the “EPV rule”)
- Linearity of continuous predictors with the log-odds (check with Box-Tidwell test)
Worked Example:
Research question: What factors independently predict 30-day readmission (yes/no) following hospital admission for COPD exacerbation? Data from 400 admissions.
Outcome: 30-day readmission (n=84, 21%)
| Predictor | OR | 95% CI | p-value |
|---|---|---|---|
| Age (per 10 years) | 1.24 | 1.05 to 1.47 | 0.012 |
| FEV₁% predicted (per 10% increase) | 0.82 | 0.71 to 0.95 | 0.007 |
| Previous admission in past year (yes vs no) | 2.84 | 1.63 to 4.95 | <0.001 |
| Home oxygen use (yes vs no) | 1.93 | 1.09 to 3.42 | 0.025 |
| Eosinophil count (per 0.1×10⁹/L) | 0.87 | 0.76 to 0.99 | 0.038 |
Model fit: Hosmer-Lemeshow goodness-of-fit p = 0.64 (good fit); C-statistic (AUC) = 0.72
Interpretation: Previous admission in the past year was the strongest predictor of 30-day readmission (OR 2.84, 95% CI 1.63–4.95, p < 0.001): patients with prior admissions had nearly three times the odds of readmission compared to those without. Each 10% reduction in FEV₁% was associated with a 22% increase in the odds of readmission (OR 0.82 per 10% improvement, i.e. OR 1.22 per 10% deterioration). The model discriminates readmitted from non-readmitted patients with moderate ability (AUC 0.72).
Reporting the C-statistic / AUC: The AUC (area under the ROC curve) for a logistic model represents the probability that a randomly selected patient who was readmitted had a higher predicted probability than a randomly selected patient who was not. Values: 0.5 = no better than chance; 0.7–0.8 = acceptable; 0.8–0.9 = excellent; >0.9 = outstanding.
9. Effect Sizes and Association Measures
9.1 Why Effect Sizes Matter
A statistically significant result (small p-value) tells you that an effect probably exists in the population. It does not tell you whether the effect is clinically meaningful. Effect sizes answer the question: “How big is the effect?”
The hierarchy of information:
- P-value: Is there an effect? (binary: yes/no)
- Confidence interval: What is the plausible range of the effect?
- Effect size: How large is the effect, expressed in a standardised or clinically interpretable way?
9.2 Odds Ratio (OR)
Definition: The ratio of the odds of an outcome in the exposed group to the odds in the unexposed group.
Odds of event in group A = P(event in A) / P(no event in A)
Odds of event in group B = P(event in B) / P(no event in B)
OR = [Odds in A] / [Odds in B]2×2 contingency table notation:
| Outcome: Yes | Outcome: No | |
|---|---|---|
| Exposed (E+) | a | b |
| Unexposed (E−) | c | d |
OR = (a/b) / (c/d) = ad / bc
95% CI: exp(ln(OR) ± 1.96 × √(1/a + 1/b + 1/c + 1/d))Interpreting OR:
- OR = 1.0: No association
- OR > 1.0: Exposure associated with increased odds of outcome
- OR < 1.0: Exposure associated with decreased odds of outcome
Natural home: Case-control studies and logistic regression models.
9.3 Relative Risk (RR) — also called Risk Ratio
Definition: The ratio of the probability (risk) of an outcome in the exposed group to the probability in the unexposed group.
RR = Risk in exposed / Risk in unexposed
= [a/(a+b)] / [c/(c+d)]Interpreting RR:
- RR = 1.0: No association
- RR = 2.0: Exposed group has twice the risk
- RR = 0.5: Exposed group has half the risk (50% reduction)
Natural home: Cohort studies and RCTs.
9.4 Odds Ratio vs Relative Risk: When to Use Which
This is one of the most commonly confused distinctions in clinical research. Here is the complete framework:
Study design determines feasibility
| Study design | Can you calculate RR? | Can you calculate OR? |
|---|---|---|
| RCT | Yes (directly from data) | Yes (but OR preferred only for logistic regression output) |
| Prospective cohort | Yes | Yes |
| Retrospective cohort | Yes | Yes |
| Case-control | No (sampling from outcome group distorts risk) | Yes — OR is the correct measure |
| Cross-sectional | Prevalence ratio (modified RR) | Yes |
Why can’t you calculate RR from a case-control study? Because you select participants based on the outcome (cases and controls), not based on exposure. The proportion of cases in your sample reflects your sampling ratio, not the true disease risk in the population. The OR is mathematically unaffected by this (it is the same whether you sample 1:1 or 1:4 cases to controls).
Outcome frequency matters
When an outcome is rare (<10%), the OR approximates the RR closely. This is the “rare disease assumption”:
When P(outcome) is small:
OR ≈ RRWhen an outcome is common (≥10%), the OR will be further from 1.0 than the RR, and they diverge substantially:
| True RR | True risk in unexposed | Approximate OR |
|---|---|---|
| 2.0 | 5% | 2.1 |
| 2.0 | 20% | 2.7 |
| 2.0 | 40% | 4.0 |
Reporting an OR when the outcome is common and calling it a “risk ratio” substantially overstates the effect. This is a pervasive error in the medical literature.
Worked Example:
Scenario: A study of surgical site infection (SSI) after colorectal surgery. Diabetic patients: 40 SSIs in 100 patients (40%). Non-diabetic: 20 SSIs in 100 patients (20%).
RR = (40/100) / (20/100) = 0.40 / 0.20 = 2.0
OR = (40×80) / (60×20) = 3200 / 1200 = 2.67The OR (2.67) is 33% higher than the RR (2.0). Reporting the OR as if it were a risk ratio would overstate the association. Because this is a cohort study with a common outcome (>10%), report the RR.
Logistic regression outputs ORs — when is this a problem?
Logistic regression models produce ORs, not RRs. When:
- The outcome is rare: OR ≈ RR, report the OR from logistic regression
- The outcome is common: Use alternatives to estimate RR:
- Modified Poisson regression (with robust standard errors) — preferred, produces RR directly
- Log-binomial regression — produces RR directly but can fail to converge
- OR-to-RR conversion formula (Zhang & Yu, 1998):
RR = OR / [(1 − P₀) + (P₀ × OR)]where P₀ = baseline risk in unexposed group
Summary decision rule:
Is your study a case-control? → Report OR (only valid measure)
Is your outcome rare (<10%)? → OR ≈ RR, report OR from logistic regression
Is your outcome common (≥10%)?
In a cohort/RCT: → Calculate and report RR directly
From a logistic model: → Use modified Poisson regression for RR
OR report OR with clear caveat9.5 Absolute Risk Reduction (ARR) and Number Needed to Treat (NNT)
ARR: The absolute difference in event rates between two groups.
ARR = Risk in control − Risk in treatment
= (c/(c+d)) − (a/(a+b))NNT: How many patients need to be treated to prevent one additional outcome event.
NNT = 1 / ARRNNT < 10: Very effective treatment NNT 10–50: Moderately effective NNT > 100: Marginally effective (may still be worthwhile for serious outcomes)
Worked Example:
Scenario: In a trial of prophylactic low-molecular-weight heparin (LMWH) after major orthopaedic surgery: DVT rate = 8% in LMWH group (a/(a+b) = 0.08), 18% in placebo group (c/(c+d) = 0.18).
RR = 0.08 / 0.18 = 0.44 (56% reduction in relative risk)
ARR = 0.18 − 0.08 = 0.10 (10 percentage points)
NNT = 1 / 0.10 = 10Interpretation: LMWH reduces the relative risk of DVT by 56% (RR 0.44). In absolute terms, for every 10 patients treated with LMWH, one additional DVT is prevented (NNT = 10). The NNT communicates clinical impact in a way the RR alone does not.
9.6 Standardised Effect Sizes
When outcomes are measured on different scales and you want to compare effect sizes across studies, use standardised effect sizes:
Cohen’s d: For continuous outcomes (mean difference)
d = (μ₁ − μ₂) / pooled SDBenchmarks: small d=0.2, medium d=0.5, large d=0.8 (Cohen, 1988 — treat as rough guides only)
Eta-squared (η²): Proportion of variance explained (for ANOVA)
η² = SSbetween / SStotalPartial η² is preferred for factorial ANOVA.
Omega-squared (ω²): Less biased than η², preferred for meta-analyses.
10. Survival and Time-to-Event Analysis
10.1 Why Standard Methods Fail for Survival Data
Consider a study following 100 patients after cancer surgery for 5 years, tracking whether they are alive or dead. Two problems arise that standard regression cannot handle:
Problem 1: Censoring. Some patients are still alive at study end. Some are lost to follow-up. Some died from an unrelated cause. All three are “censored” — they did not experience the event during observation, but we don’t know when or if they would have. Excluding them wastes information; treating them as non-events introduces bias.
Problem 2: Variable follow-up times. Patients enrolled at different times have different follow-up durations. A patient followed for 6 months contributes different information from one followed for 48 months.
Survival analysis incorporates both the occurrence of events and the time to event, while properly handling censored observations.
10.2 Core Concepts
Survival function S(t): The probability of surviving (i.e., not experiencing the event) beyond time t.
S(t) = P(T > t)At t=0: S(0) = 1.0 (everyone is event-free at start) Over time: S(t) decreases monotonically (or stays flat if no events)
Hazard function h(t): The instantaneous rate of the event at time t, given survival to time t. Sometimes called the “force of mortality.”
Censoring types:
- Right censoring (most common): The event has not occurred by the end of observation
- Left censoring: The event occurred before observation started
- Interval censoring: The event occurred in a known time interval
The critical assumption: Censoring must be non-informative — i.e., the reason for censoring must be unrelated to the probability of experiencing the event. If patients who drop out are more likely to die than those who stay in, estimates will be biased.
10.3 Kaplan-Meier Estimator
What it does: Non-parametrically estimates the survival function S(t) from observed data, accounting for censoring. Produces a step-function survival curve.
The calculation:
S(t) = Π [1 − dj/nj]
where the product is over all event times tj ≤ t
dj = number of events at time tj
nj = number at risk just before time tjWorked Example:
Research question: Estimate overall survival in 10 patients with metastatic colorectal cancer following first-line chemotherapy.
| Patient | Follow-up (months) | Event (death=1, censored=0) |
|---|---|---|
| 1 | 3 | 1 |
| 2 | 5 | 1 |
| 3 | 6 | 0 (lost to follow-up) |
| 4 | 8 | 1 |
| 5 | 10 | 1 |
| 6 | 12 | 0 (still alive at study end) |
| 7 | 14 | 1 |
| 8 | 18 | 0 |
| 9 | 20 | 1 |
| 10 | 24 | 0 |
KM calculation:
| Time (months) | Events (d) | At risk (n) | S(t) = S(t-prev) × (1 − d/n) |
|---|---|---|---|
| 0 | — | 10 | 1.000 |
| 3 | 1 | 10 | 1.000 × (1 − 1/10) = 0.900 |
| 5 | 1 | 9 | 0.900 × (1 − 1/9) = 0.800 |
| 8 | 1 | 7* | 0.800 × (1 − 1/7) = 0.686 |
| 10 | 1 | 6 | 0.686 × (1 − 1/6) = 0.571 |
| 14 | 1 | 4** | 0.571 × (1 − 1/4) = 0.429 |
| 20 | 1 | 2*** | 0.429 × (1 − 1/2) = 0.214 |
*Patient 3 (censored at 6 months) removed from risk set before time 8 **Patient 6 (censored at 12 months) removed before time 14 ***Patient 8 (censored at 18 months) and Patient 10 (censored at 24 months) reduce the risk set
Interpretation: The estimated probability of surviving beyond 20 months is 21.4%. Median survival (where S(t) first falls below 0.5) falls between 10 and 14 months. The KM curve should be presented with number-at-risk tables below the time axis.
Reporting standard: Always include: (1) the KM curve with confidence bands, (2) the number-at-risk table at key time points, (3) median survival with 95% CI for each group.
10.4 Log-Rank Test
What it does: Tests whether the survival curves of two or more groups are identical. The non-parametric equivalent of the t-test for survival data. Uses a weighted sum of differences between observed and expected events at each event time.
When to use:
- Comparing survival curves of 2+ groups
- Non-parametric (makes no assumption about the shape of the survival curve)
- Assumes proportional hazards (the hazard ratio between groups is constant over time)
Worked Example:
Research question: Do patients with KRAS wild-type (WT) colorectal cancer have better overall survival than those with KRAS mutant (MT) cancer following anti-EGFR therapy?
| Group | n | Events | Median OS (months) | 95% CI |
|---|---|---|---|---|
| KRAS WT | 85 | 62 | 18.4 | 14.2–22.6 |
| KRAS MT | 79 | 71 | 9.8 | 7.6–12.0 |
Log-rank test: χ²(1) = 14.8, p < 0.001
Interpretation: Patients with KRAS wild-type tumours had significantly longer overall survival than those with KRAS mutations (median 18.4 vs 9.8 months; log-rank p < 0.001). This finding supports the predictive role of KRAS status for anti-EGFR therapy benefit.
10.5 Cox Proportional Hazards Regression
What it does: The most widely used model for time-to-event data with multiple predictors. Models the hazard (instantaneous risk) as a function of predictor variables. Output is the hazard ratio (HR) — the ratio of hazards between groups.
The model:
h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ... + βₚXₚ)Where h₀(t) is the baseline hazard function (unspecified — this is a “semi-parametric” model).
The hazard ratio:
HR for predictor Xj = e^βjInterpreting HR:
- HR = 1.0: No association with time-to-event
- HR = 2.0: Exposed group has twice the instantaneous rate of the event at any given time
- HR = 0.5: Exposed group has half the hazard (50% risk reduction at any time)
The proportional hazards (PH) assumption: The hazard ratio between two groups is constant over time. This is the key assumption of Cox regression. Check it with:
- Log-log plot: log(−log(S(t))) vs log(t) — lines should be parallel
- Schoenfeld residuals plot — no trend over time
- Grambsch-Therneau test (formal statistical test of PH assumption)
If PH is violated: use time-varying coefficients, stratified Cox model, or parametric models (Weibull, log-logistic).
Worked Example:
Research question: What factors predict time to dialysis initiation in a cohort of 280 CKD patients followed for up to 5 years? Predictors: age, sex, eGFR at baseline, proteinuria (g/24h), diabetes, hypertension.
Events: 98 patients started dialysis; 182 censored
| Predictor | HR | 95% CI | p-value |
|---|---|---|---|
| Age (per 10 years) | 1.18 | 0.98–1.42 | 0.082 |
| Male sex | 1.44 | 0.94–2.20 | 0.094 |
| eGFR at baseline (per 10 mL/min/1.73m² increase) | 0.51 | 0.42–0.62 | <0.001 |
| Proteinuria (per 1 g/24h increase) | 1.67 | 1.38–2.02 | <0.001 |
| Diabetes (yes vs no) | 2.03 | 1.32–3.12 | 0.001 |
| Hypertension (yes vs no) | 1.38 | 0.89–2.13 | 0.150 |
PH assumption checked: Schoenfeld residuals test p=0.38 (no violation)
Interpretation:
- Each 10 mL/min/1.73m² higher baseline eGFR was associated with a 49% lower hazard of dialysis initiation (HR 0.51, 95% CI 0.42–0.62, p < 0.001).
- Each 1 g/24h increase in proteinuria was associated with a 67% higher hazard of dialysis initiation (HR 1.67, 95% CI 1.38–2.02, p < 0.001).
- Patients with diabetes had twice the hazard of dialysis initiation compared to non-diabetic patients (HR 2.03, 95% CI 1.32–3.12, p = 0.001).
- After adjustment, age, sex, and hypertension were not independently associated with dialysis initiation.
Reporting template: “In multivariable Cox regression, proteinuria (HR 1.67 per 1 g/24h increase, 95% CI 1.38–2.02, p < 0.001) and diabetes (HR 2.03, 95% CI 1.32–3.12, p = 0.001) were independently associated with time to dialysis initiation after adjustment for baseline eGFR and other covariates.”
11. Multivariable Modelling Strategy
11.1 Univariate vs Multivariable Analysis: The Clinical Workflow
Almost all published clinical research involves both steps:
Step 1: Univariate (crude) analysis
- Each predictor is tested against the outcome individually, without adjustment
- Reports crude (unadjusted) ORs, HRs, or mean differences
- Purpose: describe raw associations, identify candidate variables for multivariable model
Step 2: Multivariable (adjusted) analysis
- Selected predictors are entered simultaneously into a regression model
- Reports adjusted ORs, HRs, or mean differences, with each predictor’s effect estimated after controlling for the others
- Purpose: identify independent predictors, control for confounding
The relationship between crude and adjusted estimates is clinically informative. A variable that is significant in univariate but not multivariable analysis was likely confounded. A variable that appears non-significant univariately but significant in multivariable analysis was previously masked by confounders (negative confounding).
11.2 What Is Confounding?
A confounder is a third variable that:
- Is associated with the exposure/predictor
- Is associated with the outcome
- Is NOT an intermediary on the causal pathway between exposure and outcome
Example: A study finds that coffee drinking is associated with lung cancer. But coffee drinkers are also more likely to smoke, and smoking causes lung cancer. Smoking is a confounder. After adjusting for smoking, the association between coffee and lung cancer disappears.
Controlling for confounders:
- Include them in the regression model (most common)
- Matching (case-control studies, propensity score matching)
- Restriction (study only non-smokers)
- Stratification (analyse smokers and non-smokers separately)
11.3 Selecting Variables for a Multivariable Model
The EPV (events per variable) rule: As a minimum, you need approximately 10 events per predictor variable in logistic and Cox regression to avoid overfitting. With 80 events, include a maximum of 8 predictors.
Approaches to variable selection:
1. Hypothesis-driven selection (preferred in clinical research): Select predictors based on clinical knowledge and prior literature, regardless of statistical significance in univariate analysis. Pre-specify in your protocol.
2. Univariate screening approach:
- Test each candidate predictor in univariate analysis
- Include variables with p < 0.2 (or 0.25) as candidates — not p < 0.05, as this misses potentially important confounders
- Also include clinically important variables regardless of p-value
3. Automated stepwise selection (not recommended as primary approach):
- Backward elimination, forward selection, or bidirectional stepwise
- Problems: capitalises on chance, biased SEs and p-values, unreproducible in different samples
- May be used for exploratory analyses but results should be validated in an independent dataset
11.4 Handling Confounding: A Worked Example
Research question: Is emergency (vs elective) hospital admission associated with in-hospital mortality? Data from 600 admissions.
Univariate analysis:
| Variable | Crude OR | 95% CI | p |
|---|---|---|---|
| Emergency admission (vs elective) | 3.22 | 1.84–5.63 | <0.001 |
| Age (per 10 years) | 1.65 | 1.31–2.08 | <0.001 |
| Charlson comorbidity index | 1.48 | 1.24–1.77 | <0.001 |
| Male sex | 1.29 | 0.76–2.20 | 0.340 |
Multivariable logistic regression:
| Variable | Adjusted OR | 95% CI | p |
|---|---|---|---|
| Emergency admission (vs elective) | 1.87 | 1.01–3.47 | 0.047 |
| Age (per 10 years) | 1.44 | 1.12–1.85 | 0.004 |
| Charlson comorbidity index | 1.36 | 1.12–1.65 | 0.002 |
| Male sex | 1.15 | 0.65–2.03 | 0.627 |
Interpretation: Emergency admission was significantly associated with in-hospital mortality in both univariate (crude OR 3.22) and multivariable (adjusted OR 1.87) analyses. The attenuation from 3.22 to 1.87 indicates that age and comorbidity are confounders — emergency admissions tend to involve older, sicker patients, which partially explains their higher mortality. The adjusted OR represents the “true” independent association after accounting for these differences.
11.5 Propensity Score Methods
The problem: In observational studies, patients who receive a treatment differ systematically from those who don’t. Simply adjusting for confounders in regression may be insufficient when there are many confounders or when the treatment and control groups barely overlap.
Propensity score (PS): The predicted probability of receiving the treatment, given a patient’s observed baseline characteristics. Estimated using logistic regression with treatment as outcome and all confounders as predictors.
Uses of the propensity score:
1. Propensity score matching: Match each treated patient to one (or more) untreated patient(s) with a similar PS. Creates two groups balanced on measured confounders — mimics a randomised trial.
2. PS stratification: Divide patients into quintiles of PS and compare outcomes within each stratum.
3. Inverse probability of treatment weighting (IPTW): Reweight observations so that the weighted sample resembles a randomised trial.
Worked Example:
Research question: Using a registry of 800 STEMI patients, compare 1-year mortality between those who received drug-eluting stent (DES, n=400) vs bare metal stent (BMS, n=400). Patients who received DES were younger, had lower GRACE scores, and fewer comorbidities.
After propensity score matching (caliper width = 0.1 SD of logit PS):
- 312 matched pairs (DES vs BMS)
- Baseline characteristics now balanced (standardised differences all
<0.10)
Matched analysis: HR for 1-year mortality, DES vs BMS = 0.74 (95% CI 0.55–0.99, p = 0.043)
Compare to: Unmatched analysis: HR = 0.51 (95% CI 0.39–0.67, p < 0.001) — substantially biased by confounding.
Interpretation: After propensity score matching to account for confounders, DES was associated with a 26% reduction in 1-year mortality compared to BMS (HR 0.74, p = 0.043). The unmatched estimate (51% reduction) was confounded by the baseline differences between groups.
12. Multivariate Methods
12.1 Terminology Clarification
Multivariable: Multiple predictor variables, ONE outcome (e.g. multiple linear regression) Multivariate: Multiple outcome variables simultaneously (e.g. MANOVA, PCA)
This distinction is frequently misused in published literature. MANOVA, PCA, and factor analysis are truly “multivariate” methods.
12.2 MANOVA (Multivariate Analysis of Variance)
What it does: Tests whether groups differ on a combination of continuous outcome variables simultaneously. An extension of ANOVA to multiple outcomes.
When to use:
- 3+ continuous outcome variables that are correlated with each other
- One or more grouping factors
- You want to test overall group differences before examining individual outcomes
Why not just run separate ANOVAs?
- Multiple testing inflates Type I error (with 5 outcomes at α=0.05, ~22% chance of at least one false positive)
- Ignores correlations among outcomes — MANOVA uses these to improve power
- MANOVA can detect group differences that no single ANOVA would
MANOVA test statistics: Wilks’ Lambda (most common), Pillai’s trace, Hotelling-Lawley trace, Roy’s largest root. All test the same null hypothesis but differ in robustness to assumption violations. Pillai’s trace is most robust to violations.
Worked Example:
Research question: Does exercise training modality (aerobic vs resistance vs combined vs control, n=30 per group) differentially affect cardiorespiratory fitness across three outcomes: VO₂max (mL/kg/min), 6-minute walk distance (m), and resting heart rate (bpm)?
Outcomes are moderately intercorrelated (r = 0.40–0.65).
MANOVA:
- Pillai’s trace = 0.52, F(9, 342) = 7.44, p < 0.001
Follow-up univariate ANOVAs (with Bonferroni correction, α = 0.017):
- VO₂max: F(3,116) = 12.4, p < 0.001
- 6MWD: F(3,116) = 8.7, p < 0.001
- Resting HR: F(3,116) = 5.2, p = 0.002
Interpretation: Training modality had a significant multivariate effect on cardiorespiratory fitness outcomes (MANOVA: Pillai’s trace = 0.52, F(9,342) = 7.44, p < 0.001). Follow-up univariate ANOVAs revealed significant effects on all three individual outcomes (all p ≤ 0.002 after Bonferroni correction).
12.3 Principal Component Analysis (PCA)
What it does: A data reduction technique that transforms a set of correlated variables into a smaller set of uncorrelated components (principal components) that capture most of the variance in the original data.
When to use:
- Many correlated predictor variables (multicollinearity) — reduce before regression
- Exploratory data analysis of high-dimensional data
- Visualising patterns in complex datasets
Key outputs:
- Eigenvalues: Variance explained by each component. Components with eigenvalue > 1 are typically retained (Kaiser criterion).
- Scree plot: Graph of eigenvalues — look for the “elbow” where the curve flattens.
- Factor loadings: Correlation between original variables and each component. Loadings > 0.4 are typically considered meaningful.
- % variance explained: How much of the total variability each component captures.
Worked Example:
Research question: A metabolic syndrome study measures 8 correlated biomarkers in 300 patients: waist circumference, fasting glucose, HDL-C, LDL-C, triglycerides, SBP, DBP, and insulin. Reduce these to a smaller set of components.
PCA results:
| Component | Eigenvalue | % Variance | Cumulative % |
|---|---|---|---|
| PC1 | 3.12 | 39.0% | 39.0% |
| PC2 | 1.84 | 23.0% | 62.0% |
| PC3 | 1.02 | 12.8% | 74.8% |
| PC4–8 | <0.80 each | <10% each | — |
Three components retained (eigenvalue > 1, 75% variance explained).
Loading matrix (simplified):
| Variable | PC1 (“metabolic risk”) | PC2 (“blood pressure”) | PC3 (“lipid profile”) |
|---|---|---|---|
| Waist circumference | 0.78 | 0.12 | 0.21 |
| Fasting glucose | 0.72 | 0.18 | −0.14 |
| Insulin | 0.69 | 0.08 | 0.22 |
| Triglycerides | 0.61 | 0.23 | 0.48 |
| SBP | 0.15 | 0.82 | 0.19 |
| DBP | 0.22 | 0.79 | 0.08 |
| HDL-C | −0.54 | 0.14 | 0.62 |
| LDL-C | 0.28 | 0.20 | 0.71 |
Interpretation: PC1 captures a “central metabolic risk” factor (high waist, glucose, insulin, TG; low HDL). PC2 represents blood pressure. PC3 captures lipid profile. These three components can replace the 8 original variables as predictors in subsequent analyses with minimal information loss.
13. Mixed Models and Longitudinal Data
13.1 Why Standard ANOVA Is Insufficient for Longitudinal Data
Repeated measures ANOVA requires complete data (no missing values), assumes compound symmetry (equal variances and covariances between all time pairs), and cannot handle time-varying covariates. In clinical trials, 10–40% of observations are commonly missing.
Linear mixed effects (LME) models overcome these limitations:
- Handle missing data (missing-at-random) without imputation
- Allow flexible correlation structures (not just compound symmetry)
- Can accommodate unequally spaced measurement occasions
- Can model individual trajectories (random slopes)
13.2 Linear Mixed Effects Models
The model:
Y_ij = (β₀ + b₀ᵢ) + (β₁ + b₁ᵢ)×time_ij + β₂×X_ij + ε_ijWhere:
- β₀, β₁ = fixed effects (population-average intercept and slope)
- b₀ᵢ, b₁ᵢ = random effects for subject i (individual deviations from average)
- ε_ij = residual error
Fixed effects: Average effects across the population (reported) Random effects: Between-subject variability in intercepts and/or slopes
Worked Example:
Research question: A 12-month RCT of a lifestyle intervention in type 2 diabetes. HbA1c is measured at baseline, 3, 6, and 12 months in 120 patients (60 intervention, 60 control). 18% of follow-up data are missing (missing-at-random).
LME model: HbA1c ~ time × treatment + age + baseline HbA1c + (1+time|patient)
Key results:
| Effect | Coefficient | SE | 95% CI | p |
|---|---|---|---|---|
| Time (per month, control arm) | −0.021 | 0.008 | −0.037 to −0.005 | 0.011 |
| Treatment × time interaction | −0.038 | 0.011 | −0.059 to −0.017 | 0.001 |
| Age (per year) | 0.012 | 0.007 | −0.002 to 0.026 | 0.089 |
Interpretation: In the control arm, HbA1c decreased by 0.021% per month (reflecting background treatment changes). In the intervention arm, HbA1c decreased by an additional 0.038% per month compared to control (interaction term p = 0.001), yielding a net additional reduction of 0.46% at 12 months. The mixed model used all available data including observations with missing follow-up, reducing bias compared to complete-case analysis.
14. Diagnostic Test Evaluation
14.1 The 2×2 Table for Diagnostic Tests
All diagnostic test statistics derive from the 2×2 table comparing test result to the true diagnosis (gold standard):
| Disease present | Disease absent | Total | |
|---|---|---|---|
| Test positive | True positive (TP) | False positive (FP) | TP+FP |
| Test negative | False negative (FN) | True negative (TN) | FN+TN |
| Total | TP+FN | FP+TN | N |
14.2 Sensitivity, Specificity, PPV, NPV
Sensitivity: P(test positive | disease present) = TP / (TP+FN)
- A highly sensitive test rarely misses disease (few false negatives)
- “SnNout” — a highly Sensitive test when Negative rules OUT disease
Specificity: P(test negative | disease absent) = TN / (FP+TN)
- A highly specific test rarely gives false positives
- “SpPin” — a highly Specific test when Positive rules IN disease
Positive Predictive Value (PPV): P(disease present | test positive) = TP / (TP+FP)
- Depends heavily on disease prevalence — PPV falls sharply with lower prevalence
Negative Predictive Value (NPV): P(disease absent | test negative) = TN / (FN+TN)
- NPV rises with lower prevalence
The prevalence dependence of PPV and NPV: Unlike sensitivity and specificity (intrinsic test properties), PPV and NPV depend on the prevalence of disease in the tested population. A test with 95% sensitivity and 95% specificity applied to a population with 1% prevalence has a PPV of only 16.1%.
Worked Example:
Research question: Evaluate a new point-of-care troponin I assay for ruling out NSTEMI in 500 ED patients with chest pain. True diagnosis confirmed by serial high-sensitivity troponin.
| NSTEMI (n=80) | No NSTEMI (n=420) | |
|---|---|---|
| POC troponin positive (≥40 ng/L) | 72 | 21 |
POC troponin negative (<40 ng/L) | 8 | 399 |
Prevalence = 80/500 = 16%
Sensitivity = 72/80 = 90.0% (95% CI: 81.2–95.6%)
Specificity = 399/420 = 95.0% (95% CI: 92.5–96.9%)
PPV = 72/93 = 77.4% (95% CI: 67.7–85.3%)
NPV = 399/407 = 98.0% (95% CI: 96.1–99.2%)Interpretation: This POC troponin assay demonstrates high sensitivity (90%) and specificity (95%) for NSTEMI detection. The NPV of 98.0% supports its use as a rule-out strategy — of patients who test negative, 98% truly do not have NSTEMI. The PPV of 77.4% indicates that 23% of positive results will be false positives at this prevalence (16%), so confirmatory testing is needed for positive results.
14.3 ROC Curves and AUC
What it does: Evaluates a continuous or ordinal diagnostic test across all possible cutpoints. Plots sensitivity (y-axis) against 1-specificity (x-axis) as the threshold varies.
AUC (Area Under the Curve) / C-statistic:
- 0.5 = no discrimination (no better than chance)
- 0.7–0.8 = acceptable discrimination
- 0.8–0.9 = excellent
-
0.9 = outstanding
Optimal cutpoint: Choose based on clinical need:
- For rule-out tests (screening): maximise sensitivity (accept lower specificity)
- For rule-in tests (confirmation): maximise specificity (accept lower sensitivity)
- Youden’s index (sensitivity + specificity − 1): balanced optimum
Comparing two tests: DeLong’s method for comparing paired AUCs from the same sample.
Worked Example:
Research question: Compare eGFR alone vs a clinical risk score (incorporating eGFR + proteinuria + age + diabetes) for predicting dialysis within 3 years in 350 CKD patients.
| Model | AUC | 95% CI |
|---|---|---|
| eGFR alone | 0.73 | 0.67–0.79 |
| Clinical risk score | 0.84 | 0.79–0.89 |
| Difference | +0.11 | p = 0.003 |
Interpretation: The clinical risk score (AUC 0.84) significantly outperforms eGFR alone (AUC 0.73) for predicting 3-year dialysis initiation (DeLong’s test p = 0.003). Adding proteinuria, age, and diabetes to eGFR substantially improves discrimination.
15. Agreement and Reliability
15.1 Cohen’s Kappa
What it does: Measures agreement between two raters (or methods) on categorical outcomes, corrected for chance agreement.
κ = (Po − Pe) / (1 − Pe)
Po = observed agreement proportion
Pe = expected agreement by chanceInterpreting kappa (Landis & Koch thresholds — use as rough guides):
| κ | Interpretation |
|---|---|
<0.00 | Poor (less than chance) |
| 0.00–0.20 | Slight |
| 0.21–0.40 | Fair |
| 0.41–0.60 | Moderate |
| 0.61–0.80 | Substantial |
| 0.81–1.00 | Almost perfect |
Worked Example:
Research question: Two radiologists independently classify 120 chest X-rays as: normal, consolidation, or interstitial change. What is their agreement?
| Rad2: Normal | Rad2: Consol. | Rad2: Interstitial | Total | |
|---|---|---|---|---|
| Rad1: Normal | 48 | 4 | 2 | 54 |
| Rad1: Consolidation | 3 | 28 | 2 | 33 |
| Rad1: Interstitial | 1 | 2 | 30 | 33 |
| Total | 52 | 34 | 34 | 120 |
Po = (48+28+30)/120 = 106/120 = 0.883
Expected agreement:
- Pe(normal) = (54×52)/120² = 0.195
- Pe(consolidation) = (33×34)/120² = 0.078
- Pe(interstitial) = (33×34)/120² = 0.078
- Pe = 0.195 + 0.078 + 0.078 = 0.351
κ = (0.883 − 0.351) / (1 − 0.351) = 0.532 / 0.649 = 0.82Interpretation: There is almost perfect agreement between the two radiologists for chest X-ray classification (κ = 0.82, 95% CI 0.73–0.91).
15.2 Bland-Altman Analysis
What it does: Assesses the agreement between two continuous measurement methods. Plots the difference between methods (y-axis) against the mean of the two methods (x-axis). Identifies systematic bias and limits of agreement.
Key outputs:
- Bias: Mean difference (Method A − Method B). Non-zero bias indicates systematic over- or under-measurement by one method.
- Limits of agreement (LOA): Bias ± 1.96 × SD of differences. The range within which 95% of differences will fall.
- Clinical decision: Are the LOA clinically acceptable? If the maximum acceptable difference is ±5 mmHg and the LOA are ±3 mmHg, the methods agree well enough for clinical use.
Why NOT to use correlation for method comparison: Pearson r measures association, not agreement. Two methods could be highly correlated but systematically disagree. Bland-Altman is the correct approach.
Worked Example:
Research question: Compare automated oscillometric blood pressure (AOBP) with gold-standard intra-arterial (IA) SBP measurement in 50 ICU patients.
| Statistic | Value |
|---|---|
| Mean AOBP | 124.6 mmHg |
| Mean IA SBP | 128.4 mmHg |
| Mean difference (AOBP − IA) | −3.8 mmHg |
| SD of differences | 8.2 mmHg |
| Upper LOA (+1.96 SD) | −3.8 + 16.1 = +12.3 mmHg |
| Lower LOA (−1.96 SD) | −3.8 − 16.1 = −19.9 mmHg |
Interpretation: AOBP underestimates IA SBP by a mean of 3.8 mmHg. The limits of agreement range from −19.9 to +12.3 mmHg, meaning in 95% of patients, AOBP will differ from IA by between 20 mmHg below and 12 mmHg above. Given the wide LOA, AOBP cannot reliably substitute for intra-arterial measurement in haemodynamically unstable ICU patients where precision of ±5 mmHg is needed.
16. Bayesian Methods
16.1 Frequentist vs Bayesian Framework
The fundamental difference:
Frequentist (classical) statistics:
- Parameters (e.g., true treatment effect) are fixed, unknown constants
- Probability refers to long-run frequency of events
- P-value = P(data this extreme | H₀ is true) — does not tell you probability that H₀ is true
- Cannot make probability statements about parameters
Bayesian statistics:
- Parameters have probability distributions reflecting uncertainty
- You start with a prior distribution (beliefs before seeing the data)
- You update with observed data (the likelihood)
- You get a posterior distribution (updated beliefs)
- Can make direct probability statements: “P(true effect > 0 | data) = 0.97”
Bayes’ theorem:
Posterior ∝ Prior × Likelihood
P(θ|data) ∝ P(data|θ) × P(θ)16.2 Credible Intervals vs Confidence Intervals
Frequentist 95% CI: In repeated sampling, 95% of such intervals would contain the true parameter. Does NOT mean “95% probability the true value is in this interval” for this specific interval — though clinicians routinely interpret it this way.
Bayesian 95% credible interval (CrI): There IS a 95% probability that the true parameter lies within this interval (given the prior and the data). This is the natural, intuitive interpretation most clinicians want.
16.3 Bayesian Analysis in Practice
Worked Example:
Research question: A pilot RCT tests a new immunotherapy in 30 patients with refractory rheumatoid arthritis (15 active, 15 placebo). ACR50 response rates: 7/15 (47%) active, 3/15 (20%) placebo.
Frequentist analysis:
- OR = 3.5, 95% CI 0.71–17.3, p = 0.12
- Conclusion: “Not statistically significant” — ambiguous for a small pilot trial
Bayesian analysis:
- Prior: Weakly informative prior based on existing biologics literature (modest positive effect expected)
- Posterior OR = 3.2 (95% CrI: 1.02–10.4)
- P(OR > 1 | data) = 0.96 → 96% probability that the active treatment has a positive effect
- P(OR > 2 | data) = 0.72 → 72% probability of at least a doubling of odds of response
Interpretation: While the frequentist analysis is technically non-significant (p = 0.12) — likely due to small sample size — Bayesian analysis incorporating prior evidence indicates a 96% posterior probability that the new treatment outperforms placebo. These findings support proceeding to a full Phase III RCT.
16.4 Bayes Factors
What it does: A ratio of how well H₁ predicts the data relative to H₀. Can provide evidence for the null — something p-values cannot do.
BF₁₀ = P(data | H₁) / P(data | H₀)Interpretation:
| BF₁₀ | Evidence for H₁ |
|---|---|
| 1–3 | Anecdotal |
| 3–10 | Moderate |
| 10–30 | Strong |
| 30–100 | Very strong |
| >100 | Extreme |
<1/3 | Moderate evidence for H₀ |
Clinical use case: Non-inferiority trials — showing that a new (cheaper, safer) treatment is “not meaningfully worse.” A Bayes Factor <1/3 provides positive evidence for the null (no difference) rather than merely failing to reject it.
17. Meta-Analysis and Systematic Review
17.1 The Evidence Hierarchy
Meta-analysis of randomised controlled trials sits at the top of the evidence hierarchy. It pools quantitative data from multiple studies to produce a single, more precise estimate of effect.
Why meta-analysis?
- Individual trials often underpowered to detect small but clinically important effects
- Pooling increases precision (narrower CI)
- Identifies sources of heterogeneity
- More generalisable than any single study
17.2 Fixed vs Random Effects Models
Fixed effects model:
- Assumes all studies estimate the same true effect
- Between-study variation is due to sampling error only
- Appropriate when studies are functionally identical (same population, intervention, outcome)
- Gives more weight to larger studies
Random effects model (DerSimonian-Laird):
- Assumes studies estimate different but related true effects (a distribution of effects)
- Between-study variability (heterogeneity, τ²) is estimated and incorporated
- Results in wider, more honest CIs
- More appropriate for most clinical meta-analyses where populations and protocols vary
- More weight distributed to smaller studies compared to fixed effects
How to choose: Examine heterogeneity (I²). If I² < 25%, either model is reasonable. If I² ≥ 50% (substantial heterogeneity), random effects model is more appropriate. However, if heterogeneity is very high (I² > 75%), even the random effects pooled estimate should be interpreted cautiously.
17.3 Worked Meta-Analysis Example
Research question: What is the effect of ACE inhibitors on cardiovascular mortality in patients with heart failure? Systematic review identified 6 eligible RCTs.
| Study | Control events/n | ACE-I events/n | OR | 95% CI |
|---|---|---|---|---|
| CONSENSUS (1987) | 44/126 | 29/127 | 0.56 | 0.31–0.99 |
| SOLVD-T (1991) | 452/1284 | 386/1285 | 0.81 | 0.68–0.96 |
| ATLAS (1999) | 52/1596 | 45/1568 | 0.87 | 0.58–1.30 |
| V-HeFT II (1991) | 131/403 | 117/403 | 0.84 | 0.61–1.16 |
| MERIT-HF (1999) | 145/2001 | 128/1990 | 0.89 | 0.70–1.14 |
| CIBIS-II (1999) | 156/1320 | 119/1327 | 0.73 | 0.57–0.94 |
Heterogeneity:
- I² = 18% (low heterogeneity — fixed effects model acceptable)
- Cochran’s Q = 6.1, p = 0.30
Pooled estimate (fixed effects):
- Pooled OR = 0.81 (95% CI: 0.74–0.89), p < 0.001
Random effects (for comparison):
- Pooled OR = 0.81 (95% CI: 0.72–0.91), p < 0.001 (slightly wider CI reflecting residual heterogeneity)
Interpretation: ACE inhibitors are associated with a 19% reduction in the odds of cardiovascular mortality in heart failure patients (pooled OR 0.81, 95% CI 0.74–0.89, p < 0.001). Heterogeneity across studies was low (I² = 18%), supporting the consistency of this effect. This translates to an NNT of approximately 28 over the average trial duration to prevent one cardiovascular death.
17.4 Assessing Heterogeneity
Cochran’s Q test: Tests the null hypothesis that all studies estimate the same true effect. Underpowered with few studies; significant Q indicates heterogeneity.
I² statistic: Proportion of total variation attributable to between-study differences (not sampling error).
- I² = 0–25%: Low/negligible
- I² = 26–50%: Moderate
- I² = 51–75%: Substantial
- I² > 75%: Considerable
Sources of heterogeneity — investigate with:
- Subgroup analysis: Does the effect differ by study population, intervention intensity, follow-up duration?
- Meta-regression: Regress the effect size on study-level moderators (e.g., mean age, % female, baseline risk)
17.5 Publication Bias
The problem: Studies showing significant results are more likely to be published than those showing null results. This means a meta-analysis based on published literature may overestimate the true effect.
Detection:
- Funnel plot: Plot each study’s effect size against its precision (SE or 1/SE). Under no bias, the plot is symmetric — smaller studies scatter more widely around the pooled estimate. Asymmetry suggests bias.
- Egger’s test: Formal regression test for funnel plot asymmetry. p < 0.05 suggests asymmetry (possible publication bias).
- Trim-and-fill method: Imputes missing studies to restore funnel symmetry and re-estimates the pooled effect. Shows how sensitive the main result is to potential publication bias.
18. Reporting Standards and Checklists
18.1 General Reporting Principles
-
Always report the test used, the test statistic, degrees of freedom, and exact p-value. Not just “p
<0.05” or “NS” — write “t(48) = 3.14, p = 0.003” or “χ²(2) = 8.74, p = 0.013.” -
Report effect sizes with confidence intervals for all primary outcomes. P-values alone are insufficient.
-
Report sample sizes at every step. If 200 enrolled, 180 analysed — state what happened to the other 20 and conduct a sensitivity analysis if possible.
-
For non-parametric tests, report median (IQR), not mean (SD).
-
Report model fit statistics for regression models — R²/adjusted R² for linear regression; Hosmer-Lemeshow goodness of fit, AUC/C-statistic for logistic regression; overall model χ² and −2 log-likelihood.
-
Check and report assumption testing — normality (Shapiro-Wilk), homogeneity of variance (Levene’s), sphericity (Mauchly’s), PH assumption (Cox).
-
Distinguish pre-specified from exploratory analyses. Post-hoc subgroup analyses should be clearly labelled as exploratory and interpreted with caution.
18.2 Reporting Checklists
| Study type | Checklist |
|---|---|
| RCT | CONSORT (www.consort-statement.org ) |
| Observational cohort or case-control | STROBE (www.strobe-statement.org ) |
| Diagnostic accuracy study | STARD (www.equator-network.org/reporting-guidelines/stard ) |
| Systematic review/meta-analysis | PRISMA (www.prisma-statement.org ) |
| Prognostic model development | TRIPOD (www.tripod-statement.org ) |
| Survival analysis | REMARK (for tumour marker studies) |
18.3 Specimen Results Sections
Randomised Trial (t-test result):
“The primary outcome, change in HbA1c from baseline to 6 months, was significantly greater in the intervention group compared to control (−0.82% vs −0.31%; mean difference −0.51%, 95% CI −0.78 to −0.24%; independent samples t-test: t(178) = −3.74, p < 0.001).”
Survival analysis result:
“Median progression-free survival was 11.2 months (95% CI 8.6–13.8) in the experimental arm and 7.4 months (95% CI 5.9–8.9) in the control arm. The experimental treatment was associated with a 38% reduction in the hazard of progression or death (HR 0.62, 95% CI 0.48–0.80; log-rank p < 0.001).”
Logistic regression result:
“On multivariable logistic regression analysis, prior hospitalisation in the previous year (OR 2.84, 95% CI 1.63–4.95, p < 0.001) and home oxygen use (OR 1.93, 95% CI 1.09–3.42, p = 0.025) were independently associated with 30-day readmission after adjustment for age, FEV₁%, and eosinophil count. The model demonstrated acceptable discrimination (C-statistic 0.72) and good calibration (Hosmer-Lemeshow p = 0.64).”
Appendix: Quick Reference Tables
A1. Choosing the Right Test — Complete Reference
| Research question | Outcome type | Predictor type | Groups/samples | Test |
|---|---|---|---|---|
| Is mean different from reference? | Continuous | None | 1 group | One-sample t-test (parametric) / Wilcoxon (non-param) |
| Is proportion different from reference? | Binary | None | 1 group | One-proportion z-test |
| Does categorical distribution match expected? | Categorical | None | 1 group | Chi-square goodness of fit |
| Are two independent group means different? | Continuous | Binary | 2 indep. | Student’s t / Welch’s t / Mann-Whitney U |
| Are two paired measurements different? | Continuous | Time (2 points) | 2 paired | Paired t-test / Wilcoxon signed-rank |
| Are two paired binary proportions different? | Binary | Time (2 points) | 2 paired | McNemar’s test |
| Are 3+ independent group means different? | Continuous | Categorical | 3+ indep. | One-way ANOVA / Welch’s ANOVA / Kruskal-Wallis |
| Are 3+ repeated measures different? | Continuous | Time (3+ points) | 3+ paired | Repeated measures ANOVA / Friedman |
| Are 3+ paired binary proportions different? | Binary | Time (3+ points) | 3+ paired | Cochran’s Q |
| Is there a categorical association? | Categorical | Categorical | Indep. | Chi-square / Fisher’s exact |
| Is there a linear association? | Continuous | Continuous | — | Pearson r / Spearman ρ |
| Predict continuous outcome from 1+ predictors | Continuous | Mixed | — | Linear regression |
| Predict binary outcome from 1+ predictors | Binary | Mixed | — | Logistic regression |
| Predict time-to-event from 1+ predictors | Time-to-event | Mixed | — | Cox proportional hazards |
| Predict count outcome from 1+ predictors | Count | Mixed | — | Poisson regression |
| Compare survival curves between groups | Time-to-event | Categorical | 2+ indep. | Kaplan-Meier + log-rank |
| Multiple continuous outcomes simultaneously | Continuous | Categorical | 2+ groups | MANOVA |
| Reduce many correlated variables | Continuous | None | — | PCA / Factor analysis |
| Repeated measures with missing data | Continuous | Mixed | 3+ time pts | Linear mixed effects model |
| Agreement between two raters (categorical) | Categorical | — | 2 raters | Cohen’s kappa |
| Agreement between two continuous methods | Continuous | — | 2 methods | Bland-Altman analysis |
| Diagnostic test evaluation | Binary | Continuous/ordinal | — | ROC analysis, sensitivity/specificity |
A2. Non-Parametric Equivalents
| Parametric test | Non-parametric equivalent | Use when |
|---|---|---|
| One-sample t-test | One-sample Wilcoxon | Non-normal data, n < 30 |
| Independent t-test | Mann-Whitney U | Non-normal, ordinal, n < 30 |
| Paired t-test | Wilcoxon signed-rank | Non-normal differences, ordinal |
| One-way ANOVA | Kruskal-Wallis | Non-normal groups, ordinal outcome |
| Repeated measures ANOVA | Friedman test | Non-normal, ordinal, repeated |
| Pearson correlation | Spearman correlation | Non-normal, ordinal, outliers |
| MANOVA | — (robust MANOVA) | Non-normal multivariate data |
A3. Effect Size Reference
| Measure | Formula | Interpretation |
|---|---|---|
| Cohen’s d | (μ₁−μ₂)/pooled SD | 0.2=small, 0.5=medium, 0.8=large |
| Odds ratio | ad/bc | 1=no effect; >1 increased odds; <1 decreased odds |
| Relative risk | [a/(a+b)] / [c/(c+d)] | 1=no effect; >1 increased risk |
| NNT | 1/ARR | Lower = more effective |
| Hazard ratio | e^β (Cox) | 1=no effect; same interpretation as RR |
| r (Pearson) | — | 0.1=small, 0.3=medium, 0.5=large |
| R² | SS_model/SS_total | % variance explained |
| η² (eta-squared) | SS_between/SS_total | 0.01=small, 0.06=medium, 0.14=large |
| AUC/C-statistic | Area under ROC | 0.5=chance; 0.7–0.8=acceptable; >0.8=excellent |
| Kappa (κ) | (Po−Pe)/(1−Pe) | 0.4–0.6=moderate; 0.6–0.8=substantial; >0.8=almost perfect |
A4. P-Value Thresholds in Context
| Scenario | Recommended α | Rationale |
|---|---|---|
| Primary outcome, single test | 0.05 | Standard |
| Secondary outcomes (multiple) | 0.05/k (Bonferroni) | Multiple comparisons |
| Post-hoc pairwise comparisons | Tukey HSD or Bonferroni | Familywise error control |
| Exploratory analysis | 0.05, clearly labelled | Hypothesis-generating only |
| Genome-wide association study | 5×10⁻⁸ | Millions of comparisons |
| Equivalence / non-inferiority | 0.025 (one-sided) | Specific trial design |
A5. Sample Size Formulae
| Design | Formula | Notes |
|---|---|---|
| Two independent means | n = 2(z_α/2 + z_β)²σ²/δ² | σ = SD, δ = minimum detectable difference |
| Two proportions | n = (z_α/2 + z_β)² [p₁(1-p₁)+p₂(1-p₂)] / (p₁-p₂)² | p₁, p₂ = expected proportions |
| Paired design | n = (z_α/2 + z_β)²σ_d²/δ² | σ_d = SD of differences |
| One proportion vs reference | n = z_α/2²p₀(1-p₀)/E² | E = acceptable margin of error |
z values: z_0.025 = 1.96 (α=0.05 two-tailed), z_0.2 = 0.84 (power=80%), z_0.1 = 1.28 (power=90%)
Key References and Further Reading
- Altman DG. Practical Statistics for Medical Research. Chapman & Hall, 1991.
- Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;327:307-310.
- Cox DR. Regression models and life tables. J Royal Stat Soc B. 1972;34:187-220.
- DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves. Biometrics. 1988;44:837-845.
- Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53:457-481.
- Rothman KJ. No adjustments are needed for multiple comparisons. Epidemiology. 1990;1:43-46.
- Steyerberg EW. Clinical Prediction Models. Springer, 2009.
- Zhang J, Yu KF. What’s the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA. 1998;280:1690-1691.
- Vittinghoff E, et al. Regression Methods in Biostatistics. Springer, 2012.
- Harrell FE. Regression Modeling Strategies. Springer, 2015.
This guide is intended as a methodological reference for applied clinical research. Statistical analysis should always be conducted in consultation with a qualified statistician for complex or novel study designs. Software implementations: R (free, recommended), Stata, SPSS, SAS.
Version 1.0 | Prepared for clinical researchers | Field: Medical / clinical research