A Complete Guide to Statistical Tests and Methods for Clinical Researchers

Audience: Researchers and clinicians applying statistical methods to medical and health data
Purpose: A thorough reference covering test selection, assumptions, worked examples, interpretation, and reporting — from foundational hypothesis tests through advanced methods including survival analysis, multivariable modelling, and meta-analysis
How to use this guide: Each section follows a consistent structure: What it is → When to use it → Assumptions → Step-by-step workflow → Worked example → Interpretation → Reporting → Common mistakes

Foundations: The Statistical Reasoning Framework
Choosing the Right Test
Descriptive Statistics and Data Exploration
One-Variable Tests
Comparing Two Groups
Comparing Three or More Groups
Correlation and Association
Regression Analysis
Effect Sizes and Association Measures
Survival and Time-to-Event Analysis
Multivariable Modelling Strategy
Multivariate Methods
Mixed Models and Longitudinal Data
Diagnostic Test Evaluation
Agreement and Reliability
Bayesian Methods
Meta-Analysis and Systematic Review
Reporting Standards and Checklists
Appendix: Quick Reference Tables

1. Foundations: The Statistical Reasoning Framework

1.1 What Is a Statistical Test?

A statistical test is a formal procedure for deciding whether observed data are consistent with a stated hypothesis. The process has four components:

Null hypothesis (H₀): The assumption of no effect, no difference, or no association
Alternative hypothesis (H₁): The effect or difference you are trying to detect
Test statistic: A number calculated from your data that summarises the evidence against H₀
P-value: The probability of observing a test statistic at least as extreme as yours, if H₀ were true

What a p-value is NOT: A p-value is not the probability that H₀ is true. It is not the probability that your result is due to chance. These are the two most common misinterpretations in the medical literature.

1.2 Type I and Type II Errors

	H₀ is actually TRUE	H₀ is actually FALSE
Test says: reject H₀	Type I error (false positive) — probability = α	Correct (true positive) — probability = Power (1−β)
Test says: fail to reject H₀	Correct (true negative) — probability = 1−α	Type II error (false negative) — probability = β

α (significance level): Conventionally set at 0.05. If α = 0.05, you accept a 5% chance of a false positive.
β (Type II error rate): Conventionally ≤0.20, meaning power ≥ 80%.
Power: The probability of correctly detecting a true effect. Affected by sample size, effect size, and α.

Clinical implication: In a drug trial, a Type I error means declaring an ineffective drug effective (false positive). A Type II error means missing a truly effective drug (false negative). Both have real patient consequences.

1.3 One-Tailed vs Two-Tailed Tests

Two-tailed: Tests for a difference in either direction (H₁: μ₁ ≠ μ₂). Default for most clinical research.
One-tailed: Tests for a difference in a specific direction (H₁: μ₁ > μ₂). Use only when you have strong prior justification and would not act on a result in the other direction. One-tailed tests are often viewed with suspicion by reviewers if not pre-specified.

1.4 Confidence Intervals vs P-Values

Confidence intervals (CIs) convey more information than p-values alone:

A 95% CI represents the range of values consistent with your data at the 5% significance level
If the 95% CI for a difference excludes zero (or for a ratio excludes 1.0), the result is statistically significant at α = 0.05
CIs communicate both statistical significance AND the magnitude and precision of the estimate
Report both — modern journals increasingly require CIs alongside p-values

Example: A new antihypertensive reduces SBP by 8 mmHg (95% CI: 6 to 10 mmHg, p < 0.001). The CI tells you the reduction is clinically meaningful and precisely estimated. Compare this to: 8 mmHg (95% CI: 0.1 to 16 mmHg, p = 0.048) — statistically significant but very imprecise.

1.5 Sample Size and Power Calculations

Always perform a power calculation before collecting data. The four inputs are:

α — significance level (typically 0.05)
Power (1−β) — typically 0.80 or 0.90
Effect size — the minimum clinically important difference (MCID) you want to detect
Variability — standard deviation (from pilot data or literature)

These four quantities are mathematically linked — specify three to solve for the fourth. Most commonly, you solve for n (sample size).

Example: You want to detect a 10 mmHg difference in SBP between two drug groups. From previous studies, SD ≈ 20 mmHg. With α = 0.05 (two-tailed) and power = 80%:


n per group = 2 × (z_α/2 + z_β)² × σ² / δ²
            = 2 × (1.96 + 0.84)² × 400 / 100
            = 2 × 7.84 × 4
            = 63 per group

You need approximately 63 patients per arm, so ~126 total. Always add 10–20% for expected dropout.

2. Choosing the Right Test: A Decision Framework

The Five Key Questions

Before selecting any statistical test, answer these questions in order:

Q1. What is your research question?

Describe a population → Descriptive statistics
Test a hypothesis about one group → One-sample tests
Compare groups → Between-group tests
Examine relationships → Correlation / regression
Predict an outcome → Regression modelling

Q2. How many variables are involved?

1 variable → One-sample or descriptive
2 variables → Bivariate tests (correlation, two-group comparison)
2+ variables with one outcome → Multivariable regression
2+ outcomes simultaneously → Multivariate methods

Q3. What type is each variable?

Continuous: Measured on a scale (BP, weight, age, biomarker levels)
Ordinal: Ordered categories (pain scale 1–10, NYHA class I–IV)
Nominal/categorical: Unordered categories (blood type, treatment group, sex)
Binary: Special case of nominal with exactly two categories (alive/dead, yes/no)
Time-to-event: Combined measure of whether and when an event occurred

Q4. Are the samples independent or paired/related?

Independent: Different subjects in each group (RCT treatment arms, case-control study)
Paired/related: Same subjects measured twice, or matched subjects (crossover trial, matched case-control)

Q5. Are parametric assumptions met?

Parametric tests assume approximately normal distribution (or large enough n for CLT to apply), continuous data, and homogeneity of variance where applicable
Non-parametric tests make fewer distributional assumptions — use for small samples (<30), skewed distributions, ordinal data, or data with outliers

Decision Table

Outcome variable	Predictor/groups	Sample type	Test
Continuous	None (1 group vs known value)	—	One-sample t-test or Wilcoxon
Continuous	2 groups	Independent	Student’s t or Welch’s t / Mann-Whitney U
Continuous	2 groups	Paired	Paired t-test / Wilcoxon signed-rank
Continuous	3+ groups	Independent	One-way ANOVA / Kruskal-Wallis
Continuous	3+ groups	Repeated	Repeated-measures ANOVA / Friedman
Continuous	Continuous predictor(s)	—	Linear regression
Binary	2+ groups	Independent	Chi-square / Fisher’s exact
Binary	2 groups	Paired	McNemar’s test
Binary	Continuous/mixed predictors	—	Logistic regression
Time-to-event	2+ groups	Independent	Kaplan-Meier + log-rank
Time-to-event	Continuous/mixed predictors	—	Cox regression
Count data	Groups	—	Poisson / negative binomial regression
Ordinal	2+ groups	Independent	Mann-Whitney / Kruskal-Wallis
Multiple continuous outcomes	Groups	—	MANOVA

3. Descriptive Statistics and Data Exploration

3.1 Measures of Central Tendency

Mean: Sum of all values divided by n. Best for normally distributed continuous data.

Median: The middle value when data are sorted. Preferred for skewed data or ordinal scales. Robust to outliers.

Mode: Most frequently occurring value. Rarely used in clinical research except for nominal data.

When to use which:

Normally distributed continuous data → Mean (± SD)
Skewed continuous data → Median (IQR)
Ordinal scales (e.g. pain scores) → Median (IQR)
Nominal data → Frequency and percentage

3.2 Measures of Spread

Standard deviation (SD): Average distance of data points from the mean. Use with mean for symmetric data.

Interquartile range (IQR): Difference between 75th and 25th percentiles. Use with median for skewed data.

Range: Min to max. Useful supplementary information but sensitive to outliers.

Standard error of the mean (SEM): SD / √n. Describes precision of the mean estimate, NOT the spread of individual values. Do not use SEM as a measure of variability in a study population — this is a common and misleading error in clinical publications.

3.3 Assessing Normality

Before choosing parametric vs non-parametric tests, assess distributional assumptions:

Visual methods (preferred):

Histogram: Look for symmetric bell shape
Q-Q plot (quantile-quantile plot): Points should fall along the diagonal line if data are normally distributed
Box plot: Check for symmetry and outliers

Formal tests:

Shapiro-Wilk test: Best for small samples (n < 50). H₀: data are normally distributed. A p-value > 0.05 is consistent with normality (note: does not prove normality).
Kolmogorov-Smirnov test: Better for larger samples.

Practical rule: With n > 30, the central limit theorem (CLT) ensures that the sampling distribution of the mean is approximately normal even if individual data are skewed. Parametric tests are generally robust in this case. For n < 30 with visibly skewed data, use non-parametric alternatives.

3.4 Worked Example: Describing a Study Population

Scenario: A clinical trial of a new statin enrols 120 patients. At baseline, data are collected on age, sex, BMI, LDL-cholesterol, and NYHA heart failure class (I–IV).

Appropriate summary statistics:

Variable	Type	Summary
Age (years)	Continuous, approximately normal	Mean ± SD: 62.4 ± 11.2
Sex (% male)	Binary	68 (56.7%)
BMI (kg/m²)	Continuous, slightly right-skewed	Median (IQR): 27.8 (24.6–31.9)
LDL-C (mmol/L)	Continuous, right-skewed	Median (IQR): 3.4 (2.8–4.1)
NYHA class	Ordinal	Class I: 22 (18.3%), Class II: 58 (48.3%), Class III: 32 (26.7%), Class IV: 8 (6.7%)

Reporting note: In a Table 1 (baseline characteristics), use the format: n (%) for categorical variables; mean ± SD for normally distributed continuous variables; median (IQR) for skewed or ordinal variables.

4. One-Variable Tests

4.1 One-Sample Student’s t-Test

What it does: Tests whether the mean of a single sample differs significantly from a known or hypothesised population value (μ₀).

When to use:

One continuous variable
Data are approximately normally distributed (or n ≥ 30)
You want to compare your sample mean to a reference value

Assumptions:

Continuous data
Approximate normality or n ≥ 30
Observations are independent

Test statistic:


t = (x̄ − μ₀) / (s / √n)

Where x̄ = sample mean, μ₀ = hypothesised mean, s = sample SD, n = sample size. Follows a t-distribution with n−1 degrees of freedom.

Worked Example:

Research question: A cardiology unit wants to know whether the mean INR of their anticoagulated patients differs from the therapeutic target of 2.5.

Data: n = 25 patients, mean INR = 2.8, SD = 0.6


t = (2.8 − 2.5) / (0.6 / √25) = 0.3 / 0.12 = 2.50
df = 24
p-value = 0.020 (two-tailed)
95% CI for difference: 0.05 to 0.55

Interpretation: The mean INR (2.8) is significantly above the target of 2.5 (t(24) = 2.50, p = 0.020). The 95% CI (2.55 to 3.05) excludes 2.5, confirming statistical significance. The unit may be over-anticoagulating their patients on average.

4.2 One-Sample Wilcoxon Signed-Rank Test

What it does: Non-parametric equivalent of the one-sample t-test. Tests whether the median of a sample differs from a hypothesised value.

When to use:

One continuous or ordinal variable
Data are skewed, ordinal, or n < 30 with non-normal distribution
You want to compare your sample median to a reference value

Worked Example:

Research question: A pain clinic wants to know whether their patients’ median pain score (NRS 0–10) differs from the population median of 5.

Data: n = 18 patients with chronic back pain, median NRS = 7 (IQR: 5–9). Shapiro-Wilk p = 0.003 — data are significantly non-normal.

Procedure: Calculate the difference between each patient’s score and 5. Rank the absolute differences. Sum the positive and negative ranks separately. Use the Wilcoxon W statistic.

Result: W = 142, p = 0.008

Interpretation: Patients’ median pain score (7) is significantly higher than the reference value of 5 (Wilcoxon W = 142, p = 0.008), indicating this population has worse pain than the general reference population.

4.3 One-Proportion Test (Z-test for proportion)

What it does: Tests whether an observed proportion differs from a known or hypothesised population proportion.

When to use:

One binary/nominal variable
You want to compare your proportion to a reference value
np ≥ 5 and n(1−p) ≥ 5 (otherwise use exact binomial test)

Worked Example:

Research question: The national readmission rate following elective hip replacement is 4%. A tertiary centre reviews 250 of their own procedures and finds 15 readmissions. Is their rate significantly different?

H₀: p = 0.04 (their rate equals the national rate)


p̂ = 15/250 = 0.060
z = (p̂ − p₀) / √(p₀(1−p₀)/n)
  = (0.060 − 0.040) / √(0.04 × 0.96 / 250)
  = 0.020 / 0.01241
  = 1.61
p-value = 0.107 (two-tailed)
95% CI for proportion: 0.033 to 0.097

Interpretation: The observed readmission rate (6.0%) is numerically higher than the national rate (4.0%), but this difference is not statistically significant (z = 1.61, p = 0.107). The 95% CI (3.3% to 9.7%) includes 4%, consistent with this conclusion. The study may be underpowered to detect a difference of this magnitude — a power calculation would be warranted.

4.4 Chi-Square Goodness-of-Fit Test

What it does: Tests whether the observed distribution of a categorical variable matches an expected (theoretical) distribution.

When to use:

One categorical variable with two or more categories
You have hypothesised expected frequencies for each category
Expected frequency in each cell ≥ 5

Worked Example:

Research question: ABO blood group distribution in the general UK population is approximately: A=42%, B=10%, AB=4%, O=44%. In a sample of 200 cardiac surgery patients, you observe: A=96 (48%), B=16 (8%), AB=6 (3%), O=82 (41%). Is the distribution of blood types in cardiac patients different from the general population?

Expected counts (E = n × p): A=84, B=20, AB=8, O=88


χ² = Σ (O−E)²/E
   = (96−84)²/84 + (16−20)²/20 + (6−8)²/8 + (82−88)²/88
   = 1.714 + 0.800 + 0.500 + 0.409
   = 3.423
df = 4−1 = 3
p-value = 0.331

Interpretation: The blood type distribution among cardiac surgery patients does not differ significantly from the general population (χ²(3) = 3.42, p = 0.331).

5. Comparing Two Groups

5.1 Independent Samples t-Test (Student’s t-Test)

What it does: Compares the means of two independent groups.

When to use:

Continuous outcome variable
Two independent groups (different subjects in each group)
Approximately normally distributed data in both groups, or n ≥ 30 per group
Equal population variances (if not, use Welch’s t-test)

Checking equal variances: Use Levene’s test. If p > 0.05, assume equal variances (Student’s). If p ≤ 0.05, assume unequal variances (Welch’s). In practice, Welch’s t-test is robust and increasingly recommended as the default.

Test statistic (equal variances):


t = (x̄₁ − x̄₂) / (sp × √(1/n₁ + 1/n₂))

where sp = pooled SD = √[((n₁−1)s₁² + (n₂−1)s₂²) / (n₁+n₂−2)]

df = n₁ + n₂ − 2

Worked Example:

Research question: A randomised controlled trial compares a new ACE inhibitor (Group A, n=45) to placebo (Group B, n=45) on 24-hour systolic blood pressure (SBP) reduction after 8 weeks.

	Group A (ACE inhibitor)	Group B (Placebo)
n	45	45
Mean SBP reduction (mmHg)	12.4	5.8
SD	8.2	7.6

Levene’s test: p = 0.62 → assume equal variances


sp = √[((44 × 8.2²) + (44 × 7.6²)) / 88]
   = √[(2963.84 + 2542.24) / 88]
   = √[62.57]
   = 7.910

t = (12.4 − 5.8) / (7.910 × √(1/45 + 1/45))
  = 6.6 / (7.910 × 0.2108)
  = 6.6 / 1.667
  = 3.96

df = 88
p-value < 0.001
95% CI for difference: 3.28 to 9.92 mmHg

Interpretation: The ACE inhibitor produced a significantly greater reduction in SBP compared to placebo (mean difference 6.6 mmHg, 95% CI 3.3 to 9.9 mmHg; t(88) = 3.96, p < 0.001). The CI is entirely above zero, confirming the ACE inhibitor is superior.

Reporting template: “The ACE inhibitor group showed a significantly greater reduction in 24-hour SBP compared to placebo (12.4 ± 8.2 vs 5.8 ± 7.6 mmHg; mean difference 6.6 mmHg, 95% CI 3.3 to 9.9 mmHg; p < 0.001).“

5.2 Welch’s t-Test (Unequal Variances)

What it does: Like Student’s t-test but does not assume equal population variances. The degrees of freedom are adjusted (Welch-Satterthwaite correction), resulting in a non-integer df.

When to use: Whenever Levene’s test is significant (p ≤ 0.05), or as a default (Welch’s is generally safer and loses little power when variances are actually equal).

Worked Example:

Research question: Comparing CRP levels (mg/L) between patients with confirmed bacterial infection (n=30) and viral infection (n=28).

	Bacterial	Viral
Mean CRP	118.4	22.6
SD	94.2	18.7

Levene’s test: p = 0.001 → unequal variances → use Welch’s


t = (118.4 − 22.6) / √(94.2²/30 + 18.7²/28)
  = 95.8 / √(295.87 + 12.48)
  = 95.8 / √308.35
  = 95.8 / 17.56
  = 5.46

df (Welch-Satterthwaite) ≈ 31.4 (non-integer)
p < 0.001
95% CI: 60.7 to 130.9 mg/L

Interpretation: CRP was substantially and significantly higher in bacterial compared to viral infections (118.4 vs 22.6 mg/L; mean difference 95.8 mg/L, 95% CI 60.7 to 130.9; Welch’s t = 5.46, p < 0.001). The large standard deviations and Levene’s test result confirm the appropriateness of Welch’s t-test here.

5.3 Mann-Whitney U Test

What it does: Non-parametric test comparing the distributions of two independent groups. Tests whether one group tends to have higher values than the other.

When to use:

Continuous or ordinal outcome
Two independent groups
Data are skewed, ordinal, or n < 30 with non-normal distribution
Particularly appropriate for outcomes like pain scores, quality of life measures, biomarkers with skewed distributions

What it actually tests: The Mann-Whitney U test does not strictly test equality of medians (a common misconception). It tests whether one group’s values tend to be larger than the other’s — formally, P(X > Y) = 0.5. The test is equivalent to asking: “If I randomly picked one observation from each group, is there an equal probability of either being larger?”

Worked Example:

Research question: A palliative care study compares quality of life scores (EORTC QLQ-C30 global scale, 0–100) between patients receiving standard care (n=22) and those receiving a new integrated support programme (n=24) at 3 months. The data are negatively skewed.

	Standard care	Integrated programme
n	22	24
Median (IQR)	58 (42–70)	72 (62–82)
Shapiro-Wilk p	0.031	0.028

Both groups fail the normality test → use Mann-Whitney U

Result: U = 161.5, p = 0.014

Interpretation: Quality of life scores were significantly higher in the integrated support programme group compared to standard care (median 72 vs 58; Mann-Whitney U = 161.5, p = 0.014).

Reporting template: “Global quality of life was significantly better in patients receiving the integrated support programme compared to standard care (median 72 [IQR 62–82] vs 58 [IQR 42–70]; Mann-Whitney U = 161.5, p = 0.014).“

5.4 Paired Samples t-Test

What it does: Compares means from the same subjects measured at two time points or under two conditions. Conceptually, it reduces to a one-sample t-test on the differences.

When to use:

Same subjects measured twice (before/after design)
Matched subjects in a 1:1 design
Approximately normally distributed differences (not necessarily the raw values)

Key advantage over independent t-test: Removes between-subject variability, substantially increasing statistical power.

Test statistic:


t = d̄ / (sd / √n)

where d̄ = mean of (post − pre) differences
      sd = SD of differences
      df = n − 1

Worked Example:

Research question: A crossover trial tests whether 8 weeks of dietary sodium restriction reduces 24-hour urinary sodium excretion in 20 hypertensive patients. Each patient acts as their own control.

Patient	Pre (mmol/24h)	Post (mmol/24h)	Difference (Post−Pre)
Mean	168.4	124.6	−43.8
SD	—	—	28.4


t = −43.8 / (28.4 / √20)
  = −43.8 / 6.35
  = −6.90

df = 19
p < 0.001
95% CI for mean difference: −57.1 to −30.5 mmol/24h

Interpretation: Sodium restriction significantly reduced 24-hour urinary sodium excretion (mean reduction 43.8 mmol/24h, 95% CI 30.5 to 57.1 mmol/24h; paired t(19) = −6.90, p < 0.001). The CI excludes zero, and the magnitude (43.8 mmol/24h) represents a clinically meaningful reduction.

5.5 Wilcoxon Signed-Rank Test

What it does: Non-parametric equivalent of the paired t-test. Compares two related groups without assuming normality of differences.

When to use:

Paired or repeated observations
Differences are not normally distributed
Ordinal data with paired design

Worked Example:

Research question: A physiotherapy intervention study measures pain scores (NRS 0–10) in 16 patients with knee osteoarthritis before and after 6 weeks of treatment. The differences are not normally distributed (Shapiro-Wilk p = 0.019).

	Pre-treatment	Post-treatment
Median (IQR)	7 (6–9)	4 (3–6)

Result: Wilcoxon Z = −3.29, p = 0.001

Interpretation: Pain scores were significantly reduced following physiotherapy (pre-treatment median 7 [IQR 6–9] vs post-treatment median 4 [IQR 3–6]; Wilcoxon signed-rank Z = −3.29, p = 0.001).

5.6 Chi-Square Test of Independence

What it does: Tests whether two categorical variables are associated (i.e., whether the distribution of one variable differs across levels of the other).

When to use:

Both variables are categorical (nominal or ordinal)
Independent observations
Expected frequency in each cell ≥ 5 (if not, use Fisher’s exact test)

Test statistic:


χ² = Σ (O − E)² / E

where E = (row total × column total) / grand total
df = (rows − 1)(columns − 1)

Worked Example:

Research question: Does smoking status (smoker vs non-smoker) differ between patients who develop postoperative pneumonia and those who do not following elective colorectal surgery (n=180)?

	Pneumonia	No pneumonia	Total
Smoker	24	36	60
Non-smoker	16	104	120
Total	40	140	180

Expected counts:

Smoker/Pneumonia: (60×40)/180 = 13.3
Smoker/No pneumonia: (60×140)/180 = 46.7
Non-smoker/Pneumonia: (120×40)/180 = 26.7
Non-smoker/No pneumonia: (120×140)/180 = 93.3

All expected counts ≥ 5 → chi-square test appropriate


χ² = (24−13.3)²/13.3 + (36−46.7)²/46.7 + (16−26.7)²/26.7 + (104−93.3)²/93.3
   = 8.61 + 2.45 + 4.29 + 1.23
   = 16.58

df = 1
p < 0.001

Interpretation: Smoking was significantly associated with postoperative pneumonia (χ²(1) = 16.58, p < 0.001). Smokers had a substantially higher rate of pneumonia (40.0%) compared to non-smokers (13.3%). The odds ratio is 4.33 (95% CI: 2.04–9.21), indicating smokers had over four times the odds of developing pneumonia.

5.7 Fisher’s Exact Test

What it does: Tests the association between two categorical variables when expected cell frequencies are small (less than 5). Calculates the exact probability of the observed (or more extreme) table configuration.

When to use:

2×2 contingency table with expected cell frequency < 5 in any cell
Small sample sizes
Sparse data (rare outcomes)

Worked Example:

Research question: A small case series examines whether an unusual fungal infection is associated with immunosuppressive therapy. Among 12 patients: 5 received immunosuppressants (4 with infection, 1 without), 7 did not (1 with infection, 6 without).

	Infection	No infection	Total
Immunosuppressed	4	1	5
Not immunosuppressed	1	6	7
Total	5	7	12

Smallest expected cell: (5×5)/12 = 2.08 < 5 → use Fisher’s exact test

Fisher’s exact p = 0.045 (two-tailed)

Interpretation: Immunosuppressive therapy was significantly associated with fungal infection in this small series (Fisher’s exact p = 0.045). Caution: with only 12 patients, these findings should be considered hypothesis-generating.

5.8 McNemar’s Test

What it does: Tests whether the proportion of a binary outcome differs between two paired groups (same subjects measured twice, or matched pairs).

When to use:

Binary outcome (yes/no)
Paired or matched design (before/after, matched case-control)

Worked Example:

Research question: Before and after a hand-hygiene education campaign, the same 80 clinical staff are observed for compliance (compliant = yes/no). Did compliance rates change?

	Post: Compliant	Post: Non-compliant	Total
Pre: Compliant	38	12	50
Pre: Non-compliant	22	8	30
Total	60	20	80

The key cells are the discordant pairs: b=12 (compliant pre, not post) and c=22 (not compliant pre, compliant post).


McNemar χ² = (|b − c| − 1)² / (b + c)
           = (|12 − 22| − 1)² / (12 + 22)
           = 81 / 34
           = 2.38... 

Wait, using corrected formula:
χ² = (b − c)² / (b + c) = (12−22)² / (12+22) = 100/34 = 2.94
p = 0.086

Interpretation: There was a non-significant trend toward improved hand hygiene compliance following the education campaign (60% post-intervention vs 62.5% pre-intervention; McNemar χ² = 2.94, p = 0.086). The campaign did not produce a statistically significant change in this sample.

6. Comparing Three or More Groups

6.1 One-Way ANOVA

What it does: Tests whether the means of three or more independent groups differ. The word “one-way” refers to one grouping factor. ANOVA tests the overall (“omnibus”) null hypothesis that ALL group means are equal — it does not tell you which groups differ.

When to use:

Continuous outcome
Three or more independent groups
Approximately normal distribution within each group, or large samples
Equal variances across groups (if not, use Welch’s ANOVA)

Assumptions:

Normality within each group
Homogeneity of variance (Levene’s test)
Independence of observations

Logic: ANOVA partitions total variability into between-group variability (due to the treatment/grouping) and within-group variability (random noise). The F-statistic is the ratio of these two components.


F = (Between-group variance) / (Within-group variance)
  = MSbetween / MSwithin

Where:
SSbetween = Σ nj(x̄j − x̄)²    df = k−1
SSwithin  = Σ Σ (xij − x̄j)²   df = N−k

F ~ F-distribution with df1 = k−1, df2 = N−k

Post-hoc tests: If ANOVA is significant, follow-up with pairwise comparisons. Common options:

Tukey’s HSD: Controls familywise error rate; compares all possible pairs. Good all-purpose choice.
Bonferroni correction: Divides α by number of comparisons. Conservative.
Dunnett’s test: Compares each group only to a control group. Use in dose-response studies.
Scheffé’s test: Most conservative; appropriate for complex contrasts planned after seeing the data.

Worked Example:

Research question: A multicentre RCT compares three doses of a novel anti-nausea drug (low dose, medium dose, high dose) versus placebo on vomiting episodes in 24 hours following chemotherapy (n=200, 50 per group).

Group	n	Mean episodes	SD
Placebo	50	6.8	2.4
Low dose	50	5.1	2.1
Medium dose	50	3.4	1.8
High dose	50	2.9	1.7

Grand mean (x̄) = (6.8+5.1+3.4+2.9)/4 = 4.55


SSbetween = 50×(6.8−4.55)² + 50×(5.1−4.55)² + 50×(3.4−4.55)² + 50×(2.9−4.55)²
          = 50×(5.0625 + 0.3025 + 1.3225 + 2.7225)
          = 50 × 9.41 = 470.5

MSbetween = 470.5 / 3 = 156.8

SSwithin = 49×2.4² + 49×2.1² + 49×1.8² + 49×1.7² = 49×(5.76+4.41+3.24+2.89)
         = 49 × 16.30 = 798.7

MSwithin = 798.7 / 196 = 4.075

F = 156.8 / 4.075 = 38.5
p < 0.001

Post-hoc Tukey HSD: All pairwise comparisons are significant (p < 0.05) except: Medium dose vs High dose (mean difference 0.5, p = 0.41).

Interpretation: There were significant differences in vomiting episodes across treatment groups (one-way ANOVA: F(3,196) = 38.5, p < 0.001). Post-hoc analysis (Tukey HSD) showed all active doses were superior to placebo (all p < 0.001), and medium dose was superior to low dose (p = 0.003). There was no significant difference between medium and high doses (p = 0.41), suggesting medium dose may provide the optimal therapeutic benefit with a lower adverse event profile.

6.2 Welch’s ANOVA

What it does: An F-test that does not assume equal population variances across groups. More robust than standard ANOVA when variances are heterogeneous.

When to use: When Levene’s test is significant (p < 0.05), indicating unequal variances across groups.

Post-hoc test: Use Games-Howell (does not assume equal variances) rather than Tukey HSD.

6.3 Kruskal-Wallis Test

What it does: Non-parametric alternative to one-way ANOVA. Tests whether three or more independent groups have the same distribution. Like Mann-Whitney U extended to k groups.

When to use:

Continuous or ordinal outcome
Three or more independent groups
Data are skewed or non-normal within groups
Ordinal outcome (e.g. pain scores, Likert scales)

Post-hoc testing: If Kruskal-Wallis is significant, use Dunn’s test with Bonferroni correction for pairwise comparisons.

Worked Example:

Research question: Three hospitals (A, B, C) are compared on patient-reported pain scores (NRS 0–10) at discharge following total knee replacement.

Hospital	n	Median (IQR)
A	35	4 (3–6)
B	38	6 (4–8)
C	33	5 (3–7)

Data are ordinal and skewed → Kruskal-Wallis

Result: H(2) = 8.74, p = 0.013

Post-hoc (Dunn’s with Bonferroni):

A vs B: p = 0.010
A vs C: p = 0.320
B vs C: p = 0.182

Interpretation: Discharge pain scores differed significantly across the three hospitals (Kruskal-Wallis H(2) = 8.74, p = 0.013). Post-hoc analysis showed Hospital A had significantly lower pain scores than Hospital B (Dunn’s test, p = 0.010) but not Hospital C (p = 0.320). No significant difference was found between Hospitals B and C (p = 0.182).

6.4 Repeated Measures ANOVA

What it does: Tests for differences in a continuous outcome measured at three or more time points in the same subjects.

When to use:

Same subjects measured at 3+ time points
Continuous outcome
Approximately normally distributed data or adequate sample size

Assumption unique to repeated measures: Sphericity — the variances of the differences between all possible pairs of time points should be equal. Tested with Mauchly’s test. If violated, apply Greenhouse-Geisser or Huynh-Feldt epsilon correction to the degrees of freedom.

Worked Example:

Research question: Serum creatinine (μmol/L) is monitored in 30 patients with CKD at baseline, 3 months, 6 months, and 12 months of treatment.

Time point	Mean creatinine	SD
Baseline	142	38
3 months	138	36
6 months	131	34
12 months	128	33

Mauchly’s test: p = 0.21 (sphericity not violated)

Result: F(3, 87) = 8.43, p < 0.001, η² = 0.225

Post-hoc (pairwise t-tests with Bonferroni):

Baseline vs 3 months: p = 0.31 (ns)
Baseline vs 6 months: p = 0.012
Baseline vs 12 months: p < 0.001
3 months vs 12 months: p = 0.003

Interpretation: Serum creatinine decreased significantly over 12 months (repeated measures ANOVA: F(3,87) = 8.43, p < 0.001, η² = 0.23). Significant reductions from baseline were apparent at 6 months (−11 μmol/L, p = 0.012) and 12 months (−14 μmol/L, p < 0.001).

6.5 Friedman Test

What it does: Non-parametric equivalent of repeated measures ANOVA. Compares three or more related groups.

When to use:

Same subjects measured at 3+ time points
Data are skewed, ordinal, or assumptions of repeated measures ANOVA are violated

Worked Example:

Research question: Pain scores (NRS 0–10) are compared at 3 time points (baseline, week 4, week 8) in 20 patients with rheumatoid arthritis starting a new biologic therapy. Data are ordinal and skewed.

Time	Median (IQR)
Baseline	7 (6–9)
Week 4	5 (3–7)
Week 8	3 (2–5)

Friedman χ²(2) = 28.4, p < 0.001

Post-hoc (Wilcoxon with Bonferroni, α adjusted to 0.017):

Baseline vs Week 4: p = 0.001
Baseline vs Week 8: p < 0.001
Week 4 vs Week 8: p = 0.008

Interpretation: Pain scores decreased significantly over the 8-week treatment period (Friedman χ²(2) = 28.4, p < 0.001). All pairwise comparisons showed significant improvement (all p ≤ 0.008).

7. Correlation and Association

7.1 Pearson Correlation

What it does: Quantifies the strength and direction of the linear relationship between two continuous variables. Output is the correlation coefficient r, ranging from −1 (perfect negative linear relationship) to +1 (perfect positive linear relationship).

When to use:

Both variables continuous
Approximately bivariate normal distribution
You are interested in linear association

Interpreting r: | |r| value | Interpretation | |---|---| | 0.00–0.19 | Negligible/very weak | | 0.20–0.39 | Weak | | 0.40–0.59 | Moderate | | 0.60–0.79 | Strong | | 0.80–1.00 | Very strong |

Important caveat: Correlation ≠ causation. Always plot the data first (scatterplot) — r can miss non-linear relationships, and can be distorted by outliers.

Worked Example:

Research question: Is there a linear association between age (years) and eGFR (mL/min/1.73m²) in a cohort of 150 adults attending a nephrology clinic?

Result: r = −0.58 (95% CI: −0.68 to −0.46), p < 0.001

Interpretation: There is a moderate to strong negative linear relationship between age and eGFR (r = −0.58, 95% CI −0.68 to −0.46, p < 0.001), indicating that kidney function declines with increasing age in this cohort. Age accounts for approximately 34% of the variance in eGFR (r² = 0.336).

7.2 Spearman’s Rank Correlation

What it does: Non-parametric measure of monotonic (not necessarily linear) association between two variables. Calculates the Pearson correlation on the ranks of the data.

When to use:

Ordinal data (e.g., disease severity grade, Likert scale responses)
Continuous data that are skewed or contain outliers
Non-linear but monotonic relationships

Worked Example:

Research question: Is NYHA heart failure class (I–IV, ordinal) associated with 6-minute walk distance (metres) in 80 outpatients?

Result: ρ (rho) = −0.71 (95% CI: −0.79 to −0.60), p < 0.001

Interpretation: There is a strong negative monotonic association between NYHA class and 6-minute walk distance (Spearman ρ = −0.71, p < 0.001): higher NYHA class (worse symptoms) is associated with shorter walk distance.

7.3 Common Pitfalls in Correlation Analysis

1. Correlation without scatterplot: Always plot the data. An r of 0.50 could reflect a clean linear trend, a curved relationship, or a driven entirely by a few outliers — you cannot tell from the statistic alone.

2. Ecological fallacy: Correlation at the group level (e.g., countries) does not imply correlation at the individual level.

3. Confounding: A correlation between A and B might be explained by a third variable C that is related to both.

4. Restricted range: Correlations are attenuated when you study a narrow range of one variable (e.g., only severely ill patients). True associations may be understated.

5. Multiple testing: If you test 20 correlations, you expect 1 to be significant by chance at α = 0.05.

8. Regression Analysis

8.1 Simple Linear Regression

What it does: Models the linear relationship between one continuous predictor (X) and one continuous outcome (Y). Extends correlation by fitting a line and quantifying the predicted change in Y per unit change in X.

The model:


Y = β₀ + β₁X + ε

β₀ = intercept (value of Y when X = 0)
β₁ = slope (change in Y for each 1-unit increase in X)
ε  = residual error

Key outputs:

β₁ (regression coefficient): The slope — how much Y changes per unit increase in X
95% CI for β₁
R²: Proportion of variance in Y explained by X
Residual diagnostics: Assess model assumptions

Assumptions:

Linearity: the relationship is linear
Independence of residuals
Homoscedasticity: residuals have constant variance across X values
Normality of residuals
No influential outliers

Check assumptions with:

Residuals vs fitted plot (linearity, homoscedasticity)
Q-Q plot of residuals (normality)
Cook’s distance (influential observations)

Worked Example:

Research question: What is the relationship between BMI (kg/m², predictor) and systolic blood pressure (mmHg, outcome) in 200 middle-aged adults?

Result:


SBP = 98.4 + 1.72 × BMI

β₁ = 1.72 (95% CI: 1.28 to 2.16), p < 0.001
R² = 0.21 (21% of SBP variance explained by BMI)

Interpretation: For every 1 kg/m² increase in BMI, systolic blood pressure increases by an estimated 1.72 mmHg (95% CI 1.28 to 2.16 mmHg, p < 0.001). BMI explains 21% of the variability in SBP in this cohort.

8.2 Multiple Linear Regression

What it does: Models the relationship between two or more predictors and a continuous outcome. Each coefficient represents the effect of that predictor adjusted for all other predictors in the model.

The model:


Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ + ε

Worked Example:

Research question: What factors independently predict systolic blood pressure in 200 adults? Candidate predictors: BMI, age, sex (female = reference), smoking status (current smoker vs not).

Predictor	Coefficient (β)	95% CI	p-value
Intercept	78.2	—	—
BMI (per kg/m²)	1.42	0.98 to 1.86	`<`0.001
Age (per year)	0.68	0.44 to 0.92	`<`0.001
Male sex	4.10	1.22 to 6.98	0.005
Current smoker	3.85	0.97 to 6.73	0.009

Adjusted R² = 0.34

Interpretation: After adjustment for other variables, each 1 kg/m² increase in BMI was associated with a 1.42 mmHg increase in SBP (95% CI 0.98–1.86, p < 0.001). Older age, male sex, and current smoking were also independently associated with higher SBP. Together, these four predictors explain 34% of the variance in SBP.

Important: The coefficient for BMI (1.42) differs from the unadjusted coefficient (1.72) because age, sex, and smoking are confounders — they are correlated with BMI and independently predict SBP.

8.3 Logistic Regression

What it does: Models the relationship between one or more predictors and a binary outcome (yes/no, event/no event). Output is the log-odds of the outcome, which is converted to an odds ratio (OR) for interpretation.

The model:


logit(p) = ln(p/(1−p)) = β₀ + β₁X₁ + β₂X₂ + ...

OR for predictor Xj = e^βj

Assumptions:

Binary outcome
Independence of observations
No multicollinearity among predictors
Large enough sample (at least 10 events per predictor variable — the “EPV rule”)
Linearity of continuous predictors with the log-odds (check with Box-Tidwell test)

Worked Example:

Research question: What factors independently predict 30-day readmission (yes/no) following hospital admission for COPD exacerbation? Data from 400 admissions.

Outcome: 30-day readmission (n=84, 21%)

Predictor	OR	95% CI	p-value
Age (per 10 years)	1.24	1.05 to 1.47	0.012
FEV₁% predicted (per 10% increase)	0.82	0.71 to 0.95	0.007
Previous admission in past year (yes vs no)	2.84	1.63 to 4.95	`<`0.001
Home oxygen use (yes vs no)	1.93	1.09 to 3.42	0.025
Eosinophil count (per 0.1×10⁹/L)	0.87	0.76 to 0.99	0.038

Model fit: Hosmer-Lemeshow goodness-of-fit p = 0.64 (good fit); C-statistic (AUC) = 0.72

Interpretation: Previous admission in the past year was the strongest predictor of 30-day readmission (OR 2.84, 95% CI 1.63–4.95, p < 0.001): patients with prior admissions had nearly three times the odds of readmission compared to those without. Each 10% reduction in FEV₁% was associated with a 22% increase in the odds of readmission (OR 0.82 per 10% improvement, i.e. OR 1.22 per 10% deterioration). The model discriminates readmitted from non-readmitted patients with moderate ability (AUC 0.72).

Reporting the C-statistic / AUC: The AUC (area under the ROC curve) for a logistic model represents the probability that a randomly selected patient who was readmitted had a higher predicted probability than a randomly selected patient who was not. Values: 0.5 = no better than chance; 0.7–0.8 = acceptable; 0.8–0.9 = excellent; >0.9 = outstanding.

9. Effect Sizes and Association Measures

9.1 Why Effect Sizes Matter

A statistically significant result (small p-value) tells you that an effect probably exists in the population. It does not tell you whether the effect is clinically meaningful. Effect sizes answer the question: “How big is the effect?”

The hierarchy of information:

P-value: Is there an effect? (binary: yes/no)
Confidence interval: What is the plausible range of the effect?
Effect size: How large is the effect, expressed in a standardised or clinically interpretable way?

9.2 Odds Ratio (OR)

Definition: The ratio of the odds of an outcome in the exposed group to the odds in the unexposed group.


Odds of event in group A = P(event in A) / P(no event in A)
Odds of event in group B = P(event in B) / P(no event in B)
OR = [Odds in A] / [Odds in B]

2×2 contingency table notation:

	Outcome: Yes	Outcome: No
Exposed (E+)	a	b
Unexposed (E−)	c	d


OR = (a/b) / (c/d) = ad / bc
95% CI: exp(ln(OR) ± 1.96 × √(1/a + 1/b + 1/c + 1/d))

Interpreting OR:

OR = 1.0: No association
OR > 1.0: Exposure associated with increased odds of outcome
OR < 1.0: Exposure associated with decreased odds of outcome

Natural home: Case-control studies and logistic regression models.

9.3 Relative Risk (RR) — also called Risk Ratio

Definition: The ratio of the probability (risk) of an outcome in the exposed group to the probability in the unexposed group.


RR = Risk in exposed / Risk in unexposed
   = [a/(a+b)] / [c/(c+d)]

Interpreting RR:

RR = 1.0: No association
RR = 2.0: Exposed group has twice the risk
RR = 0.5: Exposed group has half the risk (50% reduction)

Natural home: Cohort studies and RCTs.

9.4 Odds Ratio vs Relative Risk: When to Use Which

This is one of the most commonly confused distinctions in clinical research. Here is the complete framework:

Study design determines feasibility

Study design	Can you calculate RR?	Can you calculate OR?
RCT	Yes (directly from data)	Yes (but OR preferred only for logistic regression output)
Prospective cohort	Yes	Yes
Retrospective cohort	Yes	Yes
Case-control	No (sampling from outcome group distorts risk)	Yes — OR is the correct measure
Cross-sectional	Prevalence ratio (modified RR)	Yes

Why can’t you calculate RR from a case-control study? Because you select participants based on the outcome (cases and controls), not based on exposure. The proportion of cases in your sample reflects your sampling ratio, not the true disease risk in the population. The OR is mathematically unaffected by this (it is the same whether you sample 1:1 or 1:4 cases to controls).

Outcome frequency matters

When an outcome is rare (<10%), the OR approximates the RR closely. This is the “rare disease assumption”:


When P(outcome) is small:
OR ≈ RR

When an outcome is common (≥10%), the OR will be further from 1.0 than the RR, and they diverge substantially:

True RR	True risk in unexposed	Approximate OR
2.0	5%	2.1
2.0	20%	2.7
2.0	40%	4.0

Reporting an OR when the outcome is common and calling it a “risk ratio” substantially overstates the effect. This is a pervasive error in the medical literature.

Worked Example:

Scenario: A study of surgical site infection (SSI) after colorectal surgery. Diabetic patients: 40 SSIs in 100 patients (40%). Non-diabetic: 20 SSIs in 100 patients (20%).


RR = (40/100) / (20/100) = 0.40 / 0.20 = 2.0

OR = (40×80) / (60×20) = 3200 / 1200 = 2.67

The OR (2.67) is 33% higher than the RR (2.0). Reporting the OR as if it were a risk ratio would overstate the association. Because this is a cohort study with a common outcome (>10%), report the RR.

Logistic regression outputs ORs — when is this a problem?

Logistic regression models produce ORs, not RRs. When:

The outcome is rare: OR ≈ RR, report the OR from logistic regression
The outcome is common: Use alternatives to estimate RR:
- Modified Poisson regression (with robust standard errors) — preferred, produces RR directly
- Log-binomial regression — produces RR directly but can fail to converge
- OR-to-RR conversion formula (Zhang & Yu, 1998): RR = OR / [(1 − P₀) + (P₀ × OR)] where P₀ = baseline risk in unexposed group

Summary decision rule:


Is your study a case-control?        → Report OR (only valid measure)
Is your outcome rare (<10%)?         → OR ≈ RR, report OR from logistic regression
Is your outcome common (≥10%)?
  In a cohort/RCT:                   → Calculate and report RR directly
  From a logistic model:             → Use modified Poisson regression for RR
                                        OR report OR with clear caveat

9.5 Absolute Risk Reduction (ARR) and Number Needed to Treat (NNT)

ARR: The absolute difference in event rates between two groups.


ARR = Risk in control − Risk in treatment
    = (c/(c+d)) − (a/(a+b))

NNT: How many patients need to be treated to prevent one additional outcome event.


NNT = 1 / ARR

NNT < 10: Very effective treatment NNT 10–50: Moderately effective NNT > 100: Marginally effective (may still be worthwhile for serious outcomes)

Worked Example:

Scenario: In a trial of prophylactic low-molecular-weight heparin (LMWH) after major orthopaedic surgery: DVT rate = 8% in LMWH group (a/(a+b) = 0.08), 18% in placebo group (c/(c+d) = 0.18).


RR  = 0.08 / 0.18 = 0.44 (56% reduction in relative risk)
ARR = 0.18 − 0.08 = 0.10 (10 percentage points)
NNT = 1 / 0.10 = 10

Interpretation: LMWH reduces the relative risk of DVT by 56% (RR 0.44). In absolute terms, for every 10 patients treated with LMWH, one additional DVT is prevented (NNT = 10). The NNT communicates clinical impact in a way the RR alone does not.

9.6 Standardised Effect Sizes

When outcomes are measured on different scales and you want to compare effect sizes across studies, use standardised effect sizes:

Cohen’s d: For continuous outcomes (mean difference)


d = (μ₁ − μ₂) / pooled SD

Benchmarks: small d=0.2, medium d=0.5, large d=0.8 (Cohen, 1988 — treat as rough guides only)

Eta-squared (η²): Proportion of variance explained (for ANOVA)


η² = SSbetween / SStotal

Partial η² is preferred for factorial ANOVA.

Omega-squared (ω²): Less biased than η², preferred for meta-analyses.

10. Survival and Time-to-Event Analysis

10.1 Why Standard Methods Fail for Survival Data

Consider a study following 100 patients after cancer surgery for 5 years, tracking whether they are alive or dead. Two problems arise that standard regression cannot handle:

Problem 1: Censoring. Some patients are still alive at study end. Some are lost to follow-up. Some died from an unrelated cause. All three are “censored” — they did not experience the event during observation, but we don’t know when or if they would have. Excluding them wastes information; treating them as non-events introduces bias.

Problem 2: Variable follow-up times. Patients enrolled at different times have different follow-up durations. A patient followed for 6 months contributes different information from one followed for 48 months.

Survival analysis incorporates both the occurrence of events and the time to event, while properly handling censored observations.

10.2 Core Concepts

Survival function S(t): The probability of surviving (i.e., not experiencing the event) beyond time t.


S(t) = P(T > t)

At t=0: S(0) = 1.0 (everyone is event-free at start) Over time: S(t) decreases monotonically (or stays flat if no events)

Hazard function h(t): The instantaneous rate of the event at time t, given survival to time t. Sometimes called the “force of mortality.”

Censoring types:

Right censoring (most common): The event has not occurred by the end of observation
Left censoring: The event occurred before observation started
Interval censoring: The event occurred in a known time interval

The critical assumption: Censoring must be non-informative — i.e., the reason for censoring must be unrelated to the probability of experiencing the event. If patients who drop out are more likely to die than those who stay in, estimates will be biased.

10.3 Kaplan-Meier Estimator

What it does: Non-parametrically estimates the survival function S(t) from observed data, accounting for censoring. Produces a step-function survival curve.

The calculation:


S(t) = Π [1 − dj/nj]

where the product is over all event times tj ≤ t
dj = number of events at time tj
nj = number at risk just before time tj

Worked Example:

Research question: Estimate overall survival in 10 patients with metastatic colorectal cancer following first-line chemotherapy.

Patient	Follow-up (months)	Event (death=1, censored=0)
1	3	1
2	5	1
3	6	0 (lost to follow-up)
4	8	1
5	10	1
6	12	0 (still alive at study end)
7	14	1
8	18	0
9	20	1
10	24	0

KM calculation:

Time (months)	Events (d)	At risk (n)	S(t) = S(t-prev) × (1 − d/n)
0	—	10	1.000
3	1	10	1.000 × (1 − 1/10) = 0.900
5	1	9	0.900 × (1 − 1/9) = 0.800
8	1	7*	0.800 × (1 − 1/7) = 0.686
10	1	6	0.686 × (1 − 1/6) = 0.571
14	1	4**	0.571 × (1 − 1/4) = 0.429
20	1	2***	0.429 × (1 − 1/2) = 0.214

*Patient 3 (censored at 6 months) removed from risk set before time 8 **Patient 6 (censored at 12 months) removed before time 14 ***Patient 8 (censored at 18 months) and Patient 10 (censored at 24 months) reduce the risk set

Interpretation: The estimated probability of surviving beyond 20 months is 21.4%. Median survival (where S(t) first falls below 0.5) falls between 10 and 14 months. The KM curve should be presented with number-at-risk tables below the time axis.

Reporting standard: Always include: (1) the KM curve with confidence bands, (2) the number-at-risk table at key time points, (3) median survival with 95% CI for each group.

10.4 Log-Rank Test

What it does: Tests whether the survival curves of two or more groups are identical. The non-parametric equivalent of the t-test for survival data. Uses a weighted sum of differences between observed and expected events at each event time.

When to use:

Comparing survival curves of 2+ groups
Non-parametric (makes no assumption about the shape of the survival curve)
Assumes proportional hazards (the hazard ratio between groups is constant over time)

Worked Example:

Research question: Do patients with KRAS wild-type (WT) colorectal cancer have better overall survival than those with KRAS mutant (MT) cancer following anti-EGFR therapy?

Group	n	Events	Median OS (months)	95% CI
KRAS WT	85	62	18.4	14.2–22.6
KRAS MT	79	71	9.8	7.6–12.0

Log-rank test: χ²(1) = 14.8, p < 0.001

Interpretation: Patients with KRAS wild-type tumours had significantly longer overall survival than those with KRAS mutations (median 18.4 vs 9.8 months; log-rank p < 0.001). This finding supports the predictive role of KRAS status for anti-EGFR therapy benefit.

10.5 Cox Proportional Hazards Regression

What it does: The most widely used model for time-to-event data with multiple predictors. Models the hazard (instantaneous risk) as a function of predictor variables. Output is the hazard ratio (HR) — the ratio of hazards between groups.

The model:


h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ... + βₚXₚ)

Where h₀(t) is the baseline hazard function (unspecified — this is a “semi-parametric” model).

The hazard ratio:


HR for predictor Xj = e^βj

Interpreting HR:

HR = 1.0: No association with time-to-event
HR = 2.0: Exposed group has twice the instantaneous rate of the event at any given time
HR = 0.5: Exposed group has half the hazard (50% risk reduction at any time)

The proportional hazards (PH) assumption: The hazard ratio between two groups is constant over time. This is the key assumption of Cox regression. Check it with:

Log-log plot: log(−log(S(t))) vs log(t) — lines should be parallel
Schoenfeld residuals plot — no trend over time
Grambsch-Therneau test (formal statistical test of PH assumption)

If PH is violated: use time-varying coefficients, stratified Cox model, or parametric models (Weibull, log-logistic).

Worked Example:

Research question: What factors predict time to dialysis initiation in a cohort of 280 CKD patients followed for up to 5 years? Predictors: age, sex, eGFR at baseline, proteinuria (g/24h), diabetes, hypertension.

Events: 98 patients started dialysis; 182 censored

Predictor	HR	95% CI	p-value
Age (per 10 years)	1.18	0.98–1.42	0.082
Male sex	1.44	0.94–2.20	0.094
eGFR at baseline (per 10 mL/min/1.73m² increase)	0.51	0.42–0.62	`<`0.001
Proteinuria (per 1 g/24h increase)	1.67	1.38–2.02	`<`0.001
Diabetes (yes vs no)	2.03	1.32–3.12	0.001
Hypertension (yes vs no)	1.38	0.89–2.13	0.150

PH assumption checked: Schoenfeld residuals test p=0.38 (no violation)

Interpretation:

Each 10 mL/min/1.73m² higher baseline eGFR was associated with a 49% lower hazard of dialysis initiation (HR 0.51, 95% CI 0.42–0.62, p < 0.001).
Each 1 g/24h increase in proteinuria was associated with a 67% higher hazard of dialysis initiation (HR 1.67, 95% CI 1.38–2.02, p < 0.001).
Patients with diabetes had twice the hazard of dialysis initiation compared to non-diabetic patients (HR 2.03, 95% CI 1.32–3.12, p = 0.001).
After adjustment, age, sex, and hypertension were not independently associated with dialysis initiation.

Reporting template: “In multivariable Cox regression, proteinuria (HR 1.67 per 1 g/24h increase, 95% CI 1.38–2.02, p < 0.001) and diabetes (HR 2.03, 95% CI 1.32–3.12, p = 0.001) were independently associated with time to dialysis initiation after adjustment for baseline eGFR and other covariates.”

11. Multivariable Modelling Strategy

11.1 Univariate vs Multivariable Analysis: The Clinical Workflow

Almost all published clinical research involves both steps:

Step 1: Univariate (crude) analysis

Each predictor is tested against the outcome individually, without adjustment
Reports crude (unadjusted) ORs, HRs, or mean differences
Purpose: describe raw associations, identify candidate variables for multivariable model

Step 2: Multivariable (adjusted) analysis

Selected predictors are entered simultaneously into a regression model
Reports adjusted ORs, HRs, or mean differences, with each predictor’s effect estimated after controlling for the others
Purpose: identify independent predictors, control for confounding

The relationship between crude and adjusted estimates is clinically informative. A variable that is significant in univariate but not multivariable analysis was likely confounded. A variable that appears non-significant univariately but significant in multivariable analysis was previously masked by confounders (negative confounding).

11.2 What Is Confounding?

A confounder is a third variable that:

Is associated with the exposure/predictor
Is associated with the outcome
Is NOT an intermediary on the causal pathway between exposure and outcome

Example: A study finds that coffee drinking is associated with lung cancer. But coffee drinkers are also more likely to smoke, and smoking causes lung cancer. Smoking is a confounder. After adjusting for smoking, the association between coffee and lung cancer disappears.

Controlling for confounders:

Include them in the regression model (most common)
Matching (case-control studies, propensity score matching)
Restriction (study only non-smokers)
Stratification (analyse smokers and non-smokers separately)

11.3 Selecting Variables for a Multivariable Model

The EPV (events per variable) rule: As a minimum, you need approximately 10 events per predictor variable in logistic and Cox regression to avoid overfitting. With 80 events, include a maximum of 8 predictors.

Approaches to variable selection:

1. Hypothesis-driven selection (preferred in clinical research): Select predictors based on clinical knowledge and prior literature, regardless of statistical significance in univariate analysis. Pre-specify in your protocol.

2. Univariate screening approach:

Test each candidate predictor in univariate analysis
Include variables with p < 0.2 (or 0.25) as candidates — not p < 0.05, as this misses potentially important confounders
Also include clinically important variables regardless of p-value

3. Automated stepwise selection (not recommended as primary approach):

Backward elimination, forward selection, or bidirectional stepwise
Problems: capitalises on chance, biased SEs and p-values, unreproducible in different samples
May be used for exploratory analyses but results should be validated in an independent dataset

11.4 Handling Confounding: A Worked Example

Research question: Is emergency (vs elective) hospital admission associated with in-hospital mortality? Data from 600 admissions.

Univariate analysis:

Variable	Crude OR	95% CI	p
Emergency admission (vs elective)	3.22	1.84–5.63	`<`0.001
Age (per 10 years)	1.65	1.31–2.08	`<`0.001
Charlson comorbidity index	1.48	1.24–1.77	`<`0.001
Male sex	1.29	0.76–2.20	0.340

Multivariable logistic regression:

Variable	Adjusted OR	95% CI	p
Emergency admission (vs elective)	1.87	1.01–3.47	0.047
Age (per 10 years)	1.44	1.12–1.85	0.004
Charlson comorbidity index	1.36	1.12–1.65	0.002
Male sex	1.15	0.65–2.03	0.627

Interpretation: Emergency admission was significantly associated with in-hospital mortality in both univariate (crude OR 3.22) and multivariable (adjusted OR 1.87) analyses. The attenuation from 3.22 to 1.87 indicates that age and comorbidity are confounders — emergency admissions tend to involve older, sicker patients, which partially explains their higher mortality. The adjusted OR represents the “true” independent association after accounting for these differences.

11.5 Propensity Score Methods

The problem: In observational studies, patients who receive a treatment differ systematically from those who don’t. Simply adjusting for confounders in regression may be insufficient when there are many confounders or when the treatment and control groups barely overlap.

Propensity score (PS): The predicted probability of receiving the treatment, given a patient’s observed baseline characteristics. Estimated using logistic regression with treatment as outcome and all confounders as predictors.

Uses of the propensity score:

1. Propensity score matching: Match each treated patient to one (or more) untreated patient(s) with a similar PS. Creates two groups balanced on measured confounders — mimics a randomised trial.

2. PS stratification: Divide patients into quintiles of PS and compare outcomes within each stratum.

3. Inverse probability of treatment weighting (IPTW): Reweight observations so that the weighted sample resembles a randomised trial.

Worked Example:

Research question: Using a registry of 800 STEMI patients, compare 1-year mortality between those who received drug-eluting stent (DES, n=400) vs bare metal stent (BMS, n=400). Patients who received DES were younger, had lower GRACE scores, and fewer comorbidities.

After propensity score matching (caliper width = 0.1 SD of logit PS):

312 matched pairs (DES vs BMS)
Baseline characteristics now balanced (standardised differences all <0.10)

Matched analysis: HR for 1-year mortality, DES vs BMS = 0.74 (95% CI 0.55–0.99, p = 0.043)

Compare to: Unmatched analysis: HR = 0.51 (95% CI 0.39–0.67, p < 0.001) — substantially biased by confounding.

Interpretation: After propensity score matching to account for confounders, DES was associated with a 26% reduction in 1-year mortality compared to BMS (HR 0.74, p = 0.043). The unmatched estimate (51% reduction) was confounded by the baseline differences between groups.

12. Multivariate Methods

12.1 Terminology Clarification

Multivariable: Multiple predictor variables, ONE outcome (e.g. multiple linear regression) Multivariate: Multiple outcome variables simultaneously (e.g. MANOVA, PCA)

This distinction is frequently misused in published literature. MANOVA, PCA, and factor analysis are truly “multivariate” methods.

12.2 MANOVA (Multivariate Analysis of Variance)

What it does: Tests whether groups differ on a combination of continuous outcome variables simultaneously. An extension of ANOVA to multiple outcomes.

When to use:

3+ continuous outcome variables that are correlated with each other
One or more grouping factors
You want to test overall group differences before examining individual outcomes

Why not just run separate ANOVAs?

Multiple testing inflates Type I error (with 5 outcomes at α=0.05, ~22% chance of at least one false positive)
Ignores correlations among outcomes — MANOVA uses these to improve power
MANOVA can detect group differences that no single ANOVA would

MANOVA test statistics: Wilks’ Lambda (most common), Pillai’s trace, Hotelling-Lawley trace, Roy’s largest root. All test the same null hypothesis but differ in robustness to assumption violations. Pillai’s trace is most robust to violations.

Worked Example:

Research question: Does exercise training modality (aerobic vs resistance vs combined vs control, n=30 per group) differentially affect cardiorespiratory fitness across three outcomes: VO₂max (mL/kg/min), 6-minute walk distance (m), and resting heart rate (bpm)?

Outcomes are moderately intercorrelated (r = 0.40–0.65).

MANOVA:

Pillai’s trace = 0.52, F(9, 342) = 7.44, p < 0.001

Follow-up univariate ANOVAs (with Bonferroni correction, α = 0.017):

VO₂max: F(3,116) = 12.4, p < 0.001
6MWD: F(3,116) = 8.7, p < 0.001
Resting HR: F(3,116) = 5.2, p = 0.002

Interpretation: Training modality had a significant multivariate effect on cardiorespiratory fitness outcomes (MANOVA: Pillai’s trace = 0.52, F(9,342) = 7.44, p < 0.001). Follow-up univariate ANOVAs revealed significant effects on all three individual outcomes (all p ≤ 0.002 after Bonferroni correction).

12.3 Principal Component Analysis (PCA)

What it does: A data reduction technique that transforms a set of correlated variables into a smaller set of uncorrelated components (principal components) that capture most of the variance in the original data.

When to use:

Many correlated predictor variables (multicollinearity) — reduce before regression
Exploratory data analysis of high-dimensional data
Visualising patterns in complex datasets

Key outputs:

Eigenvalues: Variance explained by each component. Components with eigenvalue > 1 are typically retained (Kaiser criterion).
Scree plot: Graph of eigenvalues — look for the “elbow” where the curve flattens.
Factor loadings: Correlation between original variables and each component. Loadings > 0.4 are typically considered meaningful.
% variance explained: How much of the total variability each component captures.

Worked Example:

Research question: A metabolic syndrome study measures 8 correlated biomarkers in 300 patients: waist circumference, fasting glucose, HDL-C, LDL-C, triglycerides, SBP, DBP, and insulin. Reduce these to a smaller set of components.

PCA results:

Component	Eigenvalue	% Variance	Cumulative %
PC1	3.12	39.0%	39.0%
PC2	1.84	23.0%	62.0%
PC3	1.02	12.8%	74.8%
PC4–8	`<`0.80 each	`<`10% each	—

Three components retained (eigenvalue > 1, 75% variance explained).

Loading matrix (simplified):

Variable	PC1 (“metabolic risk”)	PC2 (“blood pressure”)	PC3 (“lipid profile”)
Waist circumference	0.78	0.12	0.21
Fasting glucose	0.72	0.18	−0.14
Insulin	0.69	0.08	0.22
Triglycerides	0.61	0.23	0.48
SBP	0.15	0.82	0.19
DBP	0.22	0.79	0.08
HDL-C	−0.54	0.14	0.62
LDL-C	0.28	0.20	0.71

Interpretation: PC1 captures a “central metabolic risk” factor (high waist, glucose, insulin, TG; low HDL). PC2 represents blood pressure. PC3 captures lipid profile. These three components can replace the 8 original variables as predictors in subsequent analyses with minimal information loss.

13. Mixed Models and Longitudinal Data

13.1 Why Standard ANOVA Is Insufficient for Longitudinal Data

Repeated measures ANOVA requires complete data (no missing values), assumes compound symmetry (equal variances and covariances between all time pairs), and cannot handle time-varying covariates. In clinical trials, 10–40% of observations are commonly missing.

Linear mixed effects (LME) models overcome these limitations:

Handle missing data (missing-at-random) without imputation
Allow flexible correlation structures (not just compound symmetry)
Can accommodate unequally spaced measurement occasions
Can model individual trajectories (random slopes)

13.2 Linear Mixed Effects Models

The model:


Y_ij = (β₀ + b₀ᵢ) + (β₁ + b₁ᵢ)×time_ij + β₂×X_ij + ε_ij

Where:

β₀, β₁ = fixed effects (population-average intercept and slope)
b₀ᵢ, b₁ᵢ = random effects for subject i (individual deviations from average)
ε_ij = residual error

Fixed effects: Average effects across the population (reported) Random effects: Between-subject variability in intercepts and/or slopes

Worked Example:

Research question: A 12-month RCT of a lifestyle intervention in type 2 diabetes. HbA1c is measured at baseline, 3, 6, and 12 months in 120 patients (60 intervention, 60 control). 18% of follow-up data are missing (missing-at-random).

LME model: HbA1c ~ time × treatment + age + baseline HbA1c + (1+time|patient)

Key results:

Effect	Coefficient	SE	95% CI	p
Time (per month, control arm)	−0.021	0.008	−0.037 to −0.005	0.011
Treatment × time interaction	−0.038	0.011	−0.059 to −0.017	0.001
Age (per year)	0.012	0.007	−0.002 to 0.026	0.089

Interpretation: In the control arm, HbA1c decreased by 0.021% per month (reflecting background treatment changes). In the intervention arm, HbA1c decreased by an additional 0.038% per month compared to control (interaction term p = 0.001), yielding a net additional reduction of 0.46% at 12 months. The mixed model used all available data including observations with missing follow-up, reducing bias compared to complete-case analysis.

14. Diagnostic Test Evaluation

14.1 The 2×2 Table for Diagnostic Tests

All diagnostic test statistics derive from the 2×2 table comparing test result to the true diagnosis (gold standard):

	Disease present	Disease absent	Total
Test positive	True positive (TP)	False positive (FP)	TP+FP
Test negative	False negative (FN)	True negative (TN)	FN+TN
Total	TP+FN	FP+TN	N

14.2 Sensitivity, Specificity, PPV, NPV

Sensitivity: P(test positive | disease present) = TP / (TP+FN)

A highly sensitive test rarely misses disease (few false negatives)
“SnNout” — a highly Sensitive test when Negative rules OUT disease

Specificity: P(test negative | disease absent) = TN / (FP+TN)

A highly specific test rarely gives false positives
“SpPin” — a highly Specific test when Positive rules IN disease

Positive Predictive Value (PPV): P(disease present | test positive) = TP / (TP+FP)

Depends heavily on disease prevalence — PPV falls sharply with lower prevalence

Negative Predictive Value (NPV): P(disease absent | test negative) = TN / (FN+TN)

NPV rises with lower prevalence

The prevalence dependence of PPV and NPV: Unlike sensitivity and specificity (intrinsic test properties), PPV and NPV depend on the prevalence of disease in the tested population. A test with 95% sensitivity and 95% specificity applied to a population with 1% prevalence has a PPV of only 16.1%.

Worked Example:

Research question: Evaluate a new point-of-care troponin I assay for ruling out NSTEMI in 500 ED patients with chest pain. True diagnosis confirmed by serial high-sensitivity troponin.

	NSTEMI (n=80)	No NSTEMI (n=420)
POC troponin positive (≥40 ng/L)	72	21
POC troponin negative (`<`40 ng/L)	8	399

Prevalence = 80/500 = 16%


Sensitivity = 72/80 = 90.0% (95% CI: 81.2–95.6%)
Specificity = 399/420 = 95.0% (95% CI: 92.5–96.9%)
PPV        = 72/93 = 77.4% (95% CI: 67.7–85.3%)
NPV        = 399/407 = 98.0% (95% CI: 96.1–99.2%)

Interpretation: This POC troponin assay demonstrates high sensitivity (90%) and specificity (95%) for NSTEMI detection. The NPV of 98.0% supports its use as a rule-out strategy — of patients who test negative, 98% truly do not have NSTEMI. The PPV of 77.4% indicates that 23% of positive results will be false positives at this prevalence (16%), so confirmatory testing is needed for positive results.

14.3 ROC Curves and AUC

What it does: Evaluates a continuous or ordinal diagnostic test across all possible cutpoints. Plots sensitivity (y-axis) against 1-specificity (x-axis) as the threshold varies.

AUC (Area Under the Curve) / C-statistic:

0.5 = no discrimination (no better than chance)
0.7–0.8 = acceptable discrimination
0.8–0.9 = excellent
0.9 = outstanding

Optimal cutpoint: Choose based on clinical need:

For rule-out tests (screening): maximise sensitivity (accept lower specificity)
For rule-in tests (confirmation): maximise specificity (accept lower sensitivity)
Youden’s index (sensitivity + specificity − 1): balanced optimum

Comparing two tests: DeLong’s method for comparing paired AUCs from the same sample.

Worked Example:

Research question: Compare eGFR alone vs a clinical risk score (incorporating eGFR + proteinuria + age + diabetes) for predicting dialysis within 3 years in 350 CKD patients.

Model	AUC	95% CI
eGFR alone	0.73	0.67–0.79
Clinical risk score	0.84	0.79–0.89
Difference	+0.11	p = 0.003

Interpretation: The clinical risk score (AUC 0.84) significantly outperforms eGFR alone (AUC 0.73) for predicting 3-year dialysis initiation (DeLong’s test p = 0.003). Adding proteinuria, age, and diabetes to eGFR substantially improves discrimination.

15. Agreement and Reliability

15.1 Cohen’s Kappa

What it does: Measures agreement between two raters (or methods) on categorical outcomes, corrected for chance agreement.


κ = (Po − Pe) / (1 − Pe)

Po = observed agreement proportion
Pe = expected agreement by chance

Interpreting kappa (Landis & Koch thresholds — use as rough guides):

κ	Interpretation
`<`0.00	Poor (less than chance)
0.00–0.20	Slight
0.21–0.40	Fair
0.41–0.60	Moderate
0.61–0.80	Substantial
0.81–1.00	Almost perfect

Worked Example:

Research question: Two radiologists independently classify 120 chest X-rays as: normal, consolidation, or interstitial change. What is their agreement?

	Rad2: Normal	Rad2: Consol.	Rad2: Interstitial	Total
Rad1: Normal	48	4	2	54
Rad1: Consolidation	3	28	2	33
Rad1: Interstitial	1	2	30	33
Total	52	34	34	120

Po = (48+28+30)/120 = 106/120 = 0.883

Expected agreement:

Pe(normal) = (54×52)/120² = 0.195
Pe(consolidation) = (33×34)/120² = 0.078
Pe(interstitial) = (33×34)/120² = 0.078
Pe = 0.195 + 0.078 + 0.078 = 0.351


κ = (0.883 − 0.351) / (1 − 0.351) = 0.532 / 0.649 = 0.82

Interpretation: There is almost perfect agreement between the two radiologists for chest X-ray classification (κ = 0.82, 95% CI 0.73–0.91).

15.2 Bland-Altman Analysis

What it does: Assesses the agreement between two continuous measurement methods. Plots the difference between methods (y-axis) against the mean of the two methods (x-axis). Identifies systematic bias and limits of agreement.

Key outputs:

Bias: Mean difference (Method A − Method B). Non-zero bias indicates systematic over- or under-measurement by one method.
Limits of agreement (LOA): Bias ± 1.96 × SD of differences. The range within which 95% of differences will fall.
Clinical decision: Are the LOA clinically acceptable? If the maximum acceptable difference is ±5 mmHg and the LOA are ±3 mmHg, the methods agree well enough for clinical use.

Why NOT to use correlation for method comparison: Pearson r measures association, not agreement. Two methods could be highly correlated but systematically disagree. Bland-Altman is the correct approach.

Worked Example:

Research question: Compare automated oscillometric blood pressure (AOBP) with gold-standard intra-arterial (IA) SBP measurement in 50 ICU patients.

Statistic	Value
Mean AOBP	124.6 mmHg
Mean IA SBP	128.4 mmHg
Mean difference (AOBP − IA)	−3.8 mmHg
SD of differences	8.2 mmHg
Upper LOA (+1.96 SD)	−3.8 + 16.1 = +12.3 mmHg
Lower LOA (−1.96 SD)	−3.8 − 16.1 = −19.9 mmHg

Interpretation: AOBP underestimates IA SBP by a mean of 3.8 mmHg. The limits of agreement range from −19.9 to +12.3 mmHg, meaning in 95% of patients, AOBP will differ from IA by between 20 mmHg below and 12 mmHg above. Given the wide LOA, AOBP cannot reliably substitute for intra-arterial measurement in haemodynamically unstable ICU patients where precision of ±5 mmHg is needed.

16. Bayesian Methods

16.1 Frequentist vs Bayesian Framework

The fundamental difference:

Frequentist (classical) statistics:

Parameters (e.g., true treatment effect) are fixed, unknown constants
Probability refers to long-run frequency of events
P-value = P(data this extreme | H₀ is true) — does not tell you probability that H₀ is true
Cannot make probability statements about parameters

Bayesian statistics:

Parameters have probability distributions reflecting uncertainty
You start with a prior distribution (beliefs before seeing the data)
You update with observed data (the likelihood)
You get a posterior distribution (updated beliefs)
Can make direct probability statements: “P(true effect > 0 | data) = 0.97”

Bayes’ theorem:


Posterior ∝ Prior × Likelihood
P(θ|data) ∝ P(data|θ) × P(θ)

16.2 Credible Intervals vs Confidence Intervals

Frequentist 95% CI: In repeated sampling, 95% of such intervals would contain the true parameter. Does NOT mean “95% probability the true value is in this interval” for this specific interval — though clinicians routinely interpret it this way.

Bayesian 95% credible interval (CrI): There IS a 95% probability that the true parameter lies within this interval (given the prior and the data). This is the natural, intuitive interpretation most clinicians want.

16.3 Bayesian Analysis in Practice

Worked Example:

Research question: A pilot RCT tests a new immunotherapy in 30 patients with refractory rheumatoid arthritis (15 active, 15 placebo). ACR50 response rates: 7/15 (47%) active, 3/15 (20%) placebo.

Frequentist analysis:

OR = 3.5, 95% CI 0.71–17.3, p = 0.12
Conclusion: “Not statistically significant” — ambiguous for a small pilot trial

Bayesian analysis:

Prior: Weakly informative prior based on existing biologics literature (modest positive effect expected)
Posterior OR = 3.2 (95% CrI: 1.02–10.4)
P(OR > 1 | data) = 0.96 → 96% probability that the active treatment has a positive effect
P(OR > 2 | data) = 0.72 → 72% probability of at least a doubling of odds of response

Interpretation: While the frequentist analysis is technically non-significant (p = 0.12) — likely due to small sample size — Bayesian analysis incorporating prior evidence indicates a 96% posterior probability that the new treatment outperforms placebo. These findings support proceeding to a full Phase III RCT.

16.4 Bayes Factors

What it does: A ratio of how well H₁ predicts the data relative to H₀. Can provide evidence for the null — something p-values cannot do.


BF₁₀ = P(data | H₁) / P(data | H₀)

Interpretation:

BF₁₀	Evidence for H₁
1–3	Anecdotal
3–10	Moderate
10–30	Strong
30–100	Very strong
>100	Extreme
`<`1/3	Moderate evidence for H₀

Clinical use case: Non-inferiority trials — showing that a new (cheaper, safer) treatment is “not meaningfully worse.” A Bayes Factor <1/3 provides positive evidence for the null (no difference) rather than merely failing to reject it.

17. Meta-Analysis and Systematic Review

17.1 The Evidence Hierarchy

Meta-analysis of randomised controlled trials sits at the top of the evidence hierarchy. It pools quantitative data from multiple studies to produce a single, more precise estimate of effect.

Why meta-analysis?

Individual trials often underpowered to detect small but clinically important effects
Pooling increases precision (narrower CI)
Identifies sources of heterogeneity
More generalisable than any single study

17.2 Fixed vs Random Effects Models

Fixed effects model:

Assumes all studies estimate the same true effect
Between-study variation is due to sampling error only
Appropriate when studies are functionally identical (same population, intervention, outcome)
Gives more weight to larger studies

Random effects model (DerSimonian-Laird):

Assumes studies estimate different but related true effects (a distribution of effects)
Between-study variability (heterogeneity, τ²) is estimated and incorporated
Results in wider, more honest CIs
More appropriate for most clinical meta-analyses where populations and protocols vary
More weight distributed to smaller studies compared to fixed effects

How to choose: Examine heterogeneity (I²). If I² < 25%, either model is reasonable. If I² ≥ 50% (substantial heterogeneity), random effects model is more appropriate. However, if heterogeneity is very high (I² > 75%), even the random effects pooled estimate should be interpreted cautiously.

17.3 Worked Meta-Analysis Example

Research question: What is the effect of ACE inhibitors on cardiovascular mortality in patients with heart failure? Systematic review identified 6 eligible RCTs.

Study	Control events/n	ACE-I events/n	OR	95% CI
CONSENSUS (1987)	44/126	29/127	0.56	0.31–0.99
SOLVD-T (1991)	452/1284	386/1285	0.81	0.68–0.96
ATLAS (1999)	52/1596	45/1568	0.87	0.58–1.30
V-HeFT II (1991)	131/403	117/403	0.84	0.61–1.16
MERIT-HF (1999)	145/2001	128/1990	0.89	0.70–1.14
CIBIS-II (1999)	156/1320	119/1327	0.73	0.57–0.94

Heterogeneity:

I² = 18% (low heterogeneity — fixed effects model acceptable)
Cochran’s Q = 6.1, p = 0.30

Pooled estimate (fixed effects):

Pooled OR = 0.81 (95% CI: 0.74–0.89), p < 0.001

Random effects (for comparison):

Pooled OR = 0.81 (95% CI: 0.72–0.91), p < 0.001 (slightly wider CI reflecting residual heterogeneity)

Interpretation: ACE inhibitors are associated with a 19% reduction in the odds of cardiovascular mortality in heart failure patients (pooled OR 0.81, 95% CI 0.74–0.89, p < 0.001). Heterogeneity across studies was low (I² = 18%), supporting the consistency of this effect. This translates to an NNT of approximately 28 over the average trial duration to prevent one cardiovascular death.

17.4 Assessing Heterogeneity

Cochran’s Q test: Tests the null hypothesis that all studies estimate the same true effect. Underpowered with few studies; significant Q indicates heterogeneity.

I² statistic: Proportion of total variation attributable to between-study differences (not sampling error).

I² = 0–25%: Low/negligible
I² = 26–50%: Moderate
I² = 51–75%: Substantial
I² > 75%: Considerable

Sources of heterogeneity — investigate with:

Subgroup analysis: Does the effect differ by study population, intervention intensity, follow-up duration?
Meta-regression: Regress the effect size on study-level moderators (e.g., mean age, % female, baseline risk)

17.5 Publication Bias

The problem: Studies showing significant results are more likely to be published than those showing null results. This means a meta-analysis based on published literature may overestimate the true effect.

Detection:

Funnel plot: Plot each study’s effect size against its precision (SE or 1/SE). Under no bias, the plot is symmetric — smaller studies scatter more widely around the pooled estimate. Asymmetry suggests bias.
Egger’s test: Formal regression test for funnel plot asymmetry. p < 0.05 suggests asymmetry (possible publication bias).
Trim-and-fill method: Imputes missing studies to restore funnel symmetry and re-estimates the pooled effect. Shows how sensitive the main result is to potential publication bias.

18. Reporting Standards and Checklists

18.1 General Reporting Principles

Always report the test used, the test statistic, degrees of freedom, and exact p-value. Not just “p<0.05” or “NS” — write “t(48) = 3.14, p = 0.003” or “χ²(2) = 8.74, p = 0.013.”
Report effect sizes with confidence intervals for all primary outcomes. P-values alone are insufficient.
Report sample sizes at every step. If 200 enrolled, 180 analysed — state what happened to the other 20 and conduct a sensitivity analysis if possible.
For non-parametric tests, report median (IQR), not mean (SD).
Report model fit statistics for regression models — R²/adjusted R² for linear regression; Hosmer-Lemeshow goodness of fit, AUC/C-statistic for logistic regression; overall model χ² and −2 log-likelihood.
Check and report assumption testing — normality (Shapiro-Wilk), homogeneity of variance (Levene’s), sphericity (Mauchly’s), PH assumption (Cox).
Distinguish pre-specified from exploratory analyses. Post-hoc subgroup analyses should be clearly labelled as exploratory and interpreted with caution.

18.2 Reporting Checklists

Study type	Checklist
RCT	CONSORT (www.consort-statement.org )
Observational cohort or case-control	STROBE (www.strobe-statement.org )
Diagnostic accuracy study	STARD (www.equator-network.org/reporting-guidelines/stard )
Systematic review/meta-analysis	PRISMA (www.prisma-statement.org )
Prognostic model development	TRIPOD (www.tripod-statement.org )
Survival analysis	REMARK (for tumour marker studies)

18.3 Specimen Results Sections

Randomised Trial (t-test result):

“The primary outcome, change in HbA1c from baseline to 6 months, was significantly greater in the intervention group compared to control (−0.82% vs −0.31%; mean difference −0.51%, 95% CI −0.78 to −0.24%; independent samples t-test: t(178) = −3.74, p < 0.001).”

Survival analysis result:

“Median progression-free survival was 11.2 months (95% CI 8.6–13.8) in the experimental arm and 7.4 months (95% CI 5.9–8.9) in the control arm. The experimental treatment was associated with a 38% reduction in the hazard of progression or death (HR 0.62, 95% CI 0.48–0.80; log-rank p < 0.001).”

Logistic regression result:

“On multivariable logistic regression analysis, prior hospitalisation in the previous year (OR 2.84, 95% CI 1.63–4.95, p < 0.001) and home oxygen use (OR 1.93, 95% CI 1.09–3.42, p = 0.025) were independently associated with 30-day readmission after adjustment for age, FEV₁%, and eosinophil count. The model demonstrated acceptable discrimination (C-statistic 0.72) and good calibration (Hosmer-Lemeshow p = 0.64).”

Appendix: Quick Reference Tables

A1. Choosing the Right Test — Complete Reference

Research question	Outcome type	Predictor type	Groups/samples	Test
Is mean different from reference?	Continuous	None	1 group	One-sample t-test (parametric) / Wilcoxon (non-param)
Is proportion different from reference?	Binary	None	1 group	One-proportion z-test
Does categorical distribution match expected?	Categorical	None	1 group	Chi-square goodness of fit
Are two independent group means different?	Continuous	Binary	2 indep.	Student’s t / Welch’s t / Mann-Whitney U
Are two paired measurements different?	Continuous	Time (2 points)	2 paired	Paired t-test / Wilcoxon signed-rank
Are two paired binary proportions different?	Binary	Time (2 points)	2 paired	McNemar’s test
Are 3+ independent group means different?	Continuous	Categorical	3+ indep.	One-way ANOVA / Welch’s ANOVA / Kruskal-Wallis
Are 3+ repeated measures different?	Continuous	Time (3+ points)	3+ paired	Repeated measures ANOVA / Friedman
Are 3+ paired binary proportions different?	Binary	Time (3+ points)	3+ paired	Cochran’s Q
Is there a categorical association?	Categorical	Categorical	Indep.	Chi-square / Fisher’s exact
Is there a linear association?	Continuous	Continuous	—	Pearson r / Spearman ρ
Predict continuous outcome from 1+ predictors	Continuous	Mixed	—	Linear regression
Predict binary outcome from 1+ predictors	Binary	Mixed	—	Logistic regression
Predict time-to-event from 1+ predictors	Time-to-event	Mixed	—	Cox proportional hazards
Predict count outcome from 1+ predictors	Count	Mixed	—	Poisson regression
Compare survival curves between groups	Time-to-event	Categorical	2+ indep.	Kaplan-Meier + log-rank
Multiple continuous outcomes simultaneously	Continuous	Categorical	2+ groups	MANOVA
Reduce many correlated variables	Continuous	None	—	PCA / Factor analysis
Repeated measures with missing data	Continuous	Mixed	3+ time pts	Linear mixed effects model
Agreement between two raters (categorical)	Categorical	—	2 raters	Cohen’s kappa
Agreement between two continuous methods	Continuous	—	2 methods	Bland-Altman analysis
Diagnostic test evaluation	Binary	Continuous/ordinal	—	ROC analysis, sensitivity/specificity

A2. Non-Parametric Equivalents

Parametric test	Non-parametric equivalent	Use when
One-sample t-test	One-sample Wilcoxon	Non-normal data, n < 30
Independent t-test	Mann-Whitney U	Non-normal, ordinal, n < 30
Paired t-test	Wilcoxon signed-rank	Non-normal differences, ordinal
One-way ANOVA	Kruskal-Wallis	Non-normal groups, ordinal outcome
Repeated measures ANOVA	Friedman test	Non-normal, ordinal, repeated
Pearson correlation	Spearman correlation	Non-normal, ordinal, outliers
MANOVA	— (robust MANOVA)	Non-normal multivariate data

A3. Effect Size Reference

Measure	Formula	Interpretation
Cohen’s d	(μ₁−μ₂)/pooled SD	0.2=small, 0.5=medium, 0.8=large
Odds ratio	ad/bc	1=no effect; >1 increased odds; `<`1 decreased odds
Relative risk	[a/(a+b)] / [c/(c+d)]	1=no effect; >1 increased risk
NNT	1/ARR	Lower = more effective
Hazard ratio	e^β (Cox)	1=no effect; same interpretation as RR
r (Pearson)	—	0.1=small, 0.3=medium, 0.5=large
R²	SS_model/SS_total	% variance explained
η² (eta-squared)	SS_between/SS_total	0.01=small, 0.06=medium, 0.14=large
AUC/C-statistic	Area under ROC	0.5=chance; 0.7–0.8=acceptable; >0.8=excellent
Kappa (κ)	(Po−Pe)/(1−Pe)	0.4–0.6=moderate; 0.6–0.8=substantial; >0.8=almost perfect

A4. P-Value Thresholds in Context

Scenario	Recommended α	Rationale
Primary outcome, single test	0.05	Standard
Secondary outcomes (multiple)	0.05/k (Bonferroni)	Multiple comparisons
Post-hoc pairwise comparisons	Tukey HSD or Bonferroni	Familywise error control
Exploratory analysis	0.05, clearly labelled	Hypothesis-generating only
Genome-wide association study	5×10⁻⁸	Millions of comparisons
Equivalence / non-inferiority	0.025 (one-sided)	Specific trial design

A5. Sample Size Formulae

Design	Formula	Notes
Two independent means	n = 2(z_α/2 + z_β)²σ²/δ²	σ = SD, δ = minimum detectable difference
Two proportions	n = (z_α/2 + z_β)² [p₁(1-p₁)+p₂(1-p₂)] / (p₁-p₂)²	p₁, p₂ = expected proportions
Paired design	n = (z_α/2 + z_β)²σ_d²/δ²	σ_d = SD of differences
One proportion vs reference	n = z_α/2²p₀(1-p₀)/E²	E = acceptable margin of error

z values: z_0.025 = 1.96 (α=0.05 two-tailed), z_0.2 = 0.84 (power=80%), z_0.1 = 1.28 (power=90%)

Key References and Further Reading

Altman DG. Practical Statistics for Medical Research. Chapman & Hall, 1991.
Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;327:307-310.
Cox DR. Regression models and life tables. J Royal Stat Soc B. 1972;34:187-220.
DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves. Biometrics. 1988;44:837-845.
Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53:457-481.
Rothman KJ. No adjustments are needed for multiple comparisons. Epidemiology. 1990;1:43-46.
Steyerberg EW. Clinical Prediction Models. Springer, 2009.
Zhang J, Yu KF. What’s the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA. 1998;280:1690-1691.
Vittinghoff E, et al. Regression Methods in Biostatistics. Springer, 2012.
Harrell FE. Regression Modeling Strategies. Springer, 2015.

This guide is intended as a methodological reference for applied clinical research. Statistical analysis should always be conducted in consultation with a qualified statistician for complex or novel study designs. Software implementations: R (free, recommended), Stata, SPSS, SAS.

Version 1.0 | Prepared for clinical researchers | Field: Medical / clinical research

A Complete Guide to Statistical Tests and Methods for Clinical Researchers

Table of Contents

1. Foundations: The Statistical Reasoning Framework

1.1 What Is a Statistical Test?

1.2 Type I and Type II Errors

1.3 One-Tailed vs Two-Tailed Tests

1.4 Confidence Intervals vs P-Values

1.5 Sample Size and Power Calculations

2. Choosing the Right Test: A Decision Framework

The Five Key Questions

Decision Table

3. Descriptive Statistics and Data Exploration

3.1 Measures of Central Tendency

3.2 Measures of Spread

3.3 Assessing Normality

3.4 Worked Example: Describing a Study Population

4. One-Variable Tests

4.1 One-Sample Student’s t-Test

4.2 One-Sample Wilcoxon Signed-Rank Test

4.3 One-Proportion Test (Z-test for proportion)

4.4 Chi-Square Goodness-of-Fit Test

5. Comparing Two Groups

5.1 Independent Samples t-Test (Student’s t-Test)

5.2 Welch’s t-Test (Unequal Variances)

5.3 Mann-Whitney U Test

5.4 Paired Samples t-Test

5.5 Wilcoxon Signed-Rank Test

5.6 Chi-Square Test of Independence

5.7 Fisher’s Exact Test

5.8 McNemar’s Test

6. Comparing Three or More Groups

6.1 One-Way ANOVA

6.2 Welch’s ANOVA

6.3 Kruskal-Wallis Test

6.4 Repeated Measures ANOVA

6.5 Friedman Test

7. Correlation and Association

7.1 Pearson Correlation

7.2 Spearman’s Rank Correlation

7.3 Common Pitfalls in Correlation Analysis

8. Regression Analysis

8.1 Simple Linear Regression

8.2 Multiple Linear Regression

8.3 Logistic Regression

9. Effect Sizes and Association Measures

9.1 Why Effect Sizes Matter

9.2 Odds Ratio (OR)

9.3 Relative Risk (RR) — also called Risk Ratio

9.4 Odds Ratio vs Relative Risk: When to Use Which

Study design determines feasibility

Outcome frequency matters

Logistic regression outputs ORs — when is this a problem?

9.5 Absolute Risk Reduction (ARR) and Number Needed to Treat (NNT)

9.6 Standardised Effect Sizes

10. Survival and Time-to-Event Analysis

10.1 Why Standard Methods Fail for Survival Data

10.2 Core Concepts

10.3 Kaplan-Meier Estimator

10.4 Log-Rank Test

10.5 Cox Proportional Hazards Regression

11. Multivariable Modelling Strategy

11.1 Univariate vs Multivariable Analysis: The Clinical Workflow

11.2 What Is Confounding?

11.3 Selecting Variables for a Multivariable Model

11.4 Handling Confounding: A Worked Example

11.5 Propensity Score Methods

12. Multivariate Methods

12.1 Terminology Clarification

12.2 MANOVA (Multivariate Analysis of Variance)

12.3 Principal Component Analysis (PCA)

13. Mixed Models and Longitudinal Data

13.1 Why Standard ANOVA Is Insufficient for Longitudinal Data

13.2 Linear Mixed Effects Models

14. Diagnostic Test Evaluation

14.1 The 2×2 Table for Diagnostic Tests

14.2 Sensitivity, Specificity, PPV, NPV

14.3 ROC Curves and AUC

15. Agreement and Reliability

15.1 Cohen’s Kappa

15.2 Bland-Altman Analysis