Term

Statistical Significance

A statistical measure indicating whether observed results are likely due to a real effect rather than random chance, typically assessed using p-values and hypothesis testing.

3 min read
Also known as:significancestatistical significance testing

Definition

Statistical significance is a formal statistical concept used to determine whether observed results — such as differences between treatment and control groups — are likely to reflect a real effect rather than random chance. In M&E, it answers the question: "Could this result have occurred by random variation alone?"

The most common measure is the p-value, which quantifies the probability of observing results at least as extreme as those obtained, assuming no true effect exists (the null hypothesis). A p-value below a predetermined threshold (typically 0.05 or 5%) indicates statistical significance — meaning there's less than a 5% probability the result occurred by chance. However, statistical significance does not measure the size or practical importance of an effect; that requires examining effect size separately.

Why It Matters

Statistical significance is essential for credible impact evaluation and evidence-based decision-making. Without it, practitioners cannot distinguish between genuine programme effects and random fluctuations in the data. This is particularly critical when:

  • Making attribution claims — determining whether observed outcomes can reasonably be attributed to the programme rather than external factors or chance
  • Scaling interventions — deciding whether to expand a programme based on evaluation results that may reflect random variation
  • Reporting to donors — providing defensible evidence of impact that meets methodological standards
  • Avoiding false positives — preventing investment in ineffective programmes that appeared successful due to random chance

However, statistical significance alone is insufficient. A result can be statistically significant yet practically meaningless (tiny effect with large sample), or practically important yet not statistically significant (large effect with small sample). Practitioners must examine both statistical significance and effect size to fully interpret evaluation findings.

In Practice

Statistical significance appears primarily in quantitative impact evaluations and quasi-experimental designs. Common applications include:

Impact evaluations using randomized controlled trials (RCTs) or quasi-experimental designs calculate p-values for each outcome indicator to test whether treatment and control groups differ significantly. For example, a health programme might find that vaccination rates are 15 percentage points higher in the treatment group (p=0.02), indicating this difference is unlikely due to chance.

Survey analysis uses significance testing to determine whether observed differences across demographic groups (disaggregation) reflect real patterns or sampling variation. This validates whether outcome disparities by gender, location, or other characteristics are genuine.

Before-after comparisons test whether changes from baseline to endline are statistically significant, accounting for natural variation in the data.

Best practice requires reporting both p-values and effect sizes (e.g., Cohen's d, odds ratios) alongside confidence intervals. A result showing p=0.049 should not be treated as meaningfully different from p=0.051 — the arbitrary 0.05 threshold creates a false binary. Instead, interpret the full statistical picture: effect magnitude, precision (confidence intervals), and practical relevance to programme goals.

Related Topics


Links to: P14 (quasi-experimental-design), P15 (impact-evaluation), effect-size, hypothesis-testing, p-values, power-analysis