When to Use
Impact evaluation is the right approach when you need to know whether a programme caused observed changes in outcomes, not just whether outcomes improved, but whether the improvement was due to the programme. This is a high bar that requires substantial investment in design and data collection. Use it when:
- Scale decisions depend on evidence: governments or donors considering large-scale rollout need credible evidence that the programme works before committing resources
- Programme effectiveness is genuinely uncertain: the intervention has a plausible theory of change but has not been rigorously tested in this context
- Policy competition exists: comparing two alternative approaches requires a comparative design to determine which is more effective
- Donor requirements mandate it: USAID, USDA, and the World Bank increasingly require impact evaluations for programmes above certain thresholds, particularly for food security, health, and agriculture
- The stakes are high: programmes that affect large numbers of people or involve significant resources warrant the investment in rigorous evaluation
Impact evaluation is not appropriate when the programme is still being developed (use formative evaluation first), when outcomes cannot be measured in the programme timeline, when a counterfactual cannot be ethically or practically constructed, or when the evaluation question is about how outcomes occurred rather than whether they did (use contribution analysis or process tracing instead).
| Scenario | Use Impact Evaluation? | Better Alternative |
|---|---|---|
| Scaling decision for proven model | Yes | — |
| Early-stage programme development | No | Formative evaluation |
| Complex multi-actor change | No | Contribution Analysis |
| How and why change happened | No | Process Tracing |
| No counterfactual possible | No | Contribution Analysis |
| Donor mandates attribution evidence | Yes | — |
How It Works
All impact evaluations rest on one central idea: the counterfactual: what would have happened to programme participants in the absence of the programme. Since you cannot observe the same people both with and without the programme, you construct a comparison group that approximates this counterfactual.
Step 1: Plan at the design stage
Impact evaluations must be planned before the programme starts. Retrospective impact evaluation is rarely credible. Baseline data must be collected before the programme begins.
Step 2: Define the evaluation question
State precisely what outcome you are trying to measure, for whom, over what time period, and at what geographic level. Vague questions produce inconclusive evaluations.
Step 3: Choose a design
The design choice depends on whether random assignment is feasible:
- Randomised Controlled Trial (RCT): participants are randomly assigned to treatment or control. Gold standard for internal validity but costly and often ethically difficult
- Quasi-experimental designs: when randomisation is not possible: difference-in-differences, propensity score matching, regression discontinuity, or interrupted time series. See quasi-experimental design for details
Step 4: Establish baseline
Collect data on outcomes for both treatment and comparison groups before the programme begins. This is non-negotiable. The two groups must be comparable at baseline, any differences should be documented and controlled for in analysis.
Step 5: Implement with evaluation integrity
Monitor for contamination (comparison group accessing programme), attrition (losing study participants), and design fidelity (programme delivered as intended). These threats to validity must be managed throughout implementation.
Step 6: Collect follow-up data and analyse
Collect midline and endline data at pre-specified intervals. Analyse using the appropriate statistical methods for the chosen design. Report the treatment effect size with confidence intervals, not just significance tests.
Step 7: Interpret and communicate findings
A statistically significant effect is not the same as a practically meaningful one. Report effect sizes in terms decision-makers understand (absolute changes, percentage changes, lives affected) alongside statistical significance.
Key Components
- Counterfactual: a credible comparison group that approximates what would have happened without the programme
- Baseline data: pre-intervention outcome measurements for both groups
- Primary outcome indicator: one or two key outcomes the evaluation is powered to detect
- Sample size calculation: determines how many participants are needed to detect an effect of expected magnitude
- Pre-registration: registering the evaluation design, hypotheses, and analysis plan before data collection (increasingly required by 3ie, J-PAL, and major donors)
- Follow-up data: midline and endline measurements at pre-specified intervals
- Analysis plan: pre-specified statistical methods to prevent data dredging
Best Practices
Commit to the counterfactual. The entire credibility of an impact evaluation depends on the quality of the comparison group. Random assignment is the gold standard; when it is not feasible, document carefully why and use the best available quasi-experimental design.
Mandate baseline data collection. No baseline means no impact evaluation, only a before-after comparison, which cannot rule out trends that would have occurred anyway.
Power the study to detect realistic effects. Underpowered studies produce inconclusive results regardless of how well everything else is done. Work with a statistician to calculate minimum sample sizes based on expected effect sizes.
Use the same instruments across groups. Survey tools and questions must be identical between treatment and comparison groups to ensure comparability.
Pre-register the design. Pre-registration prevents selective reporting of positive findings and builds credibility with donors and policymakers. 3ie, AEA RCT Registry, and RIDIE are the main registries.
Common Mistakes
Starting too late. Impact evaluations designed after implementation begins cannot establish valid baselines. The most common and most costly mistake in impact evaluation is failure to plan prospectively.
Asking the impact evaluation to answer process questions. An impact evaluation tells you whether outcomes changed. It will not tell you why, for whom the effect varied, or what mechanisms produced it. Pair it with qualitative methods for process insights.
Inadequate attention to comparison group quality. Propensity score matching, difference-in-differences, and regression discontinuity all depend on assumptions that must be tested and reported. Presenting quasi-experimental results without discussing the plausibility of design assumptions is misleading.
Conflating statistical significance with programme success. A statistically significant effect of negligible magnitude is not a programme success. Report and interpret effect sizes.
Neglecting negative results. Null results are information. A well-conducted impact evaluation that finds no effect is valuable evidence. Suppress null results and you distort the evidence base.
Examples
Agricultural livelihoods, East Africa. A USDA-funded food security programme in Ethiopia used a quasi-experimental design with propensity score matching to evaluate impact on household dietary diversity and income. Baseline data was collected for 3,000 treatment households and 2,400 matched comparison households before programme start. Midline and endline surveys tracked outcomes over five years. The evaluation found a 0.8 standard deviation improvement in dietary diversity scores in treatment households relative to comparison, attributed to the programme. The effect was concentrated in female-headed households, prompting a design revision for the follow-on programme.
Health, West Africa. A USAID-funded malaria prevention programme in Nigeria used a cluster-randomised trial design, randomising 60 communities to treatment (free bednet distribution plus community health worker visits) or control (free bednets only). The evaluation found that adding community health worker visits produced a 23 percentage point increase in consistent bednet use relative to bednets alone, justifying the additional cost of the community health worker component in national scale-up planning.
Education, South Asia. A World Bank-supported learning improvement programme in Pakistan used a regression discontinuity design based on school-level test score rankings to evaluate impact on student achievement. Schools just below the eligibility threshold were compared to schools just above. The evaluation found a 0.4 standard deviation improvement in literacy scores among Grade 3 students in programme schools, with larger effects for girls and rural schools.
Compared To
| Approach | Causal Claim | Counterfactual | Suitable When |
|---|---|---|---|
| Impact Evaluation | Attributable effect | Explicit | Feasible counterfactual, scale decision |
| Quasi-Experimental Design | Attributable effect | Constructed | Randomisation not feasible |
| Contribution Analysis | Plausible contribution | None | Complex, multi-actor change |
| Process Tracing | Causal mechanism | None | Understanding how change happened |
| Realist Evaluation | Contextual mechanisms | Partial | What works, for whom |
Relevant Indicators
52 donor-aligned indicators across USAID, DFID, World Bank, 3ie, USDA, and Global Fund. Key examples:
- Net attributable change in primary outcome between baseline and endline (treatment vs. comparison)
- Effect size (Cohen's d or percentage point difference) at programme completion
- Proportion of evaluation hypotheses confirmed versus disconfirmed
- Fidelity score for programme implementation as designed
Related Tools
- Evaluation Planner: structure your evaluation design and timeline from programme start
- Indicator Library: find donor-aligned outcome indicators for your sector
Related Topics
- Quasi-Experimental Design, the most common alternative when RCTs are not feasible
- Contribution Analysis, for when a counterfactual cannot be constructed
- Baseline Design, the foundational data collection without which no impact evaluation is possible
- Attribution vs. Contribution, understanding the distinction between impact evaluation and contribution claims
- Mixed Methods Evaluation, pairing quantitative impact estimates with qualitative process insights
Further Reading
- Gertler, P., Martinez, S., Premand, P., Rawlings, L., & Vermeersch, C. (2016). Impact Evaluation in Practice. 2nd ed. World Bank. The most accessible practitioner guide.
- White, H. (2014). Current Challenges in Impact Evaluation. 3ie Working Paper 18. Reviews methodological debates.
- J-PAL (2019). Introduction to Evaluations. Poverty Action Lab. Free online course covering RCT design.
- USAID (2016). Evaluation: Learning from Experience. ADS 203. USAID's policy on evaluation including impact evaluation requirements.