Library

Impact Evaluation

A rigorous evaluation approach that measures the causal effect of a program on outcomes by comparing what happened with what would have happened in its absence.

When to Use

Impact evaluation is the right approach when you need to know whether a program caused observed changes in outcomes. Outcomes improving is not the same as the program causing them. That distinction is what impact evaluation is designed to establish. This is a high bar that requires substantial investment in design and data collection. Use it when:

Scale decisions depend on evidence: governments or donors considering large-scale rollout need credible evidence that the program works before committing resources
Program effectiveness is genuinely uncertain: the intervention has a plausible theory of change but has not been rigorously tested in this context
Policy competition exists: comparing two alternative approaches requires a comparative design to determine which is more effective
Donor requirements mandate it: USAID, USDA, and the World Bank increasingly require impact evaluations for programs above certain thresholds, particularly for food security, health, and agriculture
The stakes are high: programs that affect large numbers of people or involve significant resources warrant the investment in rigorous evaluation

Impact evaluation is not appropriate when the program is still being developed (use formative evaluation first), when outcomes cannot be measured in the program timeline, when a counterfactual cannot be ethically or practically constructed, or when the evaluation question is about how outcomes occurred rather than whether they did (use contribution analysis or process tracing instead).

Scenario	Use Impact Evaluation?	Better Alternative
Scaling decision for proven model	Yes	-
Early-stage program development	No	Formative evaluation
Complex multi-actor change	No	Contribution Analysis
How and why change happened	No	Process Tracing
No counterfactual possible	No	Contribution Analysis
Donor mandates attribution evidence	Yes	-

How It Works

All impact evaluations rest on one central idea: the counterfactual: what would have happened to program participants in the absence of the program. Since you cannot observe the same people both with and without the program, you construct a comparison group that approximates this counterfactual.

Step 1: Plan at the design stage

Impact evaluations must be planned before the program starts. Retrospective impact evaluation is rarely credible. Baseline data must be collected before the program begins.

Step 2: Define the evaluation question

State precisely what outcome you are trying to measure, for whom, over what time period, and at what geographic level. Vague questions produce inconclusive evaluations.

Step 3: Choose a design

The design choice depends on whether random assignment is feasible:

Randomised Controlled Trial (RCT): participants are randomly assigned to treatment or control. Gold standard for internal validity but costly and often ethically difficult
Quasi-experimental designs: when randomisation is not possible: difference-in-differences, propensity score matching, regression discontinuity, or interrupted time series. See quasi-experimental design for details

Step 4: Establish baseline

Collect data on outcomes for both treatment and comparison groups before the program begins. This is non-negotiable. The two groups must be comparable at baseline - any differences should be documented and controlled for in analysis.

Step 5: Implement with evaluation integrity

Monitor for contamination (comparison group accessing program), attrition (losing study participants), and design fidelity (program delivered as intended). These threats to validity must be managed throughout implementation.

Step 6: Collect follow-up data and analyze

Collect midline and endline data at pre-specified intervals. Analyze using the appropriate statistical methods for the chosen design. Report the treatment effect size with confidence intervals, not just significance tests.

Step 7: Interpret and communicate findings

A statistically significant effect is not the same as a practically meaningful one. Report effect sizes in terms decision-makers understand (absolute changes, percentage changes, lives affected) alongside statistical significance.

Key Components

Counterfactual: a credible comparison group that approximates what would have happened without the program
Baseline data: pre-intervention outcome measurements for both groups
Primary outcome indicator: one or two key outcomes the evaluation is powered to detect
Sample size calculation: determines how many participants are needed to detect an effect of expected magnitude
Pre-registration: registering the evaluation design, hypotheses, and analysis plan before data collection (increasingly required by 3ie, J-PAL, and major donors)
Follow-up data: midline and endline measurements at pre-specified intervals
Analysis plan: pre-specified statistical methods to prevent data dredging

Best Practices

Commit to the counterfactual. The entire credibility of an impact evaluation depends on the quality of the comparison group. Random assignment is the gold standard; when it is not feasible, document carefully why and use the best available quasi-experimental design.

Mandate baseline data collection. No baseline means no impact evaluation - only a before-after comparison, which cannot rule out trends that would have occurred anyway.

Power the study to detect realistic effects. Underpowered studies produce inconclusive results regardless of how well everything else is done. Work with a statistician to calculate minimum sample sizes based on expected effect sizes.

Use the same instruments across groups. Survey tools and questions must be identical between treatment and comparison groups to ensure comparability.

Pre-register the design. Pre-registration prevents selective reporting of positive findings and builds credibility with donors and policymakers. 3ie, AEA RCT Registry, and RIDIE are the main registries.

Common Mistakes

Starting too late. Impact evaluations designed after implementation begins cannot establish valid baselines. The most common and most costly mistake in impact evaluation is failure to plan prospectively.

Asking the impact evaluation to answer process questions. An impact evaluation tells you whether outcomes changed. It will not tell you why, for whom the effect varied, or what mechanisms produced it. Pair it with qualitative methods for process insights.

Inadequate attention to comparison group quality. Propensity score matching, difference-in-differences, and regression discontinuity all depend on assumptions that must be tested and reported. Presenting quasi-experimental results without discussing the plausibility of design assumptions is misleading.

Conflating statistical significance with program success. A statistically significant effect of negligible magnitude is not a program success. Report and interpret effect sizes.

Neglecting negative results. Null results are information. A well-conducted impact evaluation that finds no effect is valuable evidence. Suppress null results and you distort the evidence base.

Examples

Agricultural livelihoods, East Africa. A USDA-funded food security program in Ethiopia used a quasi-experimental design with propensity score matching to evaluate impact on household dietary diversity and income. Baseline data was collected for 3,000 treatment households and 2,400 matched comparison households before program start. Midline and endline surveys tracked outcomes over five years. The evaluation found a 0.8 standard deviation improvement in dietary diversity scores in treatment households relative to comparison, attributed to the program. The effect was concentrated in female-headed households, prompting a design revision for the follow-on program.

Health, West Africa. A USAID-funded malaria prevention program in Nigeria used a cluster-randomised trial design, randomising 60 communities to treatment (free bednet distribution plus community health worker visits) or control (free bednets only). The evaluation found that adding community health worker visits produced a 23 percentage point increase in consistent bednet use relative to bednets alone, justifying the additional cost of the community health worker component in national scale-up planning.

Education, South Asia. A World Bank-supported learning improvement program in Pakistan used a regression discontinuity design based on school-level test score rankings to evaluate impact on student achievement. Schools just below the eligibility threshold were compared to schools just above. The evaluation found a 0.4 standard deviation improvement in literacy scores among Grade 3 students in program schools, with larger effects for girls and rural schools.

Compared To

Approach	Causal Claim	Counterfactual	Suitable When
Impact Evaluation	Attributable effect	Explicit	Feasible counterfactual, scale decision
Quasi-Experimental Design	Attributable effect	Constructed	Randomisation not feasible
Contribution Analysis	Plausible contribution	None	Complex, multi-actor change
Process Tracing	Causal mechanism	None	Understanding how change happened
Realist Evaluation	Contextual mechanisms	Partial	What works, for whom

Relevant Indicators

52 donor-aligned indicators across USAID, DFID, World Bank, 3ie, USDA, and Global Fund. Key examples:

Net attributable change in primary outcome between baseline and endline (treatment vs. comparison)
Effect size (Cohen's d or percentage point difference) at program completion
Proportion of evaluation hypotheses confirmed versus disconfirmed
Fidelity score for program implementation as designed

Related Tools

Evaluation Planner: structure your evaluation design and timeline from program start
Indicator Library: find donor-aligned outcome indicators for your sector

Impact Evaluation

A rigorous evaluation approach that measures the causal effect of a program on outcomes by comparing what happened with what would have happened in its absence.

When to Use

Scale decisions depend on evidence: governments or donors considering large-scale rollout need credible evidence that the program works before committing resources
Program effectiveness is genuinely uncertain: the intervention has a plausible theory of change but has not been rigorously tested in this context
Policy competition exists: comparing two alternative approaches requires a comparative design to determine which is more effective
Donor requirements mandate it: USAID, USDA, and the World Bank increasingly require impact evaluations for programs above certain thresholds, particularly for food security, health, and agriculture
The stakes are high: programs that affect large numbers of people or involve significant resources warrant the investment in rigorous evaluation

Scenario	Use Impact Evaluation?	Better Alternative
Scaling decision for proven model	Yes	-
Early-stage program development	No	Formative evaluation
Complex multi-actor change	No	Contribution Analysis
How and why change happened	No	Process Tracing
No counterfactual possible	No	Contribution Analysis
Donor mandates attribution evidence	Yes	-

How It Works

Step 1: Plan at the design stage

Impact evaluations must be planned before the program starts. Retrospective impact evaluation is rarely credible. Baseline data must be collected before the program begins.

Step 2: Define the evaluation question

State precisely what outcome you are trying to measure, for whom, over what time period, and at what geographic level. Vague questions produce inconclusive evaluations.

Step 3: Choose a design

The design choice depends on whether random assignment is feasible:

Randomised Controlled Trial (RCT): participants are randomly assigned to treatment or control. Gold standard for internal validity but costly and often ethically difficult
Quasi-experimental designs: when randomisation is not possible: difference-in-differences, propensity score matching, regression discontinuity, or interrupted time series. See quasi-experimental design for details

Step 4: Establish baseline

Step 5: Implement with evaluation integrity

Step 6: Collect follow-up data and analyze

Step 7: Interpret and communicate findings

Key Components

Counterfactual: a credible comparison group that approximates what would have happened without the program
Baseline data: pre-intervention outcome measurements for both groups
Primary outcome indicator: one or two key outcomes the evaluation is powered to detect
Sample size calculation: determines how many participants are needed to detect an effect of expected magnitude
Pre-registration: registering the evaluation design, hypotheses, and analysis plan before data collection (increasingly required by 3ie, J-PAL, and major donors)
Follow-up data: midline and endline measurements at pre-specified intervals
Analysis plan: pre-specified statistical methods to prevent data dredging

Best Practices

Mandate baseline data collection. No baseline means no impact evaluation - only a before-after comparison, which cannot rule out trends that would have occurred anyway.

Use the same instruments across groups. Survey tools and questions must be identical between treatment and comparison groups to ensure comparability.

Common Mistakes

Conflating statistical significance with program success. A statistically significant effect of negligible magnitude is not a program success. Report and interpret effect sizes.

Neglecting negative results. Null results are information. A well-conducted impact evaluation that finds no effect is valuable evidence. Suppress null results and you distort the evidence base.

Examples

Compared To

Approach	Causal Claim	Counterfactual	Suitable When
Impact Evaluation	Attributable effect	Explicit	Feasible counterfactual, scale decision
Quasi-Experimental Design	Attributable effect	Constructed	Randomisation not feasible
Contribution Analysis	Plausible contribution	None	Complex, multi-actor change
Process Tracing	Causal mechanism	None	Understanding how change happened
Realist Evaluation	Contextual mechanisms	Partial	What works, for whom

Relevant Indicators

52 donor-aligned indicators across USAID, DFID, World Bank, 3ie, USDA, and Global Fund. Key examples:

Net attributable change in primary outcome between baseline and endline (treatment vs. comparison)
Effect size (Cohen's d or percentage point difference) at program completion
Proportion of evaluation hypotheses confirmed versus disconfirmed
Fidelity score for program implementation as designed

Related Tools

Evaluation Planner: structure your evaluation design and timeline from program start
Indicator Library: find donor-aligned outcome indicators for your sector

Impact Evaluation

When to Use

How It Works

Key Components

Best Practices

Common Mistakes

Examples

Compared To

Relevant Indicators

Related Tools

Related Topics

Impact Evaluation

When to Use

How It Works

Key Components

Best Practices

Common Mistakes

Examples

Compared To

Relevant Indicators

Related Tools

Related Topics