Library

Quasi-Experimental Design

A family of evaluation designs that estimate causal program effects without random assignment, using statistical methods to construct credible comparison groups.

When to Use

Quasi-experimental designs (QEDs) sit between experimental designs (RCTs) and purely descriptive evaluations. They attempt to answer "Did the program cause this change?" without random assignment. Use them when:

Random assignment is not feasible: ethical concerns, operational constraints, or political resistance prevent randomisation, but causal attribution is still needed
A natural comparison group exists: program eligibility rules, phase-in schedules, or geographic boundaries create groups that differ only in program exposure
Administrative data is available: government registers, health records, or school enrolment data allow retrospective matching and comparison
A natural experiment occurred: a policy change, eligibility threshold, or external shock creates a quasi-random variation in program exposure that can be exploited
Donors require attribution evidence: USAID, USDA, and the World Bank accept credible quasi-experimental designs as evidence of program effectiveness

QEDs are not appropriate when no credible comparison group can be constructed, when the design assumptions cannot be tested or defended, or when process questions (why and how) are more important than causal attribution (use contribution analysis or process tracing in those cases).

Scenario	Use QED?	Better Alternative
Ethical or logistical barrier to RCT	Yes	-
Natural eligibility threshold exists	Yes (regression discontinuity)	-
Phase-in rollout possible	Yes (difference-in-differences)	-
No comparison group feasible	No	Contribution Analysis
Process questions are primary	No	Process Tracing
Donor requires gold-standard evidence	No	RCT

How It Works

There is no single quasi-experimental design - QED is a family of approaches, each suited to different data situations and assumptions. The four main designs are:

Design 1: Difference-in-Differences (DiD)

Compare the change in outcomes over time in a treatment group against the change in a comparison group that did not receive the program. The DiD estimate is the "double difference": (treatment post − treatment pre) minus (comparison post − comparison pre). Key assumption: in the absence of the program, both groups would have experienced similar trends ("parallel trends"). Requires panel data on both groups at baseline and follow-up.

Design 2: Propensity Score Matching (PSM)

Match each program participant to one or more non-participants who are statistically similar on observed characteristics. Compare outcomes between matched pairs. The PSM estimate is the "average treatment effect on the treated" (ATT). Key assumption: all variables that determine both program participation and outcomes are observable and included in the matching model.

To implement PSM: collect baseline data on a wide range of characteristics for both participants and non-participants; estimate a logistic regression model predicting program participation; use the predicted probabilities (propensity scores) to match participants and non-participants; verify balance; compare outcomes.

Design 3: Regression Discontinuity (RD)

Exploit a threshold in a continuous eligibility criterion to compare participants just above the threshold (eligible) against those just below (ineligible). The RD estimate applies only to those near the threshold. Key assumption: units cannot precisely manipulate their score to be just above or below the threshold. Requires a large sample near the threshold and a continuous running variable.

Design 4: Interrupted Time Series (ITS)

Analyze a long time series of outcomes before and after program introduction, controlling for pre-existing trends. Useful when a single policy or program is introduced at a specific point in time and administrative data provides many pre-intervention time points. Works without a comparison group but is strengthened by including one.

Key Components

Comparison group: a group not receiving the program whose outcomes can be compared to participants
Baseline data on both groups: pre-program outcome and covariate measurements for treatment and comparison
Identical or comparable instruments: the same survey tools used for both groups at every data collection point
Balance testing: statistical tests confirming the treatment and comparison groups are comparable at baseline on observed characteristics
Design assumption testing: explicit tests of the key identifying assumptions (parallel trends, common support for PSM, threshold manipulation tests for RD)
Sensitivity analysis: testing whether the treatment effect estimate changes under alternative model specifications
Additional time-invariant measures: baseline variables not expected to change, included to improve matching quality

Best Practices

Maximise comparability through identical instruments. Treatment and comparison group data must be collected using the same survey instruments, at the same time, by the same (or equivalently trained) enumerators. Any difference in data collection contaminates the comparison.

Test and report balance, not just match. PSM is not complete when matching is done - you must test whether matched groups are actually balanced on key variables and report the results. Unbalanced matched samples indicate the matching model needs revision.

Pre-specify the primary analysis. Document the intended analysis method, covariates, and outcome specification before data collection. This prevents post-hoc model selection that inflates false positive rates.

Include time-invariant variables in matching. Adding variables that are stable over time (e.g. land ownership, ethnicity, household composition at baseline) improves match quality and reduces bias.

Report design limitations honestly. Every QED involves untestable assumptions. A credible evaluation report states these assumptions clearly and explains why they are reasonable given the context.

Common Mistakes

Treating PSM as sufficient without balance testing. Matching by propensity score does not guarantee balance. Always test covariate balance post-matching and re-match if balance is poor.

Ignoring the parallel trends assumption in DiD. Difference-in-differences estimates are invalid if treatment and comparison groups had different pre-program trends. Test for parallel trends using pre-program data if available.

Using a geographically proximate comparison group without spillover controls. If comparison group households can observe or interact with treatment households, contamination biases the estimate toward zero.

Claiming the QED is "as good as an RCT." Quasi-experimental designs make additional assumptions that RCTs do not. Clearly state the design and its assumptions; do not oversell the causal warrant.

Retrospective data fishing. Using existing datasets without a pre-specified analysis plan creates opportunities for model selection that produces false positive findings. Pre-register the analysis wherever possible.

Examples

Food security, Latin America. A USDA-funded program in Honduras used propensity score matching to evaluate impact on household food security scores. Baseline data included 40 variables on household demographics, assets, and agricultural practices for 2,400 treatment and 2,400 comparison households. After matching, standardised mean differences for all 40 variables fell below 0.10, indicating good balance. The DiD estimate at endline showed a 0.6 standard deviation improvement in food security scores among treatment households relative to matched comparisons.

Education, East Africa. A school improvement program in Kenya used regression discontinuity based on district poverty scores that determined program eligibility. Schools scoring just below the eligibility threshold (eligible) were compared to schools just above (ineligible). Analysis of national exam score data showed a 3.8 percentage point improvement in pass rates among eligible schools relative to ineligible schools at the threshold, with no evidence of score manipulation near the threshold.

Health, South Asia. A DFID-funded community health program in Bangladesh used interrupted time series analysis of monthly facility delivery rates across 120 intervention sub-districts, with 60 matched comparison sub-districts serving as the comparison series. The ITS model estimated a 12 percentage point increase in facility delivery rates attributable to the program, above the pre-existing trend, with the effect sustained over 24 months post-introduction.

Compared To

Design	Randomisation	Counterfactual	Key Assumption
QED (PSM)	None	Constructed via matching	All confounders observed
QED (DiD)	None	Parallel trends	Common trend absent program
QED (RD)	None	Threshold discontinuity	No score manipulation
RCT	Random	Direct control group	Randomisation integrity
Contribution Analysis	None	None	Plausible causal story

Relevant Indicators

38 indicators across USAID, World Bank, USDA, and 3ie frameworks. Key examples:

Standardised mean difference on key baseline variables between treatment and comparison groups (target < 0.10)
Difference-in-differences treatment effect estimate with 95% confidence interval
Common support percentage (proportion of treatment group with matched comparison units in PSM)
Number of pre-program periods used to test parallel trends assumption

Related Tools

Evaluation Planner: structure baseline data collection and comparison group selection
Indicator Library: identify appropriate outcome measures for your evaluation

Quasi-Experimental Design

A family of evaluation designs that estimate causal program effects without random assignment, using statistical methods to construct credible comparison groups.

When to Use

Random assignment is not feasible: ethical concerns, operational constraints, or political resistance prevent randomisation, but causal attribution is still needed
A natural comparison group exists: program eligibility rules, phase-in schedules, or geographic boundaries create groups that differ only in program exposure
Administrative data is available: government registers, health records, or school enrolment data allow retrospective matching and comparison
A natural experiment occurred: a policy change, eligibility threshold, or external shock creates a quasi-random variation in program exposure that can be exploited
Donors require attribution evidence: USAID, USDA, and the World Bank accept credible quasi-experimental designs as evidence of program effectiveness

Scenario	Use QED?	Better Alternative
Ethical or logistical barrier to RCT	Yes	-
Natural eligibility threshold exists	Yes (regression discontinuity)	-
Phase-in rollout possible	Yes (difference-in-differences)	-
No comparison group feasible	No	Contribution Analysis
Process questions are primary	No	Process Tracing
Donor requires gold-standard evidence	No	RCT

How It Works

There is no single quasi-experimental design - QED is a family of approaches, each suited to different data situations and assumptions. The four main designs are:

Design 1: Difference-in-Differences (DiD)

Design 2: Propensity Score Matching (PSM)

Design 3: Regression Discontinuity (RD)

Design 4: Interrupted Time Series (ITS)

Key Components

Comparison group: a group not receiving the program whose outcomes can be compared to participants
Baseline data on both groups: pre-program outcome and covariate measurements for treatment and comparison
Identical or comparable instruments: the same survey tools used for both groups at every data collection point
Balance testing: statistical tests confirming the treatment and comparison groups are comparable at baseline on observed characteristics
Design assumption testing: explicit tests of the key identifying assumptions (parallel trends, common support for PSM, threshold manipulation tests for RD)
Sensitivity analysis: testing whether the treatment effect estimate changes under alternative model specifications
Additional time-invariant measures: baseline variables not expected to change, included to improve matching quality

Best Practices

Common Mistakes

Treating PSM as sufficient without balance testing. Matching by propensity score does not guarantee balance. Always test covariate balance post-matching and re-match if balance is poor.

Claiming the QED is "as good as an RCT." Quasi-experimental designs make additional assumptions that RCTs do not. Clearly state the design and its assumptions; do not oversell the causal warrant.

Examples

Compared To

Design	Randomisation	Counterfactual	Key Assumption
QED (PSM)	None	Constructed via matching	All confounders observed
QED (DiD)	None	Parallel trends	Common trend absent program
QED (RD)	None	Threshold discontinuity	No score manipulation
RCT	Random	Direct control group	Randomisation integrity
Contribution Analysis	None	None	Plausible causal story

Relevant Indicators

38 indicators across USAID, World Bank, USDA, and 3ie frameworks. Key examples:

Standardised mean difference on key baseline variables between treatment and comparison groups (target < 0.10)
Difference-in-differences treatment effect estimate with 95% confidence interval
Common support percentage (proportion of treatment group with matched comparison units in PSM)
Number of pre-program periods used to test parallel trends assumption

Related Tools

Evaluation Planner: structure baseline data collection and comparison group selection
Indicator Library: identify appropriate outcome measures for your evaluation

Quasi-Experimental Design

When to Use

How It Works

Key Components

Best Practices

Common Mistakes

Examples

Compared To

Relevant Indicators

Related Tools

Related Topics

Quasi-Experimental Design

When to Use

How It Works

Key Components

Best Practices

Common Mistakes

Examples

Compared To

Relevant Indicators

Related Tools

Related Topics