When should I use an RCT instead of a quasi-experimental design?

Use an RCT when you can randomize (phased rollout, limited resources), the intervention is standardized, your budget supports it ($100K-500K+), and ethical review approves withholding the program from a control group. If any of these conditions are missing, a quasi-experimental design is likely more appropriate and will still produce credible causal evidence.

What is the most common quasi-experimental design in development evaluation?

Difference-in-differences (DID). It compares the change over time in program areas versus comparison areas. Most development programs collect baseline and endline data in both program and non-program sites, which is exactly what DID requires. The main assumption is that both groups would have followed the same trend without the program.

RCT vs Quasi-Experimental Design

At a Glance

Factor	RCT	Quasi-Experimental
Causal evidence	Very strong (gold standard)	Moderate to strong
Requires randomization	Yes	No
Comparison group	Randomly assigned control	Matched or naturally occurring
Typical cost	$100K-500K+	$30K-150K
Timeline	2-5 years	1-3 years
Statistical expertise	High	High
Best for	Standardized, simple interventions	Programs where randomization was not possible
Handles complexity	Poorly	Better (more flexible designs)
Donor acceptance	Universally accepted	Widely accepted for impact evidence

Both designs try to answer the same question: did the program cause the change? The difference is how they construct the counterfactual. An RCT creates it through random assignment. A quasi-experimental design approximates it through statistical methods and naturally occurring comparison groups.

The Four Main QED Approaches

Difference-in-Differences (DID)

Compare the change over time in program areas versus comparison areas. If stunting dropped 5 percentage points in program areas but only 1 point in comparison areas, the estimated program effect is 4 points.

What you need: Baseline and endline data for both groups. At minimum two time points, though more is better.

Key assumption: Both groups would have followed the same trend without the program (parallel trends). Check this by comparing pre-program trends if you have the data.

When it works best: Programs that target geographic areas, where routine data exists in both program and non-program sites.

Propensity Score Matching (PSM)

Match each participant with a non-participant who looks statistically similar on observable characteristics (age, income, location, education). Compare outcomes between matched pairs.

What you need: Rich data on characteristics that predict program participation. The more variables, the better the match.

Key assumption: All the factors that determine who participates are captured in your data. If unobserved factors (motivation, political connections) drive participation, PSM cannot fix the bias.

When it works best: Individual-level programs (training, cash transfers) where you have survey data on both participants and non-participants.

Regression Discontinuity (RD)

When eligibility depends on a score or threshold (income below a cutoff, test scores above a line), compare people just above and just below. Those near the cutoff are essentially similar, creating a natural experiment.

What you need: A clear eligibility threshold and data on the running variable (the score that determines eligibility).

Key limitation: Results only apply to people near the cutoff, not the entire population. If your program targets the poorest 20%, RD tells you about the effect for people around the 20th percentile, not for the poorest 5%.

When it works best: Targeted programs with score-based eligibility. Check whether your program uses any kind of ranking or threshold before defaulting to other designs.

Interrupted Time Series (ITS)

Analyze trends in an outcome before and after the program starts, using many pre-program data points to establish what the trend would have looked like without the program.

What you need: At least 8-10 data points before the intervention. Monthly health facility data, quarterly education statistics, or annual survey rounds.

Key assumption: Nothing else changed at the same time as the program that could explain the shift in trend. If a new national policy launched the same month, ITS cannot separate the two effects.

When it works best: Programs with strong routine monitoring data but no comparison group. Health system interventions are a common application because facility data often has long time series.

Cost Comparison

Component	RCT	Quasi-Experimental
Design and protocol	$15K-40K	$8K-20K
Baseline data collection	$30K-150K	$15K-60K (often uses existing data)
Endline data collection	$30K-150K	$15K-60K
Analysis	$15K-40K	$10K-30K
Midline (if included)	$20K-80K	$10K-40K
IRB/ethical review	$2K-10K	$2K-5K
Total range	$100K-500K+	$30K-150K

The biggest cost driver is primary data collection. If a quasi-experimental design can use existing administrative or routine monitoring data, costs drop dramatically. DID using health facility records might cost $30K-50K total. The same question answered with an RCT requiring household surveys could cost $200K+.

Decision Guide

Work through these questions in order.

1. Can you randomize?

Yes, ethically and practically: Consider an RCT. But check that your sample size is sufficient and your budget supports it.
No: Move to quasi-experimental options.

2. Do you have baseline data?

Yes, for both program and comparison areas: DID is your strongest option.
Yes, with a score-based eligibility cutoff: Check if regression discontinuity works.
Yes, with many pre-program time points but no comparison group: Consider ITS.
No baseline data: PSM with endline data only (weaker), or switch to theory-based approaches.

3. What is your budget?

Over $100K and the question demands causal attribution: RCT or strong QED with primary data collection.
$30K-100K: QED using existing data where possible. DID with routine data is often the best value.
Under $30K: Do not attempt either. Use contribution analysis or other theory-based approaches. See How to Choose Evaluation Methodology.

4. How standardized is the program?

Same intervention everywhere: Either design works.
Varies significantly by site: QED handles variation better. An RCT measures the average effect across variations, which may not be useful for any specific site.

Use the Evaluation Designer to structure your design once you have made the choice, or the Method Selector to explore alternatives if none of these fit.

Common Mistakes

Mistake 1: Treating "quasi-experimental" as "RCT lite." QED is not a weaker version of an RCT. It is a different family of designs suited to different conditions. A well-executed DID can produce highly credible evidence. A poorly executed RCT with contamination and attrition produces garbage.

Mistake 2: Choosing DID without checking parallel trends. DID requires that the treatment and comparison groups were following the same trajectory before the program. If you cannot show this with data, your DID estimate is unreliable. Plot pre-program trends for both groups. If they diverge, DID is not your design.

Mistake 3: Defaulting to an RCT because the donor asked for "rigorous evidence." Rigorous evidence is not synonymous with RCT. Most donors accept well-designed quasi-experimental evaluations. Ask the donor what they actually need. "Credible evidence of impact" can come from DID or PSM, not only from randomization.

Mistake 4: Ignoring the design effect in cluster-randomized trials. If you randomize at the village or school level but measure individuals, you need far more units than individual-level randomization suggests. A 200-person sample might require 40+ clusters. See How to Choose Sample Size.

Mistake 5: Running a QED with a bad comparison group and calling it rigorous. A comparison group that differs systematically from the treatment group in ways your model does not capture is worse than no comparison group at all. It gives you a precise but biased estimate. If you cannot find a credible comparison, use theory-based methods instead of forcing a bad QED.

At a Glance

Factor	RCT	Quasi-Experimental
Causal evidence	Very strong (gold standard)	Moderate to strong
Requires randomization	Yes	No
Comparison group	Randomly assigned control	Matched or naturally occurring
Typical cost	$100K-500K+	$30K-150K
Timeline	2-5 years	1-3 years
Statistical expertise	High	High
Best for	Standardized, simple interventions	Programs where randomization was not possible
Handles complexity	Poorly	Better (more flexible designs)
Donor acceptance	Universally accepted	Widely accepted for impact evidence

The Four Main QED Approaches

Difference-in-Differences (DID)

What you need: Baseline and endline data for both groups. At minimum two time points, though more is better.

Key assumption: Both groups would have followed the same trend without the program (parallel trends). Check this by comparing pre-program trends if you have the data.

When it works best: Programs that target geographic areas, where routine data exists in both program and non-program sites.

Propensity Score Matching (PSM)

Match each participant with a non-participant who looks statistically similar on observable characteristics (age, income, location, education). Compare outcomes between matched pairs.

What you need: Rich data on characteristics that predict program participation. The more variables, the better the match.

Key assumption: All the factors that determine who participates are captured in your data. If unobserved factors (motivation, political connections) drive participation, PSM cannot fix the bias.

When it works best: Individual-level programs (training, cash transfers) where you have survey data on both participants and non-participants.

Regression Discontinuity (RD)

What you need: A clear eligibility threshold and data on the running variable (the score that determines eligibility).

When it works best: Targeted programs with score-based eligibility. Check whether your program uses any kind of ranking or threshold before defaulting to other designs.

Interrupted Time Series (ITS)

Analyze trends in an outcome before and after the program starts, using many pre-program data points to establish what the trend would have looked like without the program.

What you need: At least 8-10 data points before the intervention. Monthly health facility data, quarterly education statistics, or annual survey rounds.

Key assumption: Nothing else changed at the same time as the program that could explain the shift in trend. If a new national policy launched the same month, ITS cannot separate the two effects.

When it works best: Programs with strong routine monitoring data but no comparison group. Health system interventions are a common application because facility data often has long time series.

Cost Comparison

Component	RCT	Quasi-Experimental
Design and protocol	$15K-40K	$8K-20K
Baseline data collection	$30K-150K	$15K-60K (often uses existing data)
Endline data collection	$30K-150K	$15K-60K
Analysis	$15K-40K	$10K-30K
Midline (if included)	$20K-80K	$10K-40K
IRB/ethical review	$2K-10K	$2K-5K
Total range	$100K-500K+	$30K-150K

Decision Guide

Work through these questions in order.

1. Can you randomize?

Yes, ethically and practically: Consider an RCT. But check that your sample size is sufficient and your budget supports it.
No: Move to quasi-experimental options.

2. Do you have baseline data?

Yes, for both program and comparison areas: DID is your strongest option.
Yes, with a score-based eligibility cutoff: Check if regression discontinuity works.
Yes, with many pre-program time points but no comparison group: Consider ITS.
No baseline data: PSM with endline data only (weaker), or switch to theory-based approaches.

3. What is your budget?

Over $100K and the question demands causal attribution: RCT or strong QED with primary data collection.
$30K-100K: QED using existing data where possible. DID with routine data is often the best value.
Under $30K: Do not attempt either. Use contribution analysis or other theory-based approaches. See How to Choose Evaluation Methodology.

4. How standardized is the program?

Same intervention everywhere: Either design works.
Varies significantly by site: QED handles variation better. An RCT measures the average effect across variations, which may not be useful for any specific site.

Use the Evaluation Designer to structure your design once you have made the choice, or the Method Selector to explore alternatives if none of these fit.

RCT vs Quasi-Experimental Design

At a Glance

When an RCT Is Feasible

When a Quasi-Experimental Design Fits Better

The Four Main QED Approaches

Difference-in-Differences (DID)

Propensity Score Matching (PSM)

Regression Discontinuity (RD)

Interrupted Time Series (ITS)

Cost Comparison

Common Ways Each Goes Wrong

RCT Failures

QED Failures

Decision Guide

Common Mistakes

Frequently Asked Questions

RCT vs Quasi-Experimental Design

At a Glance

When an RCT Is Feasible

When a Quasi-Experimental Design Fits Better

The Four Main QED Approaches

Difference-in-Differences (DID)

Propensity Score Matching (PSM)

Regression Discontinuity (RD)

Interrupted Time Series (ITS)

Cost Comparison

Common Ways Each Goes Wrong

RCT Failures

QED Failures

Decision Guide

Common Mistakes

Frequently Asked Questions