How do I choose the right evaluation methodology?

Start with your evaluation questions, not with methods. Ask: (1) What do I need to know? (2) Do I need to prove causation or understand contribution? (3) Is a comparison group available? (4) What is my budget and timeline? (5) What does my donor require? These questions narrow the field. If you need causal attribution with high confidence, you need experimental or quasi-experimental designs. If you need to understand how and why change happened, theory-based and qualitative approaches work better.

Do I always need a randomized controlled trial for impact evaluation?

No. RCTs are the strongest design for proving causation, but they are not always feasible or appropriate. They require randomization (sometimes ethically or practically impossible), large samples, significant budgets ($100K-500K+), and long timelines. Many programs achieve rigorous impact evidence through quasi-experimental designs (difference-in-differences, propensity score matching) or theory-based approaches (contribution analysis, process tracing) at a fraction of the cost.

What is the difference between attribution and contribution in evaluation?

Attribution means proving that your program caused the observed change (requires a counterfactual: what would have happened without the program). Contribution means building a credible case that your program contributed to the observed change alongside other factors. Attribution requires experimental or strong quasi-experimental designs. Contribution can be demonstrated through theory-based approaches like contribution analysis or process tracing, which are less expensive and more appropriate for complex programs.

How to Choose an Evaluation Methodology

How to Choose the Right Evaluation Design

Most evaluation design mistakes happen because people choose a method before clarifying what they need to know. "We want to do an RCT" is not a starting point. "We need to know whether our training program reduced stunting" is. The method follows from the question, the context, and your constraints. Get those right first.

Before you open a methods textbook, answer three questions in order.

Question 1: What do you need to know?

Your evaluation question determines the entire design. If you need to prove causation ("Did this program cause the change?"), you need a counterfactual. If you need to understand contribution ("Did this program help?"), theory-based designs work. If you need to understand implementation ("Was this delivered as planned?"), a process evaluation is enough. Do not over-engineer the design for the question you are actually asking.

Question 2: Is a counterfactual feasible?

A counterfactual answers: "What would have happened without the program?" If you can construct one through randomization or a natural comparison group, experimental and quasi-experimental design options open up. If not, theory-based approaches are your path. Many programs, especially those already operating at scale, cannot construct a credible counterfactual. That is fine. It simply means you use a different family of designs.

Question 3: What are your constraints?

Budget, timeline, ethics, politics, data availability, and donor requirements all shape what is feasible. An RCT that costs $300K and takes 3 years is not viable for a $500K, 2-year program. A difference-in-differences design requires baseline data you may not have collected. Be honest about constraints early. A well-executed modest design beats a poorly executed ambitious one every time.

The Three Design Families

Experimental (RCT)

What it is: Randomly assign people, communities, schools, or facilities to receive the program or not. Compare outcomes between the two groups. Any difference is attributable to the program because randomization ensures the groups are equivalent at the start.

Most NGO evaluations do not need an RCT. That is not a controversial statement. It reflects the reality that RCTs require conditions most programs cannot meet: phased rollout, large sample sizes, substantial budgets, and stable interventions. When those conditions exist, an RCT is the gold standard for causal attribution. When they do not, forcing an RCT produces bad science at high cost.

When to use it:

Program is being rolled out in phases (randomize who gets it first)
Resources are limited and cannot reach everyone at once (natural rationing)
Donor requires rigorous impact evidence (USAID, DFID/FCDO, 3IE)
The program is relatively simple and standardized
Budget allows ($100K-500K+ for the evaluation alone)
Ethical review approves withholding the program from the control group

When it does not work:

Program already covers the entire target population (no control group possible)
Ethical concerns about withholding a beneficial intervention
Program is highly adaptive (RCTs require a stable treatment)
Sample size is too small for statistical power
The question is "why" not "how much" (RCTs tell you impact, not mechanisms)

Strengths	Limitations
Strongest causal evidence	Expensive and time-consuming
Clear, credible results	Requires randomization (not always ethical/feasible)
Widely accepted by donors and policymakers	Tells you "what" not "why"
	Requires stable intervention and large samples

Quasi-Experimental

What it is: Uses a comparison group, but without randomization. Instead, statistical techniques account for the differences between groups. These designs sit between the rigor of an RCT and the flexibility of non-experimental approaches. They work well when you have a natural comparison group and decent data, but randomization was never possible or has already passed.

Common approaches:

Difference-in-Differences (DID): Compare the change over time in the treatment group to the change over time in a comparison group. Requires data from before and after the program for both groups. Assumes both groups would have followed the same trend without the program (parallel trends assumption). This is probably the most common quasi-experimental design in development evaluation because many programs collect baseline and endline data in both program and non-program areas.

Propensity Score Matching (PSM): Match each program participant with a non-participant who is statistically similar on observable characteristics. Compare outcomes between matched pairs. Requires good data on characteristics that predict participation. PSM handles selection bias on observables but cannot address unobserved differences between groups.

Regression Discontinuity (RD): When program eligibility is based on a cutoff (e.g., income below $2/day), compare people just above and just below the cutoff. Very strong causal evidence for the group near the cutoff. Limited because it only tells you about the effect at the threshold, not for the whole population. If your program uses any kind of score-based targeting, check whether RD is an option before defaulting to other designs.

Interrupted Time Series (ITS): Analyze trends in an indicator before and after the program starts, using multiple pre-program data points to establish the counterfactual trend. Requires at least 8-10 pre-intervention data points. Works well when you have strong routine monitoring data but no comparison group.

When to use quasi-experimental designs:

A comparison group exists but randomization is not possible
Program was not randomized at the start but comparison data is available
Budget is moderate ($30K-150K for the evaluation)
You need causal evidence but an RCT is not feasible

Strengths	Limitations
Stronger than non-experimental designs	Weaker causal claims than RCT
More feasible and affordable than RCTs	Requires good comparison group
Multiple design options for different contexts	Statistical expertise needed
	Assumptions may not hold (parallel trends, no unobserved confounders)

Non-Experimental (Theory-Based and Qualitative)

What it is: No comparison group. Instead, these designs use program theory, qualitative evidence, and multiple sources of data to assess whether and how the program contributed to change. Do not dismiss these as "weak." For complex, adaptive, or systemic programs, they often produce more useful and actionable evidence than a quasi-experimental design applied badly.

Common approaches:

Contribution analysis: Builds a "contribution story" that traces the causal chain from activities to outcomes, testing the program's theory of change against evidence. Does not prove causation but builds a credible, evidence-based case for contribution. This is the workhorse of non-experimental evaluation in development.

Process tracing: Systematically examines the evidence for each link in the causal chain, looking for "smoking guns" (evidence that strongly supports the link) and "hoops" (tests the link must pass). This is a rigorous qualitative method for causal inference, borrowed from political science. It works best when you can identify specific causal mechanisms and test them against documentary and interview evidence.

Most Significant Change (MSC): Collects stories of change from stakeholders, which are then selected by panels as the "most significant." Participatory, captures unexpected changes, and works well for complex programs where predefined indicators miss what matters most.

Realist evaluation: Asks "What works, for whom, in what circumstances, and why?" Tests Context-Mechanism-Outcome (CMO) configurations rather than overall program impact. Particularly useful when you know a program works in some places but not others and you want to understand why.

When to use non-experimental approaches:

No comparison group exists or is feasible
Program is complex, adaptive, or systemic
You need to understand mechanisms, not just outcomes
Budget is limited ($15K-60K for the evaluation)
The question is "how and why" rather than "how much"

Strengths	Limitations
Feasible for any program type	Cannot prove causation
Captures complexity and context	Findings less generalizable
More affordable	Requires skilled evaluators
Explains mechanisms, not just outcomes	Some donors do not accept as sufficient evidence

Evaluation Design Comparison

Factor	Experimental (RCT)	Quasi-Experimental	Non-Experimental
Causal evidence	Very strong	Moderate-strong	Moderate (contribution, not attribution)
Feasibility	Low (many conditions needed)	Medium	High
Typical cost	$100K-500K+	$30K-150K	$15K-60K
Timeline	2-5 years	1-3 years	3-12 months
Statistical expertise	High	High	Moderate
Best for	Simple, standardized interventions	Programs with natural comparison groups	Complex, adaptive, systemic programs
USAID acceptance	Preferred for impact	Accepted	Accepted for non-impact questions
EU acceptance	Accepted	Accepted	Common for most evaluations
FCDO acceptance	Expected for large programs	Accepted	Accepted with strong ToC

Common Mistakes

Mistake 1: Choosing the method before the question. "We want to do an RCT" is backwards. Start with: "What do we need to know, and what design best answers that question given our constraints?"

Mistake 2: Assuming an RCT is always best. An RCT is the strongest design for attribution, but it answers a narrow question (did the average outcome change?) and misses context, mechanisms, and variation. For complex programs, a well-designed mixed-methods evaluation often produces more useful evidence.

Mistake 3: No comparison group and no theory. If you do not have a comparison group, you need a strong theory-based approach (contribution analysis, process tracing). Simply measuring before and after without either a comparison group or a systematic way to test your theory of change gives you very weak evidence.

Mistake 4: Underfunded evaluations. Trying to run a quasi-experimental design on a $10K budget usually produces unusable results. Match your design ambition to your budget. A well-executed contribution analysis at $25K produces better evidence than a badly executed DID at $25K.

Mistake 5: Ignoring ethical considerations. Withholding a proven intervention from a control group is ethically problematic. Delayed rollout designs (everyone eventually gets the program, but the order is randomized) address this for RCTs. For ongoing programs, pipeline designs (comparing those who received the program first versus those still waiting) are an ethical alternative.

Mistake 6: Treating the design as fixed. Your evaluation design should adapt if circumstances change. If a planned comparison group becomes contaminated (they receive a similar program from another organization), you need to pivot. Build flexibility into your evaluation plan from the start.

Quick Decision Guide

Use this sequence. Each question narrows the options.

1. Do you need to prove your program caused the change (attribution)?

Yes, and randomization is feasible: RCT
Yes, but randomization is not feasible: Quasi-Experimental (DID, PSM, RD, or ITS depending on data availability)
No, you need to show your program contributed: Contribution Analysis or Process Tracing

2. Is a comparison group available?

Yes, and it is well-matched: Quasi-Experimental
A natural cutoff exists (eligibility threshold): Regression Discontinuity
No comparison group: Theory-Based approaches (contribution analysis, realist evaluation, MSC)

3. What is your budget for the evaluation?

Over $100K: All options available. Consider whether the question justifies the cost of an RCT.
$30K-$100K: Quasi-experimental or theory-based. DID with existing data is often the sweet spot.
Under $30K: Theory-based, MSC, or process evaluation. Do not attempt quasi-experimental designs at this budget.

4. How complex is the program?

Simple, standardized: Experimental or quasi-experimental works well
Complex, adaptive, multi-component: Theory-based, realist evaluation, or developmental evaluation. These designs handle complexity better because they do not require a single stable "treatment."

How to Choose the Right Evaluation Design

Before you open a methods textbook, answer three questions in order.

Question 1: What do you need to know?

Question 2: Is a counterfactual feasible?

Question 3: What are your constraints?

The Three Design Families

Experimental (RCT)

When to use it:

Program is being rolled out in phases (randomize who gets it first)
Resources are limited and cannot reach everyone at once (natural rationing)
Donor requires rigorous impact evidence (USAID, DFID/FCDO, 3IE)
The program is relatively simple and standardized
Budget allows ($100K-500K+ for the evaluation alone)
Ethical review approves withholding the program from the control group

When it does not work:

Program already covers the entire target population (no control group possible)
Ethical concerns about withholding a beneficial intervention
Program is highly adaptive (RCTs require a stable treatment)
Sample size is too small for statistical power
The question is "why" not "how much" (RCTs tell you impact, not mechanisms)

Strengths	Limitations
Strongest causal evidence	Expensive and time-consuming
Clear, credible results	Requires randomization (not always ethical/feasible)
Widely accepted by donors and policymakers	Tells you "what" not "why"
	Requires stable intervention and large samples

Quasi-Experimental

Common approaches:

When to use quasi-experimental designs:

A comparison group exists but randomization is not possible
Program was not randomized at the start but comparison data is available
Budget is moderate ($30K-150K for the evaluation)
You need causal evidence but an RCT is not feasible

Strengths	Limitations
Stronger than non-experimental designs	Weaker causal claims than RCT
More feasible and affordable than RCTs	Requires good comparison group
Multiple design options for different contexts	Statistical expertise needed
	Assumptions may not hold (parallel trends, no unobserved confounders)

Non-Experimental (Theory-Based and Qualitative)

Common approaches:

When to use non-experimental approaches:

No comparison group exists or is feasible
Program is complex, adaptive, or systemic
You need to understand mechanisms, not just outcomes
Budget is limited ($15K-60K for the evaluation)
The question is "how and why" rather than "how much"

Strengths	Limitations
Feasible for any program type	Cannot prove causation
Captures complexity and context	Findings less generalizable
More affordable	Requires skilled evaluators
Explains mechanisms, not just outcomes	Some donors do not accept as sufficient evidence

Evaluation Design Comparison

Factor	Experimental (RCT)	Quasi-Experimental	Non-Experimental
Causal evidence	Very strong	Moderate-strong	Moderate (contribution, not attribution)
Feasibility	Low (many conditions needed)	Medium	High
Typical cost	$100K-500K+	$30K-150K	$15K-60K
Timeline	2-5 years	1-3 years	3-12 months
Statistical expertise	High	High	Moderate
Best for	Simple, standardized interventions	Programs with natural comparison groups	Complex, adaptive, systemic programs
USAID acceptance	Preferred for impact	Accepted	Accepted for non-impact questions
EU acceptance	Accepted	Accepted	Common for most evaluations
FCDO acceptance	Expected for large programs	Accepted	Accepted with strong ToC

Common Mistakes

Mistake 1: Choosing the method before the question. "We want to do an RCT" is backwards. Start with: "What do we need to know, and what design best answers that question given our constraints?"

Quick Decision Guide

Use this sequence. Each question narrows the options.

1. Do you need to prove your program caused the change (attribution)?

Yes, and randomization is feasible: RCT
Yes, but randomization is not feasible: Quasi-Experimental (DID, PSM, RD, or ITS depending on data availability)
No, you need to show your program contributed: Contribution Analysis or Process Tracing

2. Is a comparison group available?

Yes, and it is well-matched: Quasi-Experimental
A natural cutoff exists (eligibility threshold): Regression Discontinuity
No comparison group: Theory-Based approaches (contribution analysis, realist evaluation, MSC)

3. What is your budget for the evaluation?

Over $100K: All options available. Consider whether the question justifies the cost of an RCT.
$30K-$100K: Quasi-experimental or theory-based. DID with existing data is often the sweet spot.
Under $30K: Theory-based, MSC, or process evaluation. Do not attempt quasi-experimental designs at this budget.

4. How complex is the program?

Simple, standardized: Experimental or quasi-experimental works well
Complex, adaptive, multi-component: Theory-based, realist evaluation, or developmental evaluation. These designs handle complexity better because they do not require a single stable "treatment."

How to Choose an Evaluation Methodology

How to Choose the Right Evaluation Design

The Three Design Families

Experimental (RCT)

Quasi-Experimental

Non-Experimental (Theory-Based and Qualitative)

Mixing Methods

Evaluation Design Comparison

Worked Example: Choosing a Design Under Real Constraints

Common Mistakes

Quick Decision Guide

Frequently Asked Questions

How to Choose an Evaluation Methodology

How to Choose the Right Evaluation Design

The Three Design Families

Experimental (RCT)

Quasi-Experimental

Non-Experimental (Theory-Based and Qualitative)

Mixing Methods

Evaluation Design Comparison

Worked Example: Choosing a Design Under Real Constraints

Common Mistakes

Quick Decision Guide

Frequently Asked Questions