Skip to main content
M&E Studio
Home
AI for M&E
GuidesPromptsPlugins
Resources
Indicator LibraryReference LibraryTopic GuidesTools
Services
About
ENFRES
M&E Studio

AI for M&E, Built for Practitioners

About

  • About Us
  • Contact
  • Insights
  • LinkedIn

Services

  • Our Services

AI for M&E

  • Guides
  • Prompts
  • Plugins
  • Insights

Resources

  • Indicator Library
  • Reference Library
  • Downloads
  • Tools

Legal

  • Terms
  • Privacy
  • Accessibility

© 2026 Logic Lab LLC. All rights reserved.

  1. M&E Library
  2. /
  3. Decision Guides
  4. /
  5. How to Choose an Evaluation Methodology

How to Choose an Evaluation Methodology

A decision framework for choosing evaluation design. Covers experimental, quasi-experimental, and non-experimental approaches.

How to Choose the Right Evaluation Design

Most evaluation design mistakes happen because people choose a method before clarifying what they need to know. "We want to do an RCT" is not a starting point. "We need to know whether our training program reduced stunting" is. The method follows from the question, the context, and your constraints. Get those right first.

Before you open a methods textbook, answer three questions in order.

Question 1: What do you need to know?

Your evaluation question determines the entire design. If you need to prove causation ("Did this program cause the change?"), you need a counterfactual. If you need to understand contribution ("Did this program help?"), theory-based designs work. If you need to understand implementation ("Was this delivered as planned?"), a process evaluation is enough. Do not over-engineer the design for the question you are actually asking.

Question 2: Is a counterfactual feasible?

A counterfactual answers: "What would have happened without the program?" If you can construct one through randomization or a natural comparison group, experimental and quasi-experimental design options open up. If not, theory-based approaches are your path. Many programs, especially those already operating at scale, cannot construct a credible counterfactual. That is fine. It simply means you use a different family of designs.

Question 3: What are your constraints?

Budget, timeline, ethics, politics, data availability, and donor requirements all shape what is feasible. An RCT that costs $300K and takes 3 years is not viable for a $500K, 2-year program. A difference-in-differences design requires baseline data you may not have collected. Be honest about constraints early. A well-executed modest design beats a poorly executed ambitious one every time.

The Three Design Families

Experimental (RCT)

What it is: Randomly assign people, communities, schools, or facilities to receive the program or not. Compare outcomes between the two groups. Any difference is attributable to the program because randomization ensures the groups are equivalent at the start.

Most NGO evaluations do not need an RCT. That is not a controversial statement. It reflects the reality that RCTs require conditions most programs cannot meet: phased rollout, large sample sizes, substantial budgets, and stable interventions. When those conditions exist, an RCT is the gold standard for causal attribution. When they do not, forcing an RCT produces bad science at high cost.

When to use it:

  • Program is being rolled out in phases (randomize who gets it first)
  • Resources are limited and cannot reach everyone at once (natural rationing)
  • Donor requires rigorous impact evidence (USAID, DFID/FCDO, 3IE)
  • The program is relatively simple and standardized
  • Budget allows ($100K-500K+ for the evaluation alone)
  • Ethical review approves withholding the program from the control group

When it does not work:

  • Program already covers the entire target population (no control group possible)
  • Ethical concerns about withholding a beneficial intervention
  • Program is highly adaptive (RCTs require a stable treatment)
  • Sample size is too small for statistical power
  • The question is "why" not "how much" (RCTs tell you impact, not mechanisms)
StrengthsLimitations
Strongest causal evidenceExpensive and time-consuming
Clear, credible resultsRequires randomization (not always ethical/feasible)
Widely accepted by donors and policymakersTells you "what" not "why"
Requires stable intervention and large samples

Quasi-Experimental

What it is: Uses a comparison group, but without randomization. Instead, statistical techniques account for the differences between groups. These designs sit between the rigor of an RCT and the flexibility of non-experimental approaches. They work well when you have a natural comparison group and decent data, but randomization was never possible or has already passed.

Common approaches:

Difference-in-Differences (DID): Compare the change over time in the treatment group to the change over time in a comparison group. Requires data from before and after the program for both groups. Assumes both groups would have followed the same trend without the program (parallel trends assumption). This is probably the most common quasi-experimental design in development evaluation because many programs collect baseline and endline data in both program and non-program areas.

Propensity Score Matching (PSM): Match each program participant with a non-participant who is statistically similar on observable characteristics. Compare outcomes between matched pairs. Requires good data on characteristics that predict participation. PSM handles selection bias on observables but cannot address unobserved differences between groups.

Regression Discontinuity (RD): When program eligibility is based on a cutoff (e.g., income below $2/day), compare people just above and just below the cutoff. Very strong causal evidence for the group near the cutoff. Limited because it only tells you about the effect at the threshold, not for the whole population. If your program uses any kind of score-based targeting, check whether RD is an option before defaulting to other designs.

Interrupted Time Series (ITS): Analyze trends in an indicator before and after the program starts, using multiple pre-program data points to establish the counterfactual trend. Requires at least 8-10 pre-intervention data points. Works well when you have strong routine monitoring data but no comparison group.

When to use quasi-experimental designs:

  • A comparison group exists but randomization is not possible
  • Program was not randomized at the start but comparison data is available
  • Budget is moderate ($30K-150K for the evaluation)
  • You need causal evidence but an RCT is not feasible
StrengthsLimitations
Stronger than non-experimental designsWeaker causal claims than RCT
More feasible and affordable than RCTsRequires good comparison group
Multiple design options for different contextsStatistical expertise needed
Assumptions may not hold (parallel trends, no unobserved confounders)

Non-Experimental (Theory-Based and Qualitative)

What it is: No comparison group. Instead, these designs use program theory, qualitative evidence, and multiple sources of data to assess whether and how the program contributed to change. Do not dismiss these as "weak." For complex, adaptive, or systemic programs, they often produce more useful and actionable evidence than a quasi-experimental design applied badly.

Common approaches:

Contribution analysis: Builds a "contribution story" that traces the causal chain from activities to outcomes, testing the program's theory of change against evidence. Does not prove causation but builds a credible, evidence-based case for contribution. This is the workhorse of non-experimental evaluation in development.

Process tracing: Systematically examines the evidence for each link in the causal chain, looking for "smoking guns" (evidence that strongly supports the link) and "hoops" (tests the link must pass). This is a rigorous qualitative method for causal inference, borrowed from political science. It works best when you can identify specific causal mechanisms and test them against documentary and interview evidence.

Most Significant Change (MSC): Collects stories of change from stakeholders, which are then selected by panels as the "most significant." Participatory, captures unexpected changes, and works well for complex programs where predefined indicators miss what matters most.

Realist evaluation: Asks "What works, for whom, in what circumstances, and why?" Tests Context-Mechanism-Outcome (CMO) configurations rather than overall program impact. Particularly useful when you know a program works in some places but not others and you want to understand why.

When to use non-experimental approaches:

  • No comparison group exists or is feasible
  • Program is complex, adaptive, or systemic
  • You need to understand mechanisms, not just outcomes
  • Budget is limited ($15K-60K for the evaluation)
  • The question is "how and why" rather than "how much"
StrengthsLimitations
Feasible for any program typeCannot prove causation
Captures complexity and contextFindings less generalizable
More affordableRequires skilled evaluators
Explains mechanisms, not just outcomesSome donors do not accept as sufficient evidence

Mixing Methods

Most real-world evaluations combine methods. A quasi-experimental impact evaluation often includes qualitative components to explain why the numbers look the way they do. A contribution analysis draws on both quantitative routine data and qualitative interviews. Treating "quantitative vs. qualitative" as a binary choice is a mistake. Think about what evidence you need for each evaluation question, and pick the method that generates that evidence. See qualitative vs. quantitative vs. mixed methods for a deeper comparison.

Evaluation Design Comparison

FactorExperimental (RCT)Quasi-ExperimentalNon-Experimental
Causal evidenceVery strongModerate-strongModerate (contribution, not attribution)
FeasibilityLow (many conditions needed)MediumHigh
Typical cost$100K-500K+$30K-150K$15K-60K
Timeline2-5 years1-3 years3-12 months
Statistical expertiseHighHighModerate
Best forSimple, standardized interventionsPrograms with natural comparison groupsComplex, adaptive, systemic programs
USAID acceptancePreferred for impactAcceptedAccepted for non-impact questions
EU acceptanceAcceptedAcceptedCommon for most evaluations
FCDO acceptanceExpected for large programsAcceptedAccepted with strong ToC

Worked Example: Choosing a Design Under Real Constraints

A nutrition program operating in 12 districts wants to know whether its community health worker (CHW) training reduced childhood stunting. The program covers all 12 districts with no untreated comparison area. The evaluation budget is $40,000. The timeline is 4 months.

Walk through the decision sequence:

Can you randomize? No. The program is already implemented everywhere. There is no control group to create.

Is a natural comparison group available? No. All districts received the training. Neighboring districts have different health systems and demographics, making them poor comparators.

Does baseline data exist for a quasi-experimental design? The program collected routine health facility data (stunting rates, CHW visit logs) before and after training, but only in program districts. Without a comparison group, DID is off the table.

What fits the budget and timeline? $40,000 and 4 months rules out anything requiring primary household survey data collection at scale. But the routine health facility data is already available.

Recommendation: Contribution analysis using existing routine health facility data (stunting trends, CHW visit frequency, referral rates), supplemented by key informant interviews with health workers and district health officials. Map the theory of change from CHW training to stunting reduction, then test each link against the available evidence. If stunting rates improved after training and CHW visit frequency increased, and health workers describe specific practices they changed because of the training, you build a credible contribution story. This fits within budget, uses existing data, and answers the "did the training contribute to improvement?" question without requiring a comparison group. Use the Evaluation Designer to structure the design, or the Method Selector to explore alternatives.

Common Mistakes

Mistake 1: Choosing the method before the question. "We want to do an RCT" is backwards. Start with: "What do we need to know, and what design best answers that question given our constraints?"

Mistake 2: Assuming an RCT is always best. An RCT is the strongest design for attribution, but it answers a narrow question (did the average outcome change?) and misses context, mechanisms, and variation. For complex programs, a well-designed mixed-methods evaluation often produces more useful evidence.

Mistake 3: No comparison group and no theory. If you do not have a comparison group, you need a strong theory-based approach (contribution analysis, process tracing). Simply measuring before and after without either a comparison group or a systematic way to test your theory of change gives you very weak evidence.

Mistake 4: Underfunded evaluations. Trying to run a quasi-experimental design on a $10K budget usually produces unusable results. Match your design ambition to your budget. A well-executed contribution analysis at $25K produces better evidence than a badly executed DID at $25K.

Mistake 5: Ignoring ethical considerations. Withholding a proven intervention from a control group is ethically problematic. Delayed rollout designs (everyone eventually gets the program, but the order is randomized) address this for RCTs. For ongoing programs, pipeline designs (comparing those who received the program first versus those still waiting) are an ethical alternative.

Mistake 6: Treating the design as fixed. Your evaluation design should adapt if circumstances change. If a planned comparison group becomes contaminated (they receive a similar program from another organization), you need to pivot. Build flexibility into your evaluation plan from the start.

Quick Decision Guide

Use this sequence. Each question narrows the options.

1. Do you need to prove your program caused the change (attribution)?

  • Yes, and randomization is feasible: RCT
  • Yes, but randomization is not feasible: Quasi-Experimental (DID, PSM, RD, or ITS depending on data availability)
  • No, you need to show your program contributed: Contribution Analysis or Process Tracing

2. Is a comparison group available?

  • Yes, and it is well-matched: Quasi-Experimental
  • A natural cutoff exists (eligibility threshold): Regression Discontinuity
  • No comparison group: Theory-Based approaches (contribution analysis, realist evaluation, MSC)

3. What is your budget for the evaluation?

  • Over $100K: All options available. Consider whether the question justifies the cost of an RCT.
  • $30K-$100K: Quasi-experimental or theory-based. DID with existing data is often the sweet spot.
  • Under $30K: Theory-based, MSC, or process evaluation. Do not attempt quasi-experimental designs at this budget.

4. How complex is the program?

  • Simple, standardized: Experimental or quasi-experimental works well
  • Complex, adaptive, multi-component: Theory-based, realist evaluation, or developmental evaluation. These designs handle complexity better because they do not require a single stable "treatment."

Frequently Asked Questions

PreviousHow Much Should You Budget for M&E?NextHow to Choose Sample Size for M&E