Skip to main content
M&E Studio
Home
AI for M&E
GuidesPromptsPlugins
Resources
Libraries
Indicator LibraryReference Library
DownloadsTools
Topic Guides
EvaluationMEL DesignData CollectionIndicatorsData QualitySampling
Services
About
ENFRES
M&E Studio

AI for M&E, Built for Practitioners

About

  • About Us
  • Contact
  • Insights
  • LinkedIn

Services

  • Our Services

AI for M&E

  • Guides
  • Prompts
  • Plugins
  • Insights

Resources

  • Indicator Library
  • Reference Library
  • Downloads
  • Tools

Legal

  • Terms
  • Privacy
  • Accessibility

© 2026 Logic Lab LLC. All rights reserved.

  1. M&E Library
  2. /
  3. Decision Guides
  4. /
  5. How to Choose Sample Size for M&E

How to Choose Sample Size for M&E

A practical guide to sample size for program evaluations, with rules of thumb, worked examples, and budget-statistics tradeoffs.

How Many People Do You Need to Survey?

"How many people do I need to survey?" is the most common technical question in M&E. It is also the question most often answered with a guess. Too few respondents and you cannot detect real change. Too many and you waste budget that could fund better training, longer fieldwork, or deeper qualitative work. Getting sample size right is not about precision for its own sake. It is about spending data collection money where it actually produces usable evidence.

The answer depends on five factors. Understanding them saves you from underpowered studies and bloated budgets alike.

Quick Reference: Rules of Thumb

  • 385 for a basic proportion estimate (95% confidence, 5% margin of error, large population).
  • Double it for cluster sampling.
  • Add 15-20% for non-response.
  • Multiply by the number of subgroups you need to compare (e.g., 2 genders x 3 regions = 6 subgroups).
  • Use the Sampling Calculator to get an exact number for your situation.

The Standard Formula

Before diving into the five factors, here is the basic formula for sample size when estimating a proportion:

n = Z² x p(1-p) / e²

Where:

  • Z = the Z-score for your confidence level (1.96 for 95%, 1.645 for 90%)
  • p = the expected proportion (use 0.5 if unknown, which gives the maximum sample)
  • e = the margin of error you can accept (0.05 for 5%)

Quick calculation: With 95% confidence (Z = 1.96), 50% proportion, and 5% margin of error:

n = 1.96² x 0.5 x 0.5 / 0.05² = 3.8416 x 0.25 / 0.0025 = 385

That is where the "385" rule of thumb comes from. It assumes a large population, simple random sampling, and no clustering. In practice, you almost always need more. Or just use the Sampling Calculator.

Finite Population Correction

If your total population is small (under 5,000), you need fewer respondents than the formula suggests. Apply the finite population correction:

n_adjusted = n / (1 + (n - 1) / N)

Where N is the total population size.

Example: Your program serves 800 households. The formula says 385. With the correction: 385 / (1 + 384/800) = 385 / 1.48 = 260 households. That is a meaningful reduction. Do not skip this step for small populations.

The Five Factors

1. What change do you want to detect?

This is the minimum detectable effect (MDE): the smallest change in your indicator that you want your survey to be able to pick up. If your program expects to increase handwashing from 30% to 50% (a 20-percentage-point change), that is a large effect and requires fewer respondents. If you expect a shift from 60% to 65% (5 points), that is a small effect and requires many more.

Rule of thumb: Most development programs expect changes of 10-20 percentage points for behavioral indicators. If your expected change is smaller than 5 percentage points, you need a very large sample or should reconsider whether a survey is the right tool for measuring that indicator. Consider whether routine monitoring data or administrative records can track small changes more efficiently than a sample survey.

2. How confident do you need to be?

The confidence level is the probability that your results are not due to chance. The standard is 95% (meaning if you repeated the survey 100 times, 95 would give results within your margin of error). Some programs use 90%, which reduces sample size by about 25%. For internal learning purposes, 90% is often adequate. For external evaluations or academic publications, stick with 95%.

The margin of error (also called precision) is the range around your estimate. A 5% margin of error means if you measure 50% adoption, the true value is between 45% and 55%. Tighter margins require larger samples. Moving from 5% to 3% margin of error nearly triples your sample size.

3. How variable is your population?

If everyone in your target population is similar, you need fewer respondents. If there is wide variation, you need more. When you do not know the variance (common before a baseline design), using 50% as your proportion estimate gives the maximum sample size needed. This is the safest assumption and the one to use when you are planning blind.

If you have data from a previous survey or a similar program in the same area, use that proportion instead. A known baseline value of 20% or 80% requires a substantially smaller sample than the worst-case 50%.

4. Are you sampling clusters?

If you are sampling villages, schools, or health facilities first and then individuals within them, you need to adjust for the design effect. People in the same village tend to be more similar to each other than to people in other villages. Each additional person in the same cluster gives you less new information than a person from a new cluster would.

The design effect depends on the intra-cluster correlation (ICC) and the number of individuals per cluster:

Design Effect = 1 + (cluster size - 1) x ICC

Typical ICC values for common M&E indicators:

Indicator typeTypical ICCDesign effect (20 per cluster)
Immunization coverage0.02-0.081.4-2.5
Handwashing/hygiene behavior0.05-0.152.0-3.9
School attendance0.10-0.202.9-4.8
Crop yield/income0.05-0.152.0-3.9
Nutrition (stunting/wasting)0.05-0.152.0-3.9

A design effect of 2.0 means you need roughly double the sample you would need with simple random sampling. This is the single biggest factor that inflates sample size in real-world M&E, and the one most often ignored in initial planning.

Most sampling methods used in field evaluations involve some form of clustering. If your enumerators are visiting communities, you are almost certainly doing cluster sampling, whether you call it that or not.

5. How many subgroups do you need to compare?

If you need to disaggregate results by sex, age group, geography, or disability status, you need enough respondents in each subgroup to make meaningful comparisons. This is where sample sizes get large quickly.

Rule of thumb: You need at least 30-50 respondents per subgroup for basic descriptive statistics, and 100+ per subgroup for meaningful comparisons between groups. If your donor requires gender-disaggregated results and comparison across 3 regions, that is 6 subgroups at 100+ each, meaning your total sample may need to be 600+ before adjusting for design effect and non-response.

Plan your disaggregation requirements before you calculate sample size, not after. Too many teams set a sample size first, then discover at analysis time that they cannot split the data the way the donor expects.

Worked Examples

Example 1: Simple Baseline Survey

  • Population: 5,000 beneficiary households
  • Indicator: Proportion using improved water sources
  • Expected baseline value: Unknown (use 50%)
  • Confidence: 95%, Margin of error: 5%
  • Sampling: Simple random (no clustering)
  • Formula result: 385 households
  • Finite population correction: 385 / (1 + 384/5000) = 357 households
  • With 15% non-response buffer: 420 households

Example 2: Cluster-Sampled Baseline

  • Same as above, but sampling 25 villages first, then 20 households per village
  • ICC: 0.10 (moderate clustering)
  • Design effect: 1 + (20-1) x 0.10 = 2.9
  • Adjusted sample: 357 x 2.9 = 1,035 households
  • With 15% non-response: 1,220 households
  • Actual plan: 25 villages x 49 households each = 1,225 households

Notice how clustering nearly triples the required sample. This is not unusual. If someone hands you a sample size of 400 for a cluster-sampled survey design across 20 villages, ask them to show their design effect calculation.

Example 3: Gender-Disaggregated Endline

  • Need: Compare men vs women on 3 key indicators
  • Minimum per subgroup: 100 (for meaningful comparison)
  • Design effect: 2.0 (cluster sampling)
  • Per subgroup after DEFF: 200
  • Total: 400 (200 men + 200 women)
  • With 15% non-response: 470

Qualitative Sample Sizes

Qualitative research does not use statistical formulas. Instead, it follows saturation logic: collect data until you stop hearing new information.

MethodGuidelineSaturation point
Key informant interviews (homogeneous group)12-20Usually 12-16
Key informant interviews (diverse population)20-40Usually 20-25
Focus group discussions4-6 groups per segment4 groups usually sufficient
FGD participants6-10 per group8 is the sweet spot
Case studies4-10Depends on purpose

These are guidelines, not rigid rules. The right number depends on your research question, the diversity of your population, and how much variation you encounter. But do not use "saturation" as a justification for interviewing 3 people and calling it qualitative research. If you have not heard a genuinely new perspective in 4-5 consecutive interviews, you have likely reached saturation. If every interview surprises you, keep going.

For mixed methods designs, your qualitative sample is typically a purposive subset of your quantitative sample. Select for diversity, not representativeness. The qualitative component explains the "why" behind the quantitative patterns, so choose participants who represent different experiences with the program. See How to Choose Evaluation Methodology for guidance on when mixed methods designs make sense.

Common Mistakes

1. Sample too small for disaggregation

You calculate 385 for the overall sample, then try to split by gender, age group, and region. Each cell ends up with 30-40 respondents, which is not enough for reliable comparisons. Fix: decide your disaggregation requirements first and size each subgroup independently.

2. Ignoring non-response

Not everyone you plan to interview will be available. Some refuse. Some are away. Some addresses are wrong. If you plan for exactly 385 and achieve a 15% non-response rate, you end up with 327 completed surveys and an underpowered study. Always add a 15-20% buffer.

3. Convenience sampling disguised as random

Surveying "whoever is at the health facility that day" or "the first 20 households near the road" is not random sampling, no matter what the methodology section says. Convenience samples cannot support statistical inference, regardless of size. A properly randomized sample of 300 is more credible than a convenience sample of 3,000.

4. No sampling frame

You cannot randomly sample without a list to sample from. If you do not have a beneficiary registry, a household listing, or a census, your first step is building one. Skipping this step means you are not doing random sampling, even if you think you are. Budget time and money for listing exercises before data collection.

5. Forgetting finite population correction

If your program serves 500 households and you calculate a sample of 385 (ignoring the correction), you are planning to survey 77% of the population. Apply the finite population correction and you only need 217. For small populations, this correction makes a real difference in cost and effort.

6. Ignoring clustering in analysis

This is the most dangerous mistake. You sample 30 villages, 15 households per village, and analyze the data as if you had 450 independent observations. You did not. You have 30 clusters. Ignoring this inflates your statistical significance and produces false positive results. If you used cluster sampling in data collection, you must account for it in analysis. Use survey commands (svy in Stata, survey package in R) or multilevel models.

When Budget and Statistics Disagree

This is the most common real-world problem. Your statistics say 1,200 households; your budget says 400.

Legitimate ways to reduce sample size:

  1. Accept a larger margin of error. Moving from 5% to 7% reduces sample size by about 50%. For program management decisions (not academic research), 7% is often fine.
  2. Accept 90% confidence instead of 95%. Reduces sample size by about 25%. Appropriate for internal learning, less so for external evaluations.
  3. Reduce the number of subgroups. If you cannot afford gender AND age AND region disaggregation, choose the most important one. Be explicit about what you are giving up.
  4. Reduce cluster size. Sampling 15 households per village instead of 25 reduces the design effect, though you need more villages (and more travel costs). There is an optimal balance.
  5. Use stratified sampling. If you know some strata are more variable than others, you can allocate more sample to variable strata and less to homogeneous ones. This improves precision without increasing total sample size.

What you should NOT do:

  • Survey fewer people and hope for the best (underpowered studies waste money; they cannot detect real changes)
  • Drop the comparison group (without a comparison, you cannot attribute change to your program)
  • Ignore clustering in your analysis (this inflates your statistical significance and produces false positive results)
  • Do a "post-hoc power analysis" to justify an underpowered study (this is statistically invalid and reviewers will flag it)

The honest move is to present the donor with options. Show what precision you can achieve at each budget level.

How to Talk to Your Donor

When the sample size conversation comes up, frame it around trade-offs:

"With 400 households, we can measure the overall change in [indicator] with 7% precision, and disaggregate by sex. We cannot reliably compare across all 5 districts. If district-level comparison is essential, we need 800 households, which requires an additional $12,000 for data collection."

Most donors are reasonable when you present the trade-offs clearly, with cost implications attached. What they do not accept is discovering at endline that the sample was too small to detect change, with no warning beforehand.

Three principles for the conversation:

  1. Lead with what they can learn, not what they cannot. "With this sample, you can confidently track overall program change and gender differences" is better than "this sample is too small for district comparisons."
  2. Attach dollar amounts to each option. Donors think in budgets. "An additional 400 households costs $12,000 and gives you district-level comparisons" is a concrete trade-off they can evaluate.
  3. Document the decision. Put the sample size rationale, the assumptions, and the agreed trade-offs in the baseline design report. When someone asks at endline why you only surveyed 400 households, you want a paper trail.

Use the Sampling Calculator to model different scenarios before the conversation. Walk in with 2-3 options at different budget levels so the donor can choose instead of argue.

Frequently Asked Questions

PreviousHow to Choose an Evaluation MethodologyNextHow to Clean Your Dataset Before Analysis