What the Five Dimensions Are
The five data quality dimensions are the canonical framework for M&E data quality assessment, codified in USAID ADS 203 and used in DQA templates across most bilateral donors, UN agencies, and INGOs.
| Dimension | The question | What it catches |
|---|---|---|
| Validity | Does the data measure what the indicator says it measures? | Indicator-specification mismatches, proxy failures, cultural translation errors |
| Reliability | Would the same data collected again produce the same result? | Enumerator drift, protocol inconsistency, definitional confusion |
| Timeliness | Does the data arrive when the decision needs it? | Reporting lag, collection delays, cycle misalignment |
| Precision | Is the data detailed and accurate enough to distinguish meaningful differences? | Insufficient disaggregation, margin-of-error too wide, undercount/overcount patterns |
| Integrity | Is the data protected from deliberate bias or manipulation? | Target-driven reporting distortion, missing audit trails, pressure points |
A data system that passes all five dimensions produces defensible data. A system that fails even one produces data that will not withstand external review, will not survive donor audit, or will not actually inform decisions. The five dimensions are not scored independently; a failure in any one compromises the whole dataset. See the data quality assurance entry for the formal USAID specification.
Validity: Measuring the Right Thing
Validity asks whether your indicator actually captures the concept you claim to measure. It is primarily an indicator design issue, resolved at the MEL plan stage, but failures can emerge in fieldwork when enumerators interpret indicators differently from the designer.
Three validity failure patterns:
Construct-measurement mismatch. The indicator is labeled to measure concept X but operationally measures concept Y. "Percentage of women empowered" as an indicator is a construct; the operational measurement (number of women attending a livelihoods session) measures participation, not empowerment. The data may be collected correctly but does not validly represent the intended construct.
Proxy failure. The indicator uses a proxy measure (an easier-to-measure stand-in for a harder-to-measure concept). Proxies are defensible when they correlate strongly with the target construct. They fail when the correlation is weak or when the proxy changes for reasons unrelated to the target construct. Household income as a proxy for household food security is weaker than direct food consumption measures; using household income produces valid income data but not valid food security data.
Cultural or linguistic translation error. Indicator definitions developed in English and translated for field use can lose meaning. "Household" in one language may map to a nuclear unit; in another, an extended compound of related families. A valid English indicator becomes invalid Swahili or Bengali without the translation being tested in the field context.
Validity is caught through indicator review at the design stage. Use the SMART Indicator deep-dive checklist to test construct-measurement alignment. In field contexts, run a cognitive pretest with 5-10 respondents in the actual language: ask them to explain what they think the question means. If their interpretation differs from the indicator specification, validity is compromised.
Reliability: Measuring It Consistently
Reliability asks whether data collected twice under the same conditions would produce the same result. It is primarily an execution issue, resolved at enumerator training and supervision, and it erodes continuously through a program's life.
Three reliability failure patterns:
Enumerator drift. At the start of training, all enumerators apply the same definitions. Three months in, each has developed small interpretive shortcuts. "Exclusive breastfeeding" as defined in training includes no water; one enumerator lets water through on the assumption the respondent misremembered. The same indicator produces different data across enumerators. Drift is predictable; it is caught with periodic re-training and blind spot-checks, not prevented once-and-for-all.
Protocol inconsistency. Two supervisors interpret the same skip logic differently. One treats a no response to screening question 1 as survey-eligible with follow-up probes; the other treats it as ineligible and ends the interview. The two teams produce structurally different datasets. Protocol inconsistency is caught at supervision, through deliberate scenario testing during weekly team meetings.
Definitional confusion at the respondent level. The indicator is reliable across enumerators but reliability fails between the enumerator and the respondent. "Have you used this method in the last 30 days?" produces different answers depending on whether the respondent interprets "used" as started, completed, practiced once, or practiced consistently. The field-level definition must be operationalized in the question wording, not left to respondent interpretation.
Reliability problems are invisible in the final dataset unless specific controls are in place. The primary control is inter-rater agreement: periodically have two enumerators interview the same household independently, compare the responses, and measure the agreement rate. Agreement below 85% for closed questions signals a reliability problem. See the reliability reference entry for agreement calculation methods.
Timeliness: Data When Decisions Need It
Timeliness asks whether data arrives in time to inform the decisions it is supposed to support. It is primarily a scheduling and reporting system issue, and it is the most common dimension to fail in M&E practice.
Three timeliness failure patterns:
Reporting lag. Data collected in March becomes available for analysis in May and reaches the steering committee in July. The quarterly decision cycle the data was meant to inform has already passed. The data is correct but useless for the purpose. Reporting lag is caught by mapping each indicator to its specific decision cycle and working backward to determine required collection and analysis dates.
Collection delay. The survey is scheduled for Month 6 but fieldwork does not start until Month 8 because enumerator hiring ran late. Baseline data that was meant to inform Year 1 programming lands midway through Year 1. The data is usable for endline comparison but cannot inform the programming decisions it was commissioned for.
Cycle misalignment. An annual survey is scheduled for September in a program that holds quarterly decision meetings. The September data is fresh for the October meeting and stale by the July meeting of the following year. Either the survey frequency is wrong or the decision structure is wrong, but the two are not aligned.
Timeliness fails more often than teams think because reporting lag is cumulative: a 2-week delay in fieldwork becomes a 4-week delay in entry becomes a 6-week delay in analysis becomes an 8-week delay in dissemination. Each step has a legitimate reason for its slippage; the aggregate delay makes the data unusable. Track the full data pipeline, not just collection, to catch timeliness problems.
Precision: Enough Detail to Act On
Precision asks whether data is detailed and accurate enough to distinguish the differences that matter. It is partly a sample size issue (precision of estimate), partly a disaggregation issue (precision of breakdown), and partly a measurement accuracy issue (precision of instrument).
Three precision failure patterns:
Insufficient sample size for subgroup comparisons. The overall sample of 400 is precise at plus or minus 5 points. Disaggregated by sex, each subgroup has n=200 and is precise at plus or minus 7 points. Disaggregated by sex and age group, each cell has n=50 and is precise at plus or minus 14 points. Subgroup comparisons that drove the analytical design cannot be made with statistical confidence. See common sampling mistakes for sample size planning.
Aggregation level mismatch. Data is reported at the national or regional level but the decision needs district or facility-level information. A coverage estimate of 72% nationally does not tell the program where to focus outreach; a district breakdown that shows 40% to 91% range does. Precision failures at this level require either a larger sample or a sampling design stratified to the decision level.
Instrument-level imprecision. The measurement tool does not distinguish values that matter. A screening question that distinguishes "yes/no" on food security when the program's theory of change requires distinguishing "severely food insecure," "moderately," "mildly," and "food secure" cannot inform the design questions. Instrument-level precision requires revising the questionnaire or the indicator specification.
Precision is caught through design review: for each indicator, identify the smallest meaningful difference that the program needs to detect, and verify the sample and instrument can detect it. This is rarely done explicitly at design time; precision failures emerge in analysis when the team realizes the data cannot answer the question.
Integrity: Protection from Bias and Manipulation
Integrity asks whether the data is protected from deliberate distortion: inflation to meet targets, omission of uncomfortable findings, undocumented edits to records. It is primarily a governance and audit-trail issue, rare in frequency but high-consequence when it fails.
Three integrity failure patterns:
Target-driven reporting distortion. The field team is under pressure to report numbers that demonstrate program success. Enumerators who record a "no" are asked to re-check with the respondent. Entry clerks with flagged values are told to "clarify" them. The pressure is rarely explicit; it is embedded in the structure of incentives and reviews. Integrity controls here include blind data entry (entry clerks do not know which values will be reviewed), automated audit of record changes, and a standing protocol that no one below the M&E lead can change a submitted value.
Missing audit trail. Data is cleaned, merged, and analyzed, but the sequence of edits is not recorded. When a reviewer asks why a value changed, no one can explain. Digital platforms with edit-history logging (see paper vs digital data collection) prevent most audit-trail failures; paper systems require parallel change logs.
Selective reporting. The data is collected honestly but only favorable findings are published. The dataset contains the full picture; the summary shows the subset that supports the program narrative. Integrity here is a governance issue: does the program commit to publishing the full DQA report including negative findings, or filter it through internal review?
Integrity failures are rarely caught in routine DQA. The standard DQA tests dimensions 1-4 reliably; integrity testing requires specific protocols: random record audits, whistleblower pathways, third-party review of negative findings. See data quality assurance for governance-level controls.
Sector Examples
Health: Reliability drift in vaccination coverage survey in East Africa
A health program ran quarterly vaccination coverage surveys using four teams of two enumerators each. Round 1 training was thorough and inter-rater agreement was 92%. By Round 3, two enumerators had rotated out and been replaced without re-training the full team, and the team leads had started allowing "card not seen but recited by caregiver" as verified vaccination (original protocol required card sighting). Agreement dropped to 76%. The coverage estimate drifted upward by 11 percentage points over three rounds, not because coverage improved but because reliability eroded. A re-training at Round 4 restored the original protocol and agreement climbed back to 88%, but the reported coverage series was no longer internally comparable.
WASH: Timeliness failure in a safe water access program, West Africa
A program designed a household survey to inform quarterly adaptive management reviews. Fieldwork in Month 3 took 4 weeks; data entry took 3 weeks; analysis and review took 2 weeks. Results were presented in the Month 6 review meeting, one cycle late. By then, the program had already moved on in its implementation. Three rounds later the pattern was established: data always arrived one cycle late. The program switched to a smaller, more focused monthly assessment and moved the full survey to semi-annual frequency. Timeliness was restored.
Education: Precision failure in a learning outcomes study, South Asia
A program commissioned a learning outcomes assessment intending to disaggregate results by grade, gender, and school type (public vs private-aided). The study enrolled 800 students. Disaggregated three ways, each cell had 30-50 students, which produced margins of error too wide to test hypotheses meaningfully. The headline result (overall learning scores) was precise; the subgroup analysis the program actually needed for design decisions was not. A follow-up study with a stratified sample specifically designed for the three-way disaggregation cost 2.3x the original and took 4 months to execute.
Livelihoods: Integrity failure in post-training tracking, Southern Africa
A livelihoods program tracked post-training employment through self-reporting by trainers whose performance was evaluated partly on placement rates. Over two years, reported employment climbed from 54% to 83%. An external re-contact of former trainees found the actual rate was 52%. Trainers under pressure to show progress had recorded part-time or informal work as employment. The program added third-party verification with a random 10% phone survey and redesigned the trainer metric to remove the pressure point.
Food security: Validity failure in consumption measurement, Sahel
A food security program used "meals consumed per day" as an indicator of improvement. In the pastoralist context, this was not valid: traditional diets involve 2 meals per day even in food-secure periods, so an increase did not correlate with food security. The indicator measured a cultural norm change, not the intended construct. Re-specification to the Household Hunger Scale (HHS) resolved the problem and produced comparable data across contexts.
Common Mistakes
Mistake 1: Treating DQA as a one-time audit. Data quality erodes continuously: enumerator drift, protocol slippage, reporting lag, integrity pressure. A DQA conducted at baseline certifies the system at that moment, not forever. Quarterly or semi-annual DQA cycles catch drift before it compounds.
Mistake 2: Confusing reliability with validity. Reliable data collection of an invalid indicator produces consistently wrong data. Validity is designed at the indicator specification stage; reliability is executed at the field level. Both are needed.
Mistake 3: Ignoring precision until analysis. Sample size and disaggregation levels are decided at the design stage. Discovering in analysis that the sample cannot support the intended subgroup comparison is too late to fix without a second study.
Mistake 4: Passing timeliness as "the data will be ready soon." Without specific collection, entry, analysis, and reporting deadlines, timeliness fails by default. Each step needs a named owner and a calendar date.
Mistake 5: Not protecting the integrity pressure points. Integrity fails where the structure creates pressure: staff whose performance is judged by the data they report, teams whose funding depends on hitting targets. These pressure points need explicit controls (blind entry, audit, third-party verification), not trust in individual ethics.
Mistake 6: Running DQA without a baseline to compare against. A DQA that finds "acceptable quality" without a prior quality standard or prior assessment has no benchmark. Establish a quality standard in Year 1 and compare subsequent DQAs against it.
Five-Dimension Self-Assessment
Run through this for each priority indicator in your MEL plan. Any "no" answer identifies a specific quality risk.
Validity:
- The indicator's construct-measurement match has been tested (cognitive interviews, expert review, or prior-study replication)
- Translations have been field-tested in each language and context used
- Proxies (where used) have documented correlation with the target construct
Reliability:
- Enumerator training covers interpretive edge cases, not just the instrument
- Inter-rater agreement has been measured in at least one round (target: 85% or higher)
- Re-training is scheduled after every 2-3 rounds of fieldwork or significant turnover
Timeliness:
- The decision cycle the indicator feeds is named, with deadlines
- The data pipeline is mapped from collection to dissemination, with dates for each step
- The collection frequency matches the decision frequency
Precision:
- The smallest meaningful difference the program needs to detect is defined
- Sample size is calculated for that difference (with design effect and non-response buffer applied)
- Disaggregation is pre-planned with sample sizes that support the comparisons
Integrity:
- Pressure points (performance-linked reporting, target-driven incentives) are identified
- Audit trails exist for all record changes (digital platforms or parallel paper logs)
- A third-party or random-sample verification cycle is built into the DQA schedule
For the DQA process, see how to conduct a DQA. For the broader framework, see data quality assurance, validity, and reliability.