Who This Page Is For
You have finished data collection. The raw export is sitting in your downloads folder. Before you open it and start making charts, you need to clean it. Dirty data produces wrong findings, and wrong findings produce bad decisions.
This page is a step-by-step protocol for cleaning M&E datasets after collection. It covers household surveys, facility assessments, monitoring forms, and any structured data you collect in the field. Follow the steps in order. Skip nothing.
Before You Start
You need three things before you touch a single cell:
- Your raw data export. Download it from KoboToolbox, SurveyCTO, ODK, or whatever platform you used. Export as XLSX or CSV. Do not open it in the platform's online editor.
- Your codebook. This is the document that defines every variable, its expected values, skip logic rules, and valid ranges. If you do not have a codebook, your survey design process skipped a step. Build one now before cleaning.
- A cleaning log template. A spreadsheet with columns for: record ID, variable name, original value, new value, reason for change, date, and who made the change. Every edit you make goes here. No exceptions.
If you designed your data collection tools well, cleaning will be straightforward. If you skipped validation rules during survey design, expect to spend significantly more time here.
Step 1: Back Up and Set Up Your Workspace
Never clean data in the original file. Ever.
- Save the raw export to a folder named
raw/or01_raw/. This file is read-only from this point forward. Do not edit it. - Create a copy in a folder named
clean/or02_clean/. All edits happen in this copy. - Open your cleaning log. You will update it continuously as you work through the steps below.
- Record the date, the raw file name, the number of records, and the number of variables. This is your starting point.
Name your files with dates: household_survey_raw_20260401.xlsx and household_survey_clean_20260401.xlsx. When someone asks "which version did you analyze?" you need to answer precisely.
Step 2: Detect and Remove Duplicates
Duplicates happen more often than you think. Enumerators resubmit forms, connectivity issues cause double uploads, and copy-paste errors create phantom records.
How to check:
- Sort by unique ID (respondent ID, household ID, or submission UUID). Look for exact matches.
- If you have no unique ID, sort by a combination of location, date, and respondent name. Flag records that match on all three.
- Check submission timestamps. Two records from the same enumerator submitted within 60 seconds of each other are almost certainly duplicates.
Decision rule: If two records have the same unique ID and identical responses on 90% or more of variables, delete the later submission. Log the deletion. If the responses differ meaningfully, flag both for manual review.
Step 3: Check Completeness
Every record should have values for required fields. Missing data is normal, but too much missing data makes a record unusable.
- Calculate the completion rate for each record: number of non-blank required fields divided by total required fields.
- Flag any record below 80% completion. A record missing a quarter of its required fields is suspect.
- For each flagged record, check the submission metadata. Was it a test submission? Did the enumerator abandon it mid-interview? Was the respondent unavailable after consent?
Decision rule: Records below 50% completion are almost always unusable. Exclude them from analysis and log the reason. Records between 50-80% may be usable for some indicators but not others. Do not discard them automatically. Instead, mark them as partial and decide per-analysis which ones qualify.
Check for blank columns too. If an entire variable has zero responses, something went wrong with the form logic or the variable was never asked. Investigate before moving on.
Step 4: Validate Skip Logic
Skip logic errors are the most common source of bad data in electronic surveys, and they are easy to miss if you do not check systematically.
Pull up your codebook. For every conditional question, verify:
- Records that should have skipped did skip. If question 12 only applies to respondents who answered "Yes" to question 10, check that all "No" respondents on question 10 have blank or "N/A" values on question 12.
- Records that should have answered did answer. If a respondent answered "Yes" to question 10 but question 12 is blank, that is a data gap, not a valid skip.
How to do this efficiently: Write simple filter rules. In Excel: filter column Q10 to "No," then check that column Q12 is blank for all filtered rows. In R or Stata, this is a one-line logical check per skip pattern.
Decision rule: If a respondent has answers to questions they should have skipped, those answers are invalid. Set them to blank and log the change. If a respondent skipped questions they should have answered, flag the record for review but do not fabricate data.
Step 5: Check Ranges and Flag Outliers
Every numeric variable has a plausible range. A household with 47 members, an age of 250, or a monthly income of $1 are all data entry or collection errors.
- For each numeric variable, define the valid range from your codebook. Example: age of household head should be 18-99. Number of children under 5 should be 0-10.
- Calculate basic descriptive statistics: minimum, maximum, mean, median, standard deviation.
- Flag any value outside the valid range as an error.
- Flag any value more than 3 standard deviations from the mean as a potential outlier for review.
Common decision rules you can copy:
| Check | Flag Rule |
|---|---|
| Survey duration | Less than 5 minutes for a 30-minute survey: flag for review |
| GPS coordinates | More than 50 km from the expected survey area: flag |
| Age | Outside 0-120: error. Under 15 for head of household: flag |
| Household size | Over 20: flag for review |
| Income/expenditure | Zero for all categories: flag. Top 1%: verify |
Do not automatically delete outliers. Some are real. A household with 15 members exists. An income of $0 might be accurate for an extremely vulnerable household. Flag, investigate, then decide.
Step 6: Run Consistency Checks
Consistency checks catch logical contradictions that range checks miss. These require cross-referencing two or more variables.
Examples of consistency checks:
- A respondent reports 0 children but later reports school enrollment for 3 children.
- A respondent's reported age is 12 but their occupation is "teacher."
- A household reports no agricultural land but later reports crop harvest quantities.
- Date of birth and reported age differ by more than 1 year.
- A facility reports 0 staff but also reports service delivery numbers.
Build a list of 10-15 consistency checks specific to your survey. Run each one as a logical test. Flag contradictions for manual review.
Decision rule: Do not guess which answer is correct. If two variables contradict each other, you cannot determine which one is wrong from the data alone. Flag the record, and if you can contact the enumerator or respondent, do so. If you cannot, document the contradiction and exclude the conflicting variables (not the entire record) from relevant analyses.
Step 7: Document Everything in Your Cleaning Log
Your cleaning log is not optional. It is a data quality assurance requirement and an audit trail.
By the time you finish Steps 2 through 6, your log should contain every change you made. Review it now for completeness:
- Every deleted duplicate has an entry.
- Every corrected value shows the original and the new value.
- Every excluded record has a reason.
- Every flagged outlier has a disposition (corrected, verified as accurate, or excluded).
Save the cleaning log alongside the cleaned dataset. Anyone who uses your data should be able to open the log and understand exactly what was changed and why. This is fundamental to good data management.
Step 8: Final Verification
Before you declare the dataset clean, run a final round of checks.
- Recount records. Compare your cleaned dataset count to the raw count. The difference should exactly match the number of records you excluded or deleted, as documented in your cleaning log.
- Rerun range checks. Verify that no out-of-range values remain.
- Rerun skip logic checks. Confirm all skip patterns are now consistent.
- Spot-check 10-20 records manually. Open the raw file and the clean file side by side. Pick records at random. Verify the changes match your log.
- Run the DQA Scorecard to get a structured quality assessment of your cleaned dataset.
If everything checks out, rename your file to indicate it is final: household_survey_clean_FINAL_20260401.xlsx. Lock the file or set it to read-only.
Common Mistakes
Mistake 1: Cleaning in the original file. Once you overwrite raw data, it is gone. Every cleaning decision becomes irreversible and unverifiable. Always work in a copy. Always keep the raw export untouched.
Mistake 2: Deleting records instead of flagging them. Deleting a record removes it from every possible analysis, including ones you have not thought of yet. Flag and exclude instead. Let the analyst decide which records qualify for each specific analysis.
Mistake 3: No cleaning log. If you cannot explain what you changed and why, your cleaned dataset is not trustworthy. When a donor audits your data or a colleague questions a finding, "I cleaned it but didn't document the changes" is not an acceptable answer.
Mistake 4: Cleaning alone. Data cleaning involves judgment calls. A second person should review your flagged records and your decisions about outliers. If your team has only one data person, at minimum have the M&E manager review the cleaning log before analysis begins.
Mistake 5: Treating cleaning as optional. Teams skip cleaning when deadlines are tight, then discover problems during analysis when it is too late to fix them. Budget cleaning time into your workplan from the start. For a baseline or endline survey, plan for 1-3 days of cleaning per 500 responses.
Cleaning Checklist
Print this or copy it into your project folder. Check off each item as you complete it.
- Raw data exported and saved to read-only
raw/folder - Clean copy created in separate
clean/folder - Cleaning log template open with starting record count documented
- Duplicates identified and removed (later submission deleted, logged)
- Completion rates calculated; records below 50% excluded and logged
- Skip logic validated for all conditional questions
- Valid ranges defined for all numeric variables
- Out-of-range values corrected or flagged
- Outliers reviewed (3+ SD from mean) and dispositioned
- Consistency checks run (10-15 cross-variable logic tests)
- Contradictions flagged and resolved or documented
- Cleaning log reviewed for completeness
- Final record count reconciled against raw count and cleaning log
- Range and skip logic checks rerun on cleaned dataset
- 10-20 records spot-checked against raw data
- Cleaned file renamed with FINAL tag and locked
- Cleaning log saved alongside cleaned dataset