How long does data cleaning take?

Plan for 1-3 days per 500 survey responses for a standard household survey. Complex surveys with many skip patterns take longer. Budget cleaning time into your data collection timeline from the start, not as an afterthought.

Should I delete or correct problematic records?

Never delete raw data. Create a cleaned copy and document every change in a cleaning log. If a record is unusable, flag it and exclude it from analysis, but keep the original. Your cleaning log should record what was changed, why, by whom, and when.

What tools should I use for data cleaning?

For most M&E teams: Excel or Google Sheets for small datasets (under 1,000 records), R or Stata for larger datasets or complex cleaning rules, and OpenRefine for deduplication and standardization. KoboToolbox and SurveyCTO both have built-in validation that catches some issues during collection, but you still need post-collection cleaning.

How to Clean Your Dataset Before Analysis

Step 1: Back Up and Set Up Your Workspace

Never clean data in the original file. Ever.

Save the raw export to a folder named raw/ or 01_raw/. This file is read-only from this point forward. Do not edit it.
Create a copy in a folder named clean/ or 02_clean/. All edits happen in this copy.
Open your cleaning log. You will update it continuously as you work through the steps below.
Record the date, the raw file name, the number of records, and the number of variables. This is your starting point.

Name your files with dates: household_survey_raw_20260401.xlsx and household_survey_clean_20260401.xlsx. When someone asks "which version did you analyze?" you need to answer precisely.

Step 2: Detect and Remove Duplicates

Duplicates happen more often than you think. Enumerators resubmit forms, connectivity issues cause double uploads, and copy-paste errors create phantom records.

How to check:

Sort by unique ID (respondent ID, household ID, or submission UUID). Look for exact matches.
If you have no unique ID, sort by a combination of location, date, and respondent name. Flag records that match on all three.
Check submission timestamps. Two records from the same enumerator submitted within 60 seconds of each other are almost certainly duplicates.

Decision rule: If two records have the same unique ID and identical responses on 90% or more of variables, delete the later submission. Log the deletion. If the responses differ meaningfully, flag both for manual review.

Step 3: Check Completeness

Every record should have values for required fields. Missing data is normal, but too much missing data makes a record unusable.

Calculate the completion rate for each record: number of non-blank required fields divided by total required fields.
Flag any record below 80% completion. A record missing a quarter of its required fields is suspect.
For each flagged record, check the submission metadata. Was it a test submission? Did the enumerator abandon it mid-interview? Was the respondent unavailable after consent?

Decision rule: Records below 50% completion are almost always unusable. Exclude them from analysis and log the reason. Records between 50-80% may be usable for some indicators but not others. Do not discard them automatically. Instead, mark them as partial and decide per-analysis which ones qualify.

Check for blank columns too. If an entire variable has zero responses, something went wrong with the form logic or the variable was never asked. Investigate before moving on.

Step 4: Validate Skip Logic

Skip logic errors are the most common source of bad data in electronic surveys, and they are easy to miss if you do not check systematically.

Pull up your codebook. For every conditional question, verify:

Records that should have skipped did skip. If question 12 only applies to respondents who answered "Yes" to question 10, check that all "No" respondents on question 10 have blank or "N/A" values on question 12.
Records that should have answered did answer. If a respondent answered "Yes" to question 10 but question 12 is blank, that is a data gap, not a valid skip.

How to do this efficiently: Write simple filter rules. In Excel: filter column Q10 to "No," then check that column Q12 is blank for all filtered rows. In R or Stata, this is a one-line logical check per skip pattern.

Decision rule: If a respondent has answers to questions they should have skipped, those answers are invalid. Set them to blank and log the change. If a respondent skipped questions they should have answered, flag the record for review but do not fabricate data.

Step 5: Check Ranges and Flag Outliers

Every numeric variable has a plausible range. A household with 47 members, an age of 250, or a monthly income of $1 are all data entry or collection errors.

For each numeric variable, define the valid range from your codebook. Example: age of household head should be 18-99. Number of children under 5 should be 0-10.
Calculate basic descriptive statistics: minimum, maximum, mean, median, standard deviation.
Flag any value outside the valid range as an error.
Flag any value more than 3 standard deviations from the mean as a potential outlier for review.

Common decision rules you can copy:

Check	Flag Rule
Survey duration	Less than 5 minutes for a 30-minute survey: flag for review
GPS coordinates	More than 50 km from the expected survey area: flag
Age	Outside 0-120: error. Under 15 for head of household: flag
Household size	Over 20: flag for review
Income/expenditure	Zero for all categories: flag. Top 1%: verify

Do not automatically delete outliers. Some are real. A household with 15 members exists. An income of $0 might be accurate for an extremely vulnerable household. Flag, investigate, then decide.

Step 6: Run Consistency Checks

Consistency checks catch logical contradictions that range checks miss. These require cross-referencing two or more variables.

Examples of consistency checks:

A respondent reports 0 children but later reports school enrollment for 3 children.
A respondent's reported age is 12 but their occupation is "teacher."
A household reports no agricultural land but later reports crop harvest quantities.
Date of birth and reported age differ by more than 1 year.
A facility reports 0 staff but also reports service delivery numbers.

Build a list of 10-15 consistency checks specific to your survey. Run each one as a logical test. Flag contradictions for manual review.

Decision rule: Do not guess which answer is correct. If two variables contradict each other, you cannot determine which one is wrong from the data alone. Flag the record, and if you can contact the enumerator or respondent, do so. If you cannot, document the contradiction and exclude the conflicting variables (not the entire record) from relevant analyses.

Step 7: Document Everything in Your Cleaning Log

Your cleaning log is not optional. It is a data quality assurance requirement and an audit trail.

By the time you finish Steps 2 through 6, your log should contain every change you made. Review it now for completeness:

Every deleted duplicate has an entry.
Every corrected value shows the original and the new value.
Every excluded record has a reason.
Every flagged outlier has a disposition (corrected, verified as accurate, or excluded).

Save the cleaning log alongside the cleaned dataset. Anyone who uses your data should be able to open the log and understand exactly what was changed and why. This is fundamental to good data management.

Step 8: Final Verification

Before you declare the dataset clean, run a final round of checks.

Recount records. Compare your cleaned dataset count to the raw count. The difference should exactly match the number of records you excluded or deleted, as documented in your cleaning log.
Rerun range checks. Verify that no out-of-range values remain.
Rerun skip logic checks. Confirm all skip patterns are now consistent.
Spot-check 10-20 records manually. Open the raw file and the clean file side by side. Pick records at random. Verify the changes match your log.
Run the DQA Scorecard to get a structured quality assessment of your cleaned dataset.

If everything checks out, rename your file to indicate it is final: household_survey_clean_FINAL_20260401.xlsx. Lock the file or set it to read-only.

Common Mistakes

Mistake 1: Cleaning in the original file. Once you overwrite raw data, it is gone. Every cleaning decision becomes irreversible and unverifiable. Always work in a copy. Always keep the raw export untouched.

Mistake 2: Deleting records instead of flagging them. Deleting a record removes it from every possible analysis, including ones you have not thought of yet. Flag and exclude instead. Let the analyst decide which records qualify for each specific analysis.

Mistake 3: No cleaning log. If you cannot explain what you changed and why, your cleaned dataset is not trustworthy. When a donor audits your data or a colleague questions a finding, "I cleaned it but didn't document the changes" is not an acceptable answer.

Mistake 4: Cleaning alone. Data cleaning involves judgment calls. A second person should review your flagged records and your decisions about outliers. If your team has only one data person, at minimum have the M&E manager review the cleaning log before analysis begins.

Mistake 5: Treating cleaning as optional. Teams skip cleaning when deadlines are tight, then discover problems during analysis when it is too late to fix them. Budget cleaning time into your workplan from the start. For a baseline or endline survey, plan for 1-3 days of cleaning per 500 responses.

Cleaning Checklist

Print this or copy it into your project folder. Check off each item as you complete it.

How to Clean Your Dataset Before Analysis

Who This Page Is For

Before You Start

Step 1: Back Up and Set Up Your Workspace

Step 2: Detect and Remove Duplicates

Step 3: Check Completeness

Step 4: Validate Skip Logic

Step 5: Check Ranges and Flag Outliers

Step 6: Run Consistency Checks

Step 7: Document Everything in Your Cleaning Log

Step 8: Final Verification

Common Mistakes

Cleaning Checklist

Frequently Asked Questions

Try it in M&E Studio

Related decision guides

Key concepts explained

Explore the topic

How to Clean Your Dataset Before Analysis

Who This Page Is For

Before You Start

Step 1: Back Up and Set Up Your Workspace

Step 2: Detect and Remove Duplicates

Step 3: Check Completeness

Step 4: Validate Skip Logic

Step 5: Check Ranges and Flag Outliers

Step 6: Run Consistency Checks

Step 7: Document Everything in Your Cleaning Log

Step 8: Final Verification

Common Mistakes

Cleaning Checklist

Frequently Asked Questions

Try it in M&E Studio

Related decision guides

Key concepts explained

Explore the topic