Clean M&E Data with AI

A 4-step prompt workflow that takes raw survey data through validation, error detection, correction, and documentation to produce a clean, analysis-ready dataset.

20-30 min4 stepsIntermediateAnalysis

What you'll build

A set of data validation rules, error detection queries, a correction protocol, and a cleaning audit log template.

Before you start

  • Your raw dataset (or a sample of it) with variable names
  • The survey instrument or data collection form
  • Your indicator definitions (to know valid ranges and expected values)
1Define Validation Rules

Start by defining the rules that distinguish valid data from errors. These rules come from your survey instrument and indicator definitions.

Step 1: Define Validation Rules

You are a senior M&E data specialist. I need to clean a dataset from a household survey. Start by generating validation rules. Based on the survey variables I describe, create a validation rules table with columns: - Variable name - Valid range or values - Rule type (range check, consistency check, skip logic check, duplicate check, completeness check) - Error severity (critical, warning, minor) - Suggested action when violated (flag for review, auto-correct, delete record) Include rules for: - Numeric ranges (age, household size, income) - Categorical values (only valid response options) - Skip logic consistency (if Q5 = "No", Q6 should be blank) - Cross-variable consistency (e.g., age of household head > age of children) - Completeness (required fields that should not be blank) - Duplicates (same respondent ID, same GPS coordinates within X meters) Here are my survey variables: [Describe your variables, or paste the variable list from your questionnaire]

Critical errors (impossible values like age = 200) should be flagged for deletion or correction. Warnings (unlikely but possible values like household size = 15) should be flagged for review but not auto-corrected.

2Detect Errors

Apply the validation rules to find specific errors in your data. This produces a list of records and fields that need attention.

Step 2: Detect Errors

Based on the validation rules, generate error detection queries. For each rule, produce: 1. **Query logic**: A plain-language description of what to check (e.g., "Find all records where age < 0 or age > 120") 2. **Implementation**: The query written for a common tool. Provide versions for: - Excel formula (for small datasets) - Python/pandas (for larger datasets) 3. **Expected error rate**: For each rule type, what error rate is typical? (e.g., "Range violations are typically 1-3% in well-supervised data collection; above 5% suggests systematic problems") Then suggest an error detection summary template: - Total records - Records with at least one error - Error rate by rule type - Variables with highest error rates - Records flagged for deletion vs. correction vs. review

Never paste real beneficiary data into a public AI tool. Use your variable list and validation rules only, or use synthetic sample data. Always test any AI-generated code on a small sample before running it on the full dataset.

3Define Correction Protocol

Decide how to handle each type of error. The correction protocol prevents ad hoc decisions and ensures consistency.

Step 3: Define Correction Protocol

Create a data correction protocol. For each error type found, define: 1. **Decision tree**: A flowchart or decision table for how to handle the error: - Can it be corrected from other data in the record? (e.g., GPS coordinates from village name) - Can it be corrected from the original paper form or device log? - Should it be set to missing? - Should the entire record be deleted? 2. **Correction rules**: Specific rules for common corrections: - How to handle outliers (Winsorize, cap, delete, or keep with flag?) - How to handle missing data (listwise deletion, imputation, or keep with flag?) - How to handle duplicate records (keep first, keep most complete, merge, or delete both?) 3. **Never auto-correct list**: Variables or error types where correction must always involve human review (e.g., income data, sensitive questions, key outcome indicators). 4. **Correction documentation**: For every correction made, what must be logged? (record ID, variable, original value, corrected value, reason, who made the correction)

Log every correction. A dataset without a cleaning audit trail cannot be verified, and unverifiable data weakens evaluation credibility.

4Document the Cleaning

Create the audit log that documents what was cleaned, why, and by whom. This is your evidence that the data is trustworthy.

Step 4: Document the Cleaning

Create a data cleaning documentation package: 1. **Cleaning audit log template**: A table with columns: - Record ID - Variable - Original value - Corrected value (or "set to missing" or "record deleted") - Reason for correction - Rule applied (reference to validation rule number) - Date corrected - Corrected by 2. **Cleaning summary report template** (to include in evaluation reports): - Total records received - Total records after cleaning - Records deleted and why - Variables with highest correction rates - Overall data quality assessment (good, acceptable, or concerning, with criteria) - Impact on analysis (does the cleaning materially change any indicator values?) 3. **Data quality flag variables**: Suggest 2-3 new variables to add to the clean dataset that flag data quality at the record level (e.g., "quality_flag: 0 = no issues, 1 = minor corrections, 2 = major corrections, 3 = flagged for cautious interpretation"). 4. **Version control**: Naming convention for raw vs. cleaned datasets (e.g., "dataset_raw_20260401.csv" vs. "dataset_clean_v1_20260410.csv").

Always keep the raw dataset untouched. Clean a copy. If anyone asks about a data point, you need to be able to show the original value.

Not sure which AI tool to use?

Try the AI Tool Selector to find the best tool for your specific M&E task, or browse 130+ M&E-specific prompts.

Related Resources