AI Playbook

Clean M&E Data with AI

4 steps · Works with any AI assistant · No signup required

You'll end up with

A set of data validation rules, error detection queries, a correction protocol, and a cleaning audit log template.

Define Validation Rules

Clean data starts with explicit rules about what "valid" means. This first step turns your dataset description and variable list into a comprehensive, prioritized set of validation rules the rest of the cleaning process will enforce. Paste your dataset description and survey variables after the prompt.

Your dataset description and survey variables*

The AI will generate a comprehensive set of validation rules, priority-ordered.

Prompt for this step

You are a senior M&E data specialist. Based on the dataset description and survey variables I provide below, define a comprehensive set of validation rules that will govern the data cleaning process for this dataset.

Produce your response as labelled sections, one per variable (use clearly headed sections rather than tables). Cover every variable in the dataset, not a sample. For each variable, include the following components:

1. **Variable name and type** — The variable identifier as it appears in the dataset, its data type (numeric continuous, numeric discrete, categorical ordinal, categorical nominal, binary, string, date, time, geo-coordinate), and units if applicable.

2. **Valid range or allowed values** — For numeric variables, the plausible min and max bounds (with reasoning: e.g., "age 0-120 based on human lifespan limits"). For categorical variables, the full list of allowed codes. For dates, the valid window. For strings, format constraints (email pattern, phone pattern, ID pattern).

3. **Mandatory vs. optional** — Whether the variable must be non-missing for every record, or may be legitimately blank; if optional, under what conditions blank is valid.

4. **Dependency rules** — Skip-pattern logic and conditional requirements, for example: "if Q12 = Yes, Q13 must be non-missing; if Q12 = No, Q13 must be missing." Name each dependency explicitly.

5. **Cross-variable consistency checks** — Logical relationships with other variables, for example: "age must be consistent with date of birth if both are captured", "household size must equal the sum of adults and children", "pregnancy status must be No for male respondents."

6. **Known error patterns from similar surveys** — Common enumerator or data-entry mistakes to watch for (digit transposition in phone numbers, 999/888 as missing codes, duplicate entries, GPS readings outside the survey area).

7. **Severity flag** — Label each rule as blocking (record cannot be used until fixed), warning (flag for review but not disqualifying), or informational (log only).

End with a closing section titled **Rule priority ordering** that lists the 10 most critical validation checks for this dataset's integrity, with a one-sentence justification per check. Output as structured labelled sections in markdown.

My dataset description and survey variables:
[PASTE YOUR DATASET DESCRIPTION AND VARIABLE LIST HERE]

Step 1 of 4