AI for M&E

Cleaning Messy Datasets for M&E Analysis

75% of M&E datasets have quality issues affecting analysis accuracy. Learn how to use OpenRefine and AI tools to clean survey data in 2-3 hours instead of 15-20.

Ben Playfair5 min read
aidata qualityopenrefinedata cleaninganalysis

The Reality of M&E Data

You've just completed baseline data collection across five countries for a livelihoods program. Eight hundred household surveys are in hand, collected via phone using SurveyCTO. When you open the data file, you see a familiar nightmare:

  • "Farmer/farmer/FARMER" entered three different ways
  • Numeric fields containing "N/A" instead of numbers
  • "Kampala/kampala/KAMPALA/Kampla" as four separate location values
  • Missing values coded as "N/A", "---", "missing", and blank cells
  • Duplicate records from respondents accidentally entered twice

Research shows that 75% of M&E datasets have quality issues affecting analysis accuracy. These inconsistencies create serious problems: analysis tools count Kampala three ways instead of one, statistics calculate on the wrong data types, and small data entry errors compound into false conclusions that affect program decisions and donor reporting.

The Solution: Structured Data Cleaning

Instead of 15-20 hours manually reviewing spreadsheets, you can use OpenRefine, a free, open-source tool, combined with AI validation to clean datasets in 2-3 hours. That's a 75-85% time reduction.

Common M&E Data Quality Issues

| Issue | Example | Impact | |-------|---------|--------| | Inconsistent text formatting | "Farmer/farmer/FARMER" | Analysis counts as 3 values instead of 1 | | Typos and misspellings | "Kampal" instead of "Kampala" | Loses geographic data | | Duplicate records | Same respondent entered twice | Skewed statistics | | Inconsistent missing values | "N/A", "---", "missing", blank | Can't analyze patterns | | Mixed data types | "123" and "One Hundred" in same field | Prevents calculation | | Extra spaces | " Farmer " vs "Farmer" | Doesn't match filter criteria |

Step-by-Step Cleaning Workflow

Step 1: Initial Assessment

Before cleaning, understand the scope of the problem:

  1. Open your dataset in OpenRefine (File → Create Project → Select CSV/Excel)
  2. Use text facets on categorical columns to see all unique values. This instantly reveals inconsistencies
  3. Document what you find: how many unique values per column, what patterns of error exist
  4. Create a backup of the original file before any changes

Step 2: Standardize Text Formatting

The fastest wins come from basic text operations:

  • Trim whitespace: Edit cells → Common transforms → Trim leading and trailing whitespace
  • Normalize case: Edit cells → Common transforms → To titlecase (for names/locations)
  • Remove special characters: Use GREL expressions for custom transformations

Step 3: Cluster Similar Values

OpenRefine's clustering feature is where the magic happens. It automatically identifies values that are likely the same thing spelled differently:

  1. Click the column dropdown → Edit cells → Cluster and edit
  2. Review suggested clusters: "Kampala", "kampala", "Kampla" → merge to "Kampala"
  3. Try different clustering methods (key collision, nearest neighbor) for different types of errors

Step 4: Handle Missing Values

Standardize how missing data is represented:

  1. Decide on a convention: blank cells, "NA", or a specific code your analysis tool expects
  2. Find all variations: facet the column and look for "N/A", "---", "missing", "unknown", "999"
  3. Transform consistently: Edit cells → Transform → use GREL: if(value == "N/A" or value == "---" or value == "missing", null, value)

Step 5: Remove Duplicates

  1. Flag duplicates: Sort by a unique identifier, then use the "Blank down" operation
  2. Review before deleting: not all apparent duplicates are actual duplicates - different household members may share an address
  3. Document removals: keep a count of records removed and why

Using AI for Validation

After mechanical cleaning, use AI to catch issues human review misses:

Prompt for ChatGPT/Claude:

I have a cleaned dataset with the following columns: [LIST COLUMNS].
Here are 20 sample rows. Please identify:
1. Any remaining inconsistencies in categorical fields
2. Values that seem implausible (out of expected range)
3. Patterns that suggest data entry errors
4. Any rows where the combination of values seems contradictory

Expected value ranges:
- Age: 18-65
- Household size: 1-15
- Income (monthly, USD): 0-2000
- [ADD YOUR RANGES]

This catches logic errors that pure text cleaning misses - like a 5-year-old listed as head of household, or income values that are clearly in the wrong currency.

When to Use What

  • OpenRefine: Best for text standardization, clustering, deduplication across 100+ records
  • AI tools: Best for pattern detection, validation logic, anomaly identification
  • Manual review: Still necessary for ambiguous cases, cultural context, and final quality check
  • Python/R: Better choice when you need reproducible cleaning scripts for repeated data collection rounds

Key Takeaways

  1. Always keep the original file untouched - work on a copy
  2. Document every transformation - OpenRefine creates an audit trail automatically
  3. Clean before analyzing - cleaning after analysis means redoing work
  4. Build cleaning templates for repeated surveys - OpenRefine operations can be exported and reapplied
  5. Budget time for cleaning in your M&E plan - it's not optional overhead, it's essential quality assurance