The Reality of M&E Data
You've just completed baseline data collection across five countries for a livelihoods program. Eight hundred household surveys are in hand, collected via phone using SurveyCTO. When you open the data file, you see a familiar nightmare:
- "Farmer/farmer/FARMER" entered three different ways
- Numeric fields containing "N/A" instead of numbers
- "Kampala/kampala/KAMPALA/Kampla" as four separate location values
- Missing values coded as "N/A", "---", "missing", and blank cells
- Duplicate records from respondents accidentally entered twice
Research shows that 75% of M&E datasets have quality issues affecting analysis accuracy. These inconsistencies create serious problems: analysis tools count Kampala three ways instead of one, statistics calculate on the wrong data types, and small data entry errors compound into false conclusions that affect program decisions and donor reporting.
The Solution: Structured Data Cleaning
Instead of 15-20 hours manually reviewing spreadsheets, you can use OpenRefine, a free, open-source tool, combined with AI validation to clean datasets in 2-3 hours. That's a 75-85% time reduction.
Common M&E Data Quality Issues
| Issue | Example | Impact | |-------|---------|--------| | Inconsistent text formatting | "Farmer/farmer/FARMER" | Analysis counts as 3 values instead of 1 | | Typos and misspellings | "Kampal" instead of "Kampala" | Loses geographic data | | Duplicate records | Same respondent entered twice | Skewed statistics | | Inconsistent missing values | "N/A", "---", "missing", blank | Can't analyze patterns | | Mixed data types | "123" and "One Hundred" in same field | Prevents calculation | | Extra spaces | " Farmer " vs "Farmer" | Doesn't match filter criteria |
Step-by-Step Cleaning Workflow
Step 1: Initial Assessment
Before cleaning, understand the scope of the problem:
- Open your dataset in OpenRefine (File → Create Project → Select CSV/Excel)
- Use text facets on categorical columns to see all unique values. This instantly reveals inconsistencies
- Document what you find: how many unique values per column, what patterns of error exist
- Create a backup of the original file before any changes
Step 2: Standardize Text Formatting
The fastest wins come from basic text operations:
- Trim whitespace: Edit cells → Common transforms → Trim leading and trailing whitespace
- Normalize case: Edit cells → Common transforms → To titlecase (for names/locations)
- Remove special characters: Use GREL expressions for custom transformations
Step 3: Cluster Similar Values
OpenRefine's clustering feature is where the magic happens. It automatically identifies values that are likely the same thing spelled differently:
- Click the column dropdown → Edit cells → Cluster and edit
- Review suggested clusters: "Kampala", "kampala", "Kampla" → merge to "Kampala"
- Try different clustering methods (key collision, nearest neighbor) for different types of errors
Step 4: Handle Missing Values
Standardize how missing data is represented:
- Decide on a convention: blank cells, "NA", or a specific code your analysis tool expects
- Find all variations: facet the column and look for "N/A", "---", "missing", "unknown", "999"
- Transform consistently: Edit cells → Transform → use GREL:
if(value == "N/A" or value == "---" or value == "missing", null, value)
Step 5: Remove Duplicates
- Flag duplicates: Sort by a unique identifier, then use the "Blank down" operation
- Review before deleting: not all apparent duplicates are actual duplicates - different household members may share an address
- Document removals: keep a count of records removed and why
Using AI for Validation
After mechanical cleaning, use AI to catch issues human review misses:
Prompt for ChatGPT/Claude:
I have a cleaned dataset with the following columns: [LIST COLUMNS].
Here are 20 sample rows. Please identify:
1. Any remaining inconsistencies in categorical fields
2. Values that seem implausible (out of expected range)
3. Patterns that suggest data entry errors
4. Any rows where the combination of values seems contradictory
Expected value ranges:
- Age: 18-65
- Household size: 1-15
- Income (monthly, USD): 0-2000
- [ADD YOUR RANGES]
This catches logic errors that pure text cleaning misses - like a 5-year-old listed as head of household, or income values that are clearly in the wrong currency.
When to Use What
- OpenRefine: Best for text standardization, clustering, deduplication across 100+ records
- AI tools: Best for pattern detection, validation logic, anomaly identification
- Manual review: Still necessary for ambiguous cases, cultural context, and final quality check
- Python/R: Better choice when you need reproducible cleaning scripts for repeated data collection rounds
Key Takeaways
- Always keep the original file untouched - work on a copy
- Document every transformation - OpenRefine creates an audit trail automatically
- Clean before analyzing - cleaning after analysis means redoing work
- Build cleaning templates for repeated surveys - OpenRefine operations can be exported and reapplied
- Budget time for cleaning in your M&E plan - it's not optional overhead, it's essential quality assurance