How to Govern AI Use in Your M&E Work

Most AI failures in M&E are not run-time problems. They are setup problems. Before AI touches the work, six decisions shape whether the outputs will hold up: which tasks AI should handle, where your data goes, where humans stay in the loop, and when to stop.

Part of the Governance guides·Back to AI for M&E

The 6 Governance Decisions to Make Before AI Touches the Work

These are setup decisions, not policy frameworks. Each one maps to a failure mode you can prevent by thinking about it before you start. Execution practices (see the Reliable AI guide) and output validation (see the QA guide) are separate. Get these six right and most of what would otherwise go wrong never happens.

1

Task Fit: Pick What AI Is Actually Good For

AI is not equally good at every M&E task. It is strong at pattern extraction, first-pass synthesis, format conversion, and surfacing candidate themes from large volumes of text. It is weak at evaluative judgment, novel analytical reasoning about unfamiliar contexts, and anything that requires ground-truth verification only a human can do. Before any AI use, write down which specific tasks in the workflow AI will handle and which tasks will stay human-only. Resist scope creep during the work. An AI that was hired to extract quotes from transcripts should not end up drafting the findings chapter.

2

Data Boundaries: Decide What Goes to Which Kind of AI

Not all data can go to all tools. Identifiable participant data, sensitive program context, and confidential findings need a different home than public-facing summaries or anonymized excerpts. Before any AI work, classify your data into three buckets: safe for cloud AI, local-model only, and no-AI-at-all. The classification is practical, not policy: what would actually go wrong if this data ended up on a vendor server you cannot audit. Once the classification exists, match data to tool. If you do not have a local-model capability, that shapes which tasks are AI-eligible.

3

Review Placement: Decide Where Humans Stay in the Loop

Human review in AI work is a design decision, not an afterthought. Decide up front where humans sit in the workflow: reviewing every output, spot-checking a sample, owning certain decisions entirely, or stepping in only when flagged. The right placement depends on the task and the stakes. High-stakes evaluative output needs full human review. Routine extraction can be spot-checked. A team lead making recommendations to a funder needs to own that output personally regardless of how much AI helped draft it. Name the review placement for each task before the work starts, not during it.

4

Execution and Validation: Know Where They Live

This page is about setup. Execution practices (how to structure prompts, when to use local models, how to put quality control between steps, how to run for stability) live on the Reliable AI guide. Post-output validation (the 6-layer stack for catching errors) lives on the AI QA guide. Good setup hands off to good execution, which hands off to good validation. If any of the three is missing, the other two work harder than they should.

5

Documentation for Yourself

Three months after an AI-assisted M&E output ships, someone will ask how it was produced. The person asking will usually be you. For every AI-assisted task, record what model you used, what prompt, what input, and what you did with the output. Store it alongside the output, not in a separate system. If the logging is hard to maintain, the tooling is wrong. Keep it a one-page log per AI-assisted output, not a compliance artifact.

6

Stop Conditions and Recovery: Know When to Pull the Plug

AI-assisted approaches sometimes stop working. The model changes. The data pattern changes. The outputs start drifting. The failure rate crosses a threshold that makes validation more expensive than just doing the work by hand. Before you start, name the specific conditions that would cause you to stop using AI for this task. Also name the recovery path when an AI-assisted output turns out to be wrong after it was accepted: how you catch it, how you correct it, what you tell downstream users. Most teams never think about stop conditions until they are already past them.

Setup Decisions in Practice

Three scenarios showing the difference between M&E work where AI use was set up thoughtfully and work where it was not.

Using AI for a Task It Is Not Good For

Vague prompt

Team asks AI to "assess program effectiveness based on the attached evaluation documents." AI produces a fluent, confident-sounding effectiveness judgment. Nobody validated whether the evaluative logic fits the actual program context. The judgment drives a recommendation in the report. Two months later a reviewer points out that the conclusion does not follow from the evidence; it follows from the AI's internalized patterns about what "effective programs" look like in general.

Using AI for a Task It Is Not Good For

4Cs prompt

Team asks AI to extract and categorize evidence from the documents, with citation tracking. A senior evaluator reviews the evidence categorization, identifies gaps, and owns the effectiveness judgment themselves. AI accelerated evidence retrieval. The evaluative call stayed with the human who can defend it.

Sensitive Data Going to the Wrong Kind of AI

Vague prompt

Under deadline pressure, someone uploads 120 KII transcripts with participant names intact to a commercial cloud AI tool for thematic coding. Data sits on the vendor's servers indefinitely. The organization cannot audit what happened to it. When a participant later objects that their name appeared in a document without their consent, the team cannot even confirm whether the data was deleted.

Sensitive Data Going to the Wrong Kind of AI

4Cs prompt

Data classification happened before any AI work started. Identifiable transcripts go to a local model running on the team's laptop. Anonymized extracts can go to cloud AI for downstream synthesis. When a question comes later about data handling, there is a clear answer.

No Documentation, Cannot Reproduce

Vague prompt

Team uses AI to help produce the coding for a qualitative dataset. Six months later, a reviewer asks how the themes were identified. The team member who did the work has moved on. Nobody can reconstruct the prompts, the model version, or which runs produced which themes. The findings technically exist; the process behind them is gone.

No Documentation, Cannot Reproduce

4Cs prompt

One-page log per AI-assisted task: model, prompt, input, output, what was validated, what was corrected. The log lives alongside the output, not in a separate system. When the reviewer asks, the answer takes five minutes, not five days.

5 Setup Moves That Pay Off Immediately

Write down AI-eligible tasks before AI touches anything

Before any AI use, list the specific tasks in this workflow that AI will handle. Be explicit about boundaries. "AI extracts and categorizes evidence from the documents; the team owns effectiveness judgments." The list exists so you can catch yourself when scope starts to drift during the work. Without it, scope always drifts.

Classify your data before any of it moves toward AI

Three buckets: safe for cloud, local-model only, no-AI-at-all. Do the classification once, at the start of the project, not task-by-task under deadline pressure. The classification becomes a reference you can point at when someone asks why a specific transcript did or did not get AI-assisted processing.

Pilot a small task before scaling the approach

Run one small version of the work end-to-end before committing to doing the whole thing AI-assisted. Code 10 transcripts, not 200. Draft one section, not the full report. The pilot surfaces the failure modes that matter before you have built your timeline around an approach that does not actually work.

Keep a one-page log per AI-assisted output

Model, prompt, input, output, what you validated, what you corrected. Store it alongside the output. Not for compliance. For yourself, three months later, when someone asks and you need to reconstruct the work in less than a day.

Build a stop-and-think trigger into the workflow

Pick a specific moment where the team pauses and asks: is the AI actually helping here, or are we working harder to validate it than we would working without it? The trigger can be time-based (weekly check-in), output-based (after each batch), or failure-based (when errors cross a threshold). Without a trigger, teams stay on AI-assisted approaches past the point they should have stopped.

AI Task-Fit Prompt

Describe an M&E task and this prompt returns a structured recommendation on whether AI should touch it, which parts, with what data, and where the human needs to stay in the loop. Use before AI enters the work, not after.

M&E AI Task-Fit Prompt

I need to decide whether AI should assist with a specific M&E task, and if so, how. Produce a structured recommendation.

Task description:
- What is the task: [e.g., "coding 80 focus group transcripts from a mid-term evaluation" / "drafting the background section of a proposal" / "extracting indicators from a reporting template"]
- What is the output: [coded dataset / draft text / structured data / other]
- Stakes: [Low: internal use / Medium: team deliverable / High: funder-facing / Critical: feeds funding or program decision]
- Evaluative judgments involved: [none / minor / substantive]

Inputs available:
- Source material: [transcripts / documents / datasets / external sources]
- Data sensitivity: [public / anonymized / identifiable / confidential]
- Local AI model available: [yes / no]

Team context:
- Who will do the work: [one person / small team / distributed]
- Who will review the output: [same person / separate reviewer / team lead]
- Deadline pressure: [low / moderate / high]

For this task, produce:

1. AI suitability (one paragraph)
   - Is AI a good fit for this task overall, for part of it, or not at all?
   - What specific sub-tasks should AI handle, and what should stay human-only?
   - What is the failure mode most likely for this task type if AI is misused?

2. Data boundary decision (one paragraph)
   - Which data is safe for cloud AI, which needs local-model only, which should not touch AI?
   - What is the concrete data handling rule for this task?

3. Review placement (one paragraph)
   - Where does the human sit: full review, spot-check, decision owner, or triggered review?
   - What specifically cannot be delegated to AI?

4. Stop conditions (bulleted list of 2-3)
   - Specific conditions that would trigger stopping the AI-assisted approach for this task.

5. A one-paragraph setup statement
   - Suitable to paste into an inception note or a project brief, describing the AI-assistance plan for this task in practitioner language.

Setup Is the First Step

Governance decides what AI should do before the work starts. Reliable execution makes the work happen well. Quality assurance catches what slips through. Pair all three.