ToR Evaluation Questions: AI-Ready M&E Rubric

You are an expert M&E evaluation specialist. Score the evaluation questions section of the Terms of Reference (ToR) I will provide using the rubric below. SCORING RUBRIC - ToR Evaluation Questions Score each dimension 1-5 using these criteria: DIMENSION 1: Scope Coherence - Score 5: All elements present. Questions align directly with the stated evaluation purpose (formative, summative, or mixed). Questions fit the budget envelope and timeline indicated in the ToR. Each question is proportionate to the program size, complexity, and theory of change. No question requires evidence beyond what the program scope contains. - Score 4: At least three of four elements present. Purpose alignment and proportionality clear; budget or timeline fit partially demonstrated. - Score 3: At least two of four elements present. Questions broadly align with purpose but proportionality or resource fit is implicit. - Score 2: Questions only loosely related to the stated purpose. Several questions exceed the program scope or resource envelope. - Score 1: Absent or inadequate. Questions disconnected from purpose, scope, or resources. DIMENSION 2: Answerability - Score 5: All elements present. Each question can be answered by methods the ToR contemplates. Required data is available or collectable within the timeline. Required respondents are accessible. The unit of analysis is consistent with the question. - Score 4: At least three of four elements present. Most questions answerable with the contemplated methods; one or two require minor scope adjustments. - Score 3: At least two of four elements present. Questions broadly answerable but data access or respondent reach is uncertain for some. - Score 2: Several questions cannot be answered with the methods or data the ToR allows. Counterfactual or attribution claims requested without supporting design. - Score 1: Absent or inadequate. Questions cannot be answered with the contemplated methods or available evidence. DIMENSION 3: Decision Relevance - Score 5: All elements present. Each question is tied to a named decision, learning need, or accountability obligation. The user of each answer is identified (donor, program team, partner, beneficiary representatives). The decision window matches the evaluation timeline. Findings will be actionable within the program lifecycle or successor cycle. - Score 4: At least three of four elements present. Decision relevance clear; named users or decision windows partial. - Score 3: At least two of four elements present. Some questions tied to decisions; others framed as general inquiry. - Score 2: Most questions read as academic interest rather than decision support. No named users. - Score 1: Absent or inadequate. Questions have no apparent decision use. DIMENSION 4: Criterion Coverage - Score 5: All elements present. OECD-DAC criteria (or a documented alternative framework such as utilization, equity, or donor-specific criteria) are used. Each question maps to one or more criteria. Criteria selection is justified by the evaluation purpose (e.g., effectiveness and sustainability for summative; relevance and coherence for formative). Criteria are not applied as a checklist where they do not fit. - Score 4: At least three of four elements present. Criteria framework used and questions mapped; selection rationale or fit partially developed. - Score 3: At least two of four elements present. Criteria framework named but mapping implicit. Selection not justified. - Score 2: Criteria listed at the start but questions not mapped to them. Or all criteria forced onto every question regardless of fit. - Score 1: Absent or inadequate. No criteria framework used or used incoherently. DIMENSION 5: Right-Sizing - Score 5: All elements present. The total number of main questions is realistic for the budget and timeline (typically 3-6 for a standard evaluation). Sub-questions are bounded (2-4 per main question). The depth of inquiry expected for each question is achievable. The team size and skill mix in the ToR can deliver on the question set. - Score 4: At least three of four elements present. Question count realistic; sub-question scope or team fit partial. - Score 3: At least two of four elements present. Question count on the high end; sub-questions sometimes overlap or repeat. - Score 2: Question set is overcommitted (e.g., 10+ main questions, 30+ sub-questions for a short evaluation). Depth not achievable. - Score 1: Absent or inadequate. Question set cannot be addressed at any defensible depth within the resources. OUTPUT FORMAT: Return your assessment as a table followed by a summary: | Dimension | Score (1-5) | Evidence | Priority Revision | |-----------|-------------|----------|-------------------| | Scope Coherence | | | | | Answerability | | | | | Decision Relevance | | | | | Criterion Coverage | | | | | Right-Sizing | | | | **Total: X/25** **Band:** Strong (22-25) / Adequate (17-21) / Needs Revision (11-16) / Substantial Revision (5-10) **Single Most Important Revision:** [One specific sentence] For any dimension scored 1 or 2, add a brief explanation and a concrete revision example. EVALUATION QUESTIONS TO SCORE: [Paste your evaluation questions section here]

Scoring Criteria

Scope Coherence

5Excellent

All elements present. Questions align with the stated purpose, fit the budget and timeline, and remain proportionate to the program. No question requires evidence beyond the program scope.

4Good

At least three of four elements present. Purpose alignment and proportionality clear; budget or timeline fit partial.

3Adequate

At least two of four elements present. Questions broadly align with purpose but proportionality or resource fit is implicit.

2Needs Improvement

Questions only loosely related to stated purpose. Several exceed scope or resources.

1Inadequate

Absent or inadequate. Questions disconnected from purpose, scope, or resources.

Answerability

5Excellent

All elements present. Each question answerable with contemplated methods. Data available, respondents accessible, unit of analysis consistent.

4Good

At least three of four elements present. Most questions answerable; one or two need minor scope adjustment.

3Adequate

At least two of four elements present. Data access or respondent reach uncertain for some.

2Needs Improvement

Several questions unanswerable. Attribution or counterfactual claims unsupported by design.

1Inadequate

Absent or inadequate. Questions cannot be answered with the contemplated methods or evidence.

Decision Relevance

5Excellent

All elements present. Each question tied to a named decision, learning need, or accountability obligation. User of each answer identified. Decision window matches evaluation timeline. Findings actionable.

4Good

At least three of four elements present. Decision relevance clear; named users or windows partial.

3Adequate

At least two of four elements present. Some questions tied to decisions; others framed as general inquiry.

2Needs Improvement

Most questions read as academic interest. No named users.

1Inadequate

Absent or inadequate. Questions have no apparent decision use.

Criterion Coverage

5Excellent

All elements present. OECD-DAC or documented alternative framework used. Each question mapped to one or more criteria. Selection justified by evaluation purpose. Criteria not applied as a checklist.

4Good

At least three of four elements present. Criteria framework used and questions mapped; rationale or fit partial.

3Adequate

At least two of four elements present. Framework named but mapping implicit. Selection not justified.

2Needs Improvement

Criteria listed at the start but questions not mapped. Or all criteria forced onto every question regardless of fit.

1Inadequate

Absent or inadequate. No criteria framework used or used incoherently.

Right-Sizing

5Excellent

All elements present. Main-question count realistic (typically 3-6). Sub-questions bounded (2-4 each). Depth achievable. Team size and skill mix can deliver.

4Good

At least three of four elements present. Question count realistic; sub-question scope or team fit partial.

3Adequate

At least two of four elements present. Question count on the high end; sub-questions sometimes overlap.

2Needs Improvement

Question set overcommitted (e.g., 10+ main questions, 30+ sub-questions). Depth not achievable.

1Inadequate

Absent or inadequate. Question set cannot be addressed at any defensible depth.

Score Interpretation

Total (out of 25)	Band	Next Step
22-25	Strong	Evaluation questions are ready. Minor refinements only.
17-21	Adequate	Address flagged dimensions before issuing the ToR for bids.
11-16	Needs Revision	Rework the questions before procurement. Use the Revise prompt as a revision brief.
5-10	Substantial Revision	Return to the evaluation purpose and redraft the question set.

ToR Evaluation Questions

AI Prompt Templates

Scoring Criteria

Score Interpretation

Prompts Using This Rubric