Term

Reliability

The consistency and repeatability of a measurement — whether the same tool produces stable results across repeated applications, different raters, or different time periods.

4 min read
Also known as:Measurement ReliabilityTest-Retest ReliabilityInter-Rater Reliability

Definition

Reliability refers to the consistency and repeatability of a measurement — whether your data collection tool produces stable, dependable results when applied repeatedly under similar conditions. A reliable measurement yields the same (or very similar) results when administered multiple times to the same subjects, when used by different data collectors, or when split into parallel forms.

Reliability is a prerequisite for validity: a measurement can be reliable without being valid (consistently measuring the wrong thing), but it cannot be valid without being reliable (inconsistent measurements cannot accurately capture reality). In practice, reliability testing typically precedes full-scale data collection as part of data quality assurance protocols.

Why It Matters

In M&E work, unreliable measurements undermine every downstream decision. If your survey instrument produces different results depending on which enumerator administers it, or if your scoring rubric yields different ratings when applied by different evaluators, you cannot distinguish programme effects from measurement error. This creates false signals that lead to incorrect conclusions about what is working.

Reliability testing is particularly critical when:

  • Introducing new tools — Novel indicators or assessment methods have unknown reliability properties until tested
  • Training new data collectors — Even well-designed tools produce inconsistent results if collectors apply them differently
  • Comparing data across time or groups — Without reliability evidence, observed differences may reflect measurement inconsistency rather than real change
  • Making high-stakes decisions — Funding allocations, programme pivots, and termination decisions require confidence that measurements are stable

Investing in reliability testing upfront prevents costly errors later, including wasted data collection on flawed instruments and erroneous programme conclusions that damage organizational credibility.

In Practice

Reliability manifests in several forms, each tested differently:

Test-retest reliability assesses whether a tool produces stable results over time. The same instrument is administered to the same subjects on two occasions (typically 1-2 weeks apart, long enough that respondents don't recall answers but short enough that the underlying construct hasn't changed). Correlation coefficients above 0.70 generally indicate acceptable stability. This is essential for surveys measuring attitudes, perceptions, or other constructs that could genuinely shift.

Inter-rater reliability evaluates whether different data collectors apply a tool consistently. Two or more raters independently assess the same subjects using the same instrument (e.g., two evaluators scoring the same programme documentation, two enumerators conducting parallel observations). Metrics include percent agreement (simple but inflated by chance) or Cohen's kappa/Fleiss' kappa (chance-corrected agreement). Training and calibration sessions directly improve inter-rater reliability.

Internal consistency measures whether items within a multi-item scale measure the same construct. Cronbach's alpha is the standard metric, with values above 0.70 indicating acceptable consistency. This is the reliability concern most commonly addressed during survey development — poorly worded or ambiguous items reduce internal consistency and are typically revised or removed.

Parallel-forms reliability tests whether two versions of the same instrument yield equivalent results. Less common in M&E but relevant when you need alternate versions (e.g., pre/post tests that shouldn't be identical to avoid practice effects).

In practice, reliability is rarely a binary pass/fail. It's a property of your specific tool in your specific context with your specific data collectors. A survey validated in one setting may show poor reliability in another due to cultural differences, literacy levels, or enumerator training quality. Continuous monitoring of reliability metrics — particularly inter-rater agreement during data collection — helps catch drift before it compromises your findings.

Related Topics


See also: Bias, Measurement Error, Instrument Validation