How We Build Pipelines

Staged, gated, local-first.

Staged

Small steps. One simple job per AI call, so failures surface step by step instead of hiding in a wall of output.

Gated

Quality gates between steps. Schema checks, rule-based validators, and rubric-scored judges keep bad output from propagating.

Local-first

Local AI models for sensitive data. Deterministic de-identification when cloud models are needed. You decide where data goes.

Staged: how we compose pipelines

Four decisions we make on every pipeline: which steps are deterministic and which need an AI call, what temperature each step runs at, how long inputs get split, and which pattern the steps are arranged in.

Rule-based nodes where we can, AI nodes where we must

Rule-based nodes do work that does not require intelligence. They cost nothing to run and fail predictably. AI nodes are reserved for work that requires judgment or generation. The cheapest pipeline is the one that uses the fewest AI calls. Before adding an AI node, we ask whether a rule-based node can handle the work. Most of the time it can.

Rule-based (no AI)

  • Routing a record to the right downstream step based on a field value
  • Validating that a date parses and a numeric field is within bounds
  • Converting between JSON and CSV, or between donor templates
  • Checking that a required field is non-empty
  • De-identifying fields before data leaves your infrastructure

AI-required

  • Interpreting an open-ended survey response
  • Summarizing a long document against a specific question
  • Applying a codebook to an interview transcript
  • Writing a narrative paragraph from structured evidence
  • Scoring an output against a quality rubric

Temperature discipline

Temperature controls how much randomness a model introduces. Matching temperature to task is how we get reliable output at each stage. This is not stylistic. Cold temperatures reduce hallucination in extraction. Warm temperatures improve drafting quality.

PhaseTemperatureWhy
Extraction and analysis0.0 – 0.2Same input, same output. Lowest hallucination risk.
Drafting0.3 – 0.4Language needs variation to not feel templated.
Polishing and judging0.0 – 0.1Consistency matters more than creativity.

Chunking for long inputs

Long documents do not go into one prompt. A 40-page evaluation report, a hundred-page donor manual, a thousand-row survey dataset: each gets split into chunks sized to the task, processed in parallel, then merged.

Three reasons. Context windows cost money: larger inputs cost more and run slower. Model attention degrades with length, so detail in the middle of a long prompt is more likely to be missed. Per-chunk parallelism can turn a run that would take an hour as one prompt into several minutes of parallel work.

How we chunk depends on the task: semantic boundaries for documents, row groups for tables, transcript turns for interviews.

Pipeline topology

Six patterns cover most pipeline topologies we build. Most production pipelines combine several: hierarchical overall, with iterative quality loops around creative stages, tournament selection where variant quality varies, and conditional routing where mixed inputs need different treatment.

Six pipeline patterns

ABC

Linear

Steps run one after another. Used when each step builds on the output of the previous one, like extract, then code, then summarize.

AB1B2B3C

Parallel (fan-out / fan-in)

One input splits into concurrent branches that rejoin. Used when a document needs multiple independent analyses at once, like extracting themes and flagging risks in the same pass.

AB1B2C1C2D

Hierarchical

A multi-section output built by running a sub-pipeline per section, then assembled. Used for long documents like evaluation reports with distinct chapters.

ABJretry

Iterative

A step runs, a judge scores it, and if quality is low the step re-runs with feedback. Capped at three loops. Used when outputs need iterative polish, like narrative drafting.

AV1V2V3J

Tournament

Multiple variants of the same step run in parallel, a judge picks the best. Used for creative work where variant quality varies, like drafting a recommendation paragraph.

A?B1B2C1C2

Conditional

A router sends each item down the right branch based on its characteristics. Used when the same pipeline handles mixed inputs, like routing transcripts one way and reports another.

A = input · B/C/D = processing steps · V = variant · J = judge · ◆ = router

Gated: quality assurance at every join

Every step passes through a quality gate before the next step runs. If the gate fails, the work routes to human review with the reason flagged. The pipeline does not silently proceed with a bad output.

Schema validators

Confirm the output has the right shape: required fields, correct types, valid formats. Deterministic, zero-cost, runs in milliseconds.

Rule-based checks

Enforce business logic: numeric ranges, enum values, cross-field constraints. Also deterministic, also zero-cost.

Rubric-scored judges

Use an AI model to score output against a rubric of specific criteria. Used when quality is something a human would evaluate subjectively.

Variant tournaments

Generate multiple candidates for the same step, then have a judge pick the best. Used when variant quality varies and the best one is worth the extra cost.

Pipelines set a passing threshold of 0.85 on rubric-scored outputs. Anything below routes to human review. Schema and rule-based checks are pass-fail.

See how each quality assurance method works

Local-first: sensitive data handling

Not every pipeline can run on local AI models, and not every pipeline needs to. The decision depends on what data is involved. We use a three-tier decision ladder.

Which data goes where

1

Identifiable + sensitive

Interview transcripts, health records, household rosters

Local models only

2

Sensitive but de-identifiable

Survey data with personal fields, beneficiary tracking

Deterministic anonymization, then cloud

3

Public or depersonalized

Reports, indicator data, donor guidance, operational metadata

Cloud models directly

Every pipeline logs which tier each step ran on, which model version was used, and what transformations were applied. You can audit where your data went and what processed it.

See the full data privacy approach

Discuss a Pilot

Tell us the M&E data task, the volume, and the sensitivity profile. We will scope a pilot that fits.