How We Build Pipelines
Staged, gated, local-first.
Staged
Small steps. One simple job per AI call, so failures surface step by step instead of hiding in a wall of output.
Gated
Quality gates between steps. Schema checks, rule-based validators, and rubric-scored judges keep bad output from propagating.
Local-first
Local AI models for sensitive data. Deterministic de-identification when cloud models are needed. You decide where data goes.
Staged: how we compose pipelines
Four decisions we make on every pipeline: which steps are deterministic and which need an AI call, what temperature each step runs at, how long inputs get split, and which pattern the steps are arranged in.
Rule-based nodes where we can, AI nodes where we must
Rule-based nodes do work that does not require intelligence. They cost nothing to run and fail predictably. AI nodes are reserved for work that requires judgment or generation. The cheapest pipeline is the one that uses the fewest AI calls. Before adding an AI node, we ask whether a rule-based node can handle the work. Most of the time it can.
Rule-based (no AI)
- Routing a record to the right downstream step based on a field value
- Validating that a date parses and a numeric field is within bounds
- Converting between JSON and CSV, or between donor templates
- Checking that a required field is non-empty
- De-identifying fields before data leaves your infrastructure
AI-required
- Interpreting an open-ended survey response
- Summarizing a long document against a specific question
- Applying a codebook to an interview transcript
- Writing a narrative paragraph from structured evidence
- Scoring an output against a quality rubric
Temperature discipline
Temperature controls how much randomness a model introduces. Matching temperature to task is how we get reliable output at each stage. This is not stylistic. Cold temperatures reduce hallucination in extraction. Warm temperatures improve drafting quality.
| Phase | Temperature | Why |
|---|---|---|
| Extraction and analysis | 0.0 – 0.2 | Same input, same output. Lowest hallucination risk. |
| Drafting | 0.3 – 0.4 | Language needs variation to not feel templated. |
| Polishing and judging | 0.0 – 0.1 | Consistency matters more than creativity. |
Chunking for long inputs
Long documents do not go into one prompt. A 40-page evaluation report, a hundred-page donor manual, a thousand-row survey dataset: each gets split into chunks sized to the task, processed in parallel, then merged.
Three reasons. Context windows cost money: larger inputs cost more and run slower. Model attention degrades with length, so detail in the middle of a long prompt is more likely to be missed. Per-chunk parallelism can turn a run that would take an hour as one prompt into several minutes of parallel work.
How we chunk depends on the task: semantic boundaries for documents, row groups for tables, transcript turns for interviews.
Pipeline topology
Six patterns cover most pipeline topologies we build. Most production pipelines combine several: hierarchical overall, with iterative quality loops around creative stages, tournament selection where variant quality varies, and conditional routing where mixed inputs need different treatment.
Six pipeline patterns
Linear
Steps run one after another. Used when each step builds on the output of the previous one, like extract, then code, then summarize.
Parallel (fan-out / fan-in)
One input splits into concurrent branches that rejoin. Used when a document needs multiple independent analyses at once, like extracting themes and flagging risks in the same pass.
Hierarchical
A multi-section output built by running a sub-pipeline per section, then assembled. Used for long documents like evaluation reports with distinct chapters.
Iterative
A step runs, a judge scores it, and if quality is low the step re-runs with feedback. Capped at three loops. Used when outputs need iterative polish, like narrative drafting.
Tournament
Multiple variants of the same step run in parallel, a judge picks the best. Used for creative work where variant quality varies, like drafting a recommendation paragraph.
Conditional
A router sends each item down the right branch based on its characteristics. Used when the same pipeline handles mixed inputs, like routing transcripts one way and reports another.
A = input · B/C/D = processing steps · V = variant · J = judge · ◆ = router
Gated: quality assurance at every join
Every step passes through a quality gate before the next step runs. If the gate fails, the work routes to human review with the reason flagged. The pipeline does not silently proceed with a bad output.
Schema validators
Confirm the output has the right shape: required fields, correct types, valid formats. Deterministic, zero-cost, runs in milliseconds.
Rule-based checks
Enforce business logic: numeric ranges, enum values, cross-field constraints. Also deterministic, also zero-cost.
Rubric-scored judges
Use an AI model to score output against a rubric of specific criteria. Used when quality is something a human would evaluate subjectively.
Variant tournaments
Generate multiple candidates for the same step, then have a judge pick the best. Used when variant quality varies and the best one is worth the extra cost.
Pipelines set a passing threshold of 0.85 on rubric-scored outputs. Anything below routes to human review. Schema and rule-based checks are pass-fail.
See how each quality assurance method worksLocal-first: sensitive data handling
Not every pipeline can run on local AI models, and not every pipeline needs to. The decision depends on what data is involved. We use a three-tier decision ladder.
Which data goes where
Identifiable + sensitive
Interview transcripts, health records, household rosters
Local models only
Sensitive but de-identifiable
Survey data with personal fields, beneficiary tracking
Deterministic anonymization, then cloud
Public or depersonalized
Reports, indicator data, donor guidance, operational metadata
Cloud models directly
Every pipeline logs which tier each step ran on, which model version was used, and what transformations were applied. You can audit where your data went and what processed it.
See the full data privacy approachDiscuss a Pilot
Tell us the M&E data task, the volume, and the sensitivity profile. We will scope a pilot that fits.