Skip to content

Runs are compared by coordinate join, with a paired-difference verdict

touchstone's purpose is choosing a model for your usecases, tracked over time as models change. A single run answers "which is best today"; it cannot answer "did the new model version regress my tasks?" That question needs two runs related to each other. This ADR fixes how two runs are compared, and what counts as a regression.

Status

accepted

Context

Each run is a directory whose per-Cell result.json is the source of truth (ADR 0002), and a Cell's id is a stable coordinatecase__harness__model__t{trial} — that is identical across runs. Two runs of the same matrix therefore share Cell ids by construction. Nothing in the framework exploited this: comparing runs meant reading two report.md files by eye.

Three questions had to be settled:

  1. What corresponds to what across two runs? Content can drift (a case's source commit, a harness version), so matching on outputs is unreliable. The coordinate is the only stable key.
  2. When is a difference a regression? A single Cell's score dip is noise as often as signal — the report already refuses to call a single-run winner without a paired test (ADR-less, but report._significance_section). A cross-run claim deserves the same rigor.
  3. What must never read as a regression? A Cell skipped because its private repo was unreachable on this host (ADR 0008) is coverage loss, not a model getting worse.

Decision

  • Coordinate join. compare_runs(a, b) joins Cells by (case, harness, model, trial). Cells present in only one run are reported as added/removed, never silently dropped. SKIPPED Cells are excluded from the comparison and surfaced in a separate "coverage changed" section.
  • A delta is not a verdict. Per-(case,harness,model) deltas (score, pass-rate, pass^k flip) are shown for triage, but the headline regression call is the paired difference over the cases both runs share, with a bootstrap CI (the same statistic the report uses for within-run winners, see ADR 0013). A drop whose CI crosses 0 is "within noise," not a regression.
  • One gate. compare --fail-on-regression exits non-zero only on a significant paired drop, with --require-cases N refusing to gate on too few shared cases. This single predicate is the entire CI/cron contract — scheduling lives outside the tool (a cron/Action wrapper), not in a daemon.
  • Run identity is a label, not a rename. Runs keep their timestamp id; an optional --label and a runs/.baseline pointer let compare/resolve_run refer to latest, baseline, or a label without changing the immutable run id.
  • compare is read-only over existing runs; it writes only a new compare-<a>.md (and optional .html) under the newer run. No historical result.json/manifest.json is mutated.

Consequences

  • Regression tracking — flagging Cells that flipped pass→fail with a statistically honest verdict — becomes a first-class, scriptable operation, which is the framework's reason to exist over a one-shot benchmark. This is greenfield relative to comparable OSS tools, which ship run viewing but not pass→fail gating for agent evals.
  • The comparison reuses the report's per-case-mean clustering and bootstrap CI, so the two outputs never disagree on the statistics.
  • Bisecting "when did this regress" across N runs is a natural follow-on (a timeline over labeled runs); this ADR only commits to the two-run primitive it builds on.
  • New glossary terms: Baseline (the run a new run is judged against) and Regression (a group that flipped pass→fail or dropped beyond a significance-checked threshold).