Runs are compared by coordinate join, with a paired-difference verdict¶
touchstone's purpose is choosing a model for your usecases, tracked over time as models
change. A single run answers "which is best today"; it cannot answer "did the new model
version regress my tasks?" That question needs two runs related to each other. This ADR fixes
how two runs are compared, and what counts as a regression.
Status¶
accepted
Context¶
Each run is a directory whose per-Cell result.json is the source of truth (ADR 0002), and a
Cell's id is a stable coordinate — case__harness__model__t{trial} — that is identical
across runs. Two runs of the same matrix therefore share Cell ids by construction. Nothing in
the framework exploited this: comparing runs meant reading two report.md files by eye.
Three questions had to be settled:
- What corresponds to what across two runs? Content can drift (a case's source commit, a harness version), so matching on outputs is unreliable. The coordinate is the only stable key.
- When is a difference a regression? A single Cell's score dip is noise as often as signal —
the report already refuses to call a single-run winner without a paired test (ADR-less, but
report._significance_section). A cross-run claim deserves the same rigor. - What must never read as a regression? A Cell
skippedbecause its private repo was unreachable on this host (ADR 0008) is coverage loss, not a model getting worse.
Decision¶
- Coordinate join.
compare_runs(a, b)joins Cells by(case, harness, model, trial). Cells present in only one run are reported asadded/removed, never silently dropped.SKIPPEDCells are excluded from the comparison and surfaced in a separate "coverage changed" section. - A delta is not a verdict. Per-(case,harness,model) deltas (score, pass-rate,
pass^kflip) are shown for triage, but the headline regression call is the paired difference over the cases both runs share, with a bootstrap CI (the same statistic the report uses for within-run winners, see ADR 0013). A drop whose CI crosses 0 is "within noise," not a regression. - One gate.
compare --fail-on-regressionexits non-zero only on a significant paired drop, with--require-cases Nrefusing to gate on too few shared cases. This single predicate is the entire CI/cron contract — scheduling lives outside the tool (a cron/Action wrapper), not in a daemon. - Run identity is a label, not a rename. Runs keep their timestamp id; an optional
--labeland aruns/.baselinepointer letcompare/resolve_runrefer tolatest,baseline, or a label without changing the immutable run id. compareis read-only over existing runs; it writes only a newcompare-<a>.md(and optional.html) under the newer run. No historicalresult.json/manifest.jsonis mutated.
Consequences¶
- Regression tracking — flagging Cells that flipped pass→fail with a statistically honest verdict — becomes a first-class, scriptable operation, which is the framework's reason to exist over a one-shot benchmark. This is greenfield relative to comparable OSS tools, which ship run viewing but not pass→fail gating for agent evals.
- The comparison reuses the report's per-case-mean clustering and bootstrap CI, so the two outputs never disagree on the statistics.
- Bisecting "when did this regress" across N runs is a natural follow-on (a timeline over labeled runs); this ADR only commits to the two-run primitive it builds on.
- New glossary terms: Baseline (the run a new run is judged against) and Regression (a group that flipped pass→fail or dropped beyond a significance-checked threshold).