Skip to content

Judges dispatch by family, CIs are bootstrapped, and tool calls are scored

touchstone's statistics were already careful (unbiased pass@k, clustered SE, a paired significance test, SWE-bench gates, self-preference/contamination caveats). But two claims it made were not actually delivered, and an agentic benchmark was missing an obvious metric. This ADR settles three coupled decisions about judging and metrics.

Status

accepted (extends ADR 0001 — the Judge as a held-fixed control variable)

Context

  1. The cross-family jury was advertised but inert. model_judge documented a diverse jury as the fix for judge self-preference, and the report told users to use one — but every judge call routed through the Anthropic SDK, so non-Anthropic jurors were silently skipped. The recommended mitigation was a no-op for a single-provider list.
  2. The confidence intervals were the approximation the code distrusted. The report used a normal-approx CI explicitly commented "indicative, not exact" — and touchstone runs at the small case-counts (2–5) where that approximation over-claims most.
  3. Tool calls were asserted over but never scored. The trace grader could require/forbid/ budget tool usage, but there was no metric for "did it call the right tools — edit the right files, run the tests, with the right args, in the right order?" — the core agentic signal, and the Trace already carries it.

Decision

  • Judges dispatch by family behind one seam. judge_clients.call_judge(model, prompt) routes to the native provider SDK by model family (or an explicit provider:model): Anthropic, OpenAI (reusing the openai adapter's client conventions), Google. A missing SDK/key for any juror is a per-judge skip (the existing degrade contract), never a crash or a silent zero. The jury pooling and same-family self-preference flagging are unchanged — they now operate on a jury that is actually cross-family. The Judge remains a held-fixed control variable (ADR 0001).
  • Bootstrap CIs replace the normal approximation. A clustered bootstrap over per-case means is the report's headline interval (leaderboard score, paired significance, and the cross-run verdict of ADR 0011); the analytic SE is kept as a cheap secondary. Bootstrap is seeded, so intervals are reproducible.
  • Tool-call correctness is its own grader. A new tool_correctness grader scores precision/recall/F1 (with optional ordering) of observed-vs-expected tool calls over the Trace — separate from the assertion-only trace grader, and reading only the normalized Trace (name/raw_name/kind/input) so it is adapter-portable.
  • Two near-free credibility additions, both opt-in. The judge may see the trajectory (include_trace), not just the final artifact; and when a gold-label file is present the report states judge↔human agreement (Cohen's κ / Krippendorff's α) — the one validity check the tool wholly lacked.

Consequences

  • The self-preference mitigation the report recommends is now real; intervals stop over-claiming at small n; and an agentic benchmark can grade how a task was done, not only the end state.
  • Every addition is opt-in and degrades gracefully: existing cases validate and score unchanged, and a single-provider/offline host loses only the cross-family and κ extras.
  • New grader options ride GraderSpec's extra="allow", so no schema change was needed.
  • Deliberately not adopted (lower value or worse than the status quo): HELM mean-win-rate (deprecated; the paired test is better), calibration/perplexity (no logprobs in agentic runs), RAG retrieval metrics, and pairwise position-bias/Elo machinery (touchstone grades one transcript against an absolute rubric).
  • New glossary terms: Gold label (a human score for judge calibration) and Tool-call score.