Skip to content

Reachability preflight + per-case availability policy (fail-fast by default)

Cases pull their material from external git repos — source: {repo, commit} for the agent-visible upstream code and fixtures: {repo, commit} for the hidden oracle suites — and some of those repos are private. A run on a host that lacks access to one of them (no git auth, a private fork, or simply offline) must behave predictably. We add a cheap reachability preflight that probes each case's external git dependencies before the matrix runs, plus a policy that decides what to do with the unreachable ones: fail the whole run (the default) or degrade them to a new skipped cell status and continue.

Status

accepted

Why

The framework's open-source value proposition is bring your own private repos: a public engine + a public sample battery, plus each user's own private cases and a private fixtures repo holding their held-out oracle. That model guarantees a given host will routinely lack access to some referenced repos — a contributor's fork, a teammate's private fixtures, a CI box without SSH keys, a laptop on a plane.

Three problems with today's behaviour:

  • Late, ambiguous failure. An inaccessible repo throws a raw CalledProcessError deep in sandbox.prepare_sandbox / fixtures.fixtures_clone, mid-run, after other cells have already provisioned venvs and run agents. The error doesn't distinguish "I can't reach this" (environmental, host-specific) from "this case is broken" (a real defect).
  • validate gives false confidence. It is offline schema-only, so "validate passed" says nothing about whether a run can actually clone the repos it references.
  • A silently shrunk battery is dangerous. Cross-model comparisons (the leaderboard, the paired "is the winner real?" test) assume the same cases ran for every model. If a missing repo quietly drops cases, the verdict is corrupted. So the default must be loud (fail-fast); degrading must be an explicit, opt-in choice.

Design

Probe at access level, not commit level. git ls-remote <url> with a short timeout proves network + auth + repo existence in one cheap call, with no clone. The failure reason is classified from stderr: no-auth, no-network, or not-found. (GitHub returns 404 for a private repo you aren't authorized to see — to avoid leaking its existence — so not-found is treated as an access failure and is degradable.) Whether the pinned commit exists stays the clone's job: a wrong SHA is a defect that should surface loudly, not be silently skipped.

What is probed. Each case's external git repos: source.repo (when the source is remote) and fixtures.repo. Probe results are cached by URL within a scan, so the 27 multi-language SWE-bench cases that share one fixtures repo — and the many sphinx/pylint cases that share one upstream — collapse to a handful of network calls. A case with only a local source: path (e.g. example-case) has nothing to probe and is always reachable — backward-compatible and offline-clean.

Per-case availability. A new availability: required | optional field (default required). optional marks a case that references an external you might not have; it degrades to skipped even under fail mode. This is how a public battery can ship a few "nice to have if you can reach them" cases without breaking a stranger's first run.

Policy --on-unavailable {fail, skip} (default fail).

  • fail — the preflight runs before the run store is created; if any required case is unreachable, the run aborts with a per-case reason list and a non-zero exit, having done no work. Optional-unreachable cases are skipped with a warning.
  • skip — every unreachable case (required or optional) is degraded to skipped and the run proceeds with what's reachable.

New terminal cell status skipped, distinct from failed. Skipped cells are excluded from the benchmark score, the leaderboard, the pass-rate denominator, and the failures drill-down; they get their own "Skipped (unavailable)" report section with the reason. The status is persisted to result.json, so --resume doesn't retry them — but the preflight re-probes on resume, so a skip caused by a transient outage is reattempted automatically once access returns.

validate --check-access (network opt-in) runs schema validation and a reachability report, so you can see exactly what a run would skip or fail on without running it. Plain validate stays offline and fast — its contract is unchanged.

Open-source ergonomics. TOUCHSTONE_FIXTURES_REPO env override repoints the default fixtures repo, so a fork can point at its own private fixtures without editing every case.

Unavailable vs broken (the load-bearing distinction)

The preflight only ever swallows access failures — environmental, host-specific, the kind a different machine would not hit. Schema errors, bad isolation config, a missing local source.path, a wrong commit — these are defects and still fail loudly even under skip mode, because they're caught by validation and the clone, not by the reachability probe. Skip shrinks the battery only for reasons that are about this host, never for reasons that are about the case.

Considered options

  • Skip-by-default (warn + continue). Friendlier for a stranger's first clone, but it silently shrinks the battery and corrupts cross-model comparisons the moment a repo is missing on a host you expected to be complete. Rejected as the default; offered as the --on-unavailable skip opt-in. (Chosen with the user: fail by default, flag to degrade.)
  • Probe the exact pinned commit (fetch by SHA). More precise, but heavier (a real fetch) and servers vary on allowing arbitrary-SHA fetches. Commit correctness is already enforced loudly by the clone, so access-level probing is the right granularity for "do I have access".
  • No preflight; catch the clone error and classify it at run time. Saves one network round, but failures surface late — after other cells have done real work — and fail-fast can't abort before spending. A cheap upfront scan is what makes both clean skipping and honest fail-fast possible.
  • A full host-doctor (also probe container images, harness binaries, API keys). Valuable, but broader than the ask. The git-repo axis is what actually blocks bring-your-own-private- repos; reachability.py is structured so image/harness/key probes can be added later.

Consequences

  • A run makes a few extra git ls-remote calls up front (cached by URL) — negligible next to cloning and provisioning.
  • The default is safe by construction: a missing repo on a host you expected to be complete is a loud, early failure, not a quietly smaller benchmark.
  • Open-sourcing is unblocked: ship engine + a public battery (all required, reachable by anyone with network), and let users add private cases marked optional (or run with --on-unavailable skip), repointing fixtures via TOUCHSTONE_FIXTURES_REPO.
  • One new cell status (skipped) threads through the store, runner, and report; legacy runs (only done/failed) render unchanged.