Reachability preflight + per-case availability policy (fail-fast by default)¶
Cases pull their material from external git repos — source: {repo, commit} for the
agent-visible upstream code and fixtures: {repo, commit} for the hidden oracle suites —
and some of those repos are private. A run on a host that lacks access to one of them
(no git auth, a private fork, or simply offline) must behave predictably. We add a cheap
reachability preflight that probes each case's external git dependencies before the
matrix runs, plus a policy that decides what to do with the unreachable ones: fail the
whole run (the default) or degrade them to a new skipped cell status and continue.
Status¶
accepted
Why¶
The framework's open-source value proposition is bring your own private repos: a public engine + a public sample battery, plus each user's own private cases and a private fixtures repo holding their held-out oracle. That model guarantees a given host will routinely lack access to some referenced repos — a contributor's fork, a teammate's private fixtures, a CI box without SSH keys, a laptop on a plane.
Three problems with today's behaviour:
- Late, ambiguous failure. An inaccessible repo throws a raw
CalledProcessErrordeep insandbox.prepare_sandbox/fixtures.fixtures_clone, mid-run, after other cells have already provisioned venvs and run agents. The error doesn't distinguish "I can't reach this" (environmental, host-specific) from "this case is broken" (a real defect). validategives false confidence. It is offline schema-only, so "validate passed" says nothing about whether a run can actually clone the repos it references.- A silently shrunk battery is dangerous. Cross-model comparisons (the leaderboard, the paired "is the winner real?" test) assume the same cases ran for every model. If a missing repo quietly drops cases, the verdict is corrupted. So the default must be loud (fail-fast); degrading must be an explicit, opt-in choice.
Design¶
Probe at access level, not commit level. git ls-remote <url> with a short timeout proves
network + auth + repo existence in one cheap call, with no clone. The failure reason is
classified from stderr: no-auth, no-network, or not-found. (GitHub returns 404 for a
private repo you aren't authorized to see — to avoid leaking its existence — so not-found is
treated as an access failure and is degradable.) Whether the pinned commit exists stays the
clone's job: a wrong SHA is a defect that should surface loudly, not be silently skipped.
What is probed. Each case's external git repos: source.repo (when the source is remote)
and fixtures.repo. Probe results are cached by URL within a scan, so the 27
multi-language SWE-bench cases that share one fixtures repo — and the many sphinx/pylint cases
that share one upstream — collapse to a handful of network calls. A case with only a local
source: path (e.g. example-case) has nothing to probe and is always reachable —
backward-compatible and offline-clean.
Per-case availability. A new availability: required | optional field (default
required). optional marks a case that references an external you might not have; it
degrades to skipped even under fail mode. This is how a public battery can ship a few
"nice to have if you can reach them" cases without breaking a stranger's first run.
Policy --on-unavailable {fail, skip} (default fail).
fail— the preflight runs before the run store is created; if any required case is unreachable, the run aborts with a per-case reason list and a non-zero exit, having done no work. Optional-unreachable cases are skipped with a warning.skip— every unreachable case (required or optional) is degraded toskippedand the run proceeds with what's reachable.
New terminal cell status skipped, distinct from failed. Skipped cells are excluded
from the benchmark score, the leaderboard, the pass-rate denominator, and the failures
drill-down; they get their own "Skipped (unavailable)" report section with the reason. The
status is persisted to result.json, so --resume doesn't retry them — but the preflight
re-probes on resume, so a skip caused by a transient outage is reattempted automatically
once access returns.
validate --check-access (network opt-in) runs schema validation and a reachability
report, so you can see exactly what a run would skip or fail on without running it. Plain
validate stays offline and fast — its contract is unchanged.
Open-source ergonomics. TOUCHSTONE_FIXTURES_REPO env override repoints the default
fixtures repo, so a fork can point at its own private fixtures without editing every case.
Unavailable vs broken (the load-bearing distinction)¶
The preflight only ever swallows access failures — environmental, host-specific, the kind
a different machine would not hit. Schema errors, bad isolation config, a missing local
source.path, a wrong commit — these are defects and still fail loudly even under skip
mode, because they're caught by validation and the clone, not by the reachability probe. Skip
shrinks the battery only for reasons that are about this host, never for reasons that are
about the case.
Considered options¶
- Skip-by-default (warn + continue). Friendlier for a stranger's first clone, but it
silently shrinks the battery and corrupts cross-model comparisons the moment a repo is
missing on a host you expected to be complete. Rejected as the default; offered as the
--on-unavailable skipopt-in. (Chosen with the user: fail by default, flag to degrade.) - Probe the exact pinned commit (fetch by SHA). More precise, but heavier (a real fetch) and servers vary on allowing arbitrary-SHA fetches. Commit correctness is already enforced loudly by the clone, so access-level probing is the right granularity for "do I have access".
- No preflight; catch the clone error and classify it at run time. Saves one network round, but failures surface late — after other cells have done real work — and fail-fast can't abort before spending. A cheap upfront scan is what makes both clean skipping and honest fail-fast possible.
- A full host-doctor (also probe container images, harness binaries, API keys). Valuable,
but broader than the ask. The git-repo axis is what actually blocks bring-your-own-private-
repos;
reachability.pyis structured so image/harness/key probes can be added later.
Consequences¶
- A run makes a few extra
git ls-remotecalls up front (cached by URL) — negligible next to cloning and provisioning. - The default is safe by construction: a missing repo on a host you expected to be complete is a loud, early failure, not a quietly smaller benchmark.
- Open-sourcing is unblocked: ship engine + a public battery (all
required, reachable by anyone with network), and let users add private cases markedoptional(or run with--on-unavailable skip), repointing fixtures viaTOUCHSTONE_FIXTURES_REPO. - One new cell status (
skipped) threads through the store, runner, and report; legacy runs (onlydone/failed) render unchanged.