Skip to content

Case fixtures in a separate repo, split by visibility (source vs hidden)

Bulky, hand-written case assets — synthetic codebases to debug and the hidden/ oracle test suites — live in a separate, commit-pinned fixtures repo (krimvp/touchstone-eval-fixtures) instead of in the eval tree. The eval repo keeps only each case's contract (task, graders, expectations); the fixtures repo holds the code. Within the fixtures repo, each case has one directory split by visibility:

<case-id>/
  source/   # agent-VISIBLE input  → promoted into the cell sandbox before the agent runs
  hidden/   # grader ORACLE         → injected at grade time only; the agent never sees it

A case wires the two halves with two independent, commit-pinned pointers:

source:   {repo: krimvp/touchstone-eval-fixtures, commit: <sha>, subdir: <case-id>/source}
fixtures: {repo: krimvp/touchstone-eval-fixtures, commit: <sha>}   # subdir defaults to <case-id>
graders:
  - {type: pytest, inject: ["./hidden/test_x.py"]}             # resolved under <case-id>/hidden/

Status

accepted

Why

The harder/diverse battery (ADR-era plan 0002-harder-diverse-battery) checked ~33k lines of case material straight into evals/ — synthetic codebases and, mostly, the hidden oracle test suites (~25k lines across 87 files). That pollutes the runner/eval repo and conflates the definition of a case with its bulky assets. Moving the assets out lets the eval repo stay a lean set of definitions, and lets cases pull their material like any other pinned git dependency — the same model the SWE-bench cases already use for upstream source.

The visibility boundary

source/ and hidden/ are not interchangeable, and the directory split is what enforces it physically:

  RUN TIME                                       GRADE TIME (agent has stopped)
  ─────────                                      ─────────────────────────────
  source.subdir = <case-id>/source               fixtures + inject: ./hidden/test_x.py
        │ clone @commit, promote                        │ clone @commit (cached),
        ▼ (hidden/ is a sibling — NOT promoted)         ▼ copy <case-id>/hidden/* INTO sandbox
   ┌── Sandbox ─────────────┐                     ┌── Sandbox ─────────────┐
   │ source files only      │  ◀── agent edits    │ source + injected oracle│ ─▶ pytest ─▶ Score
   └────────────────────────┘                     └────────────────────────┘

If the agent could see the oracle it would optimise against it (hardcode outputs, edit the tests) and the FAIL→PASS score would measure memorisation, not capability — exactly the ~1.0 saturation the battery was built to break. Because only <case-id>/source/ is promoted and <case-id>/hidden/ is a sibling pulled separately at grade time, the oracle can never leak.

Mechanics

  • source.subdir (new) — clone isolation clones the repo, checks out the commit, then promotes just the named sub-directory to the sandbox (no .git, mirroring copy). One fixtures repo holds many cases. See sandbox.py.
  • fixtures: {repo, commit, subdir?} (new) — names the repo holding grading assets; subdir defaults to the case id. Case.asset(rel) resolves an inject: path against a host-cached clone keyed by (repo, commit) (immutable, so reused across cells/runs and serialized on creation). See config.py / fixtures.py. Absent the block, asset paths resolve case-local — backward compatible.
  • Graders are unchanged except that TestsGrader._inject (shared by tests/pytest/ swebench) resolves via Case.asset instead of Case.resolve; inject: path strings are unchanged (./hidden/...).

Considered options

  • Keep everything in evals/ (status quo). Simple, offline, but pollutes the repo and couples definitions to bulky assets; rejected as the suite grew.
  • Move only the agent-visible source/, keep hidden/ local. Removes only ~2.6k of the ~28k lines; the oracle suites are the bulk. Rejected as half a solution.
  • One fixtures repo per codebase. No framework change, but many repos to create/pin. Rejected for management overhead; a single repo with per-case subdirs (via source.subdir
  • fixtures) is cleaner.
  • Merge source/ and hidden/ into one promoted tree. Would leak the oracle into the sandbox and break held-out grading. Rejected — it defeats the benchmark's purpose.

Consequences

  • The eval repo drops ~28k lines of hardcoded assets; cases become definitions that pull code.
  • Cases that pull from the fixtures repo need network + (private-repo) git auth on the host at run time — the same trade the SWE-bench cases already make for upstream source.
  • The fixtures repo must stay private for anti-memorization cases.
  • evals/example-case/ deliberately stays local (source: path) as the offline worked example and integration fixture.