Case fixtures in a separate repo, split by visibility (source vs hidden)¶
Bulky, hand-written case assets — synthetic codebases to debug and the hidden/ oracle
test suites — live in a separate, commit-pinned fixtures repo
(krimvp/touchstone-eval-fixtures) instead of in the eval tree. The eval repo keeps only each
case's contract (task, graders, expectations); the fixtures repo holds the code. Within the
fixtures repo, each case has one directory split by visibility:
<case-id>/
source/ # agent-VISIBLE input → promoted into the cell sandbox before the agent runs
hidden/ # grader ORACLE → injected at grade time only; the agent never sees it
A case wires the two halves with two independent, commit-pinned pointers:
source: {repo: krimvp/touchstone-eval-fixtures, commit: <sha>, subdir: <case-id>/source}
fixtures: {repo: krimvp/touchstone-eval-fixtures, commit: <sha>} # subdir defaults to <case-id>
graders:
- {type: pytest, inject: ["./hidden/test_x.py"]} # resolved under <case-id>/hidden/
Status¶
accepted
Why¶
The harder/diverse battery (ADR-era plan 0002-harder-diverse-battery) checked ~33k lines of
case material straight into evals/ — synthetic codebases and, mostly, the hidden oracle test
suites (~25k lines across 87 files). That pollutes the runner/eval repo and conflates the
definition of a case with its bulky assets. Moving the assets out lets the eval repo stay a
lean set of definitions, and lets cases pull their material like any other pinned git
dependency — the same model the SWE-bench cases already use for upstream source.
The visibility boundary¶
source/ and hidden/ are not interchangeable, and the directory split is what enforces
it physically:
RUN TIME GRADE TIME (agent has stopped)
───────── ─────────────────────────────
source.subdir = <case-id>/source fixtures + inject: ./hidden/test_x.py
│ clone @commit, promote │ clone @commit (cached),
▼ (hidden/ is a sibling — NOT promoted) ▼ copy <case-id>/hidden/* INTO sandbox
┌── Sandbox ─────────────┐ ┌── Sandbox ─────────────┐
│ source files only │ ◀── agent edits │ source + injected oracle│ ─▶ pytest ─▶ Score
└────────────────────────┘ └────────────────────────┘
If the agent could see the oracle it would optimise against it (hardcode outputs, edit the
tests) and the FAIL→PASS score would measure memorisation, not capability — exactly the ~1.0
saturation the battery was built to break. Because only <case-id>/source/ is promoted and
<case-id>/hidden/ is a sibling pulled separately at grade time, the oracle can never leak.
Mechanics¶
source.subdir(new) —cloneisolation clones the repo, checks out the commit, then promotes just the named sub-directory to the sandbox (no.git, mirroringcopy). One fixtures repo holds many cases. Seesandbox.py.fixtures: {repo, commit, subdir?}(new) — names the repo holding grading assets;subdirdefaults to the case id.Case.asset(rel)resolves aninject:path against a host-cached clone keyed by(repo, commit)(immutable, so reused across cells/runs and serialized on creation). Seeconfig.py/fixtures.py. Absent the block, asset paths resolve case-local — backward compatible.- Graders are unchanged except that
TestsGrader._inject(shared bytests/pytest/swebench) resolves viaCase.assetinstead ofCase.resolve;inject:path strings are unchanged (./hidden/...).
Considered options¶
- Keep everything in
evals/(status quo). Simple, offline, but pollutes the repo and couples definitions to bulky assets; rejected as the suite grew. - Move only the agent-visible
source/, keephidden/local. Removes only ~2.6k of the ~28k lines; the oracle suites are the bulk. Rejected as half a solution. - One fixtures repo per codebase. No framework change, but many repos to create/pin.
Rejected for management overhead; a single repo with per-case subdirs (via
source.subdir fixtures) is cleaner.- Merge
source/andhidden/into one promoted tree. Would leak the oracle into the sandbox and break held-out grading. Rejected — it defeats the benchmark's purpose.
Consequences¶
- The eval repo drops ~28k lines of hardcoded assets; cases become definitions that pull code.
- Cases that pull from the fixtures repo need network + (private-repo) git auth on the host at run time — the same trade the SWE-bench cases already make for upstream source.
- The fixtures repo must stay private for anti-memorization cases.
evals/example-case/deliberately stays local (source: path) as the offline worked example and integration fixture.