Authoring cases¶
A Case is one eval, defined by a single case.yaml in its own directory under evals/:
evals/my-task/
├── case.yaml # the eval definition
├── source/ # (optional) files copied into the sandbox
└── graders/ # (optional) rubrics, expected files, injected tests
Run it by id:
A minimal case¶
The smallest useful case: a prompt, a source folder, a harness/model matrix, and one grader.
id: my-task
description: Add retry logic to the HTTP client.
task:
prompt: |
Add retry with exponential backoff to the HTTP client in client.py.
Keep the public function signature unchanged and add a `max_retries` parameter.
source:
path: ./source # copied fresh into every cell's sandbox
matrix:
entries:
- {harness: openai, models: ["gpt-4o-mini", "llama3.1"]}
graders:
- type: files
weight: 1.0
patterns: ["retry", "backoff"]
expect:
pass_threshold: 1.0
A full, observed case¶
A realistic case adds isolation, a per-cell environment, observation, and several graders. This mirrors the bundled real-repo cases:
id: repo-ordinal
description: Reimplement the `ordinal` method to match the library's behavior.
task:
prompt: |
In inflect/__init__.py, the `ordinal` method body was replaced with
`raise NotImplementedError` — reimplement it to match the established behavior.
# turns: ["...follow-up prompt..."] # optional extra Turns (needs a tracing harness)
source:
repo: jaraco/inflect # git clone, pinned for reproducibility
commit: 262a247d2d99a47a520cdb2d46adb90df88b4326
# path: ./source # OR a local folder (mutually exclusive with repo)
# subdir: pkg/lib # use only a sub-tree of the cloned repo
container: # optional OS-level isolation (docker)
image: python:3.12-slim
caches: [".cache/pip"]
# harness: true # also run the agent inside the sandbox (ADR 0014)
# backend: docker # docker (default) | harbor (remote sandboxes, ADR 0015)
# harbor_runtime: daytona # (harbor) docker | daytona | modal | e2b | runloop | gke
# env_passthrough: ["OPENAI_API_KEY"] # host secrets to forward in for an in-sandbox harness
environment: # optional per-cell venv + deps
kind: pip-venv # pip-venv (default) | uv | command
requirements: ["pytest", "more-itertools"]
# install: editable # also install the sandbox repo itself
setup: # introduce the task state
stub: # blank a function body (Python AST)
- {file: inflect/__init__.py, function: ordinal}
run:
- "rm -rf .git"
matrix:
entries:
- {harness: openai, models: ["deepseek-v3.1:cloud"]}
- {harness: claude-code-stream, models: ["claude-opus-4-8"]}
observe: # opt into Tracing / Interaction
tracing: true
interaction:
policy: auto-approve # auto-approve | auto-deny | scripted | llm-based | manual
fixtures: # hidden assets (e.g. graded tests) from a private repo
repo: krimvp/touchstone-eval-fixtures
commit: 2c0e6b80849b45b165536064bd4ebfb056e27f32
graders:
- {type: pytest, weight: 4.0, inject: ["./hidden/test_ordinal.py"]}
- {type: implemented, gate: true, files: ["inflect/__init__.py"]}
- {type: trace, weight: 1.0, require_tools: [{kind: write}], require_no_denied: true}
- {type: efficiency, weight: 2.0, target_cost_usd: 0.30, target_tokens: 2000, target_tool_calls: 20}
expect:
pass_threshold: 1.0
availability: required # required (default) | optional
Field reference¶
task¶
prompt— the first Turn sent to the agent (required).turns— additional eval-initiated follow-up prompts, sent one at a time after each stop. Multi-turn requires a Tracing-capable harness.
source — where the sandbox comes from¶
Set exactly one of:
path— a local folder, copied into the sandbox (copyisolation).repo+commit— a git repo cloned at a pinned commit (cloneisolation). Addsubdirto use only a sub-tree, or set isolation toworktreefor a local repo.
container — OS-level isolation (optional)¶
Runs the Cell's work inside a docker container with the cell bind-mounted at its same path.
image is required; caches lists paths to persist between cells. Needs the docker daemon.
Provisioning and graders always run in the container. To also run the agent inside it, set:
harness: true— run the Harness in the sandbox, not on the host (ADR 0014). A CLI harness (claude-code, an ACP/harnesses.yamlagent) needs its CLI baked into the image; the in-processopenaiharness needs nothing baked — its loop stays in the controller and only its effects (bash, file writes) are routed into the sandbox.env_passthrough— host env vars (e.g.["ANTHROPIC_API_KEY"]) to forward into the sandbox so the in-sandbox agent can reach its API key. Only the named vars cross.backend— the sandbox provider:docker(default, a local container) orharbor, which runs the Cell in a Harbor sandbox behind the same Executor seam, unlocking remote runtimes viaharbor_runtime(docker/daytona/modal/e2b/runloop/gke). Harbor needspip install touchstone-eval[harbor](Python ≥3.12); setharbor_factory: "pkg.module:callable"to construct the environment for your Harbor version.
See ADR 0005, ADR 0014, and ADR 0015.
environment — a per-cell dependency setup (optional)¶
A throwaway venv (pip-venv/uv) or project-local install (command). Keeps
dependency-bearing cells reproducible and parallel-safe. See
ADR 0004.
setup — introduce the task state¶
stub— blank a function/method body (keeps signature + docstring) so the model must restore it.run— shell commands run in the sandbox before the agent starts (e.g.rm -rf .git).
matrix — what to compare¶
Either the cross-product form (harnesses: [...] × models: [...]) or entries that pair
models per-harness (preferred — a model is only valid for some harnesses). trials repeats
each cell for consistency / pass@k.
observe — observation & interaction¶
tracing: true captures a Trace; interaction.policy answers the agent's
mid-run requests. Absent this block, the cell is output-only.
fixtures — hidden, contamination-proof assets¶
A private repo holding graded tests or expected outputs, fetched at grade time so they never sit in the sandbox the model sees. See ADR 0007.
graders & expect¶
One or more graders; each contributes a weighted score. The cell passes iff the
combined weighted (non-skipped) score ≥ expect.pass_threshold. A grader with gate: true
must pass for the cell to pass at all.
availability¶
required (default) — an unreachable source aborts the run under the fail policy.
optional — the cell degrades to skipped instead. See
ADR 0008.
Validate before you run¶
touchstone validate # schema-check every case.yaml
touchstone validate --check-access # also probe external repos are reachable
The private held-out set¶
Public benchmarks leak into training data. Keep your sharpest, never-committed tasks under
evals-private/ (git-ignored) and run them with --evals-dir: