Skip to content

Authoring cases

A Case is one eval, defined by a single case.yaml in its own directory under evals/:

evals/my-task/
├── case.yaml            # the eval definition
├── source/              # (optional) files copied into the sandbox
└── graders/             # (optional) rubrics, expected files, injected tests

Run it by id:

touchstone run --eval my-task --harness openai --with-model <model>

A minimal case

The smallest useful case: a prompt, a source folder, a harness/model matrix, and one grader.

id: my-task
description: Add retry logic to the HTTP client.

task:
  prompt: |
    Add retry with exponential backoff to the HTTP client in client.py.
    Keep the public function signature unchanged and add a `max_retries` parameter.

source:
  path: ./source            # copied fresh into every cell's sandbox

matrix:
  entries:
    - {harness: openai, models: ["gpt-4o-mini", "llama3.1"]}

graders:
  - type: files
    weight: 1.0
    patterns: ["retry", "backoff"]

expect:
  pass_threshold: 1.0

A full, observed case

A realistic case adds isolation, a per-cell environment, observation, and several graders. This mirrors the bundled real-repo cases:

id: repo-ordinal
description: Reimplement the `ordinal` method to match the library's behavior.

task:
  prompt: |
    In inflect/__init__.py, the `ordinal` method body was replaced with
    `raise NotImplementedError` — reimplement it to match the established behavior.
  # turns: ["...follow-up prompt..."]   # optional extra Turns (needs a tracing harness)

source:
  repo: jaraco/inflect                  # git clone, pinned for reproducibility
  commit: 262a247d2d99a47a520cdb2d46adb90df88b4326
  # path: ./source                      # OR a local folder (mutually exclusive with repo)
  # subdir: pkg/lib                      # use only a sub-tree of the cloned repo

container:                              # optional OS-level isolation (docker)
  image: python:3.12-slim
  caches: [".cache/pip"]
  # harness: true                       # also run the agent inside the sandbox (ADR 0014)
  # backend: docker                     # docker (default) | harbor (remote sandboxes, ADR 0015)
  # harbor_runtime: daytona             # (harbor) docker | daytona | modal | e2b | runloop | gke
  # env_passthrough: ["OPENAI_API_KEY"] # host secrets to forward in for an in-sandbox harness

environment:                            # optional per-cell venv + deps
  kind: pip-venv                        # pip-venv (default) | uv | command
  requirements: ["pytest", "more-itertools"]
  # install: editable                   # also install the sandbox repo itself

setup:                                  # introduce the task state
  stub:                                 # blank a function body (Python AST)
    - {file: inflect/__init__.py, function: ordinal}
  run:
    - "rm -rf .git"

matrix:
  entries:
    - {harness: openai, models: ["deepseek-v3.1:cloud"]}
    - {harness: claude-code-stream, models: ["claude-opus-4-8"]}

observe:                                # opt into Tracing / Interaction
  tracing: true
  interaction:
    policy: auto-approve                # auto-approve | auto-deny | scripted | llm-based | manual

fixtures:                               # hidden assets (e.g. graded tests) from a private repo
  repo: krimvp/touchstone-eval-fixtures
  commit: 2c0e6b80849b45b165536064bd4ebfb056e27f32

graders:
  - {type: pytest, weight: 4.0, inject: ["./hidden/test_ordinal.py"]}
  - {type: implemented, gate: true, files: ["inflect/__init__.py"]}
  - {type: trace, weight: 1.0, require_tools: [{kind: write}], require_no_denied: true}
  - {type: efficiency, weight: 2.0, target_cost_usd: 0.30, target_tokens: 2000, target_tool_calls: 20}

expect:
  pass_threshold: 1.0

availability: required                  # required (default) | optional

Field reference

task

  • prompt — the first Turn sent to the agent (required).
  • turns — additional eval-initiated follow-up prompts, sent one at a time after each stop. Multi-turn requires a Tracing-capable harness.

source — where the sandbox comes from

Set exactly one of:

  • path — a local folder, copied into the sandbox (copy isolation).
  • repo + commit — a git repo cloned at a pinned commit (clone isolation). Add subdir to use only a sub-tree, or set isolation to worktree for a local repo.

container — OS-level isolation (optional)

Runs the Cell's work inside a docker container with the cell bind-mounted at its same path. image is required; caches lists paths to persist between cells. Needs the docker daemon. Provisioning and graders always run in the container. To also run the agent inside it, set:

  • harness: true — run the Harness in the sandbox, not on the host (ADR 0014). A CLI harness (claude-code, an ACP/harnesses.yaml agent) needs its CLI baked into the image; the in-process openai harness needs nothing baked — its loop stays in the controller and only its effects (bash, file writes) are routed into the sandbox.
  • env_passthrough — host env vars (e.g. ["ANTHROPIC_API_KEY"]) to forward into the sandbox so the in-sandbox agent can reach its API key. Only the named vars cross.
  • backend — the sandbox provider: docker (default, a local container) or harbor, which runs the Cell in a Harbor sandbox behind the same Executor seam, unlocking remote runtimes via harbor_runtime (docker/daytona/modal/e2b/ runloop/gke). Harbor needs pip install touchstone-eval[harbor] (Python ≥3.12); set harbor_factory: "pkg.module:callable" to construct the environment for your Harbor version.

See ADR 0005, ADR 0014, and ADR 0015.

environment — a per-cell dependency setup (optional)

A throwaway venv (pip-venv/uv) or project-local install (command). Keeps dependency-bearing cells reproducible and parallel-safe. See ADR 0004.

setup — introduce the task state

  • stub — blank a function/method body (keeps signature + docstring) so the model must restore it.
  • run — shell commands run in the sandbox before the agent starts (e.g. rm -rf .git).

matrix — what to compare

Either the cross-product form (harnesses: [...] × models: [...]) or entries that pair models per-harness (preferred — a model is only valid for some harnesses). trials repeats each cell for consistency / pass@k.

observe — observation & interaction

tracing: true captures a Trace; interaction.policy answers the agent's mid-run requests. Absent this block, the cell is output-only.

fixtures — hidden, contamination-proof assets

A private repo holding graded tests or expected outputs, fetched at grade time so they never sit in the sandbox the model sees. See ADR 0007.

graders & expect

One or more graders; each contributes a weighted score. The cell passes iff the combined weighted (non-skipped) score ≥ expect.pass_threshold. A grader with gate: true must pass for the cell to pass at all.

availability

required (default) — an unreachable source aborts the run under the fail policy. optional — the cell degrades to skipped instead. See ADR 0008.

Validate before you run

touchstone validate                 # schema-check every case.yaml
touchstone validate --check-access  # also probe external repos are reachable

The private held-out set

Public benchmarks leak into training data. Keep your sharpest, never-committed tasks under evals-private/ (git-ignored) and run them with --evals-dir:

touchstone --evals-dir evals-private run --harness openai --with-model <model>