Skip to content

Concepts

touchstone fixes a small vocabulary so the framework, its docs, and its cases all mean the same thing by the same word. This page is the working glossary; the canonical source is CONTEXT.md in the repo.

The one idea

Hold everything constant except the model.

A real eval is a controlled experiment. The harness, the judge, and the responder are control variables — fixed across the whole comparison — so a difference in the report reflects a difference in the model, not the scaffolding around it. Every concept below exists to make that control possible.

Benchmark structure

Run ──expands──▶ many Cells ──each has──▶ one Harness · one Model · one Trial
 │                                            │
 └─ produces a Report                         └─ realized by one Adapter
Term Meaning
Case One eval — a task plus the input, artifacts, graders, and expectations needed to judge it. Lives in a case.yaml.
Run A single execution of the benchmark that expands a matrix and produces a report.
Cell The atomic unit of work and persistence: one (Case × Harness × Model × Trial).
Trial One repeated attempt of the same Cell coordinates — for consistency / pass@k.
Matrix The axes to compare. Models are paired per-harness (entries: [{harness, models}]), because a model is only meaningful relative to the harness that runs it.

The swappable pieces

Term Meaning
Harness The swappable thing that turns a Case's task into an output, behind one interface. Every Harness is an Adapter.
Grader A component that turns a Harness's result into a Score. A Case can have several; they combine per the Case's pass_threshold.
Judge The fixed auxiliary LLM used by the model-as-judge grader. A control variable, held constant across the matrix.
Responder The fixed auxiliary LLM that answers an agent's mid-run questions under the llm-based interaction policy. Also a control variable.

Observation & interaction

A Harness can do more than return final text. Two opt-in capabilities, declared by its Adapter:

Term Meaning
Trace The normalized, vendor-neutral event stream captured from a run — messages, tool calls, token usage, permission events. The framework's own schema, never an external protocol's types.
Tracing A Harness's ability to emit a Trace. Harnesses that lack it degrade to output-only.
Interaction A Harness's ability to let the framework answer the agent's mid-run requests (tool permission / approval / input). Strictly richer than Tracing — you can't answer what you can't observe.
Tool Kind The portable category of a tool call (read · write · execute · search · fetch · other) — the one tool axis that means the same across agents.
Interaction Policy The per-Case rule that answers agent-initiated requests: auto-approve · auto-deny · scripted · llm-based · manual.

See Observation & interaction for the Trace event types and the policies.

Execution & isolation

Term Meaning
Sandbox The isolated working directory a Cell's Harness operates in, prepared fresh from the Case source. Never shared between Cells.
Isolation Mode How a Sandbox is created: copy (a folder), clone (git at a commit), or worktree (git worktree at a commit).
Environment A Cell's own throwaway dependency setup — a per-Cell virtualenv (pip-venv/uv) or project-local install (command), so dependency-bearing Cells stay reproducible and parallel-safe.
Executor Where a Cell's non-Harness commands run: LocalExecutor (host subprocesses) or ContainerExecutor (docker exec, for OS-level isolation and OS packages).
Reachability / Availability A preflight that probes a Case's external git repos before the Run does work, then applies the availability policy (fail aborts on an unreachable required Case; skip/optional degrade it to skipped).

Conversation

Term Meaning
Turn One eval-initiated prompt sent to the agent within a Cell. The first Turn is the Case's task; later Turns are scripted follow-ups.
Conversation The ordered Turns of a Case, each sent once the agent's previous Turn reaches a stop. A single-prompt Case is a one-Turn Conversation.

How they relate

  • A Run expands into many Cells; each Cell has one Harness, one model, one Trial index.
  • A Harness is realized by exactly one Adapter, which declares its Tracing and Interaction capabilities.
  • A Tracing-capable Harness produces a Trace per Cell, alongside the raw transcript. Interaction implies Tracing, not vice-versa.
  • Graders may read the final output, the Trace, or both. A trace-dependent grader on an output-only Harness is a hard failure for that Cell.
  • Each Cell gets its own Sandbox (and, if declared, its own Environment), so Cells run in parallel without contention and a crash is resumable.