Concepts¶

touchstone fixes a small vocabulary so the framework, its docs, and its cases all mean the same thing by the same word. This page is the working glossary; the canonical source is CONTEXT.md in the repo.

The one idea¶

Hold everything constant except the model.

A real eval is a controlled experiment. The harness, the judge, and the responder are control variables — fixed across the whole comparison — so a difference in the report reflects a difference in the model, not the scaffolding around it. Every concept below exists to make that control possible.

Benchmark structure¶

Run ──expands──▶ many Cells ──each has──▶ one Harness · one Model · one Trial
 │                                            │
 └─ produces a Report                         └─ realized by one Adapter

Term	Meaning
Case	One eval — a task plus the input, artifacts, graders, and expectations needed to judge it. Lives in a `case.yaml`.
Run	A single execution of the benchmark that expands a matrix and produces a report.
Cell	The atomic unit of work and persistence: one `(Case × Harness × Model × Trial)`.
Trial	One repeated attempt of the same Cell coordinates — for consistency / pass@k.
Matrix	The axes to compare. Models are paired per-harness (`entries: [{harness, models}]`), because a model is only meaningful relative to the harness that runs it.

The swappable pieces¶

Term	Meaning
Harness	The swappable thing that turns a Case's task into an output, behind one interface. Every Harness is an Adapter.
Grader	A component that turns a Harness's result into a Score. A Case can have several; they combine per the Case's `pass_threshold`.
Judge	The fixed auxiliary LLM used by the model-as-judge grader. A control variable, held constant across the matrix.
Responder	The fixed auxiliary LLM that answers an agent's mid-run questions under the `llm-based` interaction policy. Also a control variable.

Observation & interaction¶

A Harness can do more than return final text. Two opt-in capabilities, declared by its Adapter:

Term	Meaning
Trace	The normalized, vendor-neutral event stream captured from a run — messages, tool calls, token usage, permission events. The framework's own schema, never an external protocol's types.
Tracing	A Harness's ability to emit a Trace. Harnesses that lack it degrade to output-only.
Interaction	A Harness's ability to let the framework answer the agent's mid-run requests (tool permission / approval / input). Strictly richer than Tracing — you can't answer what you can't observe.
Tool Kind	The portable category of a tool call (`read · write · execute · search · fetch · other`) — the one tool axis that means the same across agents.
Interaction Policy	The per-Case rule that answers agent-initiated requests: `auto-approve · auto-deny · scripted · llm-based · manual`.

See Observation & interaction for the Trace event types and the policies.

Execution & isolation¶

Term	Meaning
Sandbox	The isolated working directory a Cell's Harness operates in, prepared fresh from the Case source. Never shared between Cells.
Isolation Mode	How a Sandbox is created: `copy` (a folder), `clone` (git at a commit), or `worktree` (git worktree at a commit).
Environment	A Cell's own throwaway dependency setup — a per-Cell virtualenv (`pip-venv`/`uv`) or project-local install (`command`), so dependency-bearing Cells stay reproducible and parallel-safe.
Executor	Where a Cell's non-Harness commands run: `LocalExecutor` (host subprocesses) or `ContainerExecutor` (`docker exec`, for OS-level isolation and OS packages).
Reachability / Availability	A preflight that probes a Case's external git repos before the Run does work, then applies the availability policy (`fail` aborts on an unreachable required Case; `skip`/`optional` degrade it to `skipped`).

Conversation¶

Term	Meaning
Turn	One eval-initiated prompt sent to the agent within a Cell. The first Turn is the Case's task; later Turns are scripted follow-ups.
Conversation	The ordered Turns of a Case, each sent once the agent's previous Turn reaches a stop. A single-prompt Case is a one-Turn Conversation.

How they relate¶

A Run expands into many Cells; each Cell has one Harness, one model, one Trial index.
A Harness is realized by exactly one Adapter, which declares its Tracing and Interaction capabilities.
A Tracing-capable Harness produces a Trace per Cell, alongside the raw transcript. Interaction implies Tracing, not vice-versa.
Graders may read the final output, the Trace, or both. A trace-dependent grader on an output-only Harness is a hard failure for that Cell.
Each Cell gets its own Sandbox (and, if declared, its own Environment), so Cells run in parallel without contention and a crash is resumable.