Concepts¶
touchstone fixes a small vocabulary so the framework, its docs, and its cases all mean the
same thing by the same word. This page is the working glossary; the canonical source is
CONTEXT.md in the repo.
The one idea¶
Hold everything constant except the model.
A real eval is a controlled experiment. The harness, the judge, and the responder are control variables — fixed across the whole comparison — so a difference in the report reflects a difference in the model, not the scaffolding around it. Every concept below exists to make that control possible.
Benchmark structure¶
Run ──expands──▶ many Cells ──each has──▶ one Harness · one Model · one Trial
│ │
└─ produces a Report └─ realized by one Adapter
| Term | Meaning |
|---|---|
| Case | One eval — a task plus the input, artifacts, graders, and expectations needed to judge it. Lives in a case.yaml. |
| Run | A single execution of the benchmark that expands a matrix and produces a report. |
| Cell | The atomic unit of work and persistence: one (Case × Harness × Model × Trial). |
| Trial | One repeated attempt of the same Cell coordinates — for consistency / pass@k. |
| Matrix | The axes to compare. Models are paired per-harness (entries: [{harness, models}]), because a model is only meaningful relative to the harness that runs it. |
The swappable pieces¶
| Term | Meaning |
|---|---|
| Harness | The swappable thing that turns a Case's task into an output, behind one interface. Every Harness is an Adapter. |
| Grader | A component that turns a Harness's result into a Score. A Case can have several; they combine per the Case's pass_threshold. |
| Judge | The fixed auxiliary LLM used by the model-as-judge grader. A control variable, held constant across the matrix. |
| Responder | The fixed auxiliary LLM that answers an agent's mid-run questions under the llm-based interaction policy. Also a control variable. |
Observation & interaction¶
A Harness can do more than return final text. Two opt-in capabilities, declared by its Adapter:
| Term | Meaning |
|---|---|
| Trace | The normalized, vendor-neutral event stream captured from a run — messages, tool calls, token usage, permission events. The framework's own schema, never an external protocol's types. |
| Tracing | A Harness's ability to emit a Trace. Harnesses that lack it degrade to output-only. |
| Interaction | A Harness's ability to let the framework answer the agent's mid-run requests (tool permission / approval / input). Strictly richer than Tracing — you can't answer what you can't observe. |
| Tool Kind | The portable category of a tool call (read · write · execute · search · fetch · other) — the one tool axis that means the same across agents. |
| Interaction Policy | The per-Case rule that answers agent-initiated requests: auto-approve · auto-deny · scripted · llm-based · manual. |
See Observation & interaction for the Trace event types and the policies.
Execution & isolation¶
| Term | Meaning |
|---|---|
| Sandbox | The isolated working directory a Cell's Harness operates in, prepared fresh from the Case source. Never shared between Cells. |
| Isolation Mode | How a Sandbox is created: copy (a folder), clone (git at a commit), or worktree (git worktree at a commit). |
| Environment | A Cell's own throwaway dependency setup — a per-Cell virtualenv (pip-venv/uv) or project-local install (command), so dependency-bearing Cells stay reproducible and parallel-safe. |
| Executor | Where a Cell's non-Harness commands run: LocalExecutor (host subprocesses) or ContainerExecutor (docker exec, for OS-level isolation and OS packages). |
| Reachability / Availability | A preflight that probes a Case's external git repos before the Run does work, then applies the availability policy (fail aborts on an unreachable required Case; skip/optional degrade it to skipped). |
Conversation¶
| Term | Meaning |
|---|---|
| Turn | One eval-initiated prompt sent to the agent within a Cell. The first Turn is the Case's task; later Turns are scripted follow-ups. |
| Conversation | The ordered Turns of a Case, each sent once the agent's previous Turn reaches a stop. A single-prompt Case is a one-Turn Conversation. |
How they relate¶
- A Run expands into many Cells; each Cell has one Harness, one model, one Trial index.
- A Harness is realized by exactly one Adapter, which declares its Tracing and Interaction capabilities.
- A Tracing-capable Harness produces a Trace per Cell, alongside the raw transcript. Interaction implies Tracing, not vice-versa.
- Graders may read the final output, the Trace, or both. A trace-dependent grader on an output-only Harness is a hard failure for that Cell.
- Each Cell gets its own Sandbox (and, if declared, its own Environment), so Cells run in parallel without contention and a crash is resumable.