Architecture¶
touchstone is a small, layered pipeline. A Run expands a matrix into Cells; each Cell flows through preparation → execution → grading, writing its own artifacts. The whole design is organized around one contract — the Trace — and one principle — hold everything constant except the model.
The cell lifecycle¶
┌──────────────────────── one Cell ────────────────────────┐
Case ──┐ │ │
Matrix ─┼─▶ expand ▶ Cell ▶ Sandbox ▶ Environment ▶ setup ▶ Harness ▶ Trace ▶ Graders ▶ result.json
Trials ─┘ │ (copy/ (per-cell (stub/ (drives (normalized (scored) │
│ clone/ venv or run) a model) events) │
│ worktree) container) │
└────────────────────────────────────────────────────────────┘
│
Reachability preflight ◀─┘ (probe external repos before any work)
▼
report.md + manifest.json
Layers¶
| Layer | Responsibility | Code |
|---|---|---|
| Config | Parse & validate case.yaml (pydantic) |
config.py |
| Runner | Expand the matrix, orchestrate cells, resume, parallelize | runner.py, concurrency.py |
| Reachability | Preflight external repos; apply the availability policy | reachability.py |
| Sandbox | Prepare an isolated working tree per cell | sandbox.py |
| Environment | Provision a per-cell venv / project install | environment.py, setup.py |
| Executor | Run a Cell's work — provisioning, graders, and (opt-in) the Harness — on the host or in a sandbox | executor.py |
| Harness | Drive a model, emit a Trace | harness/ |
| Trace | The normalized event stream + sink | trace.py |
| Interaction | Answer agent-initiated requests | interaction/ |
| Grader | Turn a result into a Score | grader/ |
| Report / Export | Render comparisons; map to LangFuse | report.py, export/ |
Two contracts hold it together¶
- The Trace — every Harness Adapter translates its native events into one vendor-neutral
schema. Graders and the LangFuse export depend on the Trace, never on ACP or a vendor SDK.
This is what lets a new Adapter (the
openailoop, a future Claude SDK adapter) slot in without touching graders. - The Executor seam — provisioning, setup, the command/test graders, and (opt-in, ADR 0014)
the Harness itself run through one
run(argv, cwd, env)+ filesystem interface, so a sandbox backend (a local docker container, or a Harbor sandbox — Daytona/Modal/E2B/Runloop/GKE — viabackend: harbor) is a swap, not a rewrite.
Why these shapes¶
Each non-obvious choice is recorded as an ADR. Start here:
- 0001 · Responder-mediated interaction — keep the agent the only variable when answering its requests.
- 0002 · Parallel-safe store & isolation — per-cell
result.jsonas the source of truth. - 0003 · ACP as the single rich adapter (superseded by 0010) — one protocol for many agents…
- 0006 · Native stream-json Claude adapter — …but ACP is not the only rich path.
- 0009 · OpenAI-compatible in-process adapter — the rich path that needs no vendor CLI.
- 0010 · Rich-adapter substrate — three rich adapters now, so the shared concerns get one home.
- 0004 · Per-cell environment · 0005 · Pluggable provisioning & executor — reproducible, isolated dependencies.
- 0007 · Fixtures repo source & hidden — contamination-proof graded assets.
- 0008 · Reachability & availability policy — a missing private repo can never silently shrink the benchmark.
Repository layout¶
src/touchstone/ the framework (config, harness/, grader/, interaction/, runner, report, cli)
evals/ public example cases (run with the bundled `echo` or any harness)
evals-private/ your held-out, never-committed cases (git-ignored)
docs/ this site (+ ADRs under docs/adr/)
The canonical glossary lives in
CONTEXT.md; this site's
Concepts page is the readable digest.