Architecture¶

touchstone is a small, layered pipeline. A Run expands a matrix into Cells; each Cell flows through preparation → execution → grading, writing its own artifacts. The whole design is organized around one contract — the Trace — and one principle — hold everything constant except the model.

The cell lifecycle¶

                          ┌──────────────────────── one Cell ────────────────────────┐
  Case ──┐                │                                                            │
  Matrix ─┼─▶ expand ▶ Cell ▶ Sandbox ▶ Environment ▶ setup ▶ Harness ▶ Trace ▶ Graders ▶ result.json
  Trials ─┘                │   (copy/      (per-cell    (stub/   (drives    (normalized   (scored)   │
                          │    clone/       venv or     run)     a model)    events)                 │
                          │    worktree)    container)                                                │
                          └────────────────────────────────────────────────────────────┘
                                                   │
                          Reachability preflight ◀─┘   (probe external repos before any work)
                                                   ▼
                                          report.md + manifest.json

Layers¶

Layer	Responsibility	Code
Config	Parse & validate `case.yaml` (pydantic)	`config.py`
Runner	Expand the matrix, orchestrate cells, resume, parallelize	`runner.py`, `concurrency.py`
Reachability	Preflight external repos; apply the availability policy	`reachability.py`
Sandbox	Prepare an isolated working tree per cell	`sandbox.py`
Environment	Provision a per-cell venv / project install	`environment.py`, `setup.py`
Executor	Run a Cell's work — provisioning, graders, and (opt-in) the Harness — on the host or in a sandbox	`executor.py`
Harness	Drive a model, emit a Trace	`harness/`
Trace	The normalized event stream + sink	`trace.py`
Interaction	Answer agent-initiated requests	`interaction/`
Grader	Turn a result into a Score	`grader/`
Report / Export	Render comparisons; map to LangFuse	`report.py`, `export/`

Two contracts hold it together¶

The Trace — every Harness Adapter translates its native events into one vendor-neutral schema. Graders and the LangFuse export depend on the Trace, never on ACP or a vendor SDK. This is what lets a new Adapter (the openai loop, a future Claude SDK adapter) slot in without touching graders.
The Executor seam — provisioning, setup, the command/test graders, and (opt-in, ADR 0014) the Harness itself run through one run(argv, cwd, env) + filesystem interface, so a sandbox backend (a local docker container, or a Harbor sandbox — Daytona/Modal/E2B/Runloop/GKE — via backend: harbor) is a swap, not a rewrite.

Why these shapes¶

Each non-obvious choice is recorded as an ADR. Start here:

0001 · Responder-mediated interaction — keep the agent the only variable when answering its requests.
0002 · Parallel-safe store & isolation — per-cell result.json as the source of truth.
0003 · ACP as the single rich adapter (superseded by 0010) — one protocol for many agents…
0006 · Native stream-json Claude adapter — …but ACP is not the only rich path.
0009 · OpenAI-compatible in-process adapter — the rich path that needs no vendor CLI.
0010 · Rich-adapter substrate — three rich adapters now, so the shared concerns get one home.
0004 · Per-cell environment · 0005 · Pluggable provisioning & executor — reproducible, isolated dependencies.
0007 · Fixtures repo source & hidden — contamination-proof graded assets.
0008 · Reachability & availability policy — a missing private repo can never silently shrink the benchmark.

Repository layout¶

src/touchstone/      the framework (config, harness/, grader/, interaction/, runner, report, cli)
evals/               public example cases (run with the bundled `echo` or any harness)
evals-private/       your held-out, never-committed cases (git-ignored)
docs/                this site (+ ADRs under docs/adr/)

The canonical glossary lives in CONTEXT.md; this site's Concepts page is the readable digest.