Skip to content

Architecture

touchstone is a small, layered pipeline. A Run expands a matrix into Cells; each Cell flows through preparation → execution → grading, writing its own artifacts. The whole design is organized around one contract — the Trace — and one principle — hold everything constant except the model.

The cell lifecycle

                          ┌──────────────────────── one Cell ────────────────────────┐
  Case ──┐                │                                                            │
  Matrix ─┼─▶ expand ▶ Cell ▶ Sandbox ▶ Environment ▶ setup ▶ Harness ▶ Trace ▶ Graders ▶ result.json
  Trials ─┘                │   (copy/      (per-cell    (stub/   (drives    (normalized   (scored)   │
                          │    clone/       venv or     run)     a model)    events)                 │
                          │    worktree)    container)                                                │
                          └────────────────────────────────────────────────────────────┘
                          Reachability preflight ◀─┘   (probe external repos before any work)
                                          report.md + manifest.json

Layers

Layer Responsibility Code
Config Parse & validate case.yaml (pydantic) config.py
Runner Expand the matrix, orchestrate cells, resume, parallelize runner.py, concurrency.py
Reachability Preflight external repos; apply the availability policy reachability.py
Sandbox Prepare an isolated working tree per cell sandbox.py
Environment Provision a per-cell venv / project install environment.py, setup.py
Executor Run a Cell's work — provisioning, graders, and (opt-in) the Harness — on the host or in a sandbox executor.py
Harness Drive a model, emit a Trace harness/
Trace The normalized event stream + sink trace.py
Interaction Answer agent-initiated requests interaction/
Grader Turn a result into a Score grader/
Report / Export Render comparisons; map to LangFuse report.py, export/

Two contracts hold it together

  • The Trace — every Harness Adapter translates its native events into one vendor-neutral schema. Graders and the LangFuse export depend on the Trace, never on ACP or a vendor SDK. This is what lets a new Adapter (the openai loop, a future Claude SDK adapter) slot in without touching graders.
  • The Executor seam — provisioning, setup, the command/test graders, and (opt-in, ADR 0014) the Harness itself run through one run(argv, cwd, env) + filesystem interface, so a sandbox backend (a local docker container, or a Harbor sandbox — Daytona/Modal/E2B/Runloop/GKE — via backend: harbor) is a swap, not a rewrite.

Why these shapes

Each non-obvious choice is recorded as an ADR. Start here:

Repository layout

src/touchstone/      the framework (config, harness/, grader/, interaction/, runner, report, cli)
evals/               public example cases (run with the bundled `echo` or any harness)
evals-private/       your held-out, never-committed cases (git-ignored)
docs/                this site (+ ADRs under docs/adr/)

The canonical glossary lives in CONTEXT.md; this site's Concepts page is the readable digest.