CLI reference¶
| Global flag | Default | Meaning |
|---|---|---|
--evals-dir |
evals |
directory of case folders to discover |
--runs-dir |
runs |
where run outputs are written |
--version |
print the version and exit |
Commands: run · list · validate · report ·
compare · baseline · export.
run¶
Expand the matrix, execute every cell, and write a report.
| Option | Meaning |
|---|---|
--eval ID |
case id to run (repeatable; default: all discovered cases) |
--harness NAME |
restrict to these matrix harnesses (repeatable) |
--model NAME |
narrow to these declared models (repeatable) |
--with-model [HARNESS=]MODEL |
replace a harness's models — push new models through it (repeatable) |
--trials N |
override the trial count |
--resume RUN_ID |
continue an interrupted run |
--workers N |
run cells in parallel (default 1) |
--keep-sandboxes |
don't delete sandboxes after the run |
--llm-concurrency N |
cap concurrent judge/responder LLM calls |
--on-unavailable {fail,skip} |
unreachable required case: abort (fail, default) or skip |
--label NAME |
name this run (stored in the manifest; usable as a compare/baseline ref) |
--model vs --with-model¶
--modelnarrows the models a case already declares (a filter).--with-modelreplaces them — so you can compare arbitrary models on a harness without editing the case. PrefixHARNESS=to scope an override to one harness when a run spans several.
# Two models on the same harness, replacing whatever the case declared:
touchstone run --eval my-task --harness openai \
--with-model "gpt-4o-mini" --with-model "llama3.1"
# Scope overrides per harness in a multi-harness run:
touchstone run --eval my-task \
--with-model "openai=llama3.1" \
--with-model "claude-code-stream=claude-opus-4-8"
Parallel & resumable¶
touchstone run --workers 4 # 4 cells at once (each fully isolated)
touchstone run --resume 20260101-1200-ab12 # pick up after a crash
Each cell's result.json is the source of truth and the manifest is a derived index, so
parallel cells never contend and a resumed run skips already-finished cells.
list¶
List discovered cases and past runs.
validate¶
Schema-check every case.yaml without running anything.
touchstone validate # schema only
touchstone validate --check-access # also probe external repos are reachable
report¶
Re-render a run's report from its cell results.
touchstone report <run-id> # report.md (default)
touchstone report <run-id> --format html # self-contained report.html (matrix + transcripts)
touchstone report <run-id> --format both # both files
touchstone report <run-id> --format html --open # …and open it in a browser
| Option | Meaning |
|---|---|
--format {md,html,both} |
which report(s) to write (default md) |
--open |
open the rendered report in a browser (the HTML when both were written) |
The HTML report is a single self-contained file (CSS/JS/data inlined, no network, no build step): a clickable comparison matrix, a sortable leaderboard, and per-cell transcripts (nested cards, Tool Kind badges, per-step duration/cost, permission events, grader scores).
compare¶
Join two runs by coordinate and flag regressions. Reads two existing run dirs and writes a
compare-<A>.md (and/or .html) under run B; no historical run is mutated.
touchstone compare # B = latest, A = baseline (else the run before B)
touchstone compare run-a run-b # explicit A → B (run_id / label / latest / baseline)
touchstone compare champ latest --format html
touchstone compare --fail-on-regression # exit non-zero on a SIGNIFICANT paired regression
| Option | Meaning |
|---|---|
A (positional) |
baseline run ref; default: the baseline pointer, else the run before B |
B (positional) |
candidate run ref; default: latest |
--threshold T |
score drop that counts as a regression (default 0.0 = any drop) |
--require-cases N |
refuse to gate on fewer than N shared cases (default 1) |
--fail-on-regression |
exit non-zero only on a significant paired regression |
--format {md,html,both} |
comparison artifact format(s) to write (default md) |
--notify WEBHOOK |
POST a one-line summary to a webhook on regression (best-effort) |
The verdict is never a single noisy cell: a regression gates only when a (harness, model)
paired bootstrap CI over the shared cases lies strictly below -threshold. A per-cell dip whose
CI crosses 0 is within noise. The HTML diff (--format html) is a regressions-first grid with a
paired-CI banner; click a regressed row to open both runs' transcripts side by side.
CI / cron gate (no bespoke Action needed): run the matrix, then compare against the baseline and fail the job on a real drop.
touchstone run --harness openai --with-model <model> --label nightly
touchstone compare --fail-on-regression --notify "$SLACK_WEBHOOK"
baseline¶
Manage the "blessed" run new runs are compared against (a runs/.baseline pointer).
touchstone baseline set <ref> # pin to a run_id / label / latest (resolved to a concrete id)
touchstone baseline show # print the current baseline
export¶
Export a run's Traces to LangFuse JSON (needs the langfuse extra).
touchstone export <run-id> # write langfuse.json
touchstone export <run-id> --push # push to a configured LangFuse
Common recipes¶
# Offline smoke test — no API spend
touchstone run --eval example-case --harness echo
# Any provider, any model — no vendor CLI
export OPENAI_BASE_URL=http://localhost:11434/v1
touchstone run --harness openai --with-model llama3.1
# Your private held-out tasks
touchstone --evals-dir evals-private run --harness openai --with-model <model>
# Compare a whole battery across two models, 3 trials each, 4 at a time
touchstone run --harness openai \
--with-model A --with-model B \
--trials 3 --workers 4