CLI reference¶

touchstone [--evals-dir DIR] [--runs-dir DIR] <command> [options]

Global flag	Default	Meaning
`--evals-dir`	`evals`	directory of case folders to discover
`--runs-dir`	`runs`	where run outputs are written
`--version`		print the version and exit

Commands: run · list · validate · report · compare · baseline · export.

run¶

Expand the matrix, execute every cell, and write a report.

touchstone run [options]

Option	Meaning
`--eval ID`	case id to run (repeatable; default: all discovered cases)
`--harness NAME`	restrict to these matrix harnesses (repeatable)
`--model NAME`	narrow to these declared models (repeatable)
`--with-model [HARNESS=]MODEL`	replace a harness's models — push new models through it (repeatable)
`--trials N`	override the trial count
`--resume RUN_ID`	continue an interrupted run
`--workers N`	run cells in parallel (default `1`)
`--keep-sandboxes`	don't delete sandboxes after the run
`--llm-concurrency N`	cap concurrent judge/responder LLM calls
`--on-unavailable {fail,skip}`	unreachable required case: abort (`fail`, default) or skip
`--label NAME`	name this run (stored in the manifest; usable as a `compare`/`baseline` ref)

`--model` vs `--with-model`¶

--model narrows the models a case already declares (a filter).
--with-model replaces them — so you can compare arbitrary models on a harness without editing the case. Prefix HARNESS= to scope an override to one harness when a run spans several.

# Two models on the same harness, replacing whatever the case declared:
touchstone run --eval my-task --harness openai \
  --with-model "gpt-4o-mini" --with-model "llama3.1"

# Scope overrides per harness in a multi-harness run:
touchstone run --eval my-task \
  --with-model "openai=llama3.1" \
  --with-model "claude-code-stream=claude-opus-4-8"

Parallel & resumable¶

touchstone run --workers 4               # 4 cells at once (each fully isolated)
touchstone run --resume 20260101-1200-ab12   # pick up after a crash

Each cell's result.json is the source of truth and the manifest is a derived index, so parallel cells never contend and a resumed run skips already-finished cells.

list¶

List discovered cases and past runs.

touchstone list

validate¶

Schema-check every case.yaml without running anything.

touchstone validate                  # schema only
touchstone validate --check-access   # also probe external repos are reachable

report¶

Re-render a run's report from its cell results.

touchstone report <run-id>                 # report.md (default)
touchstone report <run-id> --format html   # self-contained report.html (matrix + transcripts)
touchstone report <run-id> --format both   # both files
touchstone report <run-id> --format html --open   # …and open it in a browser

Option	Meaning
`--format {md,html,both}`	which report(s) to write (default `md`)
`--open`	open the rendered report in a browser (the HTML when both were written)

The HTML report is a single self-contained file (CSS/JS/data inlined, no network, no build step): a clickable comparison matrix, a sortable leaderboard, and per-cell transcripts (nested cards, Tool Kind badges, per-step duration/cost, permission events, grader scores).

compare¶

Join two runs by coordinate and flag regressions. Reads two existing run dirs and writes a compare-<A>.md (and/or .html) under run B; no historical run is mutated.

touchstone compare                          # B = latest, A = baseline (else the run before B)
touchstone compare run-a run-b              # explicit A → B (run_id / label / latest / baseline)
touchstone compare champ latest --format html
touchstone compare --fail-on-regression     # exit non-zero on a SIGNIFICANT paired regression

Option	Meaning
`A` (positional)	baseline run ref; default: the `baseline` pointer, else the run before B
`B` (positional)	candidate run ref; default: `latest`
`--threshold T`	score drop that counts as a regression (default `0.0` = any drop)
`--require-cases N`	refuse to gate on fewer than N shared cases (default `1`)
`--fail-on-regression`	exit non-zero only on a significant paired regression
`--format {md,html,both}`	comparison artifact format(s) to write (default `md`)
`--notify WEBHOOK`	POST a one-line summary to a webhook on regression (best-effort)

The verdict is never a single noisy cell: a regression gates only when a (harness, model) paired bootstrap CI over the shared cases lies strictly below -threshold. A per-cell dip whose CI crosses 0 is within noise. The HTML diff (--format html) is a regressions-first grid with a paired-CI banner; click a regressed row to open both runs' transcripts side by side.

CI / cron gate (no bespoke Action needed): run the matrix, then compare against the baseline and fail the job on a real drop.

touchstone run --harness openai --with-model <model> --label nightly
touchstone compare --fail-on-regression --notify "$SLACK_WEBHOOK"

baseline¶

Manage the "blessed" run new runs are compared against (a runs/.baseline pointer).

touchstone baseline set <ref>   # pin to a run_id / label / latest (resolved to a concrete id)
touchstone baseline show        # print the current baseline

export¶

Export a run's Traces to LangFuse JSON (needs the langfuse extra).

touchstone export <run-id>          # write langfuse.json
touchstone export <run-id> --push   # push to a configured LangFuse

Common recipes¶

# Offline smoke test — no API spend
touchstone run --eval example-case --harness echo

# Any provider, any model — no vendor CLI
export OPENAI_BASE_URL=http://localhost:11434/v1
touchstone run --harness openai --with-model llama3.1

# Your private held-out tasks
touchstone --evals-dir evals-private run --harness openai --with-model <model>

# Compare a whole battery across two models, 3 trials each, 4 at a time
touchstone run --harness openai \
  --with-model A --with-model B \
  --trials 3 --workers 4