Skip to content

CLI reference

touchstone [--evals-dir DIR] [--runs-dir DIR] <command> [options]
Global flag Default Meaning
--evals-dir evals directory of case folders to discover
--runs-dir runs where run outputs are written
--version print the version and exit

Commands: run · list · validate · report · compare · baseline · export.

run

Expand the matrix, execute every cell, and write a report.

touchstone run [options]
Option Meaning
--eval ID case id to run (repeatable; default: all discovered cases)
--harness NAME restrict to these matrix harnesses (repeatable)
--model NAME narrow to these declared models (repeatable)
--with-model [HARNESS=]MODEL replace a harness's models — push new models through it (repeatable)
--trials N override the trial count
--resume RUN_ID continue an interrupted run
--workers N run cells in parallel (default 1)
--keep-sandboxes don't delete sandboxes after the run
--llm-concurrency N cap concurrent judge/responder LLM calls
--on-unavailable {fail,skip} unreachable required case: abort (fail, default) or skip
--label NAME name this run (stored in the manifest; usable as a compare/baseline ref)

--model vs --with-model

  • --model narrows the models a case already declares (a filter).
  • --with-model replaces them — so you can compare arbitrary models on a harness without editing the case. Prefix HARNESS= to scope an override to one harness when a run spans several.
# Two models on the same harness, replacing whatever the case declared:
touchstone run --eval my-task --harness openai \
  --with-model "gpt-4o-mini" --with-model "llama3.1"

# Scope overrides per harness in a multi-harness run:
touchstone run --eval my-task \
  --with-model "openai=llama3.1" \
  --with-model "claude-code-stream=claude-opus-4-8"

Parallel & resumable

touchstone run --workers 4               # 4 cells at once (each fully isolated)
touchstone run --resume 20260101-1200-ab12   # pick up after a crash

Each cell's result.json is the source of truth and the manifest is a derived index, so parallel cells never contend and a resumed run skips already-finished cells.

list

List discovered cases and past runs.

touchstone list

validate

Schema-check every case.yaml without running anything.

touchstone validate                  # schema only
touchstone validate --check-access   # also probe external repos are reachable

report

Re-render a run's report from its cell results.

touchstone report <run-id>                 # report.md (default)
touchstone report <run-id> --format html   # self-contained report.html (matrix + transcripts)
touchstone report <run-id> --format both   # both files
touchstone report <run-id> --format html --open   # …and open it in a browser
Option Meaning
--format {md,html,both} which report(s) to write (default md)
--open open the rendered report in a browser (the HTML when both were written)

The HTML report is a single self-contained file (CSS/JS/data inlined, no network, no build step): a clickable comparison matrix, a sortable leaderboard, and per-cell transcripts (nested cards, Tool Kind badges, per-step duration/cost, permission events, grader scores).

compare

Join two runs by coordinate and flag regressions. Reads two existing run dirs and writes a compare-<A>.md (and/or .html) under run B; no historical run is mutated.

touchstone compare                          # B = latest, A = baseline (else the run before B)
touchstone compare run-a run-b              # explicit A → B (run_id / label / latest / baseline)
touchstone compare champ latest --format html
touchstone compare --fail-on-regression     # exit non-zero on a SIGNIFICANT paired regression
Option Meaning
A (positional) baseline run ref; default: the baseline pointer, else the run before B
B (positional) candidate run ref; default: latest
--threshold T score drop that counts as a regression (default 0.0 = any drop)
--require-cases N refuse to gate on fewer than N shared cases (default 1)
--fail-on-regression exit non-zero only on a significant paired regression
--format {md,html,both} comparison artifact format(s) to write (default md)
--notify WEBHOOK POST a one-line summary to a webhook on regression (best-effort)

The verdict is never a single noisy cell: a regression gates only when a (harness, model) paired bootstrap CI over the shared cases lies strictly below -threshold. A per-cell dip whose CI crosses 0 is within noise. The HTML diff (--format html) is a regressions-first grid with a paired-CI banner; click a regressed row to open both runs' transcripts side by side.

CI / cron gate (no bespoke Action needed): run the matrix, then compare against the baseline and fail the job on a real drop.

touchstone run --harness openai --with-model <model> --label nightly
touchstone compare --fail-on-regression --notify "$SLACK_WEBHOOK"

baseline

Manage the "blessed" run new runs are compared against (a runs/.baseline pointer).

touchstone baseline set <ref>   # pin to a run_id / label / latest (resolved to a concrete id)
touchstone baseline show        # print the current baseline

export

Export a run's Traces to LangFuse JSON (needs the langfuse extra).

touchstone export <run-id>          # write langfuse.json
touchstone export <run-id> --push   # push to a configured LangFuse

Common recipes

# Offline smoke test — no API spend
touchstone run --eval example-case --harness echo

# Any provider, any model — no vendor CLI
export OPENAI_BASE_URL=http://localhost:11434/v1
touchstone run --harness openai --with-model llama3.1

# Your private held-out tasks
touchstone --evals-dir evals-private run --harness openai --with-model <model>

# Compare a whole battery across two models, 3 trials each, 4 at a time
touchstone run --harness openai \
  --with-model A --with-model B \
  --trials 3 --workers 4