Quickstart¶
This page takes you from zero to a graded comparison report. It's written to be followed by a human or an agent — every step is a copy-pasteable command with the expected outcome.
1. Install¶
touchstone is published on PyPI as touchstone-eval (the import package and CLI stay
touchstone).
Verify the CLI:
The core install has no network dependencies. Optional extras pull in only what you use:
| Extra | Adds | Needed for |
|---|---|---|
openai |
openai SDK |
the OpenAI-compatible harness (any provider) |
judge |
anthropic SDK |
the model-as-judge grader & the llm-based responder |
langfuse |
langfuse |
exporting traces to LangFuse |
2. Smoke-test the loop (no API spend)¶
The built-in echo harness runs the whole pipeline — sandbox → harness → graders →
report — with no model and no network. Use it to confirm your install works:
You should see a cell pass and a report path printed:
Run 20260101-1200-ab12: 1 cells (1 to run, 0 skipped)
[example-case__echo__opus__t1] PASS (score=1.0)
Done. Report: runs/20260101-1200-ab12/report.md
3. Run a real model — any provider¶
The openai harness drives any OpenAI-compatible endpoint. It is provider-neutral: it
reads the OpenAI SDK's standard env vars, so you point it anywhere by setting OPENAI_BASE_URL
(and a key, if the endpoint needs one).
Swap the endpoint, keep the harness
The same --harness openai runs vLLM, LM Studio, a LiteLLM proxy, Groq, DeepSeek, real
OpenAI… — you're never tied to one provider. See the OpenAI-compatible
harness for the full story.
4. Compare models head-to-head¶
--with-model replaces a case's declared models for the named harness, so you can push new
models through the same harness without editing the case:
touchstone run --eval example-case \
--harness openai \
--with-model "deepseek-v3.1:cloud" \
--with-model "gpt-oss:120b"
Each model becomes its own row in the report — same task, same harness, same graders.
5. Read the report¶
Every run writes a report.md plus per-cell artifacts under runs/<run-id>/:
runs/<run-id>/
├── report.md # the comparison, rendered
├── manifest.json # derived index of all cells
└── cells/<case>__<harness>__<model>__t<trial>/
├── result.json # the cell's source of truth (score, metrics)
├── trace.jsonl # the normalized Trace (tool calls, tokens, …)
├── transcript.jsonl # raw harness transcript
└── sandbox/ # the working tree the model produced
Re-render a report any time:
6. Write your first case¶
You've been using the bundled example-case. To grade your work, add a case under evals/
and run it. Head to Authoring cases — the minimal case is a prompt, a source
folder, a harness/model matrix, and one grader.
Next steps¶
- Concepts — the vocabulary: Case, Run, Cell, Harness, Grader, Trace.
- Harnesses — which adapters exist and what each can observe.
- Authoring cases — the full
case.yamlreference. - Graders — how outcomes turn into scores.