Skip to content

Quickstart

This page takes you from zero to a graded comparison report. It's written to be followed by a human or an agent — every step is a copy-pasteable command with the expected outcome.

1. Install

touchstone is published on PyPI as touchstone-eval (the import package and CLI stay touchstone).

pip install "touchstone-eval[openai]"   # [openai] adds the provider-neutral harness
uv tool install "touchstone-eval[openai]"

Verify the CLI:

touchstone --version
touchstone --help

The core install has no network dependencies. Optional extras pull in only what you use:

Extra Adds Needed for
openai openai SDK the OpenAI-compatible harness (any provider)
judge anthropic SDK the model-as-judge grader & the llm-based responder
langfuse langfuse exporting traces to LangFuse

2. Smoke-test the loop (no API spend)

The built-in echo harness runs the whole pipeline — sandbox → harness → graders → report — with no model and no network. Use it to confirm your install works:

touchstone run --eval example-case --harness echo --trials 1

You should see a cell pass and a report path printed:

Run 20260101-1200-ab12: 1 cells (1 to run, 0 skipped)
  [example-case__echo__opus__t1] PASS (score=1.0)
Done. Report: runs/20260101-1200-ab12/report.md

3. Run a real model — any provider

The openai harness drives any OpenAI-compatible endpoint. It is provider-neutral: it reads the OpenAI SDK's standard env vars, so you point it anywhere by setting OPENAI_BASE_URL (and a key, if the endpoint needs one).

export OPENAI_BASE_URL=http://localhost:11434/v1   # Ollama's OpenAI endpoint
# no key needed for a local server
touchstone run --eval example-case --harness openai --with-model llama3.1
export OPENAI_BASE_URL=https://openrouter.ai/api/v1
export OPENAI_API_KEY=sk-or-...
touchstone run --eval example-case --harness openai \
  --with-model "meta-llama/llama-3.1-70b-instruct"
export OPENAI_API_KEY=sk-...        # default endpoint, no base_url needed
touchstone run --eval example-case --harness openai --with-model gpt-4o-mini

Swap the endpoint, keep the harness

The same --harness openai runs vLLM, LM Studio, a LiteLLM proxy, Groq, DeepSeek, real OpenAI… — you're never tied to one provider. See the OpenAI-compatible harness for the full story.

4. Compare models head-to-head

--with-model replaces a case's declared models for the named harness, so you can push new models through the same harness without editing the case:

touchstone run --eval example-case \
  --harness openai \
  --with-model "deepseek-v3.1:cloud" \
  --with-model "gpt-oss:120b"

Each model becomes its own row in the report — same task, same harness, same graders.

5. Read the report

Every run writes a report.md plus per-cell artifacts under runs/<run-id>/:

runs/<run-id>/
├── report.md                 # the comparison, rendered
├── manifest.json             # derived index of all cells
└── cells/<case>__<harness>__<model>__t<trial>/
    ├── result.json           # the cell's source of truth (score, metrics)
    ├── trace.jsonl           # the normalized Trace (tool calls, tokens, …)
    ├── transcript.jsonl      # raw harness transcript
    └── sandbox/              # the working tree the model produced

Re-render a report any time:

touchstone report <run-id>

6. Write your first case

You've been using the bundled example-case. To grade your work, add a case under evals/ and run it. Head to Authoring cases — the minimal case is a prompt, a source folder, a harness/model matrix, and one grader.

touchstone run --eval my-first-case --harness openai --with-model <model>

Next steps

  • Concepts — the vocabulary: Case, Run, Cell, Harness, Grader, Trace.
  • Harnesses — which adapters exist and what each can observe.
  • Authoring cases — the full case.yaml reference.
  • Graders — how outcomes turn into scores.