Your usecases. Every model. One honest verdict.¶

touchstone is a personal eval benchmark. You bring the tasks you actually care about, it runs each one across the models and agents you're weighing — in isolated, reproducible sandboxes — and grades the outcomes so you can decide what to ship. No vendor lock-in: run Claude, or any model from any provider through an OpenAI-compatible endpoint.

Get started Any model, any provider GitHub

your task → harness × model → isolated sandbox → graders → report

A 30-second taste¶

Run your own tasks against any OpenAI-compatible endpoint — here a local Ollama model — with no vendor CLI:

pip install "touchstone-eval[openai]"

export OPENAI_BASE_URL=http://localhost:11434/v1   # swap for any endpoint
touchstone run --harness openai --with-model llama3.1

…or compare two models head-to-head on the same harness and your own case:

touchstone run --eval my-task \
  --harness openai \
  --with-model "deepseek-v3.1:cloud" \
  --with-model "gpt-oss:120b"

Each (case × harness × model × trial) runs in its own sandbox, is graded independently, and lands in a single comparison report.

Why touchstone¶

Any model, any provider

Claude over its CLI, any ACP agent (droid, gemini, codex), or any OpenAI-compatible endpoint — OpenAI, OpenRouter, Together, Groq, DeepSeek, vLLM, Ollama, a LiteLLM proxy. Point at a base_url, pick a model, go.

OpenAI-compatible harness
Your tasks, not someone else's

Public benchmarks leak into training data. touchstone grades the work you do — bug fixes, refactors, real repos — including a private, never-committed held-out set.

Authoring cases
Observe everything

A normalized Trace captures tool calls, tokens, cost and permission events — so graders can score how a model worked, not just its final answer.

Observation & interaction
Grade like you mean it

Hidden test suites, file/pattern checks, model-as-judge, tool-usage budgets and efficiency ramps — combined per a case's pass threshold.

Graders
Isolated & reproducible

Every cell gets a fresh sandbox (copy / git clone / worktree), its own throwaway venv, and optional container isolation. Parallel-safe, resumable after a crash.

Concepts
Hold the controls constant

The harness, judge and responder are fixed across the matrix, so the model is the only thing that varies — the controlled-variable property a real eval needs.

Concepts

How it works¶

A Run expands a matrix into many Cells — one per (case × harness × model × trial). Each Cell prepares an isolated Sandbox from the case source, hands it to a Harness (the swappable thing that drives a model), captures a Trace of what happened, then scores the result with one or more Graders. Outcomes merge into a single report you can read or diff.

The whole design rests on one idea: hold everything constant except the model. Same task, same tools, same judge — so a difference in the report is a difference in the model, not the scaffolding.

Read the concepts See the architecture