Your usecases. Every model. One honest verdict.¶
touchstone is a personal eval benchmark. You bring the tasks you actually care about, it runs each one across the models and agents you're weighing — in isolated, reproducible sandboxes — and grades the outcomes so you can decide what to ship. No vendor lock-in: run Claude, or any model from any provider through an OpenAI-compatible endpoint.
your task → harness × model → isolated sandbox → graders → report
A 30-second taste¶
Run your own tasks against any OpenAI-compatible endpoint — here a local Ollama model — with no vendor CLI:
pip install "touchstone-eval[openai]"
export OPENAI_BASE_URL=http://localhost:11434/v1 # swap for any endpoint
touchstone run --harness openai --with-model llama3.1
…or compare two models head-to-head on the same harness and your own case:
touchstone run --eval my-task \
--harness openai \
--with-model "deepseek-v3.1:cloud" \
--with-model "gpt-oss:120b"
Each (case × harness × model × trial) runs in its own sandbox, is graded independently, and
lands in a single comparison report.
Why touchstone¶
-
Any model, any provider
Claude over its CLI, any ACP agent (droid, gemini, codex), or any OpenAI-compatible endpoint — OpenAI, OpenRouter, Together, Groq, DeepSeek, vLLM, Ollama, a LiteLLM proxy. Point at a
base_url, pick a model, go. -
Your tasks, not someone else's
Public benchmarks leak into training data. touchstone grades the work you do — bug fixes, refactors, real repos — including a private, never-committed held-out set.
-
Observe everything
A normalized Trace captures tool calls, tokens, cost and permission events — so graders can score how a model worked, not just its final answer.
-
Grade like you mean it
Hidden test suites, file/pattern checks, model-as-judge, tool-usage budgets and efficiency ramps — combined per a case's pass threshold.
-
Isolated & reproducible
Every cell gets a fresh sandbox (copy / git clone / worktree), its own throwaway venv, and optional container isolation. Parallel-safe, resumable after a crash.
-
Hold the controls constant
The harness, judge and responder are fixed across the matrix, so the model is the only thing that varies — the controlled-variable property a real eval needs.
How it works¶
A Run expands a matrix into many Cells — one per (case × harness × model × trial).
Each Cell prepares an isolated Sandbox from the case source, hands it to a Harness
(the swappable thing that drives a model), captures a Trace of what happened, then scores
the result with one or more Graders. Outcomes merge into a single report you can read or
diff.
The whole design rests on one idea: hold everything constant except the model. Same task, same tools, same judge — so a difference in the report is a difference in the model, not the scaffolding.