Skip to content

Your usecases. Every model. One honest verdict.

touchstone is a personal eval benchmark. You bring the tasks you actually care about, it runs each one across the models and agents you're weighing — in isolated, reproducible sandboxes — and grades the outcomes so you can decide what to ship. No vendor lock-in: run Claude, or any model from any provider through an OpenAI-compatible endpoint.

your task harness × model isolated sandbox graders report

A 30-second taste

Run your own tasks against any OpenAI-compatible endpoint — here a local Ollama model — with no vendor CLI:

pip install "touchstone-eval[openai]"

export OPENAI_BASE_URL=http://localhost:11434/v1   # swap for any endpoint
touchstone run --harness openai --with-model llama3.1

…or compare two models head-to-head on the same harness and your own case:

touchstone run --eval my-task \
  --harness openai \
  --with-model "deepseek-v3.1:cloud" \
  --with-model "gpt-oss:120b"

Each (case × harness × model × trial) runs in its own sandbox, is graded independently, and lands in a single comparison report.

Why touchstone

  • Any model, any provider


    Claude over its CLI, any ACP agent (droid, gemini, codex), or any OpenAI-compatible endpoint — OpenAI, OpenRouter, Together, Groq, DeepSeek, vLLM, Ollama, a LiteLLM proxy. Point at a base_url, pick a model, go.

    OpenAI-compatible harness

  • Your tasks, not someone else's


    Public benchmarks leak into training data. touchstone grades the work you do — bug fixes, refactors, real repos — including a private, never-committed held-out set.

    Authoring cases

  • Observe everything


    A normalized Trace captures tool calls, tokens, cost and permission events — so graders can score how a model worked, not just its final answer.

    Observation & interaction

  • Grade like you mean it


    Hidden test suites, file/pattern checks, model-as-judge, tool-usage budgets and efficiency ramps — combined per a case's pass threshold.

    Graders

  • Isolated & reproducible


    Every cell gets a fresh sandbox (copy / git clone / worktree), its own throwaway venv, and optional container isolation. Parallel-safe, resumable after a crash.

    Concepts

  • Hold the controls constant


    The harness, judge and responder are fixed across the matrix, so the model is the only thing that varies — the controlled-variable property a real eval needs.

    Concepts

How it works

A Run expands a matrix into many Cells — one per (case × harness × model × trial). Each Cell prepares an isolated Sandbox from the case source, hands it to a Harness (the swappable thing that drives a model), captures a Trace of what happened, then scores the result with one or more Graders. Outcomes merge into a single report you can read or diff.

The whole design rests on one idea: hold everything constant except the model. Same task, same tools, same judge — so a difference in the report is a difference in the model, not the scaffolding.

Read the concepts See the architecture