Skip to content

The OpenAI-compatible harness

openai (alias openai-compatible) is touchstone's provider-neutral harness. It runs a minimal agentic coding loop in-process against any endpoint that speaks the OpenAI Chat Completions wire format — no vendor CLI, no Claude. It is both Tracing- and Interaction-capable.

This is usually the harness you want: it holds the tool surface, system prompt, and loop constant while you vary only the model, which is exactly the controlled-variable property an eval needs.

pip install "touchstone-eval[openai]"

The whole setup

There is nothing provider-specific to configure. The endpoint and key come from the OpenAI SDK's standard environment variables — the convention Ollama, vLLM, LM Studio, LiteLLM and OpenRouter all share:

export OPENAI_BASE_URL=http://localhost:11434/v1   # Ollama; swap for ANY endpoint
export OPENAI_API_KEY=sk-...                        # omit entirely for keyless local servers
touchstone run --harness openai --with-model llama3.1

Change OPENAI_BASE_URL and the same harness runs anywhere — you are never tied to one provider.

Endpoint resolution

Source Precedence
A named profile's base_url (from openai_agents.yaml) highest
$OPENAI_BASE_URL then
the SDK default (https://api.openai.com/v1) fallback

The key is read from $OPENAI_API_KEY (or a profile's api_key_env). Keyless servers (Ollama, vLLM, LM Studio) accept any value, so a placeholder is sent when none is set — only real OpenAI truly needs a key. The resolved endpoint is recorded in each cell's result.json.

Provider cheat-sheet

Provider OPENAI_BASE_URL Key var
OpenAI (unset — SDK default) OPENAI_API_KEY
Ollama (local) http://localhost:11434/v1 (none)
vLLM / LM Studio http://localhost:8000/v1 (none)
OpenRouter https://openrouter.ai/api/v1 OPENAI_API_KEY
Together https://api.together.xyz/v1 OPENAI_API_KEY
Groq https://api.groq.com/openai/v1 OPENAI_API_KEY
DeepSeek https://api.deepseek.com OPENAI_API_KEY
LiteLLM proxy http://localhost:4000/v1 your proxy key

The tool surface

The loop offers a small, fixed set of tools that map cleanly onto the portable Tool Kinds — so traces are gradable across models:

Tool Tool Kind Does
bash execute run a shell command in the sandbox (build, tests, git, …)
read_file read read a UTF-8 file
write_file write create / overwrite a file
edit_file write replace the first exact occurrence of a string
grep search regex search across the sandbox
list_dir search list a directory

bash alone is enough to do anything; the explicit tools exist so the Trace carries clean, gradable signal. File and search tools are confined to the sandbox (path-escape attempts are rejected).

How the loop works

system prompt + task
  model call ──tool_calls?──▶ for each call:
      ▲                         emit tool_call → (interaction policy) → run tool → emit tool_result
      │                         append result to the conversation
      └────────── yes ──────────┘
      no  → final message → emit usage + stop

Each model⇄tool round-trip is replayed into the Trace. Token usage comes from the API response; cost is filled when the profile carries a price table (otherwise it stays null, and the efficiency grader simply skips the cost dimension).

The loop is robust by construction: malformed tool arguments, unknown tools, path escapes, runaway loops (max_iters), a wall-clock timeout_s, and API/network errors all end the cell cleanly rather than crashing the run.

Interaction

Because the loop owns every tool call, it can mediate each one through the Case's Interaction Policy — making this harness strictly richer than a headless adapter.

When a Case opts into interaction:

observe:
  tracing: true
  interaction:
    policy: scripted
    rules:
      - {tool_kind: execute, decision: deny}   # never let the model run shell commands
    default: allow

…the harness emits a permission_request before each tool, asks the policy, emits the permission_response, then either runs the tool or returns the refusal to the model and moves on. A policy may also rewrite a tool's input before it runs (updated_input). All five policies work: auto-approve · auto-deny · scripted · llm-based · manual.

When a Case requests no interaction (tracing only), there is no policy and tools execute directly — exactly like claude-code-stream, with no permission noise in the Trace.

Named profiles

Optional. The generic openai harness already covers any endpoint; use a named profile only to run several endpoints in one comparison or to attach a price table. Drop entries in openai_agents.yaml (repo root, next to evals/):

ollama:
  base_url: "http://localhost:11434/v1"
  price_in: 0.0          # USD per 1M input tokens  → lets the harness report cost_usd
  price_out: 0.0         # USD per 1M output tokens
openrouter:
  base_url: "https://openrouter.ai/api/v1"
  api_key_env: "OPENROUTER_API_KEY"

Each entry becomes a harness name usable in a case's matrix or via --harness.

Field Default Meaning
base_url $OPENAI_BASE_URL → api.openai.com the endpoint
api_key_env OPENAI_API_KEY env var holding the key (ignored by keyless servers)
model fallback model if the matrix omits one (usually leave unset)
temperature 0.0 sampling temperature; null omits the param entirely
max_iters 50 max model⇄tool round-trips per Turn
timeout_s 1800 wall-clock budget for the whole Cell
price_in / price_out USD per 1M tokens, to report cost_usd
extra_body provider-specific create() knobs (routing, provider prefs)

Worked example: compare two providers

# Pin two endpoints in openai_agents.yaml as `ollama` and `openrouter`, then:
touchstone run --eval my-task \
  --harness ollama     --with-model "ollama=deepseek-v3.1:cloud" \
  --harness openrouter --with-model "openrouter=anthropic/claude-3.7-sonnet"

Same task, same tool surface, same graders — two providers side by side in one report.