The OpenAI-compatible harness¶

openai (alias openai-compatible) is touchstone's provider-neutral harness. It runs a minimal agentic coding loop in-process against any endpoint that speaks the OpenAI Chat Completions wire format — no vendor CLI, no Claude. It is both Tracing- and Interaction-capable.

This is usually the harness you want: it holds the tool surface, system prompt, and loop constant while you vary only the model, which is exactly the controlled-variable property an eval needs.

pip install "touchstone-eval[openai]"

The whole setup¶

There is nothing provider-specific to configure. The endpoint and key come from the OpenAI SDK's standard environment variables — the convention Ollama, vLLM, LM Studio, LiteLLM and OpenRouter all share:

export OPENAI_BASE_URL=http://localhost:11434/v1   # Ollama; swap for ANY endpoint
export OPENAI_API_KEY=sk-...                        # omit entirely for keyless local servers
touchstone run --harness openai --with-model llama3.1

Change OPENAI_BASE_URL and the same harness runs anywhere — you are never tied to one provider.

Endpoint resolution¶

Source	Precedence
A named profile's `base_url` (from `openai_agents.yaml`)	highest
`$OPENAI_BASE_URL`	then
the SDK default (`https://api.openai.com/v1`)	fallback

The key is read from $OPENAI_API_KEY (or a profile's api_key_env). Keyless servers (Ollama, vLLM, LM Studio) accept any value, so a placeholder is sent when none is set — only real OpenAI truly needs a key. The resolved endpoint is recorded in each cell's result.json.

Provider cheat-sheet¶

Provider	`OPENAI_BASE_URL`	Key var
OpenAI	(unset — SDK default)	`OPENAI_API_KEY`
Ollama (local)	`http://localhost:11434/v1`	(none)
vLLM / LM Studio	`http://localhost:8000/v1`	(none)
OpenRouter	`https://openrouter.ai/api/v1`	`OPENAI_API_KEY`
Together	`https://api.together.xyz/v1`	`OPENAI_API_KEY`
Groq	`https://api.groq.com/openai/v1`	`OPENAI_API_KEY`
DeepSeek	`https://api.deepseek.com`	`OPENAI_API_KEY`
LiteLLM proxy	`http://localhost:4000/v1`	your proxy key

The tool surface¶

The loop offers a small, fixed set of tools that map cleanly onto the portable Tool Kinds — so traces are gradable across models:

Tool	Tool Kind	Does
`bash`	`execute`	run a shell command in the sandbox (build, tests, git, …)
`read_file`	`read`	read a UTF-8 file
`write_file`	`write`	create / overwrite a file
`edit_file`	`write`	replace the first exact occurrence of a string
`grep`	`search`	regex search across the sandbox
`list_dir`	`search`	list a directory

bash alone is enough to do anything; the explicit tools exist so the Trace carries clean, gradable signal. File and search tools are confined to the sandbox (path-escape attempts are rejected).

How the loop works¶

system prompt + task
      │
      ▼
  model call ──tool_calls?──▶ for each call:
      ▲                         emit tool_call → (interaction policy) → run tool → emit tool_result
      │                         append result to the conversation
      └────────── yes ──────────┘
      │
      no  → final message → emit usage + stop

Each model⇄tool round-trip is replayed into the Trace. Token usage comes from the API response; cost is filled when the profile carries a price table (otherwise it stays null, and the efficiency grader simply skips the cost dimension).

The loop is robust by construction: malformed tool arguments, unknown tools, path escapes, runaway loops (max_iters), a wall-clock timeout_s, and API/network errors all end the cell cleanly rather than crashing the run.

Interaction¶

Because the loop owns every tool call, it can mediate each one through the Case's Interaction Policy — making this harness strictly richer than a headless adapter.

When a Case opts into interaction:

observe:
  tracing: true
  interaction:
    policy: scripted
    rules:
      - {tool_kind: execute, decision: deny}   # never let the model run shell commands
    default: allow

…the harness emits a permission_request before each tool, asks the policy, emits the permission_response, then either runs the tool or returns the refusal to the model and moves on. A policy may also rewrite a tool's input before it runs (updated_input). All five policies work: auto-approve · auto-deny · scripted · llm-based · manual.

When a Case requests no interaction (tracing only), there is no policy and tools execute directly — exactly like claude-code-stream, with no permission noise in the Trace.

Named profiles¶

Optional. The generic openai harness already covers any endpoint; use a named profile only to run several endpoints in one comparison or to attach a price table. Drop entries in openai_agents.yaml (repo root, next to evals/):

ollama:
  base_url: "http://localhost:11434/v1"
  price_in: 0.0          # USD per 1M input tokens  → lets the harness report cost_usd
  price_out: 0.0         # USD per 1M output tokens
openrouter:
  base_url: "https://openrouter.ai/api/v1"
  api_key_env: "OPENROUTER_API_KEY"

Each entry becomes a harness name usable in a case's matrix or via --harness.

Field	Default	Meaning
`base_url`	`$OPENAI_BASE_URL` → api.openai.com	the endpoint
`api_key_env`	`OPENAI_API_KEY`	env var holding the key (ignored by keyless servers)
`model`	–	fallback model if the matrix omits one (usually leave unset)
`temperature`	`0.0`	sampling temperature; `null` omits the param entirely
`max_iters`	`50`	max model⇄tool round-trips per Turn
`timeout_s`	`1800`	wall-clock budget for the whole Cell
`price_in` / `price_out`	–	USD per 1M tokens, to report `cost_usd`
`extra_body`	–	provider-specific `create()` knobs (routing, provider prefs)

Worked example: compare two providers¶

# Pin two endpoints in openai_agents.yaml as `ollama` and `openrouter`, then:
touchstone run --eval my-task \
  --harness ollama     --with-model "ollama=deepseek-v3.1:cloud" \
  --harness openrouter --with-model "openrouter=anthropic/claude-3.7-sonnet"

Same task, same tool surface, same graders — two providers side by side in one report.