The OpenAI-compatible harness¶
openai (alias openai-compatible) is touchstone's provider-neutral harness. It runs a
minimal agentic coding loop in-process against any endpoint that speaks the OpenAI Chat
Completions wire format — no vendor CLI, no Claude. It is both Tracing- and
Interaction-capable.
This is usually the harness you want: it holds the tool surface, system prompt, and loop constant while you vary only the model, which is exactly the controlled-variable property an eval needs.
The whole setup¶
There is nothing provider-specific to configure. The endpoint and key come from the OpenAI SDK's standard environment variables — the convention Ollama, vLLM, LM Studio, LiteLLM and OpenRouter all share:
export OPENAI_BASE_URL=http://localhost:11434/v1 # Ollama; swap for ANY endpoint
export OPENAI_API_KEY=sk-... # omit entirely for keyless local servers
touchstone run --harness openai --with-model llama3.1
Change OPENAI_BASE_URL and the same harness runs anywhere — you are never tied to one
provider.
Endpoint resolution¶
| Source | Precedence |
|---|---|
A named profile's base_url (from openai_agents.yaml) |
highest |
$OPENAI_BASE_URL |
then |
the SDK default (https://api.openai.com/v1) |
fallback |
The key is read from $OPENAI_API_KEY (or a profile's api_key_env). Keyless servers (Ollama,
vLLM, LM Studio) accept any value, so a placeholder is sent when none is set — only real OpenAI
truly needs a key. The resolved endpoint is recorded in each cell's result.json.
Provider cheat-sheet¶
| Provider | OPENAI_BASE_URL |
Key var |
|---|---|---|
| OpenAI | (unset — SDK default) | OPENAI_API_KEY |
| Ollama (local) | http://localhost:11434/v1 |
(none) |
| vLLM / LM Studio | http://localhost:8000/v1 |
(none) |
| OpenRouter | https://openrouter.ai/api/v1 |
OPENAI_API_KEY |
| Together | https://api.together.xyz/v1 |
OPENAI_API_KEY |
| Groq | https://api.groq.com/openai/v1 |
OPENAI_API_KEY |
| DeepSeek | https://api.deepseek.com |
OPENAI_API_KEY |
| LiteLLM proxy | http://localhost:4000/v1 |
your proxy key |
The tool surface¶
The loop offers a small, fixed set of tools that map cleanly onto the portable Tool Kinds — so traces are gradable across models:
| Tool | Tool Kind | Does |
|---|---|---|
bash |
execute |
run a shell command in the sandbox (build, tests, git, …) |
read_file |
read |
read a UTF-8 file |
write_file |
write |
create / overwrite a file |
edit_file |
write |
replace the first exact occurrence of a string |
grep |
search |
regex search across the sandbox |
list_dir |
search |
list a directory |
bash alone is enough to do anything; the explicit tools exist so the Trace carries clean,
gradable signal. File and search tools are confined to the sandbox (path-escape attempts are
rejected).
How the loop works¶
system prompt + task
│
▼
model call ──tool_calls?──▶ for each call:
▲ emit tool_call → (interaction policy) → run tool → emit tool_result
│ append result to the conversation
└────────── yes ──────────┘
│
no → final message → emit usage + stop
Each model⇄tool round-trip is replayed into the Trace. Token usage comes from
the API response; cost is filled when the profile carries a price table (otherwise it stays
null, and the efficiency grader simply skips the cost dimension).
The loop is robust by construction: malformed tool arguments, unknown tools, path escapes,
runaway loops (max_iters), a wall-clock timeout_s, and API/network errors all end the cell
cleanly rather than crashing the run.
Interaction¶
Because the loop owns every tool call, it can mediate each one through the Case's Interaction Policy — making this harness strictly richer than a headless adapter.
When a Case opts into interaction:
observe:
tracing: true
interaction:
policy: scripted
rules:
- {tool_kind: execute, decision: deny} # never let the model run shell commands
default: allow
…the harness emits a permission_request before each tool, asks the policy, emits the
permission_response, then either runs the tool or returns the refusal to the model and moves
on. A policy may also rewrite a tool's input before it runs (updated_input). All five
policies work: auto-approve · auto-deny · scripted · llm-based · manual.
When a Case requests no interaction (tracing only), there is no policy and tools execute
directly — exactly like claude-code-stream, with no permission noise in the Trace.
Named profiles¶
Optional. The generic openai harness already covers any endpoint; use a named profile only to
run several endpoints in one comparison or to attach a price table. Drop entries in
openai_agents.yaml (repo root, next to evals/):
ollama:
base_url: "http://localhost:11434/v1"
price_in: 0.0 # USD per 1M input tokens → lets the harness report cost_usd
price_out: 0.0 # USD per 1M output tokens
openrouter:
base_url: "https://openrouter.ai/api/v1"
api_key_env: "OPENROUTER_API_KEY"
Each entry becomes a harness name usable in a case's matrix or via --harness.
| Field | Default | Meaning |
|---|---|---|
base_url |
$OPENAI_BASE_URL → api.openai.com |
the endpoint |
api_key_env |
OPENAI_API_KEY |
env var holding the key (ignored by keyless servers) |
model |
– | fallback model if the matrix omits one (usually leave unset) |
temperature |
0.0 |
sampling temperature; null omits the param entirely |
max_iters |
50 |
max model⇄tool round-trips per Turn |
timeout_s |
1800 |
wall-clock budget for the whole Cell |
price_in / price_out |
– | USD per 1M tokens, to report cost_usd |
extra_body |
– | provider-specific create() knobs (routing, provider prefs) |
Worked example: compare two providers¶
# Pin two endpoints in openai_agents.yaml as `ollama` and `openrouter`, then:
touchstone run --eval my-task \
--harness ollama --with-model "ollama=deepseek-v3.1:cloud" \
--harness openrouter --with-model "openrouter=anthropic/claude-3.7-sonnet"
Same task, same tool surface, same graders — two providers side by side in one report.