Skip to content

An in-process OpenAI-compatible adapter; the rich path that needs no vendor CLI

Any model reachable through an OpenAI-compatible Chat Completions endpoint can be driven as a Tracing- and Interaction-capable Harness by an agentic loop the framework runs itself — no vendor CLI, no Node bridge, no ACP. The new openai (alias openai-compatible) harness holds the tool surface, system prompt, and loop constant and varies only the model, making it the cleanest controlled-variable path in the suite. This extends ADR 0006's "ACP is no longer the only rich path" to its conclusion: the rich path need not shell out to any external agent.

Status

accepted (extends ADR 0003 and ADR 0006)

Context

ADR 0003 made ACP the single rich adapter; ADR 0006 added a native Claude path over stream-json. Both still drive an external agent process (an ACP agent, or claude). That leaves a gap: to compare an arbitrary open or hosted model — a local vLLM/Ollama build, an OpenRouter or DeepSeek model, a LiteLLM proxy — you needed that vendor to ship a coding-agent CLI, and you'd then be comparing agents, not models.

For an eval the harness is a control variable. If model A runs through aider and model B through codex, the report compares scaffolding, not models. The OpenAI Chat Completions + function-calling format is the lingua franca every provider now exposes, so the framework can own the loop once and point it anywhere.

ADR 0006 also established that the two capabilities ACP bundles — Tracing and Interaction — are modeled separately, and that headless autonomy needs no protocol. An in-process loop has an even stronger position: it is the client, so it observes every tool call and can gate each one, getting both capabilities for free.

Decision

Add openai / openai-compatible:

  • Provider-neutral by construction. One generic harness; the endpoint and key come from the OpenAI SDK's standard env vars (OPENAI_BASE_URL + OPENAI_API_KEY). No provider-named built-ins, no provider-scoped key vars. Named endpoints are opt-in via openai_agents.yaml (only to run several at once, or to attach a price table). The model is the matrix's, opaque to the framework.
  • A fixed, gradable tool surface. bash (execute), read_file (read), write_file / edit_file (write), grep / list_dir (search) — mapping directly onto the portable Tool Kinds. File/search tools are sandbox-confined.
  • Tracing. Every model⇄tool round-trip is replayed into the Trace (message / tool_call / tool_result / usage / stop); usage comes from the API response, and cost is filled when a profile carries a price table (else null, which the efficiency grader skips).
  • Interaction. Because the loop owns each call, it mediates through the Case's Interaction Policy — emit permission_request → ask the policy → permission_response → run or refuse, with optional updated_input rewriting. So Capabilities(tracing=True, interaction=True), and all five policies work. When a Case asks for tracing only (no policy), tools run directly with no permission noise — matching claude-code-stream.
  • Optional dependency. The openai SDK is an extra (pip install touchstone-eval[openai]) imported lazily, so the core stays network-dep-free.

The Trace stays the contract. ACP and the Claude adapters remain for what they're best at (comparing real agents; first-party Claude). This adapter is for comparing models under a constant harness.

Considered options

  • Use an existing agent CLI per provider (aider, codex, …) via harnesses.yaml/ACP. Zero new code, but compares agents; the output-only ones emit no Trace and fail the trace / efficiency graders most cases use. Kept as an option, not the answer.
  • An in-process OpenAI-compatible loop (chosen). One constant harness across every provider; first-class Tracing and Interaction; usage straight from the API. Cost: the framework owns a (minimal) agentic loop. Weaker models score worse under a minimal harness than under a polished CLI — which is correct when the harness is the control variable.
  • A heavier in-process framework (LangChain-style agent). More features, but more dependencies and a less legible, less constant tool surface. Rejected for an eval, where a small fixed surface is the point.

Consequences

  • touchstone can compare any model from any provider with no vendor CLI — OpenAI, OpenRouter, Together, Groq, DeepSeek, vLLM, Ollama, LM Studio, a LiteLLM proxy — by changing only OPENAI_BASE_URL + the model.
  • It is the first in-process / API Harness; harness/base.py always anticipated one, and it validates the Trace/Interaction contracts beyond subprocess adapters.
  • A provider that doesn't support function-calling simply yields a no-tool run (final text + end_turn) — degraded, not broken.
  • Some models, tuned for other harnesses, emit tool calls as text (Anthropic-style <function_calls><invoke> XML, or the <tool_call>{json}</tool_call> convention) instead of native function calls — which would otherwise stall the loop (observed with deepseek-v3.2). A textual fallback parses those, executes them, and feeds results back as a user message, so such a model still works. Native function-calling remains the expected path (and the system prompt asks for it); the fallback is a robustness net, not the contract.
  • The minimal loop is a deliberate floor, not a ceiling: a model's competence, not the scaffolding's polish, is what the report should reflect.