Harnesses¶
A Harness is the swappable thing that turns a Case's task into an output. Every Harness is
an Adapter behind one interface (harness/base.py):
class Harness:
name: str
capabilities: Capabilities # tracing? interaction?
def run(self, ctx: RunContext) -> RunResult: ...
The interface is deliberately agnostic — a CLI agent, an ACP agent, and an in-process API loop
all implement the same run. What differs is how much each can observe and answer.
Capabilities¶
Two opt-in capabilities, declared per Adapter (Interaction implies Tracing):
- Tracing — emits a normalized Trace of tool calls, tokens, cost.
- Interaction — answers the agent's mid-run requests via an Interaction Policy.
A Harness that exposes neither is output-only: you still get its final answer, the Trace is just empty of tool events.
Built-in harnesses¶
| Harness | Adapter | Tracing | Interaction | Needs |
|---|---|---|---|---|
echo |
fake | – | – | nothing (offline smoke test) |
claude-code |
Claude CLI | – | – | claude on PATH |
claude-code-stream |
Claude CLI (stream-json) | ✓ | – | claude on PATH |
openai / openai-compatible |
in-process OpenAI loop | ✓ | ✓ | [openai] extra + an endpoint |
droid, gemini, codex, claude-acp, devin-cli |
ACP | ✓ | ✓ | the agent's CLI on PATH |
The openai harness is the no-CLI path
Unlike every other rich harness, openai needs no vendor CLI — it runs the agentic
loop itself against any OpenAI-compatible endpoint. It's both Tracing- and
Interaction-capable. See the OpenAI-compatible harness.
Three ways to add a harness — without writing code¶
ACP agents → acp_agents.yaml¶
Any agent that speaks the Agent Client Protocol drops
in as a profile. Built-ins (droid, gemini, codex, claude-acp, devin-cli) work once
the agent's CLI is on PATH.
# acp_agents.yaml (repo root, next to evals/)
my-agent:
argv: ["my-agent", "acp"]
model_via: session # launch | session | none
name_map: {"Run shell command": "bash"}
kind_overrides: {"Run shell command": "execute"}
OpenAI-compatible endpoints → openai_agents.yaml¶
Optional — the generic openai harness already covers any endpoint via OPENAI_BASE_URL. Use
this only to pin named endpoints (so one run can compare several at once) or attach a price
table:
# openai_agents.yaml
ollama:
base_url: "http://localhost:11434/v1"
price_in: 0.0
price_out: 0.0
openrouter:
base_url: "https://openrouter.ai/api/v1"
api_key_env: "OPENROUTER_API_KEY"
Generic CLI agents → harnesses.yaml¶
Any agent exposed as a command-line tool. Placeholders are substituted per run; the process
runs with cwd=sandbox and its stdout/stderr become the transcript. Output-only (no Trace):
# harnesses.yaml
aider:
argv: ["aider", "--model", "{model}", "--message", "{prompt}", "--yes", "--no-auto-commits"]
timeout_s: 1200
Placeholders: {prompt} {model} {sandbox} {artifacts_dir} {mcp}.
Choosing a harness¶
- Just trying touchstone?
echo— runs the whole loop offline. - Any open/hosted model, no CLI?
openai— point at an endpoint, done. (Recommended default.) - Comparing coding agents themselves (their tools, prompts, scaffolding)? the ACP harnesses or a
harnesses.yamlCLI agent. - Claude specifically, with first-party tracing?
claude-code-stream.
Graders that read the Trace need a Tracing harness
Most cases declare observe.tracing: true and use trace / efficiency graders. An
output-only harness (echo, claude-code, a harnesses.yaml agent) can't satisfy those —
the cell hard-fails. Prefer openai, claude-code-stream, or an ACP agent for graded runs.
Running a harness inside a sandbox¶
By default the Harness runs on the host (against the cell's isolated working directory). For
stronger isolation — running the agent's bash/file effects inside a throwaway container (and,
in an enterprise deployment, a per-cell Kubernetes pod or Daytona workspace) — a Case adds a
container block with harness: true (ADR 0014):
container:
image: python:3.12-slim # the sandbox image
harness: true # run the agent in the sandbox, not on the host
env_passthrough: ["OPENAI_API_KEY"] # forward just this secret into the sandbox
openaineeds nothing baked into the image. Its loop and model calls stay in the touchstone controller; only its effects (running a command, reading/writing files) are routed into the sandbox through the Executor. So one generic runtime image isolates any model you point the harness at — the recommended pattern for graded runs under strong isolation.- CLI harnesses must bake their agent CLI into the image.
claude-code/claude-code-streamneedclaudein the image; an ACP orharnesses.yamlagent needs its binary there.harness: false(the default) keeps them on the host.
Each cell is started and torn down independently, so cells already run in separate sandboxes
under --workers; the backend (container.backend) is where that sandbox lives — docker for a
local container, or harbor to run each cell in a remote Harbor
sandbox (Daytona / Modal / E2B / Runloop / GKE) behind the same Executor seam
(ADR 0015) — the enterprise "each cell in its own pod" path.