Skip to content

Harnesses

A Harness is the swappable thing that turns a Case's task into an output. Every Harness is an Adapter behind one interface (harness/base.py):

class Harness:
    name: str
    capabilities: Capabilities          # tracing? interaction?
    def run(self, ctx: RunContext) -> RunResult: ...

The interface is deliberately agnostic — a CLI agent, an ACP agent, and an in-process API loop all implement the same run. What differs is how much each can observe and answer.

Capabilities

Two opt-in capabilities, declared per Adapter (Interaction implies Tracing):

  • Tracing — emits a normalized Trace of tool calls, tokens, cost.
  • Interaction — answers the agent's mid-run requests via an Interaction Policy.

A Harness that exposes neither is output-only: you still get its final answer, the Trace is just empty of tool events.

Built-in harnesses

Harness Adapter Tracing Interaction Needs
echo fake nothing (offline smoke test)
claude-code Claude CLI claude on PATH
claude-code-stream Claude CLI (stream-json) claude on PATH
openai / openai-compatible in-process OpenAI loop [openai] extra + an endpoint
droid, gemini, codex, claude-acp, devin-cli ACP the agent's CLI on PATH

The openai harness is the no-CLI path

Unlike every other rich harness, openai needs no vendor CLI — it runs the agentic loop itself against any OpenAI-compatible endpoint. It's both Tracing- and Interaction-capable. See the OpenAI-compatible harness.

Three ways to add a harness — without writing code

ACP agents → acp_agents.yaml

Any agent that speaks the Agent Client Protocol drops in as a profile. Built-ins (droid, gemini, codex, claude-acp, devin-cli) work once the agent's CLI is on PATH.

# acp_agents.yaml (repo root, next to evals/)
my-agent:
  argv: ["my-agent", "acp"]
  model_via: session            # launch | session | none
  name_map: {"Run shell command": "bash"}
  kind_overrides: {"Run shell command": "execute"}

OpenAI-compatible endpoints → openai_agents.yaml

Optional — the generic openai harness already covers any endpoint via OPENAI_BASE_URL. Use this only to pin named endpoints (so one run can compare several at once) or attach a price table:

# openai_agents.yaml
ollama:
  base_url: "http://localhost:11434/v1"
  price_in: 0.0
  price_out: 0.0
openrouter:
  base_url: "https://openrouter.ai/api/v1"
  api_key_env: "OPENROUTER_API_KEY"

Full field reference

Generic CLI agents → harnesses.yaml

Any agent exposed as a command-line tool. Placeholders are substituted per run; the process runs with cwd=sandbox and its stdout/stderr become the transcript. Output-only (no Trace):

# harnesses.yaml
aider:
  argv: ["aider", "--model", "{model}", "--message", "{prompt}", "--yes", "--no-auto-commits"]
  timeout_s: 1200

Placeholders: {prompt} {model} {sandbox} {artifacts_dir} {mcp}.

Choosing a harness

  • Just trying touchstone? echo — runs the whole loop offline.
  • Any open/hosted model, no CLI? openai — point at an endpoint, done. (Recommended default.)
  • Comparing coding agents themselves (their tools, prompts, scaffolding)? the ACP harnesses or a harnesses.yaml CLI agent.
  • Claude specifically, with first-party tracing? claude-code-stream.

Graders that read the Trace need a Tracing harness

Most cases declare observe.tracing: true and use trace / efficiency graders. An output-only harness (echo, claude-code, a harnesses.yaml agent) can't satisfy those — the cell hard-fails. Prefer openai, claude-code-stream, or an ACP agent for graded runs.

Running a harness inside a sandbox

By default the Harness runs on the host (against the cell's isolated working directory). For stronger isolation — running the agent's bash/file effects inside a throwaway container (and, in an enterprise deployment, a per-cell Kubernetes pod or Daytona workspace) — a Case adds a container block with harness: true (ADR 0014):

container:
  image: python:3.12-slim          # the sandbox image
  harness: true                    # run the agent in the sandbox, not on the host
  env_passthrough: ["OPENAI_API_KEY"]   # forward just this secret into the sandbox
  • openai needs nothing baked into the image. Its loop and model calls stay in the touchstone controller; only its effects (running a command, reading/writing files) are routed into the sandbox through the Executor. So one generic runtime image isolates any model you point the harness at — the recommended pattern for graded runs under strong isolation.
  • CLI harnesses must bake their agent CLI into the image. claude-code / claude-code-stream need claude in the image; an ACP or harnesses.yaml agent needs its binary there. harness: false (the default) keeps them on the host.

Each cell is started and torn down independently, so cells already run in separate sandboxes under --workers; the backend (container.backend) is where that sandbox lives — docker for a local container, or harbor to run each cell in a remote Harbor sandbox (Daytona / Modal / E2B / Runloop / GKE) behind the same Executor seam (ADR 0015) — the enterprise "each cell in its own pod" path.