Observation & interaction¶
A Harness can do more than return final text. Two opt-in capabilities make a run observable
and answerable. A Case turns them on with an observe block:
observe:
tracing: true # capture a Trace
interaction: # answer the agent's mid-run requests
policy: auto-approve
Absent the block, the cell is output-only. Interaction implies Tracing — you can't answer a request you can't observe.
The Trace¶
The Trace is touchstone's normalized, vendor-neutral event stream. Every Adapter translates its native events (ACP notifications, Claude stream-json, the OpenAI loop's own calls) into the same schema — so graders and the LangFuse export read the Trace, never a source protocol.
It's written one JSON object per line to trace.jsonl. Every event has seq, id, ts,
type, parent_id (so tool calls nest under the message that spawned them), and a payload.
Event types¶
| Type | Payload |
|---|---|
message |
role + text (assistant/user visible text) |
thought |
text (agent reasoning) |
tool_call |
tool_call_id, raw_name, name, kind, input |
tool_result |
tool_call_id, status, output, locations |
usage |
input_tokens, output_tokens, cost_usd, context_used |
permission_request |
request_id, tool_call_id, options |
permission_response |
request_id, outcome, chosen_option |
stop |
reason, final_output |
Tool Kinds¶
A tool call carries three names: raw_name (verbatim from the agent), name (the Adapter's
normalized name), and kind — the one portable category that means the same across every
agent:
read · write · execute · search · fetch · other
Cross-model grading uses the Tool Kind (every agent's "edit a file" is write), while
within-agent checks may use raw_name. The openai harness maps its fixed tools directly:
bash→execute, read_file→read, write_file/edit_file→write, grep/list_dir→search.
Interaction policies¶
The Interaction Policy answers agent-initiated mid-run requests — chiefly "may I run this tool?". Every request and its answer is recorded in the Trace, whatever the policy.
| Policy | Behavior | Reproducible |
|---|---|---|
auto-approve |
allow every request | ✓ |
auto-deny |
deny every request | ✓ |
scripted |
ordered rules, first match wins, else a default | ✓ |
llm-based |
a fixed Responder LLM decides under case guidelines | ✓ (control variable) |
manual |
prompt the operator on the terminal | ✗ (excluded from aggregates) |
Scripted rules¶
observe:
tracing: true
interaction:
policy: scripted
default: allow
rules:
- {tool_kind: execute, decision: deny} # never run shell commands
- {raw_name: "rm", decision: deny} # match by verbatim name
- {input_regex: "sudo", decision: deny} # match on the tool input
Each rule may match on tool_kind, raw_name, name, input_regex, or prompt_regex, and
either decision: allow|deny, supply an answer: (for free-text questions), or pick an
option:. A rule with no match keys is a catch-all.
Which harness can do what¶
| Harness | Tracing | Interaction |
|---|---|---|
echo, claude-code, harnesses.yaml CLI agents |
– | – |
claude-code-stream |
✓ | – |
openai / openai-compatible |
✓ | ✓ |
ACP agents (droid, gemini, codex, claude-acp, devin-cli) |
✓ | ✓ |
The openai harness mediates every tool call
Because it owns the agentic loop, the OpenAI-compatible harness routes
each tool call through the policy — emitting permission_request → asking the policy →
permission_response → running the tool (or returning the refusal). A policy can even
rewrite a tool's input before it runs. When a case asks for tracing only, tools run
directly with no permission noise.
Soft-degrade rules¶
- Tracing requested, harness can't → output-only, with a warning. But a trace-dependent grader with no Trace is a hard failure for that cell.
- Interaction requested, harness can't → agent-initiated requests fall back to the harness's own defaults, with a warning.
manualis non-reproducible and excluded from aggregation;llm-basedis included but flagged responder-mediated.
Export to LangFuse¶
A Trace maps losslessly onto LangFuse spans (cell → trace, turn/tool_call → span, generation → generation):