Observation & interaction¶

A Harness can do more than return final text. Two opt-in capabilities make a run observable and answerable. A Case turns them on with an observe block:

observe:
  tracing: true            # capture a Trace
  interaction:             # answer the agent's mid-run requests
    policy: auto-approve

Absent the block, the cell is output-only. Interaction implies Tracing — you can't answer a request you can't observe.

The Trace¶

The Trace is touchstone's normalized, vendor-neutral event stream. Every Adapter translates its native events (ACP notifications, Claude stream-json, the OpenAI loop's own calls) into the same schema — so graders and the LangFuse export read the Trace, never a source protocol.

It's written one JSON object per line to trace.jsonl. Every event has seq, id, ts, type, parent_id (so tool calls nest under the message that spawned them), and a payload.

Event types¶

Type	Payload
`message`	`role` + `text` (assistant/user visible text)
`thought`	`text` (agent reasoning)
`tool_call`	`tool_call_id`, `raw_name`, `name`, `kind`, `input`
`tool_result`	`tool_call_id`, `status`, `output`, `locations`
`usage`	`input_tokens`, `output_tokens`, `cost_usd`, `context_used`
`permission_request`	`request_id`, `tool_call_id`, `options`
`permission_response`	`request_id`, `outcome`, `chosen_option`
`stop`	`reason`, `final_output`

Tool Kinds¶

A tool call carries three names: raw_name (verbatim from the agent), name (the Adapter's normalized name), and kind — the one portable category that means the same across every agent:

read · write · execute · search · fetch · other

Cross-model grading uses the Tool Kind (every agent's "edit a file" is write), while within-agent checks may use raw_name. The openai harness maps its fixed tools directly: bash→execute, read_file→read, write_file/edit_file→write, grep/list_dir→search.

Interaction policies¶

The Interaction Policy answers agent-initiated mid-run requests — chiefly "may I run this tool?". Every request and its answer is recorded in the Trace, whatever the policy.

Policy	Behavior	Reproducible
`auto-approve`	allow every request	✓
`auto-deny`	deny every request	✓
`scripted`	ordered rules, first match wins, else a default	✓
`llm-based`	a fixed Responder LLM decides under case guidelines	✓ (control variable)
`manual`	prompt the operator on the terminal	✗ (excluded from aggregates)

Scripted rules¶

observe:
  tracing: true
  interaction:
    policy: scripted
    default: allow
    rules:
      - {tool_kind: execute, decision: deny}        # never run shell commands
      - {raw_name: "rm", decision: deny}            # match by verbatim name
      - {input_regex: "sudo", decision: deny}       # match on the tool input

Each rule may match on tool_kind, raw_name, name, input_regex, or prompt_regex, and either decision: allow|deny, supply an answer: (for free-text questions), or pick an option:. A rule with no match keys is a catch-all.

Which harness can do what¶

Harness	Tracing	Interaction
`echo`, `claude-code`, `harnesses.yaml` CLI agents	–	–
`claude-code-stream`	✓	–
`openai` / `openai-compatible`	✓	✓
ACP agents (`droid`, `gemini`, `codex`, `claude-acp`, `devin-cli`)	✓	✓

The openai harness mediates every tool call

Because it owns the agentic loop, the OpenAI-compatible harness routes each tool call through the policy — emitting permission_request → asking the policy → permission_response → running the tool (or returning the refusal). A policy can even rewrite a tool's input before it runs. When a case asks for tracing only, tools run directly with no permission noise.

Soft-degrade rules¶

Tracing requested, harness can't → output-only, with a warning. But a trace-dependent grader with no Trace is a hard failure for that cell.
Interaction requested, harness can't → agent-initiated requests fall back to the harness's own defaults, with a warning.
manual is non-reproducible and excluded from aggregation; llm-based is included but flagged responder-mediated.

Export to LangFuse¶

A Trace maps losslessly onto LangFuse spans (cell → trace, turn/tool_call → span, generation → generation):

pip install "touchstone-eval[langfuse]"
touchstone export <run-id>          # writes langfuse.json
touchstone export <run-id> --push   # push to a configured LangFuse