Skip to content

Observation & interaction

A Harness can do more than return final text. Two opt-in capabilities make a run observable and answerable. A Case turns them on with an observe block:

observe:
  tracing: true            # capture a Trace
  interaction:             # answer the agent's mid-run requests
    policy: auto-approve

Absent the block, the cell is output-only. Interaction implies Tracing — you can't answer a request you can't observe.

The Trace

The Trace is touchstone's normalized, vendor-neutral event stream. Every Adapter translates its native events (ACP notifications, Claude stream-json, the OpenAI loop's own calls) into the same schema — so graders and the LangFuse export read the Trace, never a source protocol.

It's written one JSON object per line to trace.jsonl. Every event has seq, id, ts, type, parent_id (so tool calls nest under the message that spawned them), and a payload.

Event types

Type Payload
message role + text (assistant/user visible text)
thought text (agent reasoning)
tool_call tool_call_id, raw_name, name, kind, input
tool_result tool_call_id, status, output, locations
usage input_tokens, output_tokens, cost_usd, context_used
permission_request request_id, tool_call_id, options
permission_response request_id, outcome, chosen_option
stop reason, final_output

Tool Kinds

A tool call carries three names: raw_name (verbatim from the agent), name (the Adapter's normalized name), and kind — the one portable category that means the same across every agent:

read · write · execute · search · fetch · other

Cross-model grading uses the Tool Kind (every agent's "edit a file" is write), while within-agent checks may use raw_name. The openai harness maps its fixed tools directly: bash→execute, read_file→read, write_file/edit_file→write, grep/list_dir→search.

Interaction policies

The Interaction Policy answers agent-initiated mid-run requests — chiefly "may I run this tool?". Every request and its answer is recorded in the Trace, whatever the policy.

Policy Behavior Reproducible
auto-approve allow every request
auto-deny deny every request
scripted ordered rules, first match wins, else a default
llm-based a fixed Responder LLM decides under case guidelines ✓ (control variable)
manual prompt the operator on the terminal ✗ (excluded from aggregates)

Scripted rules

observe:
  tracing: true
  interaction:
    policy: scripted
    default: allow
    rules:
      - {tool_kind: execute, decision: deny}        # never run shell commands
      - {raw_name: "rm", decision: deny}            # match by verbatim name
      - {input_regex: "sudo", decision: deny}       # match on the tool input

Each rule may match on tool_kind, raw_name, name, input_regex, or prompt_regex, and either decision: allow|deny, supply an answer: (for free-text questions), or pick an option:. A rule with no match keys is a catch-all.

Which harness can do what

Harness Tracing Interaction
echo, claude-code, harnesses.yaml CLI agents
claude-code-stream
openai / openai-compatible
ACP agents (droid, gemini, codex, claude-acp, devin-cli)

The openai harness mediates every tool call

Because it owns the agentic loop, the OpenAI-compatible harness routes each tool call through the policy — emitting permission_request → asking the policy → permission_response → running the tool (or returning the refusal). A policy can even rewrite a tool's input before it runs. When a case asks for tracing only, tools run directly with no permission noise.

Soft-degrade rules

  • Tracing requested, harness can't → output-only, with a warning. But a trace-dependent grader with no Trace is a hard failure for that cell.
  • Interaction requested, harness can't → agent-initiated requests fall back to the harness's own defaults, with a warning.
  • manual is non-reproducible and excluded from aggregation; llm-based is included but flagged responder-mediated.

Export to LangFuse

A Trace maps losslessly onto LangFuse spans (cell → trace, turn/tool_call → span, generation → generation):

pip install "touchstone-eval[langfuse]"
touchstone export <run-id>          # writes langfuse.json
touchstone export <run-id> --push   # push to a configured LangFuse