Responder-mediated interaction is allowed, with the Responder as a fixed control variable¶

When a Case enables Interaction, the agent's mid-run requests (tool permission, clarification questions) are answered by an Interaction Policy. We allow five policies — auto-approve, auto-deny, scripted, llm-based, and manual — rather than only deterministic ones, because we cannot predict the questions an agent will ask, and a fixed script cannot cover that space. The llm-based policy uses a Responder LLM driven by per-Case guidelines.

To keep cross-model comparisons honest, the Responder (like the model-as-judge Judge) is a control variable held constant across the entire matrix: same responder model, same guidelines, temperature 0. The thing being benchmarked is the agent, not the (agent + responder) pair.

Status¶

accepted

Considered options¶

Deterministic policies only (auto-approve/auto-deny/scripted). Fully reproducible, but cannot answer open-ended questions an agent invents at runtime, so interactive Cases would stall or behave unrealistically.
Responder allowed (chosen). Trades strict determinism for coverage of the unpredictable question space, recovering interpretability by fixing the Responder and recording every request/response in the Trace.

Consequences¶

Two compared models may receive different interactions under llm-based — this is expected and inherent, not a bug. Results are tagged "responder-mediated."
manual is non-reproducible and is excluded from leaderboard aggregation; llm-based is included but flagged.
The Responder's identity and guidelines are recorded per Run so a result is always interpretable.