Skip to content

Responder-mediated interaction is allowed, with the Responder as a fixed control variable

When a Case enables Interaction, the agent's mid-run requests (tool permission, clarification questions) are answered by an Interaction Policy. We allow five policies — auto-approve, auto-deny, scripted, llm-based, and manual — rather than only deterministic ones, because we cannot predict the questions an agent will ask, and a fixed script cannot cover that space. The llm-based policy uses a Responder LLM driven by per-Case guidelines.

To keep cross-model comparisons honest, the Responder (like the model-as-judge Judge) is a control variable held constant across the entire matrix: same responder model, same guidelines, temperature 0. The thing being benchmarked is the agent, not the (agent + responder) pair.

Status

accepted

Considered options

  • Deterministic policies only (auto-approve/auto-deny/scripted). Fully reproducible, but cannot answer open-ended questions an agent invents at runtime, so interactive Cases would stall or behave unrealistically.
  • Responder allowed (chosen). Trades strict determinism for coverage of the unpredictable question space, recovering interpretability by fixing the Responder and recording every request/response in the Trace.

Consequences

  • Two compared models may receive different interactions under llm-based — this is expected and inherent, not a bug. Results are tagged "responder-mediated."
  • manual is non-reproducible and is excluded from leaderboard aggregation; llm-based is included but flagged.
  • The Responder's identity and guidelines are recorded per Run so a result is always interpretable.