Skip to content

Graders

A Grader turns a Harness's result into a Score. A Case can declare several; each carries a weight, and the cell passes iff the combined weighted (non-skipped) score ≥ expect.pass_threshold. Two cross-cutting flags:

  • weight — relative contribution to the combined score (default 1.0).
  • gate: true — a cap-only grader: it never adds credit, it only disqualifies the cell (→ 0) when it doesn't fully pass. Use for "must implement something" or "must not break existing tests" checks.

A grader that can't apply (e.g. a trace grader with no Trace) is marked skipped and excluded from the weighted total, rather than failing.

At a glance

type Grades Reads Needs
command a shell command's exit code sandbox
files expected files / regex patterns present sandbox
tests / pytest fraction of injected tests that pass sandbox hidden tests
swebench SWE-bench FAIL→PASS + PASS→PASS gate sandbox hidden tests
implemented did the agent write anything real sandbox — (use as gate)
model_judge open-ended quality via an LLM judge output [judge] extra
trace tool usage / budgets / permissions Trace a Tracing harness
efficiency how cheaply the result was reached Trace a Tracing harness

command

Run a command in the produced sandbox; pass iff it exits as expected.

- {type: command, cmd: "pytest -q", weight: 1.0, expect_exit: 0, timeout_s: 300}

files

Two independent checks (use either or both). Score = fraction of checks that pass.

- type: files
  weight: 1.0
  expect_dir: ./graders/expected     # each file here must exist & match byte-for-byte
  patterns: ["retry", "backoff"]     # each regex must appear somewhere in the sandbox

tests / pytest

Partial-credit test grader — value = passed / (passed + failed + errors). Hidden test files are injected at grade time, so they never sit in the sandbox the model saw. pytest is a Python alias of the language-agnostic tests grader.

- {type: pytest, weight: 4.0, inject: ["./hidden/test_x.py"]}

- type: tests                        # any language
  weight: 4.0
  cmd: "node --test hidden.test.js"
  inject: ["./hidden/x.test.js"]
  # junit_xml: "target/surefire-reports/TEST-*.xml"   # parse JUnit XML for counts

- {type: tests, gate: true, cmd: "...", junit_xml: "..."}   # a regression gate

swebench

One block that grades a real-issue fix the SWE-bench way: a FAIL_TO_PASS set (the bug's tests, scored for credit) plus a PASS_TO_PASS set (existing tests that must stay green, scored as a built-in gate).

- type: swebench
  weight: 4.0
  inject: [{src: hidden/test_x.py, dest: testing/test_x.py}]
  fail_to_pass: ["testing/test_x.py::test_bug"]
  pass_to_pass: ["testing/test_x.py::test_unrelated"]
  # cmd_prefix: "python3 -m pytest -q -p no:cacheprovider"

implemented

A gate that checks the agent actually wrote an implementation (not just a stub or a comment). Use it as gate: true so a non-attempt disqualifies the cell rather than scoring partial.

- {type: implemented, gate: true}                    # scan all *.py (excludes test_*)
- {type: implemented, gate: true, files: ["a.py"]}   # scan specific files

model_judge

Sends the rubric + task prompt + produced output to one or more judge models and asks for a JSON verdict. The judge is a control variable — fixed across the matrix. Auto-skips when ANTHROPIC_API_KEY is unset, so offline runs aren't penalized.

- type: model_judge
  weight: 1.0
  rubric: ./graders/rubric.md
  model: claude-opus-4-8       # a single judge, or a jury:
  # models: [claude-opus-4-8, claude-sonnet-4-6]
  pass_threshold: 0.7

Requires the judge extra: pip install "touchstone-eval[judge]".

trace

Assert over the observed Tracehow the model worked, not just its output. Binary budgets (a cliff at the threshold). Skipped if the harness produced no Trace.

- type: trace
  weight: 1.0
  require_tools: [{kind: write}]     # ≥1 tool_call of each listed Tool Kind
  require_no_denied: true            # no permission request ended in a deny
  # max_tool_calls: 30               # budgets on tool calls / tokens / cost

efficiency

At the frontier, correctness saturates — so spend is the differentiator. This grades each configured metric on a smooth ramp: at/under target → 1.0, then target/actual once over (2× over → 0.5). The cell score is the mean across the configured metrics. Skipped without a Trace.

- type: efficiency
  weight: 2.0
  target_cost_usd: 0.30      # full credit at/under; ramps down above
  target_tokens: 2000
  target_tool_calls: 20
  pass_threshold: 0.8

Pair correctness with efficiency

A common, discriminating combo: a pytest/swebench grader for correctness (most of the weight), an implemented gate so non-attempts score zero, a trace grader to require the right kind of work, and an efficiency grader to separate the models that all pass.