Graders¶
A Grader turns a Harness's result into a Score. A Case can declare several; each carries
a weight, and the cell passes iff the combined weighted (non-skipped) score ≥
expect.pass_threshold. Two cross-cutting flags:
weight— relative contribution to the combined score (default1.0).gate: true— a cap-only grader: it never adds credit, it only disqualifies the cell (→ 0) when it doesn't fully pass. Use for "must implement something" or "must not break existing tests" checks.
A grader that can't apply (e.g. a trace grader with no Trace) is marked skipped and excluded from the weighted total, rather than failing.
At a glance¶
type |
Grades | Reads | Needs |
|---|---|---|---|
command |
a shell command's exit code | sandbox | — |
files |
expected files / regex patterns present | sandbox | — |
tests / pytest |
fraction of injected tests that pass | sandbox | hidden tests |
swebench |
SWE-bench FAIL→PASS + PASS→PASS gate | sandbox | hidden tests |
implemented |
did the agent write anything real | sandbox | — (use as gate) |
model_judge |
open-ended quality via an LLM judge | output | [judge] extra |
trace |
tool usage / budgets / permissions | Trace | a Tracing harness |
efficiency |
how cheaply the result was reached | Trace | a Tracing harness |
command¶
Run a command in the produced sandbox; pass iff it exits as expected.
files¶
Two independent checks (use either or both). Score = fraction of checks that pass.
- type: files
weight: 1.0
expect_dir: ./graders/expected # each file here must exist & match byte-for-byte
patterns: ["retry", "backoff"] # each regex must appear somewhere in the sandbox
tests / pytest¶
Partial-credit test grader — value = passed / (passed + failed + errors). Hidden test files
are injected at grade time, so they never sit in the sandbox the model saw. pytest is a
Python alias of the language-agnostic tests grader.
- {type: pytest, weight: 4.0, inject: ["./hidden/test_x.py"]}
- type: tests # any language
weight: 4.0
cmd: "node --test hidden.test.js"
inject: ["./hidden/x.test.js"]
# junit_xml: "target/surefire-reports/TEST-*.xml" # parse JUnit XML for counts
- {type: tests, gate: true, cmd: "...", junit_xml: "..."} # a regression gate
swebench¶
One block that grades a real-issue fix the SWE-bench way: a FAIL_TO_PASS set (the bug's tests, scored for credit) plus a PASS_TO_PASS set (existing tests that must stay green, scored as a built-in gate).
- type: swebench
weight: 4.0
inject: [{src: hidden/test_x.py, dest: testing/test_x.py}]
fail_to_pass: ["testing/test_x.py::test_bug"]
pass_to_pass: ["testing/test_x.py::test_unrelated"]
# cmd_prefix: "python3 -m pytest -q -p no:cacheprovider"
implemented¶
A gate that checks the agent actually wrote an implementation (not just a stub or a comment).
Use it as gate: true so a non-attempt disqualifies the cell rather than scoring partial.
- {type: implemented, gate: true} # scan all *.py (excludes test_*)
- {type: implemented, gate: true, files: ["a.py"]} # scan specific files
model_judge¶
Sends the rubric + task prompt + produced output to one or more judge models and asks for a
JSON verdict. The judge is a control variable — fixed across the matrix. Auto-skips when
ANTHROPIC_API_KEY is unset, so offline runs aren't penalized.
- type: model_judge
weight: 1.0
rubric: ./graders/rubric.md
model: claude-opus-4-8 # a single judge, or a jury:
# models: [claude-opus-4-8, claude-sonnet-4-6]
pass_threshold: 0.7
Requires the judge extra: pip install "touchstone-eval[judge]".
trace¶
Assert over the observed Trace — how the model worked, not just its output. Binary budgets (a cliff at the threshold). Skipped if the harness produced no Trace.
- type: trace
weight: 1.0
require_tools: [{kind: write}] # ≥1 tool_call of each listed Tool Kind
require_no_denied: true # no permission request ended in a deny
# max_tool_calls: 30 # budgets on tool calls / tokens / cost
efficiency¶
At the frontier, correctness saturates — so spend is the differentiator. This grades each
configured metric on a smooth ramp: at/under target → 1.0, then target/actual once over
(2× over → 0.5). The cell score is the mean across the configured metrics. Skipped without
a Trace.
- type: efficiency
weight: 2.0
target_cost_usd: 0.30 # full credit at/under; ramps down above
target_tokens: 2000
target_tool_calls: 20
pass_threshold: 0.8
Pair correctness with efficiency
A common, discriminating combo: a pytest/swebench grader for correctness (most of the
weight), an implemented gate so non-attempts score zero, a trace grader to require
the right kind of work, and an efficiency grader to separate the models that all pass.