touchstone — benchmark methodology & findings¶
A log of how the eval cases are built and what they revealed about the model under test (glm-5.1:cloud, a custom Ollama-cloud model driven through Factory droid over ACP).
Methodology¶
- Hidden tests. Grading tests live in
evals/<case>/hidden/and are injected into the sandbox only at grade time (pytestgraderinject:). The agent never sees them, so it can't code-to-the-tests. Expected values are generated from a reference / stdlib oracle so they're exact and fair. - Partial credit. The
pytestgrader scores the fraction of tests passing — a partially-correct solution earns a sensible in-between score (SWE-bench-style). - Gates (cap-only). Validity checks are
gate: true: they never add credit, they only disqualify (→ 0).implemented(module still raisesNotImplementedError→ not attempted) andcommandimport-forbid checks (used the stdlib shortcut → cheated) are gates. Correctness is the score; not-attempting and cheating are failures, not partial credit. - Consistency.
trials: Nruns each cell N times.pass@k= best of k,pass^k= all k pass.pass^kis the honest reliability signal. - Real-repo tasks.
source: {repo, commit}clones a real library at a pinned commit; thesetupstep blanks a target function (stub) andrm -rf .gitso the original can't be recovered. The agent reimplements it inside the real codebase; the real function is the oracle. - Observation. The ACP adapter captures a normalized Trace (tool calls, tokens, permission
events). droid emits no usage over ACP, so token usage is recovered from its session settings
file. The
tracegrader can assert on tool usage; tool/token counts appear in the report.
Case catalog¶
- Conventions (string tasks with exact reference-defined behavior): slug, titlecase, number→words, pluralize, humanize-bytes, roman.
- Hard algorithms (from scratch, stdlib import forbidden): regex engine, unified diff, topological sort, JSON parser, glob matcher, CSV parser.
- Real-repo reimplements (clone @ commit, stub, fix in place): inflection.parameterize, toolz.merge_with, more_itertools.collapse / chunked_even / windowed / split_into, funcy.chunks, boltons.iterutils.bucketize.
- Real-repo reimplements with dependencies (per-cell venv; see ADR 0004): python-slugify
smart_truncate (a
requirementsdep — tiny repo) and werkzeug secure_filename (install: editable, src-layout — large repo). These exercise the broader Sandbox: the hidden tests can't even import the package without the provisioned environment. - Non-Python real-repo reimplements: word-wrap (zero-dep CommonJS, graded over
node --test) and commons-textCaseUtils.toCamelCase(Java/Maven, graded over Surefire, with a realcommons-lang3dependency Maven resolves). Shows the framework isn't tied to Python — onlysetup.stub, the venvenvironment, and thepytestdefault are — and that dependency-bearing non-Python projects work (deps live in~/.m2, notsite-packages).
Findings (glm-5.1 via droid)¶
-
Capability is high. When it writes code, it solves hard algorithms correctly — a from-scratch regex engine, JSON/CSV parsers, glob, toposort, and tricky
more_itertoolsedge logic (even-remainderchunked_even,windowedfill/step,split_intoNone) — all 100%. Algorithmic complexity is not the discriminator. -
Subtle conventions/edges cost points. It loses a little where it must infer an unstated convention or a corner case: slug
&→and, titlecase AP minor-words,parameterize('--edges--')(strip leading/trailing separators). → ~90%. -
Reliability is the real weakness. On output-heavy tasks it is highly stochastic — it often reads, reasons for tens of thousands of tokens, and ends the turn without writing any code. With
trials: 3: regex: solved 100% in one earlier single shot, then 0/3 —pass^3 = false, mean 0%.diff: 1/3 —pass@3 = truebutpass^3 = false, mean 33%.markdown: 0/3 (incl. one droid internal error).
Single-shot scores flatter the model; pass^k tells the truth. The implemented gate
correctly scores these non-attempts 0 (not the ~28% they got before gating).
- Harness flakiness is real and isolated. droid occasionally raises an internal ACP error mid-prompt; the adapter records it and fails just that cell without aborting the run.
Takeaway¶
For this model, the benchmark should be read through pass^k across trials, not a single
shot. The discriminators are (a) subtle conventions/edges and (b) reliability of producing
code on large tasks — not raw algorithmic difficulty.