Skip to content

touchstone — benchmark methodology & findings

A log of how the eval cases are built and what they revealed about the model under test (glm-5.1:cloud, a custom Ollama-cloud model driven through Factory droid over ACP).

Methodology

  • Hidden tests. Grading tests live in evals/<case>/hidden/ and are injected into the sandbox only at grade time (pytest grader inject:). The agent never sees them, so it can't code-to-the-tests. Expected values are generated from a reference / stdlib oracle so they're exact and fair.
  • Partial credit. The pytest grader scores the fraction of tests passing — a partially-correct solution earns a sensible in-between score (SWE-bench-style).
  • Gates (cap-only). Validity checks are gate: true: they never add credit, they only disqualify (→ 0). implemented (module still raises NotImplementedError → not attempted) and command import-forbid checks (used the stdlib shortcut → cheated) are gates. Correctness is the score; not-attempting and cheating are failures, not partial credit.
  • Consistency. trials: N runs each cell N times. pass@k = best of k, pass^k = all k pass. pass^k is the honest reliability signal.
  • Real-repo tasks. source: {repo, commit} clones a real library at a pinned commit; the setup step blanks a target function (stub) and rm -rf .git so the original can't be recovered. The agent reimplements it inside the real codebase; the real function is the oracle.
  • Observation. The ACP adapter captures a normalized Trace (tool calls, tokens, permission events). droid emits no usage over ACP, so token usage is recovered from its session settings file. The trace grader can assert on tool usage; tool/token counts appear in the report.

Case catalog

  • Conventions (string tasks with exact reference-defined behavior): slug, titlecase, number→words, pluralize, humanize-bytes, roman.
  • Hard algorithms (from scratch, stdlib import forbidden): regex engine, unified diff, topological sort, JSON parser, glob matcher, CSV parser.
  • Real-repo reimplements (clone @ commit, stub, fix in place): inflection.parameterize, toolz.merge_with, more_itertools.collapse / chunked_even / windowed / split_into, funcy.chunks, boltons.iterutils.bucketize.
  • Real-repo reimplements with dependencies (per-cell venv; see ADR 0004): python-slugify smart_truncate (a requirements dep — tiny repo) and werkzeug secure_filename (install: editable, src-layout — large repo). These exercise the broader Sandbox: the hidden tests can't even import the package without the provisioned environment.
  • Non-Python real-repo reimplements: word-wrap (zero-dep CommonJS, graded over node --test) and commons-text CaseUtils.toCamelCase (Java/Maven, graded over Surefire, with a real commons-lang3 dependency Maven resolves). Shows the framework isn't tied to Python — only setup.stub, the venv environment, and the pytest default are — and that dependency-bearing non-Python projects work (deps live in ~/.m2, not site-packages).

Findings (glm-5.1 via droid)

  1. Capability is high. When it writes code, it solves hard algorithms correctly — a from-scratch regex engine, JSON/CSV parsers, glob, toposort, and tricky more_itertools edge logic (even-remainder chunked_even, windowed fill/step, split_into None) — all 100%. Algorithmic complexity is not the discriminator.

  2. Subtle conventions/edges cost points. It loses a little where it must infer an unstated convention or a corner case: slug &and, titlecase AP minor-words, parameterize('--edges--') (strip leading/trailing separators). → ~90%.

  3. Reliability is the real weakness. On output-heavy tasks it is highly stochastic — it often reads, reasons for tens of thousands of tokens, and ends the turn without writing any code. With trials: 3:

  4. regex: solved 100% in one earlier single shot, then 0/3pass^3 = false, mean 0%.
  5. diff: 1/3pass@3 = true but pass^3 = false, mean 33%.
  6. markdown: 0/3 (incl. one droid internal error).

Single-shot scores flatter the model; pass^k tells the truth. The implemented gate correctly scores these non-attempts 0 (not the ~28% they got before gating).

  1. Harness flakiness is real and isolated. droid occasionally raises an internal ACP error mid-prompt; the adapter records it and fails just that cell without aborting the run.

Takeaway

For this model, the benchmark should be read through pass^k across trials, not a single shot. The discriminators are (a) subtle conventions/edges and (b) reliability of producing code on large tasks — not raw algorithmic difficulty.

Running

touchstone validate
touchstone run --eval <case> --workers N        # trials come from each case's matrix
touchstone run --harness droid --with-model A --with-model B   # compare models, same harness
touchstone report <run_id>                       # per-case matrix + leaderboard
touchstone export <run_id>                       # LangFuse JSON