The results UI is a static, self-contained HTML bundle — not a server¶
The aggregate report.md answers "which model scored best," but it cannot show how an agent
reached an outcome: what it read, what it edited, what each step cost, why a grader scored it.
For an agentic-coding benchmark that trajectory is the signal. This ADR decides the shape of
the results UI that exposes it.
Status¶
accepted
Context¶
Every Cell already persists a normalized Trace (trace.jsonl: messages, tool calls/results,
usage, permission events, nested by parent_id) plus a result.json of scores and metrics. The
data needed to render a step-through transcript exists; only the rendering was missing.
The field offers two models. The SaaS tools (Braintrust, Langfuse, Arize Phoenix, W&B Weave)
are servers with a database and a hosted UI — rich, but operationally heavy and wrong for a
solo-maintained, local-first, often-private benchmark. Inspect AI's inspect view bundle is the
other model: emit a self-contained static site you open from file:// or push to GitHub
Pages, no server, no database.
A server/DB would also duplicate what touchstone already has — the runs/ tree is the
database, and result.json/trace.jsonl are the records.
Decision¶
- Static, self-contained bundle.
report --format htmlrenders onereport.htmlwith CSS/JS inlined and data embedded as<script type="application/json">islands. No server, no database, no network, no build step, no CDN — it opens offline and is shareable as a file. All interactivity (sort, filter, drill-in) is vanilla JS over the embedded model. - One compute model, two renderers. A pure
ReportModel(report_model.py) is factored out of the markdown renderer; markdown and HTML both render from it, so they can never disagree on a number. Markdown stays the default and stays byte-identical. - The transcript viewer is the headline. Clicking a matrix/grid cell opens its Trace as
ordered cards nested by
parent_id: tool-call cards with a Tool Kind badge, collapsible input→output, per-step duration (ts deltas) and token/cost (from surroundingusage), permission events inline, and the cell's grader explanations beside the trajectory. - Bounded size. Embedded payloads are capped (on top of capture-time truncation) so a long run can't produce a runaway file.
- Output-only cells degrade, showing scores/metrics with an explicit "no trace" note.
Consequences¶
- The single most valuable thing markdown can't do — inspecting one agent's trajectory — is available with zero infrastructure, and a run is shareable as one file.
ReportModelbecomes the reuse point for the cross-run diff view (ADR 0011) and any future renderer, and it is the "aggregate/stats split" long flagged as a refactor.- Power users who want a server keep a zero-code path: touchstone already exports LangFuse JSON
(
export), so "point at a local Langfuse/Phoenix" stands alongside the static bundle. - Explicitly out of scope (and not to be added without revisiting this ADR): a served/long-running UI, a database, cross-history search, agent control-flow graphs. The static bundle is the ceiling for a solo project.