Skip to content

The results UI is a static, self-contained HTML bundle — not a server

The aggregate report.md answers "which model scored best," but it cannot show how an agent reached an outcome: what it read, what it edited, what each step cost, why a grader scored it. For an agentic-coding benchmark that trajectory is the signal. This ADR decides the shape of the results UI that exposes it.

Status

accepted

Context

Every Cell already persists a normalized Trace (trace.jsonl: messages, tool calls/results, usage, permission events, nested by parent_id) plus a result.json of scores and metrics. The data needed to render a step-through transcript exists; only the rendering was missing.

The field offers two models. The SaaS tools (Braintrust, Langfuse, Arize Phoenix, W&B Weave) are servers with a database and a hosted UI — rich, but operationally heavy and wrong for a solo-maintained, local-first, often-private benchmark. Inspect AI's inspect view bundle is the other model: emit a self-contained static site you open from file:// or push to GitHub Pages, no server, no database.

A server/DB would also duplicate what touchstone already has — the runs/ tree is the database, and result.json/trace.jsonl are the records.

Decision

  • Static, self-contained bundle. report --format html renders one report.html with CSS/JS inlined and data embedded as <script type="application/json"> islands. No server, no database, no network, no build step, no CDN — it opens offline and is shareable as a file. All interactivity (sort, filter, drill-in) is vanilla JS over the embedded model.
  • One compute model, two renderers. A pure ReportModel (report_model.py) is factored out of the markdown renderer; markdown and HTML both render from it, so they can never disagree on a number. Markdown stays the default and stays byte-identical.
  • The transcript viewer is the headline. Clicking a matrix/grid cell opens its Trace as ordered cards nested by parent_id: tool-call cards with a Tool Kind badge, collapsible input→output, per-step duration (ts deltas) and token/cost (from surrounding usage), permission events inline, and the cell's grader explanations beside the trajectory.
  • Bounded size. Embedded payloads are capped (on top of capture-time truncation) so a long run can't produce a runaway file.
  • Output-only cells degrade, showing scores/metrics with an explicit "no trace" note.

Consequences

  • The single most valuable thing markdown can't do — inspecting one agent's trajectory — is available with zero infrastructure, and a run is shareable as one file.
  • ReportModel becomes the reuse point for the cross-run diff view (ADR 0011) and any future renderer, and it is the "aggregate/stats split" long flagged as a refactor.
  • Power users who want a server keep a zero-code path: touchstone already exports LangFuse JSON (export), so "point at a local Langfuse/Phoenix" stands alongside the static bundle.
  • Explicitly out of scope (and not to be added without revisiting this ADR): a served/long-running UI, a database, cross-history search, agent control-flow graphs. The static bundle is the ceiling for a solo project.