Skip to content

Borrow Harbor's sandboxes behind touchstone's own Executor seam

Harbor is a framework for running agentic tasks in sandboxes, with backends for docker, Daytona, Modal, E2B, Runloop, GKE, …. touchstone gets those remote runtimes — the enterprise "each Cell in its own pod/workspace" requirement — by wrapping Harbor's BaseEnvironment behind its own Executor interface (ADR 0014), rather than adopting Harbor's task/agent/verifier framework. The Executor is the seam; Harbor is just one more place a Cell's commands run. This also retires the kubernetes/daytona placeholder backends ADR 0014 reserved.

Status

accepted (extends ADR 0014; supersedes its reserved kubernetes/daytona backend values)

Context

ADR 0014 made the Executor the single place a Cell's work runs (provisioning, grading, and the Harness) and reserved container.backend: kubernetes|daytona as "future drop-ins." Writing each remote backend (a Kubernetes pod client, a Daytona client, a Modal client, …) is a lot of undifferentiated transport code — the exact problem Harbor already solved: its BaseEnvironment unifies docker + a fleet of cloud sandbox providers behind one async interface (start / stop / exec / upload_* / download_* / is_file / is_dir).

The decomposition Harbor is built on — task / harness(agent) / sandbox, mixed freely — is the same one touchstone arrived at (Case / Harness / Executor), and Harbor's external vs installed agent split maps one-to-one onto ADR 0014's host-harness vs in-sandbox-harness boundary. So Harbor is the right thing to borrow at the sandbox layer. But adopting Harbor as the framework would discard touchstone's differentiators that live above the sandbox: the Trace contract, the model-comparison matrix, Interaction policies, judge credibility, regression tracking. The goal is therefore: borrow Harbor's backends, keep touchstone's seam and semantics.

Two mismatches stand between the two:

  • Harbor's BaseEnvironment is async; the Executor is sync.
  • A Harbor sandbox is a remote pod/VM whose filesystem the host does not share — unlike the docker ContainerExecutor, which gets a same-path bind mount. ADR 0014's filesystem surface was built for exactly this, but the runner still assumed the prepared Cell was already visible to the backend (true for a bind mount, false for a remote pod).

Decision

1. HarborExecutor(Executor) wraps a Harbor BaseEnvironment:

  • async ↔ sync via one dedicated event loop per executor (Harbor's own usage pattern — "each env drives start/exec/verify synchronously on its own loop"). All calls to an environment go through that loop.
  • run maps argv → a single shell command (quoting each token, since Harbor's exec takes a string), then exec(command, cwd, env, timeout_sec)ExecResult.
  • The filesystem surface maps onto Harbor: read_text via exec cat, write_text via upload_file from a temp, is_file/is_dir native, list_dir via exec ls -1Ap, copy_into_sandbox via upload_file/upload_dir, and resolve_in_sandbox lexically (the host has no tree to .resolve() against — confinement is enforced in the sandbox's namespace, the very reason ADR 0014 put resolution behind the seam).

2. Non-shared-FS staging in the Executor base. The Executor gains shares_host_filesystem() (True by default) and stage_in/stage_out (no-ops by default). HarborExecutor returns False and uploads the prepared Cell to its same absolute path before the run (the bind-mount trick, over upload_dir) and downloads the agent's edits back before host-side grading. The runner calls stage_in/stage_out unconditionally; shared-FS backends ignore them, so docker/host are byte-identical. Setup stubbing now reads/writes through the Executor FS surface too, so it lands in the sandbox wherever it lives. Because a host harness can't edit a tree the in-sandbox graders see, a non-shared-FS backend always runs the Harness in the sandbox (the ADR 0005 boundary doesn't apply).

3. Optional and lazy; the floor moves to 3.12. Harbor needs Python ≥3.12, and touchstone's requires-python is raised to match (the project owner opted in). The Harbor import still lives behind pip install touchstone-eval[harbor] and is imported lazily inside the executor's factory, so non-Harbor runs pull nothing. container.backend becomes docker | harbor; container.harbor_runtime (docker/daytona/modal/e2b/runloop/gke/…) is passed through to Harbor, and env_passthrough secrets cross into the (possibly cloud) sandbox by name only — never the whole host environment.

4. Construction is isolated, verified, and overridable. Harbor couples environment construction to its own Task/Trial config models, so it is quarantined in build_harbor_environment: harbor_runtimeEnvironmentType, image → the task env's docker_image, passthrough secrets → its env. This is verified end-to-end against Harbor 0.16.1 on the docker runtime (construct → start → exec → upload → stop, plus a full touchstone run whose in-sandbox agent recorded the container's OS, not the host's). For another Harbor version/runtime a Case can still bypass it with container.harbor_factory: "pkg.module:callable". HarborExecutor itself — the seam mapping — is additionally unit-tested against a fake async environment, so it is covered with or without a live Harbor.

Considered options

  • Adopt Harbor as the framework. Discards the Trace/matrix/Interaction/judge differentiators and Harbor's task format would replace the Case schema. Rejected — wrong layer.
  • Write each remote backend natively (k8s, Daytona, Modal, …). Months of transport code duplicating Harbor. Rejected; that's the reserved-backend debt ADR 0014 left, now paid by one adapter.
  • Make HarborExecutor async / the whole Executor async. A viral refactor of every call site for one backend's benefit. Rejected — the per-executor event loop confines async to the adapter.
  • Wrap Harbor behind the Executor seam (chosen). One adapter unlocks Harbor's whole fleet; the Trace, graders, harnesses, and runner are unchanged; the non-shared-FS staging is a clean, reusable generalization (it would serve any future remote backend, Harbor or not).

Consequences

  • The enterprise model is real, not reserved. container.backend: harbor + harbor_runtime: daytona (or modal/e2b/runloop/gke) runs each Cell in its own remote sandbox, agent included, with touchstone's grading and Trace intact. --workers already isolates Cells; Harbor isolates the compute.
  • stage_in/stage_out + shares_host_filesystem() are the contract for any non-shared-FS backend. A future native backend implements the FS surface + these three hooks and slots in the same way — the abstraction is now honestly backend-neutral (the gap the ADR 0014 review flagged is closed by construction, not just documented).
  • The build is verified, the override remains. build_harbor_environment is validated against Harbor 0.16.1 (docker); harbor_factory stays as the escape hatch for other runtimes/versions, and build_harbor_environment raises with that guidance if the extra is absent.
  • The cost is the 3.12 floor. Raising requires-python to 3.12 drops 3.10/3.11 users — an accepted trade for making Harbor (and its backends) first-class. The Harbor package is still optional (lazy, behind the extra); only container.backend: harbor Cases pull it.
  • CONTEXT.md gains the backend vocabulary; docs/cases.md/harnesses.md document backend: harbor.

Future work

  • Validate the cloud runtimes (Daytona/Modal/E2B/Runloop) against live providers and pin a Harbor version in a CI lane; docker is verified here.
  • Map Harbor's own verifier/trajectory hooks onto touchstone's Trace where they add signal (the Harness still owns the loop; this would only enrich observation).