Borrow Harbor's sandboxes behind touchstone's own Executor seam¶
Harbor is a framework for running agentic tasks in sandboxes,
with backends for docker, Daytona, Modal, E2B, Runloop, GKE, …. touchstone gets those remote
runtimes — the enterprise "each Cell in its own pod/workspace" requirement — by wrapping Harbor's
BaseEnvironment behind its own Executor interface (ADR 0014), rather than adopting Harbor's
task/agent/verifier framework. The Executor is the seam; Harbor is just one more place a Cell's
commands run. This also retires the kubernetes/daytona placeholder backends ADR 0014 reserved.
Status¶
accepted (extends ADR 0014; supersedes its reserved kubernetes/daytona backend values)
Context¶
ADR 0014 made the Executor the single place a Cell's work runs (provisioning, grading, and the
Harness) and reserved container.backend: kubernetes|daytona as "future drop-ins." Writing each
remote backend (a Kubernetes pod client, a Daytona client, a Modal client, …) is a lot of
undifferentiated transport code — the exact problem Harbor already solved: its BaseEnvironment
unifies docker + a fleet of cloud sandbox providers behind one async interface (start / stop /
exec / upload_* / download_* / is_file / is_dir).
The decomposition Harbor is built on — task / harness(agent) / sandbox, mixed freely — is the same one touchstone arrived at (Case / Harness / Executor), and Harbor's external vs installed agent split maps one-to-one onto ADR 0014's host-harness vs in-sandbox-harness boundary. So Harbor is the right thing to borrow at the sandbox layer. But adopting Harbor as the framework would discard touchstone's differentiators that live above the sandbox: the Trace contract, the model-comparison matrix, Interaction policies, judge credibility, regression tracking. The goal is therefore: borrow Harbor's backends, keep touchstone's seam and semantics.
Two mismatches stand between the two:
- Harbor's
BaseEnvironmentis async; theExecutoris sync. - A Harbor sandbox is a remote pod/VM whose filesystem the host does not share — unlike the
dockerContainerExecutor, which gets a same-path bind mount. ADR 0014's filesystem surface was built for exactly this, but the runner still assumed the prepared Cell was already visible to the backend (true for a bind mount, false for a remote pod).
Decision¶
1. HarborExecutor(Executor) wraps a Harbor BaseEnvironment:
- async ↔ sync via one dedicated event loop per executor (Harbor's own usage pattern — "each env drives start/exec/verify synchronously on its own loop"). All calls to an environment go through that loop.
runmaps argv → a single shell command (quoting each token, since Harbor'sexectakes a string), thenexec(command, cwd, env, timeout_sec)→ExecResult.- The filesystem surface maps onto Harbor:
read_textviaexec cat,write_textviaupload_filefrom a temp,is_file/is_dirnative,list_dirviaexec ls -1Ap,copy_into_sandboxviaupload_file/upload_dir, andresolve_in_sandboxlexically (the host has no tree to.resolve()against — confinement is enforced in the sandbox's namespace, the very reason ADR 0014 put resolution behind the seam).
2. Non-shared-FS staging in the Executor base. The Executor gains shares_host_filesystem()
(True by default) and stage_in/stage_out (no-ops by default). HarborExecutor returns False
and uploads the prepared Cell to its same absolute path before the run (the bind-mount trick,
over upload_dir) and downloads the agent's edits back before host-side grading. The runner calls
stage_in/stage_out unconditionally; shared-FS backends ignore them, so docker/host are
byte-identical. Setup stubbing now reads/writes through the Executor FS surface too, so it lands in
the sandbox wherever it lives. Because a host harness can't edit a tree the in-sandbox graders see,
a non-shared-FS backend always runs the Harness in the sandbox (the ADR 0005 boundary doesn't apply).
3. Optional and lazy; the floor moves to 3.12. Harbor needs Python ≥3.12, and touchstone's
requires-python is raised to match (the project owner opted in). The Harbor import still lives
behind pip install touchstone-eval[harbor] and is imported lazily inside the executor's factory,
so non-Harbor runs pull nothing. container.backend becomes docker | harbor;
container.harbor_runtime (docker/daytona/modal/e2b/runloop/gke/…) is passed through to Harbor, and
env_passthrough secrets cross into the (possibly cloud) sandbox by name only — never the whole
host environment.
4. Construction is isolated, verified, and overridable. Harbor couples environment
construction to its own Task/Trial config models, so it is quarantined in
build_harbor_environment: harbor_runtime → EnvironmentType, image → the task env's
docker_image, passthrough secrets → its env. This is verified end-to-end against Harbor 0.16.1
on the docker runtime (construct → start → exec → upload → stop, plus a full touchstone run whose
in-sandbox agent recorded the container's OS, not the host's). For another Harbor version/runtime a
Case can still bypass it with container.harbor_factory: "pkg.module:callable". HarborExecutor
itself — the seam mapping — is additionally unit-tested against a fake async environment, so it is
covered with or without a live Harbor.
Considered options¶
- Adopt Harbor as the framework. Discards the Trace/matrix/Interaction/judge differentiators and Harbor's task format would replace the Case schema. Rejected — wrong layer.
- Write each remote backend natively (k8s, Daytona, Modal, …). Months of transport code duplicating Harbor. Rejected; that's the reserved-backend debt ADR 0014 left, now paid by one adapter.
- Make
HarborExecutorasync / the whole Executor async. A viral refactor of every call site for one backend's benefit. Rejected — the per-executor event loop confines async to the adapter. - Wrap Harbor behind the Executor seam (chosen). One adapter unlocks Harbor's whole fleet; the Trace, graders, harnesses, and runner are unchanged; the non-shared-FS staging is a clean, reusable generalization (it would serve any future remote backend, Harbor or not).
Consequences¶
- The enterprise model is real, not reserved.
container.backend: harbor+harbor_runtime: daytona(or modal/e2b/runloop/gke) runs each Cell in its own remote sandbox, agent included, with touchstone's grading and Trace intact.--workersalready isolates Cells; Harbor isolates the compute. stage_in/stage_out+shares_host_filesystem()are the contract for any non-shared-FS backend. A future native backend implements the FS surface + these three hooks and slots in the same way — the abstraction is now honestly backend-neutral (the gap the ADR 0014 review flagged is closed by construction, not just documented).- The build is verified, the override remains.
build_harbor_environmentis validated against Harbor 0.16.1 (docker);harbor_factorystays as the escape hatch for other runtimes/versions, andbuild_harbor_environmentraises with that guidance if the extra is absent. - The cost is the 3.12 floor. Raising
requires-pythonto 3.12 drops 3.10/3.11 users — an accepted trade for making Harbor (and its backends) first-class. The Harbor package is still optional (lazy, behind the extra); onlycontainer.backend: harborCases pull it. - CONTEXT.md gains the backend vocabulary;
docs/cases.md/harnesses.mddocumentbackend: harbor.
Future work¶
- Validate the cloud runtimes (Daytona/Modal/E2B/Runloop) against live providers and pin a Harbor version in a CI lane; docker is verified here.
- Map Harbor's own verifier/trajectory hooks onto touchstone's Trace where they add signal (the Harness still owns the loop; this would only enrich observation).