Run the Harness inside the Sandbox (the Executor-hosted harness)¶
The Harness — the agent under test — may now run inside the Cell's isolated Sandbox instead of always on the host. This lifts the boundary ADR 0005 deferred ("the agent itself still runs on the host"), and it generalizes the Executor from "where a Cell's non-harness commands run" into the single backend-neutral seam for all of a Cell's work — provisioning, setup, grading, and the Harness — so a Cell can be confined to a container today and to a Kubernetes pod or a Daytona workspace tomorrow without touching the harness or grader code.
Status¶
accepted (extends ADR 0005; builds on ADR 0009 and ADR 0010)
Context¶
ADR 0005 introduced the Executor seam and a ContainerExecutor, but drew an explicit
boundary: only a Cell's non-harness commands (provisioning, setup.run, the
command/tests/pytest graders) ran in the container; the Harness still ran on the host
against the bind-mounted Sandbox. That was the high-value slice for the original goal (the
reproducible part — deps + grading — is containerized; the variable under test — the agent —
is not), and it dodged the hard part: a CLI agent's binary isn't in the image.
But "sandbox the deps, run the agent on the host" is the wrong shape for an enterprise
deployment, where the whole point is to run each Cell in its own throwaway, network- and
filesystem-isolated compute unit — a container, a Kubernetes pod, a Daytona workspace — with the
agent's side effects (arbitrary bash, file writes) confined there, not on the orchestrator
host. With the harness pinned to the host, the orchestrator is exactly where untrusted
agent-driven commands land. The isolation that matters most is the isolation we didn't have.
Two facts make lifting the boundary tractable now:
- The
openaiHarness owns its loop in-process (ADR 0009). Its agentic loop and model calls are just code and an HTTPS call from the controller; only its tool effects (bash,read_file/write_file/edit_file,grep,list_dir) touch the Sandbox. If those effects route through the Executor, the harness is sandboxed without baking anything into an image — touchstone provides the agent and confines only its effects. This is the "provide our own harness into any sandbox" path the framework wanted. - CLI Harnesses already shell out.
claude-code,claude-code-stream, and the genericcli_agentlaunch a subprocess. Routing that one launch through the Executor instead ofsubprocess.runputs the agent in the Sandbox — provided its CLI is baked into the image, which is the stated assumption for image-shipped harnesses.
What was missing was (a) a way to hand the Harness the Cell's Executor, (b) a filesystem surface on the Executor for backends whose filesystem the host does not share (a remote pod has no bind mount), and (c) the env/secret plumbing to let an in-sandbox agent reach its API key.
Decision¶
1. The Executor is the Cell's whole compute backend. Its remit grows from "non-harness
commands" to "everything a Cell runs, the Harness included." The interface gains a filesystem
surface alongside run/create_venv:
read_text·write_text·is_file·is_dir·list_dir
plus resolve_in_sandbox (path confinement) and copy_into_sandbox (artifact staging). The
base class implements them all host-backed, correct for every backend whose filesystem the
host shares — LocalExecutor (it is the host) and ContainerExecutor (the Sandbox is
bind-mounted at its same absolute path). A backend whose filesystem the host does not share
(a remote pod, a Daytona workspace) overrides them to act over its transport; nothing in the
harness or graders changes.
Crucially, path confinement is part of the surface (resolve_in_sandbox, defaulting to the
sandbox_fs.safe_resolve substrate of ADR 0010): a symlink inside a remote pod's Sandbox must be
resolved in the pod, not against the orchestrator host, or the escape check is meaningless
there. Putting resolution behind the same seam as I/O is what makes "nothing in the harness
changes" actually true for a non-shared-FS backend — confinement, I/O, and staging all swap
together. Likewise artifact materialization (a CLI harness's .claude/ skills + MCP config) goes
through copy_into_sandbox, so it reaches the sandbox wherever it lives rather than being written
to the host and lost.
2. The Harness runs through ctx.executor. RunContext gains an executor (default
LocalExecutor). The CLI family launches via ctx.executor.run(...); the openai harness
mediates every tool through ctx.executor (commands via run, files via the FS surface). With
the default LocalExecutor this is byte-for-byte the old host behavior — the existing Trace
tests pin it.
3. The boundary is lifted per Case, opt-in. container.harness: true runs the Harness inside
the Cell's sandbox (against the container-built venv); the default keeps ADR 0005's host-harness
boundary, so a Case whose image doesn't bake the agent CLI is unaffected. container.backend
selects the provider (docker; remote runtimes via harbor, ADR 0015). container.env_passthrough
names host secrets (e.g. ANTHROPIC_API_KEY) to forward into the sandbox — only the named vars
cross, because the per-command env delta forwards only what a command changed and a secret
usually equals the host value (so it would otherwise be dropped).
Considered options¶
- Keep the harness on the host (ADR 0005 status quo). Simplest, but leaves untrusted agent commands running on the orchestrator and blocks the per-Cell-pod enterprise model. Rejected as the end state; retained as the default for back-compat.
- Bake every agent (incl.
openai) into images and always shell out. Uniform, but forces a custom image per agent and throws away ADR 0009's in-process advantage (the controlled loop, no vendor image). Rejected — theopenaipath should need nothing baked. - A separate
Sandbox/Runtimeabstraction parallel to the Executor. A second seam doing 90% of what the Executor does (run a command, in a place, with an env). Rejected — one seam, extended, is the deep module; two would re-derive each other. - Share the host filesystem to every backend (always bind-mount). Works for local docker, impossible for a remote pod / Daytona. The FS surface is what makes those real, so the abstraction is backend-neutral rather than docker-shaped. Chosen.
Consequences¶
- The enterprise shape is reachable. A Cell can run end-to-end — agent included — in its own
isolated unit.
dockeris wired natively; remote runtimes (Daytona, Modal, E2B, Runloop, GKE) arrive via theharborbackend (ADR 0015), which wraps Harbor's sandboxes as one Executor subclass — superseding thekubernetes/daytonaplaceholders this ADR originally reserved. - The
openaiharness is sandboxable with no image work. Its effects route into whatever backend the Cell uses; the loop stays in the controller. This is the recommended way to grade any model under strong isolation without a per-agent image. - CLI harnesses run in-sandbox when their CLI is in the image.
container.harness: true+ an image withclaude/droid/… baked in puts the agent in the container; the venv it sees is the container-built one (same bind-mounted path inside and out). - Secrets cross the boundary only by name.
env_passthroughis explicit; the no-blanket-leak rule from ADR 0005 holds, now covering the harness too. - Back-compat is total. No
container⇒ hostLocalExecutor, identical to before; acontainerwithoutharness: truekeeps the agent on the host (ADR 0005's slice intact). Only opting in changes where the agent runs. - Timeouts are reaped in the right namespace.
docker exechas no remote timeout, so a host-side timeout would kill the exec client and orphan the process inside the container. TheContainerExecutorwraps commands with a container-sidetimeout(1)(probed once at start) so a hung agent command dies in the container and its partial output survives; a host-side timeout remains as a backstop. A future remote backend owns the equivalent guarantee for its transport. - Not yet in-sandbox: the ACP adapter. ACP drives an external agent over stdio from the controller process; running that agent in a remote sandbox means the JSON-RPC transport must cross the boundary too. Deferred — it composes the same Executor seam when done (ADR 0010).
- CONTEXT.md's Executor term is widened to "where a Cell's work runs, the Harness included," and notes the filesystem surface and the per-Case harness-location knob.