Skip to content

Pluggable provisioning + a backend-neutral Executor (toward containers)

A Case's environment block now selects a provisioner by kind, and all provisioning commands run through a Cell Executor rather than calling subprocess directly. This generalizes the previously pip-only environment to other ecosystems and establishes the single seam a container backend will swap in behind — without rewriting the provisioning recipes, the graders, or the runner.

Status

accepted

Context

The environment block (ADR 0004) provisioned a per-Cell virtualenv and pip-installed into it. That covers Python, but real projects differ in how dependencies are installed (pip vs uv, npm, cargo, gradle/maven) and where they live (Python's shared mutable site-packages vs project-local node_modules/target/). Non-Python cases had to smuggle their installs through setup.run — untyped, unvalidated, and conflating "install dependencies" with "introduce the task state."

Separately, every command a Cell runs outside the harness (provisioning, setup.run, the command/tests/pytest graders) was a direct subprocess.run(argv, cwd, env). To run a Cell inside a container later, each of those call sites would otherwise need its own docker exec path. We want one seam, exercised now by the local backend, that a container backend slots into.

Decision

1. Provisioner kind. EnvironmentSpec.kind selects the strategy:

  • pip-venv (default) — unchanged from ADR 0004: a stdlib venv per Cell, pip-installs requirements / requirement_files / the repo (install: editable). Every existing Case keeps its exact behavior.
  • uv — the same venv model via the uv CLI (faster resolver/installer); same fields.
  • command — for ecosystems with project-local deps: run the declared commands in the Sandbox (npm ci, cargo fetch, …). No venv; subprocesses inherit the host env, and the installed deps live in the Sandbox (torn down with it). This is the typed replacement for the old setup.run-as-installer pattern.

commands also serves as a post-install hook for the venv kinds (e.g. python -m playwright install), run under the venv env.

2. The Executor seam. executor.py defines Executor.run(argv, *, cwd, env, timeout, shell) -> ExecResult. LocalExecutor (host subprocess) is the only backend today; provision_env is written entirely against the interface and takes an optional executor (defaulting to LocalExecutor). The provisioner contract — the returned env dict threaded to every later subprocess — is unchanged.

Considered options

  • Keep using setup.run for non-Python deps. Zero new schema, but untyped, unvalidated, not reproducible per-ecosystem, and conflates provisioning with task setup. Rejected.
  • A provisioner per ecosystem (node, cargo, go, maven, …). A zoo of near-identical kinds; their deps are all project-local, so one generic command kind covers them with honest, explicit commands. Chosen the single command kind instead.
  • Jump straight to a container backend. Strongest isolation and the real answer to OS packages, but heavyweight (needs a running daemon — dockerd isn't even up by default in CI), and we want the provisioning model proven on the cheap local backend first. Deferred, deliberately, behind the Executor seam.

Consequences

  • Backward compatible: no kindpip-venv; all existing cases and the environment threading are untouched. provision_env's signature gains an optional executor.
  • command-kind validation forbids the pip-only fields and requires non-empty commands, so misuse fails at validate time.
  • Container backend (now implemented). A Case may declare a container: {image, setup, python} block; the runner then builds a ContainerExecutor instead of LocalExecutor. It docker runs a long-lived container from image with the cell directory (rw) and the case directory (ro) bind-mounted at their same absolute paths — so host and container paths coincide, needing no cwd translation and letting a venv built in the container resolve identically inside and out. Provisioning, setup.run, and the command/tests/pytest graders run through it (docker exec); the provisioner recipes are unchanged. container.setup runs OS-level prep (apt-get install …) once at start — the OS-package story. Only changed env vars are forwarded into the container (not the whole host environment), so host secrets don't leak. A pinned image (by digest) + pinned deps gives a reproducible grading environment — the door ADR 0004 left open.
  • Boundary (lifted in ADR 0014): originally the Harness (the agent under test) ran on the host against the bind-mounted Sandbox, so it couldn't use a container-built venv (it was handed the host env in container mode). ADR 0014 lifts this per Case (container.harness: true): the Harness then runs inside the Cell's sandbox through the Executor — the in-process openai harness needs nothing baked in (only its effects route in), a CLI harness needs its agent CLI in the image. The default still keeps the agent on the host, so the slice described here (containerize the reproducible part — deps + grading — leave the variable under test on the host) remains the back-compatible default.
  • Reproducibility otherwise remains the case author's responsibility (pin versions / use lockfiles / pin the image digest); the Executor seam is what made the container tier a drop-in rather than a rewrite.