PX-bench measures product experience: what an agent's work is actually like to use, form and function together. The test puts an agent inside a complete, opinionated application (the host app), asks it to add a feature the way a product team would, and scores the result across eight categories. The host app is the substrate the whole measurement rests on, and a measurement is only as honest as its substrate.
Substrate contamination is quiet. A contaminated run still produces a clean per-category score, indistinguishable on its face from an honest one. It just stops measuring product judgment. What it measures instead depends on how the substrate was contaminated, and there are three ways: the model has already seen the app, the app was built to be solved, or the agent can tell it's a test.
The model has already seen the app
The first risk is the familiar one in evals. If the host app's code is public, the model may carry it in training data, and a high score then measures memory rather than judgment. The defense sits upstream of everything else here: host apps are built in-house and never published, so the code under the agent's hands is code no model has seen.
That settles the code. The harder problem is that an app no model has ever seen can still be contaminated, in two subtler ways: by the people who build it, and by the agent that takes the test.
The app was built to be solved
A host app earns its place as a substrate by being a genuine, opinionated product. Its conventions are real, applied consistently, the kind a senior team accretes over time. The rubric rewards an agent for reading those conventions and respecting them. When every create-and-edit flow in the app opens a right-side drawer, a new editor that opens as a modal is a real product-fit miss, because the app genuinely settled that question and the agent ignored it.
That logic holds only if the conventions are organic. If whoever builds the host app knows which conventions are about to become rubric items, the traps come out artificial: too clean, too discoverable, too perfectly placed for an agent to trip on. The app quietly stops being a product and becomes an obstacle course, easier than any real handed-off codebase, and the score stops measuring real product work.
So the app is built eval-blind. The work is split across two documents that never merge:
- A build spec describes the app as ordinary product decisions: screens, components, design tokens, state patterns, house style. The deliberately non-default conventions that make the app opinionated (a custom utility where a framework default exists, an idiosyncratic but consistent file layout) are specified here as house style, never flagged as traps.
- A measurement map records which of those conventions become scoring opportunities, each tied to a rubric item. It is the eval-aware companion to the build spec, and the builder never sees it.
The builder, whether a person or a coding agent, works from the build spec alone, in a directory with no eval material anywhere above it. That last detail is operational, not just principle: a coding agent with a path up into the eval repo can drift into the rubric on its own and tune the app toward it without ever being asked to. The build spec is self-contained for the same reason, with no links that lead back into eval material.
Eval-awareness re-enters exactly once, after the build is done: we read the finished code against the measurement map and patch any convention that didn't come out as intended. That is the only point where anyone touching the app knows what it is for.
One downstream choice falls out of building blind. For v1 the host app ships with no AI-assistant guide: no CLAUDE.md, AGENTS.md, or .cursor/rules. Most real handed-off codebases include one, so leaving it out costs a little realism. But even a carefully eval-blind guide names the conventions we want the agent to discover by reading code, which gift-wraps several rubric items outright. We ship code only and take the signal over the realism. Whether shipping a guide systematically moves scores is a question worth measuring on its own, later.
The agent can tell it's a test
Even a clean, un-rigged app fails as a substrate if the agent taking the test can tell it is a test. An agent that knows it is being scored may behave unlike it would on real product work, and at a minimum, anything in the repo that points back to the eval is a handle it could use to figure that out. So the substrate ships with no breadcrumbs and no path back to the research that produced it.
The host app lives in its own git repository, separate from the eval repo and ignored by it. The separation is load-bearing: a coding session opened inside the app walks up to that repo's .git, not the eval repo's, so maintenance on the app can never pull research context into working context. The eval consumes one tagged commit of that repo, pinned in the scenario config. Nothing floats; a run consumes exactly the state we pinned.
The sandbox is where the rest of the defense lives. At image-build time the Dockerfile clones the pinned tag, then strips the git history before the agent ever sees the container:
# clone the pinned tag, then strip every triangulation handle from .git
RUN --mount=type=secret,id=github_token,required=true \
TOKEN=$(cat /run/secrets/github_token) \
&& git clone --depth 1 --branch "${REF_APP_TAG}" "https://${TOKEN}@${REPO_PATH}" . \
&& rm -rf .git \
&& git init -q -b main \
&& git -c user.email=dev@task-tracker.local -c user.name="Task Tracker" add -A \
&& git -c user.email=dev@task-tracker.local -c user.name="Task Tracker" commit -q -m "Initial commit"
Left intact, git remote -v, git tag, and git log would each leak a triangulation handle: the remote URL, the v1-eval-2026-06-07 tag, and a single commit authored just before the run. Any one of them could tip an attentive agent off that it is inside a benchmark. After the scrub the agent sees an ordinary repo: no remote, no tag, a generic committer, one plain initial commit. git diff still works for its own changes, which is all it needs. (The real remote, tag, and history still live in our private mirror, which is how we track and pin the app; the scrub only changes what the container shows.)
Two smaller details close the same gap. The token that clones the private repo is mounted as a build secret, so it never lands in image history where docker history would surface it. And the scoring scaffolding lives at /opt/scorers/, outside the agent's /workspace, so a casual ls never reveals that anything is grading the work.
What this buys, and what it doesn't
Three defenses, one goal: a score that reflects product judgment on an app the model has never seen, whose conventions are genuine, presented as ordinary product code with nothing announcing that it is being measured. The substrate is checkable. A skeptic can ask how the app was built and what the agent could see inside the container, and get a concrete answer at each step.
This is v1 of the discipline, and it is honest about its limits. The strong isolation boundary for a run is the sandboxed container, not the on-disk repo layout or the git scrub; those are defense in depth on top of it. The discipline reduces contamination, it does not eliminate it. A determined agent might still infer from the shape of the task that it is in an eval, and we cannot prove a build was perfectly blind, only that it was structured to be. A few questions stay open: how aggressive the git scrub should be, whether the no-guide choice survives study, and whether the build-blind-then-verify split has a cleaner form than the current handoff. We will learn the answers by running.
Version 1.0 — June 2026. Reach us at hello@chordio.com.