The contamination problem

PX-bench measures product experience: what the product an agent ships is actually like to use, form and function together. The test puts a coding agent inside a complete, opinionated application (the host app), asks it to add a feature the way a product team would, and scores the result across eight categories. The host app is the substrate the whole measurement rests on.

Contamination is quiet. A compromised run still posts a confident per-category score, indistinguishable from an honest one; it just stops measuring product experience. It enters four ways, taken one at a time.

The model has already seen the app

If the host app's code is public, the model may carry it in training data, and a high score measures memory rather than judgment. The whole premise of the test is that the agent meets the app cold; reading an unfamiliar codebase and finding its conventions is a large part of what gets scored. So we build host apps in-house and hold them out; the code is never open-sourced and never quoted in docs or demos.

Held out is perishable, though. Every run sends the app's code to a model provider's API, and a published transcript can quote it. So each score names the substrate version it ran on, and a host app retires once we publish its code.

The app was built to be solved

A host app earns its place by being a genuine, opinionated product. Its conventions are deliberate and consistently applied, and the rubric rewards an agent for reading and respecting them. When every create-and-edit flow opens a right-side drawer, a new editor that opens as a modal is a real product-fit miss.

That logic holds only if nobody tuned the app toward the test. A builder who knows which conventions are about to become rubric items plants traps: too clean, too discoverable, too perfectly placed. The app stops being a product and becomes an obstacle course.

We split the defense across two documents that never merge:

A build spec describes the app as ordinary product decisions (screens, components, design tokens, house style). Even the deliberately non-default conventions that make the app opinionated (a custom utility where a framework default exists) are specified as house style, never as traps.
A measurement map records which conventions become scoring opportunities, each tied to a rubric item. The builder never sees it.

The builder works from the build spec alone, with no path into the eval repo, where the rubric and the map live. One limit is worth stating. The split doesn't make the conventions organic, the way a real product's are; people who know the rubric still designed them. What it prevents is aiming the app at the test, since the builder never learns which conventions are scored and so can't make those few too clean or too discoverable. The defensible claim is more modest: the conventions are consistent, and the builder who applied them was blind to what gets scored.

The app was written by a model

Host apps have to come from somewhere, and the obvious modern answer is a coding agent. The catch is that code can be literally unseen and still be familiar. A model that writes a codebase draws every unforced choice — naming, file shape, idiom — from its own distribution, and the habits a model writes are the habits it reads most fluently. For a benchmark that compares agents across model families, this is the sharpest version of the problem. If one family's model writes the substrate, its agents may hold a home-field advantage.

There are two ways out. A human builder is the clean one: the app is then no model's own output, so no agent under test reads its own habits back. Keeping the model builder is the economical one, and it takes three disciplines:

Pin the decisions. The build spec fixes every choice the rubric will score, so the builder supplies execution, not direction.
Disclose the builder. Every published score names who or what built the host app, so a reader knows when an agent is being scored on its own family's code.
Measure the advantage. Sibling apps built from one spec by different models' agents test whether scores follow the builder.

The agent can tell it's a test

An agent that knows it is being scored may behave unlike it would on real work, so anything that signals "test" is contamination. The signals live at two layers: the host app, and the brief that hands over the task.

At the app layer the defense is separation. The host app lives in its own repository and never references the eval; no rubric, no scoring code, no comment, branch, or commit message that hints at a benchmark. Authoring the app as a standalone product gets most of this for free, and we still review the git history before a run to confirm every commit is grounded in the app rather than the eval. The agent then works in a sandboxed container that holds the app and the task, with everything that scores the run withheld; scoring happens afterwards, against the workspace the agent leaves behind.

The brief is the other tell, and the easy one to get wrong. A brief that reads like an eval prompt, listing scoring criteria instead of product needs, gives the game away. So it is written the way a product manager would write it: the motivation behind the feature, user stories, an out-of-scope list, and requirements framed as product needs rather than rubric items.

The payoff, and the limits

Four defenses, one goal: a score produced on an app the model under test has never seen, built blind to the rubric, with nothing in the agent's reach that says "test". The discipline is auditable in structure, not provable. A skeptic can ask how the app was built, who or what built it, and what the agent could find in the container, and get a concrete answer at each step. Nobody can demonstrate a build was perfectly blind; the structure's job is to keep the answer key away from everyone who touched the code.

Some channels stay open. A model-built host app can be disclosed but not cleared until we rebuild it with other models and check whether the scores follow the builder, and every substrate ages toward retirement the moment its code surfaces in public. Naming these is part of the method; a benchmark with no contamination problem has stopped looking for one.

Version 1.1 — June 2026. Reach us at hello@chordio.com.