PX-bench measures product experience: what an agent's work is actually like to use, form and function together. It drops an agent inside a complete, opinionated application, asks it to add a feature the way a product team would, and scores the result across eight categories. Every run produces a number.
A number invites comparison. If one agent scores 0.72 and another 0.69, the natural read is that the first did better. That read holds only if the two numbers are far enough apart to mean something, and they may not be, because the same agent on the same task does not score the same twice.
The number moves on its own
Run the identical eval again, change nothing, and the score drifts. We ran one agent on one task ten times: the composite landed anywhere from 0.66 to 0.79. Nothing about the task changed between those runs. The spread is the eval measuring itself.
Two things move it. The scorers are one source: the scoring layer leans on language-model judges, and even handed the exact same finished work, they wobble a little from run to run. The agent is the other, and it is the larger source. Asked to build the same feature twice, an agent builds genuinely different things each time: an empty state here, a keyboard shortcut it shipped last time and skipped this time, a slightly different layout. All of it lands in the categories and moves the composite.
Name the two. Instrument noise (the scorers alone, call it σ_scorer) is how much the number moves when the work under it is held fixed. Total noise (σ_total) is how much it moves when everything is free to vary. Total noise is always at least instrument noise, and the gap between them is how much the agent's own variability adds on top.
Why we measure it
Two reasons, both about not fooling ourselves.
The first is a guardrail on every comparison we publish. If a single run's number would drift by some amount with nothing changed, then a gap between two agents smaller than that drift is not a finding. It is noise dressed up as signal. Measuring total noise turns that intuition into a stated rule: a gap smaller than roughly twice σ_total is not interpretable from a single run. That distance is the noise floor, and we print it next to the numbers so a reader can see which gaps carry weight and which don't.
The second reason is operational. Total noise tells us how many times to run each agent-and-task cell (one agent on one task). Averaging N runs shrinks the spread of the average by a factor of √N, so running a cell more times buys precision. We pick the smallest N whose averaged precision clears the smallest gap between agents we mean to call real, rather than running every cell an arbitrary number of times.
How we measure it
One study per source, because the two only mean something once they are isolated.
For instrument noise, we freeze a single agent's finished build and re-score only that. Same files, same diff, the scoring layer run several times over. Anything that moves is the instrument, since the work beneath it is byte-identical.
For total noise, we run the whole eval end to end, agent and then scorers, several times for one fixed agent and task. Everything instrument noise captures is in here too; the difference between the two numbers is everything the agent did differently from one build to the next.
Both numbers are the ordinary sample standard deviation of the run-level score across the repeated runs, reported with bootstrapped confidence intervals on a pinned seed so the analysis is itself reproducible. On our reference subject (Sonnet 4.6, on the task-tracker scenario):
- Instrument noise is about 0.007. Negligible. Most rubric items score identically every single time, and the residual lives in a couple of genuinely subjective judge calls that no amount of re-scoring will make deterministic.
- Total noise is about 0.047, roughly seven times larger. So the agent rebuilding the feature differently, not the scorers disagreeing, is what actually moves a PX-bench number between runs. Twice that, about 0.094, is the noise floor for a single run. Averaging five runs pulls it down to about 0.042, fine enough to separate the agents we care about, so each cell runs five times.
Measuring the floor also flushes out things that look like noise but aren't. Two were worth the trip.
The first was the sample size itself. Early on, with four runs, total noise looked like 0.035. At ten runs it was 0.047. The small sample had understated the spread, which is exactly why the protocol fixed the count at ten rather than stopping once the number looked settled.
The second was a bug in disguise. One accessibility check kept wobbling across re-scores of the same frozen build, which should be impossible for a deterministic check, where the same input always yields the same output. The cause was a contrast badge counted once per row of a list instead of once per component, so the score tracked how many rows a given build happened to render. That is not noise, it is a determinism defect, and it had been hiding inside what we would otherwise have filed as scorer variance. Collapsing the repeated component to a single count fixed it, and the item now scores identically on every replay. Re-scoring a frozen build turns out to be the cleanest test we have that a "deterministic" scorer is actually deterministic.
What this buys, and what it doesn't
The floor does one concrete thing: it stops us reading meaning into gaps the eval cannot resolve. A reader can take any two published numbers, measure the distance between them against the floor, and know whether the comparison is real or an artifact of run-to-run drift. That is the point of publishing the number rather than keeping it as an internal sanity check.
The floor has limits worth naming. The figures here come from a single subject, and a weaker agent's messier output may well be noisier to score, so we treat the floor as provisional and will widen it as we measure builds from other agents. Noise is also partly a property of the task: a tightly scoped feature leaves the agent less room to build differently, so a different scenario would carry a different floor. And the floor reduces overconfidence, it does not abolish it. It tells you when a gap is too small to trust, not that every gap above it reflects exactly what you think it does. What it removes is the easiest way to be wrong, which is to read a leaderboard's third decimal as though the eval could see that far.
Version 1.0 — June 2026. Reach us at hello@chordio.com.