What gets measured
Chordio's PX-bench measures how well an AI coding agent handles product experience: not how a screen looks on its own, but the whole experience a finished product gives the people who use it. When an agent adds a feature to an app, product experience is the judgment behind the work. Does the feature land in the right place? Does it behave the way the rest of the app behaves? Are the empty and error states finished, and does the copy read in the product's voice?
Product experience has a defined shape: a taxonomy breaks it into eight categories a senior designer would recognize, running from whether the agent built what was asked, through visual hierarchy and convention-following, to accessibility. And it has a place we measure it: the host app, a complete, opinionated application with its own design system, conventions, and quirks, the kind of codebase a new contributor is handed on their first day.
With the capability defined and a real app to build in, the problem is the measurement: how to turn one agent's attempt at a feature into a score we can stand behind. Product judgment is easy to assert and hard to pin down, which is what makes every part of the evaluation deliberate, starting with the task we hand the agent.
The task the agent is handed
Each test is built around a scenario: one feature we ask the agent to add to the host app. The agent receives it as a brief, a feature request written the way a product manager would file it. One scenario asks the agent to add saved views to a task tracker, so a user can name a filter-and-sort setup, switch between saved ones, and edit or delete them.
The brief reads like an ordinary ticket:
...
Save the current view. As a user, I can take whatever filter, sort, and grouping setup I have right now and save it under a name I choose, so I can come back to it later, from any device I sign in on.
A saved view belongs to the user's account and travels with them. The current, unsaved configuration can stay on the device, as it does today.
...
It explains why the feature matters and what the user should be able to do. What it deliberately leaves out is how to build any of it. It never says whether the editor should be a modal or a drawer, where in the app the feature should live, or how a saved view should be stored. A real ticket leaves these decisions to be worked out from the product, and whether the agent makes them well is the product judgment we measure.
The bar is set before the agent runs
Before an agent ever touches the host app, we map its decision points: the places where the brief leaves a choice open, and a careful agent and a careless one part ways. Some are planted in the app's code, waiting to be noticed. Others live in the brief's silences. For each one we write down, in advance, what a good answer looks like, and we tie it to a specific item in the rubric, the scoresheet drawn from the eight categories. The agent never sees this map.
One planted decision point:
...
The decision: deleting a saved view should be protected against accidents.
What's already in the app: the task tracker shows a confirmation dialog before it deletes a task, then an "undo" toast for a few seconds after.
The trap: the agent can reach for a raw browser
confirm()box, build its own one-off dialog, or skip the undo entirely. Each is working code. Each also breaks from the way the app already handles this exact situation, which is a miss on two categories at once: whether the feature fits the product, and whether it follows the app's conventions....
Setting the bar before we see the work is what makes the grade a measurement, not just an opinion formed after the fact.
The agent doesn't know it's being tested
The agent works inside a sealed container (a sandbox), the way a new contributor works after being handed a repository and a ticket. It has a shell, a file editor, and a tool that lets it screenshot whatever the running app renders. It reads the existing code, makes its changes, checks its own work, and signals when it believes the feature is done.
It doesn't know it's in an evaluation. The rubric and the map of decision points are nowhere in its environment. We also scrub the repository of anything that would give the game away: the commit history is rewritten and the origin removed, so the project looks like an ordinary internal product rather than a test fixture assembled the day before.
An agent told it was being graded on accessibility would pad its work with accessibility. We don't want the work an agent produces for an exam. We want the product experience it ships by default, when it thinks it's just building a feature, because that's the one a real team would actually receive. The harness around all this is Inspect AI, the open-source evaluation framework the UK AI Safety Institute built to run model evaluations: controlled, repeatable, and identical for every agent we put through it.
The work gets scored
When the agent submits, scoring begins, one rubric item at a time. We match the tool to the question.
Some questions have a single, verifiable answer, and a script is the right tool: count the accessibility violations, flag a hardcoded color where the app has a design token for it.
Most are harder to verify. The same feature can be built countless ways in code, so a script scanning for one expected shape often can't even tell what the agent actually built. Those go to a scoring agent of our own, one that reads the code the way a reviewer would and scores it against the bar we set in advance.
Two things guard these scores. A navigator steps through the finished app's screens and states automatically, capturing what actually renders, so the scoring agent grades the running product and not just its source code. And each item is scored by the same agent run on models from three different AI labs, with the median taken as the score, so no single vendor's blind spot decides it alone.
The scores get audited
A scorer that's confidently wrong is worse than no scorer at all. So while we're still earning trust in the scoring itself, every item gets a second, independent reading. A separate auditor, built on a different method and different models, re-grades the work from scratch, reading the run's artifacts directly: the code the agent wrote and the screens the navigator captured. Where the auditor and the scorer agree, the score stands. Where they disagree, a person reads the case.
Some disagreements turn out to be scorer bugs, and we fix them. Some are genuinely ambiguous calls, worth flagging in public rather than papering over. The rule underneath the loop is plain: a number we can't defend doesn't get published.
How it holds together
A feature request goes in, and a checked score comes out. Everything in between is built to make that score mean something: the task is realistic, the bar is set before anyone sees the work, the agent builds without knowing it's being tested, and the grades are checked by something that doesn't share the grader's blind spots. The result isn't a verdict on whether the feature works. Plenty of ways to build it work. It's a measurement of the whole product experience the agent shipped, not just the working code underneath.
Version 1.0 — June 2026. Reach us at hello@chordio.com.