A methodology for evaluating AI agents on product experience

The capability we measure

Product experience (PX) is what a product is like to use: the combined effect of its form (the elements it is built from and how they are organized) and its function (what it does, and whether that actually meets the need). An agent builds the form and the function; PX-bench measures the quality of the experience they produce. When an agent adds a feature to an app, we can evaluate the resulting experience by answering questions like "Does the feature land in the right place? Does it behave the way the rest of the app behaves? Are the empty and error states included, and does the copy read in the product's voice?"

PX-bench uses a set of rubrics that represent these questions among many others. We define a taxonomy that organizes rubrics into eight categories a senior designer would recognize, covering whether the agent built what was asked, through visual hierarchy and convention-following, to accessibility. The taxonomy is how we make that measurable. The quality of an experience is not something you can read off a screen directly, so we score the narrower judgments that compose it and aggregate from there.

The host app

We measure it within a host app, a complete, opinionated application with its own design system, conventions, and quirks, the kind of codebase a new contributor is handed on their first day.

The host app enables us to evaluate categories that are relational. For example, whether the agent's output fits naturally within existing product constraints, and adheres to existing styles and UX patterns.

The whole evaluation runs on Inspect AI, the open-source evaluation framework the UK AI Safety Institute built to run model evaluations: controlled, repeatable, and identical for every agent we put through it.

The task: an ordinary feature request

Each test is built around a scenario, a feature we ask the agent to add to the host app. The agent receives it as a brief, a feature request written the way a product manager would file it.

For example, one scenario asks the agent to add saved views to a task tracker app. The brief reads like an ordinary ticket:

Save the current view. As a user, I can take whatever filter, sort, and grouping setup I have right now and save it under a name I choose, so I can come back to it later, from any device I sign in on.

A saved view belongs to the user's account and travels with them. The current, unsaved configuration can stay on the device, as it does today.

It explains why the feature matters and what the user should be able to do. What it deliberately leaves out is how to build any of it. It never says whether the editor should be a modal or a drawer, where in the app the feature should live, or how a saved view should be stored. A real-world ticket leaves these decisions for the product team (product managers, designers, and developers) to work out, and so the benchmark leaves this work for the agent.

Establishing ground truth

Before an agent ever touches the host app, we define the ground truth: the answer key the work will be graded against. We map its decision points, the places where the brief leaves a choice open, and a careful agent parts ways with a careless one. Some are planted in the app's code, waiting to be noticed. Others live in the brief's silences. For each, we write down in advance what a good answer looks like and tie it to a specific rubric item. That answer rests on established design guidelines that experts have vetted over years, like Apple's Human Interface Guidelines, or Google's Material Design, as well as our own judgment. The agent never sees this map.

An example planted decision point:

The decision: deleting a saved view should be protected against accidents.

What's already in the app: the task tracker shows a confirmation dialog before it deletes a task, then an "undo" toast for a few seconds after.

The trap: the agent can reach for a raw browser confirm() box, build its own one-off dialog, or skip the undo entirely. Each is working code. Each also breaks from the way the app already handles this exact situation, which is a miss on two categories at once: whether the feature fits the product, and whether it follows the app's conventions.

The agent doesn't know it's being tested

The agent works inside a sealed container (a sandbox), the way a new contributor works after being handed a repository and a ticket. It has a shell, a file editor, and a tool that lets it screenshot whatever the running app renders. It reads the existing code, makes its changes, checks its own work, and signals when it believes the feature is done.

It doesn't know it's in an evaluation. The rubric and the map of decision points are nowhere in its environment. We also scrub the repository of anything that would give the game away. The commit history is rewritten and the origin removed, so the project looks like an ordinary internal product rather than a test fixture assembled the day before.

An agent told it was being graded on accessibility would pad its work with accessibility. We don't want the work an agent produces for an exam. We want the product experience it ships by default, when it thinks it's just building a feature, because that's the one a product team would receive.

The work gets scored

When the agent delivers its work, scoring begins, one rubric item at a time, each graded on the same fixed scale. We match the evaluation method to the item.

Some items can be verified by a deterministic check, and a script is the right tool: count the accessibility violations, flag a hardcoded color where the app has a design token for it.

Most can't be. They call for judgment. The same feature can be built countless ways in code, so a script scanning for one expected shape often won't find it. Those go to a scoring agent of our own, one that reads the code the way a reviewer would and grades it against the ground truth we set in advance.

Two things guard these scores. A navigator steps through the finished app's screens and states automatically, capturing what actually renders, so the scoring agent grades the running product and not just its source code. And each item is scored by the same agent run using models from three different AI labs, with the median taken as the score, so no single vendor's blind spot decides it alone.

The scores get audited

A scorer that's confidently wrong is worse than no scorer at all. So while we're still earning trust in the scoring itself, every item gets a second, independent reading. A separate auditor, built on a different method and different models, re-grades the work from scratch, reading the run's artifacts directly: the code the agent wrote and the screens the navigator captured. Where the auditor and the scorer agree, the score stands. Where they disagree, a person reads the case.

Some disagreements turn out to be scorer bugs, and we fix them. Some are genuinely ambiguous calls, worth flagging in public rather than papering over. The rule underneath the loop is plain: a number we can't defend doesn't get published.

Reading the score

A feature request goes in, and a checked score comes out. For an agent developer, the value is in the signal behind the number. A low score is explainable. It breaks down by category and by item, each tied to the decision the agent got wrong and the evidence behind the call. That makes it actionable, a clear read on where product-experience capability is weak and what to change.

A high score that holds across versions also tells a story. Because the same scenario reruns cheaply, developers can keep pushing the agent on cost and speed, and a number that stays put tells you the change earned its savings without quietly degrading the experience the product ships. The score tells you what to fix and warns you when a change costs you quality.

Next comes the comparison this methodology was built for. Several agents go through the same scenario, scored the same way and published side by side — each number still carrying the evidence behind it.

Version 1.0 — June 2026. Reach us at hello@chordio.com.