Why definition precedes measurement
A great deal is being said about the design capability of AI agents. Very little of it agrees on what the phrase means.
One conversation calls an agent "great at design" because its outputs are pleasing to look at. Another, because it produces code that matches a Figma reference closely. A third, because it picks reasonable layouts for common UI tasks — three different claims about three different capabilities. The published benchmarks fragment along the same lines: UI-Bench measures aesthetic preference, the Design2Code line measures code-from-image fidelity, and nothing widely cited measures whether the agent makes good product judgments when adding a feature to a product that already exists.
This is the recurring shape of capability research in a new domain: before measurement is useful, the field needs a shared vocabulary for what's being measured. METR's work depends on an agreed unit of "task" before "task success rate" can mean anything; the UK AI Safety Institute's cyber evaluations establish a taxonomy of capability before measuring against it. Without that step, every result has to be re-argued from first principles.
We don't call the capability we evaluate "design capability": the word design keeps pulling toward the surface, toward how a thing looks. The capability that actually decides whether an agent's work is good is wider than that — it covers everything the agent ships, of which the visual surface is one part — and we give it its own name: product experience, the PX in PX-bench. The categories that follow are the structure we measure it against, in this first scenario and the ones after it.
What we mean by product experience
Product experience is the product-facing judgment an agent applies when it builds a feature: what to build, where it fits, how it behaves, how it reads, how it holds up, and who can use it. We treat it as a capability — a property of the agent under test that reaches a stated outcome ("add an export action to the dashboard without fragmenting the existing one") and generalizes beyond any single task. The test is not whether a screen is attractive in isolation, but whether the product the agent hands back is one a person can actually move through.
The product the agent works in is the context. Our scenarios hold it fixed — a held-out app with a settled house style — and vary the agent, because that is where the judgments that matter most become checkable: whether a feature fits and whether it matches the app's conventions can only be scored against a product that already exists, exactly the setting single-prompt benchmarks skip. How we hold the context fixed, and keep general coding ability and reference-design fidelity out of the score, are questions for the methodology piece.
The categories
Eight categories, ordered as a ladder — the deep, judgment-heavy calls first, the mechanical checks last. Each names one decision, filed by which decision it tests, not by which craft (visual, writing, interaction) happens to show it. That rule, more than the list, is what keeps the categories from overlapping; we return to it after the list. Each has a short pass-and-fail example.
1. Intent fidelity
Whether the agent built what was actually asked for — every requested capability present and working on its core path, nothing important left out, nothing extra bolted on.
Pass. The brief asks for saved views — a way to name the current filter-and-sort setup and switch between saved ones. The agent lets a user create a view, switch to it, rename it, and delete it, and each works.
Fail. The brief asks for editable views, but the agent ships create and delete with no way to rename. Or it builds an elaborate sharing system no one asked for and leaves the core feature half-finished.
The most basic question on the list, and the easiest to take for granted: did the agent build the right thing at all? It can produce something clean and well-styled that quietly omits a capability the brief asked for. The scope is deliberately narrow — present and working on its core path, not whether it was done well. How well is what the categories below measure.
2. Product fit
Whether the feature attaches to the existing product in the right structural place and shape — container and pattern choice, entry-point placement, and consolidation over fragmentation.
Pass. A single-property change uses in-place edit; a destructive confirmation uses a modal; a long setup uses a dedicated page. The "create" action reuses an existing create flow rather than adding a fourth, and new settings join the existing settings surface.
Fail. A drawer for a single-field setting; a modal for a twelve-step flow; a fourth create flow parallel to three that exist; a new top-level area duplicating one the product already has.
Product fit is the call a senior product designer can make from a wireframe alone, before any pixel is styled — and the one we most expect to differentiate agents. It chooses the structural pattern: where the feature lives and what shape it takes, read against the product that already exists. (Executing that choice in the app's established way is Convention adherence.) It is also least visible to single-prompt evaluation — with no existing product, there is no structural relation to get right or wrong.
3. Visual craft
Whether the surface is composed into a clear visual hierarchy — emphasis, grouping, alignment, spacing rhythm, and type-scale use that guide the eye to the right thing first.
Pass. The view name anchors each row; secondary metadata recedes; related controls are grouped and aligned on a consistent rhythm; the type scale signals what is primary at a glance.
Fail. Everything competes at one weight; the primary action is indistinguishable from a tertiary one; ragged alignment makes a simple list feel noisy; headings and body sit at the same size.
Composition, not component choice: an agent can reuse every component the design system offers and still stack them into a screen with no hierarchy — everything at one weight, nothing grouped, the eye with nowhere to land. (Reuse of the system's components and tokens is Convention adherence.) And it is legibility, not taste — whether the layout guides the eye to the right thing, not whether it is beautiful. That line is load-bearing: legible is something senior product designers converge on; beautiful is preference, the one thing here we don't measure — we leave it to benchmarks built for it, like UI-Bench.
4. Convention adherence
Whether the agent executes in the app's house style — reusing existing components and tokens, matching naming and file conventions, and formatting dates and numbers the way the rest of the app does.
Pass. A new form reuses the existing input components and an existing button variant; dates and numbers match the app's formatting; new files and symbols follow the established naming.
Fail. The feature introduces a fourth button shape; the new modal has a different close affordance than the existing three; a color is hardcoded where a token exists; a date is rendered raw where the app has a formatter.
The fidelity question: did the agent use what the app already had, or reinvent it? It is distinct from Product fit, which chooses the pattern — Convention is whether that choice is executed in the established way — and from Visual craft: an agent can match every convention and still compose poorly.
5. Pathway completeness
Whether the agent designs the whole interaction — every path (cancel, back, undo, error-recovery, no dead-ends) and every required state (loading, empty, error, pending) present and leading somewhere.
Pass. Every destructive action has a cancel; every multi-step flow has back and error-recovery; loading, empty, and error states exist and each leads somewhere.
Fail. A create flow with no edit path; an error state that offers no way forward; a wizard's third step with no way back; a blank screen where the empty state should be.
Anecdotally the most common failure mode in AI-built product work: the agent finishes the happy path and calls it done, and the cancel buttons, the undo, the error recoveries are the first skipped. The category checks one thing — does each path and state exist and lead somewhere? It owns the presence of an error state and its way forward; whether the UI survives the condition that triggers it (a 500) is Resilience, and whether the pattern is the right one is Product fit. Mere presence is what makes it the most clear-cut category here.
6. Content & language
Whether the words are right — label and microcopy quality, empty-state and error-message quality (not mere presence), and a voice that matches the product.
Pass. Empty-state copy explains what to do next; error messages are specific enough to suggest an action; labels and microcopy use the product's own terms.
Fail. A generic "Something went wrong" with no next step; an empty state left with placeholder text; a label that names a concept differently than the rest of the app.
Its own decision, separated from the states it lives inside: whether an empty state exists is Pathway, whether its copy is any good is Content. The two fail independently — a well-placed empty state with useless copy, or sharp copy in a state that was never built. The test: if fixing it means changing only the words, it belongs here.
7. Resilience
Whether the rendered product holds together under realistic conditions, not just the demo condition — long and extreme content, every viewport, rendering under API failure, throttled network, and performance.
Pass. A 200-character title wraps; a 50-line list scrolls without distorting its container; the layout survives at 320px without horizontal scroll; the UI still renders when the API returns a 500.
Fail. A long title overflows its card; the mobile breakpoint hides the primary action behind a sidebar that won't open; the page jumps when content loads; a 500 produces a blank screen.
The failure modes a single screenshot can't reveal. A clean shot of a fresh, populated, desktop-sized screen says almost nothing about how the design behaves when the title runs long, the window is narrow, or the API errors — and those edges are exactly where AI-built UIs crack.
8. Accessibility
Whether the produced UI meets established accessibility standards and practices.
Pass. Form inputs are programmatically labeled. Focus order follows visual order. Contrast meets WCAG AA against the actual rendered backgrounds. Modals trap focus and restore it on close. Interactive elements have appropriate roles.
Fail. A button is a
<div>with anonClick. Focus is invisible. Contrast meets the spec against the mockup background but not the rendered one. Keyboard users can tab into a modal but not out.
The compliance end of the ladder: the most rule-bound call here, the standards explicit and checkable (axe-core), the cost of getting it wrong concrete — real users are shut out. It comes last because it shades closest to "follow the rules" and furthest from open-ended product judgment — not because it matters least. If anything the reverse: a lab that can't get accessibility right has no standing to grade the subtler calls above it.
How every defect gets one home
A category scheme is only useful for reporting if every scored finding has one home: the categories have to be mutually exclusive and collectively exhaustive — MECE. We don't enforce that at the category layer, where one observed problem — a low-contrast error message in a modal — is at once visual, copy, interaction, and accessibility. We enforce it one level down, at the atomic scored item, the way METR gets reviewers to agree: each item has one clear home and a yes-or-no (or graded) test, filed by which decision it tests rather than which surface the symptom shows on. One observed problem can then decompose into several items — a poor error state is at once a Pathway gap (it's missing), a Resilience gap (it white-screens on a 500), a Content gap (the message is useless), and an Accessibility gap (it's never announced), four findings read as a cross-cut when someone wants the "how good is error handling" view. Genuinely two-home items get a precedence rule and authored splits in the rubric, not here.
For the feature-extension work we score today, the categories are exhaustive by construction: they tile the whole act of adding a feature — build it, fit it, compose it, match the house style, finish its paths and states, word it, harden it, make it accessible — with no miscellaneous bucket. The one known seam, interaction feel, we name below.
What we deliberately exclude
Taste, and the calls only the stakeholder can make. Some judgments have no rubric-answer: "our brand should feel more reserved," "this copy is more us." Reasonable senior product designers don't converge on them, and measurement needs convergence, so they sit outside the scheme by construction — including pure aesthetic preference, which is why beautiful is not a category and legible (Visual craft) is.
Marketing and editorial content. Long-form copy, landing-page narrative, editorial voice — real and assessable, but there is no marketing surface to get right when the task is wiring a feature into an existing app. We would add a content-oriented scenario before claiming to measure it.
Finally, two things our scenarios run through that are confounds, not part of product experience: general coding ability (whether the code compiles and runs, as opposed to whether the product judgment is sound) and fidelity to a reference design (reproducing a given mockup rather than deciding what to build). Each is real, each is measured well elsewhere, and neither is what we mean by product experience. We scope them down or hold them fixed; how is for the methodology piece.
A skeleton, and one open gap
This is a first version, not a complete grammar of product experience. The rubric is where these categories become specific items tested against specific apps, and the methodology piece covers how each item is actually scored.
The eight categories leave one known gap: interaction feel — the responsiveness and weight of an interaction, motion, the touches that make a UI feel alive — has no category yet. For now it spreads across Product fit, Pathway, and Resilience, which holds up on a feature like saved views that involves little novel interaction. A more interaction-heavy task would need its own category, and we would add one then.
Version 1.0 — June 2026. Reach us at hello@chordio.com.