A taxonomy of product experience for evaluating what AI agents build

Definition precedes measurement

A great deal is being said about the design capability of AI agents. Very little of it agrees on what the phrase means.

One conversation calls an agent "great at design" because its outputs are pleasing to look at. Another, because it produces code that matches a Figma reference closely. A third, because it picks reasonable layouts for common UI tasks — three different claims about three different capabilities. The published benchmarks fragment along the same lines: UI-Bench measures aesthetic preference, the Design2Code line measures code-from-image fidelity, and nothing widely cited measures whether the agent makes good product judgments when adding a feature to a product that already exists.

This is the recurring shape of capability research in a new domain: before measurement is useful, the field needs a shared vocabulary for what's being measured. METR's work depends on an agreed unit of "task" before "task success rate" can mean anything; the UK AI Safety Institute's cyber evaluations establish a taxonomy of capability before measuring against it. Without that step, every result has to be re-argued from first principles.

We don't call the capability we evaluate "design capability", since the word design keeps pulling toward the surface, toward how a thing looks. A product is two things at once: its form (the elements it is built from and how they are organized) and its function (what it does, and whether that meets the need). What decides whether an agent's work is good is the combined effect of both as a person uses it, not the surface alone. We call that the product experience, the PX in PX-bench, our benchmark for it. It is wider than visual design, which is only the surface, and narrower than "good at coding," which is the whole of software skill. What we measure is the quality of the experience an agent produces, and the categories that follow are the structure we measure that quality against.

Product experience, defined

Product experience is what an agent's finished product is like to use, form and function together. What decides it is the product-facing judgment the agent applies when it builds a feature: what to build, where it fits, how it behaves, how it reads, how it holds up, and who can use it. We treat that judgment as a capability, a property of the agent under test that lets it reach a stated outcome ("add an export action to the dashboard without fragmenting the existing one") and generalize beyond any single task.

The categories

Eight categories, each naming one decision and filed by which decision it tests, not by which craft (visual, writing, interaction) happens to show it. That rule, more than the list, is what keeps the categories from overlapping; we return to it after the list. They are peers, no one of them more important than the next, and the order they appear in carries no weight. Each has a short pass-and-fail example.

1. Intent fidelity

Whether the agent built what was actually asked for, with every requested capability present and working on its core path, nothing important left out, nothing extra bolted on.

Pass. The brief asks for saved views — a way to name the current filter-and-sort setup and switch between saved ones. The agent lets a user create a view, switch to it, rename it, and delete it, and each works.

Fail. The brief asks for editable views, but the agent ships create and delete with no way to rename. Or it builds an elaborate sharing system no one asked for and leaves the core feature half-finished.

A basic question, and the easiest to take for granted: did the agent build the right thing at all? It can produce something clean and well-styled that quietly omits a capability the brief asked for. The scope is deliberately narrow, covering whether the capability is present and working on its core path, not whether it was done well. How well is what the other categories measure.

2. Product fit

Whether the feature attaches to the existing product in the right structural place and shape, including container and pattern choice, entry-point placement, and consolidation over fragmentation.

Pass. A single-property change uses in-place edit; a destructive confirmation uses a modal; a long setup uses a dedicated page. The "create" action reuses an existing create flow rather than adding a fourth, and new settings join the existing settings surface.

Fail. A drawer for a single-field setting; a modal for a twelve-step flow; a fourth create flow parallel to three that exist; a new top-level area duplicating one the product already has.

Product fit is the call a senior product designer can make from a wireframe alone, before any pixel is styled, and the one we most expect to differentiate agents. It chooses the structural pattern: where the feature lives and what shape it takes, read against the product that already exists. (Executing that choice in the app's established way is Convention adherence.) It is also least visible to single-prompt evaluation, since with no existing product there is no structural relation to get right or wrong.

3. Visual craft

Whether the surface is composed into a clear visual hierarchy through emphasis, grouping, alignment, spacing rhythm, and type-scale use that guide the eye to the right thing first.

Pass. The view name anchors each row; secondary metadata recedes; related controls are grouped and aligned on a consistent rhythm; the type scale signals what is primary at a glance.

Fail. Everything competes at one weight; the primary action is indistinguishable from a tertiary one; ragged alignment makes a simple list feel noisy; headings and body sit at the same size.

Composition, not component choice: an agent can reuse every component the design system offers and still stack them into a screen with no hierarchy, everything at one weight, nothing grouped, the eye with nowhere to land. (Reuse of the system's components and tokens is Convention adherence.) And it is legibility rather than taste, a question of whether the layout guides the eye to the right thing, not whether it is beautiful. This distinction matters. Legible is something senior product designers converge on; beautiful is preference, the one thing here we don't measure, and we leave it to benchmarks built for it, like UI-Bench.

4. Convention adherence

Whether the agent executes in the app's house style, reusing existing components and tokens, matching naming and file conventions, and formatting dates and numbers the way the rest of the app does.

Pass. A new form reuses the existing input components and an existing button variant; dates and numbers match the app's formatting; new files and symbols follow the established naming.

Fail. The feature introduces a fourth button shape; the new modal has a different close affordance than the existing three; a color is hardcoded where a token exists; a date is rendered raw where the app has a formatter.

The fidelity question: did the agent use what the app already had, or reinvent it? It is distinct from Product fit, which chooses the pattern (Convention is whether that choice is executed in the established way), and from Visual craft: an agent can match every convention and still compose poorly.

5. Pathway completeness

Whether the agent designs the whole interaction, with every path (cancel, back, undo, error-recovery, no dead-ends) and every required state (loading, empty, error, pending) present and leading somewhere.

Pass. Every destructive action has a cancel; every multi-step flow has back and error-recovery; loading, empty, and error states exist and each leads somewhere.

Fail. A create flow with no edit path; an error state that offers no way forward; a wizard's third step with no way back; a blank screen where the empty state should be.

Anecdotally the most common failure mode in AI-built product work: the agent finishes the happy path and calls it done, and the cancel buttons, the undo, the error recoveries are the first skipped. The category checks one thing: does each path and state exist and lead somewhere? It owns the presence of an error state and its way forward; whether the UI survives the condition that triggers it (a 500) is Resilience, and whether the pattern is the right one is Product fit. Mere presence is what makes it the most clear-cut category here.

6. Content & language

Whether the words are right, spanning label and microcopy quality, empty-state and error-message quality (not mere presence), and a voice that matches the product.

Pass. Empty-state copy explains what to do next; error messages are specific enough to suggest an action; labels and microcopy use the product's own terms.

Fail. A generic "Something went wrong" with no next step; an empty state left with placeholder text; a label that names a concept differently than the rest of the app.

Its own decision, separated from the states it lives inside: whether an empty state exists is Pathway, whether its copy is any good is Content. The two fail independently. You can get a well-placed empty state with useless copy, or sharp copy in a state that was never built. The test: if fixing it means changing only the words, it belongs here.

7. Resilience

Whether the rendered product holds together under realistic conditions, not just the demo condition. That means long and extreme content, every viewport, rendering under API failure, throttled network, and performance.

Pass. A 200-character title wraps; a 50-line list scrolls without distorting its container; the layout survives at 320px without horizontal scroll; the UI still renders when the API returns a 500.

Fail. A long title overflows its card; the mobile breakpoint hides the primary action behind a sidebar that won't open; the page jumps when content loads; a 500 produces a blank screen.

The failure modes a single screenshot can't reveal. A clean shot of a fresh, populated, desktop-sized screen says almost nothing about how the design behaves when the title runs long, the window is narrow, or the API errors. Those edges are exactly where AI-built UIs crack.

8. Accessibility

Whether the produced UI meets established accessibility standards and practices.

Pass. Form inputs are programmatically labeled. Focus order follows visual order. Contrast meets WCAG AA against the actual rendered backgrounds. Modals trap focus and restore it on close. Interactive elements have appropriate roles.

Fail. A button is a <div> with an onClick. Focus is invisible. Contrast meets the spec against the mockup background but not the rendered one. Keyboard users can tab into a modal but not out.

The most rule-bound call of the eight. The standards are explicit and checkable with axe-core, and the cost of getting it wrong is concrete, since people are shut out. It shades closest to "follow the rules" and furthest from open-ended product judgment, but that places it apart from the others, not beneath them. If anything the reverse: a lab that can't get accessibility right has no standing to grade the subtler calls.

Every defect gets one home

A category scheme is only useful for reporting if every scored finding has one home: the categories have to be mutually exclusive and collectively exhaustive, or MECE. We don't enforce that at the category layer, where one observed problem (a low-contrast error message in a modal) is at once visual, copy, interaction, and accessibility. We enforce it one level down, at the atomic scored item, the way METR gets reviewers to agree: each item has one clear home and a yes-or-no (or graded) test, filed by which decision it tests rather than which surface the symptom shows on. One observed problem can then decompose into several items. A poor error state is at once a Pathway gap (it's missing), a Resilience gap (it white-screens on a 500), a Content gap (the message is useless), and an Accessibility gap (it's never announced), four findings read as a cross-cut when someone wants the "how good is error handling" view. Genuinely two-home items get a precedence rule and authored splits in the rubric, not at the category level.

For the feature-extension work we score today, the categories are exhaustive by construction: they tile the whole act of adding a feature (build it, fit it, compose it, match the house style, finish its paths and states, word it, harden it, make it accessible) with no miscellaneous bucket. The one known seam, interaction feel, we name below.

The deliberate exclusions

Taste, and the calls only the stakeholder can make. Some judgments have no rubric-answer: "our brand should feel more reserved," "this copy is more us." Reasonable senior product designers don't converge on them, and measurement needs convergence, so they sit outside the scheme by construction. That includes pure aesthetic preference, which is why beautiful is not a category and legible (Visual craft) is.

Marketing and editorial content. Long-form copy, landing-page narrative, and editorial voice are legitimate and assessable, but there is no marketing surface to get right when the task is wiring a feature into an existing app. We would add a content-oriented scenario before claiming to measure it.

Finally, two things our scenarios run through that are confounds, not part of product experience: general coding ability (whether the code compiles and runs, as opposed to whether the product judgment is sound) and fidelity to a reference design (reproducing a given mockup rather than deciding what to build). Each is genuine, each is measured well elsewhere, and neither is what we mean by product experience. We scope them down or hold them fixed so they don't move the score.

A skeleton, and one open gap

This is a first version, not a complete grammar of product experience, and it stops at the categories. Turning each into specific items, tested and scored against specific apps, is a separate layer of the work.

The eight categories leave one known gap: interaction feel (the responsiveness and weight of an interaction, motion, the touches that make a UI feel alive) has no category yet. For now it spreads across Product fit, Pathway, and Resilience, which holds up on a feature like saved views that involves little novel interaction. A more interaction-heavy task would need its own category, and we would add one then.

Version 1.0 — June 2026. Reach us at hello@chordio.com.