Most product work isn't a blank canvas
Most evaluations of AI design capability run on isolated prompts: design a settings page, build a pricing table, create a sign-up flow. The agent starts in an empty editor, and the result is judged on its own.
Real product work almost never looks like that. A feature gets built inside an app that already exists, with its own conventions, its own design system, and the quirks any real codebase carries. The job is usually to add something to what is already there, in a way that fits in.
So that is what we test. Each scenario fixes one complete, opinionated app and asks the agent to build a feature inside its actual codebase. We call that app the host app.
Some questions need an app to answer
In PX-bench we split product-experience capability into eight categories. Two of them only mean something when there is an existing app to compare against.
Product fit asks whether a feature lands in the right place in the product: the right container, the right entry point, one consolidated surface instead of three scattered ones.
Convention adherence asks whether the agent built the way the app already builds, reusing its components and tokens and matching how it names and formats things.
Neither question has an answer in an empty editor. "Does this fit?" needs something to fit into; "does this match?" needs conventions to match. That is the whole reason we test inside an existing app: it is the only way to ask them.
What a host app is
A host app is a complete, opinionated, multi-screen application with its own component library, design tokens, and code patterns, plus the small idiosyncrasies real products pick up. It has several ways to view its data, real empty and loading and error states, keyboard shortcuts, a command palette, and a mobile layout.
We build these ourselves. They are not shipping products, but each one is built to behave like one, conventions and idiosyncrasies included. That matters because of how the test works: add a feature without breaking the conventions only means something if the conventions are real and applied consistently. So we design each host app as a genuine product in its own right, never reverse-engineered to fit the rubric. An app built to trip the agent up would only measure how well agents handle contrived traps. A coherent one measures what we actually care about: whether an agent can read an unfamiliar codebase, learn how it works, and extend it the same way.
Two decisions that only exist inside an app
Our first host app is a task tracker: list, board, and calendar views; filtering, sorting, and grouping; a command palette; the usual furniture of a real productivity tool. The first scenario asks the agent to add saved views: let a user name the current filter-and-sort setup, switch between saved ones, and edit or delete them.
The brief reads like a normal product ticket. What it deliberately does not do is tell the agent how to build any of it. The answers are in the existing app, waiting to be read. Two examples follow, each the kind of decision that simply does not exist until there is an existing codebase to make it in.
Where does the editor live? The brief asks for a way to create and edit a saved view. It never says whether that should be a modal, a drawer, a separate page, or an inline form. On a blank canvas, several of those are defensible and there is no way to be wrong. In this app, creating and editing a task already opens a right-side drawer. So a saved-view editor that opens as a modal is the weaker answer. Modals are fine in general. Here the app already settled how "create and edit" looks, and a modal breaks from it. The right call is knowable only by reading the app. That is Product fit, and here one answer is clearly better, precisely because the app exists.

The existing create-and-edit pattern is a right-side drawer. The brief never names a container; the app already did.
Where do saved views live? This one runs deeper. The app already remembers the current view (your active filters, sort, grouping, columns) by writing it to the browser's local storage. An agent reading the code finds that pattern quickly, and the tempting move is to extend it: saved views are just more view state, so store them the same way. But the brief asks for views that belong to the user's account and follow them across devices, and one browser's local storage cannot do that. A saved view is a named, persisted entity that belongs on the server, alongside the app's tasks, through the same data layer the app already uses for everything account-bound. The ephemeral current view stays local; the saved views do not. Telling those apart requires reading the existing code, recognizing two distinct patterns already in it, and judging which to reuse and which to deliberately set aside.
Two patterns already in the codebase, for two different jobs: the current view stays in local storage, a saved view follows the user to the server.
Neither decision is visible from the prompt alone. There is no existing drawer to match and no local-storage pattern to be tempted by until there is an app that made those choices first.
What this lets us see
The two decisions above are instances of a general payoff. Holding a complete app fixed reveals judgments an isolated prompt cannot reach:
- Does the feature fit the structure? The saved-view editor had a right container to find and an existing create flow to reuse. With no product, there is no structure to fit.
- Did the agent honor the conventions or reinvent them? Did it reach for the component and token that already exist and match the app's naming and formatting, or write reasonable-in-isolation code that quietly drifts from everything around it?
- Did the agent carry the feature through every surface it touches? Saved views reach the filter bar, every view type, and the command palette. Does the agent extend all of them, or just the one in front of it?
There is also a quieter signal worth naming. A host app with a hand-rolled design system (coherent, but not a library the agent has seen a thousand times) separates an agent that genuinely reads and absorbs a design system from one that pattern-matches to a popular one it has memorized. An app built on an off-the-shelf kit cannot tell those apart; a hand-rolled one can. It is one of the axes we expect to matter, and one a host app is well placed to test.
What's next
For now there is a single host app, and we get as much out of it as we can: more scenarios on the same app, many agents scored on each. A second app comes later. That ordering is intentional: it lets us get one host app right before building the next.
The apps that follow will each vary one axis on purpose: the design system (hand-rolled versus a familiar library), the domain (an over-represented one like task tracking versus something models see far less of), the flavor of the conventions. Each becomes a question we can actually answer rather than a confound we have to hope cancels out.
The through-line stays the same. The most consequential design judgments are relational: they are about fitting work into a product that already exists. That makes the product the agent works inside part of the test itself.
Version 1.0 — June 2026. Reach us at hello@chordio.com.