Research notes from the lab.
The taxonomies, methodological framings, and open research agenda that structure our evaluations. Substantive scaffolding that lives alongside the results: definitions, exclusions, decisions made in public, and the work we intend to study next.
- № 01
A taxonomy of product experience for evaluating what AI agents build
You can't measure what you can't define, and for AI agents, "design" covers three different things at once. So we define what we evaluate: the product experience, judged across eight categories, with the exclusions stated up front.
Methodologyv1.0Jun 2, 2026 - № 02
Agentic design benchmarks need product context
Most design benchmarks hand the agent a blank canvas: design a settings page, build a pricing table. But real agents add features to apps that already exist, where a feature can land in the wrong place or reinvent a pattern the codebase already has. The failures a blank canvas can never surface.
Methodologyv1.0Jun 3, 2026 - № 03
A methodology for evaluating AI agents on product experience
A feature request goes in; a score we're prepared to publish comes out. The account of everything in between, and why each step is built so the number holds up.
Methodologyv1.0Jun 4, 2026 - № 04
The contamination problem
A contaminated benchmark still posts a confident score. It just stops measuring product experience. Four ways the host app gets compromised, none of them visible in the score.
Validityv1.1Jun 9, 2026 - № 05
The noise floor
Run the same agent on the same task ten times, change nothing, and the score swings from 0.66 to 0.79. So when one agent edges out another, how do you know the gap is real?
Validityv1.0Jun 9, 2026 - № 06
Scoring the scorers
A feature whose create flow returned an error scored a perfect 1.0. Two scoring models read the code and inferred it worked; the one that ran the app was outvoted. How do you catch a scorer that is confidently wrong?
Validityv1.0Jun 10, 2026 - № 07
The AI design evaluation landscape
Design benchmarks judge outputs with no product context. Coding benchmarks have the context and never judge the experience. The two halves of the measurement that matters already exist, in separate fields, and nothing public puts them together.
Field notesv1.0Jun 11, 2026