§ P Publications

Research notes from the lab.

The taxonomies, methodological framings, and open research agenda that structure our evaluations. Substantive scaffolding that lives alongside the results: definitions, exclusions, decisions made in public, and the work we intend to study next.

Published

№ 01
A taxonomy of product experience for evaluating what AI agents build
You can't measure what you can't define, and for AI agents, "design" covers three different things at once. So we define what we evaluate: the product experience, judged across eight categories, with the exclusions stated up front.
Methodology
v1.0
Jun 2, 2026
№ 02
Agentic design benchmarks need product context
Most design benchmarks hand the agent a blank canvas: design a settings page, build a pricing table. But real agents add features to apps that already exist, where a feature can land in the wrong place or reinvent a pattern the codebase already has. The failures a blank canvas can never surface.
Methodology
v1.0
Jun 3, 2026
№ 03
A methodology for evaluating AI agents on product experience
A feature request goes in; a score we're prepared to publish comes out. The account of everything in between, and why each step is built so the number holds up.
Methodology
v1.0
Jun 4, 2026
№ 04
The contamination problem
A contaminated benchmark still posts a confident score. It just stops measuring product experience. Four ways the host app gets compromised, none of them visible in the score.
Validity
v1.1
Jun 9, 2026
№ 05
The noise floor
Run the same agent on the same task ten times, change nothing, and the score swings from 0.66 to 0.79. So when one agent edges out another, how do you know the gap is real?
Validity
v1.0
Jun 9, 2026
№ 06
Scoring the scorers
A feature whose create flow returned an error scored a perfect 1.0. Two scoring models read the code and inferred it worked; the one that ran the app was outvoted. How do you catch a scorer that is confidently wrong?
Validity
v1.0
Jun 10, 2026
№ 07
The AI design evaluation landscape
Design benchmarks judge outputs with no product context. Coding benchmarks have the context and never judge the experience. The two halves of the measurement that matters already exist, in separate fields, and nothing public puts them together.
Field notes
v1.0
Jun 11, 2026