Product-experience performance report · Confidential

GPT-5.5

Scenario: Task tracker — add saved viewsEvaluated 2026-06-175 epochsBuild cost 766k tokens · ~$1.90

GPT-5.5 builds the right feature and makes it robust, then reaches for its own components instead of the ones the app already ships.

Its biggest gap is product fit: it builds a custom popover for create and edit instead of reusing the app's drawer, and it skips a real error path on API failure. Accessibility and content trail where the feature leans on its own UI rather than the app's. Intent, resilience, and code conventions are at or near the comparative reference.

82/ 100

± 4 across 5 epochs

Strong

GPT-5.5Comparative SOTA model

Executive summary

The 60-second read. Everything below is evidence for these claims.

The read

GPT-5.5 clears the hard part. All four saved-view operations work, persistence is server-backed and account-bound, and the feature is responsive and survives an API failure without taking the page down. On intent fidelity (99) and resilience (92) it matches the comparative reference; visual craft (86) and code conventions (85) sit just behind it.

The leaks are about fitting into this app and finishing the unhappy paths, not capability. For create and edit GPT-5.5 builds its own popover instead of reusing the app's Drawer (product fit 73 vs the reference's 97), and on a 500 the saved-views list falls through to the empty state with no inline error or retry (error-state ≈ 2/100, every epoch). That same custom container is why keyboard accessibility is the most run-to-run-volatile score.

Highest-leverage fix: bias the agent to import the app's components before building its own. In 1 of the 5 epochs it did reuse the drawer and product fit jumped to 97, so this is reachable, not a ceiling.

Capability profile

GPT-5.5 vs a comparative SOTA model (│)

Cat-1Intent fidelity

Cat-2Product fit

Cat-3Visual craft

Cat-4Convention adherence

Cat-5Pathway completeness

Cat-6Content & language

Cat-7Resilience

Cat-8Accessibility

Top findings

HighBuilds its own components instead of reusing the app's

HighNo real error path when the API fails

MedUndo covers delete, but not reconfigure

MedAccessibility is solid where it reuses, shaky where it doesn't

MedShips the unhappy-path states, but rebuilds them

Recommendations

Import before build

Feed the host app's component inventory in and route create/edit through the existing drawer. Moves product fit and inherits the focus trap for free.

Make the error path first-class

Require an explicit error / empty / loading branch that reuses the app's InlineError. Closes the sharpest, most consistent defect.

AA contrast + keyboard gate

Run axe and a keyboard pass before the agent reports done, and lift muted text to the AA token. Steadies the most volatile score.

Capability profile

Eight categories, the locked PX-bench taxonomy. Each score is the mean across 5 epochs, shown against a comparative SOTA model run on the same scenario and harness (the tick).

Cat-1Intent fidelity

Did it build what was asked, working in the happy path?

Cat-2Product fit

Does the feature attach to the app in the right place and shape?

Cat-3Visual craft

Is the surface composed into a clear visual hierarchy?

Cat-4Convention adherence

Does it work in the app's house style: reuse, tokens, naming?

Cat-5Pathway completeness

Are all paths and states present and reachable?

Cat-6Content & language

Are the words right: labels, errors, empty copy, voice?

Cat-7Resilience

Does it hold together under long content, small screens, failure?

Cat-8Accessibility

Keyboard, contrast, labels, axe violations.

Bar = GPT-5.5 (mean of 5 epochs). │ = comparative SOTA model.Color reflects band: green ≥ 75, amber 50-74, red < 50. Click any category below to see its items.

Findings

The patterns behind the scores, ranked by user-facing impact. Each pairs a concrete failure with the fix. Click to expand evidence.

The pattern

For create and edit, GPT-5.5 builds a custom absolutely-positioned popover (a <div role="dialog"> in SavedViewsControl.tsx) rather than importing the app's right-side <Drawer> — the exact container new-task and edit-task use. It does reliably reuse Toast (for the undo) and ConfirmDialog (for delete), but it rebuilds the container, the loading line, and the inline error. That parallel container is the single biggest reason product fit lands at 73 against the comparative reference's 97.

Why it matters

A hand-rolled container drifts from the rest of the app and is the visible 'this feature feels different' tell. It also drags keyboard accessibility down, because the custom popover doesn't inherit the drawer's focus trap (see the accessibility finding). In 1 of the 5 epochs GPT-5.5 did reuse the drawer and product fit jumped to 97, so this is reachable, not a capability ceiling.

Evidence

Create / edit live in a custom popover instead of the app's drawer:

// app/features/tasks/SavedViewsControl.tsx

- <div role="dialog" className="absolute ..."> // custom popover, no focus trap

+ import { Drawer } from '@/components/Drawer' // app uses Drawer for new/edit task

+ <Drawer open={open} onOpenChange={setOpen}> ... </Drawer>

Rubric items

R-2.1R-4.1R-8.2G-2G-15

Confidence

High. Component reuse is a deterministic check, and all three scorers independently flagged the create/edit container as a custom popover rather than the app's drawer.

The pattern

Saved views load through a failable useQuery, but the code never handles isError — it does savedViews = data ?? [], so a 500 produces an empty array and the list renders the empty state, 'No saved views yet.' Where an error message does appear it is a bare line of text with no retry and not the app's InlineError. This is the sharpest defect in the run: error-state scored about 2 out of 100, in every one of the 5 epochs.

Why it matters

A failed load looks identical to 'you have no saved views' — the worst kind of failure, because it quietly implies data loss and offers no way forward. Real users hit 500s eventually, and right now the feature has no honest answer for them.

Evidence

On a 500 the list falls through to the empty state instead of an error:

// app/features/tasks/SavedViewsControl.tsx

- const savedViews = savedViewsQuery.data ?? [] // 500 → [] → "No saved views yet"

+ if (savedViewsQuery.isError)

+ return <InlineError onRetry={() => savedViewsQuery.refetch()} />

Rubric items

R-5.6R-7.3G-8

Confidence

High. All three scorers, every epoch — the most consistent finding in the run. The page itself survives the 500 (resilience R-7.3 = 99); it just shows no error UI.

Category detail

Every rubric item, what it tested, and its score (mean across 5 epochs). Expand an AUTO-AGENT item to see the three independent scorers (Anthropic, OpenAI, Google): each provider's verdict for the item, with its reasoning, and where they agreed or split.

How this was measured

The eval

Scenario: Add a “saved views” feature to a working 8-screen task tracker with its own design system.
What's tested: Whether the agent designs a complete product experience: intent, fit, craft, conventions, pathways, content, resilience, accessibility.
Scoring: Three tiers. 4 deterministic script checks, 26 cross-provider LLM-judge items, 2 human-panel items.
Cross-provider: Every judge item is scored independently by Anthropic, OpenAI, and Google models. The reported score is their median; we publish the spread.
Comparison: A comparative SOTA model run on the same scenario and harness (6 epochs), shown as the reference tick on each bar. Shown without naming the specific model.
Samples: 5 epochs of GPT-5.5; category and overall scores are means, with the spread shown as ± on the overall.
Build cost: 722k in + 44k out tokens, ~$1.90per run. The agent's own spend producing the feature, not the scorers'.

Coverage & confidence

Items scored: 29 of 32
Pending human: 2 (R-2.4, R-4.6)
No-signal: 1: R-4.4 (no skeleton shipped)
Overall spread: ± 4 points across 5 epochs

No-signal items are excluded from the mean, not counted as zero. Each category shows its own coverage so a partial category never reads as complete.

Ground truth set by

Senior product designer ASenior product designer BSenior product designer C

Human-panel items (R-2.4, R-4.6) reported with inter-rater agreement; a scorer earns its place by matching designer judgment at ≥ 0.7.

Provenance & reproducibility

Run set: reportable-gpt-5-5 · 2026-06-17
Subject: openai/gpt-5.5-2026-04-23
Scorers: Anthropic + OpenAI + Google, cross-provider consensus
Scenario: 01-task-tracker / add-saved-views
Harness: Inspect AI (AISI), sandboxed container

Every score traces to its run, rubric version, and scorer version. The harness, scenario, and rubric are pinned per run; the run is reproducible.

What this does and doesn't tell you

One scenario. This measures product experience on a feature-add to an existing app. It is one slice of capability, not a verdict on everything GPT-5.5 builds.

Out of scope. We do not score speed, animation polish, or end-to-end team workflows.

Coverage conditioning. One item reports no-signal: GPT-5.5 shipped a text loading line, not the app's skeleton, so there was nothing to score for loading-convention match. It is excluded from the mean, not zeroed.

Demonstration. This page shows a real grid result in the report format we share with customers. The comparison reference is a frontier model run on the same scenario, shown without naming it.

Phase-0 footnote. Visual craft is scored from code in this run; render-based scoring lands next phase.

Spend less time building evals infrastructure

Improve your agent's real-world performance

GPT-5.5

The read

Capability profile

Top findings

Recommendations

The pattern

Why it matters

Evidence

Rubric items

Confidence

The pattern

Why it matters

Evidence

Rubric items

Confidence

The eval

Coverage & confidence

Ground truth set by

Provenance & reproducibility

What this does and doesn't tell you