GPT-5.5
GPT-5.5 builds the right feature and makes it robust, then reaches for its own components instead of the ones the app already ships.
Its biggest gap is product fit: it builds a custom popover for create and edit instead of reusing the app's drawer, and it skips a real error path on API failure. Accessibility and content trail where the feature leans on its own UI rather than the app's. Intent, resilience, and code conventions are at or near the comparative reference.
The 60-second read. Everything below is evidence for these claims.
The read
GPT-5.5 clears the hard part. All four saved-view operations work, persistence is server-backed and account-bound, and the feature is responsive and survives an API failure without taking the page down. On intent fidelity (99) and resilience (92) it matches the comparative reference; visual craft (86) and code conventions (85) sit just behind it.
The leaks are about fitting into this app and finishing the unhappy paths, not capability. For create and edit GPT-5.5builds its own popover instead of reusing the app's Drawer(product fit 73 vs the reference's 97), and on a 500 the saved-views list falls through to the empty state with no inline error or retry (error-state ≈ 2/100, every epoch). That same custom container is why keyboard accessibility is the most run-to-run-volatile score.
Highest-leverage fix:bias the agent to import the app's components before building its own. In 1 of the 5 epochs it did reuse the drawer and product fit jumped to 97, so this is reachable, not a ceiling.
Capability profile
GPT-5.5 vs a comparative SOTA model (│)
Top findings
Recommended next investments
Eight categories, the locked PX-bench taxonomy. Each score is the mean across 5 epochs, shown against a comparative SOTA model run on the same scenario and harness (the tick).
Bar = GPT-5.5 (mean of 5epochs). │ = comparative SOTA model. Color reflects band: green ≥ 75, amber 50–74, red < 50. Click any category below to see its items.
The patterns behind the scores, ranked by user-facing impact. Each pairs a concrete failure with the fix. Click to expand evidence.
The pattern
For create and edit, GPT-5.5 builds a custom absolutely-positioned popover (a <div role="dialog"> in SavedViewsControl.tsx) rather than importing the app's right-side <Drawer> — the exact container new-task and edit-task use. It does reliably reuse Toast (for the undo) and ConfirmDialog (for delete), but it rebuilds the container, the loading line, and the inline error. That parallel container is the single biggest reason product fit lands at 73 against the comparative reference's 97.
Why it matters
A hand-rolled container drifts from the rest of the app and is the visible 'this feature feels different' tell. It also drags keyboard accessibility down, because the custom popover doesn't inherit the drawer's focus trap (see the accessibility finding). In 1 of the 5 epochs GPT-5.5 did reuse the drawer and product fit jumped to 97, so this is reachable, not a capability ceiling.
Evidence
Create / edit live in a custom popover instead of the app's drawer:
Recommendation
Feed the host app's component inventory into the agent and bias it toward import-before-build. Specifically, route create/edit through the existing <Drawer>; that one change moves product fit and inherits the focus trap for free.
Rubric items
Confidence
High. Component reuse is a deterministic check, and all three scorers independently flagged the create/edit container as a custom popover rather than the app's drawer.
The pattern
Saved views load through a failable useQuery, but the code never handles isError — it does savedViews = data ?? [], so a 500 produces an empty array and the list renders the empty state, 'No saved views yet.' Where an error message does appear it is a bare line of text with no retry and not the app's InlineError. This is the sharpest defect in the run: error-state scored about 2 out of 100, in every one of the 5 epochs.
Why it matters
A failed load looks identical to 'you have no saved views' — the worst kind of failure, because it quietly implies data loss and offers no way forward. Real users hit 500s eventually, and right now the feature has no honest answer for them.
Evidence
On a 500 the list falls through to the empty state instead of an error:
Recommendation
Handle isError explicitly, reuse the app's InlineError component, and add a retry. Make an explicit empty / loading / error branch a required step before the agent reports done.
Rubric items
Confidence
High. All three scorers, every epoch — the most consistent finding in the run. The page itself survives the 500 (resilience R-7.3 = 99); it just shows no error UI.
Every rubric item, what it tested, and its score (mean across 5epochs). Expand an AUTO-AGENT item to see the three independent scorers (Anthropic, OpenAI, Google) — each provider's mean for the item, with representative reasoning quoted from one epoch — and where they agreed or split.
The eval
- Scenario
- Add a “saved views” feature to a working 8-screen task tracker with its own design system.
- What's tested
- Whether the agent designs a complete product experience: intent, fit, craft, conventions, pathways, content, resilience, accessibility.
- Scoring
- Three tiers. 4 deterministic script checks, 26 cross-provider LLM-judge items, 2 human-panel items.
- Cross-provider
- Every judge item is scored independently by Anthropic, OpenAI, and Google models. The reported score is their median; we publish the spread.
- Comparison
- A comparative SOTA model run on the same scenario and harness (6 epochs), shown as the reference tick on each bar. Shown without naming the specific model.
- Samples
- 5 epochs of GPT-5.5; category and overall scores are means, with the spread shown as ± on the overall.
- Build cost
- 722k in + 44k out tokens, ~$1.90per run. The agent's own spend producing the feature, not the scorers'.
Coverage & confidence
- Items scored
- 29 of 32
- Pending human
- 2 (R-2.4, R-4.6)
- No-signal
- 1 (R-4.4) — shipped a text loader, no skeleton to judge
- Overall spread
- ± 4 points across 5 epochs
No-signal items are excluded from the mean, not counted as zero. Each category shows its own coverage so a partial category never reads as complete.
Ground truth set by
Human-panel items (R-2.4, R-4.6) reported with inter-rater agreement; a scorer earns its place by matching designer judgment at ≥ 0.7.
Provenance & reproducibility
- Run set
- reportable-gpt-5-5 · 2026-06-17
- Subject model
- openai/gpt-5.5-2026-04-23
- Scorers
- Anthropic + OpenAI + Google, cross-provider consensus
- Scenario
- 01-task-tracker / add-saved-views
- Harness
- Inspect AI (AISI), sandboxed container
Every score traces to its run, rubric version, and scorer version. The harness, scenario, and rubric are pinned per run; the run is reproducible.