← Back to PX-bench
PX-failure survey

Where does your coding agent fall short on product experience?

A five-minute self-audit you run on your own machine. It reads your coding-agent history, classifies the moments where the work needed rework, and returns a small summary, designed to minimize identifiers, that you choose whether to share.

We're building PX-bench, a public benchmark for how well coding agents build product experience. This survey is its in-the-wild half: instead of a lab scenario, it looks at the real work you've already done with an agent and classifies where what it built needed fixing, including broken layouts, wrong information hierarchy, missing states, and off copy. Run it, see what it finds, and if you'd like, send the result back to help ground the benchmark in real sessions.

Copy the prompt.

Paste it into a coding agent with shell and file access, running in a project where you've done real work. It writes nothing to your code.

Survey prompt
You are going to audit this machine's coding-agent session transcripts to measure where the agent fell short on **product experience (PX)** — the design and product quality of what it built, as opposed to whether the code merely compiled. Work carefully and conservatively; the goal is an honest, defensible count, not a big one.

## Privacy — non-negotiable

- Operate **entirely locally**. Do not send anything anywhere.
- Your final output is **structured counts + short quotes only**. Never include source code, full prompts, file paths, repo names, client names, or secrets.
- Evidence quotes must be ≤200 characters and may be paraphrased to remove sensitive detail. When in doubt, redact.
- The user reviews the final JSON and chooses whether to share it. Tell them so.

## STEP 1 — Locate the transcripts

Claude Code and Conductor both store session transcripts as JSONL under `~/.claude/projects/<encoded-cwd>/*.jsonl`, where `<encoded-cwd>` is the absolute project path with `/` and `.` replaced by `-`. 

Run this to list the candidate transcripts for the **current** project, newest first:

```bash
proj="$HOME/.claude/projects/$(pwd | sed 's/[/.]/-/g')"
ls -lt "$proj"/*.jsonl 2>/dev/null || echo "No transcripts for this project. Try: ls ~/.claude/projects/ | grep -i <your-project-name>"
```

If nothing matches, list all projects (`ls ~/.claude/projects/`) and ask the user which one(s) to analyze, or accept a path they give you. If the user uses a different agent (Cursor, etc.), ask them to point you at its chat/transcript export and adapt the extractor's parsing accordingly — the classification rules below are agent-agnostic.

**Default scope:** the most recent **up to 8** sessions for the chosen project, skipping tiny ones (<20 KB). If there are more, tell the user the count and proceed with the most recent 8 unless they say otherwise.

## STEP 1b — Detect the harness setup (availability)

What predicts where agents fall short isn't only the model — it's the **harness**: which skills, rules, and tools were in play. What you *used* in a session comes from the transcript (Steps 2–3). What was *available* you read off disk. Capture both: the gap between "installed but not used" is the most useful thing we can later hand back.

Run these (read-only — they list names and presence, nothing more):

```bash
# Skills available — global (all projects) and project-local
ls ~/.claude/skills/ 2>/dev/null     # global skills
ls .claude/skills/ 2>/dev/null       # project skills
# Rules files — presence only; do NOT read or send their contents
ls -la ~/.claude/CLAUDE.md CLAUDE.md .claude/CLAUDE.md .cursor/rules .cursorrules 2>/dev/null
```

(For Cursor or other agents, list the equivalents — `.cursor/rules`, `.cursorrules`, etc.)

**De-identification — same bar as the rest of the survey:**
- **Public/known skills** (the gstack suite — `design-review`, `qa`, `qa-only`, `ship`, `review`, `investigate`, `browse`, `verify`, `office-hours`, `codex`, … — and Claude Code built-ins) → record the **name** + scope (`project`/`global`).
- **Custom/private skills** (org- or project-specific names, or anything you don't recognize as public) → **count only, never the name** (a name like `acme-deploy` leaks the company).
- **Rules files** (CLAUDE.md, .cursorrules) → record **presence** + scope, **never the content**.

## STEP 2 — Extract candidate episodes mechanically

Save this script to a temp file and run it on each chosen transcript. It does *only* mechanical extraction (no judgment): it pulls the real human turns (each paired with the assistant text that preceded it — i.e. what was "delivered") and the agent's own self-review observations. The JSONL has sharp edges (tool-results masquerade as user messages, system tags, skill injections, a duplicative `last-prompt` record) — this handles them.

```python
# save as /tmp/px_extract.py ; run: python3 /tmp/px_extract.py <transcript.jsonl>
import json, sys, glob, collections

SKILL_MARKERS = ("Base directory for this skill:", "<command-message>", "<command-name>",
                 "<local-command-stdout>", "<system_instruction>", "<system-reminder>",
                 "<task-notification>", "Caveat:", "<user-prompt-submit-hook>", "<bash-")
REVIEW_SKILLS = {"design-review","qa","qa-only","browse","view-prototype","verify","investigate","review"}
DEFECT_KW = ("the main problem","the issue is","is broken","doesn't match","too small","too low",
             "low contrast","overflow","crushe","unreadable","misalign","cut off","clipped","cramped",
             "not responsive","layout shift","the problem","slivers","squished","off-screen",
             "my mistake","i merged","should have","regress")

def parts(content):
    if isinstance(content, str): return content, [], False
    txt=[]; tools=[]; is_tr=False
    if isinstance(content, list):
        for b in content:
            if not isinstance(b, dict): continue
            if b.get("type")=="text": txt.append(b.get("text",""))
            elif b.get("type")=="tool_use": tools.append(b.get("name"))
            elif b.get("type")=="tool_result": is_tr=True
    return "\n".join(txt), tools, is_tr

def is_human(text, is_tr, o):
    if is_tr or o.get("isMeta"): return False
    s=(text or "").strip()
    if not s: return False
    if s.startswith("[Request interrupted"): return False   # interruption marker, not content
    if s.startswith("<") and any(m in s[:60] for m in ("<command","<system","<task-notif","<local-command","<bash-")): return False
    if any(s.startswith(m) or m in s[:40] for m in SKILL_MARKERS): return False
    return True

def main(path):
    events=[]; model="unknown"
    with open(path) as fh:
        for line in fh:
            line=line.strip()
            if not line: continue
            try: o=json.loads(line)
            except: continue
            if o.get("type")=="last-prompt": continue
            msg=o.get("message")
            if not isinstance(msg,dict): continue
            text,tools,is_tr=parts(msg.get("content"))
            if o.get("type")=="assistant" and msg.get("model"): model=msg["model"]
            if o.get("type")=="user" and is_human(text,is_tr,o):
                events.append(("human",text.strip(),[],None))
            elif o.get("type")=="assistant" and (text.strip() or tools):
                events.append(("assistant",text.strip(),tools,o.get("attributionSkill")))

    n_human=sum(1 for e in events if e[0]=="human")
    n_asst=sum(1 for e in events if e[0]=="assistant")
    tools_used=sorted({t for e in events if e[0]=="assistant" for t in e[2]})
    skills_invoked=sorted({e[3] for e in events if e[0]=="assistant" and e[3]})
    episodes=[]
    for i,(kind,text,tools,skill) in enumerate(events):
        if kind=="human":
            prev=""
            for j in range(i-1,-1,-1):
                if events[j][0]=="assistant" and events[j][1]: prev=events[j][1]; break
            episodes.append({"type":"human_turn","i":i,"human":text[:600],"prev_assistant_tail":prev[-400:]})
        else:
            low=text.lower()
            if (skill in REVIEW_SKILLS or any(k in low for k in DEFECT_KW)) and text:
                fixed=any(("Edit" in events[k][2] or "Write" in events[k][2] or "MultiEdit" in events[k][2])
                          for k in range(i+1,min(i+7,len(events))) if events[k][0]=="assistant")
                if fixed or skill in REVIEW_SKILLS:
                    episodes.append({"type":"agent_self","i":i,"skill":skill,
                                     "observation":text[:500],"followed_by_edit":fixed})
    print(json.dumps({"file":path.split("/")[-1],"model":model,
        "n_human_turns":n_human,"n_assistant_turns":n_asst,
        "mostly_autonomous": n_human<=3,
        "tools_used":tools_used,"skills_invoked":skills_invoked,
        "candidates":episodes}, indent=1))

if __name__=="__main__": main(glob.glob(sys.argv[1])[0])
```

The candidates are **high-recall, low-precision** — most are *not* PX defects. Your job in Step 3 is to filter hard.

## STEP 3 — Classify each candidate (this is the judgment)

For every candidate, first decide its **`nature`**, then — only if it's a `px_defect` — its category. Read `prev_assistant_tail` (for human turns) to see what the agent had delivered; the defect is relative to that.

**First, drop the non-episodes.** Many candidates are neither defects nor decisions of interest — do **not** emit an episode for them:
- Pure operational/neutral requests: "run the server", "push to github", "open a PR", "show me", "run it", "continue".
- Pure approvals/acks: "yes", "go ahead", "nice", "perfect".
- Verification turns where nothing was wrong (the agent confirms it works; the user moves on).
- Plan-level dialogue *before any work is delivered* (a long planning back-and-forth has no delivered artifact to be deficient — those are `scope_change`/`clarification` at most, and usually just dropped). A session that is *entirely* planning legitimately yields ~0 defects.

Emit an episode only when the candidate is a defect (any `nature` ∈ {px_defect, functional_defect, infra_env}) **or** a substantive non-defect intervention worth counting in the denominator (`scope_change`, `preference`, `clarification`).

### 3a. `nature` — what kind of episode is this?

| `nature` | Use when… | Counts toward PX? |
|---|---|---|
| `px_defect` | the agent's **delivered** product experience was deficient — a competent senior designer/engineer would independently call it wrong/incomplete/sloppy | **YES** |
| `functional_defect` | code/logic was wrong: build/runtime error, crash, wrong data, or the agent did the wrong *operation* | no (non-PX rework) |
| `infra_env` | environment/ops: deploy, dev server, git/merge, config, tooling, ports | no (non-PX rework) |
| `scope_change` | the user **adds or changes a requirement** ("also add…", "let's now…", "we changed X to Y") | **no — not a defect** |
| `preference` | the user steers a **subjective** choice the agent had gotten *reasonably* ("try another font", "make it warmer", "I prefer…") | **no — not a defect** |
| `clarification` | the user supplies info the agent **couldn't have known**, or answers the agent's question | **no — not a defect** |

### 3b. The defect-vs-preference test (decides credibility — apply strictly)

- Language invokes a **quality bar** → `px_defect`: *readable, broken, messy, inconsistent, doesn't match, too small/low, unclear, cut off, "this is wrong", "you broke…".*
- Language invokes **taste/direction** → `preference`: *prefer, warmer, elegant, "try…", vibe, "let's go with…".*
- **When genuinely unsure, it is NOT a defect.** Under-counting is the correct, defensible direction. If you do file a borderline case as `px_defect`, set `confidence: low`.
- A long taste-exploration (e.g. swapping fonts five times) is mostly `preference`. Only the turns that name a real quality failure ("hard to read", "looks messy") are defects.

### 3c. Dedup rules

- **One root cause = one episode.** A QA/review pass that finds several sub-issues of one underlying mistake (mobile grid + header + hero all un-responsive) → **one** episode.
- A multi-turn back-and-forth fixing **one** defect → **one** episode; set `resolution_turns` to the number of turns.
- The big first prompt of a session (the task spec) is **never** an episode.

### 3d. If `nature == px_defect`, file `px_category` (exactly one)

File by **where the fix belongs**, not where the symptom shows; one category per episode. **Convention vs. Visual craft** (the one boundary worth memorizing): reaching for the *wrong primitive* (a hardcoded value where a token exists, a parallel component, a raw date) → `convention_adherence`; composing the *right* primitives into an *unclear* result (everything one weight, nothing grouped, ragged alignment, no type-scale hierarchy) → `visual_craft`. On a true tie, the most objective/foundational category wins: absent whole capability → Intent; WCAG/axe-named → Accessibility; condition-triggered → Resilience; structural placement/consolidation → Product fit; string-only fix → Content.

| `px_category` | A defect here means… |
|---|---|
| `intent_fidelity` | a whole requested capability is **absent or behaviorally wrong** vs. the ask (not just imperfectly executed) |
| `product_fit` | wrong **structural** place/shape: container/pattern choice (modal vs drawer), entry-point placement, fragmenting instead of consolidating across views |
| `visual_craft` | the composition is **illegible**: no clear visual hierarchy/emphasis, ragged spacing/alignment, type scale not used to signal primary vs supporting — the eye doesn't land on the primary thing first. A **legibility** judgment, *not* aesthetic preference (pure taste is `preference`, not a defect) |
| `convention_adherence` | not in the **house style**: duplicated instead of reusing an existing component, hardcoded values where tokens exist, wrong naming/file conventions, wrong date/number formatting |
| `pathway_completeness` | a path or state is **missing**: no cancel/undo/error-recovery, a dead-end, or an absent loading/empty/error/pending state |
| `content_language` | the **words** are wrong: label/error-message quality, empty-state copy, microcopy off-voice, typos, terminology drift |
| `resilience` | breaks only when a **condition varies**: long content, a specific viewport, API failure, throttled network, performance/jank |
| `accessibility` | an **axe/WCAG or keyboard/screen-reader** failure: contrast, focus order, missing labels, not keyboard-operable |
| `other_px` | a genuine PX defect that fits **none** of the above. **You must give a `subfacet` label.** Now that `visual_craft` owns composition/hierarchy, the residual case is a happy-path **rendering glitch** the eight categories don't cover — e.g. a wrong element type (a read-only field rendered as an input box), overlap, or a control rendering broken at default data/width/network. Don't force-fit; label it. |

### 3e. Other fields per episode

- `subfacet`: short kebab-case label below the category (`responsive-layout`, `information-hierarchy`, `microcopy-voice`, `empty-state-missing`, `happy-path-render-glitch`…). **Required** for `other_px`. Encouraged everywhere.
- `caught_by`: `human` (the agent treated it as done; the user had to flag it → *shipped-broken*) or `agent_self` (the agent found and fixed it in a QA/review/verify pass before handing back).
- `severity`: `cosmetic` | `functional` | `blocking`.
- `evidence`: ≤200-char verbatim (or redacted) quote of the trigger.
- `description`: one neutral line.
- `confidence`: `high` | `med` | `low`.

### 3f. Assemble each session's `harness_profile`

Per session, fill `harness_profile` — the harness → outcome signal. Pull from three places:

- **Availability** (from the STEP 1b disk scan): `skills_available` (public names + scope), `n_custom_skills_available`, `review_skills_available`, `has_project_rules`, `has_global_rules`.
- **Usage** (from the extractor's `skills_invoked` + `tools_used`, plus the transcript): `skills_invoked`, `n_custom_skills_invoked`, `review_skills_invoked`; and the booleans — `used_browser_tools` / `captured_screenshots` (browser/screenshot tool names in `tools_used`), `ran_app` / `ran_tests` (a dev-server or test command was run), `used_plan_mode`.
- **The gap** (compute it): `review_available_not_used` = the review skills in `review_skills_available` **not** in `review_skills_invoked`. This is the actionable bit — "installed but not run."

Also set `build_context` (greenfield vs. existing app — a confounder for convention/product-fit) and `model_version` / `harness_version` if you can read them. Same de-identification bar as STEP 1b (public skills by name, custom as counts, rules by presence). Fill what you can confidently determine; omit the rest — partial is fine.

## STEP 4 — Emit the output

Produce two things:

1. **A JSON object** conforming to the survey schema — `sessions[]`, `episodes[]` (include the non-defect episodes too), and a `totals` self-check block. Fill `model` / `agent_harness` / `app_domain` / `task_type` and the `harness_profile` (STEP 3f) per session (`app_domain` and `task_type` you infer from the work; keep `app_domain` generic — "marketing site", not a brand). Schema fields: see the survey's `schema.json`; the shape is:

```json
{
  "survey_version": "1.0",
  "sessions": [{"session_label":"session-1","model":"claude-opus-4-8","agent_harness":"conductor","app_domain":"marketing site","task_type":"feature_add","n_human_turns":23,"n_assistant_turns":120,"mostly_autonomous":false,"harness_profile":{"build_context":"existing","has_project_rules":true,"has_global_rules":true,"skills_available":[{"name":"design-review","scope":"global"},{"name":"qa","scope":"global"}],"n_custom_skills_available":3,"review_skills_available":["design-review","qa"],"skills_invoked":["browse"],"review_skills_invoked":[],"review_available_not_used":["design-review","qa"],"ran_app":true,"ran_tests":false,"captured_screenshots":true,"used_browser_tools":true,"used_plan_mode":false}}],
  "episodes": [
    {"session_label":"session-1","nature":"px_defect","px_category":"visual_craft","subfacet":"information-hierarchy","caught_by":"human","severity":"functional","evidence":"the information hierarchy is broken — content above the header","description":"Promo content placed above the page header, burying the H1.","confidence":"high"},
    {"session_label":"session-1","nature":"preference","px_category":null,"caught_by":"human","evidence":"try IBM Plex Serif instead","description":"User exploring heading typeface; prior choice was reasonable.","confidence":"high"}
  ],
  "totals": {"n_sessions":1,"n_episodes":2,"by_nature":{"px_defect":1,"functional_defect":0,"infra_env":0,"scope_change":0,"preference":1,"clarification":0},"px_by_category":{"intent_fidelity":0,"product_fit":0,"visual_craft":1,"convention_adherence":0,"pathway_completeness":0,"content_language":0,"resilience":0,"accessibility":0,"other_px":0},"px_caught_by_human":1}
}
```

2. **A short readable digest**: the **task mix** you analyzed (e.g. "4 landing-page builds, 1 dashboard feature-add"), then PX defects by category (with %), shipped-broken rate, PX share of all rework, and any `other_px` themes you noticed. Call out caveats you hit — especially **task skew** ("all sessions were marketing-copy work, so content & language dominates and there were no multi-step flows to test pathway/state"), mostly-autonomous sessions, and any category that was structurally invisible (accessibility rarely shows up in human corrections without tooling).

Then tell the user: *review the JSON, redact anything you like, and — if you choose to share it — submit it on the PX-bench survey page you opened this prompt from (paste it into the form). Submitting it means you agree to the **Survey Contribution Terms** and **Survey Privacy Notice**: the submission is built to minimize identifiers and is kept separate from any contact email, the raw corpus is kept confidential, and you can ask us to delete your submission while it can still be identified. Or keep it — it's yours either way.*

## Calibration examples (from real sessions)

- "the headers are input boxes instead of read-only text — fix" → `px_defect` / `other_px` / `happy-path-render-glitch` (wrong element *type* on the happy path — not a hierarchy/composition problem, so not Visual craft; not condition-varied, so not Resilience). caught_by `human`.
- "the product grid is locked to 5 columns; on a phone the cards crush into unreadable slivers" (agent's own design-review) → `px_defect` / `resilience` / `responsive-layout`. caught_by `agent_self`.
- "the information hierarchy is broken — you put content above the header" → `px_defect` / `visual_craft` / `information-hierarchy` (the eye lands on promo, not the H1 — visual hierarchy, not structural placement). caught_by `human`.
- "the row's name, metadata, and actions are all one weight — nothing for the eye to land on" → `px_defect` / `visual_craft` / `visual-hierarchy` (right tokens, composed flat; the fix recomposes the layout, not the primitives — so Visual craft, not Convention). caught_by `agent_self`.
- "find a different term from 'swear by', we just used it on the previous screen" → `px_defect` / `content_language` / `terminology-repetition`, `confidence: med`. caught_by `human`.
- "try Marck Script instead" → `preference` (taste; prior font was a fine choice). **Not** a defect.
- "let's add a 12-month subscription option" → `scope_change`. **Not** a defect.
- "404 NOT_FOUND when deploying to Vercel" → `infra_env`. Non-PX rework.
- "you merged the rows when I wanted them removed" → `functional_defect` (wrong operation). Non-PX rework.

How it works.

  1. 01

    Paste it into your agent

    Claude Code, Conductor, Cursor, or anything else with shell and file access. Run it in a project where you've done real work with the agent.

  2. 02

    Review the summary

    It prints a small JSON summary and a short readable digest. The prompt only reads your history; nothing leaves your machine until you send it.

  3. 03

    Send it back, or don't

    If you'd like to contribute, paste the JSON into the form below. It's built to minimize identifiers, and it's yours to keep either way.

Review and redact before you send.

The output is designed to hold only counts and short quotes, with identifiers minimized. Before you share it, read it over and take out anything you'd rather keep private.

What happens to your data

By sending your results, you grant us a license to use them in our research, the PX-bench benchmark, and our publications. We keep the raw corpus confidential and publish only aggregate findings and short redacted examples. You can ask us to delete your submission while we can still identify it.

Contribute your results.

Submitting your results means you agree to the Contribution Terms and Privacy Notice. It's built to minimize identifiers; redact anything you like before you send it.