PX-failure survey
Where does your coding agent fall short on product experience?

A five-minute self-audit you run on your own machine. It reads your coding-agent history and the app that work produced, then classifies the moments where the work needed rework. It returns a small summary, designed to minimize identifiers, that you choose whether to share.
Upload results
Copy the prompt.

Paste it into a coding agent with shell and file access, running in a project where you've done user-facing work. It reads your session logs and does a static read of the app's code.
You are going to audit this machine's coding-agent work to measure where the agent fell short on **product experience (PX)** — the design and product quality of what it built, as opposed to whether the code merely compiled. It comes in two parts: **Part A** (STEPS 1–3) classifies the rework moments in your session **transcripts** — what the agent delivered that later needed fixing; **Part B** (STEP 4) is a **static read of the app as it stands today**, surfacing the latent defects nobody caught. Both run entirely locally, and Part B never starts the app or installs anything. Work carefully and conservatively throughout — the goal is an honest, defensible count, not a big one.

## Privacy — non-negotiable

- Operate **entirely locally**. Do not send anything anywhere.
- Your final output is **structured counts + short quotes only**. Never include source code, full prompts, file paths, repo names, client names, or secrets.
- Evidence quotes must be ≤200 characters and may be paraphrased to remove sensitive detail. When in doubt, redact.
- The user reviews the final JSON and chooses whether to share it. Tell them so.

## STEP 1 — Locate the transcripts

Claude Code and Conductor both store session transcripts as JSONL under `~/.claude/projects/<encoded-cwd>/*.jsonl`, where `<encoded-cwd>` is the absolute project path with `/` and `.` replaced by `-`. 

Run this to list the candidate transcripts for the **current** project, newest first:

```bash
proj="$HOME/.claude/projects/$(pwd | sed 's/[/.]/-/g')"
ls -lt "$proj"/*.jsonl 2>/dev/null || echo "No transcripts for this project. Try: ls ~/.claude/projects/ | grep -i <your-project-name>"
```

If nothing matches, list all projects (`ls ~/.claude/projects/`) and ask the user which one(s) to analyze, or accept a path they give you. If the user uses a different agent (Cursor, etc.), ask them to point you at its chat/transcript export and adapt the extractor's parsing accordingly — the classification rules below are agent-agnostic.

**Default scope:** the most recent **up to 8** sessions for the chosen project, skipping tiny ones (<20 KB). If there are more, tell the user the count and proceed with the most recent 8 unless they say otherwise.

## STEP 1b — Detect the harness setup (availability)

What predicts where agents fall short isn't only the model — it's the **harness**: which skills, rules, and tools were in play. What you *used* in a session comes from the transcript (Steps 2–3). What was *available* you read off disk. Capture both: the gap between "installed but not used" is the most useful thing we can later hand back.

Run these (read-only — they list names and presence, nothing more):

```bash
# Skills available — global (all projects) and project-local
ls ~/.claude/skills/ 2>/dev/null     # global skills
ls .claude/skills/ 2>/dev/null       # project skills
# Rules files — presence only; do NOT read or send their contents
ls -la ~/.claude/CLAUDE.md CLAUDE.md .claude/CLAUDE.md .cursor/rules .cursorrules 2>/dev/null
```

(For Cursor or other agents, list the equivalents — `.cursor/rules`, `.cursorrules`, etc.)

**De-identification — same bar as the rest of the survey:**
- **Public/known skills** (the gstack suite — `design-review`, `qa`, `qa-only`, `ship`, `review`, `investigate`, `browse`, `verify`, `office-hours`, `codex`, … — and Claude Code built-ins) → record the **name** + scope (`project`/`global`).
- **Custom/private skills** (org- or project-specific names, or anything you don't recognize as public) → **count only, never the name** (a name like `acme-deploy` leaks the company).
- **Rules files** (CLAUDE.md, .cursorrules) → record **presence** + scope, **never the content**.

**Also record yourself.** Fill the output's `classifier` block with **your own** model id and harness — you, the agent running this audit. You are the measurement instrument, and you are often the same model family you're auditing; recording it lets the analysis quantify that.

## STEP 2 — Extract candidate episodes mechanically

Save this script to a temp file and run it on each chosen transcript. It does *only* mechanical extraction (no judgment): it pulls the real human turns (each paired with the assistant text that preceded it — i.e. what was "delivered") and the agent's own self-review observations. The JSONL has sharp edges (tool-results masquerade as user messages, system tags, skill injections, a duplicative `last-prompt` record) — this handles them.

```python
# save as /tmp/px_extract.py ; run: python3 /tmp/px_extract.py <transcript.jsonl>
import json, sys, glob, collections

SKILL_MARKERS = ("Base directory for this skill:", "<command-message>", "<command-name>",
                 "<local-command-stdout>", "<system_instruction>", "<system-reminder>",
                 "<task-notification>", "Caveat:", "<user-prompt-submit-hook>", "<bash-")
REVIEW_SKILLS = {"design-review","qa","qa-only","browse","view-prototype","verify","investigate","review"}
DEFECT_KW = ("the main problem","the issue is","is broken","doesn't match","too small","too low",
             "low contrast","overflow","crushe","unreadable","misalign","cut off","clipped","cramped",
             "not responsive","layout shift","the problem","slivers","squished","off-screen",
             "my mistake","i merged","should have","regress")

def parts(content):
    if isinstance(content, str): return content, [], False
    txt=[]; tools=[]; is_tr=False
    if isinstance(content, list):
        for b in content:
            if not isinstance(b, dict): continue
            if b.get("type")=="text": txt.append(b.get("text",""))
            elif b.get("type")=="tool_use": tools.append(b.get("name"))
            elif b.get("type")=="tool_result": is_tr=True
    return "\n".join(txt), tools, is_tr

def is_human(text, is_tr, o):
    if is_tr or o.get("isMeta"): return False
    s=(text or "").strip()
    if not s: return False
    if s.startswith("[Request interrupted"): return False   # interruption marker, not content
    if s.startswith("<") and any(m in s[:60] for m in ("<command","<system","<task-notif","<local-command","<bash-")): return False
    if any(s.startswith(m) or m in s[:40] for m in SKILL_MARKERS): return False
    return True

def main(path):
    events=[]; model="unknown"
    with open(path) as fh:
        for line in fh:
            line=line.strip()
            if not line: continue
            try: o=json.loads(line)
            except: continue
            if o.get("type")=="last-prompt": continue
            msg=o.get("message")
            if not isinstance(msg,dict): continue
            text,tools,is_tr=parts(msg.get("content"))
            if o.get("type")=="assistant" and msg.get("model"): model=msg["model"]
            if o.get("type")=="user" and is_human(text,is_tr,o):
                events.append(("human",text.strip(),[],None))
            elif o.get("type")=="assistant" and (text.strip() or tools):
                events.append(("assistant",text.strip(),tools,o.get("attributionSkill")))

    n_human=sum(1 for e in events if e[0]=="human")
    n_asst=sum(1 for e in events if e[0]=="assistant")
    tools_used=sorted({t for e in events if e[0]=="assistant" for t in e[2]})
    skills_invoked=sorted({e[3] for e in events if e[0]=="assistant" and e[3]})
    episodes=[]
    for i,(kind,text,tools,skill) in enumerate(events):
        if kind=="human":
            prev=""
            for j in range(i-1,-1,-1):
                if events[j][0]=="assistant" and events[j][1]: prev=events[j][1]; break
            episodes.append({"type":"human_turn","i":i,"human":text[:600],"prev_assistant_tail":prev[-400:]})
        else:
            low=text.lower()
            if (skill in REVIEW_SKILLS or any(k in low for k in DEFECT_KW)) and text:
                fixed=any(("Edit" in events[k][2] or "Write" in events[k][2] or "MultiEdit" in events[k][2])
                          for k in range(i+1,min(i+7,len(events))) if events[k][0]=="assistant")
                if fixed or skill in REVIEW_SKILLS:
                    episodes.append({"type":"agent_self","i":i,"skill":skill,
                                     "observation":text[:500],"followed_by_edit":fixed})
    print(json.dumps({"file":path.split("/")[-1],"model":model,
        "n_human_turns":n_human,"n_assistant_turns":n_asst,
        "mostly_autonomous": n_human<=3,
        "tools_used":tools_used,"skills_invoked":skills_invoked,
        "candidates":episodes}, indent=1))

if __name__=="__main__": main(glob.glob(sys.argv[1])[0])
```

The candidates are **high-recall, low-precision** — most are *not* PX defects. Your job in Step 3 is to filter hard.

## STEP 3 — Classify each candidate (this is the judgment)

For every candidate, first decide its **`nature`**, then — only if it's a `px_defect` — its category. Read `prev_assistant_tail` (for human turns) to see what the agent had delivered; the defect is relative to that.

**First, drop the non-episodes.** Many candidates are neither defects nor decisions of interest — do **not** emit an episode for them:
- Pure operational/neutral requests: "run the server", "push to github", "open a PR", "show me", "run it", "continue".
- Pure approvals/acks: "yes", "go ahead", "nice", "perfect".
- Verification turns where nothing was wrong (the agent confirms it works; the user moves on).
- Plan-level dialogue *before any work is delivered* (a long planning back-and-forth has no delivered artifact to be deficient — those are `scope_change`/`clarification` at most, and usually just dropped). A session that is *entirely* planning legitimately yields ~0 defects.

Emit an episode only when the candidate is a defect (any `nature` ∈ {px_defect, functional_defect, infra_env}) **or** a substantive non-defect intervention worth counting in the denominator (`scope_change`, `preference`, `clarification`).

### 3a. `nature` — what kind of episode is this?

| `nature` | Use when… | Counts toward PX? |
|---|---|---|
| `px_defect` | the agent's **delivered** product experience was deficient — a competent senior designer/engineer would independently call it wrong/incomplete/sloppy | **YES** |
| `functional_defect` | code/logic was wrong: build/runtime error, crash, wrong data, or the agent did the wrong *operation* | no (non-PX rework) |
| `infra_env` | environment/ops: deploy, dev server, git/merge, config, tooling, ports | no (non-PX rework) |
| `scope_change` | the user **adds or changes a requirement** ("also add…", "let's now…", "we changed X to Y") | **no — not a defect** |
| `preference` | the user steers a **subjective** choice the agent had gotten *reasonably* ("try another font", "make it warmer", "I prefer…") | **no — not a defect** |
| `clarification` | the user supplies info the agent **couldn't have known**, or answers the agent's question | **no — not a defect** |

### 3b. The defect-vs-preference test (decides credibility — apply strictly)

- Language invokes a **quality bar** → `px_defect`: *readable, broken, messy, inconsistent, doesn't match, too small/low, unclear, cut off, "this is wrong", "you broke…".*
- Language invokes **taste/direction** → `preference`: *prefer, warmer, elegant, "try…", vibe, "let's go with…".*
- **When genuinely unsure, it is NOT a defect.** Under-counting is the correct, defensible direction. If you do file a borderline case as `px_defect`, set `confidence: low`.
- A long taste-exploration (e.g. swapping fonts five times) is mostly `preference`. Only the turns that name a real quality failure ("hard to read", "looks messy") are defects.

### 3c. Dedup rules

- **One root cause = one episode.** A QA/review pass that finds several sub-issues of one underlying mistake (mobile grid + header + hero all un-responsive) → **one** episode.
- A multi-turn back-and-forth fixing **one** defect → **one** episode; set `resolution_turns` to the number of turns.
- The big first prompt of a session (the task spec) is **never** an episode.

### 3d. If `nature == px_defect`, file `px_category` (exactly one)

File by **where the fix belongs**, not where the symptom shows; one category per episode. **Convention vs. Visual craft** (the one boundary worth memorizing): reaching for the *wrong primitive* (a hardcoded value where a token exists, a parallel component, a raw date) → `convention_adherence`; composing the *right* primitives into an *unclear* result (everything one weight, nothing grouped, ragged alignment, no type-scale hierarchy) → `visual_craft`. On a true tie, the most objective/foundational category wins: absent whole capability → Intent; WCAG/axe-named → Accessibility; condition-triggered → Resilience; structural placement/consolidation → Product fit; string-only fix → Content.

| `px_category` | A defect here means… |
|---|---|
| `intent_fidelity` | a whole requested capability is **absent or behaviorally wrong** vs. the ask (not just imperfectly executed) |
| `product_fit` | wrong **structural** place/shape: container/pattern choice (modal vs drawer), entry-point placement, fragmenting instead of consolidating across views |
| `visual_craft` | the composition is **illegible**: no clear visual hierarchy/emphasis, ragged spacing/alignment, type scale not used to signal primary vs supporting — the eye doesn't land on the primary thing first. A **legibility** judgment, *not* aesthetic preference (pure taste is `preference`, not a defect) |
| `convention_adherence` | not in the **house style**: duplicated instead of reusing an existing component, hardcoded values where tokens exist, wrong naming/file conventions, wrong date/number formatting |
| `pathway_completeness` | a path or state is **missing**: no cancel/undo/error-recovery, a dead-end, or an absent loading/empty/error/pending state |
| `content_language` | the **words** are wrong: label/error-message quality, empty-state copy, microcopy off-voice, typos, terminology drift |
| `resilience` | breaks only when a **condition varies**: long content, a specific viewport, API failure, throttled network, performance/jank |
| `accessibility` | an **axe/WCAG or keyboard/screen-reader** failure: contrast, focus order, missing labels, not keyboard-operable |
| `other_px` | a genuine PX defect that fits **none** of the above. **You must give a `subfacet` label.** Now that `visual_craft` owns composition/hierarchy, the residual case is a happy-path **rendering glitch** the eight categories don't cover — e.g. a wrong element type (a read-only field rendered as an input box), overlap, or a control rendering broken at default data/width/network. Don't force-fit; label it. |

### 3e. Other fields per episode

- `subfacet`: short kebab-case label below the category (`responsive-layout`, `information-hierarchy`, `microcopy-voice`, `empty-state-missing`, `happy-path-render-glitch`…). **Required** for `other_px`. Encouraged everywhere.
- `caught_by`: `human` (the agent treated it as done; the user had to flag it → *shipped-broken*) or `agent_self` (the agent found and fixed it in a QA/review/verify pass before handing back).
- `severity`: `cosmetic` | `functional` | `blocking`.
- `evidence`: ≤200-char verbatim (or redacted) quote of the trigger.
- `description`: one neutral line.
- `confidence`: `high` | `med` | `low`.

### 3f. Assemble each session's `harness_profile`

Per session, fill `harness_profile` — the harness → outcome signal. Pull from three places:

- **Availability** (from the STEP 1b disk scan): `skills_available` (public names + scope), `n_custom_skills_available`, `review_skills_available`, `has_project_rules`, `has_global_rules`.
- **Usage** (from the extractor's `skills_invoked` + `tools_used`, plus the transcript): `skills_invoked`, `n_custom_skills_invoked`, `review_skills_invoked`; and the booleans — `used_browser_tools` / `captured_screenshots` (browser/screenshot tool names in `tools_used`), `ran_app` / `ran_tests` (a dev-server or test command was run), `used_plan_mode`.
- **The gap** (compute it): `review_available_not_used` = the review skills in `review_skills_available` **not** in `review_skills_invoked`. This is the actionable bit — "installed but not run."

Also set `build_context` (greenfield vs. existing app — a confounder for convention/product-fit) and `model_version` / `harness_version` if you can read them. Same de-identification bar as STEP 1b (public skills by name, custom as counts, rules by presence). Fill what you can confidently determine; omit the rest — partial is fine.

## STEP 4 — PART B: audit the app as it stands today

Episodes only count defects somebody **caught**. The categories nobody catches by eye — accessibility, missing states, resilience under varied conditions — are invisible to Steps 2–3 by construction. Part B closes that gap: audit the project's **current state** against the same categories and record what's latent in it right now. Running it here, with the transcripts in hand, is what makes it sharp — they give you the two things a bare code audit can't have: the **asks** (so intent is verifiable against what was actually requested) and **authorship** (so findings separate the agent's work from the human developer's). This is also where the audit pays *you* back most — a concrete defect list for your own app.

Part B is a **static code audit** — it never starts the app, drives a browser, or runs axe. Skip it entirely only if the project has no UI surface (a pure library/CLI) or the user declines — set `app_audit.performed = false` in that case.

Ground rules:

- **Local-only, read-only, and install-nothing.** Read the code; do **not** start the app, and make **no changes** to your application's code, content, or dependencies. The static accessibility pass uses a linter **only if your project already has one configured** (item 3) — it never installs one. Nothing leaves your machine.
- **Same conservative bar as Step 3:** a finding is something a competent senior product engineer/designer would independently call deficient — not a nitpick, not taste. When unsure, don't file it.
- **One root cause = one finding.** Group repeated violations by rule + surface (one missing-label pattern across one form = one finding), so the composition isn't distorted by a count explosion.
- **Same privacy bar:** evidence is a generic surface description ("a multi-step form has no error state") — never file paths, component/product names, or code.
- **Don't re-file Part A episodes.** Findings describe the current state. A defect that was caught and fixed in-session is gone; one that was caught but never fixed may legitimately appear in both.

Procedure:

1. **Scope.** Prefer the surfaces the analyzed sessions touched (you know them from Step 2) — that's where attribution is strongest. Tag each finding `agent_touched`: `yes` (transcripts show the agent built/edited that surface), `likely`, or `unknown`. Estimate `agent_authorship` for the audited surfaces overall.
2. **Static pass.** Walk the code for: wrong structural container/pattern (modal vs drawer vs inline), buried or duplicated entry points, a concern fragmented across views instead of consolidated (`product_fit` — a structural call made from the code + the asks, "before any pixel is styled"); weak hierarchy/emphasis or ragged spacing-rhythm/alignment/type-scale in the new surfaces, read from the diff (class strings, structure/nesting, weight & size usage) against how the app weights comparable surfaces (`visual_craft` — as the benchmark scores it in v1; **file at lower confidence**, since you're reading code, not pixels); missing loading/empty/error/pending states, dead-end flows, no cancel/undo (`pathway_completeness`); fixed widths/no breakpoints, unhandled fetch failures, unbounded content (`resilience`); hardcoded values where tokens exist, parallel/duplicate components, raw date/number formatting (`convention_adherence`); UI strings — typos, terminology drift, off-voice copy, slop words (`content_language`); and `accessibility`, which gets its own method in **item 3** — it's the category transcripts and eyeballs miss entirely, so audit it deliberately.
3. **Accessibility — the static subset.** This is where `accessibility` is scored, from **code only**. Two paths, in order:
   - **Fast path — run the linter you already have.** If the project is already configured with an accessibility linter, run it scoped to the surfaces the agent touched and harvest its warnings: `eslint-plugin-jsx-a11y` (React — frequently already on in Next.js / CRA configs), `eslint-plugin-vuejs-accessibility` (Vue), the **Svelte** compiler's built-in a11y warnings, `@angular-eslint` template rules (Angular), or `html-validate`'s WCAG ruleset (plain HTML). This is higher-recall than reading by eye and needs no judgment. Don't install one — if none is configured, use the fallback.
   - **Fallback — apply the checklist by reading the diff.** Walk the touched components for the high-yield, statically-decidable failures, roughly ordered by how often agents trip them: (1) **non-semantic interactives** — a `<div>`/`<span>` with an `onClick` but no `role` + `tabIndex` + key handler; (2) **icon-only or empty controls with no accessible name** — a `<button>`/`<a>` whose only child is an icon/SVG and no `aria-label`; (3) **images with no `alt`** (missing, not a deliberate `alt=""`); (4) **form controls with no associated label** (no `htmlFor`/wrapping/`aria-label`/`aria-labelledby`); (5) **invalid or incomplete ARIA** — a bad role, ARIA props a role requires but lacks, ARIA on an element that doesn't support it; (6) **positive `tabIndex`**; (7) document-level — a missing `<html lang>`, a viewport that disables zoom (`user-scalable=no` / `maximum-scale=1`), a missing `<title>`; (8) links-as-buttons (`<a onClick>` with no `href`), empty headings, skipped heading levels; (9) `outline: none` / `outline-0` on a focusable element with no `:focus-visible` replacement (file at **low** confidence — noisy).
   - **What this can't see, and who owns it.** Static cannot compute **color contrast** (the single most common real a11y failure — it needs resolved colors over the actual rendered background), focus **order**, or the computed accessibility tree. Those are **out of scope here** by construction — the controlled PX-bench benchmark runs axe against the rendered app and owns that grade. So the survey's `accessibility` is the *statically-detectable subset*: a lower bound that deliberately excludes contrast. Set every finding's `detected_by: static_code`.
4. **The intent check (recommended — possible only because you have the transcripts).** From the analyzed sessions, list the main capabilities the user actually asked for, then verify each is present in the current app and behaviorally right (trace it through the code). A capability the agent reported done that is absent or wrong **now** is a latent `intent_fidelity` finding — the class nobody catches in-session, because the human trusted "done." File it `detected_by: static_code`.
5. **Honesty about coverage.** List in `categories_assessed` only what you could genuinely evaluate — `intent_fidelity` only if you did the intent check (item 4 above). A category you didn't assess must **not** appear as a zero — that's what `categories_assessed` is for. Every category is code-assessable in v1, so on a project with a UI you can typically assess all of them, **`accessibility` included** — but as the static subset only (item 3): include it in `categories_assessed`, and read its count as a contrast-excluding lower bound, never a clean WCAG pass. File `visual_craft` at lower confidence (it's the least-validated category, and you're reading code, not pixels).
6. **Record.** Full counts per category in `latent_by_category`; up to 3 exemplar findings per category (highest severity first) in `findings[]`, each with `detected_by: static_code` (the only audit mode in v1).

## STEP 5 — Emit the output

Produce three things:

1. **A JSON object** conforming to the survey schema — `classifier` (you — STEP 1b), `sessions[]`, `episodes[]` (include the non-defect episodes too), `app_audit` (Part B — `performed: false` if skipped, else the static-audit object), and a `totals` self-check block. Fill `model` / `agent_harness` / `app_domain` / `task_type` and the `harness_profile` (STEP 3f) per session (`app_domain` and `task_type` you infer from the work; keep `app_domain` generic — "marketing site", not a brand). Schema fields: see the survey's `schema.json`.

   **Emit these top-level keys verbatim, and no others:** `survey_version`, `classifier`, `sessions`, `episodes`, `totals` (required), plus optional `generated_at`, `respondent_note`, and `app_audit`. Three drift traps that get a submission bounced or quarantined — avoid all three:
   - The rework array is named exactly **`episodes`** (not `px_episodes`, `px_episodes_full`, or `episodes_full`), and it holds **every** episode — the non-defect ones (`nature` `scope_change` / `preference` / `infra_env` / `functional_defect` / `clarification`) as full entries too, never a separate counts block. `totals.n_episodes` must equal the length of `episodes`.
   - Part B is the structured **`app_audit`** object (`{"performed": false}` if skipped), never a free-text `app_audit_note`.
   - Any extra context goes in **`respondent_note`** (free text). Invented top-level keys (e.g. `corpus`, `nondefect_episode_counts`) are stored but ignored by the analyzer, so that data is lost — fold it into `respondent_note` or the schema's real fields instead.

   **Stay inside the closed enums** (off-enum values are quarantined): `classifier.harness` and each session's `agent_harness` are one of `claude_code` / `conductor` / `cursor` / `aider` / `codex` / `other` / `unknown` (a single value — put Conductor as `conductor`, not `"claude_code (conductor)"`); `task_type` is one of `new_build` / `feature_add` / `refactor` / `bugfix` / `mixed`; `nature`, `px_category`, `severity`, `caught_by`, and `confidence` exactly as the schema lists them. Every session needs `n_human_turns`. The shape is:

```json
{
  "survey_version": "1.1",
  "classifier": {"model":"claude-fable-5","harness":"claude_code"},
  "sessions": [{"session_label":"session-1","model":"claude-opus-4-8","agent_harness":"conductor","app_domain":"marketing site","task_type":"feature_add","n_human_turns":23,"n_assistant_turns":120,"mostly_autonomous":false,"harness_profile":{"build_context":"existing","has_project_rules":true,"has_global_rules":true,"skills_available":[{"name":"design-review","scope":"global"},{"name":"qa","scope":"global"}],"n_custom_skills_available":3,"review_skills_available":["design-review","qa"],"skills_invoked":["browse"],"review_skills_invoked":[],"review_available_not_used":["design-review","qa"],"ran_app":true,"ran_tests":false,"captured_screenshots":true,"used_browser_tools":true,"used_plan_mode":false}}],
  "episodes": [
    {"session_label":"session-1","nature":"px_defect","px_category":"visual_craft","subfacet":"information-hierarchy","caught_by":"human","severity":"functional","evidence":"the information hierarchy is broken — content above the header","description":"Promo content placed above the page header, burying the H1.","confidence":"high"},
    {"session_label":"session-1","nature":"preference","px_category":null,"caught_by":"human","evidence":"try IBM Plex Serif instead","description":"User exploring heading typeface; prior choice was reasonable.","confidence":"high"}
  ],
  "app_audit": {"performed":true,"agent_authorship":"most","categories_assessed":["product_fit","visual_craft","convention_adherence","pathway_completeness","content_language","resilience","accessibility"],"latent_by_category":{"pathway_completeness":3,"resilience":2,"accessibility":2,"content_language":1},"findings":[{"px_category":"pathway_completeness","subfacet":"empty-state-missing","agent_touched":"yes","severity":"functional","evidence":"the primary list view renders blank with zero items — no empty state","description":"No empty state on the main list view.","confidence":"high","detected_by":"static_code"},{"px_category":"accessibility","subfacet":"non-semantic-interactive","agent_touched":"yes","severity":"functional","evidence":"a row's click target is a div with onClick and no role, tabindex, or key handler","description":"Custom clickable rows are not keyboard-operable.","confidence":"high","detected_by":"static_code"}]},
  "totals": {"n_sessions":1,"n_episodes":2,"by_nature":{"px_defect":1,"functional_defect":0,"infra_env":0,"scope_change":0,"preference":1,"clarification":0},"px_by_category":{"intent_fidelity":0,"product_fit":0,"visual_craft":1,"convention_adherence":0,"pathway_completeness":0,"content_language":0,"resilience":0,"accessibility":0,"other_px":0},"px_caught_by_human":1}
}
```

2. **A short readable digest**: the **task mix** you analyzed (e.g. "4 landing-page builds, 1 dashboard feature-add"), then PX defects by category (with %), shipped-broken rate, PX share of all rework, and any `other_px` themes you noticed. If Part B ran, add the latent picture — defects sitting in the app right now, by category — and the **caught-vs-latent contrast**: what humans flagged in-session vs. what shipped silently. Call out caveats you hit — especially **task skew** ("all sessions were marketing-copy work, so content & language dominates and there were no multi-step flows to test pathway/state"), mostly-autonomous sessions, and any category that was structurally invisible or unassessed.

3. **A tune-up block — for the user, not for us.** From their own data, as observations (≤8 sessions is not statistics):
   - Their **availability-vs-usage gap**: "you have `design-review` and `qa` installed; they ran in 1 of 8 sessions; in that session the agent caught its own defects before handing back — in the other 7, you were the QA."
   - For their top 1–2 defect categories, the matching countermeasure: `content_language` → voice/terminology rules + a copy pass before handoff; `resilience` → "screenshot at ~375px before declaring done"; `pathway_completeness` → "enumerate loading/empty/error states for every new view"; `convention_adherence` → point the rules file at the token/component inventory; `accessibility` → add an a11y linter (`eslint-plugin-jsx-a11y` or your framework's equivalent) to catch the wiring at edit time, and wire axe into your run/QA loop to catch contrast and focus order (transcripts can't see a11y, and neither can eyeballs).
   - Offer to **draft** — never auto-apply — a rules-file block targeting their recurring defects, printed for them to paste into CLAUDE.md (or equivalent).

Then tell the user: *review the JSON, redact anything you like, and — if you choose to share it — submit it on the PX-bench survey page you opened this prompt from (paste it into the form). Submitting it means you agree to the **Survey Contribution Terms** and **Survey Privacy Notice**: the submission is built to minimize identifiers and is kept separate from any contact email, the raw corpus is kept confidential, and you can ask us to delete your submission while it can still be identified. Or keep it — it's yours either way.*

## STEP 6 — Validate before you emit

Models drift from the schema (renamed arrays, off-enum values, missing fields, totals that don't match `episodes`), which gets a submission bounced at upload or quarantined in analysis. So **before you show the user the JSON**, write your assembled object to a file and run this structural check. Fix everything it lists, then re-run until it prints `OK`. Only then present the JSON.

```python
import json, sys
from collections import Counter

sub = json.load(open(sys.argv[1]))  # path to the JSON you just wrote

TOP_REQ = ("survey_version", "classifier", "sessions", "episodes", "totals")
TOP_OK = {*TOP_REQ, "generated_at", "respondent_note", "app_audit"}
HARNESS = {"claude_code", "conductor", "cursor", "aider", "codex", "other", "unknown"}
TASK = {"new_build", "feature_add", "refactor", "bugfix", "mixed"}
NATURE = {"px_defect", "functional_defect", "infra_env", "scope_change", "preference", "clarification"}
CAT = {"intent_fidelity", "product_fit", "visual_craft", "convention_adherence",
       "pathway_completeness", "content_language", "resilience", "accessibility", "other_px"}
SEV = {"cosmetic", "functional", "blocking"}
CONF = {"high", "med", "low"}
err = []

for k in TOP_REQ:
    if k not in sub:
        alt = {"episodes": ("px_episodes", "px_episodes_full", "episodes_full"),
               "app_audit": ("app_audit_note",)}.get(k, ())
        hit = next((a for a in alt if a in sub), None)
        err.append(f"missing top-level key `{k}`" + (f" (you named it `{hit}` — rename it)" if hit else ""))
for k in sub:
    if k not in TOP_OK:
        err.append(f"unknown top-level key `{k}` — fold it into `respondent_note` or drop it; the analyzer ignores it")

c = sub.get("classifier")
if isinstance(c, dict):
    for k in ("model", "harness"):
        if k not in c:
            err.append(f"classifier missing required `{k}`")
    if c.get("harness") not in HARNESS:
        err.append(f"classifier.harness `{c.get('harness')}` not in {sorted(HARNESS)}")
else:
    err.append("classifier must be an object")

sessions = sub.get("sessions", [])
if isinstance(sessions, list):
    for i, s in enumerate(sessions):
        if not isinstance(s, dict):
            err.append(f"sessions[{i}] must be an object")
            continue
        for k in ("session_label", "model", "agent_harness", "task_type", "n_human_turns"):
            if k not in s:
                err.append(f"sessions[{i}] missing required `{k}`")
        if s.get("agent_harness") not in HARNESS:
            err.append(f"sessions[{i}].agent_harness `{s.get('agent_harness')}` not in {sorted(HARNESS)}")
        if s.get("task_type") not in TASK:
            err.append(f"sessions[{i}].task_type `{s.get('task_type')}` not in {sorted(TASK)}")
else:
    err.append("sessions must be an array")

eps = sub.get("episodes", [])
if isinstance(eps, list):
    for i, e in enumerate(eps):
        if not isinstance(e, dict):
            err.append(f"episodes[{i}] must be an object")
            continue
        for k in ("session_label", "nature", "caught_by", "evidence", "description", "confidence"):
            if k not in e:
                err.append(f"episodes[{i}] missing required `{k}`")
        if e.get("nature") not in NATURE:
            err.append(f"episodes[{i}].nature `{e.get('nature')}` invalid")
        if e.get("caught_by") not in {"human", "agent_self"}:
            err.append(f"episodes[{i}].caught_by invalid")
        if e.get("confidence") not in CONF:
            err.append(f"episodes[{i}].confidence invalid")
        if e.get("nature") == "px_defect":
            if e.get("px_category") not in CAT:
                err.append(f"episodes[{i}] px_defect needs a valid `px_category`")
            if e.get("severity") not in SEV:
                err.append(f"episodes[{i}] px_defect needs `severity`")
            if e.get("px_category") == "other_px" and not e.get("subfacet"):
                err.append(f"episodes[{i}] other_px needs a `subfacet`")
else:
    err.append("episodes must be an array")

t = sub.get("totals", {}) or {}
if not isinstance(t, dict):
    err.append("totals must be an object")
    t = {}
eps_for_counts = eps if isinstance(eps, list) else []
if t.get("n_episodes") != len(eps_for_counts):
    err.append(f"totals.n_episodes ({t.get('n_episodes')}) != number of episodes ({len(eps_for_counts)}) — every non-defect episode must be an entry in `episodes`, not a separate counts block")
bn = Counter(e.get("nature") for e in eps_for_counts if isinstance(e, dict))
by_nature = t.get("by_nature") if isinstance(t, dict) else None
if isinstance(by_nature, dict):
    for k, v in by_nature.items():
        if bn.get(k, 0) != v:
            err.append(f"totals.by_nature.{k} ({v}) != counted from episodes ({bn.get(k, 0)})")
elif isinstance(t, dict):
    err.append("totals.by_nature must be an object")

print("OK — submission is well-formed" if not err else "FIX THESE:\n- " + "\n- ".join(err))
```

This is a structural gate, not the full schema — it catches the common drift. The authoritative check still runs at analysis time against `schema.json`.

## Calibration examples (from real sessions)

- "the headers are input boxes instead of read-only text — fix" → `px_defect` / `other_px` / `happy-path-render-glitch` (wrong element *type* on the happy path — not a hierarchy/composition problem, so not Visual craft; not condition-varied, so not Resilience). caught_by `human`.
- "the product grid is locked to 5 columns; on a phone the cards crush into unreadable slivers" (agent's own design-review) → `px_defect` / `resilience` / `responsive-layout`. caught_by `agent_self`.
- "the information hierarchy is broken — you put content above the header" → `px_defect` / `visual_craft` / `information-hierarchy` (the eye lands on promo, not the H1 — visual hierarchy, not structural placement). caught_by `human`.
- "the row's name, metadata, and actions are all one weight — nothing for the eye to land on" → `px_defect` / `visual_craft` / `visual-hierarchy` (right tokens, composed flat; the fix recomposes the layout, not the primitives — so Visual craft, not Convention). caught_by `agent_self`.
- "find a different term from 'swear by', we just used it on the previous screen" → `px_defect` / `content_language` / `terminology-repetition`, `confidence: med`. caught_by `human`.
- "try Marck Script instead" → `preference` (taste; prior font was a fine choice). **Not** a defect.
- "let's add a 12-month subscription option" → `scope_change`. **Not** a defect.
- "404 NOT_FOUND when deploying to Vercel" → `infra_env`. Non-PX rework.
- "you merged the rows when I wanted them removed" → `functional_defect` (wrong operation). Non-PX rework.
- *(Part B)* the settings save flow shows "Saved ✓" even when the request fails; no failure state exists in the code → latent finding, `pathway_completeness` / `error-state-missing`, `detected_by: static_code`, `agent_touched: yes` (transcripts show the agent built that form). Nobody caught it in-session — exactly what Part B exists to surface.
Hand it to your agent to run.

01
Paste it into your agent
Claude Code, Conductor, Cursor, or anything else with shell and file access. Run it in a project where you've done user-facing work with the agent. Run time is ~8–10 min using Opus 4.8 at high effort, or a similar model.
02
Review the summary
It prints a small JSON summary, a short readable digest, and a tune-up for your setup. It only reads your session logs and your project's code, locally; nothing leaves your machine until you send it.
03
Send it back, or don't
If you'd like to contribute, paste the JSON into the form below with your email. It's built to minimize identifiers, and it's yours to keep either way.
Review and redact before you send.

The output is designed to hold only counts and short quotes, with identifiers minimized. Before you share it, read it over and take out anything you'd rather keep private.
No source code, full prompts, secrets, or credentials.
No file paths, repository names, or client or employer names.
Keep evidence quotes short (200 characters or fewer) and paraphrase to remove anything sensitive.
When in doubt, redact. The summary is counts and short quotes, nothing more.
What happens to your data
By sending your results, you grant us a license to use them in our research, the PX-bench benchmark, and our publications. We keep the raw corpus confidential and publish only aggregate findings and short redacted examples. You can ask us to delete your submission for as long as we hold it.
Contribution Terms →Privacy Notice →
Contribute your results.

Submitting your results means you agree to the Contribution Terms and Privacy Notice. It's built to minimize identifiers; redact anything you like before you send it.
Copy the prompt.

Hand it to your agent to run.

Paste it into your agent

Review the summary

Send it back, or don't

Review and redact before you send.

Contribute your results.