← All projects

Project · Production · Open source

Reasonable UX

Point it at a URL. A three-model agent suite walks the site, scores the UX, and generates a PDF. The architecture question was whether the advisor pattern's cost premium was justified — and how to measure it.

$0.52–$1.94 per run depending on variant
17 judge calls eval harness $0.58 total
2W / 1L v3_8step wins cheapest variant

What it is

Reasonable UX is a Playwright-based audit agent. It walks a site through ten steps of screenshot → reasoning → action, scores the UX across heuristics, re-reads the findings through three to five personas inferred from the site itself, and stitches the pages together into a single executive summary. Output is a PDF.

The thing I care about is the reasoning layer. Most agent projects end up shaped like this: one model, one prompt, everything runs at whatever tier you paid for. That's expensive on a long-horizon task, and the ceiling on quality is whatever the single model happens to be good at. Reasonable UX is tiered by reasoning depth instead — cheap models do the cheap work, expensive models are consulted only when judgment is actually needed.

A single Opus-only run across a multi-page site cost ~$8. That's not viable for iterative calibration.

Architecture

URL input
    ↓
Scout      [Haiku 4.5]     preflight — bot-block check, interest 1–5
    ↓  interest ≥ 3?
Executor   [Sonnet 4.6]    10-step screenshot → reasoning → Playwright action loop
    ↓  escalate?
Advisor    [Opus 4.6]      judgment-only — beta advisor_20260301 tool
                           executor decides when to escalate; ~8 calls/run
    ↓
Synthesis  [Haiku 4.5]     multi-page exec summary
    ↓
PDF report
Judgment routes up. Execution stays down.

Why not Opus everywhere?

A full Opus run cost ~$8. The architecture question was whether the advisor escalation pattern justified its premium. The eval found it didn't: v3_8step (no advisor) posted 2W/1L vs baseline at $0.70/run. v2_advisor cost $1.83 for a 1W/2L record. The cheaper architecture won.

Why strip image history?

Screenshot bloat is the first cost wall on a multi-step agent loop. Stripping prior images from message history after each step — keeping text reasoning, discarding the visuals — was the highest-leverage cost change in the project. Wired in the first commit, not added under budget pressure. Without it, layering the advisor on top wouldn't pencil out.

Why nav:Label instead of CSS selectors?

The prompt instructs Claude to emit "target": "nav:Pricing" for navigation clicks. Dispatch translates that to get_by_role('link', name='Pricing'). Brittle selectors break when classes change; a semantic handle doesn't. This is the kind of thing the executor wouldn't arrive at on its own — it required a deliberate convention imposed at prompt-design time.

Eval harness

The advisor variant posted 1W/2L against the no-advisor baseline. The cheapest configuration won two of three sites. I needed a way to quantify that before acting on it — so I built a separate eval harness.

  • Pre-registered rubric locked before any judge calls: 4 dimensions (overall verdict, specificity, actionability, coverage)
  • 2-run pilot for calibration, locked before the full sweep
  • 17 Opus-as-judge pairwise comparisons, randomized A/B label order, $0.58 total
  • Nav drift regression detection: _nav_drift_check() fires on every eval run, flags CSS-selector violations automatically
Variant Cost tok/step Overall verdict
v1_baseline $1.32 9,430 champion
v3_8step $0.70 6,753 2W / 1L — cheapest winner
v2_advisor $1.83 12,703 1W / 2L
v4_8step_adv $1.21 10,539 2W / 1L

Finding: the advisor tool added 38–160% cost overhead while losing the overall verdict comparison 1W/2L. v3_8step matched its quality record at 47% of the cost.

Full methodology: DECISIONS.md §9 →

Live artifacts

Audited: Stripe Linear Glossier Figma

Reports available on request. Composite scores ranged 2.00–3.26 across sites and variants.

Report output

Page 1 — cover. stripe.com, 3-persona walkthrough across 22 steps, 66 friction points and 66 prioritized fixes. Editorial theme. Page 3 — executive summary. Overall UX score 3.0/5, 3 personas evaluated, 66 friction points, 66 recommendations. Severity distribution and ranked friction themes (e.g. Pricing transparency 32×). Page 5 — persona card. Design Team Lead (Evaluator) at a Series B fintech, 40-person design org. JTBD, frustrations, success criteria, in-voice quote. Page 9 — walkthrough. Design Team Lead on Stripe homepage. Annotated screenshot, observation, verdict, CTA/Copy/Flow subscores, friction points, and recommendations.
1 · Cover — pg 1 of 35
Selected pages from a 35-page Stripe audit (April 30). Editorial theme — reports also ship in technical and studio variants.

Hard-won bits

Image stripping — pattern from an earlier project

Screenshot bloat is the first cost wall you hit on a multi-step agent loop. I ran into it building QAgent and solved it by walking prior screenshots out of the message history before each new turn — text reasoning stays, images don't. Reasonable UX was built on top of that lesson: stripping was wired in the first commit, not added later under budget pressure. Worth flagging because the advisor pattern adds its own tokens — without image stripping as the floor, layering Opus on top wouldn't pencil out.

JPEG quality tiers

40% quality for per-step screenshots, 60% for full-page below-fold crops. Tuned until legibility broke, then backed off one step. The agent doesn't need print-quality images to reason about layout; spending bytes on crispness it won't use is money on the floor.

The nav:Label dispatch convention

The prompt instructs Claude to emit "target": "nav:Pricing" for navigation links instead of CSS selectors. Dispatch code translates that to get_by_role("link", name="Pricing"). Brittle selectors are a known failure mode of AI-driven browser testing — classes change, IDs get regenerated, your agent breaks weekly. A semantic handle sidesteps that entirely, and it's the kind of thing the executor would not have arrived at on its own.

Persona threading

Step one infers the evaluator persona from the first screenshot. Every subsequent step is prompted as that persona. The result is an internally coherent critique rather than a committee of disagreeing reviewers averaged into incoherence. The site tells you who its reader is; the agent just listens.

Pre-ship data-flow audit

Before publishing, I mapped the full data flow from LLM-extracted strings to HTML output. Found: Claude might extract <script> tags or event handlers from a malicious audit target verbatim into friction points or observations. When the report HTML is opened — including when shared with a founder — the injected markup executes. Stored XSS in the primary deliverable. Fixed with html.escape() wrapping at all 7 injection points in agent_core.py. Caught because the audit was part of the pre-public checklist, not triggered by an incident.

What I'd tell another builder

The advisor pattern isn't just cost optimization. It's a concrete shape for "spec-first, verification-loop" development. You write a specification by picking which model does what and what escalation looks like. The executor model then does the cheap iteration; the advisor sees only the moments where the spec is ambiguous. If you find yourself wanting a bigger model everywhere, that's usually a sign your spec isn't crisp — not that your executor is too dumb.

"Cost down ~4× ($8 Opus → ~$2 Sonnet per 20-URL eval run). The swap regressed pass rate to 5/20... Batch 33 recalibrated labels for Sonnet's actual scoring behavior, restoring 19/20."

Architecture decision record — DECISIONS.md →