Project · Production

Reasonable UX

Point it at a URL. A multi-model agent suite audits the UX and produces a scored PDF. The interesting part isn't the output — it's the architecture that makes it cheap enough to run on real sites.

What it is

Reasonable UX is a Playwright-based audit agent. It walks a site through ten steps of screenshot → reasoning → action, scores the UX across heuristics, re-reads the findings through three to five personas inferred from the site itself, and stitches the pages together into a single executive summary. Output is a PDF.

The thing I care about is the reasoning layer. Most agent projects end up shaped like this: one model, one prompt, everything runs at whatever tier you paid for. That's expensive on a long-horizon task, and the ceiling on quality is whatever the single model happens to be good at. Reasonable UX is tiered by reasoning depth instead — cheap models do the cheap work, expensive models are consulted only when judgment is actually needed.

Architecture

Four layers, each tuned to what it's actually doing.

Planner (Sonnet 4.6) scrapes the target page and generates prioritized test cases. HTML analysis is cheap; don't pay Opus to skim a DOM.
Agent loop (Sonnet 4.6 with advisor, Opus 4.5 otherwise) runs ten steps of screenshot → Claude → Playwright action → repeat.
Advisor tools (advisor_20260301 beta) layer Opus 4.6 judgment on top of a Sonnet executor. The executor decides when to escalate; Opus only sees the ambiguous calls. Advisor-enabled runs get 2048 tokens per advisor call versus 1024 for the executor. Judgment routes up; execution stays down.
Persona revalidation (Sonnet 4.6) re-reads the findings through three to five personas inferred from the site itself. Evaluation coherence beats blended averages.
Report synthesis (Haiku 3.5) stitches multi-page runs into one executive summary. Long-context retrieval on cheap tokens.

Hard-won bits

Four things worth calling out.

Image stripping — pattern from an earlier project

Screenshot bloat is the first cost wall you hit on a multi-step agent loop. I ran into it building QAgent and solved it by walking prior screenshots out of the message history before each new turn — text reasoning stays, images don't. Reasonable UX was built on top of that lesson: stripping was wired in the first commit, not added later under budget pressure. Worth flagging because the advisor pattern adds its own tokens — without image stripping as the floor, layering Opus on top wouldn't pencil out.

JPEG quality tiers

40% quality for per-step screenshots, 60% for full-page below-fold crops. Tuned until legibility broke, then backed off one step. The agent doesn't need print-quality images to reason about layout; spending bytes on crispness it won't use is money on the floor.

The `nav:Label` dispatch convention

The prompt instructs Claude to emit "target": "nav:Pricing" for navigation links instead of CSS selectors. Dispatch code translates that to get_by_role("link", name="Pricing"). Brittle selectors are a known failure mode of AI-driven browser testing — classes change, IDs get regenerated, your agent breaks weekly. A semantic handle sidesteps that entirely, and it's the kind of thing the executor would not have arrived at on its own.

Persona threading

Step one infers the evaluator persona from the first screenshot. Every subsequent step is prompted as that persona. The result is an internally coherent critique rather than a committee of disagreeing reviewers averaged into incoherence. The site tells you who its reader is; the agent just listens.

Status

In production. Advisor wiring landed in batch 17 on 2026-04-12 — Sonnet 4.6 executor with Opus 4.6 advisor across the agent loop, below-fold analysis, personas.py, and persona_agent.py. The --advisor CLI flag enables A/B comparison against the Opus-only baseline. Matched multi-page runs show 35–59% token overhead on advisor-enabled runs, but the advisor tokens are billed separately and aren't yet split out in reporting. Next piece of work is surfacing that cost breakdown in the PDF output so A/B decisions can be made on $/run, not just quality.

What I'd tell another builder

The advisor pattern isn't just cost optimization. It's a concrete shape for "spec-first, verification-loop" development. You write a specification by picking which model does what and what escalation looks like. The executor model then does the cheap iteration; the advisor sees only the moments where the spec is ambiguous. If you find yourself wanting a bigger model everywhere, that's usually a sign your spec isn't crisp — not that your executor is too dumb.