Project · Archival

QA Agent

Point it at a URL. A planner scrapes the page, picks prioritized test cases, and hands them to an agent that drives Playwright screenshot by screenshot until every test has a verdict. Built to find out what agentic browser testing actually feels like from the inside.

shipped
archival
84% token reduction

What it is

QA Agent is an LLM-driven browser testing agent. Point it at a URL; a planner scrapes the page and generates prioritized test cases, then the agent executes them in Playwright — screenshot → Claude → action → repeat — and emits a JSON + HTML report with per-step reasoning and pass/fail verdicts.

Two modes. QA mode (default) takes a goal picked by the planner or passed with --goal and returns pass/fail. UX mode (--mode ux) evaluates the page through three personas in parallel — a CPA vetting firm software, a B2B SaaS specialist, and an experienced SaaS UX designer — each scoring CTA clarity, copy, and flow 1–5 and surfacing friction with actionable recs. A second below-the-fold pass re-navigates, takes a full-page screenshot, and merges findings (pricing signals, social proof, trust indicators) the step-by-step agent couldn't see. UX runs emit a PDF report clean enough to hand to a founder.

Hard-won bits

Nine things worth keeping across projects.

Image stripping — 84% token reduction. Only the current screenshot is sent to Claude each step; prior screenshots are stripped from conversation history while the text reasoning trail is preserved. Average usage dropped from ~55,000 to ~8,800 tokens per run with no measurable hit to test quality. Single biggest cost lever in the codebase.
JPEG at 40% quality. Screenshots are compressed before base64 encoding. Claude reads the page clearly at this level; PNG would cost significantly more per image block.
Model tiering — Sonnet planner, Opus agent. Planner work (HTML → prioritized test cases) doesn't need the reasoning depth that vision-based step-by-step decisions do, so the planner runs on claude-sonnet-4-6 and the agent loop runs on claude-opus-4-5.
--token-budget with pre- and post-call checks. Budget is checked both before and after each API call, so a single step can't blow past the remaining budget after the fact.
URL anchor in every step prompt. The target URL is re-injected into every step, which prevents the agent from hallucinating a different domain and navigating away mid-evaluation.
Click timeout recovery. If a CSS selector isn't found within 10 seconds, the agent gets a feedback message and continues instead of crashing. Covers direct clicks and navigate-converted-to-click cases.
Selector blocklist at the dispatch layer. :contains() is stripped from selectors before Playwright sees them — the model keeps reaching for it despite being told not to, so the blocklist handles it defensively rather than via more prompt scolding.
CI env var flips headless + priority filter. os.environ.get("CI", "false") drives both the Playwright launch (headless in CI, headed locally) and the suite runner, which runs HIGH-priority tests only and marks lower priorities as "Skipped in CI." Same binary feeds both layers, so local and CI runs stay consistent.
Exit codes for CI pass/fail. Suite runner returns sys.exit(0 if result else 1) so GitHub Actions treats a failed suite as a failed job without any extra plumbing.

Status

Complete / archival. QA Agent served its purpose — it was the scaffold for learning how vision-based browser agents behave, and that learning now lives in Reasonable UX, which forked the Playwright + Claude vision architecture and extended it with the advisor pattern, multi-model tiering, and a Jinja2 PDF pipeline. Not under active development.

The archival framing is deliberate. Most of the reason to build QA Agent was to find out what the failure modes of agentic browser testing look like from the inside; a successor that outgrew it is a better outcome than protecting a codebase that already did its job.