I Was Wrong About the Advisor Pattern

I built an eval harness to check whether my advisor pattern was doing anything. It wasn't.

reasonable-ux is a Playwright agent that walks a site and scores the UX. The full project writeup is here.


I built reasonable-ux with an advisor pattern. Sonnet handles each step of the site walkthrough. When a step needs more judgment, the executor kicks the question up to Opus. Judgment routes up, execution stays down. I liked the design.

Then I built an eval harness to check whether the advisor was actually doing anything.

It wasn’t.


Self-scoring is a polite mirror

The first thing I tried was running each variant against itself. Score the report. Compare the numbers.

The composite scores landed within 0.09 points of each other across all four variants. That’s not a finding. That’s the model telling me what I want to hear, dressed up as data. If you let the same model that wrote the report grade the report, you get something halfway between flattery and noise.

I needed someone outside the room.

The harness

Four variants of reasonable-ux running against three sites: Stripe, Linear, Glossier. Each pair handed to Opus 4.6 with a locked rubric and asked to pick a winner across four dimensions: overall verdict, specificity, actionability, coverage.

A few things I made myself do up front, because I didn’t trust myself not to fudge them after the fact.

  • Wrote the rubric and locked it before running a single judge call
  • Calibrated on a two-run pilot, then closed the rubric for good
  • Randomized which report was labeled “A” and which was “B” so position bias couldn’t sneak in
  • Logged every call to Langfuse so I could see the exact reasoning later

Total cost: $0.58 across 17 pairwise comparisons. Less than a sandwich.

The result

The advisor pattern lost.

The cheapest variant, v3_8step (no advisor, fewer steps), won two of three sites. The full advisor variant cost more than twice as much and posted a 1W/2L record. It matched the baseline on specificity and actionability but lost coverage every time. The extra reasoning was making the reports tighter where they needed to be looser.

Sitting with being wrong

The thing I had been most proud of architecturally turned out to cost more for less.

There’s a version of this where I rationalize. The judge was biased. Opus prefers Opus. The sites I picked weren’t representative. The rubric was wrong. I had all of those in my head within ten minutes. Most of them were probably even partially true. None of them changed what the data said.

I shipped the change to make v3_8step the default and kept advisor as an opt-in flag for runs where I want the deeper read.

What I’d tell someone building their first eval harness

Two things stick.

Write the rubric before you run it. Anything you decide after seeing the numbers is rationalization with extra steps. Pre-registering forces you to commit to what winning means before you know who’s winning.

Build the harness even when you’re sure of the answer. Especially when you’re sure. The whole reason to do this is that you can’t tell which assumptions are load-bearing and which are vibes until something forces the comparison.

The eval cost less than a sandwich and overturned the design decision I felt strongest about. That’s a good ratio.