Project · Personal OS

Agent Panel

Ten-plus specialist agents, each with a typed interface and a narrow scope. An eval harness runs periodic trajectory checks to surface drift before it compounds. The whole thing is LLMOps patterns made operational on a personal scale.

The problem

A single Claude session that covers vault queries, hiring analysis, curriculum design, and automation review accumulates context contamination fast. Each domain has different priors, different failure modes, and different things worth flagging. One generalist agent ends up averaging across all of them — and the average isn't good at any of them.

The decision was to route by domain rather than smash everything into one session. Each specialist agent gets a briefing document, a narrow input contract, and a defined output shape. Context stays clean. Judgment stays domain-specific.

What's inside

Specialists break down roughly by domain:

Vault expert — four allowlisted intents over the Obsidian vault (project status, commit digest, skills inventory, dashboard snapshot). No free-form read; no PII surface.
Hiring analyst + reviewer — expert/adversarial pair that evaluates portfolio signal and hiring trajectory. Expert proposes; reviewer stress-tests.
LLMOps expert + reviewer — grounded in a briefing on orchestration, routing, cost, and eval patterns. Used before architectural decisions.
Software architecture expert + reviewer — same expert/reviewer pattern for system design choices.
Portfolio design expert + reviewer — evaluates portfolio site decisions against a hiring-audience standard.
QRSPI compliance reviewer — checks plans against an 11-principle methodology. Blocks shortcuts that look like progress.
General AI advisor — surfaces bleeding-edge LLMOps patterns, brainstorms project ideas, analyzes AI articles.

Eval harness

The agents are only half of it. The eval harness is what makes the panel a portfolio piece rather than a prompt library.

A weekly review script samples recent agent outputs, scores them against a rubric (domain relevance, overclaim rate, citation quality, actionability), and writes findings back to the vault. The trajectory check runs on a separate cadence — it looks at the delta between what the agents recommended and what actually shipped, and flags cases where the gap is systematic rather than situational.

The practical effect: agent drift surfaces as a number, not a vibe. When the hiring analyst starts recommending things that contradict the portfolio strategy, the eval catches it before it compounds into a week of work in the wrong direction.

Architecture pattern

Every specialist follows the same shape: a briefing document (domain knowledge, pitfall catalog, verification questions, kill case), a markdown persona file, and a typed prompt contract. The harness reads the persona files to discover agents; new agents are added by dropping a file, not by editing routing code.

Expert/reviewer pairs are the key structural decision. The expert proposes; the reviewer adversarially stress-tests before the proposal reaches Ryan. This is the trust ladder pattern applied to personal decision-making — the reviewer catches the category of error where the expert confidently wrong.

Status

Running daily. Eleven specialist agents deployed; eval harness is operational. The panel is the infrastructure layer everything else builds on top of — vault queries, project status reads, architectural reviews, and portfolio decisions all route through it. The ongoing work is tightening the eval rubric as the agent output patterns mature.