← All projects
Demo Your toggles save to localStorage; nothing leaves the browser. The live tracker runs against a SQLite-backed dashboard on my private box.

Agentic architecture, MCP tool design, Claude Code workflows, context management. ~70% of exam surface overlaps existing daily practice.

pursue — priority #1

Foundational GenAI-on-AWS vocabulary. Pre-SAA warmup that fills the Bedrock/SageMaker gap SAA skims over.

pursue

Locked legibility cert. Standalone study (Cantrill + Tutorials Dojo); curriculum Phase 7 deploys provide hands-on.

pursue (locked)

You don't start here at zero. Name what you've built in LLMOps terms:

  • reasonable-ux runs Opus as an "advisor" over Sonnet/Haiku "executors." That's model routing + tiered inference — literally chapter one of any LLMOps book.
  • reasons-qagent pairs Playwright with a Claude vision model for scoring. That's grounded evals — the "measure quality with a judge model" pattern.
  • An agent-patterns/ reference catalog is LLMOps knowledge management.
  • Claude Code skills (/build-prompt, /session-review, /checkpoint) are prompt engineering + workflow discipline with feedback loops.
  • A post-commit pipeline (sync.py → validators) is observability thinking — pointed at a vault instead of an LLM.

What's missing is vocabulary, telemetry, and a gateway — not foundational skill.

  • Reframe existing work in LLMOps vocabulary 20 min
  • Flag the one gap that feels embarrassing and write why 10 min

Everything in LLMOps fits in one of four layers. If you can name the layer, you can shop for tools.

LayerWhat it doesFirst tool
GatewayAbstracts "which model/provider" away from your code. Adds fallbacks + cost tracking.LiteLLM
ObservabilityRecords every call: prompt, response, cost, latency, who/when.Langfuse
EvalScores output quality against a known-good set. Deterministic first, LLM-judge second.Ragas (RAG) or DeepEval (general)
Structured outputsForces the LLM to return a typed object, not free-form text.Instructor

If a tool claims to "do LLMOps" and you can't place it in one of these four boxes, it's either an orchestration framework (higher layer) or hype.

  • Read Applied LLMs sections 1 + 3 45 min
  • Skim ZenML's 457 case study roundup 20 min
  • Build the 4-layer tool-slotting table 30 min

Hands-on. Pick one project per session — don't try to retrofit both the same afternoon.

Recommended order: reasonable-ux first (most active, richest telemetry will follow). Retrofit reasons-qagent the following weekend (~1 hr — Langfuse already up, mostly copy-paste config).

  1. Wrap existing Claude calls through LiteLLM. Add Langfuse instrumentation. Wrap scoring outputs in Instructor.
  2. Save a session prompt in the project's CLAUDE.md: "this repo uses LiteLLM + Langfuse + Instructor — routing lives in config/models.py."
  3. Checkpoint: you can now see every call's cost + latency in a dashboard. You went from "it works" to "I can measure it."
  • Stand up self-hosted Langfuse 30 min
  • Install the three deps in reasonable-ux venv 15 min
  • Find and wrap Claude-call entry points through LiteLLM 1 hr
  • Add Langfuse instrumentation 30 min
  • Wrap scoring outputs in Instructor 45 min
  • Document the stack in reasonable-ux CLAUDE.md 15 min
  • Run /session-review before commit 10 min
  • reasons-qagent: copy the LiteLLM wrap from reasonable-ux 30 min
  • reasons-qagent: point Langfuse at it as a second project 15 min
  • reasons-qagent: Instructor for the multi-model scoring output 15 min

This is the part 90% of people skip and pay for later.

  • Level 1 — Deterministic assertions. Regex, schema checks, length bounds. No LLM involved. Cheap, fast, catches dumb failures. Build these first.
  • Level 2 — LLM-as-judge. A smarter model scores the output. Slower, expensive per run, catches quality regressions.
  • Level 3 — Human review on a sample. The ground truth layer. Weekly on a random ~20 rows.

Read, in this order, spread over 2–3 weeknights:

  1. Hamel Husain — Your AI Product Needs Evals — the single most important thing in this curriculum
  2. Jason Liu — There Are Only 6 RAG Evals — only if you'll do RAG
  3. Applied LLMs sections 1 + 3

Phase 3 deliverable: one eval-set file checked into reasonable-ux with 20 labeled examples + Level-1 assertions running in pytest. Nothing fancy.

  • Read Hamel Husain — Your AI Product Needs Evals 45 min
  • Assemble a 20-example golden set for reasonable-ux 2 hrs
  • Write Level-1 deterministic assertions as pytest 1 hr
  • Wire eval run into reasonable-ux CI (or local pre-push) 30 min
  • Optional — Jason Liu 6 RAG evals (only if vault-search graduates from learning-sandbox) 30 min

Once the gateway is in place, routing gets interesting:

  • Fallback chains. If Anthropic is down, LiteLLM retries OpenAI transparently. Free resilience.
  • Cost-based routing. Try Haiku first, escalate to Sonnet only if Haiku's output fails validation. You already think this way in reasonable-ux — formalize it via LiteLLM's router.
  • Prompt caching discipline. Anthropic's cache is free on reads (0.1× cost) but expensive on writes (1.25–2× cost). If your cache prefix isn't byte-exact stable, you pay write prices forever and never notice. Log cache_read_input_tokens and cache_creation_input_tokens separately from day one.
  • Hard cost ceilings. LiteLLM virtual keys enforce per-key budgets in code. Dashboards alert after the fact; virtual keys prevent the bill.

Phase 4 deliverable: one concrete routing change with before/after cost numbers from Langfuse. The deliverable is the number, not the elegance.

  • Read Anthropic prompt caching docs 30 min
  • Scan LiteLLM router docs 20 min
  • Configure a fallback chain in reasonable-ux 45 min
  • Enable split cache-token telemetry 30 min
  • Create a LiteLLM virtual key with a hard budget 30 min
  • Run the Haiku-first/Sonnet-escalate experiment 2 hrs
  • Capture the win (or loss) in a project note 20 min

Private skill → public artifact is what moves your compensation floor.

Three concrete writeups, each built from retrofit telemetry:

  1. "A solo dev's minimum-viable LLMOps stack" — LiteLLM + Langfuse + Instructor, 4 deps, install in an afternoon. Screenshots of Langfuse dashboards from reasonable-ux.
  2. "Cost discipline for Claude apps" — prompt-caching telemetry, virtual-key budgets, routing-based savings. Numbers from a real project.
  3. "Evals without a team" — Hamel's Level 1/2/3 applied to a single-dev workflow. Published eval-set + scorer code on GitHub.

Cadence: one writeup per quarter is realistic. Don't try to batch them.

Rule: writing-before-doing reads as hollow; writing-after-doing reads as credible. Hold the line.

  • Gate — 2 weeks of real Langfuse data on reasonable-ux checkpoint
  • Pull the three hero numbers from Langfuse 30 min
  • Draft writeup #1 — A solo dev's minimum-viable LLMOps stack 3 hrs
  • Publish to portfolio site 1 hr
  • Share on X under @reasonequals 15 min

LLMOps brushes against platform engineering. For the AWS SAA legibility cert:

  • Networking basics (VPC, subnets, security groups) — 1–2 weekends.
  • IAM + secrets — a couple sessions. The Whoop token and Langfuse keys already forced some of this.
  • One serverless deployment (Lambda + API Gateway) — 1 weekend hands-on.
  • One container deployment (ECS or Fargate) — 1 weekend hands-on.

Realistic AWS SAA study window: 3–4 months of nights/weekends. Target exam: late 2026 or early 2027.

Skip: Classical MLOps (MLflow, Kubeflow, SageMaker training). Kubernetes depth (homelab teaches fundamentals; CKA isn't needed). Data engineering (Airflow, dbt, Spark).

  • Take AWS AI Practitioner (AIF-C01) as the warmup 2–3 weeks
  • VPC / subnets / security groups 1–2 weekends
  • IAM + secrets 1 weeknight
  • Serverless deployment — Lambda + API Gateway 1 weekend
  • Container deployment — ECS Fargate 1 weekend
  • AWS SAA (SAA-C03) 3–4 months study