LLMOps Curriculum (demo)

You don't start here at zero. Name what you've built in LLMOps terms:

reasonable-ux runs Opus as an "advisor" over Sonnet/Haiku "executors." That's model routing + tiered inference — literally chapter one of any LLMOps book.
reasons-qagent pairs Playwright with a Claude vision model for scoring. That's grounded evals — the "measure quality with a judge model" pattern.
An agent-patterns/ reference catalog is LLMOps knowledge management.
Claude Code skills (/build-prompt, /session-review, /checkpoint) are prompt engineering + workflow discipline with feedback loops.
A post-commit pipeline (sync.py → validators) is observability thinking — pointed at a vault instead of an LLM.

What's missing is vocabulary, telemetry, and a gateway — not foundational skill.

Reframe existing work in LLMOps vocabulary 20 min
- Where: today's daily note, short section titled "LLMOps vocabulary map"
- Output: 5-row table mapping each existing artifact to its LLMOps term
- Acceptance: you can say out loud "I already do X — it's called Y" for each row without hesitation
Flag the one gap that feels embarrassing and write why 10 min
- Where: same daily note, one sentence
- Output: e.g., "I've been vibes-evaluating reasonable-ux output for months — no golden set, no deterministic assertions"
- Acceptance: gap is specific enough to become a Phase 3 task, not vague

Everything in LLMOps fits in one of four layers. If you can name the layer, you can shop for tools.

Layer	What it does	First tool
Gateway	Abstracts "which model/provider" away from your code. Adds fallbacks + cost tracking.	LiteLLM
Observability	Records every call: prompt, response, cost, latency, who/when.	Langfuse
Eval	Scores output quality against a known-good set. Deterministic first, LLM-judge second.	Ragas (RAG) or DeepEval (general)
Structured outputs	Forces the LLM to return a typed object, not free-form text.	Instructor

If a tool claims to "do LLMOps" and you can't place it in one of these four boxes, it's either an orchestration framework (higher layer) or hype.

Read Applied LLMs sections 1 + 3 45 min
- Output: highlight 3 lines that surprise you; paste them into the daily note
- Acceptance: you could defend each highlight in one sentence
Skim ZenML's 457 case study roundup 20 min
- Goal: confirm the "LangChain-as-never-extracted-scaffolding" pathology is real across industries
- Acceptance: you can name 2 other recurring pathologies from the writeup
Build the 4-layer tool-slotting table 30 min
- Where: 40-reference/llmops-tools-map.md (new file — reference material, not a daily note)
- Output: 10+ tools slotted into Gateway / Observability / Eval / Structured Outputs / Orchestration. Each row: tool name, layer, one-line purpose, install-now vs defer vs skip.
- Acceptance: given a new tool name you haven't seen, you can slot it in < 30 seconds

Hands-on. Pick one project per session — don't try to retrofit both the same afternoon.

Recommended order: reasonable-ux first (most active, richest telemetry will follow). Retrofit reasons-qagent the following weekend (~1 hr — Langfuse already up, mostly copy-paste config).

Wrap existing Claude calls through LiteLLM. Add Langfuse instrumentation. Wrap scoring outputs in Instructor.
Save a session prompt in the project's CLAUDE.md: "this repo uses LiteLLM + Langfuse + Instructor — routing lives in config/models.py."
Checkpoint: you can now see every call's cost + latency in a dashboard. You went from "it works" to "I can measure it."

Stand up self-hosted Langfuse 30 min
- Where: run the docker-compose from the langfuse repo on the heavy-AI homelab host — not on the Mac
- Output: Langfuse UI reachable on LAN; project created; public/secret keys copied into a .env you git-ignore
- Acceptance: you can open the Langfuse dashboard and see "0 traces"
Install the three deps in reasonable-ux venv 15 min
- Command: uv add litellm langfuse instructor (or pip equivalent)
- Acceptance: python -c "import litellm, langfuse, instructor" runs clean
Find and wrap Claude-call entry points through LiteLLM 1 hr
- How: grep for anthropic., Anthropic(, client.messages.create to find call sites
- Replacement: litellm.completion(model="claude-sonnet-4-6", messages=...) — LiteLLM speaks OpenAI schema, response parsing will need touching
- Centralize: routing config lives in config/models.py — tiers declared there, not sprinkled at call sites
- Acceptance: one full reasonable-ux audit run completes end-to-end through LiteLLM
Add Langfuse instrumentation 30 min
- How: @observe() decorator on the top-level audit entry point; langfuse_context.update_current_observation(...) at scoring boundaries for metadata
- Acceptance: running one audit produces ≥1 trace in the Langfuse UI with cost + latency populated
Wrap scoring outputs in Instructor 45 min
- How: declare Pydantic models for each scorer's expected output; instructor.from_litellm(litellm.completion) as the client; response_model=YourModel
- Acceptance: a malformed LLM response triggers Instructor's retry → validates → returns a typed object, not a string
Document the stack in reasonable-ux CLAUDE.md 15 min
- Add: "This repo uses LiteLLM (gateway) + Langfuse (observability) + Instructor (structured outputs). Routing config in config/models.py. Langfuse keys in .env."
- Acceptance: a cold Claude Code session in this repo gets the stack picture from CLAUDE.md alone
Run /session-review before commit 10 min
reasons-qagent: copy the LiteLLM wrap from reasonable-ux 30 min
- Acceptance: one qagent audit run completes through LiteLLM
reasons-qagent: point Langfuse at it as a second project 15 min
- Acceptance: traces from qagent appear under a separate Langfuse project in the same self-hosted instance
reasons-qagent: Instructor for the multi-model scoring output 15 min
- Acceptance: scoring emits typed objects; downstream formatting stops string-parsing

This is the part 90% of people skip and pay for later.

Level 1 — Deterministic assertions. Regex, schema checks, length bounds. No LLM involved. Cheap, fast, catches dumb failures. Build these first.
Level 2 — LLM-as-judge. A smarter model scores the output. Slower, expensive per run, catches quality regressions.
Level 3 — Human review on a sample. The ground truth layer. Weekly on a random ~20 rows.

Read, in this order, spread over 2–3 weeknights:

Hamel Husain — Your AI Product Needs Evals — the single most important thing in this curriculum
Jason Liu — There Are Only 6 RAG Evals — only if you'll do RAG
Applied LLMs sections 1 + 3

Phase 3 deliverable: one eval-set file checked into reasonable-ux with 20 labeled examples + Level-1 assertions running in pytest. Nothing fancy.

Read Hamel Husain — Your AI Product Needs Evals 45 min
- Output: 3-bullet summary in a daily note — what Level 1/2/3 are in your own words
- Acceptance: you can draw the eval pyramid from memory a week later
Assemble a 20-example golden set for reasonable-ux 2 hrs
- Where: reasonable-ux/tests/evals/golden_set.jsonl
- What: 20 real audit inputs (URLs + personas) paired with the expected output shape and 1–2 reference quality notes. Don't write synthetic cases — pull from actual prior runs.
- Acceptance: running the audit on the 20 inputs produces output you'd accept as ground truth
Write Level-1 deterministic assertions as pytest 1 hr
- Where: reasonable-ux/tests/evals/test_level1.py
- Assertions: (1) output parses as the expected Pydantic model, (2) score is int 0–100, (3) critique length 50–2000 chars, (4) no placeholder strings ("TODO", "XXX", "[INSERT")
- Acceptance: pytest tests/evals/ returns pass rate on the 20-example set; intentionally breaking a prompt makes it drop
Wire eval run into reasonable-ux CI (or local pre-push) 30 min
- Where: .github/workflows/eval.yml or a pre-push git hook
- Acceptance: failing eval blocks the workflow; pass rate logs to Langfuse as a scored dataset run
Optional — Jason Liu 6 RAG evals (only if vault-search graduates from learning-sandbox) 30 min

Once the gateway is in place, routing gets interesting:

Fallback chains. If Anthropic is down, LiteLLM retries OpenAI transparently. Free resilience.
Cost-based routing. Try Haiku first, escalate to Sonnet only if Haiku's output fails validation. You already think this way in reasonable-ux — formalize it via LiteLLM's router.
Prompt caching discipline. Anthropic's cache is free on reads (0.1× cost) but expensive on writes (1.25–2× cost). If your cache prefix isn't byte-exact stable, you pay write prices forever and never notice. Log cache_read_input_tokens and cache_creation_input_tokens separately from day one.
Hard cost ceilings. LiteLLM virtual keys enforce per-key budgets in code. Dashboards alert after the fact; virtual keys prevent the bill.

Phase 4 deliverable: one concrete routing change with before/after cost numbers from Langfuse. The deliverable is the number, not the elegance.

Read Anthropic prompt caching docs 30 min
- Acceptance: you can explain why unstable prefixes silently cost 2× forever
Scan LiteLLM router docs 20 min
- Acceptance: you can name the three fallback modes (weighted, priority, latency-based) without re-reading
Configure a fallback chain in reasonable-ux 45 min
- Where: reasonable-ux/config/models.py
- What: Haiku 4.5 → Sonnet 4.6 → OpenAI gpt-4.1 on 5xx/timeout. Named routes in LiteLLM Router(model_list=[...], fallbacks=[...]).
- Acceptance: killing network to Anthropic mid-run (block via /etc/hosts) triggers fallback; Langfuse trace shows the escalation
Enable split cache-token telemetry 30 min
- How: log cache_read_input_tokens and cache_creation_input_tokens as separate Langfuse observation metadata on every call
- Acceptance: you can build a Langfuse chart showing read-hit vs. write ratio over time
Create a LiteLLM virtual key with a hard budget 30 min
- Where: LiteLLM proxy config — max $50/month cap on the reasonable-ux key
- Acceptance: attempting a call after budget exhaustion returns 429; you'd rather block than bleed
Run the Haiku-first/Sonnet-escalate experiment 2 hrs
- What: run the full 20-example golden set on three configs — Sonnet-only (baseline), Haiku-only, Haiku→Sonnet escalate-on-validation-fail. Collect: pass rate, total cost, p50/p95 latency from Langfuse.
- Output: reasonable-ux/docs/routing-experiment-2026-05.md with the 3×4 table
- Acceptance: you have before/after numbers — even if Haiku-first loses, that is the finding
Capture the win (or loss) in a project note 20 min
- Acceptance: future-you and any recruiter reading the portfolio can see the concrete number

Private skill → public artifact is what moves your compensation floor.

Three concrete writeups, each built from retrofit telemetry:

"A solo dev's minimum-viable LLMOps stack" — LiteLLM + Langfuse + Instructor, 4 deps, install in an afternoon. Screenshots of Langfuse dashboards from reasonable-ux.
"Cost discipline for Claude apps" — prompt-caching telemetry, virtual-key budgets, routing-based savings. Numbers from a real project.
"Evals without a team" — Hamel's Level 1/2/3 applied to a single-dev workflow. Published eval-set + scorer code on GitHub.

Cadence: one writeup per quarter is realistic. Don't try to batch them.

Rule: writing-before-doing reads as hollow; writing-after-doing reads as credible. Hold the line.

Gate — 2 weeks of real Langfuse data on reasonable-ux checkpoint
Pull the three hero numbers from Langfuse 30 min
- Need: total spend over 2 weeks, p50/p95 latency, cache hit ratio, routing-experiment delta from Phase 4
- Acceptance: you have numbers, not adjectives
Draft writeup #1 — A solo dev's minimum-viable LLMOps stack 3 hrs
- Sections: (1) the four layers, (2) 4-dep install, (3) the eval pattern, (4) three pitfalls (prefix-stability, deprecated LangChain imports, provider-SDK sprawl), (5) cost telemetry screenshots
- Word count: 1500–2500. No theory-only sections.
- Acceptance: a reader can reproduce your stack from the post alone
Publish to portfolio site 1 hr
- Include: link to a public GitHub gist with config/models.py + golden-set + Level-1 test skeleton (no reasonable-ux-specific logic — extract the pattern)
- Acceptance: the post has a working code link; the gist runs standalone
Share on X under @reasonequals 15 min
- Frame: ADHD+AI lane — "here's the minimum stack that made me stop re-planning observability for six months"
- Acceptance: posted, don't wait around for engagement metrics

LLMOps brushes against platform engineering. For the AWS SAA legibility cert:

Networking basics (VPC, subnets, security groups) — 1–2 weekends.
IAM + secrets — a couple sessions. The Whoop token and Langfuse keys already forced some of this.
One serverless deployment (Lambda + API Gateway) — 1 weekend hands-on.
One container deployment (ECS or Fargate) — 1 weekend hands-on.

Realistic AWS SAA study window: 3–4 months of nights/weekends. Target exam: late 2026 or early 2027.

Skip: Classical MLOps (MLflow, Kubeflow, SageMaker training). Kubernetes depth (homelab teaches fundamentals; CKA isn't needed). Data engineering (Airflow, dbt, Spark).

Take AWS AI Practitioner (AIF-C01) as the warmup 2–3 weeks
- Why: cheap foundational credential that covers GenAI-on-AWS vocabulary before SAA
- Acceptance: pass the exam
VPC / subnets / security groups 1–2 weekends
- Output: one diagram in the homelab repo showing how a deployed reasonable-ux instance would sit in AWS
- Acceptance: you can draw VPC → subnet → SG → Lambda/ECS from memory
IAM + secrets 1 weeknight
- Output: migrate one personal secret (Langfuse keys or Whoop token) to AWS Secrets Manager as a dry-run; pull it from a Lambda
- Acceptance: the Lambda retrieves the secret without hardcoded creds
Serverless deployment — Lambda + API Gateway 1 weekend
- What: deploy a tiny wrapper that takes a URL and calls reasonable-ux's router via LiteLLM, returns JSON
- Acceptance: curl against the API Gateway URL returns a scored audit
- Note: deploy a minimal demo, not real reasonable-ux. The learning target is AWS, not making reasonable-ux multi-host.
Container deployment — ECS Fargate 1 weekend
- What: same demo, containerized
- Acceptance: curl works against the ECS load balancer URL
AWS SAA (SAA-C03) 3–4 months study
- Target exam: late 2026 or early 2027

LLMOps Curriculum

Claude Certified Architect Foundations

AWS AI Practitioner (AIF-C01)

AWS Solutions Architect Associate (SAA-C03)

You've already done more than you think

The 4-layer mental model

The retrofit

Eval discipline

Routing + cost strategy

Public artifacts (the marketability layer)

Adjacent platform engineering (parallel track)